Creating the BNC
Making the BNC was a joint effort of a large number of participants; organisations and individuals. It comprised two main stages: the planning (design stage) and the execution (creation stage) as described further below.
The BNC project started with a careful planning stage where the design principles for the corpus were drawn up. These established a number of selection criteria which were then used for identifying suitable texts to be included in the corpus. In addition to the selection criteria for the written and spoken components, a large number of classification features were identified for the texts in the corpus.
Selection Criteria: Written texts
Texts were selected for inclusion in the corpus according to three independent selection criteria: domain, time, and medium. Target proportions were defined for each of these criteria, as listed below.
- The domain of a text indicates the kind of writing it contains.
- 75% of the written texts were to be chosen from informative writings: of which roughly equal quantities should be chosen from the fields of applied sciences, arts, belief & thought, commerce & finance, leisure, natural & pure science, social science, world affairs.
- 25% of the written texts were to be imaginative, that is, literary and creative works.
- The medium of a text indicates the kind of publication in which it
occurs. The classification used is quite broad.
- 60% of written texts were to be books
- 25% were to be periodicals (newspapers etc.)
- between 5 and 10% should come from other kinds of miscellaneous published material (brochures, advertising leaflets, etc)
- between 5 and 10% should come from unpublished written material such as personal letters and diaries, essays and memoranda, etc
- a small amount (less than 5%) should come from material written to be spoken (for example, political speeches, play texts, broadcast scripts, etc.)
- The time criterion refers to the date of publication of a text. Being a synchronic corpus, the BNC should contain texts from roughly the same period. The intention was that no text should date back further than 1975. This condition was relaxed for imaginative works only, a few of which date back to 1964, because of their continued popularity and consequent effect on the language.
Classification features: Written texts
- Sample size (number of words) and extent (start and end points)
- Topic or subject of the text
- Author's name, age, gender, region of origin, and domicile
- Target age group and gender
- "Level" of writing (a subjective measure of reading difficulty) : the more literary or technical a text, the "higher" its level.
Designing the Spoken Component
There are two parts to the 10-million word spoken corpus: a demographic part, containing transcriptions of spontaneous natural conversations made by members of the public and a context-governed part, containing transcriptions of recordings made at specific types of meeting and event.
All the original recordings transcribed for inclusion in the BNC have been deposited at the National Sound Archives of the British Library.
- The Demographic part of the Spoken Corpus
A total of 124 volunteers were recruited by the British Market Research Bureau. The volunteers came from four social groupings (AB, C1, C2, and DE). There were male and female volunteers from a wide range of ages, and they lived at 38 different locations across the UK. Recruits were chosen in such a way as to make sure there were equal numbers of men and women, approximately equal numbers from each age group, and equal numbers from each social grouping.
Recruits used a personal stereo to record all their conversations unobtrusively over two or three days, and logged details of each conversation in a special notebook. Those who took part in the recordings were asked after the conversation to give permission for their speech to be included in the corpus.
Information about the participants, such as age, sex, accent, occupation, was recorded when available.
- The Context-Governed part of the Spoken Corpus
- The intention was to collect roughly equal quantities of speech recorded in
each of the following four broad categories of social context:
- Educational and informative events, such as lectures, news broadcasts, classroom discussion, tutorials.
- Business events such as sales demonstrations, trades union meetings, consultations, interviews.
- Institutional and public events, such as sermons, political speeches, council meetings, parliamentary proceedings.
- Leisure events, such as sports commentaries, after-dinner speeches, club meetings, radio phone-ins.