[bnc] Designing and Creating the BNC

Creating the BNC

Making the BNC was a joint effort of a large number of participants; organisations and individuals. It comprised two main stages: the planning (design stage) and the execution (creation stage) as described further below.

Design stage

The BNC project started with a careful planning stage where the design principles for the corpus were drawn up. These established a number of selection criteria which were then used for identifying suitable texts to be included in the corpus. In addition to the selection criteria for the written and spoken components, a large number of classification features were identified for the texts in the corpus.

Selection Criteria: Written texts

Texts were selected for inclusion in the corpus according to three independent selection criteria: domain, time, and medium. Target proportions were defined for each of these criteria, as listed below.

Domain

The domain of a text indicates the kind of writing it contains.

75% of the written texts were to be chosen from informative writings: of which roughly equal quantities should be chosen from the fields of applied sciences, arts, belief & thought, commerce & finance, leisure, natural & pure science, social science, world affairs.
25% of the written texts were to be imaginative, that is, literary and creative works.

Medium

The medium of a text indicates the kind of publication in which it occurs. The classification used is quite broad.

60% of written texts were to be books
25% were to be periodicals (newspapers etc.)
between 5 and 10% should come from other kinds of miscellaneous published material (brochures, advertising leaflets, etc)
between 5 and 10% should come from unpublished written material such as personal letters and diaries, essays and memoranda, etc
a small amount (less than 5%) should come from material written to be spoken (for example, political speeches, play texts, broadcast scripts, etc.)

Time

The time criterion refers to the date of publication of a text. Being a synchronic corpus, the BNC should contain texts from roughly the same period. The intention was that no text should date back further than 1975. This condition was relaxed for imaginative works only, a few of which date back to 1964, because of their continued popularity and consequent effect on the language.

Classification features: Written texts

In addition to the selection criteria, a large number of classification features were identified for the texts in the corpus. No fixed proportions were specified for these features, although the intention was to make sure that there should be an appropriate level of variation within each criterion. The classification criteria include such things as:

Sample size (number of words) and extent (start and end points)
Topic or subject of the text
Author's name, age, gender, region of origin, and domicile
Target age group and gender
"Level" of writing (a subjective measure of reading difficulty) : the more literary or technical a text, the "higher" its level.

Information was added when available which means that the amount of information added to each text varies.

Designing the Spoken Component

There are two parts to the 10-million word spoken corpus: a demographic part, containing transcriptions of spontaneous natural conversations made by members of the public and a context-governed part, containing transcriptions of recordings made at specific types of meeting and event.

All the original recordings transcribed for inclusion in the BNC have been deposited at the National Sound Archives of the British Library.

The Demographic part of the Spoken Corpus

A total of 124 volunteers were recruited by the British Market Research Bureau. The volunteers came from four social groupings (AB, C1, C2, and DE). There were male and female volunteers from a wide range of ages, and they lived at 38 different locations across the UK. Recruits were chosen in such a way as to make sure there were equal numbers of men and women, approximately equal numbers from each age group, and equal numbers from each social grouping.

Recruits used a personal stereo to record all their conversations unobtrusively over two or three days, and logged details of each conversation in a special notebook. Those who took part in the recordings were asked after the conversation to give permission for their speech to be included in the corpus.

Information about the participants, such as age, sex, accent, occupation, was recorded when available.

The Context-Governed part of the Spoken Corpus

The intention was to collect roughly equal quantities of speech recorded in each of the following four broad categories of social context:

Educational and informative events, such as lectures, news broadcasts, classroom discussion, tutorials.
Business events such as sales demonstrations, trades union meetings, consultations, interviews.
Institutional and public events, such as sermons, political speeches, council meetings, parliamentary proceedings.
Leisure events, such as sports commentaries, after-dinner speeches, club meetings, radio phone-ins.

Information about the participants, such as age and sex, was recorded when available.

Up: Contents