BNC

British National Corpus User Reference Guide

2. Design of BNC-baby

  Author: edited by Lou Burnard (revised LB) Date: (revised 19-22 Nov 2003)

Up: Contents Previous: 1. Introduction Next: 3. Basic structure

BNC-baby is a four-part corpus constructed by a principled sampling taken from the BNC World Edition. The principles underlying selection of texts for inclusion in BNC-baby may be summarized as follows:

To select materials meeting these criteria, the following procedure was adopted. Selection of texts was made on the basis of the information about them recorded in the original corpus, in particular the text classifications provided by David Lee.

2.1. Fiction

The original BNCworld contains 477 texts (16 million words) classed as `imaginative'. These vary considerably in date of publication, target audience and medium, as well as in author properties and other features. Compared with other texts in the corpus, these texts are also rather long. Achieving a comparable variety in the one million word subset would be difficult, if not impossible. In BNC-baby therefore, we selected according to the following carefully defined criteria. Texts for the fiction component were selected from among all the BNCworld written imaginative texts those which were published as books between 1985-1994 and classified explicitly as for an adult audience and were assigned the genre label W fict prose. From this set of 356 texts, a random sample of about one million words (25 texts) was drawn. The sample was checked to ensure no more than one title by any particular author was selected.

2.2. Newspapers

In the design of the original BNC, newspapers were not identified as a distinct category, although a very large amount of newspaper material is contained in it. On the basis of the descriptive information provided in the text headers, it is however possible to select different kinds of newspaper text. Note that each `text' is made up of several newspaper articles, often drawn from a particular subject domain, rather than complete issues of a paper. In this, BNC-baby follows the sampling methods of the original corpus compilers.

The BNC-baby selection was made to ensure a mix of national and local papers, wide coverage of topics and little duplication of dates. Approximately 60% of the newspaper data comes from five national papers, and the remaining 40% from regional newspapers. An attempt was made to include texts from different domains and genres, to maximise the spread in topic areas covered. As far as possible, the texts were selected to ensure a spread of dates of publication in order to minimise the effects of seasonal or topical variation. The size of the texts was considered and choices made to ensure a roughly equal distribution across different newspapers within the national and regional subsets respectively, as well as a spread across subject areas. Differences between the amount of data from each newspaper in the component is largely due to the considerable variation in the size of the newspaper texts in BNCworld.

The following table shows the number of words in each newspaper sampled for BNC-baby

Newspaper words %
Daily Mirror 124251 12%
Daily Telegraph 128794 13%
Guardian 129598 13%
Independent 131205 13%
Today 91238 9%
Belfast Telegraph 43006 4%
East Anglian Daily Times 43674 4%
Liverpool Daily Post and Echo 85441 9%
Northern Echo 68887 7%
The Alton Herald 56316 6%
The East Anglian 15814 2%
The Scotsman 66709 7%
Ulster Newsletter 16888 2%
Total component 1001821 100%

2.3. Spoken

The conversational data in the BNC-baby has been drawn only from the spoken demographic component of the BNC World edition. Each text consists of a number of conversations recorded by one individual, capturing data produced by a number of different speakers in different situations. Speakers were recruited (as described elsewhere) according to demographic principles, in order to be broadly representative of the UK population in terms of age, gender, region, and class.

Texts for which very little information about the speakers was available were excluded from selection. From the remainder, 30 texts were then randomly selected.

The following table shows the number of words spoken by participants in the spoken part of BNC-baby, broken down by sex and age group:

category value words % of categoru % of corpus
Age of speaker 0-14 102,350 11% 10%
15-24 73,891 8% 7%
35-44 182,976 20% 18%
25-34 292,083 32% 29%
45-59 113,038 12% 11%
60+ 159,948 17% 16%
total 924,286 100% 91%
Sex of speaker Female 551,077 59% 55%
Male 384,337 41% 38%
total 935,414 100% 93%
Social class of speaker AB 243,125 35% 24%
C1 232,981 33% 23%
C2 133,963 19% 13%
DE 86,189 12% 9%
total 696,258 100% 69%

2.4. Academic

In the design of the original BNC, academic prose is not identified as a distinct category, although a large amount of such material is contained in it. On the basis of the descriptive information provided in the text headers, it is however possible to select such texts. From the set of texts identified by David Lee as "written academic", titles were randomly selected within different subject areas to maximize variation in topic. An attempt was also made to include data originally published in periodicals as well as in books, although no targets were set for the proportions of material from each medium. Of the 501 academic writing texts in the BNCworld, 30 were selected for the BNC-baby academic component.

2.5. Design of the BNC World Edition

Sections 2 to 4 of the Users Reference Guide supplied with the BNC World Edition give a detailed overview of the design principles underlying the original construction of the British National Corpus, which principles are necessarily followed in the sampling for BNC-baby. The BNC World Users Reference Guide also includes detailed information about the actual composition of the full corpus with respect to its selection and classification.

We do not duplicate that information here: it should however be consulted for a proper understanding of the composition of the BNC-baby corpus.

Up: Contents Previous: 1. Introduction Next: 3. Basic structure


Date: (revised 19-22 Nov 2003) Author: edited by Lou Burnard (revised LB).
British National Corpus.