[bnc] Design of the corpus - BNC User manual

Design of the corpus

This section discusses some of the basic design issues underlying the creation of the BNC. It summarizes the kinds of uses for which the corpus is intended, and the principles upon which it was created. Some summary information about the composition of the corpus is also included.

Purpose

The uses originally envisaged for the British National Corpus were set out in a working document called Planned Uses of the British National Corpus BNCW02 (11 April 91). This document identified the following as likely application areas for the corpus:

reference book publishing
academic linguistic research
language teaching
artificial intelligence
natural language processing
speech processing
information retrieval

The same document identified the following categories of linguistic information derivable from the corpus:

lexical
semantic/pragmatic
syntactic
morphological
graphological/written form/orthographical

General definitions

The British National Corpus is:

a sample corpus: composed of text samples generally no longer than 45,000 words.
a synchronic corpus: the corpus includes imaginative texts from 1960, informative texts from 1975.
a general corpus: not specifically restricted to any particular subject field, register or genre.
a monolingual British English corpus: it comprises text samples which are substantially the product of speakers of British English.
a mixed corpus: it contains examples of both spoken and written language.

Composition

There is a broad consensus among the participants in the project and among corpus linguists that a general-purpose corpus of the English language would ideally contain a high proportion of spoken language in relation to written texts. However, it is significantly more expensive to record and transcribe natural speech than to acquire written text in computer-readable form. Consequently the spoken component of the BNC constitutes approximately 10 per cent (10 million words) of the total and the written component 90 per cent (90 million words). These were agreed to be realistic targets, given the constraints of time and budget, yet large enough to yield valuable empirical statistical data about spoken English. In the BNC sampler, a two per cent sample taken from the whole of the BNC, spoken and written language are present in approximately equal proportions, but other criteria are not equally balanced.

From the start, a decision was taken to select material for inclusion in the corpus according to an overt methodology, with specific target quantities of clearly defined types of language. This approach makes it possible for other researchers and corpus compilers to review, emulate or adapt concrete design goals. This section outlines these design considerations, and reports on the final make-up of the BNC.

This and the other tables in this section show the actual make-up of the second version of the British National Corpus (the BNC World Edition) in terms of

texts : number of distinct samples not exceeding 45,000 words
S-units: number of <s> elements identified by the CLAWS system (more or less equivalent to sentences)
W-units: number of <w> elements identified by the CLAWS system (more or less equivalent to words)

For further explanation of <s> and <w> elements, see section ??.

The BNC World Edition contains 4054 texts and occupies (including SGML markup) 1,508,392 Kbytes, or about 1.5 Gb. In total, it comprises just over 100 million orthographic words (specifically, 100,467,090), but the number of w-units (POS-tagged items) is slightly less: 97,619,934. The total number of s-units identified by CLAWS is just over 6 million (6,053,093). Counts for these and all the other elements tagged in the corpus are provided below in ??

In the following tables both an absolute count and a percentage are given for all the counts. The percentage is calculated with reference to the relevant portion of the corpus, for example, in the table for "written text domain", with reference to the total number of written texts. These reference totals are given in the first table below.

Table 1. Composition of the BNC World Edition
Text type	Texts	Kbytes	W-units	S-units	percent
Spoken demographic	153	4206058	4.30	610563	10.08
Spoken context-governed	757	6135671	6.28	428558	7.07
All Spoken	910	10341729	10.58	1039121	17.78
Written books and periodicals	2688	78580018	80.49	4403803	72.75
Written-to-be-spoken	35	1324480	1.35	120153	1.98
Written miscellaneous	421	7373707	7.55	490016	8.09
All Written	3144	87278205	89.39	5013972	82.82

All texts are also classified according to their date of production. For spoken texts, the date was that of the recording. For written texts, the date used for classification was the date of production of the material actually transcribed, for the most part; in the case of imaginative works, however, the date of first publication was used. Informative texts were selected only from 1975 onwards, imaginative ones from 1960, reflecting their longer ‘shelf-life’, though most (75 per cent ) of the latter were published no earlier than 1975.

Table 2. Date of production
Creation date	texts	w-units	%	s-units	%
Unknown	162	1814051	1.85	127132	2.10
Before 1974	47	1741624	1.78	121323	2.00
1974 to 1983	156	4621950	4.73	255057	4.21
1984 to 1994	3689	89442309	91.62	5549581	91.68

Spoken and written components of the corpus are discussed separately in the next two sections.

Up: Contents Next: Design of the written component