Design of the corpus
This section discusses some of the basic design issues underlying the creation of the BNC. It summarizes the kinds of uses for which the corpus is intended, and the principles upon which it was created. Some summary information about the composition of the corpus is also included.
General definitions
- a sample corpus: composed of text samples generally no longer than 45,000 words.
- a synchronic corpus: the corpus includes imaginative texts from 1960, informative texts from 1975.
- a general corpus: not specifically restricted to any particular subject field, register or genre.
- a monolingual British English corpus: it comprises text samples which are substantially the product of speakers of British English.
- a mixed corpus: it contains examples of both spoken and written language.
Composition
There is a broad consensus among the participants in the project and among corpus linguists that a general-purpose corpus of the English language would ideally contain a high proportion of spoken language in relation to written texts. However, it is significantly more expensive to record and transcribe natural speech than to acquire written text in computer-readable form. Consequently the spoken component of the BNC constitutes approximately 10 per cent (10 million words) of the total and the written component 90 per cent (90 million words). These were agreed to be realistic targets, given the constraints of time and budget, yet large enough to yield valuable empirical statistical data about spoken English. In the BNC sampler, a two per cent sample taken from the whole of the BNC, spoken and written language are present in approximately equal proportions, but other criteria are not equally balanced.
From the start, a decision was taken to select material for inclusion in the corpus according to an overt methodology, with specific target quantities of clearly defined types of language. This approach makes it possible for other researchers and corpus compilers to review, emulate or adapt concrete design goals. This section outlines these design considerations, and reports on the final make-up of the BNC.
For further explanation of <s> and <w> elements, see section ??.
The BNC World Edition contains 4054 texts and occupies (including SGML markup) 1,508,392 Kbytes, or about 1.5 Gb. In total, it comprises just over 100 million orthographic words (specifically, 100,467,090), but the number of w-units (POS-tagged items) is slightly less: 97,619,934. The total number of s-units identified by CLAWS is just over 6 million (6,053,093). Counts for these and all the other elements tagged in the corpus are provided below in ??
Spoken and written components of the corpus are discussed separately in the next two sections.
Up: Contents Next: Design of the written component