add this bookmarking tool

Design of the corpus

This section discusses some of the basic design issues underlying the creation of the BNC. It summarizes the kinds of uses for which the corpus is intended, and the principles upon which it was created. Some summary information about the composition of the corpus is also included.

Purpose

The uses originally envisaged for the British National Corpus were set out in a working document called Planned Uses of the British National Corpus BNCW02 (11 April 91). This document identified the following as likely application areas for the corpus:
  • reference book publishing
  • academic linguistic research
  • language teaching
  • artificial intelligence
  • natural language processing
  • speech processing
  • information retrieval
The same document identified the following categories of linguistic information derivable from the corpus:
  • lexical
  • semantic/pragmatic
  • syntactic
  • morphological
  • graphological/written form/orthographical

General definitions

The British National Corpus is:
  • a sample corpus: composed of text samples generally no longer than 45,000 words.
  • a synchronic corpus: the corpus includes imaginative texts from 1960, informative texts from 1975.
  • a general corpus: not specifically restricted to any particular subject field, register or genre.
  • a monolingual British English corpus: it comprises text samples which are substantially the product of speakers of British English.
  • a mixed corpus: it contains examples of both spoken and written language.

Composition

There is a broad consensus among the participants in the project and among corpus linguists that a general-purpose corpus of the English language would ideally contain a high proportion of spoken language in relation to written texts. However, it is significantly more expensive to record and transcribe natural speech than to acquire written text in computer-readable form. Consequently the spoken component of the BNC constitutes approximately 10 per cent (10 million words) of the total and the written component 90 per cent (90 million words). These were agreed to be realistic targets, given the constraints of time and budget, yet large enough to yield valuable empirical statistical data about spoken English. In the BNC sampler, a two per cent sample taken from the whole of the BNC, spoken and written language are present in approximately equal proportions, but other criteria are not equally balanced.

From the start, a decision was taken to select material for inclusion in the corpus according to an overt methodology, with specific target quantities of clearly defined types of language. This approach makes it possible for other researchers and corpus compilers to review, emulate or adapt concrete design goals. This section outlines these design considerations, and reports on the final make-up of the BNC.

This and the other tables in this section show the actual make-up of the second version of the British National Corpus (the BNC World Edition) in terms of
  • texts : number of distinct samples not exceeding 45,000 words
  • S-units: number of <s> elements identified by the CLAWS system (more or less equivalent to sentences)
  • W-units: number of <w> elements identified by the CLAWS system (more or less equivalent to words)

For further explanation of <s> and <w> elements, see section ??.

The BNC World Edition contains 4054 texts and occupies (including SGML markup) 1,508,392 Kbytes, or about 1.5 Gb. In total, it comprises just over 100 million orthographic words (specifically, 100,467,090), but the number of w-units (POS-tagged items) is slightly less: 97,619,934. The total number of s-units identified by CLAWS is just over 6 million (6,053,093). Counts for these and all the other elements tagged in the corpus are provided below in ??

In the following tables both an absolute count and a percentage are given for all the counts. The percentage is calculated with reference to the relevant portion of the corpus, for example, in the table for "written text domain", with reference to the total number of written texts. These reference totals are given in the first table below.
Table 1. Composition of the BNC World Edition
Text type Texts Kbytes W-units S-units percent
Spoken demographic 153 4206058 4.30 610563 10.08
Spoken context-governed 757 6135671 6.28 428558 7.07
All Spoken 910 10341729 10.58 1039121 17.78
Written books and periodicals 2688 78580018 80.49 4403803 72.75
Written-to-be-spoken 35 1324480 1.35 120153 1.98
Written miscellaneous 421 7373707 7.55 490016 8.09
All Written 3144 87278205 89.39 5013972 82.82
All texts are also classified according to their date of production. For spoken texts, the date was that of the recording. For written texts, the date used for classification was the date of production of the material actually transcribed, for the most part; in the case of imaginative works, however, the date of first publication was used. Informative texts were selected only from 1975 onwards, imaginative ones from 1960, reflecting their longer ‘shelf-life’, though most (75 per cent ) of the latter were published no earlier than 1975.
Table 2. Date of production
Creation date texts w-units % s-units %
Unknown 162 1814051 1.85 127132 2.10
Before 1974 47 1741624 1.78 121323 2.00
1974 to 1983 156 4621950 4.73 255057 4.21
1984 to 1994 3689 89442309 91.62 5549581 91.68

Spoken and written components of the corpus are discussed separately in the next two sections.

Up: Contents Next: Design of the written component