[bnc] The BNC in numbers - About the British National Corpus

The BNC in numbers

The XML Edition of the BNC contains 4049 texts and occupies (including all markup) 5,228,040 Kb, or about 5.2 Gb. In total, it comprises just under 100 million orthographic words (specifically, 96,986,707), but the number of w-units (POS-tagged items) is slightly higher at 98,363,783. The tagging distinguishes a further 13,614,425 punctuation strings, giving a total content count of 110,691,482 tokens. The total number of s-units tagged is over 6 million (6,026,284). Counts for these and all the other XML elements tagged in the corpus are provided in the corpus header.

To put these numbers into perspective, the average paperback book has about 250 pages per centimetre of thickness; assuming 400 words a page, we calculate that the whole corpus printed in small type on thin paper would take up about ten metres of shelf space. Reading the whole corpus aloud at a fairly rapid 150 words a minute, eight hours a day, 365 days a year, would take just over four years.

As the following summary table shows, most (about 90%) of the words making up the corpus are taken from written texts of many different kinds, but 10 percent — about 10 million in total — are taken from transcribed speech, recorded in both formal and informal contexts.

Table 1. Composition of the BNC World Edition
Text type	Texts	Kbytes	W-units	S-units	percent
Spoken demographic	153	4206058	4.30	610563	10.08
Spoken context-governed	757	6135671	6.28	428558	7.07
All Spoken	910	10341729	10.58	1039121	17.78
Written books and periodicals	2688	78580018	80.49	4403803	72.75
Written-to-be-spoken	35	1324480	1.35	120153	1.98
Written miscellaneous	421	7373707	7.55	490016	8.09
All Written	3144	87278205	89.39	5013972	82.82

More detailed frequency information for the various kinds of text included in the corpus are available in the BNC User Reference Guide.

Word frequency lists for the whole corpus have also been produced and published online by, for example, Leech, Rayson and Wilson (see Word Frequencies in Written and Spoken English: based on the British National Corpus.

Up: Contents