add this bookmarking tool

The BNC in numbers

The XML Edition of the BNC contains 4049 texts and occupies (including all markup) 5,228,040 Kb, or about 5.2 Gb. In total, it comprises just under 100 million orthographic words (specifically, 96,986,707), but the number of w-units (POS-tagged items) is slightly higher at 98,363,783. The tagging distinguishes a further 13,614,425 punctuation strings, giving a total content count of 110,691,482 tokens. The total number of s-units tagged is over 6 million (6,026,284). Counts for these and all the other XML elements tagged in the corpus are provided in the corpus header.

To put these numbers into perspective, the average paperback book has about 250 pages per centimetre of thickness; assuming 400 words a page, we calculate that the whole corpus printed in small type on thin paper would take up about ten metres of shelf space. Reading the whole corpus aloud at a fairly rapid 150 words a minute, eight hours a day, 365 days a year, would take just over four years.

As the following summary table shows, most (about 90%) of the words making up the corpus are taken from written texts of many different kinds, but 10 percent — about 10 million in total — are taken from transcribed speech, recorded in both formal and informal contexts.

Table 1. Composition of the BNC World Edition
Text type Texts Kbytes W-units S-units percent
Spoken demographic 153 4206058 4.30 610563 10.08
Spoken context-governed 757 6135671 6.28 428558 7.07
All Spoken 910 10341729 10.58 1039121 17.78
Written books and periodicals 2688 78580018 80.49 4403803 72.75
Written-to-be-spoken 35 1324480 1.35 120153 1.98
Written miscellaneous 421 7373707 7.55 490016 8.09
All Written 3144 87278205 89.39 5013972 82.82

More detailed frequency information for the various kinds of text included in the corpus are available in the BNC User Reference Guide.

Word frequency lists for the whole corpus have also been produced and published online by, for example, Leech, Rayson and Wilson (see Word Frequencies in Written and Spoken English: based on the British National Corpus.

Up: Contents