The BNC in numbers
The XML Edition of the BNC contains 4049 texts and occupies (including all markup) 5,228,040 Kb, or about 5.2 Gb. In total, it comprises just under 100 million orthographic words (specifically, 96,986,707), but the number of w-units (POS-tagged items) is slightly higher at 98,363,783. The tagging distinguishes a further 13,614,425 punctuation strings, giving a total content count of 110,691,482 tokens. The total number of s-units tagged is over 6 million (6,026,284). Counts for these and all the other XML elements tagged in the corpus are provided in the corpus header.
To put these numbers into perspective, the average paperback book has about 250 pages per centimetre of thickness; assuming 400 words a page, we calculate that the whole corpus printed in small type on thin paper would take up about ten metres of shelf space. Reading the whole corpus aloud at a fairly rapid 150 words a minute, eight hours a day, 365 days a year, would take just over four years.
As the following summary table shows, most (about 90%) of the words making up the corpus are taken from written texts of many different kinds, but 10 percent — about 10 million in total — are taken from transcribed speech, recorded in both formal and informal contexts.
More detailed frequency information for the various kinds of text included in the corpus are available in the BNC User Reference Guide.
Word frequency lists for the whole corpus have also been produced and published online by, for example, Leech, Rayson and Wilson (see Word Frequencies in Written and Spoken English: based on the British National Corpus.