add this bookmarking tool

BNC Products

When users obtain a BNC product, they agree to the licence which gives them the right to hold and use a copy of the corpus. A corpus is a dataset which can be used in many different ways, and we regret that the University of Oxford is not able to offer support to users of the corpus. Funding for the development and support of the corpus ended many years ago, but the corpus has been created in such a way that it should be usable long into the future, with software created by the community.

The British National Corpus (BNC) Consortium was formed in 1990, and started work in 1991 on the three-year task of producing a hundred-million word corpus of modern British English for use in commercial and academic research. The full BNC contains about 100 million words: 90% written, 10% orthographically transcribed spoken text. The first edition was completed in 1994 and the first general release of the corpus for European researchers was announced in February 1995. A slightly revised version, BNC World, was made available world-wide in 2001. In 2007, a third edition appeared, using XML. BNC XML Edition is the version currently distributed and supported. Two subsets of the BNC have been produced separately: BNC Sampler and BNC Baby.

The BNC corpora have historically been distributed with a free search tool called XAIRA, developed specially for the BNC, but also usable with other corpora in XML format. Many users will still access the BNC via a disk which includes a copy of XAIRA. XAIRA is not supported by the University of Oxford, and I'm afraid that staff there cannot answer queries about its installation or use. Users of XAIRA should bear in mind that it won't be usable in all circumstances and for all purposes, and that some time in the future, it is unlikely to work at all any more on the latest computing platforms.

BNC XML Edition

The full BNC contains about 100 million words: 90% written, 10% orthographically transcribed spoken text. It is annotated with word-class information (part-of-speech, simplified word class) and lemmatized. The texts also contain detailed metatextual information. It is delivered in XML format.

BNC XML Edition is distributed on two DVDs. Users are welcome to try out the free XAIRA software, which is provided for installation on a stand-alone PC or on a Windows, Unix or OSX server. A copy of the XAIRA search program and the XAIRA index files for the BNC are provided.

Full reference information about the BNC is provided in the Reference Guide for the British National Corpus (XML Edition). Information about the BNC project and the original creation of the corpus can be found at corpus creation page. The corpus can be downloaded from the OTA.

BNC Baby

BNC Baby is a subset of the BNC. It consists of four one-million word samples, each compiled as an example of a particular genre: fiction, newspapers, academic writing and spoken conversation. The texts have the same annotation as the full corpus (part of speech, meta data, etc). The Reference Guide to BNC Baby [.pdf file] offers further information about this sample, such as a description of the design and information about the way in which it is encoded.

The BNC Baby is in XML format. More information about the former release of BNC Baby on CD is available on the BNC Baby CD page. Nowadays, the corpus can be downloaded from the OTA.

BNC Sampler

The BNC Sampler is a subset of the full BNC. It comprises two samples of written and spoken material of one million words each, compiled to mirror the composition of the full BNC as far as possible. The word-class annotation of the BNC Sampler texts has been carefully checked and manually corrected. The Sampler was first created at Lancaster University during the creation of the BNC. More information about the Sampler can be found in the users reference guide for the BNC Sampler: XML Edition [.pdf file]

The BNC Sampler is in XML format. The corpus can be downloaded from the OTA.

Up: Contents