Introduction
Overview
The Users Reference Guide for the British
National Corpus contains a description of the design principles
underlying the British National Corpus (BNC), and detailed information
about the way in which it is encoded, covering both the markup conventions
applied and the linguistic annotation with which the corpus was
enriched.
This revised edition has been slightly reorganized and considerably expanded to
provide a complete reference work for users of the corpus in its new
XML form. The text of the manual is available in TEI-XML and in HTML
format, and also from the BNC website at , from which
updated versions may be obtained.
The material presented in this manual derives originally from a
number of BNC Project internal documents, combining contributions
from all the participants in the project (see further ); any errors introduced are the responsibility of
the editor. Please send any comments or corrections to
natcorp@oucs.ox.ac.uk.
Section describes the basic structure of
the BNC encoding scheme, in terms of the XML elements and attributes
distinguished and the tags used to mark them. Section describes features which are peculiar to written
texts, and section those peculiar to spoken
texts. In each case, a distinction is made between those elements
which are marked up in all texts and those which (for technical or
financial reasons) are not always so distinguished, and hence appear
in some texts only. It should be noted that by no means all of the
features described here will be present in every text of the corpus,
nor, if present, will they necessarily be tagged.
Section describes the structure of the
detailed metadata associated with each text, in the form of the
teiHeader element attached to each component of the corpus,
and also to the whole corpus itself.
This is complemented in section by a
detailed presentation of the linguistic annotation or wordclass
tagging applied throughout the corpus. (This chapter is derived from
the Tagging Guide (Smith et al) originally distributed separately with
BNC World)
Section discusses briefly some
ways of exploiting the the BNC computationally.
Section complements the metadata supplied in
the header by listing and documenting several of the coded values used
in the markup. A brief bibliography combining significant background readings
about the BNC with works cited elsewhere in the manual is provided in section and a complete list of all the original sources from which the corpus
was compiled is given in section .
Section documents suggested settings
for those wishing to use the XAIRA system to index and query the
BNC. The pre-built XAIRA index delivered as part of the BNC XML
package was made using the XAIRA specification described in this
section. This section is provided for the convenience of XAIRA users;
it may be ignored if you are using some other software to search or
manage the corpus.
Finally, a reference section () provides an
alphabetical list of all XML elements and attributes used in the
markup of the corpus, together with the model and attribute classes to
which they belong, and macros used to simplify references to them.
This specification conforms to the 2007 (P5) edition of the TEI
Guidelines (), with which it should be read in
conjunction.
The BNC was originally created by an academic-industrial consortium whose
original members were:
Oxford University Press
Longman Group Ltd
Chambers Harrap
Oxford University Computing Services
Unit for Computer Research on the English Language (Lancaster University)
British Library Research and Development Department
Creation of the corpus was funded by the UK Department of Trade and
Industry and the Science and Engineering Research Council under grant number
IED4/1/2184 (1991-1994), within the DTI/SERC Joint Framework for
Information Technology. Additional funding was provided by the British
Library and the British Academy.
Maintenance, distribution, and development of the corpus has been carried out at
Oxford University Computing Services. There have been three major
revisions of the corpus:
BNC 1.0 (1995)
BNC World Edition (2000)
BNC XML Edition (2007)
For a brief historical overview of the project see Burnard 2002.
Acknowledgments
BNC 1.0
Management of the original BNC project was co-ordinated by an executive
committee whose members were as follows:
OUP
Tim Benbow; Simon Murison-Bowie
Longman
Della Summers; Rob Francis
Chambers Harrap
John Clement
OUCS
Lou Burnard
UCREL
Geoffrey Leech
British Library
Terry Cannon
DTI observers
Gerry Gavigan; Donald Bell
An Advisory Council supervised the running of the project 1991-1994.
Members of this Council were:
Dr Michael Brady
Christopher Butler
Professor David Crystal
Sir Antony Kenny (chair)
Dr Nicholas Ostler
Professor Sir Randolph Quirk
Tim Rix
Dr Henry Thompson
Many people within each member organization made
major contributions to the success of the
project. It is a pleasure to acknowledge their
hard work and dedication here.
OUP
Lyndsay Brown; Jeremy Clear (project manager 1991-2);
Caroline Davis; Ginny Frewer; Frank Keenan; Tom McLean; Anita Sabin;
Ray Woodall (project manager 1992-4)
Longman
Steve Crowdy (project manager); Denise Denney; Duncan Pettigrew
Chambers Harrap
Robert Allen; Ilona Morison
OUCS
Glynis Baguley; Gavin Burnage; Tony Dodd;
Dominic Dunlop (project manager 1992-4)
UCREL
Tom Barney; Michael Bryant (project manager 1991-3); Elizabeth Eyes;
Jean Forrest; Roger Garside; Mary Hodges; Mary Kinane; Nicholas Smith;
Xungfeng Xu.
The project also benefited greatly from the advice and support of many
external consultants. Listing all those who have influenced our thinking
and to whom we are indebted would be very difficult, but chief amongst them
we would like to thank:
Sue Atkins
Clive Bradley
Ann Brumfitt
Charles Clark
James Clark
Bruce Heywood
Mark Lefanu
Michael Rundle
Richard Sharman
Michael Sperberg-McQueen
Anna-Brita Stenström
Russell Sweeney
BNC World
After the completion of the first edition of the BNC, a phase of
tagging improvement was undertaken at Lancaster University with
funding from the Engineering and Physical Sciences Research Council
(Research Grant No. GR/F 99847). This tagging enhancement project was
led by Geoffrey Leech, Roger Garside and Tony McEnery. The main
objective was to correct as many tagging errors as possible, using an
enhanced version of Claws4. In addition, a new tool was developed (the
Template Tagger) for patching the corpus in such
a way as to eliminate further sets of errors by rule. This tool was
developed by Michael Pacey, building on a prototype written by Steven
Fligelstone. The research team working on tagging improvement was
Nicholas Smith (lead researcher), Martin Wynne and Paul Baker.
Correction and validation of the bibliographic and contextual
information in all the BNC Headers was carried out at OUCS by Lou
Burnard, with assistance at various stages from Andrew Hardie and Paul
Groves, who helped check demographic details for all spoken texts, and
in particular from David Lee, who checked bibliographic and
classification information for the bulk of the written texts. Thanks
are also due to the many users of the original version of the BNC who
took the time to notify us of errors they found.
BNC XML
Thanks are due to Martin Wynne and Ylva Berglund who first
suggested the idea of an XML version of a subset of the BNC.
Production of that edition (BNC Baby) provided valuable experience in
automatic conversion of the World edition. The bulk of the technical
work involved in producing the XML edition was carried out by Tony
Dodd and Lou Burnard, with assistance and advice from many BNC
users and beta-testers worldwide, in particular Guy Aston, Andrew Hardie, Paul
Rayson, and Sebastian Rahtz. Without their input the present revision would
have been impossible.