BNC User Reference Guide

Introduction

Up: Contents Next: 1 Design of the corpus

Overview

The Users Reference Guide for the British National Corpus contains a description of the design principles underlying the British National Corpus (BNC), and detailed information about the way in which it is encoded, covering both the markup conventions applied and the linguistic annotation with which the corpus was enriched.

This revised edition has been slightly reorganized and considerably expanded to provide a complete reference work for users of the corpus in its new XML form. The text of the manual is available in TEI-XML and in HTML format, and also from the BNC website at http://www.natcorp.ox.ac.uk/XMLedition/urg.html, from which updated versions may be obtained.

The material presented in this manual derives originally from a number of BNC Project internal documents, combining contributions from all the participants in the project (see further Acknowledgments); any errors introduced are the responsibility of the editor. Please send any comments or corrections to natcorp@oucs.ox.ac.uk.

Section 2 Basic structure describes the basic structure of the BNC encoding scheme, in terms of the XML elements and attributes distinguished and the tags used to mark them. Section 3 Written texts describes features which are peculiar to written texts, and section 4 Spoken texts those peculiar to spoken texts. In each case, a distinction is made between those elements which are marked up in all texts and those which (for technical or financial reasons) are not always so distinguished, and hence appear in some texts only. It should be noted that by no means all of the features described here will be present in every text of the corpus, nor, if present, will they necessarily be tagged.

Section 5 The header describes the structure of the detailed metadata associated with each text, in the form of the <teiHeader> element attached to each component of the corpus, and also to the whole corpus itself.

This is complemented in section 6 Wordclass Tagging in BNC XML by a detailed presentation of the linguistic annotation or wordclass tagging applied throughout the corpus. (This chapter is derived from the the Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging (Leech and Smith) originally distributed separately with BNC World)

Section 7 Software for the BNC discusses briefly some ways of exploiting the the BNC computationally. Section 9 Miscellaneous tables complements the metadata supplied in the header by listing and documenting several of the coded values used in the markup. A brief bibliography combining significant background readings about the BNC with works cited elsewhere in the manual is provided in section 8 References and a complete list of all the original sources from which the corpus was compiled is given in section 10 List of Sources.

Section 11 The Xaira Specification documents suggested settings for those wishing to use the XAIRA system to index and query the BNC. The pre-built XAIRA index delivered as part of the BNC XML package was made using the XAIRA specification described in this section. This section is provided for the convenience of XAIRA users; it may be ignored if you are using some other software to search or manage the corpus.

Finally, a reference section (12 Formal Specification of the BNC XML schema) provides an alphabetical list of all XML elements and attributes used in the markup of the corpus, together with the model and attribute classes to which they belong, and macros used to simplify references to them. This specification conforms to the 2007 (P5) edition of the TEI Guidelines ([24]), with which it should be read in conjunction.

The BNC was originally created by an academic-industrial consortium whose original members were:

Oxford University Press
Longman Group Ltd
Chambers Harrap
Oxford University Computing Services
Unit for Computer Research on the English Language (Lancaster University)
British Library Research and Development Department

Creation of the corpus was funded by the UK Department of Trade and Industry and the Science and Engineering Research Council under grant number IED4/1/2184 (1991-1994), within the DTI/SERC Joint Framework for Information Technology. Additional funding was provided by the British Library and the British Academy.

Maintenance, distribution, and development of the corpus has been carried out at Oxford University Computing Services. There have been three major revisions of the corpus:

BNC 1.0 (1995)
BNC World Edition (2000)
BNC XML Edition (2007)

For a brief historical overview of the project see Burnard 2002.

Acknowledgments

BNC 1.0

Management of the original BNC project was co-ordinated by an executive committee whose members were as follows:

OUP: Tim Benbow; Simon Murison-Bowie
Longman: Della Summers; Rob Francis
Chambers Harrap: John Clement
OUCS: Lou Burnard
UCREL: Geoffrey Leech
British Library: Terry Cannon
DTI observers: Gerry Gavigan; Donald Bell

An Advisory Council supervised the running of the project 1991-1994. Members of this Council were:

Dr Michael Brady
Christopher Butler
Professor David Crystal
Sir Antony Kenny (chair)
Dr Nicholas Ostler
Professor Sir Randolph Quirk
Tim Rix
Dr Henry Thompson

Many people within each member organization made major contributions to the success of the project. It is a pleasure to acknowledge their hard work and dedication here.

OUP: Lyndsay Brown; Jeremy Clear (project manager 1991-2); Caroline Davis; Ginny Frewer; Frank Keenan; Tom McLean; Anita Sabin; Ray Woodall (project manager 1992-4)
Longman: Steve Crowdy (project manager); Denise Denney; Duncan Pettigrew
Chambers Harrap: Robert Allen; Ilona Morison
OUCS: Glynis Baguley; Gavin Burnage; Tony Dodd; Dominic Dunlop (project manager 1992-4)
UCREL: Tom Barney; Michael Bryant (project manager 1991-3); Elizabeth Eyes; Jean Forrest; Roger Garside; Mary Hodges; Mary Kinane; Nicholas Smith; Xungfeng Xu.

The project also benefited greatly from the advice and support of many external consultants. Listing all those who have influenced our thinking and to whom we are indebted would be very difficult, but chief amongst them we would like to thank:

Sue Atkins
Clive Bradley
Ann Brumfitt
Charles Clark
James Clark
Bruce Heywood
Mark Lefanu
Michael Rundle
Richard Sharman
Michael Sperberg-McQueen
Anna-Brita Stenström
Russell Sweeney

BNC World

After the completion of the first edition of the BNC, a phase of tagging improvement was undertaken at Lancaster University with funding from the Engineering and Physical Sciences Research Council (Research Grant No. GR/F 99847). This tagging enhancement project was led by Geoffrey Leech, Roger Garside and Tony McEnery. The main objective was to correct as many tagging errors as possible, using an enhanced version of Claws4. In addition, a new tool was developed (the Template Tagger) for ‘patching’ the corpus in such a way as to eliminate further sets of errors by rule. This tool was developed by Michael Pacey, building on a prototype written by Steven Fligelstone. The research team working on tagging improvement was Nicholas Smith (lead researcher), Martin Wynne and Paul Baker.

Correction and validation of the bibliographic and contextual information in all the BNC Headers was carried out at OUCS by Lou Burnard, with assistance at various stages from Andrew Hardie and Paul Groves, who helped check demographic details for all spoken texts, and in particular from David Lee, who checked bibliographic and classification information for the bulk of the written texts. Thanks are also due to the many users of the original version of the BNC who took the time to notify us of errors they found.

BNC XML

Thanks are due to Martin Wynne and Ylva Berglund who first suggested the idea of an XML version of a subset of the BNC. Production of that edition (BNC Baby) provided valuable experience in automatic conversion of the World edition. The bulk of the technical work involved in producing the XML edition was carried out by Tony Dodd and Lou Burnard, with assistance and advice from many BNC users and beta-testers worldwide, in particular Guy Aston, Andrew Hardie, Paul Rayson, and Sebastian Rahtz. Without their input the present revision would have been impossible.

Up: Contents Next: 1 Design of the corpus

edited by Lou Burnard. Date: January 2007
This page is copyrighted