[bnc] An introduction to the BNC XML Edition

An introduction to the British National Corpus (XML Edition)

Lou Burnard, Oxford University Computing Services

The British National Corpus (BNC) has been a major influence on the construction of language corpora during the last decade, if only as a significant reference point. This corpus may be seen as the culmination of a research tradition going back to the one-million word Brown corpus of 1964, but its constitution and its industrial-scale production techniques look forward to a new world in which language-focussed engineering and software development are at the heart of the information society instead of lurking on its academic fringes.

This paper ¹ reviews the design and management issues and decisions taken during the construction of the BNC and describes why its most recent incarnation, revised to use XML, remains relevant today.

Up: Contents Next: What, exactly, is the BNC?

What, exactly, is the BNC?

The British National Corpus (BNC) is a 100 million word corpus of modern British English, originally produced by a consortium of dictionary publishers and academic researchers in 1990-1994. The Consortium brought together as members dictionary publishers OUP, Longman, and Chambers, and research centres at the Universities of Lancaster and Oxford, and at the British Library. The project was originally funded under the Joint Framework for Information Technology, a British Government initiative designed to facilitate academic-industrial co-operation in the production of what were regarded as ‘pre-competitive’ resources, whereby the Department of Trade and Industry provided 50 percent funding to commercial partners, and the Science and Engineering Research Council funded 100 percent of the academics' costs.

The nineties have been called many things in social history: as far as computing facilities are concerned however, I suggest that an appropriate epithet might well be neotenous. It is salutary to remember that in computer magazines of the early nineties, the big debate was about the relative merits of word processors WordPerfect release 5 and WinWord (an ancestor of the now ubiquitous Microsoft Word). On your desktop, if you were a reasonably well-funded academic, you might have a ‘personal computer’ with a fast Intel 386 processor, and as much as 50 Mb of diskspace — just about enough to run Microsoft's new-fangled Windows 3.1 operating system. But your real computing work would be done in your laboratory or at your centralised computing service, where you would probably have shared use of a Unix system of some kind or a VAX minicomputer. This was also a period in which a few people were starting to talk about a new hypertext concept called the World Wide Web, a few of whom might even have tried an impressive new interface program called Mosaic...

The art of corpus building was however already well understood in the nineties, at least by its European practitioners. ‘Corpora are becoming mainstream’ declared Leech, with palpable surprise, in the preface to the ICAME proceedings volume of 1990. We may discern three intellectual currents or differences of emphasis already becoming clear at this period: the traditional school initiated by the Brown Corpus, institutionalised in LOB, and perpetuated through ICAME; the Birmingham school, which had been building up ever larger collections of textual material as part of the COBUILD project throughout the late eighties ¹; and the American view most famously expressed by Mitch Marcus as ‘there's no data like more data’. The locale in which these traditions most visibly began to combine into a new form was in computer aided lexicography, partly as a consequence of the availability of computer-held representations of traditionally organised dictionaries, such as Longman's Dictionary of Contemporary English, and of course the computerization of the Oxford English Dictionary itself, partly as a result of an upsurge of interest amongst the computational linguistics community (see for example Atkins92).

At the same time, the early nineties were an exciting period for synergy in research applications of information technology. ‘Humanities Computing’ and ‘Computational Linguistics’ were pulling together in their first (and to date only) joint success, the establishment of Text Encoding standards appropriate to the dawning digital age.² The term language engineering was being used to describe not a dubious kind of social policy, but a sexy new sort of technology. It is in this context that we should place the fact that production of the BNC was funded over three years, with a budget of over GBP 1.5 million.

The project came into being through an unusual coincidence of interests amongst lexicographic publishers, government, and researchers. Amongst the publishers, Oxford University Press and Longman were at that time beginning to wake up to the possible benefits of corpus use in this field. One should point also to the success of the Collins COBUILD dictionaries, (first published in 1987, and probably the first major language-learner dictionary whole-heartedly to embrace corpus principles) as a vital motivating factor for rival publishers OUP and Longman. For the government, a key factor was a desire to stimulate a UK language engineering industry in the climate of expanded interest in this field in Europe. For researchers at Oxford and Lancaster, this unlikely synergy was a golden opportunity to push further the boundaries of corpus construction, as further discussed below. And for the British Library, the corpus was one of a number of exploratory projects being set up to experiment with new media at the beginning of the age of the digital library (for other examples, see the essays in Carpenter 1998)

The stated goals of the BNC project were quite explicit: it would create a language corpus at least an order of magnitude bigger than any freely available hitherto ³The new corpus would be synchronic and contemporary and it would comprise a range of samples from the full range of British English language production, both spoken and written. Considerable debate and discussion focussed on the notion of sampling, and in particular of corpus design. Unlike some other collections of language data then popular, the BNC would be of avowedly non-opportunistic design. In order to make the corpus generally applicable, it would contain automatically-generated word class annotation, and it would also include very detailed contextual information. These three features, together with its general availability and large size, would make the BNC unique amongst available collections of language data, and would also justify the ‘national’ part of its title (originally included simply in recognition of the fact that the project was partly government funded).

Unstated, but clearly implicit in the project design, were other goals. For the commercial partners, the major reason for their substantial investment of time and money was of course the production of better ELT dictionaries, plus, perhaps some regaining of competitive position by the authoritative nature of the resulting corpus. For the academic partners, an unstated goal was to provide a new model for the development of corpora within the emerging European language industries, and to put to the test emerging ideas about standardization of encoding and text representation and documentation. But over-riding all there was the simple desire to build a really big corpus!

Up: Contents Next: Organization of the project

Notes

An earlier version of this paper was published under the title The BNC: Where did we go wrong in Teaching and learning by doing corpus analysis, ed. B. Kettemann and G. Markus. Amsterdam: Rodopi, pp 51-71.

For a summary of this highly influential work, prefiguring that of the BNC in many regards, see Renouf 1986, and the many publications of its intellectual centre, J. McH. Sinclair, e.g. Sinclair 1987

The introduction to Zampolli 1994 makes this connexion explicit.

Though incontestably larger, the Bank of English corpus developed as part of the Cobuild project was not originally designed for distribution or use by anyone outside that project; to this day, IPR and other restrictions have effectively limited access to it by the research community at large.

These are exhaustively discussed in e.g. Atkins 1992 for the written material, and Crowdy 1995 for the spoken material; discussion and detailed tables for each classification are also provided in the BNC User Reference Guide (Burnard 1995, revised 2000).

For further description of the way TEI Headers are used by the BNC see Dunlop 1995

These were obtained from the UK joint COPAC for the bulk of the written published material

The phrase world wide web in fact appears only twice in the corpus, in both cases as part of a brief exchange about the feasibility of publicizing the Leeds United football club which occurred on an email discussion list in January 1994. The most frequent collocates for the word web in the corpus are spider, tangled, complex, and seamless. In this respect at least the BNC is definitely no longer an accurate reflection of the English language.

Such a task would however be feasible, since the original paper sources for the majority of the written parts of the corpus is still preserved at OUCS

These estimates are derived from manual inspection of a 50,000 word sample taken from the whole corpus, as further discussed in the Tagging Manual cited.