[bnc] - An introduction to the BNC XML Edition

What lessons have we learned?

Everyone knows you should research the market before distributing any kind of project, especially one with the level of initial investment needed by the BNC. But, as with some other things that everyone knows, this common-sense wisdom turns out to have been somewhat misleading in the case of the BNC. When the original project partners discussed the likely market for copies of the BNC, it seemed quite clear where and how small it would be. In the mid-nineties, it was obvious that only a specialist research community, with a clear focus on Natural Language Processing, and of course the research and development departments of businesses engaged in NLP or in lexicography would be in the least interested in a 100 million word collection of English in what was then still called machine-readable form. Both the rights framework for distribution of copies of the corpus and the methods of distribution chosen clearly reflect this ‘obvious’ model: the licence which all would-be purchasers must sign (in duplicate) for example talks about the licensees's ‘research group’ and is quite belligerent about the need to monitor networked usage of the corpus within an institution — but nowhere entertains the notion that an individual might buy a copy for their own personal use, or for use with a group of students.

In fact however, we rapidly discovered that the market was both much larger, and quite different in nature. The major users of the BNC turn out to be people working in applied linguistics, not computational linguistics, and in particular those concerned with language learning and teaching. Their computational expertise is rather less than expected, their enthusiasms more wide-ranging. They include not only computational linguists and NLP researchers but also cultural historians and even language learners.

In retrospect, the BNC project also had the same technological blind spots as others at the time. Curiously, we did not expect the success of the XML revolution! So we wasted time in format conversion and compromises. Equally, because we did not foresee standalone computers running at 1 Ghz with 20 gigabyte disks as standard home equipment, we did not anticipate that it might one day be feasible to store the digital audio version of the texts we transcribed along with their transcriptions. Consequently, we never even considered whether it would be useful to try to get rights to distribute the digital audio, and our software development efforts focussed on developing a client/server application, a system predicated on the assumption that BNC usage would be characterized by a single shared computing resource, with many entry points, rather than by the massive duplication of standalone machines.

What other opportunities did we miss? In the original design, there is a clearly discernible shift from the notion of ‘Representativeness’ to the idea of the BNC as a fonds: a source of specialist corpora. From being a sample of the whole of language, the BNC was rapidly re-positioned as a repository of language variety. This was in retrospect a sensible repositioning; a more divers collection of materials than the BNC is hard to imagine. A rapid scan of most corpus-related discussion lists shows that close to the top of most frequently asked question lists is a question of the form ‘I am looking for a collection of texts of type X’ (recent values for X I have noticed include doctor-patient interaction, legal debate, arguments, flirtation...); in almost every case, the answer to such a request is ‘There is some, somewhere in the BNC, but it's up to you to find it...’. The XML version of the corpus makes it somewhat easier to do so, by providing better access to a range of metadata which can be searched in combination with the textual content itself. For example, we can identify the starts of novels, by searching for texts sampled to include opening sections, which are categorised as fiction; or we can select training sessions by looking for those words within the title of texts classified as context governed speech.

Clearly, the design of the BNC entirely missed the opportunity to set up a grand monitor corpus, one which could watch the river of language flow and change across time. It would be rather depressing if linguists of this century continue to study the language of the nineties for as long as those of the preceding one were constrained to study that of the sixties. Nevertheless, although it would be interesting, of course, to build a series of BNC-like corpora at regular intervals, say every decade, there seems little chance of obtaining funding for such an enterprise. Instead, we will have a different kind of large scale corpus of language production at our disposal for at least the foreseeable future. How best to manage the diversity and unpredictability of the Web as our future source of linguistic information is another, and quite different, story.

Up: Contents Previous: Revisions of the BNC Next: Works cited