[bnc] - An introduction to the BNC XML Edition

Revisions of the BNC

As we noted above, the BNC was never designed to be a monitor corpus. Nevertheless, there have been two major revisions of it since its first appearance, and it may be that a third is judged necessary in the future, given continued demand for this particular unique snapshot of the British language. In this section we discuss some of the changes which have been made to the corpus and the general principles limiting what is possible.

The second edition of the BNC, also known as the BNC World Edition was published in December 1999, five years after the first appearance of the BNC. A small number (less than 50) texts for which world rights could not be obtained were removed from the corpus so that it could, at last, be distributed worldwide.

Desirable though it might be, the scale of the BNC precludes any complete proof reading of it. Trying to correct the errors in the BNC is not unlike the task of sweeping a beach clear of sand, as imagined by the Walrus and the Carpenter:

"If seven maids with seven mops
Swept it for half a year

Do you suppose," the Walrus said,
"That they would get it clear?"

"I doubt it," said the Carpenter,
And shed a heavy tear.

There is a sense in which any transcription of spoken text is inevitably indeterminate. Even for written texts deciding what counts as an error is not always obvious: mis-spelled words do appear in published material, and should therefore also be expected to appear in a corpus. Where corrections have been made during the process of corpus construction, they are sometimes noted in the markup in such a way as to preserve both the original error and its correction: this provides some indication at least of the kinds of error likely to be encountered. However, it is impossible reliably to assess the extent of such errors, nor precisely to locate their origin, because of the varied processes carried out on the source texts. In principle, it is impossible to distinguish an error introduced by (for example) inaccurate OCR software from an error which was present in the original, without doing an exact proof reading of the text against its original source ⁸; the use of automatic spelling-error detection software also somewhat muddies the water.

One kind of systematic correction is however possible, and was applied to the BNC World Edition. In part because of the availability of the BNC Sampler, it was possible to improve greatly the rules used by CLAWS, and thus significantly to reduce both the error rate and the degree of indeterminacy in the POS codes for BNC world. This work, carried out at Lancaster with funding from the Engineering and Physical Sciences Research Council (Research Grant No. GR/F 99847), is described in detail in the Manual supplied with the corpus (Leech 2000), which estimates that the error rate in the whole corpus following the automatic procedures applied is now reduced to approximately 1.15 percent of all words, while the proportion of ambiguous codes is now reduced to approximately 3.75 per cent ⁹

At the same time, a number of semi-systematic errors were fixed. These ranged from duplicate or wrongly labelled texts to a complete check of the demographic data associated with the speakers in each text, which had been found to contain many errors in BNC1.

The third edition of the BNC, known as the BNC XML Edition, appeared in March 2007. Somewhat to our surprise, demand for the BNC showed no sign of decreasing during the five years following release of the World Edition, even though the technology on which it relied had changed almost out of all recognition: SGML had given way to XML as the encoding language of choice, while the advent of Unicode and of the World Wide Web directly addressed many issues of the nineties, not least by making feasible a completely different software economy. In the nineties, the normal academic software development practices tended to result in monolithic sophisticated but idiosyncratic applications; in the noughties, academic software developers relied on the availability of hundreds of small scale co-operating utilities and tools developed for different purposes to standard interfaces.

For the BNC XML edition, the documentation of the markup scheme, the schema used to validate it, and the markup itself were all quite extensively revised. The chief goals of these revisions were:

to reduce the complexity of the markup, in particular by removing inconsistently or rarely deployed markup features
to increase usability of the corpus with generic XML tools by using only standard XML features
to improve conformance of the markup scheme with international standards such as the TEI

Conversion of the corpus from SGML to XML was relatively automatic, but with more tractable and generally accessible markup, it seemed appropriate to address some of the less frequented or more eccentric aspects of the BNBC's markup.

The BNC World Edition, for example, had two different ways of indicating editorial correction in the corpus, which were not at all consistently applied. It had not attempted any form of standardization for the descriptions of non linguistic features of spoken texts, or for the codes used to characterize highlighting of various kinds in the written texts. We attempted to address these inconveniences by making such descriptions more homogenous, recoding for example ‘baby cries’ ‘baby screams’, ‘baby noise’ etc. uniformly as ‘baby cries’. As another example, the country Bahrain is referred to 174 times in the BNC; on nine occasions it is mis-spelled as Bahrein. Of these nine mis-spellings, only two are flagged as erroneous, and each uses a different method of doing so. This kind of pseudo-precision simply confuses the user. We also tried to simplify several aspects of the markup, in particular the way in which overlapping speech is represented.

With each new release of the BNC, the typical environment in which the corpus is used has changed completely. By the time of BNC World, it was evident that the corpus could now be installed at low cost for personal use on a single standalone workstation running any version of the Windows operating system: our licencing and distribution policies changed accordingly to make this feasible. With BNC XML, this trend has continued, but in revising the software used to access the corpus we have also tried to take note of the trend towards distributed access which characterizes web-based computing. The latest version of XAIRA is designed to support web services so that access to the corpus can easily be built in to other web-based applications. Licencing constraints limit to some extent what is possible, but certainly simple look-ups of the corpus can now readily be built into other web pages without much effort. This continues the trend initiated by the development of the BNC Online service towards making the corpus more accessible to a wider community of users.

Up: Contents Previous: Organization of the project Next: What lessons have we learned?

Notes

Such a task would however be feasible, since the original paper sources for the majority of the written parts of the corpus is still preserved at OUCS

These estimates are derived from manual inspection of a 50,000 word sample taken from the whole corpus, as further discussed in the Tagging Manual cited.