Organization of the project

As this figure demonstrates, production of different types of material was shared out amongst a number of different agencies: Longman focussed on the collection and transcription of spoken materials, and OUP on the transcription of written materials, using a mixture of OCR, rekeying, and processing of materials already in digital form. Conversion of all materials to a single consistent format and validation of its structure was carried out at OUCS, which also maintained a database of contextual and workflow information. Linguistic annotation of the material was carried out at Lancaster, using the well-established CLAWS tagger (discussed below and in Garside 1996, and the resulting texts then combined with standard metadata descriptions extracted from the database to form a single document conformant (insofaras these were already published) to the recommendations of the Text Encoding Initiative (Sperberg-McQueen 1994).
- permissions
- design and implementation of a standard permissions letter for use with all those owning IPR in the materials to be included in the corpus;
- design criteria
- definition of the range of text types to be included in the corpus and of their target proportions;
- enrichment and annotation
- implementation of linguistic and contextual annotation of the corpus texts;
- encoding and markup
- definition of the markup scheme to be applied in the final reference form of the corpus, and of procedures for mapping to it from a variety of data capture formats;
- retrieval software
- definition and implementation of simple retrieval software able to make use of the detailed corpus encoding.
Permissions issues
As noted above, the BNC was the first corpus of its size to be made widely available. This was possible largely because of the work done by this task group in successfully defining standard forms of agreement, between rights owners and the Consortium on the one hand, and between corpus users and the Consortium on the other. IPR owners were requested to give permission for the inclusion of their materials in the corpus free of charge, and shown the standard licence agreement which is still used today. Acceptance of this arrangement was perhaps to some extent facilitated by the relative novelty of the concept and the prestige attached to the project; however by no means every rights owner approached was immediately ready to assign rights to use digital versions of their material for linguistic research purpose indefinitely and free of charge. Some chose to avoid committing themselves at all, and others refused any non-paying arrangements.
Two specific problems attached to permissions issues relating to the spoken materials. Because participants had been assured that their identies would be kept secret, much effort was put into pondering how best to anonymise the their contributions, without unduly compromising their linguistic usefulness. Specific references to named persons were in many cases removed; the option of replacing them by alternative (but linguistically similar) names was briefly considered but felt to be impractical.
A more embarassing problem derives from the fact that participants in the demographically sampled part of the corpus had been asked (and had therefore given) permission only for inclusion of transcribed versions of their speech, not for inclusion of the speech itself. While such permission could in principle be sought again from the original respondents, the effectiveness of the anonymization procedures used now makes this a rather difficult task.
Two additional factors affected the willingness of IPR owners to donate materials: firstly, that no complete texts were to be included; secondly, that there was no intention of commercially exploiting or distributing the corpus materials themselves. This did not however preclude commercial usage of derived products, created as a consequence of access to the corpus. This distinction, made explicit in the standard User Licence, is obviously essential both to the continued availability of the corpus for research purposes, and to its continued usefulness in the commercial sector, for example as a testbed for language products from humble spelling correction software to sophisticated translation memories. To emphasize the non-commercial basis on which the corpus itself was to be distributed, one of the academic members of the consortium, OUCS, was appointed sole agent for licensing its use, reporting any dubious cases to the Consortium itself. Initially restricted to the EU, distribution of the corpus outside Europe was finally permitted in 1998.
Design Criteria
I referred above to the BNC's ‘non-opportunistic design’. A sense of the historical context is also perhaps helpful to understand the singling out of this aspect of the design as noteworthy. During the mid-nineties, although textual materials of all kinds were increasingly being prepared in digital form as a precursor to their appearance in print, the notion that the digital form might itself be of value was not at all widespread. Moreover, digitization in those pre-e-commerce days was far from uniform either in coverage or in format. As a consequence, there was a natural tendency in the research community to snap up such unconsidered trifles of electronic text as were available without considering too deeply their status with respect to the language as a whole. Because, to take one notorious example, large quantities of the Wall Street Journal were widely available in digital form, there was a danger that the specific register typified by that newspaper would increasingly serve as a basis for computationally-derived linguistic generalisations about the whole language.
As a corrective, therefore, the BNC project established at its outset the goal of sampling materials from across the language with respect to explicit design criteria rather than simply their contingent availability in machine-readable form. These criteria (usefully summarized in Atkins 1992) defined a specific range of text characteristics and target proportions for the material to be collected. The goal of the BNC was to make it possible to say something about language in general. But is language that which is received (read and heard) or that which is produced (written and spoken)? As good Anglo-Saxon pragmatists, the designers of the BNC chose to ignore this classic Saussurian dichotomy by attempting to take account of both perspectives.
The objective was to define a stratified sample according to stated criteria. While one might hesitate to claim that the corpus was statistically representative of the whole language in terms either of production or reception, at least the corpus would represent the degree of variability known to exist along certain specific dimensions, such as mode of production (speech or writing); medium (book, newspaper, etc.); domain (imaginative, scientific, leisure etc.); social context (formal, informal, business, etc.) and so on.
This is not the place to rehearse in detail the motivations for the text classification scheme adopted by the BNC4. For example, spoken texts may be characterized by age, sex, or social class (of respondent, not speaker), or by the domain, region, or type of speech captured; written texts may also be characterized by author age, sex, type, by audience, circulation, status, and (as noted above) by medium or domain. Some of these categories were regarded as selection criteria, i.e. the domain of values for this category was predefined, and a target proportion identified for each; while others were regarded as descriptive criteria, i.e. while no particular target was set for the proportion of material of a particular type, other things being equal, attempts would be made to maximize variability within such categories. It should be stressed that the purpose of noting these variables was to improve coverage, not to facilitate access, nor to subset the corpus according to some typological theory.
Inevitably, the design goals of the project had to be tempered by the realities of economic life. A rough guess suggests that the cost of collecting and transcribing in electronic form one million words of naturally occurring speech is at least 10 times higher than the cost of adding another million words of newspaper text: the proportion of written to spoken material in the BNC is thus 10:1, even though many people would suggest that if speech and writing are of equal significance in the language, they should therefore be present in equal amounts in the corpus. Within the spoken corpus, an attempt is made to represent equally the production of different speech types (in the context-governed part) and its reception (in the demographically sampled part).
Similarly pragmatic concerns lead to the predominance within the written part of the corpus of published books and periodical. However, while text that is published in the form of books, magazines, etc., may not be representative of the totality of written language that is produced, (since writing for publication is a comparatively specialized activity in which few people engage), it is obviously representative of the written language that most people receive. In addition, it should be noted that significant amounts of other material (notably unpublished materials such as letters or gray literature) are also included. And even within a readily accessible text-type such as newspapers, care was taken to sample both broadsheet and tabloid varieties, both national and regional in such a way that the readily available (national broadsheet) variety did not drown out the other, less readily found, variants.
The spoken part of the corpus is itself divided into two. Approximately half of it is composed of informal conversation recorded by nearly 200 volunteers recruited for the project by a market research agency and forming a balanced sample with respect to age, gender, geographical area, and social class. This sampling method reflects the demographic distribution of spoken language, but (because of its small size) would have excluded from the corpus much linguistically-significant variation due to context. To compensate for this, the other half of the spoken corpus consists of speech recorded in each of a large range of predefined situations (for example public and semi-public meetings, professional interviews, formal and semi-formal proceedings in academia, business, or leisure contexts).
In retrospect, some text classifications (author ethnic origin for example) were poorly defined and many of them were only partially or unreliably populated. Pressures of production and lack of ready information seriously affected the accuracy and consistency with which all these variables were actually recorded in the text headers. Even such a seemingly neutral concept as dating is not unproblematic for written text — are we talking about the date of the copy used or of the first publication? Similarly, when we talk of ‘Author age’ do we mean age at the time the book was published, or when it was printed?
Of course, corpora before the BNC had been designed according to similar methods, though perhaps not on such a scale. In general, however, the metadata associated with such corpora had been regarded as something distinct from the corpus itself, to be sought out by the curious in the ‘manual of information to accompany’ the corpus. One innovation due to the Text Encoding Initiative, and adopted by the BNC, was the idea of an integrated header, attached to each text file in the corpus, and using the same formalism. This header contains information identifying and classifying each text, as well as additional specifics such as demographic data about the speakers, and housekeeping information about the size, update status, etc. Again following the TEI, the BNC factors out all common date (such as documentation and definition of the classification codes used) into a header file applicable to the whole corpus, retaining within each text header only the specific codes applicable to that text.5
During production, however, classificatory and other metadata was naturally gathered as part of the text capture process by the different data capture agencies mentioned above and stored locally before it was integrated within the OUCS database from which the TEI headers were generated. With the best will in the world, it was therefore difficult to avoid inconsistencies in the way metadata was captured, and hence to ensure that it was uniformly reliable when combined.
Annotation
Word tagging in the BNC was performed automatically, using CLAWS4, an automatic tagger developed at Lancaster University from the CLAWS1 tagger originally produced to perform a similar task on the one million LOB Corpus. The system is described more fully in Leech 1994; its theory and practice are explored in Garside 1997, and full technical documentation of its usage with the BNC is provided in the Manual which accompanies the BNC World Edition (Leech 2000).
- tokenization into words (usually marked by spaces) and orthographic sentences (usually marked by punctuation); enclitic verbs (such as 'll or 's), and negative contractions (such as n't) are regarded as special cases, as are some common merged forms such as dunno (which is tokenized as ‘do + n't + know’
- initial POS code assignment: all the POS codes which might be assigned to a token are retrieved, either by lookup from a 50,000 word lexicon, or by application of some simple morphological procedures; where more than one code is assigned to the word, the relative probability for each code is also provided by the lexicon look-up or other procedures. Probabilities are also adjusted on the basis of word-position within the sentence.
- disambiguation or code selection is then applied, using a technique known as Viterbi alignment which chooses the probabilities associated with each code to determine the most likely path through a sequence of ambiguous codes, in rather the same way as the text messaging applications found on many current mobile phones. At the end of this stage, the possible codes are ranked in descending probability for each word in its context
- idiom tagging is a further refinement of the procedure, in which groups of words and their tags are matched against predefined idiomatic templates, resembling finite-state networks.
With these procedures, CLAWS was able to achieve over 95% accuracy (i.e. lack of indeterminacy) in assigning POS codes to any word in the corpus. To improve on this, the Lancaster team developed further the basic ideas of ‘idiom tagging’, using a template tagger which could be taught more sophisticated contextual rules, in part derived by semi-automatic procedures from a sample set of texts which had previously been manually disambiguated. This process is further described in the Reference Manual cited.
By multiword unit we mean the situation where two or more orthographic words are considered by the CLAWS tagger to function as a single unit with a single wordclass. Common examples include adverbial phrases such as of course or in short, and prepositional sequences such as in spite of or up to. Deciding whether or not to treat these orthographic sequences as multiword units sometimes requires interpretation (in short is not adverbial in sequences such as ‘in short sharp bursts’, for example); such situations required extensions to the idiom rules.
The lemmatization procedure adopted derives ultimately from work reported in Beale 1987, as subsequently refined by others at Lancaster, and applied in a range of projects including the JAWS program (Fligelstone et al 1996) and the book Word Frequencies in Written and Spoken English (Leech et al 2001). The basic approach is to apply a number of morphological rules, combining simple POS-sensitive suffix stripping rules with a word list of common exceptions. This process was carried out during the XML conversion, using code and a set of rules files kindly supplied by Paul Rayson.
Encoding
The markup scheme used by the BNC was originally defined at the same time as the Text Encoding Initiative's work was being done (and to some extent by the same people); the two schemes are thus unsurprisingly close, though there are differences. Since this scheme has been so widely taken up and is well documented elsewhere, we do not discuss it in any detail here.
The segmentation of the text carried out by CLAWS is preserved by
means of the <s> elements used throughout the whole text. Each
><s> element carries a number to identify it. Within each
<s>, every ‘word’ identified by CLAWS is
marked as a <w>element containing on its attributes the
original CLAWS C5 wordclass code (eg c5="VVN"
), the
simplified wordclass code derived from it (eg
pos="VERB"
), and the root form or headword for it
(e.g. hw="grant"
).
The User Reference Guide delivered with the corpus contains a detailed discussion of the scope and significance of this markup system.
In marking up the spoken part of the corpus, many different technical issues had to be addressed. As noted above, this was the first time detailed markup of transcribed speech on such a scale had been attempted. The transcription itself was carried out by staff who were not linguistically trained (but who were however familiar with the regional variation being transcribed — staff recruited in Essex for example were not required to transcribe material recorded in Northern Ireland). Transcribers added a minimal (non-SGML) kind of markup to the text, which was then normalized, converted to SGML, and validated by special purpose software (see further Burnage 1993). The markup scheme made explicit a number of features, including changes of speaker and quite detailed overlap; the words used, as perceived by the transcriber; indications of false starts, truncation, uncertainty; some performance features e.g. pausing, stage directions etc. In addition, of course, detailed demographic and other information about each speaker and each speech context was recorded in the appropriate part of the Header, where this was available.
Words and sentences are tagged as in the written example above. However, sentences are now grouped into utterances, marked by the <u> element, each representing an unbroken stretch of speech, and containing withing its start tag a code (such as PS04Y) which acts as a key to access the more detailed information about the speaker recorded in the TEI Header for this text. Note also the <pause> and <event> elements used to mark paralinguistic features of the transcribed speech.
As can readily be seen in the above example, the intention of the transcribers was to provide a version of the speech which was closer to writing than to unmediated audio signal. Thus, the spelling of filled pauses such as erm or mmm is normalised, and there is even use of conventional punctuation to mark intonation patterns interpreted as questions. For more discussion of the rationale behind this and other aspects of the speech transcription see Crowdy 1994.
Software and distribution
In 1994, it was not entirely obvious how one should distribute a corpus the size of the BNC on a not-for-profit basis. Low-cost options such as anonymous ftp seemed precluded by the scale of the data. Our initial policy was to distribute the text compressed to the extent that it would fit on a set of three CDs, together with some simple software system which could be installed by suitably skilled personnel to provide departmental access over a network to the local copy of the corpus. Development of such a software system was undertaken, with the aid of additional funding from the British Library, during the last year of the project. The software now delivered with the corpus is known as XAIRA (XML Aware Indexing and Retrieval Architecture) and derives from that original tool, which was called SARA (for SGML Aware Retrieval Application). XAIRA was developed as a general purpose open source XML-aware tool for searching large or small language corpora with funding from the Anmdrew Mellon foundation: see further the website at http://www.xaira.org.
It was always intended that access to the BNC should not be contingent on the use of any particular software — this was after all the main rationale behind the use of the international standard SGML as a means of encoding the original corpus, rather than a system tailored to any particular software tool.
As noted above, the BNC dates from the pre-World Wide Web era. 7 However, within a year of its publication, it was apparent that web access would be the ideal way of making it available, if only because this would enable us to provide a service to researchers outside the European Union, who were still at this time unable to obtain copies of the corpus itself because of licencing restrictions. The British Library generously offered the project a server for this purpose, and a simple web interface to the corpus was developed. This service, still available at the address http://sara.natcorp.ox.ac.uk allows anyone to perform basic searches of the corpus, with a restricted range of display options; those wishing for more sophisticated facilities can also download a copy of the SARA client program to access the same server: a small registration fee is charged for continued use of the service beyond an initial trial period.
To complement this service, and in response to the demand for help in using the BNC from the language teaching community, a detailed tutorial guide (Aston 1999) was written, introducing the various facilities of the software in the form of focussed and linguistically-motivated exercises. The Online service remains very popular, receiving several thousand queries each month.
Up: Contents Previous: What, exactly, is the BNC? Next: Revisions of the BNC