The British National Corpus

Gavin Burnage and Glynis Baguley


Introduction

What is a Corpus?

What is a corpus for?

Earlier interest in computerised corpora for linguistic research

Design and Implementation of the BNC

Written Language, Spoken Language

A monolingual corpus

A general corpus

A synchronic corpus

A sample corpus

Written Corpus Design

Domain

Time

Medium

Level

Implementing the selection criteria

Classification features

Spoken Corpus Design

Demographic sampling

Context-governed sampling

Constructing the BNC

Data Collection

Scanning

Keyboarding

Existing electronic material

Copyright and permissions

Textual Markup and the Text Encoding Initiative

Text encoding for the BNC

Linguistic Markup

Using the BNC

Text analysis and concordancing software: SARA

Getting hold of the BNC

References


Introduction

The British National Corpus is a collection of 100 million words of contemporary British English text held in computer-readable form. It is available as a research tool for those professionally interested in how the English language is being used in the late twentieth century within the United Kingdom. These include particularly lexicographers and producers of English language reference works, but also academic linguists. The search/retrieval software supplied with the corpus will also be of interest to information scientists.

The corpus was created in the period 1991-1994 by a consortium which comprised three commercial partners – Oxford University Press (OUP), Longman Group UK Ltd and Chambers Harrap – and two academic ones – Oxford University Computing Services (OUCS) and the Unit for Computer Research in the English Language (UCREL) at Lancaster University.

This Briefing describes the structure and composition of the corpus and explains its significance and usefulness.

When J.A.H. Murray took over the editorship of what was to become the Oxford English Dictionary in the 1870s, his raw material was an enormous collection of slips of paper. (n) Each slip bore a short extract copied from a literary work, selected because it included a word of interest. Having assembled all the slips for a particular word, Murray or a colleague would divide them into separate heaps for different parts of speech if necessary, further subdivide each heap according to the different meanings the word might have, and arrange the slips for each meaning in chronological order of composition or publication. Thus the lexicographers developed their definition(s) of the word. Selected quotations were also included in the dictionary to illustrate the usage of words.

The principles of lexicography have remained the same, but the advent of corpora enables lexicographers to study many more examples much more easily. By pressing a few keys, lexicographers can cause the search software supplied with the BNC to display in context some or all of the occurrences of the target word in the corpus. They can choose that the context displayed be just a few words either side, or the entire text sample in which the word was found. The search can be restricted to find the word only when it has a particular part of speech, or only when it occurs near another selected word, and in various other ways. Furthermore, modern lexicographers are not limited, as was Murray, by the subjective decisions of readers as to what constitutes an 'interesting' word or usage.


What is a corpus?

The definitions are many and varied, but in general a corpus is a collection of texts gathered according to particular principles for some particular purpose. Using terms more specific to linguistics, David Crystal suggests that a corpus is.

a collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting point of linguistic description or as a means of verifying hypotheses about a language. (1)

This emphasises the scientific basis many claim for corpus-based study: by observing the patterns of language evident in large quantities of data, we can learn more about it. David Crystal is perhaps a little tentative when he talks of a 'starting point' only.

In contrast, John Sinclair says that a corpus can be

a collection of naturally-occurring language text, chosen to characterize a state or variety of a language. (2)

This is a much grander claim – from a carefully selected corpus we can say definitive things about a language. The British National Corpus comprises selections which together characterise British English of the late twentieth century. The exactness with which it does this is open to question, naturally, and in any event almost impossible to quantify statistically – how do you state what the English language, in all its various written and spoken forms, actually is at any given point in time? The least that can be said is that the design of the BNC incorporates a wealth of different styles and registers, all of which are real examples of British English in use.

All three definitions suggest that it is the collection as a whole, rather than its component parts, that makes a corpus worth studying. A corpus is unlike a library or a text archive in this respect. People come to a corpus to analyse the language used throughout, rather than to extract and examine a book from one particular author (although perhaps there will be occasions when someone wishes to do just that). Certainly many will construct their own subcorpora, according to their own principles and purposes, from material within the BNC. Examples may include subcorpora of teenage speech, of the language of newspaper headlines, or of scientific writing. But in almost all cases, generalisations about the language type or variety can be made because the subcorpus is made up of many different examples from many different sources.

Finally, all three definitions omit any reference to computers, but in practice, of course, it is only the computer which permits analysis of language on this scale. One dramatic example of greatly improved language processing is concordancing (the production of indexes of selected words or of all the words in a text): what was once in some cases a lifetime's task can now be done in hours. In short, computerised corpora open up new ways of looking at language, and the numerous and varied possibilities excite and intrigue researchers in many ways.

What is a corpus for?

The corpus is a general-purpose tool which can be used to whatever ends researchers choose, but a primary use is to assist in the production of dictionaries. Dictionaries are very significant commercially, especially those produced for foreign learners. Grammars for foreign learners are also very important. The use of corpora makes it a great deal easier for lexicographers and grammarians to describe the language objectively from authentic samples, rather than depending on their own intuitions about how the language is used. This is particularly true of the spoken language, which because of its ephemeral nature is much more difficult to study than written text.

In the past, linguists often prescribed a notional ideal of good usage. Today's linguists prefer to describe what native speakers actually say and write. Anyone who, in studying a foreign language, has mastered the finer points of syntax, acquired an appreciation of its literary treasures, and then found themselves barely able to order a meal in the foreign country, will appreciate the difference.

David Crystal has produced a popular work on this topic. (3) One example he uses to illustrate the distinction is the 'split infinitive'. Many people (influenced by early schooling derived ultimately from eighteenth-century grammarians who regarded Latin as the ideal language and tried to fit their use of English onto its grammar), consider phrases such as 'to boldly go' as errors, and recommend rewriting them as 'boldly to go' or 'to go boldly'. In practice, split infinitives occur frequently, so frequently in fact that the Collins COBUILD English language dictionary states that "it is a regular feature of current speech and writing and we believe that it is now time to accept that the balance of usage is in favour of sylistic freedom in this instance" (5).

The split infinitive is an example which can arouse controversy; most corpus-based linguistic description is not so divisive. The important point is that real language, as people use it every day, can be extensively analysed using a comprehensive and balanced computerised corpus.

Outside the fields of lexicography and language teaching, corpora can help scholars to address theoretical questions such as 'Is there really such a thing as a word?' and to consider sociolinguistic issues such as differences in the ways men and women use language or variation in choice of swear words according to age or social class.

Earlier interest in computerised corpora for linguistic research

Computerised corpus linguistics dates back three or for decades. Work on the LOB (London-Oslo-Bergen) corpus of British English began in the 1960s. It contains 1 million words of British English, and for many years was a mainstay for corpus-based researchers. Likewise the Brown corpus contains 1 million words of American English organised with the same text categories used in the LOB corpus. A small corpus of transcribed speech was also constructed in the 60s and 70s, and was named the London-Lund corpus. (In all cases, the names refer to the Universities involved in developing these corpora.) In the 80s, as terchnological advances made larger corpora possible, Cobuild created a 6 million word corpus upon which the Collins Cobuild Dictionary of the English Language was based. In 1991, work began on the construction of the 100 million word British National Corpus, based in large measure on the categorisations used in the LOB corpus. The Cobuild corpus continued to expand with less structured data collection, and became the Bank of English, which contains 200 million or more words, from which subcorpora are selected for specific purposes. The International Corpus of English (ICE) comprises a number of 1 million word corpora of written and spoken English as used in English-speaking countries world-wide. These are just a few corpus projects past and present; there are many more, in English and other languages. Information about, and in some cases access to corpora and electronic texts is becoming more acessible by means of the Internet.

Corpus Design and Construction

The purpose of the British National Corpus project was to construct a balanced and representative sample of current British English. The designers therefore had to address just what, in theory at least, constituted a balanced cross-section, and what could usefully be achieved with the resources available.

Written language, spoken language

A basic question for the design of the BNC was the proportion of written to spoken language. In the minds of many outside the field of linguistics, written language, particularly literature, is the most important form language takes. But within the field the reverse is generally held to be true: spoken language is the primary form, and written language a derivation from it (albeit a highly important one, in cultural and social terms). The study of spoken language has therefore been a priority for many linguistic researchers. However, the distinction between written and spoken language is by no means absolute. This issue has been examined by Doug Biber, who, using texts from the LOB corpus and the London–Lund corpus, suggests different dimensions of language variation, which describe how various kinds of written and spoken language resemble and differ from each other (5).

The million-word LOB and Brown corpora, and the million-word components of the ICE, contain equal proportions of written and spoken language. For the 100-million-word BNC, however, such a split was, regrettably, too costly: the expense and time required to record and transcribe 50 million words of speech would have been too great. The proportions were therefore set at 90 million words of written and 10 million words of spoken text. The spoken corpus remains, however, a considerable achievement: no bigger transcribed purpose-built cross-section of spoken language exists at present.

A monolingual corpus

The BNC is monolingual: it deals only with English, and more specifically British (UK) English. Only texts written by people regarded as British (by birth or adoption) were included in the corpus. In the case of spoken material, respondents selected to record their conversations had to have English as their first language. Other languages used in the UK – the indigenous ones Scottish Gaelic and Welsh as well as newer arrivals such as Urdu – have worthy claims to be included in a British corpus, but the project partners' brief was to deal exclusively with English.

In practice, some non-British usage found its way quite naturally into the corpus. Newspapers and books often contain contributions by non-British authors, and although these were removed where they could be identified, some undoubtedly remain. Electronic mailing lists have contributions from around the world. Spoken corpus respondents frequently converse with non-native speakers of English, and occasionally in different languages altogether. Foreign words and expressions show up in many contexts. However, the intention of producing a corpus consisting primarily of British English was adhered to throughout its construction.

A general corpus

The BNC consists of a wide range of examples of language in use, wide enough to justify the claim that it characterises modern British English. It includes books fictional and non-fictional, from Mills and Boon to Iris Murdoch. It contains: biographies, and scientific and academic expositions; essays written by school and university students; electronic mail from a football supporters' discussion list; leaflets about restaurants, patenting, tourist attractions, driving tips, and religion; magazines about package holidays, dogs, embalming, social work, and plane-spotting; newspapers both local and national, from the Independent to the Belfast Telegraph to the Daily Mirror; transcriptions of council meetings, business meetings, parliamentary debates, school classes, television news broadcasts and radio phone-ins, and of the daily conversations of a company director, a nurse, some students, an aircraft engineer, a courier and a machine operator. This list is by no means exhaustive, but gives an impression of the nature of the corpus: it is not limited to any subject, field, or genre, but is as wide as the selection criteria and the resources available could make it.

A synchronic corpus

A popular view of language study is that it consists of examining the origins of words (etymology) and the derivation and development of languages (philology, or historical linguistics). This approach is termed diachronic: it deals with the way language develops over time. In contrast, synchronic study looks at a language at a single point in time, disregarding the historical developments which brought it to its present state. Linguistics nowadays concentrates on synchronic study, and the BNC reflects this: it deals with current British English, aiming to provide a snapshot of how English was used by British people in the period 1975–1994. In time, it will acquire a diachronic value, especially if further, similarly composed corpora are constructed.

A sample corpus

Most of the books selected for inclusion in the BNC are represented not by the full text but by samples. The use of samples allows a better cross-section of texts to be represented: about three times as many books can be included if samples of 40-45,000 words are used instead of the average 120,000-word full work. Another advantage of using samples is that publishers and authors need not worry about pirating: many feared that the public availability of full texts in the corpus would permit illegal re-use, but the use instead of much smaller samples allayed most of their fears. To permit stylistic analysis, samples were taken variously from the start, middle, or end of each book.

Works of multiple authorship, such as newspapers, magazines and journals, were generally included in their entirety. In the spoken corpus too, no sampling was necessary, and full transcriptions are used throughout.

Written Corpus text selection

With the above definitions in mind, the project partners devised strategies for choosing texts to include in the corpus. Initially, four selection criteria were agreed, though one of these was later dropped.

Domain

To ensure a wide, general coverage, targets were drawn up which identified what proportion of the written corpus should consist of texts under certain subject headings. A distinction was drawn between informative texts (factual, non-fictional) and imaginative texts (fictional, literary), and a target set of 70-80% informative texts and 20-30% imaginative texts. Within these targets, subject headings were agreed for informative texts, but, in keeping with standard bibliographic practice, were considered inappropriate for imaginative texts. The targets set for informative texts, as percentages of the 90-million-word written corpus, were as follows:

Natural and pure science 5%

Applied science 5%

Social and community 15%

World and current affairs 15%

Commerce and finance 10%

Arts 10%

Belief and thought 5%

Leisure 10%

Time

Almost all of the material selected dates from 1975 until the end of the collection period in 1994. In the case of imaginative texts, some from 1960-1975 were included, as their continued currency makes them an important part of contemporary British English; informative texts tend to fall out of circulation much more rapidly.

Medium

Books, magazines, newspapers, leaflets, essays, memos, minutes – texts come in many different forms, and the BNC covers a wide range. The targets, as percentages of the 90-million-word written corpus, were as follows:

Books 60%

Periodicals 30%

Miscellaneous 10%

The category 'Periodicals' includes newspapers and magazines. 'Miscellaneous' includes many hitherto uncollected items such as unpublished letters, reports, minutes and essays, 'written to be spoken' material such as television news scripts, and published items such as leaflets, brochures, and advertisements.

Level

A fourth selection criterion was planned, using the notion of different levels. High-level language might be imaginative works of high literary standing or university textbooks; low-level would be tabloid journalism, or some advertising. In practice the notion of level proved too subjective to be used in selecting texts, though its prominence in the design specification meant that the corpus builders were aware of the need both to cover a full stylistic range and to keep the corpus a balanced cross-section of different varieties of British English.

It was considered important to include more medium-level than either high- or low-level text in order to ensure that the corpus adequately represented 'normal' English. Once the character of normal, average English is established, it becomes possible to analyse various kinds of English – such as poetry, or tabloid press journalism – and identify what makes them special or distinctive. (Poetry, for example, often uses words and phrases out of their normal context.)

Implementing the selection criteria

About half the books in the corpus were selected at random from Whitaker's Books In Print (1992) by students of Library and Information Studies at Leeds Polytechnic (now Leeds Metropolitan University). The selections were accepted provided they fulfilled the design criteria, and Dewey Decimal Classification numbers were used to establish each book's subject area ('domain'). The rest of the books were selected more systematically, using best-seller lists, as given in the Bookseller, for the years 1987-1991, various literary prize winners, books popular in public libraries, and finally the British National Bibliography [title correct (I suspect not)? Reference? Not mentioned in the User Guide – omit?], where once again Dewey Decimal codes were used to balance the domain targets agreed in the corpus design.

Other published written material comprised newspapers, magazines, and brochures and leaflets of various kinds. Newspaper text is increasingly easy to acquire, since most publishers now use computerised technology to produce their papers, and because, once it is published, most newspaper text is no longer thought of as work whose copyright must be closely guarded. The availability of newspaper text means that many researchers and teachers use it in their work; the value of the BNC's newspaper text is in its quantity, its range and its balance. The other material in this category – magazines, journals and leaflets of many kinds – was collected from around the country, and mostly captured by keyboarders using PCs.

Classification features

Each item accepted for inclusion in the BNC was further classified according to a set of features which describe it in more detail:

authorship (single, multiple, corporate)

author gender

author age group

author domicile

target age group

target gender

target level (high, medium, low)

place of publication

sample type (full, beginning, middle, end)

sample composition (a single text, or one made up of a number of components, eg a collection of essays)

subject (a more detailed classification than domain)

In addition, the sample size (number of words), sample extent (start and end page-numbers) and date of origin (usually of publication) are recorded.

This classification served two purposes. One was to help in the quest for balance across the BNC: when an imbalance came to light during the construction process, corrective action could be taken. For example, examination of the 'extent' classification revealed that far more samples were being taken from the start of books than from the middle or end, so from then on more samples were taken from the middle or end.

The second purpose of these classification features is to enable researchers to select texts with certain features. For example, it is possible to create a subcorpus of writing by female authors and compare it with a subcorpus of writing by male authors, or to create a subcorpus of texts linked to a particular region within the UK.

Spoken Corpus Design

Possibly the most innovative and exciting part of the BNC project is the creation of a sizeable corpus of transcribed spoken English. A market research organisation was commissioned to recruit volunteers from a range of locations and social backgrounds throughout the UK; these volunteers then recorded their conversations over a number of days. Other recordings made at events across the UK, and recordings made by teenagers, made up the total of 10 million words of transcribed speech. The conversations recorded by the volunteers and the teenagers form the demographic corpus, and the recordings of selected events form the context-governed corpus. This, briefly, is how the project designers constructed a balanced cross-section of current British spoken English, to complement the written part of the corpus.

Demographic sampling

One hundred and twenty-four adults (aged fifteen or over) were selected from places throughout the United Kingdom. They comprised approximately equal numbers of men and women, and of people in particular age ranges and social groupings. The age ranges were 15-24, 25-34, 35-44, 45-59, and 60 and over. The social groupings were A/B, C1, C2, and D/E. The UK was divided into three major areas – North, Midlands, and South – and again attempts were made to recruit roughly equal numbers of people from each. The numbers of respondents by gender, age group and social group are shown in Table ??, and the regional distribution of the respondents is shown in the map.

(target figures in brackets)

NORTH

MIDLANDS

SOUTH

TOTAL

47 (41)

33 (41)

44 (41)

124

Category

Gender

male

22

16

21

59 (62)

female

25

17

23

65 (62)

Age

15-24

8

7

8

23 (25)

25-34

13

7

10

30 (25)

35-44

8

4

12

24 (25)

45-59

10

8

8

26 (25)

60+

8

7

6

21 (25)

Social group

A/B

13

4

12

29 (31)

C1

14

9

10

33 (31)

C2

12

9

12

33 (31)

D/E

8

11

10

29 (31)

[These are not the figures quoted in the User Guide, even though the UG claims a total of 124 as here. If we take the UG figures, we can't keep the division by region as that doesn't have it; it includes a 0-14 age group, which I think might be unclassified, and it claims 5 were unclassified as regards gender, which is odd. Despite those reservations, the UG figures are better than these, but it's a bit of a pig's ear …]

The respondents (those who had agreed to record their conversations) were supplied with a Walkman and several audiocassettes. They also received a notebook to write down personal details about themselves and the participants (those who took part in the conversations they recorded). Guarantees of anonymity were given to respondents and participants: names and places which could identify particular people are omitted from the transcriptions in the corpus. Most of the details supplied were transcribed with the tapes, and can be treated as background details about each conversation, or used in the selection of subcorpora (conversations from the North, for example, or including pensioners).

***Bergen material [Don't know anything about this! Additional recordings were made under the auspices of a University of Bergen project concerning teenage speech, and included in the BNC.]

Context-governed sampling

The demographic sampling yielded a wide range of conversational English, but it would be wrong to assume that spoken English consists only of conversation. To balance conversation with more formal types of spoken English, and to include situations which the demographic approach might not cover, recordings were made at various events throughout the country, selected to fit four basic categories: educational, business, public/institutional, and leisure. Within these categories, an attempt was made to record 60% dialogue (discussion, debate or some other sort of interaction) and 40% monologue (lecturing or news reading, for example). For the educational category, dialogue recordings were made in classrooms and commercial tutorials, and monologue recordings at lectures and demonstrations. In the business category, dialogue was recorded at business meetings, consultations and interviews, and monologue recordings were made of company talks, trade union speeches and sales demonstrations. In the public/institutional category, dialogue came from council chambers, the Houses of Parliament, and religious and legal meetings, while monologues recorded included political speeches, sermons, and legal proceedings. The leisure category consisted of dialogue from events such as radio chat shows and phone-ins, and club meetings, and monologues like after-dinner speeches, sports commentaries, and talks to social clubs.

Notes were made about the context of each recording – the date, place, time, setting, size of audience, and the spontaneity of the speech being recorded – and transcribed with the tapes for the use of researchers interested in the background of each event or in selecting events with certain features.

The effect of the tape recorder on conversation

Recording natural, unstilted language was an important aim in the demographic corpus, so participants were asked to record as unobtrusively as possible. Respondents informed participants after the event that their words had been recorded and offered to erase the conversation from the tape. In practice, no one objected.

The extent to which the presence of the tape recorder affected the conversations recorded may only become clear after detailed analysis and debate of the transcriptions, and even that may not form definitive conclusions, but the impression gained by those who prepared the recordings and transcriptions at Longman was that after some initial awkwardness in some situations, the conversations quickly became natural.

-------------------------------------------------------------------------------------------------------------------

[box - it's more detailed and more subjective than the rest of the text, and causes a bit of a break when reading, so might be better treated deliberately as such.]

Working on the transcriptions at Oxford University Computing Services, this writer (Burnage) had the opportunity to look at a wide range of the texts, and observed, though without rigorous study, that the tape recorder was on occasion like any other external stimulus which can affect a conversation (for example, an aeroplane passing overhead, or a colleague interrupting). It sometimes comes up in conversation, and provokes arguments and comments which appear to be identical, linguistically speaking, to arguments and comments on any topic. People react to the tape recorder, just as they would react differently to the presence of their boss, or a friend, or the vicar. The tape recorder is one of many triggers which provoke the various acts people 'put on' throughout their daily lives. A contrasting pair of examples illustrates the point well. Two teenage girls recorded a conversation in which they played up to the tape recorder, flirting with the man they imagine will transcribe the conversation. Two pensioners recorded a couple of conversations over meal times, and worried that the transcribers would think they were 'always eating'. It isn't hard to imagine similar conversations being recorded by the same people in a situation where something other than a tape recorder triggers similar words. Likewise, some people made a point of swearing when first aware of the tape recorder, while some people noted they definitely wouldn't swear: such actions say more about the people and they way they speak than they do about the tape recorder. When the tape recorder does influence conversation, it influences it in 'normal' ways, sometimes provoking a certain amount of linguistic activity. Most conversations in the corpus, however, appear not to have been unduly affected.

THE CONSTRUCTION OF THE BNC

Assembling a 100-million-word text corpus from the stipulated design was an enormous task, with many unforeseen difficulties. We describe the main stages.

Data collection

After the selection of material for the corpus, the first task was to turn it into 'electronic text' – text which can be held on and manipulated by computers. OUP were responsible for the bulk of the text capture, with some contributions from Longman. There were three ways of getting texts into electronic form: scanning, using Optical Character Readers (OCRs); keyboarding – typing the text into PCs; and using existing electronic versions of the material.

Scanning

Scanning involves making an image of each page of a document, and using software (a computer program) to interpret that image as characters. In many respects, a scanner resembles a photocopier: the document is placed face down on a transparent surface, beneath which is the mechanism to make a digital image of the page. The image is stored as a computer file, and fed into software which produces another file containing the text – or rather, an approximation of it which can be corrected by hand. Scanning is not a simple, completely accurate way of making a text machine-readable; its success often depends on the size, design, and quality of the type of the original. A photocopy of a page from a book whose type is set in a small point size, for instance, may not scan very well. So after a page has been scanned, it has to be checked for misinterpreted characters, and corrected where necessary: this is almost a form of proof-reading.

Keyboarding

Sometimes the time required to make a scanned text readable makes the whole procedure counterproductive: in many cases, it is quicker to type in the text. For the BNC, magazines, leaflets, manuscript material and some newspapers were typed rather than scanned. In some cases – hand-written material, for example, or church leaflets typed then photocopied – this was because the text was too irregular or unclear for the scanner to read. The problem presented by magazines and leaflets was usually the complexity of the layout: whereas most books simply have a continuous flow of text from the top of each page to the bottom, magazine text may be in columns of varying width and may be interrupted by pictures, boxes of text discontinuous with the main text, pull-quotes, advertisements etc. A scanner cannot deduce the structure of such a layout.

Existing Electronic material

Over recent years, the use of computing, and thus of electronic texts, has expanded enormously. More and more people produce their work with word processors; publishers are increasingly using computer technology to design and typeset books and papers. Repositories for electronic text – such as the Oxford Text Archive at Oxford University Computing Services – have become established sites for the storage and distribution of all kinds of textual material. The designers of the BNC therefore initially expected that a large proportion of the material to be used in the corpus would come from existing electronic text versions, and, moreover, that such texts would require only minimal work to be converted to the BNC mark-up scheme. In practice, though, more material had to be scanned and keyboarded than originally envisaged, for two reasons. One reason was that the design stipulated particular kinds of material – in some cases even particular books and magazines – and there was simply not enough electronic material in those categories to fill the quotas. The other was that the formats in which much electronic text, particularly that from publishers' typesetting tapes, was stored were locked up in a proprietary software package or encoded in a dense mark-up scheme, which meant that the time required to research and convert the different formats was too great. Exceptions were made where the amount of material was great enough to justify the work required, and thus some newspaper material found its way into the corpus from computer tapes supplied by the publishers.

Copyright and Permissions

Oxford University Press, in conjunction with Longman, were responsible for obtaining

copyright clearance or permission for the use of each text used in the corpus. It proved to be an onerous task, though certain conditions on the use of the corpus ensured that most of those approached were happy to co-operate. Texts, or normally only sampled sections of texts, were included without charge to the BNC project, on condition that no commercial exploitation was to be carried out from the corpus, and that the corpus would be issued to users under the terms of a standardised license agreement protecting the owners' rights.

Textual Mark-up and the Text Encoding Initiative

Everyone is used to the conventions of typography which interpret a printed text: capitals or bold type at the top of a page or above a paragraph indicate a headline or a title, for example; speech marks indicate spoken words, or quoted words; italics indicate stress or a cited title. When printed texts become electronic texts, however, these conventions are difficult to interpret, and often ambiguous. The human brain can interpret printed text because it understands the context; a computer needs to be given explicit information.

For this reason, the rise in popularity of electronic text was closely followed by concerns about how to mark up – or 'encode' – texts for computer use. The Text Encoding Initiative (TEI) (5) brought together academics from around the world to suggest guidelines for standardised mark-up in electronic texts to be exchanged within the Humanities world-wide. These guidelines are based on Standard Generalized Mark-up Language (SGML) (6), which is a computer language for defining mark-up systems by means of tags to denote text markings such as chapters, paragraphs and headings, and entity references to represent characters such as accented letters which are not part of a standard character set and hence do not travel well. These systems, which use only a standard character set, are independent of any software, or any type of computer. The TEI recommended specific (though fairly flexible) practices so that texts created in one environment can be used successfully in another. Its recommendations were published as the TEI Guidelines which represent international agreement by scholars in specific subject areas concerning the mark-up conventions most useful for electronic texts in their disciplines.

Text Encoding for the BNC

The BNC defined its own mark-up conventions in SGML, in conformity with the TEI Guidelines for language corpora, and named them the Corpus Document Interchange Format (CDIF). CDIF defines how textual features appear in BNC texts – things such as divisions (chapters, sections, articles or features), paragraphs, sentences (or more properly segments which approximate to sentences), headlines, highlighted words and phrases, lists, poems, and quotations. It also defines formats for the linguistic information (roughly speaking, parts-of-speech tagging) provided by UCREL, and for the background and bibliographic data prefixed to each text and to the corpus as a whole (7)

CDIF is the format in which the BNC is issued to researchers, but it is not the format in which it was originally transcribed by OUP and Longman. Scanners and typists used mark-up schemes designed by the publishers for their own use, and it was the job of Oxford University Computing Services to convert text encoded in these schemes to CDIF.

This task proved to be more onerous than expected. The publishers' encoding schemes were modified in the light of experience, partly to make life easier for those typing and scanning, and partly to make the automatic conversion to CDIF more efficient. There were concerns that a full implementation of all the mark-up conventions considered helpful or interesting when CDIF was designed would require too much time and effort, so the conventions defined in CDIF were classified as 'required', 'recommended', or 'optional'. Features such as divisions, headlines, and paragraphs were deemed to be 'required' – they had to be marked up whenever they occurred in any CDIF text. Others, such as lists, were 'recommended' and some, such as the phrases with which letters are concluded, were 'optional' and, in practice, hardly used at all because the time required to include them was disproportionate to their value.

Linguistic Mark-up

Every word in each corpus text received a grammatical classification, and each text was divided automatically into sentence-like segments. Most of this work was done by computer program at UCREL, the Unit for Computer Research into the English Language at Lancaster University. Following extensive work with the LOB corpus, UCREL developed a program called CLAWS (Constituent-Likelihood Automatic Word-Tagging System), and a version of it was used to provide grammatical for the British National Corpus. For the full 100 million word corpus, some 61 different grammatical categories were used - including different types of nouns, verbs, adjectives, adverbs, conjunctions, determiners, and so on - such as are used in traditional grammar books. For a special sub-corpus of 2 million words, known as the core corpus, a more sophisticated set of 161 different grammatical categories was used. (The core corpus is intended to be a representative subset of the full 100 million word corpus, consisting of 1m words of written material ,and 1 million words of transcribed spoken material.)

Although most of the grammatical analysis was automatic, some manual intervention was required. Even so, it would have been impossible to examine every single grammatical code, and so some errors exist in the grammatical encoding: the error rate was estimated at 1.7%

for the whole 100 million word corpus. Under 5% of the words could not be given an unambiguous classification, and so were assigned ambiguity tags (or portmanteau tags) which gave a choice of possible classifications, indicating the most likely option whenever possible.

In the core corpus, the error rate was as low as 0.3%, using the more sophisticated set of grammatical encodings.

After the initial completion of the BNC, UCREL undertook to further enhance the grammatical classification of the whole British National Corpus, work due for completion in 1996.

Release!

- SARA

- Future updates

References

1. Crystal, D. (1991), A Dictionary of Linguistics and Phonetics. Blackwell, 3rd Edition.

2. Sinclair, J.M. (1991), Corpus, Concordance, Collocation. Oxford University Press.

3. Crystal, D. (????), Who cares about English usage? [Incomplete]

4. Introduction to Collins COBUILD English language dictionary (1987). HarperCollins

5. Biber, D. (1988) Variation across speech and writing. Cambridge University Press.

6. Murray, Elisabeth K.M. (1979), Caught in the Web of Words. Oxford University Press.

7. Sperberg-McQueen, C.M. and Burnard, L. (eds) (1994) Guidelines for Electronic Text Encoding and Interchange.

8. Goldfarb, C.F. (1990) The SGML Handbook. Clarendon Press, Oxford.

9. See Burnage, G. and Dunlop, D. (1992) Encoding the British National Corpus. Published in English Language Corpora: Design, Analysis and Exploitation (Papers from the thirteenth International Conference on English Language Research on Computerized Corpora, Nijmegen 1992, eds Aarts, de Haan, Oostdijk)

10. See Garside, R., Leech, G. and Sampson, G. (1987) The Computational Analysis of English. Longman.

Further Reading

Burnard, L (ed) (1995) Users Reference Guide for the British National Corpus. Oxford University Computing Services

http://info.ox.ac.uk/bnc/

http://info.ox.ac.uk/bnc/

tel. +44 (1865) 273 280 fax +44 (1865) 273 275

enquiries: natcorp@oucs.ox.ac.uk errors: bugs@natcorp.ox.ac.uk