[bnc] Users Reference Guide for the British National Corpus (XML Edition)

Reference Guide for the British National Corpus (XML Edition)

edited by Lou Burnard

Published for the British National Corpus Consortium by the Research Technologies Service at Oxford University Computing Services

January 2007

Design of the corpus

This section discusses some of the basic design issues underlying the creation of the BNC. It summarizes the kinds of uses for which the corpus is intended, and the principles upon which it was created. Some summary information about the composition of the corpus is also included.

Purpose

The uses originally envisaged for the British National Corpus were set out in a working document called Planned Uses of the British National Corpus BNCW02 (11 April 91). This document identified the following as likely application areas for the corpus:

reference book publishing
academic linguistic research
language teaching
artificial intelligence
natural language processing
speech processing
information retrieval

The same document identified the following categories of linguistic information derivable from the corpus:

lexical
semantic/pragmatic
syntactic
morphological
graphological/written form/orthographical

In the 15 or more years since that document was published, it has become apparent that the corpus, and corpus methods in general, have had a far wider impact than anticipated, notably in the field of language teaching.

General definitions

The British National Corpus is:

a sample corpus: composed of text samples generally no longer than 45,000 words.
a synchronic corpus: the corpus includes imaginative texts from 1960, informative texts from 1975.
a general corpus: not specifically restricted to any particular subject field, register or genre.
a monolingual British English corpus: it comprises text samples which are substantially the product of speakers of British English.
a mixed corpus: it contains examples of both spoken and written language.

Composition

There is a broad consensus among the participants in the project and among corpus linguists that a general-purpose corpus of the English language would ideally contain a high proportion of spoken language in relation to written texts. However, it is significantly more expensive to record and transcribe natural speech than to acquire written text in computer-readable form. Consequently the spoken component of the BNC constitutes approximately 10 per cent (10 million words) of the total and the written component 90 per cent (90 million words). These were agreed to be realistic targets, given the constraints of time and budget, yet large enough to yield valuable empirical statistical data about spoken English. In the BNC sampler, a two per cent sample taken from the whole of the BNC, spoken and written language are present in approximately equal proportions, but other criteria are not equally balanced.

From the start, a decision was taken to select material for inclusion in the corpus according to an overt methodology, with specific target quantities of clearly defined types of language. This approach makes it possible for other researchers and corpus compilers to review, emulate or adapt concrete design goals. This section outlines these design considerations, and reports on the final make-up of the BNC.

This and the other tables in this section show the actual make-up of the second version of the British National Corpus (the BNC World Edition) in terms of

texts : number of distinct samples not exceeding 45,000 words
S-units: number of <s> elements identified by the CLAWS system (more or less equivalent to sentences)
W-units: number of <w> elements identified by the CLAWS system (more or less equivalent to words)

For further explanation of <s> and <w> elements, see section Segments and words.

Frequency tables

The XML Edition of the BNC contains 4049 texts and occupies (including all markup) 5,228,040 Kb, or about 5.2 Gb. In total, it comprises just under 100 million orthographic words (specifically, 96816116), but the number of w-units (POS-tagged items) is slightly higher at 98363783. The tagging distinguishes a further 13614425 punctuation strings, giving a total content count of 110691482 strings. The total number of sunits tagged is about 6 million (6026284). Counts for these and all the other elements tagged in the corpus are provided in the corpus header.

In the following tables both an absolute count and a percentage are given for all the counts. The percentage is calculated with reference to the relevant portion of the corpus, for example, in the table for "written text domain", with reference to the total number of w-units in written texts. Note that punctuation strings are not included in these totals. The reference totals used are given in the first table below.

In the following tables both an absolute count and a percentage are given for all the counts. The percentage is calculated with reference to the relevant portion of the corpus, for example, in the table for "written text domain", with reference to the total number of written texts. These reference totals are given in the first table below.

Table 1. Text type
	texts	w-units	%	s-units	%
Spoken demographic	153	4233955	4.30	610557	10.13
Spoken context-governed	755	6175896	6.27	427523	7.09
Written books and periodicals	2685	79238146	80.55	4395581	72.94
Written-to-be-spoken	35	1278618	1.29	104665	1.73
Written miscellaneous	421	7437168	7.56	487958	8.09

All texts are also classified according to their date of production. For spoken texts, the date was that of the recording. For written texts, the date used for classification was the date of production of the material actually transcribed, for the most part; in the case of imaginative works, however, the date of first publication was used. Informative texts were selected only from 1975 onwards, imaginative ones from 1960, reflecting their longer ‘shelf-life’, though most (75 per cent ) of the latter were published no earlier than 1975.

Table 2. Publication date
	texts	w-units	%	s-units	%
Unknown	162	1831585	1.86	126416	2.09
1960-1974	46	1718449	1.74	119510	1.98
1975-1984	169	4730889	4.80	257962	4.28
1985-1993	3672	90082860	91.58	5522396	91.63

Spoken and written components of the corpus are discussed separately in the next two sections.

Design of the written component

Sampling basis: production and reception

While it is sometimes useful to distinguish in theory between language which is received (read and heard) and that which is produced (written and spoken), it was agreed that the selection of samples for a general-purpose corpus must take account of both perspectives.

Text that is published in the form of books, magazines, etc., is not representative of the totality of written language that is produced, as writing for publication is a comparatively specialized activity in which few people engage. However, it is much more representative of written language that is received, and is also easier to obtain in useful quantities, and thus forms the greater part of the written component of the corpus.

There was no single source of information about published material that could provide a satisfactory basis for a sampling frame, but a combination of various sources furnished useful information about the totality of written text produced and, particularly, received, some sources being more significant than others. They are principally statistics about books and periodicals that are published, bought or borrowed.

Catalogues of books published per annum tell us something about production but little about reception as many books are published but hardly read.

A list of books in print provides somewhat more information about reception as time will weed out the books that nobody bought (or read): such a list will contain a higher proportion of books that have continued to find a readership.

The books that have the widest reception are presumably those that figure in bestseller lists, particularly prize winners of competitions such as the Booker or Whitbread. Such works were certainly candidates for inclusion in the corpus, but the statistics of book-buying are such that very few texts achieve high sales while a vast number sell only a few or in modest numbers. If texts had been selected in strict arithmetical proportion to their sales, their range would have been severely limited. However, where a text from one particular subject domain was required, it was appropriate to prefer a book which had achieved high sales to one which had not.

Library lending statistics, where these are available, also indicate which books enjoy a wide reception and, like lists of books in print, show which books continue to be read.

Similar observations hold for magazines and periodicals. lists of current magazines and periodicals are similar to catalogues of published books, but perhaps more informative about language reception, as it may be that periodicals are bought and read by a wider cross-section of the community than books. Also, a periodical that fails to find a readership will not continue to be published for long.

Periodical circulation figures have to be treated with the same caution as bestseller lists, as a few titles dominate the market with a very high circulation. To concentrate too exclusively on these would reduce the range of text types in the corpus and make contrastive analysis difficult.

Published written texts were selected partly at random from Whitaker's Books in Print for 1992 and partly systematically, according to the selection features outlined in section Selection features below.

Available sources are concerned almost exclusively with published books and periodicals. It is much more difficult to obtain data concerning the production or reception of unpublished writing. Intuitive estimates were therefore made in order to establish some guidelines for text sampling in the latter area.

Selection features

Texts were chosen for inclusion according to three selection features: domain (subject field), time (within certain dates) and medium (book, periodical, etc.).

The purpose of these selection features was to ensure that the corpus contained a broad range of different language styles, for two reasons. The first was so that the corpus could be regarded as a microcosm of current British English in its entirety, not just of particular types. The second was so that different types of text could be compared and contrasted with each other.

Selection Procedure

Each selection feature was divided into classes (e.g. ‘Medium’ into books, periodicals, unpublished etc.; ‘Domain’ into imaginative, informative, etc.) and target percentages were set for each class. These percentages are quite independent of each other: there was no attempt, for example, to make 25 per cent of the selected periodicals imaginative.

Seventy-five per cent of the samples were to be drawn from informative texts, and the remaining 25 per cent from imaginative texts.

titles were to be taken from a variety of media, in the following proportions: 60 per cent from books, 30 per cent from periodicals, 10 per cent from miscellaneous sources (published, unpublished, and written to be spoken).

Half of the books in the ‘Books and Periodicals’ class were selected at random from Whitaker's Books in Print 1992. This was to provide a control group to validate the categories used in the other method of selection: the random selection disregarded Domain and Time, but texts selected by this method were classified according to these other features after selection.

Sample size and method

For books, a target sample size of 40,000 words was chosen. No extract included in the corpus exceeds 45,000 words. For the most part, texts which in their entirety were shorter than 40,000 words were further reduced by ten per cent for copyright reasons; a few texts longer than the target size were however included in their entirety. Text samples normally consist of a continuous stretch of discourse from within the whole. A convenient breakpoint (e.g. the end of a section or chapter) was chosen as far as possible to begin and end the sample so that high-level discourse units were not fragmented. Only one sample was taken from any one text. Samples were taken randomly from the beginning, middle or end of longer texts. (In a few cases, where a publication included essays or articles by a variety of authors of different nationalities, the work of non-UK authors was omitted.)

Some types of written material are composite in structure: that is, the physical object in written form is composed of more than one text unit. Important examples are issues of a newspaper or magazine which, though editorially shaped as a document, contain discrete texts, each with its specific authorship, stylistic characteristics, register and domain. The BNC attempts to separate these discrete texts where appropriate and to classify them individually according to the selection and classification features. As far as possible, the individual stories in one issue of a newspaper were grouped according to domain, for example as ‘Business’ articles, ‘Leisure’ articles, etc.

The following subsections discuss each selection criterion, and indicate the actual numbers of words in each category included.

Domain

Classification according to subject field seems hardly appropriate to texts which are fictional or which are generally perceived to be literary or creative. Consequently, these texts are all labelled imaginative and are not assigned to particular subject areas. All other texts are treated as informative and are assigned to one of the eight domains listed below.

Table 3. Written Domain
	texts	w-units	%	s-units	%
Imaginative	476	16496420	18.75	1352150	27.10
Informative: natural & pure science	146	3821902	4.34	183384	3.67
Informative: applied science	370	7174152	8.15	356662	7.15
Informative: social science	526	14025537	15.94	698218	13.99
Informative: world affairs	483	17244534	19.60	798503	16.00
Informative: commerce & finance	295	7341163	8.34	382374	7.66
Informative: arts	261	6574857	7.47	321140	6.43
Informative: belief & thought	146	3037533	3.45	151283	3.03
Informative: leisure	438	12237834	13.91	744490	14.92

The evidence from catalogues of books and periodicals suggests that imaginative texts account for significantly less than 25 per cent of published output, and unpublished reports, correspondence, reference works and so on would seem to add further to the bulk of informative text which is produced and consumed. However, the overall distribution between informative and imaginative text samples is set to reflect the influential cultural role of literature and creative writing. The target percentages for the eight informative domains were arrived at by consensus within the project, based loosely upon the pattern of book publishing in the UK during the past 20 years or so, as reflected in the categorized figures for new publications that appear annually in Whitaker's Book list.

Medium

This categorisation is broad, since a detailed taxonomy or feature classification of text medium could have led to such a proliferation of subcategories as to make it impossible for the BNC adequately to represent all of them. The labels used here are intended to be comprehensive in the sense that any text can be assigned with reasonable confidence to these macro categories. The labels we have adopted represent the highest levels of a fuller taxonomy of text medium.

Table 4. Written Medium
	texts	w-units	%	s-units	%
Book	1411	50293803	57.18	2887523	57.88
Periodical	1208	28609494	32.52	1487644	29.82
Miscellaneous published	238	4233135	4.81	287700	5.76
Miscellaneous unpublished	249	3538882	4.02	220672	4.42
To-be-spoken	35	1278618	1.45	104665	2.09

The ‘Miscellaneous published’ category includes brochures, leaflets, manuals, advertisements. The ‘Miscellaneous unpublished’ category includes letters, memos, reports, minutes, essays. The ‘written-to-be-spoken’ category includes scripted television material, play scripts etc.

Descriptive features

Written texts may be further classified according to sets of descriptive features. These features describe the sample texts; they did not determine their selection. This information is recorded to allow more delicate contrastive analysis of particular sets of texts. As a simple example, the gross division into two time periods in the selection features can, of course, be refined and subcorpora defined over the BNC for more specific dates. However, the relative sizes of such subcorpora are undefined by the BNC design specification.

These descriptive features were monitored during the course of the data gathering, and text selection, in cases where a free choice of texts was available, took account of the relative balance of these features. Thus although no relative proportions were defined for different target age groups (for example), we ensured that the corpus does contain texts intended for children as well as for adults.

The following tables summarize the results for the first release of the corpus. Note that many texts remain unclassified.

Author information

Information about authors of written texts was included only where it was readily available, for example from the dust-wrapper of a book. Consequently, the coverage of such information is very patchy. The authorship of a written text was characterized as ‘corporate’ where it was produced by an organization and no specific author was given, and as ‘multiple’ in cases where several authors were named. Author sex was classified as ‘mixed’ where more than one author of either sex was specified, and ‘unknown’ where it could not reliably be determined from the author's name. Note that ‘author age’ means the author's age at the time of creation of the work concerned.

Table 5. Author type
	texts	w-units	%	s-units	%
Unknown	211	3786835	4.30	174371	3.49
Corporate author	347	6497144	7.38	455649	9.13
Multiple author	1322	34563219	39.29	1810901	36.30
Sole author	1261	43106734	49.01	2547283	51.06

Table 6. Sex of author
	texts	w-units	%	s-units	%
Unknown	1573	36161115	41.11	1968162	39.45
Author sex Male	920	30665582	34.86	1671420	33.50
Author sex Female	414	14588260	16.58	967522	19.39
Author sex Mixed	234	6538975	7.43	381100	7.64

Table 7. Author age-group
	texts	w-units	%	s-units	%
Unknown	2518	66000719	75.04	3687586	73.92
Author age 0-14	3	59559	0.06	3443	0.06
Author age 15-24	19	542578	0.61	29810	0.59
Author age 25-34	66	2267139	2.57	159455	3.19
Author age 35-44	191	6726926	7.64	410143	8.22
Author age 45-59	205	7230714	8.22	410644	8.23
Author age 60+	139	5126297	5.82	287123	5.75

Table 8. Domicile
	texts	w-units	%	s-units	%
Unknown	2272	57227155	65.06	3133068	62.80
Author domicile UK and Ireland	841	29760000	33.83	1798301	36.05
Author domicile Commonwealth	12	411207	0.46	25759	0.51
Author domicile Continental Europe	6	234402	0.26	12466	0.24
Author domicile USA	8	245604	0.27	15675	0.31
Author domicile Elsewhere	2	75564	0.08	2935	0.05

Target audience

Some attempt was made to characterize the kind of audience for which written texts were produced in terms of age, sex and ‘level’ (a subjective assessment of the text's technicality or difficulty). The last of these proved very difficult to assess and was very frequently confused with circulation size or audience size; for that reason, no figures for it are included here.

Table 9. Audience age
	texts	w-units	%	s-units	%
Child audience	42	903690	1.02	81074	1.62
Teenager audience	78	1831178	2.08	138098	2.76
Adult audience	2911	81928776	93.14	4597388	92.16
Any audience	110	3290288	3.74	171644	3.44

Table 10. Audience sex
	texts	w-units	%	s-units	%
Unknown	706	20271270	23.04	1131254	22.67
Male audience	61	2396935	2.72	135950	2.72
Female audience	175	6904137	7.84	503629	10.09
Mixed audience	2199	58381590	66.37	3217371	64.49

Miscellaneous classification information

Written texts were also characterized according to their place of publication and the type of sampling used

Table 11. Publication place
	texts	w-units	%	s-units	%
Unknown	690	14718827	16.73	788440	15.80
UK (unspecific) publication	263	7163111	8.14	380824	7.63
Ireland publication	37	570652	0.64	31793	0.63
UK: North (north of Mersey-Humber line) publication	191	3781055	4.29	228247	4.57
UK: Midlands (north of Bristol Channel-Wash line) publication	93	2590345	2.94	177308	3.55
UK: South (south of Bristol Channel-Wash line) publication	1853	58587808	66.61	3360401	67.36
United States publication	14	542134	0.61	21191	0.42

Table 12. Sampling type
	texts	w-units	%	s-units	%
Unknown	1583	35551102	40.42	1991798	39.93
Whole text	270	6524975	7.41	433722	8.69
Beginning sample	584	21075222	23.96	1119251	22.43
Middle sample	510	18454807	20.98	1049692	21.04
End sample	119	4317326	4.90	253322	5.07
Composite sample	75	2030500	2.30	140419	2.81

In addition to the above, standard bibliographic details such as author, title, publication details, extent, topic keywords etc. were recorded for the majority of texts, as further described below (see The header).

Selection procedures employed

Books

Roughly half the titles were randomly selected from available candidates identified in Whitaker's Books in Print (BIP), 1992, by students of Library and Information Studies at Leeds City University. Each text randomly chosen was accepted only if it fulfilled certain criteria: it had to be published by a British publisher, contain sufficient pages of text to make its incorporation worthwhile, consist mainly of written text, fall within the designated time limits, and cost less than a set price. The students noted the ISBN, author, title and price of each book thus selected; the final selection weeded out texts by non-UK authors.

Half of the books having been selected by this method, the remaining half were selected systematically to make up the target percentages in each category. The selection proceeded as follows.

Bestsellers

Because of their wide reception, bestsellers were obvious candidates for selection. The lists used were those that appeared in the Bookseller at the end of the years 1987 to 1993 inclusive. Some of the books in the lists were rejected, for a variety of reasons. Obviously books that had already been selected by the random method were excluded, as were those by non-UK authors. In addition, a limit of 120,000 words from any one author was imposed, and books belonging to a domain or category whose quota had already been reached were not selected. Other bestseller lists were obtained from The Guardian, the British Council, and from Blackwells Paperback Shop.

The titles yielded by this search were mostly in the Imaginative category.

Literary prizes

The criteria for inclusion were the same as for bestsellers. The prize winners, together with runners-up and shortlisted titles, were taken from several sources, principally Anne Strachan, Prizewinning literature: UK literary award winners, London, 1989. For 1990 onwards the sources used were: the last issue of the Bookseller for each year; The Guardian Index, 1989–, entries under the term ‘Literature’; and The Times Index, 1989-, entries under the term ‘Literature — Awards’.

Literary prizes are in the main awarded to works that fall into the Imaginative category, but there are some Informative ones also.

Library loans

The source of statistics in this category was the record of loans under Public Lending Right, kindly provided by Dr J. Parker, the Registrar. The information comprised lists of the hundred most issued books and the hundred most issued children's books, in both cases for the years 1987 to 1993.

The lists consist almost exclusively of imaginative literature, and many titles found there also appear in the lists of bestsellers and prize winners.

Additional texts

As collection proceeded, monitoring disclosed potential shortfalls in certain domains. A further selection was therefore made, based on the ‘Short Loan’ collections of seven University libraries. (Short Loan collections typically contain books required for academic courses, which are consequently in heavy demand.)

Periodicals and magazines

Periodicals, magazines and newspapers account for 30 per cent of the total text in the corpus. Of these, about 250 titles were issues of newspapers. These were selected to cover as wide a spectrum of interests and language as possible. Newspapers were selected to represent as wide a geographic spread as possible: The Scotsman and the Belfast Telegraph are both represented, for example.

Other media

In addition to samples from books, periodicals, and magazines, the written part of the corpus contains about seven million words classified as ‘Miscellaneous Published’, ‘Miscellaneous Unpublished’, or as ‘Written to be spoken’. The distinction between ‘published’ and ‘unpublished’ is not an easy one; the former category largely contains publicity leaflets, brochures, fact sheets, and similar items, while the latter has a substantial proportion of school and university essays, unpublished creative writing or letters, and internal company memoranda. The ‘written to be spoken’ material includes scripted material, intended to be read aloud such as television news broadcasts; transcripts of more informal broadcast materials such as discussions or phone-ins are included in the spoken part of the corpus.

Copyright permissions

Before a selected text could be included, permissions had to be obtained from the copyright owner (publisher, agent, or author). A standard Permissions Request was drafted with considerable care, but some requests were refused, or simply not answered even after prompting, so that the texts concerned had to be excluded or replaced.

Design of the spoken component

Lexicographers and linguists have long hoped for corpus evidence about spoken language, but the practical difficulties of transcribing sufficiently large quantities of text have prevented the construction of a spoken corpus of over one million words. The British National Corpus project undertook to produce five to ten million words of orthographically transcribed speech, covering a wide range of speech variation. A large proportion of the spoken part of the corpus — over four million words — comprises spontaneous conversational English. The importance of conversational dialogue to linguistic study is unquestionable: it is the dominant component of general language both in terms of language reception and language production.

As with the written part of the corpus, the most important considerations in constructing the spoken part were sampling and representativeness. The method of transcription was also an important issue.

The issues of corpus sampling and representativeness have been discussed at great length by many corpus linguists. With spoken language there are no obvious objective measures that can be used to define the target population or construct a sampling frame. A comprehensive list of text types can be drawn up but there is no accurate way of estimating the relative proportions of each text type other than by a priori linguistically motivated analysis. An alternative approach, one well known to sociological researchers, is demographic sampling, and this was broadly the approach adopted for approximately half of the spoken part of the corpus. The sampling frame was defined in terms of the language production of the population of British English speakers in the United Kingdom. Representativeness was achieved by sampling a spread of language producers in terms of age, gender, social group, and region, and recording their language output over a set period of time.

We recognised, however, that many types of spoken text are produced only rarely in comparison with the total output of all ‘speech producers’: for example, broadcast interviews, lectures, legal proceedings, and other texts produced in situations where — broadly speaking — there are few producers and many receivers. A corpus constituted solely on the demographic model would thus omit important spoken text types. Consequently, the demographic component of the corpus was complemented with a separate text typology intended to cover the full range of linguistic variation found in spoken language; this is termed the context-governed part of the corpus.

The demographically sampled part of the corpus

The approach adopted uses demographic parameters to sample the population of British English speakers in the United Kingdom. Established random location sampling procedures were used to select individual members of the population by personal interview from across the country taking into account age, gender, and social group. Selected individuals used a portable tape recorder to record their own speech and the speech of people they conversed with over a period of up to a week. In this way a unique record of the language people use in everyday conversation was constructed.

Sampling procedure

124 adults (aged 15+) were recruited from across the United Kingdom. Recruits were of both sexes and from all age groups and social classes. The intention was, as far as possible, to recruit equal numbers of men and women, equal numbers from each of the six age groups, and equal numbers from each of four social classes.

Additional recordings were gathered for the BNC as part of the University of Bergen COLT Teenager Language Project. This project used the same recording methods and transcription scheme as the BNC, but selected only respondents aged 16 or below.

The tables below give figures for the amount of transcribed material collected by each respondent, classified by their age, class, and sex.

Table 13. Age-group
	texts	w-units	%	s-units	%
Respondent Age 0-14	26	267005	6.30	41036	6.72
Respondent Age 15-24	36	665358	15.71	97993	16.04
Respondent Age 25-34	29	853832	20.16	121752	19.94
Respondent Age 35-44	22	845153	19.96	126690	20.74
Respondent Age 45-59	20	963483	22.75	136530	22.36
Respondent Age 60+	20	639124	15.09	86556	14.17

Table 14. Social class
	texts	w-units	%	s-units	%
Unknown	7	37622	0.88	5340	0.87
AB respondent	59	1372933	32.42	197795	32.39
C1 respondent	36	1104279	26.08	169387	27.74
C2 respondent	31	1087808	25.69	144876	23.72
DE respondent	20	631313	14.91	93159	15.25

Table 15. Sex
	texts	w-units	%	s-units	%
Unknown	5	16245	0.38	2407	0.39
Male respondent	73	1742222	41.14	248241	40.65
Female respondent	75	2475488	58.46	359909	58.94

Recruits who agreed to take part in the project were asked to record all of their conversations over a two to seven day period. The number of days varied depending on how many conversations each recruit was involved in and was prepared to record. Results indicated that most people recorded nearly all of their conversations, and that the limiting factor was usually the number of conversations a person had per day. The placement day was varied, and recruits were asked to record on the day after placement and on any other day or days of the week. In this way a broad spread of days of the week including weekdays and weekends was achieved. A conversation log allowed recruits to enter details of every conversation recorded, and included date, time and setting, and brief details of other participants.

Recording procedure

All conversations were recorded as unobtrusively as possible, so that the material gathered approximated closely to natural, spontaneous speech. In many cases the only person aware that the conversation was being taped was the person carrying the recorder. Although an initial unnaturalness on the part of the recruit was not uncommon this soon seemed to disappear. Similarly, where non-intrusive recording was not possible, for example at a family gathering where everyone is aware they are being recorded, the same initial period of unease sometimes occurred, but in our experience again vanished quickly. The guarantee of confidentiality and complete anonymity (all references to full names and addresses have been removed from the corpus and the log), and the fact that there was an intermediary between those being recorded and those listening to the recordings certainly helped.

For each conversational exchange the person carrying the recorder told all participants they had been recorded and explained why. Whenever possible this happened after the conversation had taken place. If any participant was unhappy about being recorded the recording was erased. During the project around 700 hours of recordings were gathered.

Sample size

The number of people recruited may seem small in comparison to some demographic studies of the population of the United Kingdom. As with any sampling method, some compromise between what was theoretically desirable and what was feasible within the constraints of the BNC project had to be made. There is no doubt that recruiting 1000 people would have given greater statistical validity but the practical difficulties and cost implications of recruiting 1000 people and transcribing 50–100 million words of speech made this impossible. given that we were not attempting to represent the complete range of age and social groups within each region we considered a sample size between 100 and 130 would be adequate. It is also important to stress that the total number of participants in all conversations was well in excess of a thousand.

Piloting the demographic sampling approach

Because this approach to spoken corpus sampling had to our knowledge never previously been attempted a detailed piloting project was carried out to investigate:

the likelihood that enough material would be obtained from a sample of around 100 people
any problems that might be encountered during the recruitment and collection stages
any problems or difficulties experienced by recruits during taping or with logging details of conversations and participants
any areas where the documentation designed for the project could be improved
whether the recording quality under a wide range of conditions would be good enough for accurate transcription
whether the predicted throughput rates for tape editing, transcription and checking were accurate.

The results of the pilot generally confirmed predictions and allowed some procedures to be refined for the full project.

The context-governed part of the corpus

As mentioned above, the spoken texts in the demographic part of the corpus consists mainly of conversational English. A complementary approach was developed to create what is termed the context-governed part of the corpus. As in other spoken corpora, the range of text types was selected according to a priori linguistically motivated categories. At the top layer of the typology is a division into four equal-sized contextually based categories: educational, business, public/institutional, and leisure. Each is divided into the subcategories monologue (40 per cent) and dialogue (60 per cent). Each monologue subcategory therefore totals 10 per cent of the context-governed part of the corpus, and each dialogue subcategory 15 per cent.

Within each subcategory a range of text types was defined. This range was not fixed, and the design was flexible enough to allow the inclusion of additional text types. The sampling methodology was different for each text type but the overall aim was to achieve a balanced selection within each, taking into account such features as region, level, gender of speakers, and topic. Other features, such as purpose, were applied on the basis of post hoc judgements.

Sampling procedure

For the most part, a variety of text types were sampled within three geographic regions. However, some text types, such as parliamentary proceedings, and most broadcast categories, apply to the country as a whole and were not regionally sampled. Different sampling strategies were required for each text type, and these are outlined below.

Educational and informative:

Lectures, talks, educational demonstrations: Within each sampling area a university (or college of further education) and a school were selected. A range of lectures and talks was recorded, varying the topic, level, and speaker gender.
News commentaries: Regional sampling was not applied, but both national and regional broadcasting companies were sampled. The topic, level, and gender of commentator was varied.
Classroom interaction: Schools were regionally sampled and the level (generally based on student age) and topic were varied. Home tutorials were also included.

Business:

Company talks and interviews: Sampling took into account company size, areas of activity, and gender of speakers.
Trade union talks: Talks to union members, branch meetings and annual conferences were all sampled.
Sales demonstrations: A range of topics was included.
Business meetings: Companies were selected according to size, area of activity, and purpose of meeting.
Consultations: These included medical, legal, business and professional consultations.

All categories under this heading were regionally sampled.

Public/ or institutional:

Political speeches: Regional sampling of local politics, plus speeches in both the House of Commons and the House of Lords.
Sermons: Different denominations were sampled.
Public/government talks: Regional sampling of local inquiries and meetings, plus national issues at different levels.
Council meetings: Regionally sampled, covering parish, town, district, and county councils.
Religious meetings: Includes church meetings, group discussions, and so on.
Parliamentary proceedings: Sampling of main sessions and committees, House of Commons and House of Lords.
Legal proceedings: Royal Courts of Justice, and local Magistrates and similar courts were sampled.

Leisure:

Speeches: Regionally sampled, covering a variety of occasions and speakers.
Sports commentaries: Exclusively broadcast, sampling a variety of sports, commentators, and TV/radio channels.
Talks to clubs: Regionally sampled, covering a range of topics and speakers.
Broadcast chat shows and phone-ins: Only those that include a significant amount of unscripted speech were selected from both television and radio.
Club meetings: Regionally sampled, covering a wide range of clubs.

Sample size

Each monologue text type contains up to 200,000 words of text, and each dialogue text type up to 300,000 words. The length of text units within each text type vary — for example, news commentaries may be only a few minutes long (several hundred words), lectures are typically up to one hour (10,000 words), and some business meetings and parliamentary proceedings may last for several hours (20,000 words+). For the context-governed part of the corpus an upper limit of 10,000 words per text unit was generally imposed, although a few texts are slightly above this.

Composition of the spoken component

A total of 757 texts (6,153,671 words) make up the context-governed part of the corpus. The following contexts are distinguished:

Table 16. Spoken context
	texts	w-units	%	s-units	%
Educational/Informative	169	1646380	26.65	118987	27.83
Business	129	1282416	20.76	107366	25.11
Public/Institutional	262	1672658	27.08	96500	22.57
Leisure	195	1574442	25.49	104670	24.48

In addition, the following classifications are applicable to both demographic and context-governed spoken texts:

Table 17. Region
	texts	w-units	%	s-units	%
Unknown	35	448458	4.30	27496	2.64
South	311	4687877	45.03	457726	44.09
Midlands	213	2492236	23.94	240306	23.14
North	349	2781280	26.71	312552	30.10

Table 18. Interaction type
	texts	w-units	%	s-units	%
Monologue	207	1562017	15.00	92619	8.92
Dialogue	701	8847834	84.99	945461	91.07

Up: Contents Next: Basic structure

Notes

The terms "POS-tagging" and "wordclass tagging" are used interchangeably in this manual.

The only exceptions to this statement are: (i) the file F9M, which contains the Rap poetry "City Psalms" by Benjamin Zephaniah. It was thoroughly hand-corrected because the tagger, not familiar with Jamaican Creole, had produced an inordinate number of tagging errors. (ii) files identified as containing many foreign and classical expressions, as mentioned above.

In BNC version 1, the quantifier a little meaning 'a small amount' was sometimes (but not reliably) tagged as a multiword DT0

In our experience, human analysts too sometimes have difficulty resolving ambiguities such as these, especially when using the plain orthographic transcriptions of the BNC, and with no direct access to the original sound recordings.

That is, the error rate based on CLAWS's first choice tag only.

We borrow the term "patching" from Brill (1992), although for his tagging program the patches are discovered by an automatic procedure.

The repetition value of up to 16 words was reached at by trial and error; an occurrence of a finite verb beyond that range was rarely in the same clause as the #AFTER-type word.

Training and testing were mostly carried out on the BNC Sampler corpus of 2 million words. For less frequent phenomena we needed to use sections from the full BNC. None of the texts used for the tagging error report is included in the Sampler.