Design of the corpus
This section discusses some of the basic design issues underlying
the creation of the BNC. It summarizes the kinds of uses for which the
corpus is intended, and the principles upon which it was created. Some
summary information about the composition of the corpus is also
included.
Purpose
The uses originally envisaged for the British National Corpus
were set out in a working document called Planned Uses of the
British National Corpus BNCW02 (11 April 91). This document
identified the following as likely application areas for the corpus:
reference book publishing
academic linguistic research
language teaching
artificial intelligence
natural language processing
speech processing
information retrieval
The same document identified the following categories of linguistic
information derivable from the corpus:
lexical
semantic/pragmatic
syntactic
morphological
graphological/written form/orthographical
In the 15 or more years since that document was published, it has
become apparent that the corpus, and corpus methods in general, have
had a far wider impact than anticipated, notably in the field of
language teaching.
General definitions
The British National Corpus is:
a sample corpus: composed of text samples
generally no longer than 45,000 words.
a synchronic corpus: the corpus includes
imaginative texts from 1960, informative texts from 1975.
a general corpus: not specifically
restricted to any particular subject field, register or genre.
a monolingual British English corpus: it
comprises text samples which are substantially the product of speakers
of British English.
a mixed corpus: it contains examples of
both spoken and written language.
Composition
There is a broad consensus among the participants in the project and
among corpus linguists that a general-purpose corpus of the English
language would ideally contain a high proportion of spoken language in
relation to written texts. However, it is significantly more expensive
to record and transcribe natural speech than to acquire written text in
computer-readable form. Consequently the spoken component of the BNC
constitutes approximately 10 per cent (10 million words) of the total
and the written component 90 per cent (90 million words). These were
agreed to be realistic targets, given the constraints of time and
budget, yet large enough to yield valuable empirical statistical data
about spoken English. In the BNC sampler,
a two per cent sample taken from the whole of the BNC, spoken and
written language are present in approximately equal proportions, but
other criteria are not equally balanced.
From the start, a decision was taken to select material for
inclusion in the corpus according to an overt methodology, with specific
target quantities of clearly defined types of language. This approach
makes it possible for other researchers and corpus compilers to review,
emulate or adapt concrete design goals. This section outlines these
design considerations, and reports on the final make-up of the BNC.
This and the other tables in this section show the actual make-up
of the second version of the British National Corpus (the BNC World
Edition) in terms of
texts : number of distinct samples not exceeding 45,000 words
S-units: number of s elements identified by the CLAWS system
(more or less equivalent to sentences)
W-units: number of w elements identified by the CLAWS system
(more or less equivalent to words)
For further explanation of s and w elements, see
section .
The XML Edition of the BNC contains 4049 texts and occupies
(including all markup) 5,228,040 Kb, or about 5.2 Gb. In total, it
comprises just under 100 million orthographic words (specifically,
96986707), but the number of w-units (POS-tagged items) is slightly
higher at 98363783. The tagging distinguishes a further
13614425 punctuation strings, giving a total content count of
110691482 strings. The total number of s-units tagged is about
6 million (6026284). Counts for these and all the other elements
tagged in the corpus are provided in the corpus header.
In the following tables both an absolute count and a percentage are
given for all the counts. The percentage is calculated with reference
to the relevant portion of the corpus, for example, in the table for
"written text domain", with reference to the total number of w-units
in written texts. Note that punctuation strings are not included
in these totals. The reference totals used are given in the first table
below.
Text type
textsw-units%s-units%Spoken demographic15342339554.3061055710.13
Spoken context-governed75561758966.274275237.09
Written books and periodicals26857923814680.55439558172.94
Written-to-be-spoken3512786181.291046651.73
Written miscellaneous42174371687.564879588.09
All texts are also classified according to their date of
production. For spoken texts, the date was that of the recording. For
written texts, the date used for classification was the date of
publication of the source edition used, for the most part; in
the case of imaginative works, however, the date of first publication
of the work was used. Informative texts were selected only from 1975 onwards,
imaginative ones from 1960, reflecting their longer shelf-life,
though most (75 per cent ) of the latter were published no earlier
than 1975.
Publication date
textsw-units%s-units%Unknown16218315851.861264162.09
1960-19744617184491.741195101.98
1975-198416947308894.802579624.28
1985-199336729008286091.58552239691.63
Spoken and written components of the corpus are discussed separately in
the next two sections.
Design of the written component
Sampling basis: production and reception
While it is sometimes useful to distinguish in theory between
language which is received (read and heard) and that which
is produced (written and spoken), it was agreed that the
selection of samples for a general-purpose corpus must take account of
both perspectives.
Text that is published in the form of books, magazines, etc., is
not representative of the totality of written language that is produced,
as writing for publication is a comparatively specialized activity in
which few people engage. However, it is much more representative of
written language that is received, and is also easier to obtain in
useful quantities, and thus forms the greater part of the written
component of the corpus.
There was no single source of information about published material
that could provide a satisfactory basis for a sampling frame, but a
combination of various sources furnished useful information about the
totality of written text produced and, particularly, received, some
sources being more significant than others. They are principally
statistics about books and periodicals that are published, bought or
borrowed.
Catalogues of books published per annum tell us something
about production but little about reception as many books are published
but hardly read.
A list of books in print provides somewhat more
information about reception as time will weed out the books that nobody
bought (or read): such a list will contain a higher proportion of books
that have continued to find a readership.
The books that have the widest reception are presumably those that
figure in bestseller lists, particularly prize
winners of competitions such as the Booker or Whitbread. Such
works were certainly candidates for inclusion in the corpus, but the
statistics of book-buying are such that very few texts achieve high
sales while a vast number sell only a few or in modest numbers. If
texts had been selected in strict arithmetical proportion to their
sales, their range would have been severely limited. However, where a
text from one particular subject domain was required, it was
appropriate to prefer a book which had achieved high sales to one
which had not.
Library lending statistics, where these are available,
also indicate which books enjoy a wide reception and, like lists of
books in print, show which books continue to be read.
Similar observations hold for magazines and periodicals. lists of
current magazines and periodicals are similar to catalogues of
published books, but perhaps more informative about language reception,
as it may be that periodicals are bought and read by a wider
cross-section of the community than books. Also, a periodical that fails
to find a readership will not continue to be published for long.
Periodical circulation figures have to be treated with the
same caution as bestseller lists, as a few titles dominate the market
with a very high circulation. To concentrate too exclusively on these
would reduce the range of text types in the corpus and make contrastive
analysis difficult.
Published written texts were selected partly at
random from Whitaker's Books in Print for 1992 and
partly systematically, according to the selection features outlined in
section below.
Available sources are concerned almost exclusively with published
books and periodicals. It is much more difficult to obtain data
concerning the production or reception of unpublished writing. Intuitive
estimates were therefore made in order to establish some guidelines for
text sampling in the latter area.
Selection features
Texts were chosen for inclusion according to three
selection features: domain (subject field), time (within
certain dates) and medium (book, periodical, etc.).
The purpose of these selection features was to ensure that the
corpus contained a broad range of different language styles, for two
reasons. The first was so that the corpus could be regarded as a
microcosm of current British English in its entirety, not just of
particular types. The second was so that different types of text could
be compared and contrasted with each other.
Selection Procedure
Each selection feature was divided into classes (e.g. Medium
into books, periodicals, unpublished etc.; Domain into
imaginative, informative, etc.) and target percentages were set for
each class. These percentages are quite independent of each other: there
was no attempt, for example, to make 25 per cent of the selected
periodicals imaginative.
The design proposed that seventy-five per cent of the samples be drawn from
informative texts, and the remaining 25 per cent from imaginative texts.
It further proposed that titles be taken from a variety of media, in the following
proportions: 60 per cent from books, 30 per cent from periodicals, 10
per cent from miscellaneous sources (published, unpublished, and written
to be spoken).
Half of the books in the Books and Periodicals class were
selected at random from Whitaker's Books in Print 1992.
This was to provide a control group to validate the categories used in
the other method of selection: the random selection disregarded Domain
and Time, but texts selected by this method were classified according to
these other features after selection.
Sample size and method
For books, a target sample size of 40,000 words was chosen. No
extract included in the corpus exceeds 45,000 words. For the most
part, texts which in their entirety were shorter than 40,000 words
were further reduced by ten per cent for copyright reasons; a few
texts longer than the target size were however included in their
entirety. Text samples normally consist of a continuous stretch of
discourse from within the whole. A convenient breakpoint (e.g. the end
of a section or chapter) was chosen as far as possible to begin and
end the sample so that high-level discourse units were not
fragmented. Where possible, no more than one sample was taken from any
one text; for newspaper texts and large encyclopaedic works, no sample
greater than 40,000 words was taken. Samples were taken randomly from
the beginning, middle or end of longer texts. (In cases where a
publication included essays or articles by a variety of authors of
different nationalities, the work of non-UK authors was omitted.)
Some types of written material are composite in structure: that is,
the physical object in written form is composed of more than one text
unit. Important examples are issues of a newspaper or magazine which,
though editorially shaped as a document, contain discrete texts, each
with its specific authorship, stylistic characteristics, register and
domain. The BNC attempts to separate these discrete texts where
appropriate and to classify them individually according to the selection
and classification features. As far as possible, the individual stories
in one issue of a newspaper were grouped according to domain, for
example as Business articles, Leisure articles, etc.
The following subsections discuss each selection criterion, and
indicate the actual numbers of words in each category included.
Domain
Classification according to subject field seems hardly appropriate
to texts which are fictional or which are generally perceived to be
literary or creative. Consequently, these texts are all labelled
imaginative and are not assigned to particular subject
areas. All other texts are treated as informative and are
assigned to one of the eight domains listed below.
Written Domain
textsw-units%s-units%Imaginative4761649642018.75135215027.10
Informative: natural & pure science14638219024.341833843.67
Informative: applied science37071741528.153566627.15
Informative: social science5261402553715.9469821813.99
Informative: world affairs4831724453419.6079850316.00
Informative: commerce & finance29573411638.343823747.66
Informative: arts26165748577.473211406.43
Informative: belief & thought14630375333.451512833.03
Informative: leisure4381223783413.9174449014.92
The evidence from catalogues of books and periodicals suggests that
imaginative texts account for significantly less than 25 per cent of
published output, and unpublished reports, correspondence, reference
works and so on would seem to add further to the bulk of informative
text which is produced and consumed. However, the overall distribution
between informative and imaginative text samples is set to reflect the
influential cultural role of literature and creative writing. The target
percentages for the eight informative domains were arrived at by
consensus within the project, based loosely upon the pattern of book
publishing in the UK during the past 20 years or so, as reflected in the
categorized figures for new publications that appear annually in
Whitaker's Book list.
Medium
This categorisation is broad, since a detailed taxonomy or feature
classification of text medium could have led to such a proliferation of
subcategories as to make it impossible for the BNC adequately to
represent all of them. The labels used here are intended to be
comprehensive in the sense that any text can be assigned with reasonable
confidence to these macro categories. The labels we have adopted
represent the highest levels of a fuller taxonomy of text medium.
Written Medium
textsw-units%s-units%Book14115029380357.18288752357.88
Periodical12082860949432.52148764429.82
Miscellaneous published23842331354.812877005.76
Miscellaneous unpublished24935388824.022206724.42
To-be-spoken3512786181.451046652.09
The Miscellaneous published category includes
brochures, leaflets, manuals, advertisements. The
Miscellaneous unpublished category includes
letters, memos, reports, minutes, essays. The
written-to-be-spoken category includes scripted television
material, play scripts etc.
Descriptive features
Written texts may be further classified according to sets of
descriptive features. These features describe
the sample texts; they did not determine their selection. This
information is recorded to allow more delicate contrastive analysis of
particular sets of texts. As a simple example, the gross division into
two time periods in the selection features can, of course, be refined
and subcorpora defined over the BNC for more specific dates. However,
the relative sizes of such subcorpora are undefined by the BNC design
specification.
These descriptive features were monitored during the course of the
data gathering, and text selection, in cases where a free choice of
texts was available, took account of the relative balance of these
features. Thus although no relative proportions were defined for
different target age groups (for example), we ensured that the corpus
does contain texts intended for children as well as for adults.
The following tables summarize the results for the first release of
the corpus. Note that many texts remain unclassified.
Author information
Information about authors of written texts was included only where
it was readily available, for example from the dust-wrapper of a book.
Consequently, the coverage of such information is very patchy. The
authorship of a written text was characterized as
corporate where it was produced by an
organization and no specific author was given, and as
multiple in cases where several authors were
named. Author sex was classified as mixed where more than one
author of either sex was specified, and unknown where it could
not reliably be determined from the author's name. Note that
author age means the author's age at the time of
creation of the work concerned.
Author type
textsw-units%s-units%Unknown21137868354.301743713.49
Corporate author34764971447.384556499.13
Multiple author13223456321939.29181090136.30
Sole author12614310673449.01254728351.06
Sex of author
textsw-units%s-units%Unknown15733616111541.11196816239.45
Author sex Male9203066558234.86167142033.50
Author sex Female4141458826016.5896752219.39
Author sex Mixed23465389757.433811007.64
Author age-group
textsw-units%s-units%Unknown25186600071975.04368758673.92
Author age 0-143595590.0634430.06
Author age 15-24195425780.61298100.59
Author age 25-346622671392.571594553.19
Author age 35-4419167269267.644101438.22
Author age 45-5920572307148.224106448.23
Author age 60+13951262975.822871235.75
Domicile
textsw-units%s-units%Unknown22725722715565.06313306862.80
Author domicile UK and Ireland8412976000033.83179830136.05
Author domicile Commonwealth124112070.46257590.51
Author domicile Continental Europe62344020.26124660.24
Author domicile USA82456040.27156750.31
Author domicile Elsewhere2755640.0829350.05
Target audience
Some attempt was made to characterize the kind of audience for which
written texts were produced in terms of age, sex and
level (a subjective assessment of the text's
technicality or difficulty). The last of these proved very difficult
to assess and was very frequently confused with circulation size or
audience size; for that reason, no figures for it are included here.
Audience age
textsw-units%s-units%Child audience429036901.02810741.62
Teenager audience7818311782.081380982.76
Adult audience29118192877693.14459738892.16
Any audience11032902883.741716443.44
Audience sex
textsw-units%s-units%Unknown7062027127023.04113125422.67
Male audience6123969352.721359502.72
Female audience17569041377.8450362910.09
Mixed audience21995838159066.37321737164.49
Miscellaneous classification information
Written texts were also characterized according to their place of
publication and the type of sampling used.
Publication place
textsw-units%s-units%Unknown6901471882716.7378844015.80
UK (unspecific) publication26371631118.143808247.63
Ireland publication375706520.64317930.63
UK: North (north of Mersey-Humber line) publication19137810554.292282474.57
UK: Midlands (north of Bristol Channel-Wash line) publication9325903452.941773083.55
UK: South (south of Bristol Channel-Wash line) publication18535858780866.61336040167.36
United States publication145421340.61211910.42
Sampling type
textsw-units%s-units%Unknown15833555110240.42199179839.93
Whole text27065249757.414337228.69
Beginning sample5842107522223.96111925122.43
Middle sample5101845480720.98104969221.04
End sample11943173264.902533225.07
Composite sample7520305002.301404192.81
In addition to the above, standard bibliographic details such as
author, title, publication details, extent, topic keywords etc. were
recorded for the majority of texts, as further described below (see
).
Selection procedures employed
Books
Roughly half the titles were randomly selected from available
candidates identified in Whitaker's Books in Print
(BIP), 1992, by students of Library and Information Studies at Leeds
City University. Each text randomly chosen was accepted only if it
fulfilled certain criteria: it had to be published by a British
publisher, contain sufficient pages of text to make its incorporation
worthwhile, consist mainly of written text, fall within the
designated time limits, and cost less than a set price. The students
noted the ISBN, author, title and price of each book thus selected; the
final selection weeded out texts by non-UK authors.
Half of the books having been selected by this method, the
remaining half were selected systematically to make up the target
percentages in each category. The selection proceeded as follows.
Bestsellers
Because of their wide reception, bestsellers were obvious
candidates for selection. The lists used were those that appeared in the
Bookseller at the end of the years 1987 to 1993
inclusive. Some of the books in the lists were rejected, for a variety
of reasons. Obviously books that had already been selected by the random
method were excluded, as were those by non-UK authors. In addition, a
limit of 120,000 words from any one author was imposed, and books
belonging to a domain or category whose quota had already been reached
were not selected. Other bestseller lists were obtained from The
Guardian, the British Council, and from Blackwells Paperback
Shop.
The titles yielded by this search were mostly in the Imaginative
category.
Literary prizes
The criteria for inclusion were the same as for bestsellers. The
prize winners, together with runners-up and shortlisted titles, were
taken from several sources, principally Anne Strachan,
Prizewinning literature: UK literary award winners,
London, 1989. For 1990 onwards the sources used were: the last issue of
the Bookseller for each year; The Guardian
Index, 1989–, entries under the term Literature;
and The Times Index, 1989-, entries under the term
Literature — Awards.
Literary prizes are in the main awarded to works that fall into the
Imaginative category, but there are some Informative ones also.
Library loans
The source of statistics in this category was the record of loans
under Public Lending Right, kindly provided by Dr J. Parker, the
Registrar. The information comprised lists of the hundred most issued
books and the hundred most issued children's books, in both cases for
the years 1987 to 1993.
The lists consist almost exclusively of imaginative literature, and
many titles found there also appear in the lists of bestsellers and
prize winners.
Additional texts
As collection proceeded, monitoring disclosed potential shortfalls
in certain domains. A further selection was therefore made, based on the
Short Loan collections of seven University libraries. (Short
Loan collections typically contain books required for academic courses,
which are consequently in heavy demand.)
Periodicals and magazines
Periodicals, magazines and newspapers account for 30 per cent of
the total text in the corpus. Of these, about 250 titles were issues of
newspapers. These were selected to cover as wide a spectrum of interests
and language as possible. Newspapers were selected to represent as wide
a geographic spread as possible:
The Scotsman and the Belfast Telegraph
are both represented, for example.
Other media
In addition to samples from books, periodicals, and magazines, the
written part of the corpus contains about seven million words classified as
Miscellaneous Published, Miscellaneous Unpublished, or
as
Written to be spoken. The distinction between published
and unpublished is not an easy one; the former category largely
contains publicity leaflets, brochures, fact sheets, and similar items,
while the latter has a substantial proportion of school and university
essays, unpublished creative writing or letters, and internal company
memoranda. The written to be spoken material includes scripted
material, intended to be read aloud such as television news broadcasts;
transcripts of more informal broadcast materials such as discussions or
phone-ins are included in the spoken part of the corpus.
Copyright permissions
Before a selected text could be included, permissions had to be
obtained from the copyright owner (publisher, agent, or author). A
standard Permissions Request was drafted with considerable care, but
some requests were refused, or simply not answered even after prompting,
so that the texts concerned had to be excluded or replaced.
Design of the spoken component
Lexicographers and linguists have long hoped for corpus evidence
about spoken language, but the practical difficulties of transcribing
sufficiently large quantities of text have prevented the construction of
a spoken corpus of over one million words. The British National Corpus
project undertook to produce
five to ten million words of orthographically transcribed speech,
covering a wide range of speech variation. A
large proportion of the spoken part of the corpus — over four million words
— comprises spontaneous conversational English. The importance of
conversational dialogue to linguistic study is unquestionable: it is the
dominant component of general language both in terms of language
reception and language production.
As with the written part of the corpus, the most important
considerations in constructing the spoken part were
sampling and representativeness. The method of transcription was also an
important issue.
The issues of corpus sampling and representativeness have been
discussed at great length by many corpus linguists.
With spoken language there
are no obvious objective measures that can be used to define the target
population or construct a sampling frame. A comprehensive list of text
types can be drawn up but there is no accurate way of estimating the
relative proportions of each text type other than by a priori
linguistically motivated analysis. An alternative approach, one well
known to sociological researchers, is demographic sampling,
and this was broadly the approach adopted for approximately half of the
spoken part of the corpus. The sampling frame was defined in terms of
the language production of the population of British English speakers in
the United Kingdom. Representativeness was achieved by sampling a spread
of language producers in terms of age, gender, social group, and region,
and recording their language output over a set period of time.
We recognised, however, that many types of spoken text are produced
only rarely in comparison with the total output of all speech
producers: for example, broadcast interviews, lectures, legal
proceedings, and other texts produced in situations where —
broadly speaking — there are few producers and many receivers. A
corpus constituted solely on the demographic model would thus omit
important spoken text types. Consequently, the demographic component of
the corpus was complemented with a separate text typology intended to
cover the full range of linguistic variation found in spoken language;
this is termed the context-governed part of the corpus.
The demographically sampled part of the corpus
The approach adopted uses demographic parameters to sample the
population of British English speakers in the United Kingdom.
Established random location sampling procedures were used to select
individual members of the population by personal interview from across
the country taking into account age, gender, and social group. Selected
individuals used a portable tape recorder to record their own speech and
the speech of people they conversed with over a period of up to a week.
In this way a unique record of the language people use in everyday
conversation was constructed.
Sampling procedure
124 adults (aged 15+) were recruited from across the United
Kingdom. Recruits were of both sexes and from all age groups and social
classes. The intention was, as far as possible, to recruit equal numbers
of men and women, equal numbers from each of the six age groups, and
equal numbers from each of four social classes.
Additional recordings were gathered for the BNC as part of the
University of Bergen COLT Teenager Language Project. This project used
the same recording methods and transcription scheme as the BNC, but
selected only respondents aged 16 or below.
The tables below give figures for the amount of transcribed
material collected by each respondent, classified by their age,
class, and sex.
Age-group
textsw-units%s-units%Respondent Age 0-14262670056.30410366.72
Respondent Age 15-243666535815.719799316.04
Respondent Age 25-342985383220.1612175219.94
Respondent Age 35-442284515319.9612669020.74
Respondent Age 45-592096348322.7513653022.36
Respondent Age 60+2063912415.098655614.17
Social class
textsw-units%s-units%Unknown7376220.8853400.87
AB respondent59137293332.4219779532.39
C1 respondent36110427926.0816938727.74
C2 respondent31108780825.6914487623.72
DE respondent2063131314.919315915.25
Sex
textsw-units%s-units%Unknown5162450.3824070.39
Male respondent73174222241.1424824140.65
Female respondent75247548858.4635990958.94
Recruits who agreed to take part in the project were asked to
record all of their conversations over a two to seven day period. The
number of days varied depending on how many conversations each recruit
was involved in and was prepared to record. Results indicated that most
people recorded nearly all of their conversations, and that the limiting
factor was usually the number of conversations a person had per day.
The placement day was varied, and recruits were asked to record on the
day after placement and on any other day or days of the week. In this
way a broad spread of days of the week including weekdays and weekends
was achieved. A conversation log allowed recruits to enter details of
every conversation recorded, and included date, time and setting, and
brief details of other participants.
Recording procedure
All conversations were recorded as unobtrusively as possible, so
that the material gathered approximated closely to natural, spontaneous
speech. In many cases the only person aware that the conversation was
being taped was the person carrying the recorder. Although an initial
unnaturalness on the part of the recruit was not uncommon this soon
seemed to disappear. Similarly, where non-intrusive recording was not
possible, for example at a family gathering where everyone is aware they
are being recorded, the same initial period of unease sometimes
occurred, but in our experience again vanished quickly. The guarantee of
confidentiality and complete anonymity (all references to full names and
addresses have been removed from the corpus and the log), and the fact
that there was an intermediary between those being recorded and those
listening to the recordings certainly helped.
For each conversational exchange the person carrying the recorder
told all participants they had been recorded and explained why. Whenever
possible this happened after the conversation had taken place. If any
participant was unhappy about being recorded the recording was erased.
During the project around 700 hours of recordings were gathered.
Sample size
The number of people recruited may seem small in comparison to some
demographic studies of the population of the United Kingdom. As with
any sampling method, some compromise between what was theoretically
desirable and what was feasible within the constraints of the BNC
project had to be made. There is no doubt that recruiting 1000 people
would have given greater statistical validity but the practical
difficulties and cost implications of recruiting 1000 people and
transcribing 50–100 million words of speech made this impossible.
Given that we were not attempting to represent the complete range of age
and social groups within each region we considered a sample size between
100 and 130 would be adequate. The
total number of participants in all conversations was well in excess of
a thousand.
Piloting the demographic sampling approach
Because this approach to spoken corpus sampling had to our
knowledge never previously been attempted a detailed piloting project
was carried out to investigate:
the likelihood that enough material would be obtained from a
sample of around 100 respondents
any problems that might be encountered during the recruitment and
collection stages
any problems or difficulties experienced by recruits during
taping or with logging details of conversations and participants
any areas where the documentation designed for the project could
be improved
whether the recording quality under a wide range of conditions
would be good enough for accurate transcription
whether the predicted throughput rates for tape editing,
transcription and checking were accurate.
The results of the pilot generally confirmed predictions and allowed
some procedures to be refined for the full project.
The context-governed part of the corpus
As mentioned above, the spoken texts in the demographic part of
the corpus consists mainly of conversational English. A complementary
approach was developed to create what is termed the
context-governed part of the corpus. As in other spoken
corpora, the range of text types was selected according to a
priori linguistically motivated categories. At the top layer
of the typology is a division into four equal-sized contextually based
categories: educational, business, public/institutional, and leisure.
Each is divided into the subcategories monologue (40 per cent) and
dialogue (60 per cent). Each monologue subcategory therefore totals 10
per cent of the context-governed part of the corpus, and each dialogue
subcategory 15 per cent.
Within each subcategory a range of text types was defined. This
range was not fixed, and the design was
flexible enough to allow the inclusion of additional text types. The
sampling methodology was different for each text type but the overall
aim was to achieve a balanced selection within each, taking into account
such features as region, level, gender of speakers, and topic. Other
features, such as purpose, were applied on the basis of post
hoc judgements.
Sampling procedure
For the most part, a variety of text types were sampled within three
geographic regions. However,
some text types,
such as parliamentary proceedings, and most broadcast categories, apply
to the country as a whole and were not regionally sampled. Different
sampling strategies were required for each text type, and these are
outlined below.
Educational and informative:
Lectures, talks, educational demonstrations
Within each sampling area a university (or college of further
education) and a school were selected. A range of lectures and talks was
recorded, varying the topic, level, and speaker gender.
News commentaries
Regional sampling was not applied, but both national and
regional broadcasting companies were sampled. The topic, level, and
gender of commentator was varied.
Classroom interaction
Schools were regionally sampled and the level (generally based
on student age) and topic were varied. Home tutorials were also
included.
Business:
Company talks and interviews
Sampling took into account company size, areas of activity, and
gender of speakers.
Trade union talks
Talks to union members, branch meetings and annual conferences
were all sampled.
Sales demonstrations
A range of topics was included.
Business meetings
Companies were selected according to size, area of activity, and
purpose of meeting.
Consultations
These included medical, legal, business and professional
consultations.
All categories under this heading were regionally sampled.
Public or institutional:
Political speeches
Regional sampling of local politics, plus speeches in both the
House of Commons and the House of Lords (in the latter case,
transcriptions were made by the project, and are not taken from the official
Hansard report).
Sermons
Different denominations were sampled.
Public/government talks
Regional sampling of local inquiries and meetings, plus national
issues at different levels.
Council meetings
Regionally sampled, covering parish, town, district, and county
councils.
Religious meetings
Includes church meetings, group discussions, and so on.
Parliamentary proceedings
Sampling of main sessions and committees, House of Commons and
House of Lords.
Legal proceedings
Royal Courts of Justice, and local Magistrates and similar courts
were sampled.
Leisure:
Speeches
Regionally sampled, covering a variety of occasions and
speakers.
Sports commentaries
Exclusively broadcast, sampling a variety of sports,
commentators, and TV/radio channels.
Talks to clubs
Regionally sampled, covering a range of topics and speakers.
Broadcast chat shows and phone-ins
Only those that include a significant amount of unscripted
speech were selected from both television and radio.
Club meetings
Regionally sampled, covering a wide range of clubs.
Sample size
Each monologue text type contains up to 200,000 words of text, and
each dialogue text type up to 300,000 words. The length of text units
within each text type vary — for example, news commentaries may
be only a few minutes long (several hundred words), lectures are
typically up to one hour (10,000 words), and some business meetings
and parliamentary proceedings may last for several hours (20,000
words+). For the context-governed part of the corpus an upper limit
of 10,000 words per text unit was generally imposed, although a few
texts are slightly above this.
Composition of the spoken component
A total of 757 texts (6,153,671 words) make up the context-governed
part of the corpus. The following contexts are distinguished:
Spoken context
textsw-units%s-units%Educational/Informative169164638026.6511898727.83
Business129128241620.7610736625.11
Public/Institutional262167265827.089650022.57
Leisure195157444225.4910467024.48
In addition, the following classifications are applicable to both
demographic and context-governed spoken texts:
Region
textsw-units%s-units%Unknown354484584.30274962.64
South 311468787745.0345772644.09
Midlands213249223623.9424030623.14
North349278128026.7131255230.10
Interaction type
textsw-units%s-units%Monologue207156201715.00926198.92
Dialogue701884783484.9994546191.07