[bnc] Design of the spoken component - BNC User manual

Design of the spoken component

Lexicographers and linguists have long hoped for corpus evidence about spoken language, but the practical difficulties of transcribing sufficiently large quantities of text have prevented the construction of a spoken corpus of over one million words. The British National Corpus project undertook to produce five to ten million words of orthographically transcribed speech, covering a wide range of speech variation. A large proportion of the spoken part of the corpus — over four million words — comprises spontaneous conversational English. The importance of conversational dialogue to linguistic study is unquestionable: it is the dominant component of general language both in terms of language reception and language production.

As with the written part of the corpus, the most important considerations in constructing the spoken part were sampling and representativeness. The method of transcription was also an important issue.

The issues of corpus sampling and representativeness have been discussed at great length by many corpus linguists. With spoken language there are no obvious objective measures that can be used to define the target population or construct a sampling frame. A comprehensive list of text types can be drawn up but there is no accurate way of estimating the relative proportions of each text type other than by a priori linguistically motivated analysis. An alternative approach, one well known to sociological researchers, is demographic sampling, and this was broadly the approach adopted for approximately half of the spoken part of the corpus. The sampling frame was defined in terms of the language production of the population of British English speakers in the United Kingdom. Representativeness was achieved by sampling a spread of language producers in terms of age, gender, social group, and region, and recording their language output over a set period of time.

We recognised, however, that many types of spoken text are produced only rarely in comparison with the total output of all ‘speech producers’: for example, broadcast interviews, lectures, legal proceedings, and other texts produced in situations where — broadly speaking — there are few producers and many receivers. A corpus constituted solely on the demographic model would thus omit important spoken text types. Consequently, the demographic component of the corpus was complemented with a separate text typology intended to cover the full range of linguistic variation found in spoken language; this is termed the context-governed part of the corpus.

The demographically sampled part of the corpus

The approach adopted uses demographic parameters to sample the population of British English speakers in the United Kingdom. Established random location sampling procedures were used to select individual members of the population by personal interview from across the country taking into account age, gender, and social group. Selected individuals used a portable tape recorder to record their own speech and the speech of people they conversed with over a period of up to a week. In this way a unique record of the language people use in everyday conversation was constructed.

Sampling procedure

124 adults (aged 15+) were recruited from across the United Kingdom. Recruits were of both sexes and from all age groups and social classes. The intention was, as far as possible, to recruit equal numbers of men and women, equal numbers from each of the six age groups, and equal numbers from each of four social classes.

Additional recordings were gathered for the BNC as part of the University of Bergen COLT Teenager Language Project. This project used the same recording methods and transcription scheme as the BNC, but selected only respondents aged 16 or below.

The tables below give figures for the amount of transcribed material collected by each respondent, classified by their age, class, and sex.

Table 13. Age group of demographic respondent
Age group	texts	w-units	%	s-units	%
0-14	26	265382	6.30	41035	6.72
15-24	36	660847	15.71	97994	16.04
25-34	29	848162	20.16	121750	19.94
35-44	22	839622	19.96	126688	20.74
45-59	20	957382	22.76	136540	22.36
60+	20	634663	15.08	86556	14.17

Table 14. Social class of demographic respondent
Social class	texts	w-units	%	s-units	%
Unknown	7	37363	0.88	5339	0.87
AB	59	1363571	32.41	197804	32.39
C1	36	1097023	26.08	169384	27.74
C2	31	1080654	25.69	144877	23.72
DE	20	627447	14.91	93159	15.25

Table 15. Sex of demographic respondent
Sex	texts	w-units	%	s-units	%
Unknown	5	16151	0.38	2407	0.39
Male	73	1730592	41.14	248247	40.65
Female	75	2459315	58.47	359909	58.94

Recruits who agreed to take part in the project were asked to record all of their conversations over a two to seven day period. The number of days varied depending on how many conversations each recruit was involved in and was prepared to record. Results indicated that most people recorded nearly all of their conversations, and that the limiting factor was usually the number of conversations a person had per day. The placement day was varied, and recruits were asked to record on the day after placement and on any other day or days of the week. In this way a broad spread of days of the week including weekdays and weekends was achieved. A conversation log allowed recruits to enter details of every conversation recorded, and included date, time and setting, and brief details of other participants.

Recording procedure

All conversations were recorded as unobtrusively as possible, so that the material gathered approximated closely to natural, spontaneous speech. In many cases the only person aware that the conversation was being taped was the person carrying the recorder. Although an initial unnaturalness on the part of the recruit was not uncommon this soon seemed to disappear. Similarly, where non-intrusive recording was not possible, for example at a family gathering where everyone is aware they are being recorded, the same initial period of unease sometimes occurred, but in our experience again vanished quickly. The guarantee of confidentiality and complete anonymity (all references to full names and addresses have been removed from the corpus and the log), and the fact that there was an intermediary between those being recorded and those listening to the recordings certainly helped.

For each conversational exchange the person carrying the recorder told all participants they had been recorded and explained why. Whenever possible this happened after the conversation had taken place. If any participant was unhappy about being recorded the recording was erased. During the project around 700 hours of recordings were gathered.

Sample size

The number of people recruited may seem small in comparison to some demographic studies of the population of the United Kingdom. As with any sampling method, some compromise between what was theoretically desirable and what was feasible within the constraints of the BNC project had to be made. There is no doubt that recruiting 1000 people would have given greater statistical validity but the practical difficulties and cost implications of recruiting 1000 people and transcribing 50–100 million words of speech made this impossible. given that we were not attempting to represent the complete range of age and social groups within each region we considered a sample size between 100 and 130 would be adequate. It is also important to stress that the total number of participants in all conversations was well in excess of a thousand.

Piloting the demographic sampling approach

Because this approach to spoken corpus sampling had to our knowledge never previously been attempted a detailed piloting project was carried out to investigate:

the likelihood that enough material would be obtained from a sample of around 100 people
any problems that might be encountered during the recruitment and collection stages
any problems or difficulties experienced by recruits during taping or with logging details of conversations and participants
any areas where the documentation designed for the project could be improved
whether the recording quality under a wide range of conditions would be good enough for accurate transcription
whether the predicted throughput rates for tape editing, transcription and checking were accurate.

The results of the pilot generally confirmed predictions and allowed some procedures to be refined for the full project.

The context-governed part of the corpus

As mentioned above, the spoken texts in the demographic part of the corpus consists mainly of conversational English. A complementary approach was developed to create what is termed thecontext-governed part of the corpus. As in other spoken corpora, the range of text types was selected according to a priori linguistically motivated categories. At the top layer of the typology is a division into four equal-sized contextually based categories: educational, business, public/institutional, and leisure. Each is divided into the subcategories monologue (40 per cent) and dialogue (60 per cent). Each monologue subcategory therefore totals 10 per cent of the context-governed part of the corpus, and each dialogue subcategory 15 per cent.

Within each subcategory a range of text types was defined. This range was not fixed, and the design was flexible enough to allow the inclusion of additional text types. The sampling methodology was different for each text type but the overall aim was to achieve a balanced selection within each, taking into account such features as region, level, gender of speakers, and topic. Other features, such as purpose, were applied on the basis of post hoc judgements.

Sampling procedure

For the most part, a variety of text types were sampled within three geographic regions. However, some text types, such as parliamentary proceedings, and most broadcast categories, apply to the country as a whole and were not regionally sampled. Different sampling strategies were required for each text type, and these are outlined below.

Educational and informative:

Lectures, talks, educational demonstrations: Within each sampling area a university (or college of further education) and a school were selected. A range of lectures and talks was recorded, varying the topic, level, and speaker gender.
News commentaries: Regional sampling was not applied, but both national and regional broadcasting companies were sampled. The topic, level, and gender of commentator was varied.
Classroom interaction: Schools were regionally sampled and the level (generally based on student age) and topic were varied. Home tutorials were also included.

Business:

Company talks and interviews: Sampling took into account company size, areas of activity, and gender of speakers.
Trade union talks: Talks to union members, branch meetings and annual conferences were all sampled.
Sales demonstrations: A range of topics was included.
Business meetings: Companies were selected according to size, area of activity, and purpose of meeting.
Consultations: These included medical, legal, business and professional consultations.

All categories under this heading were regionally sampled.

Public/ or institutional:

Political speeches: Regional sampling of local politics, plus speeches in both the House of Commons and the House of Lords.
Sermons: Different denominations were sampled.
Public/government talks: Regional sampling of local inquiries and meetings, plus national issues at different levels.
Council meetings: Regionally sampled, covering parish, town, district, and county councils.
Religious meetings: Includes church meetings, group discussions, and so on.
Parliamentary proceedings: Sampling of main sessions and committees, House of Commons and House of Lords.
Legal proceedings: Royal Courts of Justice, and local Magistrates and similar courts were sampled.

Leisure:

Speeches: Regionally sampled, covering a variety of occasions and speakers.
Sports commentaries: Exclusively broadcast, sampling a variety of sports, commentators, and TV/radio channels.
Talks to clubs: Regionally sampled, covering a range of topics and speakers.
Broadcast chat shows and phone-ins: Only those that include a significant amount of unscripted speech were selected from both television and radio.
Club meetings: Regionally sampled, covering a wide range of clubs.

Sample size

Each monologue text type contains up to 200,000 words of text, and each dialogue text type up to 300,000 words. The length of text units within each text type vary — for example, news commentaries may be only a few minutes long (several hundred words), lectures are typically up to one hour (10,000 words), and some business meetings and parliamentary proceedings may last for several hours (20,000 words+). For the context-governed part of the corpus an upper limit of 10,000 words per text unit was generally imposed, although a few texts are slightly above this.

Composition of the spoken component

A total of 757 texts (6,153,671 words) make up the context-governed part of the corpus. The following contexts are distinguished:

Table 16. Context in which spoken text was captured
Context	texts	w-units	%	s-units	%
Educational/Informative	169	1633303	26.61	119252	27.82
Business	131	1285938	20.95	108101	25.22
Public/Institutional	262	1655263	26.97	96504	22.51
Leisure	195	1561167	25.44	104701	24.43

In addition, the following classifications are applicable to both demographic and context-governed spoken texts:

Table 17. Region where spoken text captured
Region	texts	w-units	%	s-units	%
Unknown	35	446584	4.31	27706	2.66
South	312	4658232	45.04	458253	44.10
Midlands	213	2471184	23.89	240320	23.12
North	350	2765729	26.74	312842	30.10

Table 18. Interaction type for spoken text
Interaction type	texts	w-units	%	s-units	%
Monologue	212	1578614	15.26	94272	9.07
Dialogue	698	8763115	84.73	944849	90.92

Up: Contents Previous: Design of the written component