[bnc] Design of the written component - BNC User manual

Sampling basis: production and reception

While it is sometimes useful to distinguish in theory between language which is received (read and heard) and that which is produced (written and spoken), it was agreed that the selection of samples for a general-purpose corpus must take account of both perspectives.

Text that is published in the form of books, magazines, etc., is not representative of the totality of written language that is produced, as writing for publication is a comparatively specialized activity in which few people engage. However, it is much more representative of written language that is received, and is also easier to obtain in useful quantities, and thus forms the greater part of the written component of the corpus.

There was no single source of information about published material that could provide a satisfactory basis for a sampling frame, but a combination of various sources furnished useful information about the totality of written text produced and, particularly, received, some sources being more significant than others. They are principally statistics about books and periodicals that are published, bought or borrowed.

Catalogues of books published per annum tell us something about production but little about reception as many books are published but hardly read.

A list of books in print provides somewhat more information about reception as time will weed out the books that nobody bought (or read): such a list will contain a higher proportion of books that have continued to find a readership.

The books that have the widest reception are presumably those that figure in bestseller lists, particularly prize winners of competitions such as the Booker or Whitbread. Such works were certainly candidates for inclusion in the corpus, but the statistics of book-buying are such that very few texts achieve high sales while a vast number sell only a few or in modest numbers. If texts had been selected in strict arithmetical proportion to their sales, their range would have been severely limited. However, where a text from one particular subject domain was required, it was appropriate to prefer a book which had achieved high sales to one which had not.

Library lending statistics, where these are available, also indicate which books enjoy a wide reception and, like lists of books in print, show which books continue to be read.

Similar observations hold for magazines and periodicals. lists ofcurrent magazines and periodicals are similar to catalogues of published books, but perhaps more informative about language reception, as it may be that periodicals are bought and read by a wider cross-section of the community than books. Also, a periodical that fails to find a readership will not continue to be published for long.

Periodical circulation figures have to be treated with the same caution as bestseller lists, as a few titles dominate the market with a very high circulation. To concentrate too exclusively on these would reduce the range of text types in the corpus and make contrastive analysis difficult.

Published written texts were selected partly at random from Whitaker's Books in Print for 1992 and partly systematically, according to the selection features outlined in section Selection features below.

Available sources are concerned almost exclusively with published books and periodicals. It is much more difficult to obtain data concerning the production or reception of unpublished writing. Intuitive estimates were therefore made in order to establish some guidelines for text sampling in the latter area.

Selection features

Texts were chosen for inclusion according to threeselection features: domain (subject field), time (within certain dates) and medium (book, periodical, etc.).

The purpose of these selection features was to ensure that the corpus contained a broad range of different language styles, for two reasons. The first was so that the corpus could be regarded as a microcosm of current British English in its entirety, not just of particular types. The second was so that different types of text could be compared and contrasted with each other.

Selection Procedure

Each selection feature was divided into classes (e.g. ‘Medium’ into books, periodicals, unpublished etc.; ‘Domain’ into imaginative, informative, etc.) and target percentages were set for each class. These percentages are quite independent of each other: there was no attempt, for example, to make 25 per cent of the selected periodicals imaginative.

Seventy-five per cent of the samples were to be drawn from informative texts, and the remaining 25 per cent from imaginative texts.

titles were to be taken from a variety of media, in the following proportions: 60 per cent from books, 30 per cent from periodicals, 10 per cent from miscellaneous sources (published, unpublished, and written to be spoken).

Half of the books in the ‘Books and Periodicals’ class were selected at random from Whitaker's Books in Print 1992. This was to provide a control group to validate the categories used in the other method of selection: the random selection disregarded Domain and Time, but texts selected by this method were classified according to these other features after selection.

Sample size and method

For books, a target sample size of 40,000 words was chosen. No extract included in the corpus exceeds 45,000 words. For the most part, texts which in their entirety were shorter than 40,000 words were further reduced by ten per cent for copyright reasons; a few texts longer than the target size were however included in their entirety. Text samples normally consist of a continuous stretch of discourse from within the whole. A convenient breakpoint (e.g. the end of a section or chapter) was chosen as far as possible to begin and end the sample so that high-level discourse units were not fragmented. Only one sample was taken from any one text. Samples were taken randomly from the beginning, middle or end of longer texts. (In a few cases, where a publication included essays or articles by a variety of authors of different nationalities, the work of non-UK authors was omitted.)

Some types of written material are composite in structure: that is, the physical object in written form is composed of more than one text unit. Important examples are issues of a newspaper or magazine which, though editorially shaped as a document, contain discrete texts, each with its specific authorship, stylistic characteristics, register and domain. The BNC attempts to separate these discrete texts where appropriate and to classify them individually according to the selection and classification features. As far as possible, the individual stories in one issue of a newspaper were grouped according to domain, for example as ‘Business’ articles, ‘Leisure’ articles, etc.

The following subsections discuss each selection criterion, and indicate the actual numbers of words in each category included.

Domain

Classification according to subject field seems hardly appropriate to texts which are fictional or which are generally perceived to be literary or creative. Consequently, these texts are all labelled imaginative and are not assigned to particular subject areas. All other texts are treated asinformative and are assigned to one of the eight domains listed below.

Table 3. Written domain
Domain	texts	w-units	%	s-units	%
Applied science	370	7104635	8.14	357067	7.12
Arts	261	6520634	7.47	321442	6.41
Belief and thought	146	3007244	3.44	151418	3.01
Commerce and finance	295	7257542	8.31	382717	7.63
Imaginative	477	16377726	18.76	1356458	27.05
Leisure	438	12187946	13.96	760722	15.17
Natural and pure science	146	3784273	4.33	183466	3.65
Social science	527	13906182	15.93	700122	13.96
World affairs	484	17132023	19.62	800560	15.96

The evidence from catalogues of books and periodicals suggests that imaginative texts account for significantly less than 25 per cent of published output, and unpublished reports, correspondence, reference works and so on would seem to add further to the bulk of informative text which is produced and consumed. However, the overall distribution between informative and imaginative text samples is set to reflect the influential cultural role of literature and creative writing. The target percentages for the eight informative domains were arrived at by consensus within the project, based loosely upon the pattern of book publishing in the UK during the past 20 years or so, as reflected in the categorized figures for new publications that appear annually in Whitaker's Book list.

Medium

This categorisation is broad, since a detailed taxonomy or feature classification of text medium could have led to such a proliferation of subcategories as to make it impossible for the BNC adequately to represent all of them. The labels used here are intended to be comprehensive in the sense that any text can be assigned with reasonable confidence to these macro categories. The labels we have adopted represent the highest levels of a fuller taxonomy of text medium.

Table 4. Written medium
Medium	texts	w-units	%	s-units	%
Book	1414	49891770	57.16	2895652	57.75
Periodical	1208	28356005	32.48	1487725	29.67
Published miscellanea	238	4197450	4.80	288004	5.74
Unpublished miscellanea	249	3508500	4.01	222438	4.43
To-be-spoken	35	1324480	1.51	120153	2.39

The ‘Miscellaneous published’ category includes brochures, leaflets, manuals, advertisements. The‘Miscellaneous unpublished’ category includes letters, memos, reports, minutes, essays. The‘written-to-be-spoken’ category includes scripted television material, play scripts etc.

Descriptive features

Written texts may be further classified according to sets ofdescriptive features. These features describe the sample texts; they did not determine their selection. This information is recorded to allow more delicate contrastive analysis of particular sets of texts. As a simple example, the gross division into two time periods in the selection features can, of course, be refined and subcorpora defined over the BNC for more specific dates. However, the relative sizes of such subcorpora are undefined by the BNC design specification.

These descriptive features were monitored during the course of the data gathering, and text selection, in cases where a free choice of texts was available, took account of the relative balance of these features. Thus although no relative proportions were defined for different target age groups (for example), we ensured that the corpus does contain texts intended for children as well as for adults.

The following tables summarize the results for the first release of the corpus. Note that many texts remain unclassified.

Author information

Information about authors of written texts was included only where it was readily available, for example from the dust-wrapper of a book. Consequently, the coverage of such information is very patchy. The authorship of a written text was characterized as‘corporate’ where it was produced by an organization and no specific author was given, and as‘multiple’ in cases where several authors were named. Author sex was classified as ‘mixed’ where more than one author of either sex was specified, and ‘unknown’ where it could not reliably be determined from the author's name. Note that‘author age’ means the author's age at the time of creation of the work concerned.

Table 5. Type of author
Author type	texts	w-units	%	s-units	%
Unknown	211	3750668	4.29	175027	3.49
Corporate	347	6497415	7.44	471152	9.39
Multiple	1323	34284025	39.28	1813636	36.17
Sole	1263	42746097	48.97	2554157	50.94

Table 6. Author sex
Author sex	texts	w-units	%	s-units	%
Unknown	1573	35825335	41.04	1970482	39.29
Male	922	30434132	34.87	1675236	33.41
Female	415	14480939	16.59	972106	19.38
Mixed	234	6537799	7.49	396148	7.90

Table 7. Author age group
Author age	texts	w-units	%	s-units	%
Unknown	2519	65457159	74.99	3707600	73.94
0-14	3	59071	0.06	3447	0.06
15-24	19	537251	0.61	29862	0.59
25-34	67	2286936	2.62	163079	3.25
35-44	191	6660606	7.63	410324	8.18
45-59	205	7157985	8.20	410717	8.19
60+	140	5119197	5.86	288943	5.76

Table 8. Author domicile
Author domicile	texts	w-units	%	s-units	%
Unknown	2273	56750777	65.02	3144578	62.71
UK and Ireland	843	29570097	33.88	1812550	36.14
Commonwealth	12	407076	0.46	25762	0.51
Continental Europe	6	232275	0.26	12469	0.24
USA	8	243177	0.27	15677	0.31
Elsewhere	2	74803	0.08	2936	0.05

Target audience

Some attempt was made to characterize the kind of audience for which written texts were produced in terms of age, sex and‘level’ (a subjective assessment of the text's technicality or difficulty). The last of these proved very difficult to assess and was very frequently confused with circulation size or audience size; for that reason, no figures for it are included here.

Table 9. Target age group
age group	texts	w-units	%	s-units	%
Child	42	895413	1.02	81085	1.61
Teenager	77	1769940	2.02	135583	2.70
Adult	2915	81345838	93.20	4625633	92.25
Any	110	3267014	3.74	171671	3.42

Table 10. Target sex
sex	texts	w-units	%	s-units	%
Unknown	707	20113523	23.04	1135038	22.63
Male	61	2366396	2.71	136564	2.72
Female	176	6882659	7.88	507713	10.12
Mixed	2200	57915627	66.35	3234657	64.51

Miscellaneous classification information

Written texts were also characterized according to their place of publication and the type of sampling used

Table 11. Place of publication
Region	texts	w-units	%	s-units	%
Unknown	690	14583761	16.70	790465	15.76
UK (unspecific)	264	7124424	8.16	383046	7.63
Ireland	37	567046	0.64	31825	0.63
UK (North)	192	3778114	4.32	230008	4.58
UK (Midlands)	93	2622554	3.00	192379	3.83
UK (South)	1854	58066891	66.53	3365045	67.11
United States	14	535415	0.61	21204	0.42

Table 12. Sampling method
Sample type	texts	w-units	%	s-units	%
Unknown	1583	35240809	40.37	1994357	39.77
Whole text	270	6463415	7.40	433833	8.65
Beginning sample	585	20890666	23.93	1121658	22.37
Middle sample	512	18344188	21.01	1055383	21.04
End sample	119	4271138	4.89	253413	5.05
Composite	75	2067989	2.36	155328	3.09

In addition to the above, standard bibliographic details such as author, title, publication details, extent, topic keywords etc. were recorded for the majority of texts, as further described below (see??).

Selection procedures employed

Books

Roughly half the titles were randomly selected from available candidates identified in Whitaker's Books in Print (BIP), 1992, by students of Library and Information Studies at Leeds City University. Each text randomly chosen was accepted only if it fulfilled certain criteria: it had to be published by a British publisher, contain sufficient pages of text to make its incorporation worthwhile, consist mainly of written text, fall within the designated time limits, and cost less than a set price. The students noted the ISBN, author, title and price of each book thus selected; the final selection weeded out texts by non-UK authors.

Half of the books having been selected by this method, the remaining half were selected systematically to make up the target percentages in each category. The selection proceeded as follows.

Bestsellers

Because of their wide reception, bestsellers were obvious candidates for selection. The lists used were those that appeared in theBookseller at the end of the years 1987 to 1993 inclusive. Some of the books in the lists were rejected, for a variety of reasons. Obviously books that had already been selected by the random method were excluded, as were those by non-UK authors. In addition, a limit of 120,000 words from any one author was imposed, and books belonging to a domain or category whose quota had already been reached were not selected. Other bestseller lists were obtained from The Guardian, the British Council, and from Blackwells Paperback Shop.

The titles yielded by this search were mostly in the Imaginative category.

Literary prizes

The criteria for inclusion were the same as for bestsellers. The prize winners, together with runners-up and shortlisted titles, were taken from several sources, principally Anne Strachan,Prizewinning literature: UK literary award winners, London, 1989. For 1990 onwards the sources used were: the last issue of the Bookseller for each year; The Guardian Index, 1989–, entries under the term ‘Literature’; and The Times Index, 1989-, entries under the term‘Literature — Awards’.

Literary prizes are in the main awarded to works that fall into the Imaginative category, but there are some Informative ones also.

Library loans

The source of statistics in this category was the record of loans under Public Lending Right, kindly provided by Dr J. Parker, the Registrar. The information comprised lists of the hundred most issued books and the hundred most issued children's books, in both cases for the years 1987 to 1993.

The lists consist almost exclusively of imaginative literature, and many titles found there also appear in the lists of bestsellers and prize winners.

Additional texts

As collection proceeded, monitoring disclosed potential shortfalls in certain domains. A further selection was therefore made, based on the‘Short Loan’ collections of seven University libraries. (Short Loan collections typically contain books required for academic courses, which are consequently in heavy demand.)

Periodicals and magazines

Periodicals, magazines and newspapers account for 30 per cent of the total text in the corpus. Of these, about 250 titles were issues of newspapers. These were selected to cover as wide a spectrum of interests and language as possible. Newspapers were selected to represent as wide a geographic spread as possible:The Scotsman and the Belfast Telegraph are both represented, for example.

Other media

In addition to samples from books, periodicals, and magazines, the written part of the corpus contains about seven million words classified as‘Miscellaneous Published’, ‘Miscellaneous Unpublished’, or as‘Written to be spoken’. The distinction between ‘published’ and ‘unpublished’ is not an easy one; the former category largely contains publicity leaflets, brochures, fact sheets, and similar items, while the latter has a substantial proportion of school and university essays, unpublished creative writing or letters, and internal company memoranda. The ‘written to be spoken’ material includes scripted material, intended to be read aloud such as television news broadcasts; transcripts of more informal broadcast materials such as discussions or phone-ins are included in the spoken part of the corpus.

Copyright permissions

Before a selected text could be included, permissions had to be obtained from the copyright owner (publisher, agent, or author). A standard Permissions Request was drafted with considerable care, but some requests were refused, or simply not answered even after prompting, so that the texts concerned had to be excluded or replaced.