This section consists of a series of tables listing a number of codes used in encoding various aspects of the corpus.
The following code tables are provided:
A general discussion of the principles and practice underlying the CLAWS word class annotation scheme used in the BNC is provided by the document A brief users' guide to the grammatical tagging of the British National Corpus . This also includes a full list of the CLAWS7 word class codes applied to the BNC Sampler.
The following list gives a brief description of each SGML element used in the BNC Sampler. Elements are listed in alphabetical order. Descriptions prefixed by ``(H)'' are for elements which appear only in the text headers.
<activity> | (H) | participants' activity during recording | 297 |
<address> | (H) | postal or other address | 185 |
<align> | alignment map for synchronizing overlap points | 245 | |
<author> | (H) | author in bibliographic entry | 51 |
<avail> | (H) | availability code for file | 185 |
<bibNote> | (H) | note within a bibliographic entry | 1 |
<bibl> | loosely structured bibliographic reference | 23 | |
<biblScop> | (H) | page range within bibliographic entry | 39 |
<biblSrow> | (H) | structured bibliographic entry | 87 |
<bncDoc> | an individual text in the BNC Sampler | 184 | |
<c> | a punctuation mark | 285801 | |
<caption> | a floating heading or caption | 1592 | |
<catDesc> | (H) | description of a category | 127 |
<catRef> | (H) | category codes applicable to a text | 184 |
<category> | (H) | a category-value pair | 127 |
<change> | (H) | change note | 963 |
<clasDecl> | (H) | description of classification scheme | 1 |
<corr> | (H) | description of correction policy | 3 |
<creation> | (H) | information about creation of a text | 185 |
<date> | a date | 1218 | |
<div> | any subdivision of a spoken text | 238 | |
<div1> | first-level subdivision of a written text | 1218 | |
<div2> | second-level subdivision of a written text | 1165 | |
<div3> | third-level subdivision of a written text | 512 | |
<div4> | fourth-level subdivision of a written text | 137 | |
<editDecl> | (H) | descriptions of editorial policies | 16 |
<ednStmt> | (H) | information about a particular edition | 185 |
<encDesc> | (H) | encoding description | 185 |
<event> | non-verbal event within a spoken text | 477 | |
<extent> | (H) | size of a corpus text | 185 |
<fileDesc> | (H) | documentation of an electronic text | 185 |
<gap> | a spot where part of source text has been omitted | 4831 | |
<head> | any form of heading or title | 3020 | |
<header> | meta-information describing a corpus text | 185 | |
<hi> | typographically highlighted phrase | 1738 | |
<hyph> | (H) | description of hyphenation policy | 2 |
<idno> | (H) | identifying number for a text | 185 |
<imprint> | (H) | imprint within a bibliographic entry | 71 |
<item> | item within a list | 1041 | |
<keywords> | (H) | descriptive keywords for topics of a text | 184 |
<l> | line of verse | 3618 | |
<label> | label of a list item | 291 | |
<langUsg> | (H) | description of languages used in a text | 1 |
<list> | list of items | 185 | |
<loc> | synchronisation point within an alignment map | 28399 | |
<locName> | (H) | name of place where speech recorded | 285 |
<locale> | (H) | description of a place where speech recorded | 264 |
<monogr> | (H) | monographic bibliographic entry | 87 |
<name> | proper name of person, place etc. | 1417 | |
<note> | note or comment of any kind | 119 | |
<p> | paragraph in written text | 28 | |
<partics> | (H) | description of spoken text participants | 99 |
<pause> | noticeable pause in spoken text | 26091 | |
<pb> | page break in written text | 2633 | |
<person> | (H) | information about a speaker | 537 |
<poem> | group of verse lines in a written text | 161 | |
<profDesc> | (H) | additional information about a text | 185 |
<projDesc> | (H) | background information about BNC project | 185 |
<prow> | link to a displaced element or to synchronisation point | 58524 | |
<pubPlace> | (H) | place of publication within bibliographic entry | 69 |
<pubStmt> | (H) | publication or distribution information | 185 |
<quot> | (H) | description of quotation policy | 2 |
<quote> | quotation from some other work | 59 | |
<rec> | (H) | recording details | 297 |
<recStmt> | (H) | information about an audio recording | 98 |
<refsDecl> | (H) | description of reference system used | 185 |
<reg> | description of regularisation policy | 261 | |
<relation> | (H) | relationship between participants in a spoken text | 396 |
<resp> | (H) | nature of responsibility | 1346 |
<respStmt> | (H) | statement of responsibility in a bibliographic entry | 1343 |
<revDesc> | (H) | revision description | 185 |
<s> | sentence-like linguistic segment | 172408 | |
<sampDecl> | (H) | description of sampling policy | 5 |
<segm> | (H) | description of segmentation policy | 2 |
<settDesc> | (H) | description of setting in which speech occurs | 98 |
<setting> | (H) | an individual setting in which speech occurs | 297 |
<shift> | change in voice quality | 4230 | |
<sic> | apparently erroneous transcription | 516 | |
<srcDesc> | (H) | description of the source for a written text | 185 |
<stext> | an individual spoken text | 98 | |
<tagUsage> | (H) | count for a particular tag in a text | 2600 |
<tagsDecl> | (H) | list of tags used in a particular text | 185 |
<term> | (H) | individual term in a list of keywords | 262 |
<text> | an individual written text | 86 | |
<titStmt> | (H) | title statement for a text | 185 |
<title> | (H) | title within a bibliographic entry | 272 |
<trans> | (H) | declaration of transcription policy | 7 |
<trunc> | truncated form in a spoken text | 5566 | |
<txtClass> | (H) | text classification | 184 |
<u> | utterance in a spoken text | 79811 | |
<unclear> | inaudible or incomprehensible passage in a spoken text | 19936 | |
<vocal> | non-verbal vocalization in a spoken text | 4521 | |
<w> | word or other non-punctuation token carrying a POS code | 1993554 |
The following table gives a brief description of each character entity used within the text of the BNC Sampler. Declarations for these entities may be found either in standard entity sets or from the entity definitions supplied as part of the BNC document type definition, in the file sampents.dtd. In either case, system specific values should be supplied for the characters described below. The number indicates the number of times this entity reference appears in the current version of the corpus.
aacute | small a, acute accent | 5 |
acirc | small a, circumflex accent | 2 |
agrave | small a, grave accent | 3 |
amp | ampersand | 262 |
ast | asterisk | 3 |
auml | small a, dieresis or umlaut mark | 10 |
bquo | normalised begin quote mark | 8049 |
bsol | reverse solidus | 1 |
ccaron | small c, caron | 1 |
ccedil | small c, cedilla | 5 |
deg | degree sign | 26 |
dollar | dollar sign | 231 |
Eacute | capital E, acute accent | 3 |
eacute | small e, acute accent | 69 |
egrave | small e, grave accent | 4 |
equo | normalised end quote mark | 8323 |
euml | small e, dieresis or umlaut mark | 1 |
formula | mathematical formula | 740 |
frac12 | fraction one-half | 163 |
frac14 | fraction one-quarter | 27 |
frac15 | fraction one-fifth | 2 |
frac18 | fraction one-eighth | 4 |
frac34 | fraction three-quarters | 27 |
frac38 | fraction three-eighths | 13 |
frac58 | fraction five-eighths | 9 |
frac78 | fraction seven-eighths | 1 |
ft | feet indicator | 13 |
gt | greater-than sign | 645 |
hellip | ellipsis (horizontal) | 919 |
ins | inches indicator | 33 |
iuml | small i, dieresis or umlaut mark | 2 |
lcub | left curly bracket | 93 |
lsqb | left square bracket | 301 |
lt | less-than sign | 640 |
mdash | em dash | 3500 |
ndash | en dash | 784 |
ntilde | small n, tilde | 2 |
oacute | small o, acute accent | 4 |
ocirc | small o, circumflex accent | 5 |
oelig | small oe ligature | 2 |
oslash | small o, slash | 110 |
ouml | small o, dieresis or umlaut mark | 19 |
pound | pound sign | 1097 |
quot | quotation mark | 2280 |
rcub | right curly bracket | 95 |
rehy | maps to soft hyphen | 16 |
rsqb | right square bracket | 301 |
times | multiply sign | 39 |
uacute | small u, acute accent | 1 |
uuml | small u, dieresis or umlaut mark | 8 |
The type attribute on each <div1>, <div2> (etc) element of a written text may be used to supply a value which characterizes the function of the corresponding subdivision in some way. Only the following values are used in the BNC Sampler:
The following codes are used in the BNC Sampler to indicate the kind of typographic rendition associated with an element where this is typographically distinct in some way. These codes are mostly used as values for the r attribute of the <hi> element, but may be used on any element bearing this attribute.
bo | bold face | 220 |
hi | highlighted | 7 |
it | italic face | 1531 |
ro | roman face | 2 |
ul | underlined | 31 |
Changes in voice quality in spoken texts are indicated by values for the <new> attribute on a <shift> element, at the point where the speaker's voice changes. The following values are used in the BNC Sampler (frequencies are given in parentheses):
crying (18) laughing (1151) mimicking (46) mimicking refined accent (1) reading (327) screaming (12) shouting (165) sighing (31) singing (157) spelling (24) whingeing (3) whining (4) whispering (139) yawning (45)
A single set of codes, derived from the International Standard for language and country identification, is used to identify regional origins, first language, and dialects spoken by participants, as specified in the <person> element in the text header. Speakers for whom such information was recorded will use one or more of the following codes as values for the who.flang or who.dialect attributes. All available codes are listed here; note that not all of these codes are actually used in the BNC Sampler:
CAN | Canada |
CHN | China |
DEU | Germany |
FRA | France |
GBR | United Kingdom |
IND | India |
IRL | Ireland |
USA | United States |
XXX | Unknown |
ZZG | Europe |
XDE | accent: German |
XEA | accent: East Anglia |
XFR | accent: French |
XHC | accent: Home Counties |
XHM | accent: Humberside |
XIR | accent: Irish |
XIS | accent: Indian subcontinent |
XLC | accent: Lancashire |
XLO | accent: London |
XMC | accent: central Midlands |
XMD | accent: Merseyside |
XME | accent: north-east Midlands |
XMI | accent: Midlands |
XMS | accent: south Midlands |
XMW | accent: north-west Midlands |
XNC | accent: central northern England |
XNE | accent: north-east England |
XNO | accent: northern England |
XOT | accent: unidentifiable |
XSD | accent: Scottish |
XSL | accent: lower south-west England |
XSS | accent: central south-west England |
XSU | accent: upper south-west England |
XUR | accent: European |
XUS | accent: U.S.A. |
XWA | accent: Welsh |
XWE | accent: West Indian |
Where relationships between individual participants in spoken texts can be identified, they will be specified by means of the <relation> element within the text header (as discussed in section 5.3.3 ). The desc attribute of this element may take any of the values listed below. The number in parentheses indicates the number of times this value appears in the BNC Sampler.
acquaint | acquaintance (2) |
audience | (1) |
aunt | (2) |
b-i-l | brother-in-law (6) |
brother | (16) |
chairman | (1) |
colleagu | colleague (51) |
cous-i-l | cousin-in-law (1) |
cousin | (1) |
customer | (1) |
d-i-l | daughter-in-law (4) |
daughter | (24) |
f-i-l | father-in-law (6) |
father | (27) |
friend | (36) |
g-daught | granddaughter (5) |
g-fath | grand-father (5) |
g-moth | grandmother (6) |
g-son | grandson (5) |
husband | (40) |
intervee | interviewee (1) |
m-i-l | mother-in-law (8) |
mother | (37) |
neighbou | neighbour (4) |
nephew | (2) |
niece | (1) |
s-daught | step-daughter (1) |
s-father | step-father (1) |
server | (1) |
sis-i-l | sister-in-law (8) |
sister | (15) |
son | (28) |
speaker | (1) |
stranger | (1) |
student | (2) |
teacher | (1) |
tutor | (1) |
uncle | (1) |
wife | (41) |
In addition to the <pause>, <shift>, <trunc>, and <unclear> elements, a variety of paralinguistic phenomena are marked up in the transcriptions which form the spoken part of the BNC Sampler. The <vocal> element is used to mark a variety of non-linguistic or semi-linguistic sounds made by one or more speakers in the transcriptions; the <event> element is used to mark up other occurrences which seemed of importance to the transcribers when making sense of the spoken interaction. We list here the various annotations used for the latter two categories throughout the BNC Sampler.
The following lists in alphabetic order all the different values specified on the TYPE attribute for the <vocal> element in the BNC Sampler, together with the frequency of occurrence for each different value.
baby talk (18) belch (18) buzzing sound (1) clapping (82) clears throat (139) clicks tongue (1) cough (451) crying (30) gasp (2) giggle (2) gurgle (7) hiccup (4) howl (1) humming (18) imitates aeroplane (1) imitates banjo (1) imitates bringing up phlegm (4) imitates cat licking (1) imitates clearing throat (1) imitates vomiting (3) imitating engine revving (1) kiss (5) kissing (1) kissing sound (2) laugh (3438) laughing (1) licking sound (1) mimicking cat spitting (1) mimicking gorilla noises (2) mimicking microphone noises (1) mimicking shaving noise (1) noise for paws (1) panting (1) purring noises (1) raspberry (7) scream (14) sigh (78) singing (7) sneeze (27) sniff (35) sound effect (8) sound effects (1) sound of biting (1) spitting sound (1) squeak (1) sucking noises (2) sucking then purring noises (1) tt (1) tut (27) whine (1) whistling (44) yawn (22) yelping sound (1) |
The following lists in alphabetic order all the different values specified on the DESC attribute for the <event> element in the BNC Sampler, together with the frequency of occurrence for each different value.
Background to following (1) Band (1) Band music (1) Break in enquiry (1) Break in recording (1) Children screaming (1) Dramatic music (1) Dramatic music. (1) End of first tape (2) End of recording (3) End of side (3) Engine noises (3) General chatter (1) Gives her a kiss (1) Gunfire, celebration. (1) Horse racing on the radio (1) Intense gunfire (1) Loud rustling next to microphone (1) Mr Bean record on telly I think (1) Music (3) Music and singing (1) Music and song (1) Pen on paper (3) Plane noises (1) Portuguese speech (31) Reading title of book (1) Rock music (2) Sound of burning and dramatic music (1) Sounds of intense automatic weapons fire. (1) Spanish (1) Tape ends (1) Telephone being dialled, overlaps next part of speech. (1) Theme music leading to triumphant climax (1) Theme music, reprise, to end of job (1) Theme music. Engine noise (1) advert (1) advertisements (7) advertisements and travel news (2) another gap in tape (1) applause (9) baby crying (5) baby screaming (2) baby squealing (1) baby talking (8) background to following (1) banging (2) banging noises (1) barking (2) bell ringing (1) bell rings (2) bibs hooter (1) birds singing/whistling (1) birthdays etcetera (1) blowing kisses (2) blowing nose (1) blows nose (2) boxing on television (1) boys fighting (1) break in recording (6) break in recording while watching video (1) break in tape (5) calling from outside (1) car hooter (1) cat miaows (1) chairs being moved (1) cheering (1) cheering and shouting (1) clapping (24) classroom chatter (11) classroom chatter - barely audible speech (1) clicking computer (4) clicking of computer (1) clock chiming (4) closing music (2) cough (4) dog barking (14) dog barks (4) dog sick noise again (1) dogs barking (2) door opening (1) doorbell (1) doorbell ringing (1) doorbell rings (1) dramatic music (1) drilling noise (1) drums (1) duck noise (1) eating (1) eating dinner (1) end of first side of tape and start of second (1) end of job (1) end of recording (5) end of side (1) end of side of tape (1) end of side one of first tape (1) end of side one of tape (1) end of side one of tape. second side starts part way into tape. (1) end of tape (1) end of tape side two (1) engine noises (1) everyone claps (1) football on television (1) football on television again - changed channels (1) general background chatter continues, but foreground conversation has paused (1) general hubbub as people move forward (1) getting into car (1) going into kitchen (1) hammering something (1) happy children (1) hums tune (1) in another room (1) in background throughout following text (1) in canteen/dining room - very noisy (1) in other room (1) intake of breath (1) interruption for radio commercials (2) introduction music (1) jet passes overhead (1) key sounds (1) knock on door (1) knocking on door (3) laugh (1) laughter (5) loud music (2) loud music playing (1) loud music playing on television (1) loud television (1) machinery noises (2) makes dog whining noise (1) makes growling noise (1) makes noise as if being sick (1) making sound of a plane (1) market place noises (1) microphone hissing and conversation very quiet (1) mimicking (1) mimicking crying (1) mimicking dog barking (1) mumble (1) mumbles (1) murmuring from the floor (1) music (5) music of Waltzing Matilda playing (5) music on in background (1) music on loud (1) music on loud again (1) music playing (1) music playing on television (1) music playing with interruption for radio commercials (1) music very loud (1) news bulletin (2) noise (2) noise like the dog being sick (1) noise of dog biscuits being tipped out (1) now moved to another class? (1) paper crinkling noises (1) phone bleeping (2) phone ringing (1) phone rings (3) piano sound (1) plate stacking (1) playing (1) playing piano (1) playing with baby (1) printer noises (1) printer sounds (2) puking noise. (1) radio on (1) reading computer screen (1) reading from board (1) reading from book/text (1) reading from classroom board (1) reading from hospital form (2) reading from newspaper cutting (1) reading from package (1) reading from receipt (1) reading title of book (1) reads horoscopes (1) recording ends (1) repetitive banging (1) scraping something (1) screaming in background (1) shooting sound on video game (1) shuts door (1) shutting door (1) singing (4) singing along to record (1) singing in background (3) singing to music (2) smacking lips together (1) snooker on the telly (1) sombre music (1) something falls (1) something smashes (1) sounds of burning. (1) sounds of sporadic fire (1) sounds of violence and gunfire (1) speaking French (1) speaking Spanish (2) spells surname (1) spitting noise (1) sports news (1) students' voices in background (6) sucks teeth (3) talking in kitchen away from microphone (1) talking in other room (1) talking to dog (1) talking to the cat (1) talking with mouth full (3) tape ends (6) tape ends side one and starts side two (1) tape playing (1) tape recording (1) tape stops and starts (2) taps table (1) telephone conversation ends (14) telephone conversation starts (13) telephone ringing (5) telephone rings (1) television comes on (1) television loud (1) television on (5) television on - horse racing (1) television turned over to football commentary (1) television very loud (2) telly on loud (1) theme music to end of recording (1) throws dice (1) tickling little girl (2) too much banging (1) traffic (1) traffic news and advertisements (1) travel news (2) travel news and news bulletin (1) travel news and weather (1) unable to hear conversation for some time (1) very loud television - football (1) video film for 135 seconds (1) video playing (1) walking along corridor - just a lot of chatter (1) watching football on television (1) watching football on television and talking quietly in background (2) weather and travel news (1) whispering to dog (1) with microphone (3) woof (1) |
This section lists all of the classification codes that may appear within the <catRef> element in the header of each text. Not all of the values defined here are actually used within the BNC Sampler. For information on how many texts, words, and sentences are included within a given classification, see the additional statistical tables.
The following table shows codes which can be used to classify all kinds of text, according to their availability, or their type. (For the BNC Sampler, all texts are freely available worldwide).
allAva | Text availability |
allAva1 | free, world: Freely available worldwide |
allAva2 | restricted, world: Available worldwide |
allAva3 | restricted, Not-NA: Not available in North America |
allAva4 | restricted, Not-US: Not available in U.S.A. |
allAva5 | restricted, EU: Not available outside the European Union |
allAva6 | restricted, Not-USP: Not available in U.S.A. & Philippines |
allAva7 | restricted, Not-NAP: Not available in North America & Philippines |
allTyp | Text type |
allTyp1 | Spoken demographic |
allTyp2 | Spoken context-governed |
allTyp3 | Written books and periodicals |
allTyp4 | Written-to-be-spoken |
allTyp5 | Written miscellaneous |
The following table list the classification codes which can be specified for spoken texts (either demographic or context-governed) only. Note that the classifications for demographically sampled texts apply to the respondent only, not necessarily to all speakers transcribed.
scgDom | Domain for context-governed material |
scgDom1 | Educational |
scgDom2 | Business |
scgDom3 | Institutional |
scgDom4 | Leisure |
sdeAge | Age band for demographic respondent |
sdeAge1 | 0-14 |
sdeAge2 | 15-24 |
sdeAge3 | 25-34 |
sdeAge4 | 35-44 |
sdeAge5 | 45-59 |
sdeAge6 | 60+ |
sdeCla | Social class for demographic repondent |
sdeCla1 | AB |
sdeCla2 | C1 |
sdeCla3 | C2 |
sdeCla4 | DE |
sdeSex | Sex of demographic respondent |
sdeSex1 | Male |
sdeSex2 | Female |
spoLog | Interaction type |
spoLog1 | Monologue |
spoLog2 | Dialogue |
spoReg | Region where text captured |
spoReg1 | South |
spoReg2 | Midlands |
spoReg3 | North |
The following table lists all classification codes which may be specified for any written text.
wbpSel | Books & periodicals: selection method |
wbpSel1 | Selective |
wbpSel2 | Random |
wmiPub | Miscellaneous materials: publication status |
wmiPub1 | Published |
wmiPub2 | Unpublished |
wriAAg | Author age band |
wriAAg1 | 0-14 |
wriAAg2 | 15-24 |
wriAAg3 | 25-34 |
wriAAg4 | 35-44 |
wriAAg5 | 45-59 |
wriAAg6 | 60+ |
wriADo | Author domicile |
wriAD036 | Australia |
wriAD124 | Canada |
wriAD250 | France |
wriAD276 | Germany |
wriAD372 | Ireland |
wriAD380 | Italy |
wriAD422 | Lebanon |
wriAD492 | Monaco |
wriAD554 | New Zealand |
wriAD620 | Portugal |
wriAD702 | Singapore |
wriAD756 | Switzerland |
wriAD826 | United Kingdom |
wriAD840 | United States |
wriAD920 | UK North (north of Mersey-Humber line) |
wriAD921 | UK Midlands (north of Bristol Channel-Wash line) |
wriAD922 | UK South (south of Bristol Channel-Wash line) |
wriASe | Sex of author |
wriASe1 | Male |
wriASe2 | Female |
wriASe3 | Mixed |
wriASe4 | Unknown |
wriATy | Type of author |
wriATy1 | Corporate |
wriATy2 | Multiple |
wriATy3 | Sole |
wriATy4 | Unknown |
wriAud | Intended age of audience |
wriAud1 | Child |
wriAud2 | Teenager |
wriAud3 | Adult |
wriAud4 | Any |
wriDom | Domain |
wriDom1 | Imaginative |
wriDom2 | Informative: natural & pure science |
wriDom3 | Informative: applied science |
wriDom4 | Informative: social science |
wriDom5 | Informative: world affairs |
wriDom6 | Informative: commerce & finance |
wriDom7 | Informative: arts |
wriDom8 | Informative: belief & thought |
wriDom9 | Informative: leisure |
wriLev | Circulation level |
wriLev1 | Low |
wriLev2 | Medium |
wriLev3 | High |
wriMed | Medium |
wriMed1 | Book |
wriMed2 | Periodical |
wriMed3 | Miscellaneous published |
wriMed4 | Miscellaneous unpublished |
wriMed5 | To-be-spoken |
wriPPl | Place of publication |
wriPP372 | Ireland |
wriPP826 | United Kingdom |
wriPP840 | United States |
wriPP920 | UK North (north of Mersey-Humber line) |
wriPP921 | UK Midlands (north of Bristol Channel-Wash line) |
wriPP922 | UK South (south of Bristol Channel-Wash line) |
wriSam | Type of sample |
wriSam1 | Whole text |
wriSam2 | Beginning sample |
wriSam3 | Middle sample |
wriSam4 | End sample |
wriSam5 | Composite |
wriSta | Reception status |
wriSta1 | Low |
wriSta2 | Medium |
wriSta3 | High |
wriTas | Target audience sex |
wriTas1 | Male |
wriTas2 | Female |
wriTas3 | Mixed |
wriTas4 | Unknown |
wriTim | Time period |
wriTim1 | 1960-1974 |
wriTim2 | 1975-1993 |
For each classification listed above, the absence of information may be indicated either by the absence of any code, or by the presence of a code ending with a zero instead of a number. For example, written texts for which type of author is unknown may be indicated either by the absence of any value beginning wriAty or by the presence of the specific value wriAty0.