add this bookmarking tool

Formal Specification of the BNC XML schema

The structure of the XML edition of the British National Corpus is described by means of a single XML schema, which is however expressed in three different schema languages: the traditional DTD language which XML inherits from SGML; the more recently defined ISO schema language known as RELAXNG; and the W3C defined schema language. The three schema files are all generated from the same TEI-conformant XML source file, which is also used to generate the present documentation.

This section of the document contains the TEI-conformant reference specification for all components of the BNC schema. These include definitions for attribute classses, model classes, and macro patterns as well as definitions for elements and their associated attributes and possible value lists. A full description of these concepts and how they are used to define and document XML encoding schemes is given by the TEI Guidelines (in particular, in chapter TD); the following summary provides only basic information about them.

When several elements in a schema share attributes of the same name, with values drawn from a common set, they are considered to form an attribute class. The members of such a class can then all reference the same class definition rather then each repeat the same information. In the BNC, for example, the elements <bibl>,<corr>, <div>, <head>, <hi>, and half a dozen others, all have the same attribute rend which takes a coded value taken from the same short list of possibilities. Rather than repeat this definition half a dozen times therefore, the relevant elements are all said to be members of a class att.rendered, which is defined independently of those elements (but includes a list of its members). In the same way, the <w> and <mw> elements, as members of the att.c5coded class, share the same definition for the possible CLAWS5 codes specified by their @c5 attribute. Note however that the element <c>, although it has an attribute @c5, is not a member of this class because the possible values for this attribute on this element are entirely different.

In any reasonably large schema, and particularly one derived from the TEI model, several elements are likely to have very similar content models, since it will often be the case that at a given point in the document hierarchy any one of several possible elements will be permissible. The specific subset of elements (<w>, <mw>, <c> and a few others) which may appear within an <s> element in the BNC, is different from the subsets of elements which may appear within a <p> or <div> element. However, there are several elements which can appear in the same places as a <p>. Following TEI practice, we call the set of elements which can appear together (in sequence or alternation) at a specific place in the document hierarchy a model class. For example, since <l>, <lg>, <list>, <p>, <quote>, and <sp> are all permitted as immediate components of a <div> elements, we define a class model.divPart, of which these six elements are all members. Wherever convenient, content models are defined in terms of these model classes.

As noted above, this usage of model classes is a distinctive and pervasive feature of the TEI encoding scheme. Because the BNC derives from the TEI scheme, it uses the same names and (as far as is practicable) the same model classes throughout. Although this introduces an occasionally redundant degree of indirection in the resulting schema, it also makes clearer the relationship between the components defined for the BNC and their origins in the TEI scheme.

Finally, we define here a few macros for commonly encountered content models. These are also taken from the TEI encoding scheme, though in a few cases with different meanings. In the TEI for example, the macro macro.phraseSeq is defined as a mixture of various ‘phrase level’ elements and plain text; in the BNC scheme, it has been redefined as plain text only. The places where this macro is referenced however are unchanged; in this respect therefore, the BNC schema is a proper subset of the full BNC schema.

The remainder of this section lists in alphabetical order all of the attribute classes, model classes, elements, and macros defined for the BNC encoding scheme, using a similar method of display as the full TEI Guidelines. For each component, we give a brief description and also a usage example. Note that many of the elements listed here appear only in the corpus header rather than in the texts, and may thus be safely disregarded by applications which operate on the texts alone or in isolation.

level of spontaneity or informality of the context as assessed by transcriber. high. medium. low. not applicable or unknown. natcorp@oucs.ox.ac.uk 13 Banbury Road, Oxford OX2 6NN,UK main country of residence where known. internal identifier. year of birth where known. Aubrey, Crispin This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full licencing and distribution conditions. contains any bibliographic reference, occurring either within the header of a written corpus text in which case it has a fixed substructure, or within the body of a corpus text, in which case it contains only <s> elements. British intelligence services in action. Lindsay, Kennedy Dunrod Press Dundalk, Ireland 1980 74-176 contains a significant punctuation mark as identified by the CLAWS tagger. the CLAWS 5 code associated with this punctuation mark. any separating punctuation mark. opening round or square parenthesis. closing round or square parenthesis. any quotation mark. ? provides a description for one category within the text taxonomies provided in the corpus header. Academic prose provides a list of codes identifying the categories to which this text has been assigned, each code referencing a category element declared in the corpus header. targets defines a single category within a taxonomy of texts. Fiction and verse contains verbatim text which has been corrected, or an empty string if the correction consists of an addition. a code identifying the agency responsible for making the correction. OUP. OUCS. Longman. existent supplies the date of a change data.date Tag usage updated for BNC-XML supplies the year of original composition, if known; or 000-00-00 if the date is unknown. Origination/creation date not known Original publisher: A & C Black (Publishers) Ltd, London supplies a standardized representation of the date. 1991-02-16 1989 supplies explanatory text associated with a category or other component defined in the corpus header. Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium. supplies an additional name or number for this division, taken from the original source. for a spoken text, identities the declarations (for setting, recording etc.) in the header which apply to this division. specifies the hierarchic level of this division as a number between 1 (outermost or largest division) and 4 (innermost or smallest). identifies the type or function of the division (for a written text). advertisement section or insert. appendix. single article in a journal. any kind of promotional front matter. cartoon. chapter of a novel etc. newspaper column, regular feature etc. composite material. table of contents. any kind of front matter. free-standing leaflet or pamphlet. an academic paper in a collection. subdivision of a chapter. separate recipe in a cookbook. any subdivision. sidebar or displayed paragraph e.g. in a news story. distinct story in a periodical or collection. smaller subdivision of any kind. So you want to be an Actor? Everyone who wants ... BNC XML Edition, December 2006 supplies an identifying number for the edition. BNC XML Edition, December 2006 supplies a number for the editor where multiple editors are specified for a single text. Boileau, John Material included in the BNC was produced by several different agencies ...
This element is supplied in the BNC corpus header only
The British National Corpus (BNC) Consortium was formed in 1990... Definitive information on the sampling policies... Material included in the BNC was produced by several different agencies ... Canonical references to the BNC should ... David Lee's register and domain classification. ... ...
Used in corpus header only
provides a brief description of the event. specifies the approximate size of the text, in orthographic words, <w> elements, and <s> elements . 432434 tokens; 432859 w-units; 26215 s-units indicates a point where material has been omitted from the transcription. briefly describes the material which has been omitted. gives further details of the reason for omission. contains any type of heading, for example the title of a section or a poem. describes the kind of heading. a major heading. any sub-heading. a sub-heading providing the name of a journalist or other source of a newspaper report. Do I need any training? Apple is to fruit as dog is to X . supplies an identifying code for a text. categorizes the code number used. the canonical three character text identifier. a superceded six-character identifier used during production of the BNC. KD7 XMa0KP internal identifier. John Murray (Publishers) Ltd London 1989 Substitute plain biscuits for filled or chocolate-covered ones... Try eating a small amount ... Fluid dynamics Fluids. Dynamics Next Day at Six before the Gate appears, The Wretch divided by his Hopes and Fears. Amount: 52153 Pounds Date Award Began: 01 January 1992 The language of the British National Corpus is modern British English. ... The language of the British National Corpus is modern British English. ...
Appears only in the corpus header.
Too jellied, viscous, floating a condition to inspire more action than a sigh —... Longman
This element is used only in the header.
This element is not used in the current release of the BNC: all elements belong to the empty namespace.
specifies where the note is placed in the original source. footnote. side note. endnote. internal identifier. The short is a film about sailing.... student indicates how the paragraph is displayed. the paragraph is displayed as a caption. the displayed paragraph contains a byline. the paragraph is displayed as a floating caption. the paragraph is displayed as an attached caption. BRAVE: Louise JOBLESS Darren St John gobbled 5lb of strawberries in two pints of chilli-flavoured gravy to raise £450 for charity at Henley, Oxon. in demographic texts, supplies the ‘respondent number’ used to identify the batch of tapes. 45 Terry british rail employee ...
The respondent number is also used to identify the tapes deposited with the British Library's sound archive.
Erm right now, ... gives the number of the page beginning here. North Yorkshire: York Norman specifies the age group to which the participant belongs. Under 15 years. 15 to 24 years. 25 to 34 years. 35 to 44 years. 45 to 59 years. Over 59 years. Unknown. specifies the dialect or accent of a participant's speech, as identified by the respondent. Canadian. No accent recorded. German. East Anglian. French. Home Counties . Humberside. Irish. Indian subcontinent. Lancashire. London. Central Midlands. Merseyside. North-east Midlands. Midlands. South Midlands. North-west Midlands. Central Northern England. North-east England. Northern England. Other or unidentifiable. Scottish. Lower south-west England. Central south-west England. Upper south-west England. European. American (US). Welsh. West Indian. specifies the country of origin of the participant, as identified by the respondent. Unknown. German. French. British English. North American English. Unknown Indian language. internal identifier. specifies the social class of the participant. Higher management: administrative or professional. Lower management: supervisory or clerical. Skilled manual. Semi-skilled or unskilled. Social class unknown. specifies the age at which the participant ceased full-time education. Still in education. Left school aged 14 or under. Education continued until age 19 or over. Unknown. specifies the sex of the participant. male. female. unknown. describes the relationship or role of this participant with respect to the respondent. 77 discrete values are used 55 Nola british rail employee The British National Corpus (BNC) Consortium was formed in 1990, and started work in 1991 on the three-year task of producing a hundred-million word corpus of modern British English for use in commercial and academic research. The first edition was published in 1994. ... W hansard Parliamentary debates This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full licencing and distribution conditions.HHV HansrA Thrift has gone out of fashion.
Any bibliographic source or reference provided for the quotation may be included within the quote element.
date of the recording in standardized form. duration of the recording in seconds. tape number. time of day the recording was made. kind of recording. recording made directly to Digital Audio tape. recording made to Walkman tape. provides documentation for the reference system applicable to the corpus. Canonical references to the BNC should be constructed by taking the value of the n attribute of the <bncDoc> element containing the target text, and concatenating a dot separator, followed by the value of the n attribute of the target <s> element containing the material to be referenced. ... Text enrichment Unit for Computer Research into the English Language, University of Lancaster Tag usage updated for BNC-XML.. sequence number . Come in. Definitive information on the sampling policies applied during construction of the BNC is provided in the associated documentation... an internal identifier for a setting. Strathclyde: Glasgow doctor's surgery medical consultation Unknown analysts meeting speech Yeah ...The worst poverty: a history of debt and debtors. Barty-King, Hugh Alan Sutton Publishing Ltd Gloucester 1991 85-203 Several Hon. Members rose provides information about the XML elements actually used within a BNC text.... Text mode. Written Transcribed speech contains a word or phrase used to describe the topic or nature of a text. Parliamentary debates
Used to specify a single keyword or phrase
W hansard Parliamentary debates indicates the bibliographic level of this title. the title is an analytic title, rather than a monographic one. Amnesty International meeting. Sample containing about 15274 words speech recorded in public context An awfully big adventure. Bainbridge, B Duckworth & Company Ltd London 1990 49-192 So you want to be an actor?. Sample containing about 35817 words from a book (domain: arts) Data capture and transcription Oxford University Press supplies a simplified part-of-speech code. adjective. adverb. article. conjunction. interjection. preposition. pronoun. punctuation. substantive. unclassified or non-lexical word. verb. specifies the headword under which this lexical unit is conventionally grouped, where known. a lower-case only normalised form of a word I don't care bnc contains a distinct document within the corpus, either spoken or written. shall I get it or not? I don't know what to do Yes get it eh, me and your mother
In the BNC, each change of speaker is marked by a new <u> element.
provides a brief description of the vocal event. specifies the age in years of a recorded participant at the time of the recording in which they participate. 25 marks an temporal alignment point within transcribed speech. supplies an arbitrary identifier; all elements specifying the same value for this attribute are understood to be aligned with each other in time. A unique identifier in the form.... tell Billy that Hello now can you hear me contains an informal description of the regional variety of English used by a participant in a spoken text. Home Counties contains a multi-word unit as identified by CLAWS, that is, a sequence of individual tokens which function as a single unit and can be given a single part of speech code. in response to
In CLAWS output the components of a <mw> are given ‘ditto’ tags inherited from the parent <mw>. In BNC they have been given the same code as elsewhere in the corpus.
contains descriptive text appearing within components of a TEI header. For information, the conditions of the Standard License Agreement are as follows: contains any additional information supplied about a participant in a spoken text. May well be an actor portraying a Davidian supplies page numbers for a bibliographic citation. Misfortunes of Nigel. ... 67-173 Mr. Speaker I call Mr. Dennis Turner. Mr. Speaker I call Mr. Dennis Turner.
In the BNC, used only for speaker labels in dramatic texts, or Hansard
contains a single spoken text, i.e. a transcription or collection of transcriptions from a single source. specifies the type of spoken text. demographically sampled conversation. any other spoken text. contains one or more truncated words in transcribed speech. The , then he bo bowled contains a single written text. specifies the type of written text. Academic prose. Fiction and verse. Newspapers. Non-academic prose and biography. Other published materials. Unpublished materials. specifies additional information needed by XAIRA. specifies the indexing policy to be used for one or more attributes. identifies the attribute to which the indexing policy applies. specifies the required indexing policy. . no part of the attribute will be indexed. the attribute supplies an identifier which can be linked to from elsewhere. the value of the attribute is used as an identifier on some other element. the value of the attribute is used to supply a key for some predefined codebook or taxonomy of possible values. supplies any additional ICU-conformant collating rules to be used when sorting words in the corpus.
The format for collating rules is defined at http://icu.sourceforge.net/userguide/Collate_Customization.html
specifies the XAIRA indexing policy to be used for one or more elements. specifies the required indexing policy. . no part of the element or its content will be indexed. no part of the element will be indexed, but any child elements will be. only the text and child content of the element will be indexed. only the start and end tags for the element and any child elements not otherwise specified will be indexed. specifies the label to be generated for the parent reference. specifies when the new label is to be generated. a new label is generated when a new instance of the scope-defining element is found, and remains in force until the next such element is found. a new label is generated when a new instance of the scope-defining element is found, and remains in force until that instance ends. supplies a list of element names or attribute identifiers. supplies a list of element names carrying an attribute which has been specified with the XAIRA "joinTo" indexing policy. supplies any additional ICU-conformant rules to be used when tokenization is performed by XAIRA rather than by explicit XML markup. specifies where the XAIRA indexer is to find a value. indicates the kind of value to be found. element content. attribute content. content returned by one of the pseudo- functions count(), sysid(), or lineno() . indicates whether the value required should be casefolded or not. xsd:boolean false provides data needed to define one part of a XAIRA specification. indicates what is defined by this part of the specification. an element and its attributes. a lexical form. an additional key. a lemmatization scheme. a region. the reference used to identify a document in the corpus. the default scope for results obtained from a corpus. the reference used to identify some segmentation of a corpus document. the indexing policy for some corpus component. the default language for a corpus. any additional tokenization or collation rules for some language used in a corpus. contains a list of XAIRA parameters of a particular type. indicates the function of this part of the specification. lists and glosses the elements, attributes, and codebooks used in a corpus . specifies how items are to be indexed. specifies any predefined regions to be made avalable to the client. specifies any lemmatization schemes used. specifies how items are to be referenced. specifies any special indexing policies. specifies any language-specific rules. provides the definition for a single attribute. supplies the identifier of a previously-defined value list to be used at this point. the class of elements which carry an identifier which is unique across the whole corpus. provides the unique identifier for this element. the class of elements which describe other elements by means of their generic identifiers. supplies an element's generic identifier, or one of the codes * (meaning all elements), or name() meaning that the name of the referenced element is to be used rather than its value. supplies the namespace within which the generic identifier is to be found.
The values * and name() are used for ident as well.
the class of elements whose rendition has been recorded intermittently in the BNC. a code briefly characterising the way the element content was originally presented. bold weight font. boxed. superscript. italic and bold. italic superscript. italic subscript. italic font. italic and underlined. subscript. centre-aligned. roman within italic. strike-out. bold underlined . underlined. crossed-out. elements which carry a CLAWS 5 Part of speech code. supplies the CLAWS 5 code associated with this word. Adjective (general or positive) (e.g. ‘good’, ‘old’, ‘beautiful’). Comparative adjective (e.g. ‘better’, ‘older’). Superlative adjective (e.g. ‘best’, ‘oldest’). Article (e.g. ‘the’, ‘a’, ‘an’, ‘no’). General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. ‘often’, ‘well’, ‘longer’ (adv.), ‘furthest’. Adverb particle (e.g. ‘up’, ‘off’, ‘out’). Wh-adverb (e.g. ‘when’, ‘where’, ‘how’, ‘why’, ‘wherever’) . Coordinating conjunction (e.g. ‘and’, ‘or’, ‘but’). Subordinating conjunction (e.g. ‘although’, ‘when’). The subordinating conjunction ‘that’ . Cardinal number (e.g. ‘one’, ‘3’, ‘fifty-five’, ‘3609’). Possessive determiner-pronoun (e.g. ‘your’, ‘their’, ‘his’). General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0. Wh-determiner-pronoun (e.g. ‘which’, ‘what’, ‘whose’, ‘whichever’) . Existential there, i.e. ‘there’ occurring in the ‘there is’ ... or ‘there are’ ... construction. Interjection or other isolate (e.g. ‘oh’, ‘yes’, ‘mhm’, ‘wow’). Common noun, neutral for number (e.g. ‘aircraft’, ‘data’, ‘committee’) . Singular common noun (e.g. ‘pencil’, ‘goose’, ‘time’, ‘revelation’). Plural common noun (e.g. ‘pencils’, ‘geese’, ‘times’, ‘revelations’). Proper noun (e.g. ‘London’, ‘Michael’, ‘Mars’, ‘IBM’) . Ordinal numeral (e.g. ‘first’, ‘sixth’, ‘77th’, ‘last’) . Indefinite pronoun (e.g. ‘none’, ‘everything’, ‘one’ [as pronoun], ‘nobody’). Personal pronoun (e.g. ‘I’, ‘you’, ‘them’, ‘ours’). Wh-pronoun (e.g. ‘who’, ‘whoever’, ‘whom’). Reflexive pronoun (e.g. ‘myself’, ‘yourself’, ‘itself’, ‘ourselves’). The possessive or genitive marker ‘'s ’or ‘'’. The preposition ‘of’. Preposition (except for ‘of’) (e.g. ‘about’, ‘at’, ‘in’, ‘on’, ‘on behalf of’, ‘with’). Infinitive marker ‘to’ . Unclassified items which are not appropriately considered as items of the English lexicon. The present tense forms of the verb BE, except for ‘is’, ‘'s’: i.e. ‘am’, ‘are’, ‘'m’, ‘'re’ and ‘be’ [subjunctive or imperative]. The past tense forms of the verb BE: ‘was’ and ‘were’. The -ing form of the verb BE: ‘being’. The infinitive form of the verb BE: ‘be’. The past participle form of the verb BE: ‘been’. The -s form of the verb BE: ‘is’, ‘'s’. The finite base form of the verb BE: ‘do’. The past tense form of the verb DO: ‘did’. The -ing form of the verb DO: ‘doing’. The infinitive form of the verb DO: ‘do’. The past participle form of the verb DO: ‘done’. The -s form of the verb DO: ‘does’, ‘'s’. The finite base form of the verb HAVE: ‘have’, ‘'ve’. The past tense form of the verb HAVE: ‘had’, ‘'d’. The -ing form of the verb HAVE: ‘having’. The infinitive form of the verb HAVE: ‘have’. The past participle form of the verb HAVE: ‘had’. The -s form of the verb HAVE: ‘has’, ‘'s’. Modal auxiliary verb (e.g. ‘will’, ‘would’, ‘can’, ‘could’, ‘'ll’, ‘'d’). The finite base form of lexical verbs (e.g. ‘forget’, ‘send’, ‘live’, ‘return’) [Including the imperative and present subjunctive]. The past tense form of lexical verbs (e.g. ‘forgot’, ‘sent’, ‘lived’, ‘returned’). The -ing form of lexical verbs (e.g. ‘forgetting’, ‘sending’, ‘living’, ‘returning’). The infinitive form of lexical verbs (e.g. ‘forget’, ‘send’, ‘live’, ‘return’). The past participle form of lexical verbs (e.g. ‘forgotten’, ‘sent’, ‘lived’, ‘returned’). The -s form of lexical verbs (e.g. ‘forgets’, ‘sends’, ‘lives’, ‘returns’). The negative particle ‘not’ or ‘n't ’. Alphabetical symbols (e.g. ‘A’, ‘a’, ‘B’, ‘b’, ‘c’, ‘d’). Probably AJ0 (adjective), but maybe AV0 (adverb). Probably AJ0 (adjective), but maybe NN1 (singular noun). Probably AJ0 (adjective), but maybe VVD (verb past tense). Probably AJ0 (adjective), but maybe VVG (-ing verb). Probably AJ0 (adjective), but maybe VVN (verb past participle). Probably AV0 (adverb), but maybe AJ0 (adjective). Probably AVP (adverb particle), but maybe PRP (preposition). Probably AVQ (wh- adverb), but maybe CJS (subordinating conjunction). Probably CJS (subordinating conjunction), but maybe AVQ (wh- adverb). Probably CJS (subordinating conjunction), but maybe PRP (preposition). Probably CJT ("that" as conjunction), but maybe DT0 (determiner). Probably CRD (number), but maybe PNI (indefinite pronoun). Probably DT0 (determiner), but maybe CJT ("that" as conjunction). Probably NN1 (singular noun), but maybe AJ0 (adjective). Probably NN1 (singular noun), but maybe NP0 (proper noun). Probably NN1 (singular noun), but maybe VVB (verb). Probably NN1 (singular noun), but maybe VVG (-ing verb). Probably NN2 (plural noun), but maybe VVZ (-s verb). Probably NP0 (proper noun), but maybe NN1 (singular noun). Probably PNI (indefinite pronoun), but maybe CRD (number). Probably PRP (preposition), but maybe AVP (adverb particle). Probably PRP (preposition), but maybe CJS (subordinating conjunction). Probably VVB (verb), but maybe NN1 (singular noun). Probably VVD (verb past tense), but maybe AJ0 (adjective). Probably VVD (verb past tense), but maybe VVN (verb past participle). Probably VVG (-ing verb), but maybe AJ0 (adjective). Probably VVG (-ing verb), but maybe NN1 (singular noun). Probably VVN (verb past participle), but maybe AJ0. Probably VVN (verb past participle), but maybe VVD (verb past tense). Probably VVZ (-s verb), but maybe NN2 (plural noun). indicates the duration of the element in seconds. groups elements which can appear either within or between paragraphs. In the BNC, the members of this class are the segments identified by CLAWS, i.e. the sentence, word, punctuation, and multiword units.

Up: Contents