Formal Specification of the BNC XML schema
The structure of the XML edition of the British National Corpus is described by means of a single XML schema, which is however expressed in three different schema languages: the traditional DTD language which XML inherits from SGML; the more recently defined ISO schema language known as RELAXNG; and the W3C defined schema language. The three schema files are all generated from the same TEI-conformant XML source file, which is also used to generate the present documentation.
This section of the document contains the TEI-conformant reference specification for all components of the BNC schema. These include definitions for attribute classses, model classes, and macro patterns as well as definitions for elements and their associated attributes and possible value lists. A full description of these concepts and how they are used to define and document XML encoding schemes is given by the TEI Guidelines (in particular, in chapter TD); the following summary provides only basic information about them.
When several elements in a schema share attributes of the same name, with values drawn from a common set, they are considered to form an attribute class. The members of such a class can then all reference the same class definition rather then each repeat the same information. In the BNC, for example, the elements <bibl>,<corr>, <div>, <head>, <hi>, and half a dozen others, all have the same attribute rend which takes a coded value taken from the same short list of possibilities. Rather than repeat this definition half a dozen times therefore, the relevant elements are all said to be members of a class att.rendered, which is defined independently of those elements (but includes a list of its members). In the same way, the <w> and <mw> elements, as members of the att.c5coded class, share the same definition for the possible CLAWS5 codes specified by their @c5 attribute. Note however that the element <c>, although it has an attribute @c5, is not a member of this class because the possible values for this attribute on this element are entirely different.
In any reasonably large schema, and particularly one derived from the TEI model, several elements are likely to have very similar content models, since it will often be the case that at a given point in the document hierarchy any one of several possible elements will be permissible. The specific subset of elements (<w>, <mw>, <c> and a few others) which may appear within an <s> element in the BNC, is different from the subsets of elements which may appear within a <p> or <div> element. However, there are several elements which can appear in the same places as a <p>. Following TEI practice, we call the set of elements which can appear together (in sequence or alternation) at a specific place in the document hierarchy a model class. For example, since <l>, <lg>, <list>, <p>, <quote>, and <sp> are all permitted as immediate components of a <div> elements, we define a class model.divPart, of which these six elements are all members. Wherever convenient, content models are defined in terms of these model classes.
As noted above, this usage of model classes is a distinctive and pervasive feature of the TEI encoding scheme. Because the BNC derives from the TEI scheme, it uses the same names and (as far as is practicable) the same model classes throughout. Although this introduces an occasionally redundant degree of indirection in the resulting schema, it also makes clearer the relationship between the components defined for the BNC and their origins in the TEI scheme.
Finally, we define here a few macros for commonly encountered content models. These are also taken from the TEI encoding scheme, though in a few cases with different meanings. In the TEI for example, the macro macro.phraseSeq is defined as a mixture of various ‘phrase level’ elements and plain text; in the BNC scheme, it has been redefined as plain text only. The places where this macro is referenced however are unchanged; in this respect therefore, the BNC schema is a proper subset of the full BNC schema.
The remainder of this section lists in alphabetical order all of the attribute classes, model classes, elements, and macros defined for the BNC encoding scheme, using a similar method of display as the full TEI Guidelines. For each component, we give a brief description and also a usage example. Note that many of the elements listed here appear only in the corpus header rather than in the texts, and may thus be safely disregarded by applications which operate on the texts alone or in isolation.
level of spontaneity or informality of the context as assessed by transcriber. high. medium. low. not applicable or unknown. natcorp@oucs.ox.ac.uk 13 Banbury Road, Oxford OX2 6NN,UK main country of residence where known. internal identifier. year of birth where known. Aubrey, Crispin This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full licencing and distribution conditions. contains any bibliographic reference, occurring either within the header of a written corpus text in which case it has a fixed substructure, or within the body of a corpus text, in which case it contains only <s> elements. British intelligence services in action. Lindsay, Kennedy Dunrod Press Dundalk, Ireland 1980 74-176 contains a significant punctuation mark as identified by the CLAWS tagger. the CLAWS 5 code associated with this punctuation mark. any separating punctuation mark. opening round or square parenthesis. closing round or square parenthesis. any quotation mark. ? provides a description for one category within the text taxonomies provided in the corpus header. Academic prose provides a list of codes identifying the categories to which this text has been assigned, each code referencing a category element declared in the corpus header. targets defines a single category within a taxonomy of texts. Fiction and verse contains verbatim text which has been corrected, or an empty string if the correction consists of an addition. a code identifying the agency responsible for making the correction. OUP. OUCS. Longman. existent supplies the date of a change data.date Tag usage updated for BNC-XML supplies the year of original composition, if known; or 000-00-00 if the date is unknown. Origination/creation date not known Original publisher: A & C Black (Publishers) Ltd, London supplies a standardized representation of the date. 1991-02-16 1989 supplies explanatory text associated with a category or other component defined in the corpus header. Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium. supplies an additional name or number for this division, taken from the original source. for a spoken text, identities the declarations (for setting, recording etc.) in the header which apply to this division. specifies the hierarchic level of this division as a number between 1 (outermost or largest division) and 4 (innermost or smallest). identifies the type or function of the division (for a written text). advertisement section or insert. appendix. single article in a journal. any kind of promotional front matter. cartoon. chapter of a novel etc. newspaper column, regular feature etc. composite material. table of contents. any kind of front matter. free-standing leaflet or pamphlet. an academic paper in a collection. subdivision of a chapter. separate recipe in a cookbook. any subdivision. sidebar or displayed paragraph e.g. in a news story. distinct story in a periodical or collection. smaller subdivision of any kind. So you want to be an Actor? Everyone who wants ... BNC XML Edition, December 2006 supplies an identifying number for the edition. BNC XML Edition, December 2006 supplies a number for the editor where multiple editors are specified for a single text. Boileau, John Material included in the BNC was produced by several different agencies ... The British National Corpus (BNC) Consortium was formed in 1990... Definitive information on the sampling policies... Material included in the BNC was produced by several different agencies ... Canonical references to the BNC should ... David Lee's register and domain classification. ... ... provides a brief description of the event. specifies the approximate size of the text, in orthographic words, <w> elements, and <s> elements . 432434 tokens; 432859 w-units; 26215 s-units indicates a point where material has been omitted from the transcription. briefly describes the material which has been omitted. gives further details of the reason for omission. contains any type of heading, for example the title of a section or a poem. describes the kind of heading. a major heading. any sub-heading. a sub-heading providing the name of a journalist or other source of a newspaper report. Do I need any training? Apple is to fruit as dog is to X . supplies an identifying code for a text. categorizes the code number used. the canonical three character text identifier. a superceded six-character identifier used during production of the BNC. KD7 XMa0KP internal identifier. John Murray (Publishers) Ltd London 1989 Substitute plain biscuits for filled or chocolate-covered ones... Try eating a small amount ... Fluid dynamics Fluids. Dynamics Next Day at Six before the Gate appears, The Wretch divided by his Hopes and Fears. Amount: 52153 Pounds Date Award Began: 01 January 1992 The language of the British National Corpus is modern British English. ... The language of the British National Corpus is modern British English. ... Too jellied, viscous, floating a condition to inspire more action than a sigh —... Longman*
(meaning all elements), or
name()
meaning that the name of the referenced element is
to be used rather than its value.
supplies the namespace within which the generic identifier is to be
found.
the class of elements whose rendition has been recorded
intermittently in the BNC.
a code briefly characterising the way the element content was originally
presented.
bold weight font.
boxed.
superscript.
italic and bold.
italic superscript.
italic subscript.
italic font.
italic and underlined.
subscript.
centre-aligned.
roman within italic.
strike-out.
bold underlined .
underlined.
crossed-out.
elements which carry a CLAWS 5 Part of speech code.
supplies the CLAWS 5 code associated with this word.
Adjective (general or positive) (e.g. ‘good’, ‘old’, ‘beautiful’).
Comparative adjective (e.g. ‘better’, ‘older’).
Superlative adjective (e.g. ‘best’, ‘oldest’).
Article (e.g. ‘the’, ‘a’, ‘an’, ‘no’).
General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. ‘often’, ‘well’, ‘longer’ (adv.), ‘furthest’.
Adverb particle (e.g. ‘up’, ‘off’, ‘out’).
Wh-adverb (e.g. ‘when’, ‘where’, ‘how’, ‘why’, ‘wherever’) .
Coordinating conjunction (e.g. ‘and’, ‘or’, ‘but’).
Subordinating conjunction (e.g. ‘although’, ‘when’).
The subordinating conjunction ‘that’ .
Cardinal number (e.g. ‘one’, ‘3’, ‘fifty-five’, ‘3609’).
Possessive determiner-pronoun (e.g. ‘your’, ‘their’, ‘his’).
General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0.
Wh-determiner-pronoun (e.g. ‘which’, ‘what’, ‘whose’, ‘whichever’) .
Existential there, i.e. ‘there’ occurring in the ‘there is’ ... or ‘there are’ ... construction.
Interjection or other isolate (e.g. ‘oh’, ‘yes’, ‘mhm’, ‘wow’).
Common noun, neutral for number (e.g. ‘aircraft’, ‘data’, ‘committee’) .
Singular common noun (e.g. ‘pencil’, ‘goose’, ‘time’, ‘revelation’).
Plural common noun (e.g. ‘pencils’, ‘geese’, ‘times’, ‘revelations’).
Proper noun (e.g. ‘London’, ‘Michael’, ‘Mars’, ‘IBM’) .
Ordinal numeral (e.g. ‘first’, ‘sixth’, ‘77th’, ‘last’) .
Indefinite pronoun (e.g. ‘none’, ‘everything’, ‘one’ [as pronoun], ‘nobody’).
Personal pronoun (e.g. ‘I’, ‘you’, ‘them’, ‘ours’).
Wh-pronoun (e.g. ‘who’, ‘whoever’, ‘whom’).
Reflexive pronoun (e.g. ‘myself’, ‘yourself’, ‘itself’, ‘ourselves’).
The possessive or genitive marker ‘'s ’or ‘'’.
The preposition ‘of’.
Preposition (except for ‘of’) (e.g. ‘about’, ‘at’, ‘in’, ‘on’, ‘on behalf of’, ‘with’).
Infinitive marker ‘to’ .
Unclassified items which are not appropriately considered as items of the English lexicon.
The present tense forms of the verb BE, except for ‘is’, ‘'s’: i.e. ‘am’, ‘are’, ‘'m’, ‘'re’ and ‘be’ [subjunctive or imperative].
The past tense forms of the verb BE: ‘was’ and ‘were’.
The -ing form of the verb BE: ‘being’.
The infinitive form of the verb BE: ‘be’.
The past participle form of the verb BE: ‘been’.
The -s form of the verb BE: ‘is’, ‘'s’.
The finite base form of the verb BE: ‘do’.
The past tense form of the verb DO: ‘did’.
The -ing form of the verb DO: ‘doing’.
The infinitive form of the verb DO: ‘do’.
The past participle form of the verb DO: ‘done’.
The -s form of the verb DO: ‘does’, ‘'s’.
The finite base form of the verb HAVE: ‘have’, ‘'ve’.
The past tense form of the verb HAVE: ‘had’, ‘'d’.
The -ing form of the verb HAVE: ‘having’.
The infinitive form of the verb HAVE: ‘have’.
The past participle form of the verb HAVE: ‘had’.
The -s form of the verb HAVE: ‘has’, ‘'s’.
Modal auxiliary verb (e.g. ‘will’, ‘would’, ‘can’, ‘could’, ‘'ll’, ‘'d’).
The finite base form of lexical verbs (e.g. ‘forget’, ‘send’, ‘live’, ‘return’) [Including the imperative and present subjunctive].
The past tense form of lexical verbs (e.g. ‘forgot’, ‘sent’, ‘lived’, ‘returned’).
The -ing form of lexical verbs (e.g. ‘forgetting’, ‘sending’, ‘living’, ‘returning’).
The infinitive form of lexical verbs (e.g. ‘forget’, ‘send’, ‘live’, ‘return’).
The past participle form of lexical verbs (e.g. ‘forgotten’, ‘sent’, ‘lived’, ‘returned’).
The -s form of lexical verbs (e.g. ‘forgets’, ‘sends’, ‘lives’, ‘returns’).
The negative particle ‘not’ or ‘n't ’.
Alphabetical symbols (e.g. ‘A’, ‘a’, ‘B’, ‘b’, ‘c’, ‘d’).
Probably AJ0 (adjective), but maybe AV0 (adverb).
Probably AJ0 (adjective), but maybe NN1 (singular noun).
Probably AJ0 (adjective), but maybe VVD (verb past tense).
Probably AJ0 (adjective), but maybe VVG (-ing verb).
Probably AJ0 (adjective), but maybe VVN (verb past participle).
Probably AV0 (adverb), but maybe AJ0 (adjective).
Probably AVP (adverb particle), but maybe PRP (preposition).
Probably AVQ (wh- adverb), but maybe CJS (subordinating conjunction).
Probably CJS (subordinating conjunction), but maybe AVQ (wh- adverb).
Probably CJS (subordinating conjunction), but maybe PRP (preposition).
Probably CJT ("that" as conjunction), but maybe DT0 (determiner).
Probably CRD (number), but maybe PNI (indefinite pronoun).
Probably DT0 (determiner), but maybe CJT ("that" as conjunction).
Probably NN1 (singular noun), but maybe AJ0 (adjective).
Probably NN1 (singular noun), but maybe NP0 (proper noun).
Probably NN1 (singular noun), but maybe VVB (verb).
Probably NN1 (singular noun), but maybe VVG (-ing verb).
Probably NN2 (plural noun), but maybe VVZ (-s verb).
Probably NP0 (proper noun), but maybe NN1 (singular noun).
Probably PNI (indefinite pronoun), but maybe CRD (number).
Probably PRP (preposition), but maybe AVP (adverb particle).
Probably PRP (preposition), but maybe CJS (subordinating conjunction).
Probably VVB (verb), but maybe NN1 (singular noun).
Probably VVD (verb past tense), but maybe AJ0 (adjective).
Probably VVD (verb past tense), but maybe VVN (verb past participle).
Probably VVG (-ing verb), but maybe AJ0 (adjective).
Probably VVG (-ing verb), but maybe NN1 (singular noun).
Probably VVN (verb past participle), but maybe AJ0.
Probably VVN (verb past participle), but maybe VVD (verb past tense).
Probably VVZ (-s verb), but maybe NN2 (plural noun).
indicates the duration of the element in seconds.
groups elements which can appear either within or between
paragraphs.
In the BNC, the members of this class are the segments
identified by CLAWS, i.e. the sentence, word, punctuation, and
multiword units.
Up: Contents