Basic structure
The mark-up scheme chosen for the British National Corpus is an application of ISO 8879, the Standard Generalized Mark-Up Language. This international standard provides, amongst other things, a method of specifying an application-independent document grammar, in terms of the elements which may appear in a document, their attributes, and the ways in which they may legally be combined. It is also a superset of the language XML, the extensible markup language currently proposed by the World Wide Web Consortium for general use on the World Wide Web. A brief summary of the encoding format used in the BNC to represent SGML constructs is given in section Markup conventions below; more detailed information about SGML and XML is readily available in many places.
The original BNC encoding format was strongly influenced by the proposals of the Text Encoding Initiative (TEI). This international research project resulted in the development of a set of comprehensive guidelines for the encoding and interchange of a wide range of electronic texts amongst researchers. An initial report appeared in 1991, and a substantially revised and expanded version in early 1994. A conscious attempt was made to conform to TEI recommendations, where these had already been formulated, but in the first version of the BNC there were a number of differences in tag names, and models. In the present edition of the BNC, the tagging scheme has been changed to conform as far as possible with the published Recommendations of the TEI. Unless otherwise stated, elements used here have the same meaning as those of the published TEI scheme. More information about the relationship between the BNC's markup and both its original CDIF format and the TEI standard are given in section ??.
Section Basic structure describes the basic structure of the British National Corpus, in terms of the SGML elements distinguished and the tags used to mark them up. Section ?? describes the elements which are peculiar to written texts, and section ?? those peculiar to spoken texts. In each case, a distinction is made between those elements which are marked up in all texts and those which (for technical or financial reasons) are not always so distinguished, and hence appear in some texts only.
Section ?? describes the structure of the <teiHeader> element attached to each component of the corpus, and also to the whole corpus itself. Sections ?? and ?? informally describe the elements specific to written and to spoken texts respectively. It should be noted that by no means all of the features described here will be present in every text of the corpus, nor, if present, will they necessarily be tagged. A list of elements actually used in the whole corpus is given below in ??.
Markup conventions
The BNC texts use the ‘reference concrete syntax’ of SGML, in which all elements are delimited by the use of tags. There are two forms of tag, a start-tag, marking the beginning of an element, and an end-tag marking its end. Tags are delimited by the characters < and >, and contain the name of the element (its gi, for generic identifier), preceded by a solidus (/) in the case of an end-tag.
For example, a heading or title in a written text will be preceded by a tag of the form <head> and followed by a tag in the form </head>. Everything between these two tags is regarded as the content of an element of type <head>.
Attributes applicable to element instances, if present, are also indicated within the start-tag, and take the form of an attribute name, an equal sign and the attribute value, which may be a number, a string literal or a quoted literal. Attribute values are used for a variety of purposes, notably to represent the part of speech codes allocated to particular words by the CLAWS tagging scheme.
For example, the <head> element may take an attribute type which categorizes it in some way. A main heading will thus appear with a start tag <head type="main">, and a subheading with a start tag <head type="sub">.
In XML (but not always in SGML), case is significant in all tag or attribute names. A consistent style has been adopted throughout the corpus. This style uses lower-case letters for identifiers, unless they are derived from more than one word, in which case the first letter of the second and any subsequent word is capitalized.
SGML (but not XML) permits various kinds of minimization, or abbreviatory conventions. Only two such are used: end-tag omission and attribute-name omission. These conventions apply only to the elements <s>, <w> and <c> (i.e., for sentences, words, and punctuation).
For all other non-empty elements, every occurrence in the distributed form of the corpus has both a start-tag and an end-tag, and any attributes specified are supplied in the form attribute name=value (in the body of the texts), or attribute name="value" (in the headers). For the elements <s>, <w> and <c>, and all empty elements, end-tags are routinely omitted. For these three elements only, attribute values are given without any associated attribute name. See section Segments and words for some examples.
In the present release of the corpus, the headers are marked up using XML: this means that empty-tags take a slightly different form and that attribute values are always quoted.
Only a restricted range of characters is used in element content: specifically, the upper- and lower-case alphabetics, digits, and a subset of the common punctuation marks. All other characters are represented by SGML entity references, which take the form of an ampersand (&) followed by a mnemonic for the character, and terminated by a semicolon (;) where this is necessary to resolve ambiguity.
£
, the character é by the string
é
and so forth. The French word
‘été’ (summer), if it appeared in the corpus, would
be represented as Finally, although this is not mandated by either XML or SGML, in the present form of the corpus, tags are never broken across linebreaks. Additionally, an attempt has been made to avoid linebreaks within the content of a single <s> element, so as to simplify processing of the text.
Global attributes
Corpus and text elements
The British National Corpus contains a large number of text samples, some spoken and some written. Each such sample has some associated descriptive or bibliographic information particular to it, and there is also a large body of descriptive information which applies to the whole corpus.
In SGML terms, the British National Corpus consists of a single SGML element, tagged <bnc>. This element contains a single <teiHeader> element, followed by a sequence of <bncDoc> elements. Each such <bncDoc> element contains its own <teiHeader>, followed by either a <text> element (for written texts) or an <stext> element (for spoken texts). The last named element is an extension of the TEI scheme, but the others are all standard TEI elements, possibly renamed as permitted by the TEI scheme.
The components of the header are fully documented in section ??. Further discussion of SGML concepts and practices is provided in section ??.
Note that different elements are used for spoken and written texts because each has a different substructure; this represents a departure from TEI recommended practice.
The org attribute is used to characterize the internal organization of written texts. All demographically collected spoken texts have the same internal organization: each <stext> element collects together all the conversations for a given respondent, each distinct conversation being represented by a <div> element (see further ??). Since the order of these <div> elements is not significant, the org attribute always has the value ‘composite’.
Segments and words
- <s>
- a segment of spoken or written text as identified by the CLAWS segmentation scheme. The global n attribute is always supplied for <s> elements.
- <w>
- represents a grammatical (not necessarily orthographic) word. Note that the CLAWS definition of a ‘word’ does not correspond with the conventional orthogaphic definition. Attributes include:
- <c>
- represents a punctuation character. Attributes include:
The <s> element is the basic organizational principle for the whole corpus: every text, spoken or written, may be regarded as an end-to-end sequence of <s> elements, possibly grouped into higher-level constructs, such as paragraphs or utterances.
The n attribute is specified for each <s> element and gives its sequence number within the text from which it comes. The code within each <w> or <c> tag is the word class code assigned by the CLAWS tagging system. These codes are listed below, in section ??.
In most cases, <s> elements will correspond with regular orthographic sentences, and <w> elements with regular orthographic words. However, it should be noted that several common phrases are treated as single <w> elements, typically prepositional phrases such as ‘in spite of’, while some single orthographic forms such as ‘can't’ and possessive forms such as ‘man's’ are decomposed into two <w> elements. Further discussion of these non-orthographic word forms is given in the accompanying Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging by Geoffrey Leech and Nicholas Smith.
Dashes used to separate numbers are represented in a similar way, using the ndash entity.
Quotation marks are also represented by entity references The reference name used will depend on whether or not the usage of quotation marks in the text has been normalized. Information in the header should describe the course taken for a particular text, as described in section ??.
Editorial indications
- <corr>
- any editorial correction or regularization, e.g. of material obviously mistranscribed or misspelled, or of variant spellings. Attributes include:
- <sic>
- a word or phrase which has not been corrected, but which is in doubt; for example, a spoken word which the transcribers cannot recognise, or a dubious spelling. Attributes include:
In general, the <corr> element is used wherever a word appears to be misspelled in the source, and the <sic> element where the transcriber is unable to propose a correction, but believes the original to be erroneous. The <sic> element is also used to mark words which are intentionally misspelled, for example to indicate non-standard pronunciation; in this case, the corr attribute is used to supply a standard spelling.
Slightly different transcription policies have been followed by different transcribers, and consequently these elements may not appear in all texts. The <editorialDecl> element of the header described in section ?? gives further details of the editorial principles applied across the corpus. The value of the decls attribute for an individual text will indicate which principle or set of principles applies to it. The <tagsDecl> element in each text's header may also be consulted for an indication of the usage of these and other elements within it (see further section ??).
Users are cautioned that the corpus contains a significant number of errors, both in transcription and encoding. Every attempt has been made to reduce the incidence of such errors to an acceptable level, using a number of automatic and semi-automatic validation and correction procedures, but exhaustive proof-reading of a corpus of this size was not economically feasible. The corrections indicated by the tags discussed above are included only where errors have been detected, and no claim should be inferred that no other errors remain.
Some examples
Pointers
C87NT000
which is then supplied as the value of the target attribute.
For example,
This mechanism is also used to represent captions, notes, etc which interrupt the normal reading sequence. By far the commonest use of the <ptr> element, however, is to represent alignment of synchronous speech; see further section ??.
Up: Contents