The original British National Corpus was provided as an application of ISO 8879, the Standard Generalized Mark-Up Language (SGML). This international standard provides, amongst other things, a method of specifying an application-independent document grammar, in terms of the elements which may appear in a document, their attributes, and the ways in which they may legally be combined. SGML was a predecessor of XML, the extensible markup language defined by the World Wide Web Consortium and now in general use on the World Wide Web, which was originally designed as a means of distributing SGML documents on the web.
This XML edition of the BNC is delivered in an XML format which is documented in this manual in section Markup conventions below; more detailed information about XML itself is readily available in many places.
The original BNC encoding format was also strongly influenced by the proposals of the Text Encoding Initiative (TEI). This international research project resulted in the development of a set of comprehensive guidelines for the encoding and interchange of a wide range of electronic texts amongst researchers. An initial report appeared in 1991, and a substantially revised and expanded version in early 1994. A conscious attempt was made to conform to TEI recommendations, where these had already been formulated, but in the first version of the BNC there were a number of differences in tag names, and models. In the second edition of the BNC (BNC World), the tagging scheme was changed to conform as far as possible with the published Recommendations of the TEI (??). In the XML edition, this process has continued, and the corpus schema is now supplied in the form of a TEI customization: see further ??.
Section Markup conventions describes the basic structure of the BNC encoding scheme, in terms of the XML elements and attributes distinguished and the tags used to mark them. Section Written texts describes features which are peculiar to written texts, and section Spoken texts those peculiar to spoken texts. In each case, a distinction is made between those elements which are marked up in all texts and those which (for technical or financial reasons) are not always so distinguished, and hence appear in some texts only.
Section The header describes the structure of the <teiHeader> element attached to each component of the corpus, and also to the whole corpus itself. Sections Written texts and Spoken texts informally describe the elements specific to written and to spoken texts respectively. It should be noted that by no means all of the features described here will be present in every text of the corpus, nor, if present, will they necessarily be tagged. Finally, a reference section (Formal Specification of the BNC XML schema) provides an alphabetical list of all elements and attributes used, together with the model and attribute classes to which they belong, and macros used to simplify references to them.
The BNC XML edition is marked up in XML and encoded in Unicode. These formats are now so pervasive as to need little explication here; for the sake of completeness however, we give a brief summary of their chief characteristics. We strongly recommend the use of XML-aware processing tools to process the corpus.
An XML document, such as the BNC consists of a single root element, within which are nested occurrences of other element types. All element occurrences are delimited by tags. There are two forms of tag, a start-tag, marking the beginning of an element, and an end-tag marking its end. Tags are delimited by the characters < and >, and contain the name of the element (its gi, for generic identifier), preceded by a solidus (/) in the case of an end-tag.
For example, a heading or title in a written text will be preceded by a tag of the form <head> and followed by a tag in the form </head>. Everything between these two tags is regarded as the content of an element of type <head>.
Attributes applicable to element instances, if present, are also indicated within the start-tag, and take the form of an attribute name, an equals sign and the attribute value, in the form of a quoted literal. Attribute values are used for a variety of purposes, notably to represent the part of speech codes allocated to particular words by the CLAWS tagging scheme.
For example, the <head> element may take an attribute type which categorizes it in some way. A main heading will thus appear with a start tag <head type="MAIN">, and a subheading with a start tag <head type="SUB">.
The names of elements and attributes are case-significant, as are attribute values. The style adopted throughout the BNC scheme is to use lower-case letters for identifiers, unless they are derived from more than one word, in which case the first letter of the second and any subsequent word is capitalized.
Unless it is empty, every occurrence of an element must have both a start-tag and an end-tag. Empty elements use a special syntax in which start and end-tags are combined together: for example, the point at which a page break occurs in an original source is marked <pb/> rather than <pb></pb>
The BNC is delivered in UTF-8 encoding: this means that almost
all characters in the corpus are represented directly by the
appropriate Unicode character. The chief exceptions are the ampersand
(&) which is always represented by the special string
&, the double quotation mark, which is sometimes
represented by the special string
", and the
arithmetic less-than sign, which always appears as
<. These ‘named entity
references’ use a syntactic convention of XML which is
followed by this version of the corpus. All other characters,
including accented letters such as é or special characters such as —,
are represented directly.
The example begins with the start tag for a <wtext>
(written text) element,
which bears a type attribute, the value of which is
FICTION, the code used for texts derived from published
fiction. The start tag is followed by an empty <pb> element,
which provides the page number in the original source text. This in
turn is followed by the start of a <div> element, which
contains the first subdivision (chapter) of this text. This first
chapter begins with a heading (marked by a <head> element)
followed by a paragraph (marked by the <p> element). Further
details and examples are provided for all of these elements and their
functions elsewhere in this documentation.
Each distinct word and punctuation mark in the text, as identified by the CLAWS tagger, has been separately tagged with a <w> or <c> element as appropriate. These elements both bear a c5 attribute, which indicates the code from the CLAWS C6 tagset allocated to that word by the CLAWS POS-tagger; <w> elements also bear a pos attribute, which provides a less fine-grained part of speech classification for the word, and an hw attribute, which indicates the root form of the word. For example, the word ‘said’ in this example has the CLAWS 5 code VVD, the simplified POS tag VERB, and the headword say. The sequence of words and punctuation marks making up a complete segment is tagged as an <s> element, and bears an n attribute, which supplies its sequence number within the text. A combination of text identifier (the three letter code) and <s> number may be used to reference any part of the corpus: the example above contains J10 1 and J10 2.
This is not, of course, a complete text: in particular, it lacks the TEI header which is prefixed to each text file making up the corpus. Its purpose is to indicate how the corpus is encoded. Any XML aware processing software, including common Web browsers, should be able to operate directly on BNC texts in XML format.
The BNC contains a large number of text samples, some spoken and some written. Each such sample has some associated descriptive or bibliographic information particular to it, and there is also a large body of descriptive information which applies to the whole corpus.
In XML terms, the corpus consists of a single element, tagged <bnc>. This element contains a single <teiHeader> element, containing metadata which relates to the whole corpus, followed by a sequence of <bncDoc> elements. Each such <bncDoc> element contains its own <teiHeader>, containing metadata relating to that specific text, followed by either a <text> element (for written texts) or an <stext> element (for spoken texts).
The components of the TEI header are fully documented in section The header.
The <s> element is the basic organizational principle for the whole corpus: every text, spoken or written, is represented as a sequence of <s> elements, possibly grouped into higher-level constructs, such as paragraphs or utterances. Each <s> element in turn contains <w> or <c> elements representing words and punctuation marks.
The n attribute is used to provide a sequential number for the <s> element to which it is attached. These numbers are, as far as possible, preserved across versions of the corpus, to facilitate referencing. This implies that the sequence numbering may have gaps, where duplicate sequences or segmentation errors have been identified and removed from the corpus. In cases where sequences formerly regarded as a single <s> have subsequently been split into two or more, the same number is retained for each new <s>, but it is suffixed by a fragment number. To identify any part of the corpus uniquely therefore, all that is needed is the three character text identifier (given as the value of the attribute xml:id on the <bncDoc> containing the text, followed by the value of the n attribute of the <s> element containing the passage to be identified.
- <s> (s-unit) contains a sentence-like division of a text.
- <w> (word) represents a grammatical (not necessarily orthographic) word.
- <c> (character) contains a significant punctuation mark as identified by the CLAWS tagger.
- <mw> contains a multi-word unit as identified by CLAWS, that is, a sequence of individual tokens which function as a single unit and can be given a single part of speech code.
Despite the best efforts of its creators, any corpus as large as the BNC will inevitably contain many errors, both in transcription and encoding. Every attempt has been made to reduce the incidence of such errors to an acceptable level, using a number of automatic and semi-automatic validation and correction procedures, but exhaustive proof-reading of a corpus of this size remains economically feasible.Editorial interventions in the marked up texts take three forms. On a few occasions, where markup or commentary introduced by transcribers during the process of creating the corpus may be helpful to subsequent users, it has been retained in the form of an XML comment. On some occasions, encoders have decided to correct material evidently wrong in their copy text: such corrections are marked using the <corr> element. And on several occasions, sampling, anonymization or other concerns, have led to the omission of significant parts of the original source; such omissions are marked by means of the <gap> element.
The transcription and editorial policies defined for the corpus may not have been applied uniformly by different transcribers and consequently the usage of these elements is not consistent across all texts. The <tagsDecl> element in each text's header may be consulted for an indication of the usage of these and other elements within it (see further section The encoding description). Their absence should not be taken to imply that the text is either complete or perfectly transcribed.
- <gap> (omitted material) indicates a point where material has been omitted from the transcription.
- <corr> (correction) contains the correct form of a passage apparently erroneous in the copy text.