[bnc] Basic structure - Users Reference Guide for the British National Corpus (XML Edition)

Basic structure

The original British National Corpus was provided as an application of ISO 8879, the Standard Generalized Mark-Up Language (SGML). This international standard provides, amongst other things, a method of specifying an application-independent document grammar, in terms of the elements which may appear in a document, their attributes, and the ways in which they may legally be combined. SGML was a predecessor of XML, the extensible markup language defined by the World Wide Web Consortium and now in general use on the World Wide Web, which was originally designed as a means of distributing SGML documents on the web.

This XML edition of the BNC is delivered in an XML format which is documented in this manual in section Markup conventions below; more detailed information about XML itself is readily available in many places.

The original BNC encoding format was also strongly influenced by the proposals of the Text Encoding Initiative (TEI). This international research project resulted in the development of a set of comprehensive guidelines for the encoding and interchange of a wide range of electronic texts amongst researchers. An initial report appeared in 1991, and a substantially revised and expanded version in early 1994. A conscious attempt was made to conform to TEI recommendations, where these had already been formulated, but in the first version of the BNC there were a number of differences in tag names, and models. In the second edition of the BNC (BNC World), the tagging scheme was changed to conform as far as possible with the published Recommendations of the TEI (??). In the XML edition, this process has continued, and the corpus schema is now supplied in the form of a TEI customization: see further ??.

Section Markup conventions describes the basic structure of the BNC encoding scheme, in terms of the XML elements and attributes distinguished and the tags used to mark them. Section Written texts describes features which are peculiar to written texts, and section Spoken texts those peculiar to spoken texts. In each case, a distinction is made between those elements which are marked up in all texts and those which (for technical or financial reasons) are not always so distinguished, and hence appear in some texts only.

Section The header describes the structure of the <teiHeader> element attached to each component of the corpus, and also to the whole corpus itself. Sections Written texts and Spoken texts informally describe the elements specific to written and to spoken texts respectively. It should be noted that by no means all of the features described here will be present in every text of the corpus, nor, if present, will they necessarily be tagged. Finally, a reference section (Formal Specification of the BNC XML schema) provides an alphabetical list of all elements and attributes used, together with the model and attribute classes to which they belong, and macros used to simplify references to them.

Markup conventions

The BNC XML edition is marked up in XML and encoded in Unicode. These formats are now so pervasive as to need little explication here; for the sake of completeness however, we give a brief summary of their chief characteristics. We strongly recommend the use of XML-aware processing tools to process the corpus.

An XML document, such as the BNC consists of a single root element, within which are nested occurrences of other element types. All element occurrences are delimited by tags. There are two forms of tag, a start-tag, marking the beginning of an element, and an end-tag marking its end. Tags are delimited by the characters < and >, and contain the name of the element (its gi, for generic identifier), preceded by a solidus (/) in the case of an end-tag.

For example, a heading or title in a written text will be preceded by a tag of the form <head> and followed by a tag in the form </head>. Everything between these two tags is regarded as the content of an element of type <head>.

Attributes applicable to element instances, if present, are also indicated within the start-tag, and take the form of an attribute name, an equals sign and the attribute value, in the form of a quoted literal. Attribute values are used for a variety of purposes, notably to represent the part of speech codes allocated to particular words by the CLAWS tagging scheme.

For example, the <head> element may take an attribute type which categorizes it in some way. A main heading will thus appear with a start tag <head type="MAIN">, and a subheading with a start tag <head type="SUB">.

The names of elements and attributes are case-significant, as are attribute values. The style adopted throughout the BNC scheme is to use lower-case letters for identifiers, unless they are derived from more than one word, in which case the first letter of the second and any subsequent word is capitalized.

Unless it is empty, every occurrence of an element must have both a start-tag and an end-tag. Empty elements use a special syntax in which start and end-tags are combined together: for example, the point at which a page break occurs in an original source is marked <pb/> rather than <pb></pb>

The BNC is delivered in UTF-8 encoding: this means that almost all characters in the corpus are represented directly by the appropriate Unicode character. The chief exceptions are the ampersand (&) which is always represented by the special string &, the double quotation mark, which is sometimes represented by the special string ", and the arithmetic less-than sign, which always appears as <. These ‘named entity references’ use a syntactic convention of XML which is followed by this version of the corpus. All other characters, including accented letters such as é or special characters such as —, are represented directly.

The number of linebreaks in the corpus has been reduced to a minimum in order to simplify processing by non-XML aware utilities. In particular:

XML tags are never broken across linebreaks;
the TEI Header prefixed to each text contains no linebreaks
each <s> element begins on a new line

. Many XML aware utilities are available to convert this representation as required.

An example

Here is the opening of text J10 (a novel by Michael Pearce):

<wtext type="FICTION"> <pb n="5"/> <div level="1"> <head> <s n="1"> <w c5="NN1" hw="chapter" pos="SUBST">CHAPTER </w> <w c5="CRD" hw="1" pos="ADJ">1</w> </s> </head> <p> <s n="2"> <c c5="PUQ">‘</c> <w c5="CJC" hw="but" pos="CONJ">But</w> <c c5="PUN">,</c> <c c5="PUQ">’ </c> <w c5="VVD" hw="say" pos="VERB">said </w> <w c5="NP0" hw="owen" pos="SUBST">Owen</w> <c c5="PUN">,</c> <c c5="PUQ">‘</c> <w c5="AVQ" hw="where" pos="ADV">where </w> <w c5="VBZ" hw="be" pos="VERB">is </w> <w c5="AT0" hw="the" pos="ART">the </w> <w c5="NN1" hw="body" pos="SUBST">body</w> <c c5="PUN">?</c> <c c5="PUQ">’</c> </s> </p> .... </div> </wtext>

This example has been reformatted to make its structure more apparent: as noted above, in the actual corpus texts, newlines appear only at the start of each <s> element, rather than (as here) at the start of each element. The original files also lack the extra white space at the start of each line, used in the above example to indicate how the XML elements nest within one another.

The example begins with the start tag for a <wtext> (written text) element, which bears a type attribute, the value of which is FICTION, the code used for texts derived from published fiction. The start tag is followed by an empty <pb> element, which provides the page number in the original source text. This in turn is followed by the start of a <div> element, which contains the first subdivision (chapter) of this text. This first chapter begins with a heading (marked by a <head> element) followed by a paragraph (marked by the <p> element). Further details and examples are provided for all of these elements and their functions elsewhere in this documentation.

Each distinct word and punctuation mark in the text, as identified by the CLAWS tagger, has been separately tagged with a <w> or <c> element as appropriate. These elements both bear a c5 attribute, which indicates the code from the CLAWS C6 tagset allocated to that word by the CLAWS POS-tagger; <w> elements also bear a pos attribute, which provides a less fine-grained part of speech classification for the word, and an hw attribute, which indicates the root form of the word. For example, the word ‘said’ in this example has the CLAWS 5 code VVD, the simplified POS tag VERB, and the headword say. The sequence of words and punctuation marks making up a complete segment is tagged as an <s> element, and bears an n attribute, which supplies its sequence number within the text. A combination of text identifier (the three letter code) and <s> number may be used to reference any part of the corpus: the example above contains J10 1 and J10 2.

This is not, of course, a complete text: in particular, it lacks the TEI header which is prefixed to each text file making up the corpus. Its purpose is to indicate how the corpus is encoded. Any XML aware processing software, including common Web browsers, should be able to operate directly on BNC texts in XML format.

The remainder of this manual describes in more detail the intended semantics for each of the XML elements used in the corpus, with examples of their use.

Corpus and text elements

The BNC contains a large number of text samples, some spoken and some written. Each such sample has some associated descriptive or bibliographic information particular to it, and there is also a large body of descriptive information which applies to the whole corpus.

In XML terms, the corpus consists of a single element, tagged <bnc>. This element contains a single <teiHeader> element, containing metadata which relates to the whole corpus, followed by a sequence of <bncDoc> elements. Each such <bncDoc> element contains its own <teiHeader>, containing metadata relating to that specific text, followed by either a <text> element (for written texts) or an <stext> element (for spoken texts).

Each bncDoc element also carries an xml:id attribute, which supplies its standard three-character identifier.

The components of the TEI header are fully documented in section The header.

Note that different elements are used for spoken and written texts because each has a different substructure; this represents a departure from TEI recommended practice.

The function of these elements and their attributes may be summarized as follows:

<wtext> contains a single written text.
<stext> contains a single spoken text, i.e. a transcription or collection of transcriptions from a single source.

Segments and words

The <s> element is the basic organizational principle for the whole corpus: every text, spoken or written, is represented as a sequence of <s> elements, possibly grouped into higher-level constructs, such as paragraphs or utterances. Each <s> element in turn contains <w> or <c> elements representing words and punctuation marks.

The n attribute is used to provide a sequential number for the <s> element to which it is attached. These numbers are, as far as possible, preserved across versions of the corpus, to facilitate referencing. This implies that the sequence numbering may have gaps, where duplicate sequences or segmentation errors have been identified and removed from the corpus. In cases where sequences formerly regarded as a single <s> have subsequently been split into two or more, the same number is retained for each new <s>, but it is suffixed by a fragment number. To identify any part of the corpus uniquely therefore, all that is needed is the three character text identifier (given as the value of the attribute xml:id on the <bncDoc> containing the text, followed by the value of the n attribute of the <s> element containing the passage to be identified.

Fragmentary sentences such as headings or labels in lists are also encoded as <s> elements, as in the following example from text CBE:

<div type="u"> <head type="MAIN"> <s n="835"> <w c5="AJ0" hw="serious" pos="ADJ">Serious </w> <w c5="NN1" hw="fit" pos="SUBST">fit </w> <w c5="PRF" hw="of" pos="PREP">of </w> <w c5="NN2" hw="giggle" pos="SUBST">giggles</w> </s> </head> <p> <s n="836"> <w c5="AT0" hw="a" pos="ART">A </w> <w c5="NN0" hw="pair" pos="SUBST">PAIR </w> <w c5="PRF" hw="of" pos="PREP">of </w> <w c5="NN1" hw="tv" pos="SUBST">TV </w> <w c5="NN2" hw="newsreader" pos="SUBST">newsreaders </w> ... </s>...</p> ... </div>

As noted above, at the lowest level, the corpus consists of <w> (word) and <c> (punctuation) elements, grouped into <s> (segment) elements. Each <w> element contains three attributes to indicate its morphological class or part of speech, as determined by the CLAWS tagger, a simplified form of that POS code, and an automatically-derived root form or lemma. Each <c> element also carries codes for part of speech, but not for lemma. For example, the word ‘corpora’ werever it appears in the BNC is presented like this:

<w c5="NN2" pos="SUBST" hw="corpus">corpora </w>

Any white space following a word in the original source is preserved within the <w> tag, as in the previous example. White space is not added if no space is present in the source, as in the following example:

<w c5="NN2" pos="SUBST" hw="corpus">corpora</w> <c c5="PUN" pos="PUN">. </c>

The <w> element encloses a single token as identified by the CLAWS tagger. Usually this willl correspond with a word as conventionally spelled; there are however two important exceptions. Firstly, CLAWS regards certain common appreviated or enclitic forms such as ‘'s’ in ‘he's’ or <dog's> as distinct tokens, thus enabling it to distinguish them as being an auxiliary verb in the first case, and a genitive marker in the second. For example, ‘It's’ is encoded as follows:

, while ‘dog's’ is encoded:

Secondly, CLAWS treats certain common multi-word units as if they were single tokens, giving the whole of a sequence such as ‘in spite of’ a single POS code. These multiword sequences were not distinguished from individual <w> elements in earlier versions of the corpus; in the present version however a new element <mw> (for multiword) has been introduced to mark them explicitly. The individual components of a <mw> sequence are also tagged as <w> elements in the same way as elsewhere. Thus, the phrase ‘in terms of’, which in earlier editions of the BNC would have been encoded as a single <w> element, is now encoded as follows:

<mw c5="PRP"> <w c5="PRP" hw="in" pos="PREP">in </w> <w c5="NN2" hw="term" pos="SUBST">terms </w> <w c5="PRF" hw="of" pos="PREP">of </w> </mw>

Detailed information about the procedures by which the part of speech and lemmatization information was added to the corpus is provided in section Wordclass Tagging in BNC XML, which is derived from the Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging by Geoffrey Leech and Nicholas Smith, as distributed along with the BNC World edition of the corpus. A brief summary of the codes used and their significance is also provided in the reference section below (Formal Specification of the BNC XML schema).

<s> (s-unit) contains a sentence-like division of a text.
n
sequence number
<w> (word) represents a grammatical (not necessarily orthographic) word.
pos
supplies a simplified part-of-speech code.
c5
supplies the CLAWS 5 code associated with this word.
hw
specifies the headword under which this lexical unit is conventionally grouped, where known.
<c> (character) contains a significant punctuation mark as identified by the CLAWS tagger.
c5
the CLAWS 5 code associated with this punctuation mark.
<mw> contains a multi-word unit as identified by CLAWS, that is, a sequence of individual tokens which function as a single unit and can be given a single part of speech code.

Editorial indications

Despite the best efforts of its creators, any corpus as large as the BNC will inevitably contain many errors, both in transcription and encoding. Every attempt has been made to reduce the incidence of such errors to an acceptable level, using a number of automatic and semi-automatic validation and correction procedures, but exhaustive proof-reading of a corpus of this size remains economically feasible.Editorial interventions in the marked up texts take three forms. On a few occasions, where markup or commentary introduced by transcribers during the process of creating the corpus may be helpful to subsequent users, it has been retained in the form of an XML comment. On some occasions, encoders have decided to correct material evidently wrong in their copy text: such corrections are marked using the <corr> element. And on several occasions, sampling, anonymization or other concerns, have led to the omission of significant parts of the original source; such omissions are marked by means of the <gap> element.

The transcription and editorial policies defined for the corpus may not have been applied uniformly by different transcribers and consequently the usage of these elements is not consistent across all texts. The <tagsDecl> element in each text's header may be consulted for an indication of the usage of these and other elements within it (see further section The encoding description). Their absence should not be taken to imply that the text is either complete or perfectly transcribed.

In the following example, the first three chapters have been omitted for sampling reasons:

<wtext type="FICTION"> <div level="1" n="1"> <head> <s n="1"> <w c5="NP0" hw="friday" pos="SUBST">Friday </w> <w c5="CRD" hw="16" pos="ADJ">16 </w> <w c5="NP0" hw="september" pos="SUBST">September </w> <w c5="PRP" hw="to" pos="PREP">to </w> <w c5="NP0" hw="tuesday" pos="SUBST">Tuesday </w> <w c5="CRD" hw="20" pos="ADJ">20 </w> <w c5="NP0" hw="september" pos="SUBST">September</w> </s> </head> <gap desc="chapters 1–3 of book 1" reason="sampling strategy"/> <pb n="17"/> <div level="2" n="4"> <p> <s n="2"> <w c5="AV0" hw="once" pos="ADV">Once </w> <w c5="AJ0" hw="free" pos="ADJ">free </w> <w c5="PRF" hw="of" pos="PREP">of </w> <w c5="AT0" hw="the" pos="ART">the </w> <w c5="AJ0" hw="knotted" pos="ADJ">knotted </w> <w c5="NN2" hw="tentacle" pos="SUBST">tentacles </w> <w c5="PRF" hw="of" pos="PREP">of </w> <w c5="AT0" hw="the" pos="ART">the </w> <w c5="AJ0" hw="eastern" pos="ADJ">eastern </w> <w c5="NN2" hw="suburb" pos="SUBST">suburbs</w> <c c5="PUN">, </c> <w c5="NP0" hw="dalgliesh" pos="SUBST">Dalgliesh </w> <w c5="VVD" hw="make" pos="VERB">made </w> <w c5="AJ0" hw="good" pos="ADJ">good </w> <w c5="NN1" hw="time" pos="SUBST">time </w> <w c5="CJC" hw="and" pos="CONJ">and </w> <w c5="PRP" hw="by" pos="PREP">by </w> <w c5="CRD" hw="three" pos="ADJ">three </w> <w c5="PNP" hw="he" pos="PRON">he </w> <w c5="VBD" hw="be" pos="VERB">was </w> <w c5="VVG" hw="drive" pos="VERB">driving </w> <w c5="PRP" hw="through" pos="PREP">through </w> <w c5="NP0" hw="lydsett" pos="SUBST">Lydsett </w> <w c5="NN1" hw="village" pos="SUBST">village</w> <c c5="PUN">.</c> </s> </p>... </div>...</div> </wtext>

In the following example, a proper name has been omitted:

<s n="547"> <w c5="PNP" hw="i" pos="PRON">I </w> <w c5="VVD" hw="ask" pos="VERB">asked </w> <w c5="NP0" hw="mr" pos="SUBST">Mr </w> <gap desc="name" reason="anonymization"/> <w c5="CJC" hw="and" pos="CONJ">and </w> ...</s>

In the following example, a telephone number has been omitted:

<s n="762"> <w c5="PNP" hw="he" pos="PRON">He </w> <w c5="VVD" hw="appeal" pos="VERB">appealed </w> <w c5="PRP" hw="for" pos="PREP">for </w> <w c5="PNI" hw="anyone" pos="PRON">anyone </w> <w c5="PRP" hw="with" pos="PREP">with </w> <w c5="NN1" hw="information" pos="SUBST">information </w> <w c5="TO0" hw="to" pos="PREP">to </w> <w c5="VVI" hw="contact" pos="VERB">contact </w> <w c5="PNP" hw="he" pos="PRON">him </w> <w c5="AVP" hw="on" pos="ADV">on </w> <gap desc="telephone number"/> <c c5="PUN">.</c> </s>

In the following example, a typographic error in the original has been corrected:

<s n="48">... <w c5="AJ0" hw="good" pos="ADJ">good </w> <w c5="CJC" hw="or" pos="CONJ">or </w> <corr sic="herioc"> <w c5="AJ0" hw="heroic" pos="ADJ">heroic </w> </corr> <w c5="NN1" hw="behaviour" pos="SUBST">behaviour</w> ...</s>

In the following example, typographic variation in the original has been regularized:

<s n="1380"> <w c5="PNP" hw="he" pos="PRON">He </w> <w c5="VVD" hw="use" pos="VERB">used </w> <w c5="AT0" hw="the" pos="ART">the </w> <w c5="NN1" hw="telephone" pos="SUBST">telephone </w> <w c5="TO0" hw="to" pos="PREP">to </w> <w c5="VVI" hw="ring" pos="VERB">ring </w> <w c5="DPS" hw="he" pos="PRON">his </w> <w c5="DT0" hw="own" pos="ADJ">own </w> <w c5="NN1" hw="number" pos="SUBST">number </w> <w c5="CJC" hw="and" pos="CONJ">and </w> <w c5="NP0" hw="celia" pos="SUBST">Celia</w> <w c5="POS" hw="'s" pos="UNC">'s</w> <c c5="PUN">, </c> <w c5="PRP" hw="on" pos="PREP">on </w> <w c5="AT0" hw="the" pos="ART">the </w> <corr sic="offchance" resp="OUCS"> <w c5="AJ0" hw="off" pos="ADJ">off </w> <w c5="NN1" hw="chance" pos="SUBST">chance </w> </corr> <w c5="CJT" hw="that" pos="CONJ">that </w> <w c5="NP0" hw="dougal" pos="SUBST">Dougal </w> <w c5="VHD" hw="have" pos="VERB">had </w> <w c5="VVN" hw="go" pos="VERB">gone </w> <w c5="PRP" hw="to" pos="PREP">to </w> <w c5="NP0" hw="primrose" pos="SUBST">Primrose </w> <w c5="NP0" hw="hill" pos="SUBST">Hill</w> <c c5="PUN">.</c> </s>

The usage of these elements may be summarized as follows:

<gap> (omitted material) indicates a point where material has been omitted from the transcription.
desc
briefly describes the material which has been omitted.
reason
gives further details of the reason for omission.
resp
indicates the agency responsible for the intervention or interpretation, for example an editor or transcriber.
<corr> (correction) contains the correct form of a passage apparently erroneous in the copy text.
sic
contains verbatim text which has been corrected, or an empty string if the correction consists of an addition.
rend
a code briefly characterising the way the element content was originally presented.
resp
a code identifying the agency responsible for making the correction.

Note that the <sic> element used in preceding editions of the BNC is no longer used.

Up: Contents Previous: Design of the corpus Next: Written texts