Basic structure The original British National Corpus was provided as an application of ISO 8879, the Standard Generalized Mark-Up Language (SGML). This international standard provides, amongst other things, a method of specifying an application-independent document grammar, in terms of the elements which may appear in a document, their attributes, and the ways in which they may legally be combined. SGML was a predecessor of XML, the extensible markup language defined by the World Wide Web Consortium and now in general use on the World Wide Web. XML was originally designed as a means of distributing SGML documents on the web. This XML edition of the BNC is delivered in an XML format which is documented in this manual in section below; more detailed information about XML itself is readily available in many places.The article in Wikipedia () is probably as good a starting point as any; another is at The original BNC encoding format was also strongly influenced by the proposals of the Text Encoding Initiative (TEI). This international research project resulted in the development of a set of comprehensive guidelines for the encoding and interchange of a wide range of electronic texts amongst researchers. An initial report appeared in 1991, and a substantially revised and expanded version in early 1994. A conscious attempt was made to conform to TEI recommendations, where these had already been formulated, but in the first version of the BNC there were a number of differences in tag names, and models. In the second edition of the BNC (BNC World), the tagging scheme was changed to conform as far as possible with the published Recommendations of the TEI (). In the XML edition, this process has continued, and the corpus schema is now supplied in the form of a TEI customization: see further . Markup conventions The BNC XML edition is marked up in XML and encoded in Unicode. These formats are now so pervasive as to need little explication here; for the sake of completeness however, we give a brief summary of their chief characteristics. We strongly recommend the use of XML-aware processing tools to process the corpus; see further . An XML document, such as the BNC consists of a single root element, within which are nested occurrences of other element types. All element occurrences are delimited by tags. There are two forms of tag, a start-tag, marking the beginning of an element, and an end-tag marking its end (in the case of empty elements, the two may be combined; see below). Tags are delimited by the characters < and >, and contain the name of the element (its gi, for generic identifier), preceded by a solidus (/) in the case of an end-tag. For example, a heading or title in a written text will be preceded by a tag of the form head and followed by a tag in the form /head. Everything between these two tags is regarded as the content of an element of type head. Attributes applicable to element instances, if present, are also indicated within the start-tag, and take the form of an attribute name, an equals sign and the attribute value, in the form of a quoted literal. Attribute values are used for a variety of purposes, notably to represent the part of speech codes allocated to particular words by the CLAWS tagging scheme. For example, the head element may take an attribute type which categorizes it in some way. A main heading will thus appear with a start tag head type="MAIN", and a subheading with a start tag head type="SUB". The names of elements and attributes are case-significant, as are attribute values. The style adopted throughout the BNC scheme is to use lower-case letters for identifiers, unless they are derived from more than one word, in which case the first letter of the second and any subsequent word is capitalized: examples include teiHeader or particDesc (for participant description). Unless it is empty, every occurrence of an element must have both a start-tag and an end-tag. Empty elements may use a special syntax in which start and end-tags are combined together: for example, the point at which a page break occurs in an original source is marked pb/ rather than pb></pb The BNC is delivered in UTF-8 encoding: this means that almost all characters in the corpus are represented directly by the appropriate Unicode character. The chief exceptions are the ampersand (&) which is always represented by the special string &, the double quotation mark, which is sometimes represented by the special string ", and the arithmetic less-than sign, which always appears as <. These named entity references use a syntactic convention of XML which is followed by this version of the corpus. All other characters, including accented letters such as é or special characters such as —, are represented directly. The number of linebreaks in the corpus has been reduced to a minimum in order to simplify processing by non-XML aware utilities. In particular:

XML tags are never broken across linebreaks;

the TEI Header prefixed to each text contains no linebreaks

each s element begins on a new line

. Many XML aware utilities are available to convert this representation as required. An example Here is the opening of text J10 (a novel by Michael Pearce). In this example, as elsewhere, we have placed each element on a separate line for clarity; this is not a requirement of XML however. CHAPTER 1 ‘ But , ’ said Owen ,‘ where is the body ? ’ .... This example has been reformatted to make its structure more apparent: as noted above, in the actual corpus texts, newlines appear only at the start of each s element, rather than (as here) at the start of each element. The original files also lack the extra white space at the start of each line, used in the above example to indicate how the XML elements nest within one another. The example begins with the start tag for a wtext (written text) element, which bears a type attribute, the value of which is FICTION, the code used for texts derived from published fiction. The start tag is followed by an empty pb element, which provides the page number in the original source text. This in turn is followed by the start of a div element, which contains the first subdivision (chapter) of this text. This first chapter begins with a heading (marked by a head element) followed by a paragraph (marked by the p element). Further details and examples are provided for all of these elements and their functions elsewhere in this documentation. Each distinct word and punctuation mark in the text, as identified by the CLAWS tagger, has been separately tagged with a w or c element as appropriate. These elements both bear a c5 attribute, which indicates the code from the CLAWS C5 tagset allocated to that word by the CLAWS POS-tagger; w elements also bear a pos attribute, which provides a less fine-grained part of speech classification for the word, and an hw attribute, which indicates the root form of the word. For example, the word said in this example has the CLAWS 5 code VVD, the simplified POS tag VERB, and the headword say. The sequence of words and punctuation marks making up a complete segment is tagged as an s element, and bears an n attribute, which supplies its sequence number within the text. A combination of text identifier (the three letter code) and s number may be used to reference any part of the corpus: the example above contains J10 1 and J10 2. This is not, of course, a complete text: in particular, it lacks the TEI header which is prefixed to each text file making up the corpus. Its purpose is to indicate how the corpus is encoded. Any XML aware processing software, including common Web browsers, should be able to operate directly on BNC texts in XML format. The remainder of this manual describes in more detail the intended semantics for each of the XML elements used in the corpus, with examples of their use. Corpus and text elements The BNC contains a large number of text samples, some spoken and some written. Each such sample has some associated descriptive or bibliographic information particular to it, and there is also a large body of descriptive information which applies to the whole corpus. In XML terms, the corpus consists of a single element, tagged bnc. This element contains a single teiHeader element, containing metadata which relates to the whole corpus, followed by a sequence of bncDoc elements. Each such bncDoc element contains its own teiHeader, containing metadata relating to that specific text, followed by either a wtext element (for written texts) or an stext element (for spoken texts). Each bncDoc element also carries an xml:id attribute, which supplies its standard three-character identifier. The components of the TEI header are fully documented in section . Note that different elements are used for spoken and written texts because each has a different substructure; this represents a departure from TEI recommended practice. The function of these elements and their attributes may be summarized as follows: Segments and words The s element is the basic organizational principle for the whole corpus: every text, spoken or written, is represented as a sequence of s elements, possibly grouped into higher-level constructs, such as paragraphs or utterances. Each s element in turn contains w or c elements representing words and punctuation marks. The n attribute is used to provide a sequential number for the s element to which it is attached. To identify any part of the corpus uniquely therefore, all that is needed is the three character text identifier (given as the value of the attribute xml:id on the bncDoc containing the text, followed by the value of the n attribute of the s element containing the passage to be identified. These numbers are, as far as possible, preserved across versions of the corpus, to facilitate referencing. This implies that the sequence numbering may have gaps, where duplicate sequences or segmentation errors have been identified and removed from the corpus. In a few (about 700) cases, sequences formerly regarded as a single s have subsequently been split into two or more s units. For compatibility with previous versions of the corpus, the same number is retained for each new s, but it is suffixed by a fragment number. For example, in text A18, the s formerly numbered 1307, has now been replaced by two s elements, numbered 1307_1 and 1307_2 respectively. Fragmentary sentences such as headings or labels in lists are also encoded as s elements, as in the following example from text CBE: Serious fit of giggles A PAIR of TV newsreaders ... ... ... As noted above, at the lowest level, the corpus consists of w (word) and c (punctuation) elements, grouped into s (segment) elements. Each w element contains three attributes to indicate its morphological class or part of speech, as determined by the CLAWS tagger, a simplified form of that POS code, and an automatically-derived root form or lemma. Each c element also carries codes for part of speech, but not for lemma. For example, the word corpora wherever it appears in the BNC is presented like this: corpora Any white space following a word in the original source is preserved within the w tag, as in the previous example. White space is not added if no space is present in the source, as in the following example: corpora. The w element encloses a single token as identified by the CLAWS tagger. Usually this willl correspond with a word as conventionally spelled; there are however two important exceptions. Firstly, CLAWS regards certain common abbreviated or enclitic forms such as 's in he's or dog's as distinct tokens, thus enabling it to distinguish them as being an auxiliary verb in the first case, and a genitive marker in the second. For example, It's is encoded as follows: It 's while dog's is encoded: dog 's Secondly, CLAWS treats certain common multi-word units as if they were single tokens, giving the whole of a sequence such as in spite of a single POS code. These multiword sequences were not distinguished from individual w elements in earlier versions of the corpus; in the present version however a new element mw (for multiword) has been introduced to mark them explicitly. The individual components of a mw sequence are also tagged as w elements in the same way as elsewhere. Thus, the phrase in terms of, which in earlier editions of the BNC would have been encoded as a single w element, is now encoded as follows: in terms of Detailed information about the procedures by which the part of speech and lemmatization information was added to the corpus is provided in section , which is derived from the Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging by Geoffrey Leech and Nicholas Smith, as distributed along with the BNC World edition of the corpus. A brief summary of the codes used and their significance is also provided in the reference section below (). Editorial indications Despite the best efforts of its creators, any corpus as large as the BNC will inevitably contain many errors, both in transcription and encoding. Every attempt has been made to reduce the incidence of such errors to an acceptable level, using a number of automatic and semi-automatic validation and correction procedures, but exhaustive proof-reading of a corpus of this size remains economically infeasible. Editorial interventions in the marked up texts take three forms. On a few occasions, where markup or commentary introduced by transcribers during the process of creating the corpus may be helpful to subsequent users, it has been retained in the form of an XML comment. On some occasions, encoders have decided to correct material evidently wrong in their copy text: such corrections are marked using the corr element. And on several occasions, sampling, anonymization or other concerns, have led to the omission of significant parts of the original source; such omissions are marked by means of the gap element. The transcription and editorial policies defined for the corpus may not have been applied uniformly by different transcribers and consequently the usage of these elements is not consistent across all texts. The tagsDecl element in each text's header may be consulted for an indication of the usage of these and other elements within it (see further section ). Their absence should not be taken to imply that the text is either complete or perfectly transcribed. In the following example, the first three chapters have been omitted for sampling reasons: Friday 16 September to Tuesday 20 September Once free of the knotted tentacles of the eastern suburbs, Dalgliesh made good time and by three he was driving through Lydsett village.... ... In the following example, a proper name has been omitted: I asked Mr and ... In the following example, a telephone number has been omitted: He appealed for anyone with information to contact him on . In the following example, a typographic error in the original has been corrected: ... good or heroic behaviour ... In the following example, a word ommitted in the original has been supplied as correction: Apart from some eye-liner aberrations as a teenager, Mr Punch, it must be said, is absolutely straight as a die. The usage of these elements may be summarized as follows: Note that the sic element used in preceding editions of the BNC is no longer used.