[bnc] BNC User Reference Manual - Basic structure of the corpus

Basic structure

The mark-up scheme chosen for the British National Corpus is an application of ISO 8879, the Standard Generalized Mark-Up Language. This international standard provides, amongst other things, a method of specifying an application-independent document grammar, in terms of the elements which may appear in a document, their attributes, and the ways in which they may legally be combined. It is also a superset of the language XML, the extensible markup language currently proposed by the World Wide Web Consortium for general use on the World Wide Web. A brief summary of the encoding format used in the BNC to represent SGML constructs is given in section Markup conventions below; more detailed information about SGML and XML is readily available in many places.

The original BNC encoding format was strongly influenced by the proposals of the Text Encoding Initiative (TEI). This international research project resulted in the development of a set of comprehensive guidelines for the encoding and interchange of a wide range of electronic texts amongst researchers. An initial report appeared in 1991, and a substantially revised and expanded version in early 1994. A conscious attempt was made to conform to TEI recommendations, where these had already been formulated, but in the first version of the BNC there were a number of differences in tag names, and models. In the present edition of the BNC, the tagging scheme has been changed to conform as far as possible with the published Recommendations of the TEI. Unless otherwise stated, elements used here have the same meaning as those of the published TEI scheme. More information about the relationship between the BNC's markup and both its original CDIF format and the TEI standard are given in section ??.

Section Basic structure describes the basic structure of the British National Corpus, in terms of the SGML elements distinguished and the tags used to mark them up. Section ?? describes the elements which are peculiar to written texts, and section ?? those peculiar to spoken texts. In each case, a distinction is made between those elements which are marked up in all texts and those which (for technical or financial reasons) are not always so distinguished, and hence appear in some texts only.

Section ?? describes the structure of the <teiHeader> element attached to each component of the corpus, and also to the whole corpus itself. Sections ?? and ?? informally describe the elements specific to written and to spoken texts respectively. It should be noted that by no means all of the features described here will be present in every text of the corpus, nor, if present, will they necessarily be tagged. A list of elements actually used in the whole corpus is given below in ??.

Markup conventions

The BNC texts use the ‘reference concrete syntax’ of SGML, in which all elements are delimited by the use of tags. There are two forms of tag, a start-tag, marking the beginning of an element, and an end-tag marking its end. Tags are delimited by the characters < and >, and contain the name of the element (its gi, for generic identifier), preceded by a solidus (/) in the case of an end-tag.

For example, a heading or title in a written text will be preceded by a tag of the form <head> and followed by a tag in the form </head>. Everything between these two tags is regarded as the content of an element of type <head>.

Attributes applicable to element instances, if present, are also indicated within the start-tag, and take the form of an attribute name, an equal sign and the attribute value, which may be a number, a string literal or a quoted literal. Attribute values are used for a variety of purposes, notably to represent the part of speech codes allocated to particular words by the CLAWS tagging scheme.

For example, the <head> element may take an attribute type which categorizes it in some way. A main heading will thus appear with a start tag <head type="main">, and a subheading with a start tag <head type="sub">.

In XML (but not always in SGML), case is significant in all tag or attribute names. A consistent style has been adopted throughout the corpus. This style uses lower-case letters for identifiers, unless they are derived from more than one word, in which case the first letter of the second and any subsequent word is capitalized.

SGML (but not XML) permits various kinds of minimization, or abbreviatory conventions. Only two such are used: end-tag omission and attribute-name omission. These conventions apply only to the elements <s>, <w> and <c> (i.e., for sentences, words, and punctuation).

For all other non-empty elements, every occurrence in the distributed form of the corpus has both a start-tag and an end-tag, and any attributes specified are supplied in the form attribute name=value (in the body of the texts), or attribute name="value" (in the headers). For the elements <s>, <w> and <c>, and all empty elements, end-tags are routinely omitted. For these three elements only, attribute values are given without any associated attribute name. See section Segments and words for some examples.

In the present release of the corpus, the headers are marked up using XML: this means that empty-tags take a slightly different form and that attribute values are always quoted.

Only a restricted range of characters is used in element content: specifically, the upper- and lower-case alphabetics, digits, and a subset of the common punctuation marks. All other characters are represented by SGML entity references, which take the form of an ampersand (&) followed by a mnemonic for the character, and terminated by a semicolon (;) where this is necessary to resolve ambiguity.

For example, the pound sign is represented by the string £, the character é by the string é and so forth. The French word ‘été’ (summer), if it appeared in the corpus, would be represented as

été

The mnemonics used are taken from standard entity sets, and are listed in section ??.

Finally, although this is not mandated by either XML or SGML, in the present form of the corpus, tags are never broken across linebreaks. Additionally, an attempt has been made to avoid linebreaks within the content of a single <s> element, so as to simplify processing of the text.

Global attributes

Three global attributes are defined, each of which may potentially be specified for any element. In practice their use is limited to certain specific functions, which are discussed at the appropriate place below, but for convenience their use is also summarized here:

id: system-generated identifier of an item, unique within the corpus
n: any name or identifier for an element, not necessarily unique within the corpus
rend: the rendition or appearance of an element.

Corpus and text elements

The British National Corpus contains a large number of text samples, some spoken and some written. Each such sample has some associated descriptive or bibliographic information particular to it, and there is also a large body of descriptive information which applies to the whole corpus.

In SGML terms, the British National Corpus consists of a single SGML element, tagged <bnc>. This element contains a single <teiHeader> element, followed by a sequence of <bncDoc> elements. Each such <bncDoc> element contains its own <teiHeader>, followed by either a <text> element (for written texts) or an <stext> element (for spoken texts). The last named element is an extension of the TEI scheme, but the others are all standard TEI elements, possibly renamed as permitted by the TEI scheme.

The components of the header are fully documented in section ??. Further discussion of SGML concepts and practices is provided in section ??.

Note that different elements are used for spoken and written texts because each has a different substructure; this represents a departure from TEI recommended practice.

Both <text> and <stext> elements take the following attributes in addition to the attributes globally available:

org

specifies how the content of the text is organised. Legal values are:

composite: composite content: i.e. no claim is made about the sequence in which elements inferior to this one are to be processed, or their inter-relationships
seq: sequential content: i.e. the elements contained by this one form a logical unit, to be processed in the sequence given

decls

supplies the identifiers of any specific encoding or editorial conventions defined in the corpus header and applicable to this specific text

The org attribute is used to characterize the internal organization of written texts. All demographically collected spoken texts have the same internal organization: each <stext> element collects together all the conversations for a given respondent, each distinct conversation being represented by a <div> element (see further ??). Since the order of these <div> elements is not significant, the org attribute always has the value ‘composite’.

Segments and words

At the lowest level, the corpus consists of <w> (word) and <c> (punctuation) elements, grouped into <s> (segment) elements:

<s>

a segment of spoken or written text as identified by the CLAWS segmentation scheme. The global n attribute is always supplied for <s> elements.

<w>

represents a grammatical (not necessarily orthographic) word. Note that the CLAWS definition of a ‘word’ does not correspond with the conventional orthogaphic definition. Attributes include:

type: specifies the word class assigned to this form by the CLAWS system.

<c>

represents a punctuation character. Attributes include:

type: specifies the class assigned to this character by the CLAWS system.

For this edition of the BNC, the word class tagging system has been extensively revised. A detailed description of the tagging procedures and their application is provided by the Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging by Geoffrey Leech and Nicholas Smith, which is distributed with the corpus in electronic form. A short list of the POS codes used for the type attribute on <w> and <c> is also provided in section ?? below. As noted above, the representation of this attribute used by the current version of the corpus is minimized, so, for example, the word difficulty tagged as a singular noun, appears as

<w NN1>Difficulty

rather than as the equivalent XML encoding:

<w type="NN1">Difficulty</w>

The <s> element is the basic organizational principle for the whole corpus: every text, spoken or written, may be regarded as an end-to-end sequence of <s> elements, possibly grouped into higher-level constructs, such as paragraphs or utterances.

Here is a simple example:

<s n=11> <w NN1>Difficulty <w VBZ>is <w VBG>being <w VVN>expressed <w PRP>with <w AT0>the <w NN1>method <w TO0>to <w VBI>be <w VVN>used <w TO0>to <w VVI>launch <w AT0>the <w NN1>scheme<c PUN>. </s>

The n attribute is specified for each <s> element and gives its sequence number within the text from which it comes. The code within each <w> or <c> tag is the word class code assigned by the CLAWS tagging system. These codes are listed below, in section ??.

In most cases, <s> elements will correspond with regular orthographic sentences, and <w> elements with regular orthographic words. However, it should be noted that several common phrases are treated as single <w> elements, typically prepositional phrases such as ‘in spite of’, while some single orthographic forms such as ‘can't’ and possessive forms such as ‘man's’ are decomposed into two <w> elements. Further discussion of these non-orthographic word forms is given in the accompanying Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging by Geoffrey Leech and Nicholas Smith.

Fragmentary sentences such as headings or labels in lists are sometimes encoded as <s> elements, as in the following example:

<div1 org=seq> <head> <s n=1> <w NPO>THEOBALD<w POS>'S <w NN1>ROAD </head> <p> <s n=2> <w PNP>He <w VVD>walked <w PRP>through <w AT0>the <w AJ0>white <w NN2>corridors<c PUN>, <w PRP>past <w ATO>the <w NN1>notice <w NN2>boards<c PUN>.

Partly for this reason, the white space (if any) following each orthographic word has been retained in the encoded text. Simply removing the tags will in general produce a correctly punctuated text. (Note however that some punctuation marks are represented as entity references:

<s n=00024> <w PNP>It <w VBD>was <w AT0>the<w NN1>sort <w PRF>of <w NN1>sight —<w NN1-VVB>the <w AJ0>poor<c PUN>, <w AT0>the <w AJ0>strange &mdash <w NN1>which <w AV0>usually <w VVD>alarmed <w NP0>Graham<c PUN>.

Dashes used to separate numbers are represented in a similar way, using the ndash entity.

Quotation marks are also represented by entity references The reference name used will depend on whether or not the usage of quotation marks in the text has been normalized. Information in the header should describe the course taken for a particular text, as described in section ??.

Where the quoted text is a true quotation (that is, a phrase or sequence attributed to someone other than the current narrator or writer) the <quote> element discussed in section ?? may optionally be used. This does not apply to dialogue in fictional works, which is not marked, except by the presence of the quotation mark entities, as in the following example:

<p> <s n=0022> <c PUQ>&bquo<w PNP>He<w VBZ>'s <w AT0>a <w AJ0>dry <w NN1>stick<c PUN>, <w NP0>Wilson<c PUN>,<c PUQ>&equo <w VVD>said <w NP0>Mr <w NP0>Malik<c PUN>, <c PUQ>&bquo<w CJC>but <w PNP>he <w VBZ>is <w CRD>100 <w NN0>per cent <w AJ0>loyal<c PUN>. <s n=0023> <w CJC>And <w PNP>I <w VBB>am <w VVG>looking <w PRP>for <w CRD>100 <w NN0>per cent <w NN1>loyalty<c PUN>. <s n=0024> <w PNI>Everything <w AV0>else <w VM0>can <w VVI>go <w NN1>hang<c PUN>!<c PUQ>&equo </p>

Editorial indications

Editorial changes made to the texts during transcription are recorded using the following elements:

<gap>

marks the spot where some part of the original source text has been omitted for some reason. Attributes include:

desc: brief description of the material omitted e.g. "name and address".
extent: extent of omitted material e.g. "six words".
reason: brief explanation e.g. "anonymization", "inaudible".
resp: code identifying the agency responsible for marking up the omission.

The <gap> element is typically used to indicate where words identifying persons or places have been removed during transcription, where labels etc. have been suppressed for ease of processing, or where material has simply not been transcribed because it is inaudible, illegible or not transcribable (e.g. figures, graphs).

<corr>

any editorial correction or regularization, e.g. of material obviously mistranscribed or misspelled, or of variant spellings. Attributes include:

sic: supplies the original form of the word or phrase marked.
resp: code identifying the agency responsible for making the correction.

<sic>

a word or phrase which has not been corrected, but which is in doubt; for example, a spoken word which the transcribers cannot recognise, or a dubious spelling. Attributes include:

reg: supplies a corrected form for the word or phrase marked.
resp: code identifying the agency responsible for noting the need for correction.

In general, the <corr> element is used wherever a word appears to be misspelled in the source, and the <sic> element where the transcriber is unable to propose a correction, but believes the original to be erroneous. The <sic> element is also used to mark words which are intentionally misspelled, for example to indicate non-standard pronunciation; in this case, the corr attribute is used to supply a standard spelling.

Slightly different transcription policies have been followed by different transcribers, and consequently these elements may not appear in all texts. The <editorialDecl> element of the header described in section ?? gives further details of the editorial principles applied across the corpus. The value of the decls attribute for an individual text will indicate which principle or set of principles applies to it. The <tagsDecl> element in each text's header may also be consulted for an indication of the usage of these and other elements within it (see further section ??).

Users are cautioned that the corpus contains a significant number of errors, both in transcription and encoding. Every attempt has been made to reduce the incidence of such errors to an acceptable level, using a number of automatic and semi-automatic validation and correction procedures, but exhaustive proof-reading of a corpus of this size was not economically feasible. The corrections indicated by the tags discussed above are included only where errors have been detected, and no claim should be inferred that no other errors remain.

Some examples

In the following example, the start of a chapter has been deleted at OUCS for sampling reasons:

<div1 complete=n n=7 org=seq> <gap reason="sampling strategy" desc="beginning of chapter" resp="OUCS"> <p> <s n=00001> <w DPS>Her <w AJ0>thin <w NN1>voice <w VVD>trailed <w AVP>off <w PRP>into <w AJ0>thin <w NN1>air<c PUN>,

In the following example, a list of proper names has been deleted:

<div1 complete=n org=seq> <head> <s n=00081> <hi r=ul> <w CRD>27.6.90 </hi> <w NN2>Minutes <w PRF>of <w AT0>a <w NN1>meeting <w PRF>of <w AT0>the <w NP0>Juniper <w NP0>Green <w NP0>Village <w NN0>Association <w VVD-VVN>held <w PRP>in <w AT0>the <w NP0>Village <w NP0>Hall <w PRP>on <w NP0>Wednesday<w PUN>, <w ORD>27th <w NP0>June <w PRP>at <w CRD>7.30 <w AV0>pm<c PUN>. </head> <gap desc="committee members present and absentees" resp=oup>

In the following example, a telephone number has been omitted:

<s n="1541"> <w VVB-NN1>Ring <gap desc='telephone number' reason=anonymization><c PUN>.

In the following examples, typographic errors in the original are corrected:

... <s n="237"> <w PNP>It .... <w VVZ>tries <w AV0>soulfully <w XX0>not <w TO0>to <corr sic=loose> <w VVI>lose </corr> <w DPS>its <w NN1>integrity<c PUN>. ... <s n="490"> <w PNP>I <w VBD>was <corr sic=fourteeen> <w CRD>fourteen </corr> <w AV0>then<c PUN>,

In the following example, typographic variation in the original has been regularized:

<s n=00029> <w AT0>The <w NN1>sum <w PRF>of <w NN0>£60 <w VHD>had <w VBN>been <w VVN>raised <w PRP>for <w AT0>the <w NN1>Telethon <w NN1>Appeal <w CJC>and <w AJC>further <corr resp=oucs sic="week end"> <w NN1>weekend </reg> <w NN2>competitions <w VBB>are <w PRP>on <w AT0>the <w NN1>programme<c PUN>.

In the following example, the transcriber has expressed a doubt as to the validity of the word ‘memorandising’, but no correction has been made, as it has for the misspelling ‘bedeviled’ which follows it:

<s n=02444> <w PNP>He <w VM0>could <w VVI>listen <w PRP>to <w DPS>her <w AJ0>gentle <w NN1-VVG>teasing <w PRP>before <w VVG>going <w PRP>into <w DPS>his <w AJ0>secret <w NN1>room <w CJC>and <sic><w VVG>memorandising </sic> <w AT0>the <w NN2>questions <w DTQ>which <corr sic=bedeviled> <w VVD>bedevilled </corr> <w PNP>him<c PUN>.

Pointers

Parts of a text are normally transcribed in the same order as they appear in the source text. In certain circumstances, however, parts of a text have been moved from the position in which they appear in the source to simplify linguistic processing. There are two common situations where this is necessary:

where a caption or note appears in the middle of a syntactic unit
where speakers overlap

Where re-ordering of the first type has occurred, the moved element is generally re-located to the end of the paragraph or similar element in which it appears. Its original position is recorded using a pointer element (<ptr>), an empty tag whose target attribute supplies the identifier of the relocated element. In the following example, the note which originally appeared between the words ‘roughie-toughie’ and ‘types’ has been relocated to the end of the paragraph. The note itself is given an automatically-generated identifier C87NT000 which is then supplied as the value of the target attribute. For example,

<s n=0141> <w CRD>Two <w NN2>men <w VVD>retained <w DPS>their <w NN2>marbles<c PUN>, <w CJC>and <w CJS>as <w NN1-VVB>luck <w VM0>would <w VHI>have <w PNP>it <w PNP>they<w VBB>'re <w AV0>both <w AJ0>roughie-toughie <ptr target=c87nt000> <w NN2>types <w AV0>as <w AV0>well <w CJS>as <w AJ0>military <w NN2>scientists <c PUN>&mdash <w AT0>a <w NN1>cross <w PRP>between <w NP0>Albert <w NP0>Einstein <w CJC>and <w NN1>Action <w NN1-NP0>Man<c PUN>! <s n=0142>  <w DPS>their <w NN1>way <w PRP>to <w NN1>freedom <c PUN>&mdash <w AV0>so <w VVB>get <w NN1-VVG>blasting<c PUN>! </p> <note id=c87nt000> <s n=0143> <w VVN>continued <w PRP>on <w NN1>page <w CRD>7 </note>

This mechanism is also used to represent captions, notes, etc which interrupt the normal reading sequence. By far the commonest use of the <ptr> element, however, is to represent alignment of synchronous speech; see further section ??.

Up: Contents