2 Basic structure

The mark-up scheme chosen for the British National Corpus is known as the Corpus Document Interchange Format (CDIF). This scheme is an application of ISO 8879, the Standard Generalized Mark-Up Language. This international standard provides, amongst other things, a method of specifying an application-independent document grammar, in terms of the elements which may appear in a document, their attributes, and the ways in which they may legally be combined. A brief summary of the encoding format used in the BNC to represent SGML constructs is given in section 2.1 below; for more detailed information, any introductory text on SGML may be consulted. Documents encoded using CDIF can be processed using any SGML-aware software.

The development of CDIF was strongly influenced by the proposals of the Text Encoding Initiative (TEI). This international research project has for its goal the development of a set of comprehensive guidelines for the encoding and interchange of electronic texts amongst researchers. An initial report appeared in 1991, and a substantially revised and expanded version in early 1994. Like CDIF, the TEI Guidelines are themselves an application of SGML. In designing CDIF, a conscious attempt was made to conform to TEI recommendations, where these had already been formulated. The intention is that CDIF texts should also be amenable to any TEI aware software. The elements and attributes proposed for use in CDIF are intended to form a ``clean'' subset of those proposed by the TEI, and should thus be compatible with those used by other major European corpus-building initiatives. In general, components with the same names in both CDIF and TEI schemes may be assumed to have identical semantics.

Since publication of the BNC in 1995, standardization of corpus encoding practices has advanced, particularly in Europe. A large number of corpora similar in motivation and execution to the BNC have been developed, notably within such projects as PAROLE, EAGLES, and others. The EU-funded Corpus Encoding Standard (CES) is to a large extent modelled on the CDIF proposals.

Section 2 describes the basic structure of the British National Corpus, in terms of the SGML elements distinguished and the tags used to mark them up. Section 3 describes the elements which are peculiar to written texts, and section 4 those peculiar to spoken texts. In each case, a distinction is made between those elements which are marked up in all texts and those which (for technical or financial reasons) are not always so distinguished, and hence appear in some texts only.

Section 5 describes the structure of the header element attached to each component of the corpus, and also to the whole corpus itself.

2.1 Markup conventions

The BNC uses the ``reference concrete syntax'' of SGML, in which all elements are delimited by the use of tags. There are two forms of tag, a start-tag, marking the beginning of an element, and an end-tag marking its end. Tags are delimited by the characters < and >, and contain the name of the element (its gi, for generic identifier), preceded by a solidus (/) in the case of an end-tag.

For example, a heading or title in a written text will be preceded by a tag of the form <head> and followed by a tag in the form </head>. Everything between these two tags is regarded as the content of an element of type <head>.

Attributes applicable to element instances, if present, are also indicated within the start-tag, and take the form of an attribute name, an equal sign and the attribute value, which may be a number, a string literal or a quoted literal. Attribute values are used for a variety of purposes, notably to represent the part of speech codes allocated to particular words by the CLAWS tagging scheme.

For example, the <head> element may take an attribute type which categorizes it in some way. A main heading will thus appear with a start tag <head type=main>, and a subheading with a start tag <head type=sub>.

Case is not significant in tag or attribute names, but a consistent style has been adopted throughout the corpus. This style uses lower-case letters for identifiers, unless they are derived from more than one word, in which case the first letter of the second and any subsequent word is capitalized.

SGML permits various kinds of minimization, or abbreviatory conventions. Only two such are applied in CDIF: end-tag omission and attribute-name omission. These conventions apply only to the elements <s>, <w> and <c> (i.e., for sentences, words, and punctuation).

For all other non-empty elements, every occurrence in the distributed form of the corpus has both a start-tag and an end-tag, and any attributes specified are supplied in the form ``attribute name=value''. For the three elements mentioned above, and all empty elements, end-tags are routinely omitted. For these three elements only, attribute values are given without any associated attribute name. See section 2.4 for some examples.

Only a restricted range of characters is used in element content: specifically, the upper- and lower-case alphabetics, digits, and a subset of the common punctuation marks. All other characters are represented by SGML entity references, which take the form of an ampersand (&) followed by a mnemonic for the character, and terminated by a semicolon (;) where this is necessary to resolve ambiguity.

For example, the pound sign is represented by the string £, the character é by the string é and so forth. The French word ``été'' (summer), if it appeared in the corpus, would be represented as

&eacute;t&eacute

No semi-colon is needed on the second &eacute because it is followed by a blank marking the end of the word. The mnemonics used are taken from standard entity sets, and are listed in section 6.2 .

2.2 Global attributes

Three global attributes are defined in the CDIF scheme, which may potentially be specified for any element. In practice their use is limited to certain specific functions, which are discussed at the appropriate place below, but for convenience their use is also summarized here:

id system-generated identifier of an item, unique within the corpus
n any name or identifier for an element, not necessarily unique within the corpus
r the rendition or appearance of an element.

2.3 Corpus and text elements

The British National Corpus contains a large number of text samples, some spoken and some written. Each such sample has some associated descriptive or bibliographic information particular to it, and there is also a large body of descriptive information which applies to the whole corpus.

In SGML terms, the BNC Sampler consists of a single SGML element, tagged <bncSampl> (the BNC proper names this element <bnc>). This element contains a single <header> element, followed by a sequence of <bncDoc> elements. Each such <bncDoc> element contains its own <header>, followed by either a <text> element (for written texts) or an <stext> element (for spoken texts).

The components of the header are fully documented in section 5 .

Note that different elements are used for spoken and written texts because each has a different substructure; this represents a departure from TEI recommended practice.

Both <text> and <stext> elements take the following attributes in addition to the attributes globally available:

complete specifies whether this text is complete or a sample. Legal values are:
- Y the full text of the original has been transcribed
- N a sample of the original text has been taken
org specifies how the content of the text is organised. Legal values are:
- compo composite content: i.e. no claim is made about the sequence in which elements inferior to this one are to be processed, or their inter-relationships
- seq sequential content: i.e. elements inferior to this are regarded as forming a logical unit, to be processed in the sequence given

The complete and org attributes are used as to characterize the internal organization and completeness of written and spoken texts. For demographically collected spoken texts, the complete attribute is not used. All demographically collected spoken texts have the same internal organization: each <stext> element collects together all the conversations for a given respondent, each distinct conversation being represented by a <div> element (see further 4.1 ). Since the order of these <div> elements is not significant, the org attribute always has the value ``compo''.

2.4 Segments and words

At the lowest level, all texts consist of <w> (word) and <c> (punctuation) elements, grouped into <s> (segment) elements.

<s> a segment of spoken or written text as identified by the CLAWS segmentation scheme. Attributes include:
- p indicates whether or not the segment was manually post-edited at Lancaster. Legal values are:
  - Y the segment was manually post-edited.
  - N the segment was not manually post-edited.
<w> represents a grammatical (not necessarily orthographic) word. Attributes include:
- type specifies the word class assigned to this form by the CLAWS system.
<c> represents a punctuation character. Attributes include:
- type specifies the class assigned to this character by the CLAWS system.

This analysis was performed by the CLAWS system developed at the University of Lancaster. The values used for the type attribute on <w> and <c> are defined in the BNC Sampler Tagging Guidelines.

The <s> element is the basic organizational principle for the whole corpus: every text, spoken or written, may be regarded as an end-to-end sequence of <s> elements, possibly grouped into higher-level constructs, such as paragraphs or utterances.

Here is a simple example:

<s n=00011>
<w NN1>Difficulty <w VBZ>is <w VBG>being
<w VVN>expressed <w PRP>with <w AT0>the
<w NN1>method <w TO0>to <w VBI>be <w VVN>used
<w TO0>to <w VVI>launch <w AT0>the <w NN1>scheme<c PUN>.
</s>

The n attribute is specified for each <s> element and gives its sequence number within the text from which it comes. If the number of this sentence is different in BNC proper and the BNC sampler (for example, because some material has been omitted from the Sampler), both numbers are given, with the original sentence number in parentheses, as in the following example:

<s n=00011>
<s n="508(493)" p=Y><w II>By <w NP1>Reuter <w II>in <w NP1>East <w NP1>Berlin </s>

On the BNC Sampler (but not the BNC), each <s> element occupies a single line, to facilitate searching for patterns within the corpus by non-SGML-aware software.

The code within each <w> or <c> tag is the word class code assigned by the CLAWS tagging system. These codes are defined in the CLAWS C7 Tagging Guide: note that different codes are used for the Sampler from those used in the BNC proper.

In most cases, <s> elements will correspond with regular orthographic sentences, and <w> elements with regular orthographic words. However, it should be noted that several common phrases are treated as single <w> elements, typically prepositional phrases such as ``in spite of'', while some single orthographic forms such as ``can't'' and possessive forms such as ``man's'' are decomposed into two <w> elements. .

The white space (if any) following each orthographic word has been retained in the encoded text. Simply removing the tags will in general produce a correctly punctuated text. Note that in the current version of the corpus, the long dash has not generally been tagged as a punctuation mark, and will appear instead as an entity reference:

<s n=00024>
<w PNP>It <w VBD>was <w AT0>the<w NN1>sort
<w PRF>of <w NN1>sight —<w NN1-VVB>the
<w AJ0>poor<c PUN>, <w AT0>the <w AJ0>strange &mdash
<w NN1>which <w AV0>usually <w VVD>alarmed
<w NP0>Graham<c PUN>.

Dashes used to separate numbers are represented in a similar way, using the ndash entity.

Quotation marks as such are also represented by entity references. The reference name used will depend on whether or not the usage of quotation marks in the text has been normalized. Information in the header should describe the course taken for a particular text, as described in section 5.2.2 .

Where the quoted text is a true quotation (that is, a phrase or sequence attributed to someone other than the current narrator or writer) the <quote> element discussed in section 3.2.2 may optionally be used. This does not apply to dialogue in fictional works, which is not marked, except by the presence of the quotation mark entities, as in the following example:

<p>
<s n=0022>
<c PUQ>&bquo<w PNP>He<w VBZ>'s <w AT0>a <w AJ0>dry <w NN1>stick<c PUN>,
<w NP0>Wilson<c PUN>,<c PUQ>&equo <w VVD>said <w NP0>Mr <w NP0>Malik<c PUN>,
<c PUQ>&bquo<w CJC>but <w PNP>he <w VBZ>is <w CRD>100 <w NN0>per cent
<w AJ0>loyal<c PUN>.
<s n=0023>
<w CJC>And <w PNP>I <w VBB>am <w VVG>looking <w PRP>for <w CRD>100 <w NN0>per cent
<w NN1>loyalty<c PUN>.
<s n=0024>
<w PNI>Everything <w AV0>else <w VM0>can <w VVI>go <w NN1>hang<c PUN>!<c PUQ>&equo
</p>

2.5 Editorial indications

Editorial changes made to the texts during transcription are recorded using the following elements:

<gap> an editorial omission; marks the spot where some part of the original source text has been omitted. Attributes include:
- desc brief description of the material omitted.

The <gap> element is typically used to indicate where words identifying persons or places have been removed during transcription, where labels etc. have been suppressed for ease of processing, or where material has simply not been transcribed because it is inaudible, illegible or not transcribable (e.g. figures, graphs). It is also used to indicate where passages present in the BNC proper have been omitted from the BNC Sampler.

<reg> any editorial regularization, e.g. to correct something mistranscribed or misspelled, or to normalize variant spellings. Attributes include:
- sic supplies the original form of whatever has been regularized
<sic> a word or phrase which has not been regularized, but which is in doubt; for example, a spoken word which the transcribers cannot recognise, or a dubious spelling. Attributes include:
- reg supplies the regularized form of a word or phrase apparently misspelled deliberately.

In general, the <reg> element is used wherever a word appears to be misspelled in the source, and the <sic> element where the transcriber is unable to propose a correction, but believes the original to be erroneous. The <sic> element is also used to mark words which are intentionally misspelled, for example to indicate non-standard pronunciation; in this case, the reg attribute is used to supply the standard spelling.

In addition to the attributes listed above, these three elements all share the following attributes:

ed identifies the agency responsible for the editorial decision.
cause describes the cause for the editorial change.

Slightly different transcription policies have been followed by different transcribers, and consequently these elements may not appear in all texts. The <editDecl> element of the header described in section 5.2.2 gives further details of the editorial principles applied across the corpus. The value of the decls attribute for an individual text will indicate which principle or set of principles applies to it (see further section 5.5 ). The <tagsDecl> element in each text's header may also be consulted for an indication of the usage of these and other elements within it (see further section 5.2 ).

Users of this first release of the BNC are cautioned that the corpus contains a significant number of errors, both in transcription and encoding. Every attempt has been made to reduce the incidence of such errors to an acceptable level, by using a number of automatic and semi-automatic validation and correction procedures, but exhaustive proof-reading of a corpus of this size was not economically feasible. The corrections indicated by the tags discussed above are included only where errors have been detected, and no claim should be inferred that no other errors remain.

2.5.1 Some examples

In the following example, the start of a chapter has been deleted for sampling reasons:

<div1 complete=N n=7 org=SEQ>
<gap cause="sampling strategy" desc="beginning of chapter">
<p>
<s n=00001>
<w DPS>Her <w AJ0>thin <w NN1>voice
<w VVD>trailed <w AVP>off <w PRP>into
<w AJ0>thin <w NN1>air<c PUN>,

In the following example, a surname has been deleted for anonymization:

<s n=0210 p=Y><w NP1>Jenny<gap cause=anonymization desc="last or full name"> 
<w PPIS1>I <w VV0>think<c YCOM>, <w VBZ>is <w PPH1>it<c YQUE>? </s>

In the following example, a list of proper names has been deleted:

<div1 complete=N org=SEQ>
<head>
<s n=00081>
<hi r=ul>
<w CRD>27.6.90 </hi>
<w NN2>Minutes <w PRF>of <w AT0>a <w NN1>meeting
<w PRF>of <w AT0>the <w NP0>Juniper <w NP0>Green
<w NP0>Village <w NN0>Association <w VVD-VVN>held
<w PRP>in <w AT0>the <w NP0>Village <w NP0>Hall
<w PRP>on <w NP0>Wednesday<w PUN>, <w ORD>27th
<w NP0>June <w PRP>at <w CRD>7.30 <w AV0>pm<c PUN>.
</head>
<gap desc="Committee members present and absentees" ed=OUP>

In the following example, a typographic error in the original has been regularized:

...
<w PRP>upon <w DTQ>which <w NP0>Odette <w VHD>had
<w VVN>worked <w DT0>a few <w AJ0>hasty
<reg sic=stiches> <w NN2>stitches
</reg> <w PRF>of <w NN1>embroidery
...

In the following example, typographic variation in the original has been regularized:

<s n=00029>
<w AT0>The <w NN1>sum <w PRF>of <w NN0>£60
<w VHD>had <w VBN>been <w VVN>raised
<w PRP>for <w AT0>the <w NN1>Telethon <w NN1>Appeal
<w CJC>and <w AJC>further <reg ed=OUCS sic="week end">
<w NN1>weekend </reg> <w NN2>competitions
<w VBB>are <w PRP>on <w AT0>the <w NN1>programme<c PUN>.

In the following example, the transcriber has expressed a doubt as to the validity of the word ``memorandising'', but no correction has been made, as it has for the misspelling ``bedeviled'' which follows it:

<s n=02444>
<w PNP>He <w VM0>could <w VVI>listen
<w PRP>to <w DPS>her <w AJ0>gentle
<w NN1-VVG>teasing <w PRP>before
<w VVG>going <w PRP>into <w DPS>his <w AJ0>secret
<w NN1>room <w CJC>and <sic> <w VVG>memorandising </sic>
<w AT0>the <w NN2>questions <w DTQ>which
<reg sic=bedeviled> <w VVD>bedevilled </reg>
<w PNP>him<c PUN>.

2.6 Pointers

Parts of a text are normally transcribed in the same order as they appear in the source text. In certain circumstances, however, parts of a text have been moved from the position in which they appear in the source to simplify linguistic processing. There are two common situations where this is necessary:

where a caption or note appears in the middle of a syntactic unit
where speakers overlap

Where re-ordering of the first type has occurred, the moved element is generally re-located to the end of the paragraph or similar element in which it appears. Its original position is recorded using a pointer element (<ptr>), an empty tag whose t attribute supplies the identifier of the relocated element. In the following example, the note which originally appeared between the words ``roughie-toughie'' and ``types'' has been relocated to the end of the paragraph. The note itself is given an automatically-generated identifier C87NT000 which is then supplied as the value of the t attribute. For example,

<s n=0141>
<w CRD>Two <w NN2>men <w VVD>retained <w DPS>their 
<w NN2>marbles<c PUN>, <w CJC>and <w CJS>as <w NN1-VVB>luck 
<w VM0>would <w VHI>have <w PNP>it <w PNP>they<w VBB>'re
<w AV0>both <w AJ0>roughie-toughie <ptr t=C87NT000> <w NN2>types 
<w AV0>as <w AV0>well <w CJS>as <w AJ0>military <w NN2>scientists 
<c PUN>&mdash <w AT0>a <w NN1>cross <w PRP>between <w NP0>Albert 
<w NP0>Einstein <w CJC>and <w NN1>Action <w NN1-NP0>Man<c PUN>!
<s n=0142>
<!-- ... -->
<w DPS>their <w NN1>way <w PRP>to <w NN1>freedom <c PUN>&mdash 
<w AV0>so <w VVB>get <w NN1-VVG>blasting<c PUN>!
</p>
<note id=C87NT000>
<s n=0143>
<w VVN>continued <w PRP>on <w NN1>page <w CRD>7
</note>

A similar mechanism is used to represent alignment of synchronous speech; see further section 4.4 .

Previous
Up
Next