Basic structure
The original British National Corpus was provided as an
application of ISO 8879, the Standard Generalized Mark-Up Language
(SGML). This international standard provides, amongst other things, a
method of specifying an application-independent document grammar, in
terms of the elements which may appear in a document, their
attributes, and the ways in which they may legally be combined. SGML
was a predecessor of XML, the extensible markup language defined by
the World Wide Web Consortium and now in general use on the World Wide
Web. XML was originally designed as a means of distributing SGML
documents on the web.
This XML edition of the BNC is delivered in an XML format which is
documented in this manual in section below;
more detailed information about XML itself is readily available in
many places.The article in Wikipedia () is probably as
good a starting point as any; another is at
The original BNC encoding format was also strongly influenced by the
proposals of the Text Encoding Initiative (TEI). This
international research project resulted in the development of a set of
comprehensive guidelines for the encoding and interchange of a wide
range of electronic texts amongst researchers. An initial report
appeared in 1991, and a substantially revised and expanded version in
early 1994. A conscious attempt was made to conform to
TEI recommendations, where these had already been
formulated, but in the first version of the BNC there were a number of
differences in tag names, and models. In the second edition of the BNC
(BNC World), the tagging scheme was changed to conform as far as
possible with the published Recommendations of the TEI
(). In the XML edition, this process has continued,
and the corpus schema is now supplied in the form of a TEI customization: see
further .
Markup conventions
The BNC XML edition is marked up in XML and encoded in
Unicode. These formats are now so pervasive as to need little
explication here; for the sake of completeness however, we give a
brief summary of their chief characteristics. We strongly recommend
the use of XML-aware processing tools to process the corpus; see
further .
An XML document, such as the BNC consists of a single root
element, within which are nested occurrences of other element
types. All element occurrences are delimited by
tags. There are two forms of tag, a
start-tag, marking the beginning of an
element, and an end-tag marking its end (in
the case of empty elements, the two may be
combined; see below).
Tags are delimited by the characters < and >, and contain the
name of the element (its gi, for generic
identifier), preceded by a solidus (/) in the case of an
end-tag.
For example, a heading or title in a written text will be preceded
by a tag of the form head and followed by a tag in the form
/head. Everything between these two tags is regarded as the
content of an element of type head.
Attributes applicable to element instances, if
present, are also indicated within the start-tag, and take the form of
an attribute name, an equals sign and the attribute value, in the form
of a quoted literal. Attribute values are used
for a variety of purposes, notably to represent the part of speech codes
allocated to particular words by the CLAWS tagging scheme.
For example, the head element may take an attribute
type which categorizes it in some way. A main heading
will thus appear with a start tag head type="MAIN", and a
subheading with a start tag head type="SUB".
The names of elements and attributes are case-significant, as are
attribute values. The style adopted throughout the BNC scheme is to
use lower-case letters for identifiers, unless they are derived from
more than one word, in which case the first letter of the second and
any subsequent word is capitalized: examples include
teiHeader or particDesc (for participant description).
Unless it is empty, every occurrence of an element
must have both a start-tag and an end-tag. Empty elements may use a
special syntax in which start and end-tags are combined together: for
example, the point at which a page break occurs in an original source
is marked pb/ rather than pb></pb
The BNC is delivered in UTF-8 encoding: this means that almost
all characters in the corpus are represented directly by the
appropriate Unicode character. The chief exceptions are the ampersand
(&) which is always represented by the special string
&, the double quotation mark, which is sometimes
represented by the special string ", and the
arithmetic less-than sign, which always appears as
<. These named entity
references use a syntactic convention of XML which is
followed by this version of the corpus. All other characters,
including accented letters such as é or special characters such as —,
are represented directly.
The number of linebreaks in the corpus has been reduced to a
minimum in order to simplify processing by non-XML aware utilities. In particular:
XML tags are never broken across linebreaks;
the TEI Header prefixed to each text contains no
linebreaks
each s element begins on a new line
. Many XML aware utilities are available to convert this representation as required.
An example
Here is the opening of text J10 (a novel by Michael Pearce). In
this example, as elsewhere, we have placed each element on a separate
line for clarity; this is not a requirement of XML however.
CHAPTER
1
‘
But
,
’
said
Owen
,‘
where
is
the
body
?
’
....
This example has been reformatted to make its structure more apparent:
as noted above, in the actual corpus texts, newlines appear only at
the start of each s element, rather than (as here) at the
start of each element. The original files also lack the extra white
space at the start of each line, used in the above example to indicate
how the XML elements nest within one another.
The example begins with the start tag for a wtext
(written text) element,
which bears a type attribute, the value of which is
FICTION, the code used for texts derived from published
fiction. The start tag is followed by an empty pb element,
which provides the page number in the original source text. This in
turn is followed by the start of a div element, which
contains the first subdivision (chapter) of this text. This first
chapter begins with a heading (marked by a head element)
followed by a paragraph (marked by the p element). Further
details and examples are provided for all of these elements and their
functions elsewhere in this documentation.
Each distinct word and punctuation mark in the text, as identified
by the CLAWS tagger, has been separately tagged with a w or
c element as appropriate. These elements both bear a
c5 attribute, which indicates the code from the CLAWS
C5 tagset allocated to that word by the CLAWS POS-tagger; w
elements also bear a pos attribute, which provides a less
fine-grained part of speech classification for the word, and an
hw attribute, which indicates the root form of the
word. For example, the word said in this example has the CLAWS
5 code VVD, the simplified POS tag VERB, and the
headword say. The sequence of words and punctuation marks
making up a complete segment is tagged as an s element, and
bears an n attribute, which supplies its sequence
number within the text. A combination of text identifier (the three
letter code) and s number may be used to reference any part
of the corpus: the example above contains J10 1 and J10 2.
This is not, of course, a complete text: in particular, it lacks
the TEI header which is prefixed to each text file making up the
corpus. Its purpose is to indicate how the corpus is encoded. Any XML
aware processing software, including common Web browsers, should be
able to operate directly on BNC texts in XML format.
The remainder of this manual describes in more detail the intended
semantics for each of the XML elements used in the corpus, with
examples of their use.
Corpus and text elements
The BNC contains a large number of
text samples, some spoken and some written. Each such
sample has some associated descriptive or bibliographic information
particular to it, and there is also a large body of descriptive
information which applies to the whole corpus.
In XML terms, the corpus consists of a single
element, tagged bnc. This element contains a single
teiHeader element, containing metadata which relates to the
whole corpus, followed by a sequence of bncDoc
elements. Each such bncDoc element contains its own
teiHeader, containing metadata relating to that specific
text, followed by either a wtext element (for
written texts) or an stext element (for spoken texts).
Each bncDoc element also carries an xml:id attribute, which
supplies its standard three-character identifier.
The components of the TEI header are fully documented in section .
Note that different elements are used for spoken and written texts
because each has a different substructure; this represents a departure
from TEI recommended practice.
The function of these elements and their attributes may be summarized as follows:
Segments and words
The s element is the basic organizational principle for
the whole corpus: every text, spoken or written, is represented as a
sequence of s elements, possibly grouped into
higher-level constructs, such as paragraphs or utterances. Each
s element in turn contains w or c elements
representing words and punctuation marks.
The n attribute is used to provide a sequential
number for the s element to which it is attached. To identify any part of the corpus
uniquely therefore, all that is needed is the three character text
identifier (given as the value of the attribute xml:id
on the bncDoc containing the text, followed by the value of
the n attribute of the s element containing
the passage to be identified.
These numbers are, as far as possible, preserved across versions of
the corpus, to facilitate referencing. This implies that the sequence
numbering may have gaps, where duplicate sequences or segmentation
errors have been identified and removed from the corpus. In a few
(about 700) cases, sequences formerly regarded as a single s
have subsequently been split into two or more s units. For
compatibility with previous versions of the corpus, the same number is
retained for each new s, but it is suffixed by a fragment
number. For example, in text A18, the s formerly numbered
1307, has now been replaced by two s elements, numbered
1307_1 and 1307_2 respectively.
Fragmentary sentences such as headings or labels in lists are
also encoded as s elements, as in the following example from
text CBE:
Serious
fit
of
giggles
A
PAIR
of
TV
newsreaders
...
... ...
As noted above, at the lowest level, the corpus consists of
w (word) and c (punctuation) elements, grouped into
s (segment) elements. Each w element contains three
attributes to indicate its morphological class or part of speech, as
determined by the CLAWS tagger, a simplified form of that POS code,
and an automatically-derived root form or lemma. Each c
element also carries codes for part of speech, but not for lemma. For
example, the word corpora wherever it appears in
the BNC is presented like this: corpora
Any white space following a word in the original source is
preserved within the w tag, as in the previous example. White
space is not added if no space is present in the source, as in the
following example:
corpora.
The w element encloses a single token as identified by
the CLAWS tagger. Usually this willl correspond with a word as
conventionally spelled; there are however two important
exceptions. Firstly, CLAWS regards certain common abbreviated or
enclitic forms such as 's in
he's or dog's as distinct tokens, thus
enabling it to distinguish them as being an auxiliary verb in the
first case, and a genitive marker in the second. For example,
It's is encoded as follows:
It
's
while dog's is encoded:
dog
's
Secondly, CLAWS treats certain common multi-word units as if they
were single tokens, giving the whole of a sequence such as
in spite of a single POS code. These multiword
sequences were not distinguished from individual w elements
in earlier versions of the corpus; in the present version however a
new element mw (for multiword) has been introduced to mark
them explicitly. The individual components of a mw sequence
are also tagged as w elements in the same way as
elsewhere. Thus, the phrase in terms of, which
in earlier editions of the BNC would have been encoded as a single
w element, is now encoded as follows:
in
terms
of
Detailed information about the procedures by which the part of
speech and lemmatization information was added to the corpus is
provided in section , which is derived from
the Manual to accompany The British National Corpus
(Version 2) with Improved Word-class Tagging by Geoffrey Leech
and Nicholas Smith, as distributed along with the BNC World
edition of the corpus.
A brief summary of the codes used and their significance is also
provided in the reference section below
().
Editorial indications
Despite the best efforts of its creators, any corpus as large as
the BNC will inevitably contain many errors, both in transcription and
encoding. Every attempt has been made to reduce the incidence of such
errors to an acceptable level, using a number of automatic and
semi-automatic validation and correction procedures, but exhaustive
proof-reading of a corpus of this size remains economically
infeasible. Editorial interventions in the marked up texts take three forms. On a
few occasions, where markup or commentary introduced by transcribers
during the process of creating the corpus may be helpful to subsequent
users, it has been retained in the form of an XML comment. On some
occasions, encoders have decided to correct material evidently wrong
in their copy text: such corrections are marked using the
corr element. And on several occasions, sampling,
anonymization or other concerns, have led to the omission of
significant parts of the original source; such omissions are marked by
means of the gap element.
The transcription and editorial policies defined for the corpus may
not have been applied uniformly by different transcribers and
consequently the usage of these elements is not consistent across all
texts. The tagsDecl element in each text's header may be
consulted for an indication of the usage of these and other elements
within it (see further section ). Their absence
should not be taken to imply that the text is either complete or
perfectly transcribed.
In the following example, the first three chapters have been omitted
for sampling reasons:
Friday
16
September
to
Tuesday
20
September
Once
free
of
the
knotted
tentacles of the
eastern suburbs, Dalgliesh made good time and by three he
was driving through
Lydsett village.... ...
In the following example, a proper name has been omitted:
I
asked
Mr
and
...
In the following example, a telephone number has been omitted:
He appealed for anyone with information to contact him on .
In the following example, a typographic error in the original has been
corrected:
...
good
or
heroic
behaviour
...
In the following example, a word ommitted in the original has been
supplied as correction:
Apart from some eye-liner aberrations as a teenager, Mr Punch, it must be said, is absolutely straight as a die.
The usage of these elements may be summarized as follows:
Note that the sic element used in preceding editions of the
BNC is no longer used.