The header
The header of a TEI-conformant text provides a
structured description of its contents, analogous to the title page
and front matter of a book. The component elements of a TEI header are
intended to provide in machine-processable form all the information
needed to make sensible use of the Corpus.
Every separate text in the British National Corpus (i.e. each
bncDoc element) has its own header, referred to below as a
text header. In addition, the corpus itself has a header,
referred to below as the corpus header, containing
information which is applicable to the whole corpus. Both
corpus and text headers are represented by teiHeader
elements.
The corpus header is supplied in a separate file called
bncHdr.xml, whereas text headers are prefixed to each
file in the Texts directory. In the remainder of this section, we describe the components of the
teiHeader element, as used within the BNC.
A TEI header contains a file description
(section ), an encoding description
(section
), a profile description (section
) and a revision description (section
), represented by the following four elements:
The file description
The file description (fileDesc) is the first of the four
main constituents of the header. It is intended to document an electronic file
i.e. (in the case of a corpus header) the whole corpus, or (in
the case of a text header) any characteristics peculiar to an
individual file within it. In each case, it contains the following
five subdivisions:
Further detail for each of these is given in the following
subsections.
The title statement
The title statement (titleStmt) element of a BNC text
contains one or more title elements, optionally followed by
author, editor, or respStmt elements. These
sub-elements are used throughout the header, wherever the title of a
work or a statement of responsibility are required.
For the corpus header, the title statement looks like this:
The British National Corpus: XML Edition
Lead partner in consortium
Oxford University Press
Text selection for miscellaneous and unpublished written materials
W R Chambers
Text selection, data capture and transcription for spoken texts and for 14% of
published written texts
Longman ELT
Text selection for 86% published written texts
Oxford University Press
Data capture and transcription for all miscellaneous and unpublished written
texts and for 86% of published written texts
Oxford University Press
XML conversion, encoding, storage and distribution
Oxford University Computing Services
Text enrichment
Unit for Computer Research into the English Language,
University of Lancaster
In individual corpus texts, the title statement follows a pattern
like the following:
The National Trust Magazine. Sample containing
about 21015 words from a periodical (domain: arts)
Data capture and transcription
Oxford University Press
The content of the title element includes the title of the
source, followed by the phrase "Sample containing about", the
approximate word count for the sample, and further information about
the text type and domain, all extracted from other parts of the
header. This is followed by responsibility statements showing which
of the BNC Consortium members was responsible for capturing the
text originally.
Here are some typical examples:
How we won the open: the caddies' stories. Sample containing
about 36083 words from a book (domain: leisure)
Harlow Women's Institute committee meeting.
Sample containing about 246 words speech recorded
in public context
The Scotsman: Arts section. Sample containing
about 48246 words from a periodical (domain: arts)
32 conversations recorded by `Frank' (PS09E)
between 21 and 28 February 1992 with 9 interlocutors,
totalling 3193 s-units, 20607 words, and 3 hours
22 minutes 23 seconds of recordings.
[Leaflets advertising goods and
products]. Sample containing about 23409 words
of miscellanea (domain: commerce)
A respStmt element is used to indicate
each agency responsible for any significant effort in the creation of
the text. Since responsibilities for data encoding and storage, and for
enrichment, are the same for all texts, they are not repeated in each
text header. The responsibility for
original data capture and transcription varies text by text, and is
therefore stated in each text header, in the following manner:
Data capture and transcription
Longman ELT
Author and editor information for the source from which a text is derived (e.g.
the author of a book) is not included in the filedesc element but in the sourceDesc element discussed below ().
The edition statement
The editionStmt element is used to specify an
edition for each file making up the corpus. It takes the same form in
both the corpus header and individual text headers:
BNC XML Edition, January 2007
The extent statement
The extent
element is used in each text header to specify the size of the text to
which it is attached, as in the following example:
21015 tokens; 21247 w-units; 957 s-units
These counts do not include the size of the header
itself. The number of tokens is generated by the Unix
wc utility, which simply counts blank delimited
strings; the other figures give the number of w and
s elements respectively.
The publication statement
The publicationStmt element is used
to specify publication and
availability information for an electronic text. It contains the
following three elements:
Individual text headers contains the following fixed text for the
first two of these:
Distributed under licence by Oxford University Computing
Services on behalf of the BNC Consortium.
This material is protected by international copyright
laws and may not be copied or redistributed in any way.
Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full
licencing and distribution conditions.
For contractual reasons, the corpus header includes a somewhat longer rehearsal of
the terms and conditions under which the BNC is made available.
For individual text headers, two identification numbers are
supplied, distinguished by the value of their type
attribute.
A0A
CAMfct
The second identifier (of type old) is the old-style
mnemonic or numeric code attached to BNC texts during the production of
the corpus, and is still used to label the original printed source materials in the
BNC Archive. The first three character code (of type bnc)
is the standard BNC identifier. It is also used both for the filename in
which the text is stored and as the value supplied for the
xml:id attribute on the bncDoc element containing
the whole text, and should always be used to cite the text. The code
is a completely arbitrary identifier, and does not indicate anything
about the nature of the text.
The source description
The sourceDesc element is used to supply
bibliographic details for the original source material from which an
electronic text derives. In the case of a BNC text, this might be a
book, pamphlet, newspaper etc., or a recording. One of the following
elements available within the sourceDesc will be used, as
appropriate:
These elements are not used within the corpus header, which simply
contains a note about the sources from which the corpus was derived,
tagged as a para (paragraph). The headers of individual texts
each contain one of the above elements to specify their source.
Context-governed spoken texts derived from broadcast or similar
published material may have either a recording statement or a
bibliographic record as their source.
All bibliographic data supplied in the individual text headers is
collected together and reproduced in section below.
The recording statement
The recording statement (recordingStmt) element
contains one or more recording elements:
The value of the n attribute here provides the
number of the audio tape holding the original recording, as deposited
with the British Library's Sound Archive in London.
In the following simple example, typical of most
of the context-governed parts of the BNC, the
recording element has no content at all:
When, as is often the case for the spoken demographic parts of the
BNC, a text has been made up by transcribing several different
recordings made by a single respondent over a period of time, each
such recording will have its own recording element, as in the
following example:
Note the presence of an xml:id attribute on each of the
above recordings. The value given here is used to indicate the
recording from which a given part of the text was transcribed. Each
recording is transcribed as a distinct div (division) element
within an stext. In that element, the identifier of the
source recording is supplied as the value of
a decls attribute. Thus, in the spoken text derived
from the above mentioned recordings, there will be a div
element starting as follows:
...
which will contain the part of text transcribed from that recording. As noted
above the identifier supplied on the n attribute is quite
distinct, and identifies the tape on which the original recording was
made, and by which it is referenced in the British Library's Sound Archive.
Structured bibliographic record
In addition to its usage within the corpus texts (see ), the bibl element is also used to record
bibliographic information for each non-spoken component of the BNC.
In this case, its structure is constrained to contain only the
following elements in the order specified:
During production of the BNC, the n attribute was
used with both author and imprint elements to supply
a six-letter code identifying the author or imprint concerned. The
values used should be unique across the corpus, but this is not
validated in the current release of the DTD.
The imprint element is
supplied for published texts only and contains the following elements in the order given:
The following example demonstrates how these elements are used to
record bibliographic details for a typical book:
It might have been Jerusalem. Healy, Thomas Polygon Books
Edinburgh 1991
1-81
The following example is typical of the case where a collection of
leaflets or newsletters has been treated as a single text:
[Potato Marketing Board leaflets] Potato Marketing Board London 1991
Occasionally, a bibliographic item has two titles, for
example a series title as well as an individual title, or multiple
authors. In the BNC such cases are treated simply by repeating the
element concerned, sometimes using the level attribute
to distinguish the bibliographic level of the
title:
Damages for personal injury and death: Damages on deathSaunt,
ThomasKemp, DavidLongman Group UK Ltd
Harlow 1993
52-68
Where series information is available for a
given title, this is not normally tagged distinctly. Instead the
series title is given as part of the monographic title, usually
preceded by a colon.
This level of bibliographic description has not been carried out with complete
consistency across the current release of the corpus.
The encoding description
The second major component of the TEI header is the encoding
description (encodingDesc). This contains
information about the relationship between an encoded text and its
original source and describes the editorial and other principles
employed throughout the corpus. It also contains reference information
used throughout the corpus.
The BNC encodingDesc element has the following six
components:
In the BNC, one of each of these elements appears in the corpus
header. Only the tagsDecl element appears
in the individual text headers.
Documentary components of the encoding
description
The projectDesc element for the corpus gives a brief
description of the goals, organization and results of the BNC project.
The samplingDecl, editorialDecl and
refsDecl elements similarly supply brief prose descriptions
describing the sampling procedures used in the project and the
referencing system applied. This information is also summarized elsewhere
in this documentation.
The tagging declaration
The tagging declaration (tagsDecl) element is
used slightly differently in corpus and in text headers. In the corpus
header, it is used to list every element name actually used within the
corpus, together with a brief description of its function. In text
headers, it is used to specify the number of elements actually tagged
within each text. In either case it consists of a namespace
element, containing a number of
tagUsage elements, defined as follows:
In the corpus header, each tagUsage element contains a
brief description of the element specified by its gi element;
the occurs attribute is not supplied, as in the following
extract:
Non-verbal event in spoken text
Point where source material has omitted
Header or headline in written text
In text headers, the tagUsage elements are empty, but the
occurs attribute is always supplied, and indicates the
number of such elements which appear within the text, as in the following
example, taken from a typical written text:
The reference and classification declarations
The refsDecl element for the corpus header defines the
approved format for references to the corpus. It takes the following form
Canonical references in the British National Corpus
are to text segment (s) elements, and
are constructed by taking the value of the xml:id attribute
of the bncDoc element containing the target text,
and concatenating a dot separator, followed by the value
of the n attribute of the target s element.
The standard TEI classDecl element is used in the BNC Corpus Header
to formally define several text classication
schemes which are used in the corpus. Each scheme or taxonomy
defines a number of code/description pairs, applicable to a text in
the corpus. For example, the written domain taxonomy defines twelve
subject domains ("Imagination", "Informative: natural science",
"Informative: applied science" etc.) and each
written text is assigned to one of them. Each
taxonomy is defined in the corpus header, using the following elements:
Here, for example, is the start of the taxonomy element
defining the Written domain classification system as it appears in
the corpus header:
Written Domain
Imaginative
Informative: natural & pure science
Informative: applied science
...
For a complete list of the taxonomies used in the BNC and the
number of texts etc. classified according to them, refer to the corpus
header and to chapter .
The classification categories applicable to a given text are specified
by the catRef element within the associated text header. Its
target lists the identifiers of all category
elements applicable to that text. For example, the header of a written text
assigned to the social science domain which has a corporate author will
include a catRef element like the following:
(The
dots above represent the identifiers of all other category codes applicable to this text).
A full list of all category codes, and the numbers of texts so
classified in the current release of the corpus is provided in section
.
Further information about the classification and categorization of an
individual texts is provided within the textClass element
discussed below ()
The Xaira Specification The Xaira Specification
element is used by the XAIRA indexing software to index the BNC. A
brief description of its components is provided in below; for full information, consult the Xaira
documentation available from
http://www.xaira.org/
The profile description
The third component of a TEI header is the profile description. In
the BNC this is used to provide the following elements:
The creation element
This element is provided to record the date of publication for
texts originally published separately, and any details concerning the origination
of any spoken or written texts, whether or not covered elsewhere. It
is supplied in every text header, although the details provided
vary. As a minimum, a date (tagged with the standard date
element) will be included; this gives the date the content of this
text was first created. For a spoken text, this will be the same as
the date of the recording; for a written text, it will normally be the
date of first publication of the edition, which may not be the same as
the date of publication of the copy used.
Here are two typical examples:
1992-02-11:
1971: originally published by Jonathan Cape.
Note that the BNC contains modernized editions of some classic texts
such as Defoe's Robinson Crusoe (FRX); the creation
date specified here is that of the creation of the modernized version
rather than the 17th c. original.
For imaginative works, the creation date is also the date used to
classify the text (by means of the WRITIM category). For
other written works, such as textbooks, which are likely to have been
extensively revised since their first publication, the date used to
classify the text will be that of the edition described in the
sourceDesc, but the original date will also be recorded
within the creation element.
The langUsage element
Unlike the other elements of the profile description, the language
usage element occurs only in the corpus header.
It contains the following text:
The language of the British National Corpus is modern
British English. Words, fragments, and passages from many
other languages, both ancient and modern, occur within the
corpus where these may be represented using a Latin
alphabet. Long passages in these languages, and material
in other languages, are generally silently deleted. In no
case is the lang attribute used to indicate the language
of a word, phrase or passage, nor are alternate writing
system definitions used.
The participant description
The participant description (particDesc) element is used
to provide information about speakers of texts transcribed for the
BNC. It appears only within individual spoken text headers to define
the participants specific to those texts.
It contains a series of person elements describing the
participants whose speech is transcribed in this text.
The person element
Each person element describes a single participant in a
language interaction. It carries a number of attributes which are used
to provide encoded values for some key aspects of the person concerned:
The xml:id attribute is required for each
participant whose speech is included in a text, and its value is unique
within the corpus. Although a given individual will always have the same
identifier within a single text, there is no way of identifying the same
individual should they appear in different texts. Since all
demographically sampled conversations collected by a single respondent
are treated together as a single text, and respondents were recruited from
many different social contexts, the probability of the same person
being recorded by different respondents is rather low, though not
completely impossible.
On many occasions the speaker of a given utterance cannot be
identified. A special code is used to indicate an unknown speaker,
but, for consistency, this is also made unique to each text. Thus, an
"unknown speaker" in one text will have different identifying code
from an "unknown speaker" in another. As far as possible, different
speakers are given different identifying codes, even where they cannot
be identified with any confidence; thus there may be more than one
"unidentified" speaker in the same text.
Where several speakers speak together, if they are identified, then
all of the relevant codes are given; if however they are not, then a
special "unknown speaker group" code is used.
Where it is available, additional information about a participant
is provided by one or more of the following elements, appearing
within the person element:
In each case, the information provided is that given by the respondent
and is taken from the log books issued to all participants in the
demographic part of the corpus. It has not been normalized.
Here is a typical example from the demographic part of the corpus:
Terry
14
student
London
Here is a typical example from the context-governed part of the corpus:
frank harasikwa politician
Euro candidate presenting self for selection
Any recorded relationship between speakers in the demographically
sampled part of the corpus is specified by means of the
role attribute, which indicates how the speaker
concerned is related to the respondent, for example as a friend,
colleague, brother, wife, etc. For example, the participant information recorded in
the header for a text (KSU) comprising conversations between four
participants: Michael and Steve (who are brothers), their mother
Christine and their aunt Leslie is as follows:
13 Michael
student
45 Christine
credit controller
45 Leslie
unemployed
21 Steve
unemployed
In the context-governed part of the corpus however, there is no
respondent and relationship information must be deduced from the other
information provided. The role attribute for
person elements in these texts will usually have the value
unspecified.
The setting description
The settingDesc element is used to describe the context
within which a spoken text takes place. It appears once in the header
of each spoken text, and contains one or more setting
elements for each distinct recording.
The content of each setting element supplies additional
details about the place, time of day, and other activities going on,
using the following additional elements:
Some typical examples follow:
Essex: Harlow
Harlow College
A'level lecture
Lancashire: Morecambe
at home
watching television
Text classification
The TEI provides a number of ways in which classification or
text-type information may be specified for a text, grouped together
within a textClass element, which appears once in the header
of each text. Classifications may be represented using references to
internally defined classications provided in the classCode
element (such as the BNC classification scheme described
in section ), by reference to some other
predefined classification system, or by an open set of keywords. All
three methods are used in the BNC, using the following elements:
A catRef element is provided in the header of each
text. Its target attribute contains values for each of the
classification codes defined in the corpus header. In each case, the
classification code consists of a code used as the identifier of a
category element within a taxonomy element defined
in the corpus header. For example: ALLTIM1 indicates
dated 1960-1974. A list of the values used is given in section
below.
This taxonomy is that originally defined for selection and
description of texts during the design of the corpus, as further
discussed elsewhere. It is of course possible to classify the texts in
many other ways, and no claim is made that this method is universally
applicable or even generally useful, though it does serve to identify
broadly distinct sub-parts of the corpus for investigation. The reader
is also cautioned that, although an attempt has been made in the
current edition of the corpus to correct the more egregious
classification errors noted in the first edition, unquestionably many
errors and inconsistencies remain. In particular, the categories WRILEV
(perceived level of difficulty) and WRISTA (estimated circulation
size) were incorrectly differentiated during the preparation of the corpus
and cannot be relied on.
A classCode element is also provided for every text in the
corpus. This contains the code assigned to the text in a genre-based
analysis carried out at Lancaster University by David Lee since
publication of the first edition of the BNC. Lee's scheme classes the
texts more delicately in most cases, since it takes into account their
topic or subject matter (see further
below).
Lee's scheme is also used as the basis of a very simple
categorization for each text, which is provided by means of the
type attribute on its text or stext
element. This categorization distinguishes six categories for written
text (fiction, academic prose, non-academic prose, newspapers, other
published, unpublished), and two for spoken text (conversation,
other); It may be found a convenient way of distinguishing the major
text types represented in the corpus: see further .
In the first release of the BNC, most texts were assigned a set of
descriptive keywords, tagged as term elements within the
keywords element. These terms were not taken from any
particular descriptive thesaurus or closed vocabulary; the words or
phrases used are those which seemed useful to the data preparation
agency concerned, and are thus often inconsistent or even
misleading. They have been retained unchanged in the present version
of the BNC, pending a more thorough revision. In the World (second)
Edition this set of keywords was complemented for most written texts
by a second set, also tagged using a keywords element, but
with a value for its source attribute of
COPAC, indicating that the terms so tagged are derived
from a different source. The source used was a major online library
catalogue service (see ). Like
other public access catalogue systems, COPAC uses a well-defined
controlled list of keywords for its subject indexing, details of which
are not further given here.
Here is an example showing how one text (BND) is classified in each
of these ways:
...
W_religion
Marriage - Religious aspects - Christianity
Marriage - Christian viewpoints
Christian guide to marriage
......
The revision description
The revision description (revisionDesc) element
is the fourth and final element of a standard TEI header.
In the BNC, it consists of a series of change elements.
Here is part of a typical example:
Tag usage updated for BNC-XML
Last check for BNC World first
release
...
corrected tagUsage
POS codes revised for BNC-2; header
updated
Initial accession to
corpus