The header
The header of a TEI-conformant text provides a structured description of its contents, analogous to the title page and front matter of a book. The component elements of a TEI header are intended to provide in machine-processable form all the information needed to make sensible use of the Corpus.
Every separate text in the British National Corpus (i.e. each <bncDoc> element) has its own header, referred to below as a text header. In addition, the corpus itself has a header, referred to below as the corpus header, containing information which is applicable to the whole corpus, possibly with some local over-riding, as described in section. Both corpus and text headers are represented by <teiHeader> elements.
- <fileDesc> (File Description) contains a full bibliographic description of an electronic file.
- <encodingDesc> (Encoding description) documents the relationship between an electronic text and the source or sources from which it was derived.
- <profileDesc> (text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting.
- <revisionDesc> (revision description) summarizes the revision history for a file.
The file description
- <titleStmt> (title statement) groups information about the title of a work and those responsible for its intellectual content.
- <editionStmt> (edition statement) groups information relating to one edition of a text.
- <extent> specifies the approximate size of the text, in orthographic words, w elements, and s elements
- <publicationStmt> (publication statement) groups information concerning the publication or distribution of an electronic or other text.
- <sourceDesc> supplies a description of the source text(s) from which an electronic text was derived or generated.
Further detail for each of these is given in the following subsections.
The title statement
The title statement (<titleStmt>) element of a BNC text contains one or more <title> elements, optionally followed by <author>, <editor>, or <respStmt> elements. These sub-elements are used throughout the header, wherever the title of a work or a statement of responsibility are required.
The content of the <title> element includes the title of the source, followed by the phrase "Sample containing about", the approximate word count for the sample, and further information about the text type and domain, all extracted from other parts of the header. This is followed by responsibility statements showing which of the BNC Consortium members was responsible for capturing the text originally.
Author and editor information for the source from which a text is derived (e.g. the author of a book) is not included in the <filedesc> element but in the <sourceDesc> element discussed below (The source description ).
The extent statement
wc
utility, which simply counts blank delimited
strings; the other figures give the number of <w> and
<s> elements respectively.The publication statement
- <distributor> supplies the name of a person or other agency responsible for the distribution of a text.
- <availability> supplies information about the availability of a text, for example any restrictions on its use or distribution, its copyright status, etc.
- <idno> (identifying number) supplies an identifying code for a text.(In addition to global attributes and those inherited from [model.biblPart ] )
The second identifier (of type old
) is the old-style
mnemonic or numeric code attached to BNC texts in early releases of
the corpus, and used to label original printed source materials in the
BNC Archive. The first three character code (of type bnc
)
is the standard BNC identifier. It is also used both for the filename in
which the text is stored and as the value supplied for the
xml:id attribute on the <bncDoc> element containing
the whole text, and should always be used to cite the text.
The source description
- <recordingStmt> (recording statement) describes a set of recordings used in transcription of a spoken text.
- <bibl> (bibliographic citation) contains any bibliographic reference, occurring either within the header of a written corpus text in which case it has a fixed substructure, or within the body of a corpus text, in which case it contains only s elements.
These elements are not used within the corpus header, which simply contains a note about the sources from which the corpus was derived, tagged as a <para> (paragraph). The headers of individual texts each contain one of the above elements to specify their source.
Context-governed spoken texts derived from broadcast or similar ‘published’ material may have either a recording statement or a bibliographic record as their source.
All bibliographic data supplied in the individual text headers is collected together and reproduced in section ?? below.
The recording statement
The value of the n attribute here provides the number of the audio tape holding the original recording, as deposited with the British Library's Sound Archive in London.
Structured bibliographic record
- <title> contains the full title of a work of any kind.(In addition to global attributes)
- <editor> (editor) secondary statement of responsibility for a bibliographic item, for example the name of an individual, institution or organization, (or of several such) acting as editor, compiler, translator, etc. (In addition to global attributes)
- <author> in a bibliographic reference, contains the name of the author(s), personal or corporate, of a work; the primary statement of responsibility for any bibliographic item. (In addition to global attributes)
- <imprint> groups information relating to the publication or distribution of a bibliographic item.(In addition to global attributes)
- <pp> supplies page numbers for a bibliographic citation.
During production of the BNC, the n attribute was used with both <author> and <imprint> elements to supply a six-letter code identifying the author or imprint concerned. The values used should be unique across the corpus, but this is not validated in the current release of the DTD.
Where ‘series’ information is available for a given title, this is not normally tagged distinctly. Instead the series title is given as part of the monographic title, usually preceded by a colon.
This level of bibliographic description has not been carried out with complete consistency across the current release of the corpus.
The encoding description
The second major component of the TEI header is the encoding description (<encodingDesc>). This contains information about the relationship between an encoded text and its original source and describes the editorial and other principles employed throughout the corpus. It also contains reference information used throughout the corpus.
- <projectDesc> (project description) describes in detail the aim or purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
- <samplingDecl> (sampling declaration) contains a prose description of the rationale and methods used in sampling texts in the creation of a corpus or collection.
- <editorialDecl> (editorial practice declaration) provides details of editorial principles and practices applied during the encoding of a text.
- <tagsDecl> (tagging declaration) provides information about the XML elements actually used within a BNC text
- <refsDecl> (references declaration) provides documentation for the reference system applicable to the corpus.
- <classDecl> (classification declarations) contains one or more taxonomies defining any classificatory codes used elsewhere in the text.
- <xairaSpecification> specifies additional information needed by xaira.
In the BNC, one of each of these elements appears in the corpus header. Only the <tagsDecl> element appears in the individual text headers.
Documentary components of the encoding description
The <projectDesc> element for the corpus gives a brief description of the goals, organization and results of the BNC project. The <samplingDecl>, <editorialDecl> and <refsDecl> elements similarly supply brief prose descriptions It is reproduced in section The BNC corpus header below.
The tagging declaration
The reference and classification declarations
- <taxonomy> (taxonomy) defines a typology used to classify texts either implicitly, by means of a bibliographic citation, or explicitly by a structured taxonomy.
- <desc> (description) supplies explanatory text associated with a category or other component defined in the corpus header
- <category> (category) defines a single category within a taxonomy of texts.
- <bibl> (bibliographic citation) contains any bibliographic reference, occurring either within the header of a written corpus text in which case it has a fixed substructure, or within the body of a corpus text, in which case it contains only s elements.
A full list of all category codes, and the numbers of texts so classified in the current release of the corpus is provided in section Text and genre classification codes.
Further information about the classification and categorization of an individual texts is provided within the <textClass> element discussed below (Text classification )
The profile description
- <creation> contains information about the creation of a text.(In addition to global attributes)
- <particDesc> (participation description) describes the identifiable speakers, voices, or other participants in a linguistic interaction. (In addition to global attributes and those inherited from [att.declarable ] )
- <settingDesc> (setting description) describes the setting or settings within which a language interaction takes place, either as a prose description or as a series of setting elements.
- <langUsage> (language usage) describes the languages, sublanguages, registers, dialects etc. represented within a text.
- <textClass> (text classification) groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.
The creation element
This element is provided to record the date of first publication of individual published texts, and any details concerning the origination of any spoken or written texts, whether or not covered elsewhere. It is supplied in every text header, although the details provided vary. As a minimum, a date (tagged with the standard <date> element) will be included; this gives the date the content of this text was first created. For a spoken text, this will be the same as the date of the recording; for a written text, it will normally be the date of first publication.
For imaginative works, the creation date is also the date used to
classify the text (by means of the WRITIM
category). For
other written works, such as textbooks, which are likely to have been
extensively revised since their first publication, the date used to
classify the text will be that of the edition described in the
<sourceDesc>, but the original date will also be recorded
within the <creation> element.
The <langUsage> element
The participant description
The participant description (<particDesc>) element is used to provide information about speakers of texts transcribed for the BNC. It appears only within individual spoken text headers to define the participants specific to those texts.
It contains a series of <person> elements describing the participants whose speech is transcribed in this text.
The person element
- <person> provides information about an identifiable individual, for
example a participant in a language interaction, or a person referred
to in a historical source.
- ageGroup
- specifies the age group to which the participant belongs.
- dialect
- specifies the dialect or accent of a participant's speech, as identified by the respondent.
- firstLang
- specifies the country of origin of the participant, as identified by the respondent.
- n
- internal identifier
- educ
- specifies the age at which the participant ceased full-time education.
- soc
- specifies the social class of the participant.
- sex
- specifies the sex of the participant.
- role
- describes the relationship or role of this participant with respect to the respondent.
- xml:id
- provides the unique identifier for this element.
The xml:id attribute is required for each participant whose speech is included in a text, and its value is unique within the corpus. Although a given individual will always have the same identifier within a single text, there is no way of identifying the same individual should they appear in different texts. Since all demographically sampled conversations collected by a single respondent are treated together as a single text, this is however rather unlikely.
On many occasions the speaker of a given utterance cannot be identified. A special code is used to indicate an unknown speaker, but, for consistency, this is also made unique to each text. Thus, an "unknown speaker" in one text will have different identifying code from an "unknown speaker" in another.
Where several speakers speak together, if they are identified, then all of the relevant codes are given; if however they are not, then a special "unknown speakere group" code is used.
- <persName> (personal name) contains a proper noun or proper-noun phrase referring to a person, possibly including any or all of the person's forenames, surnames, honorifics, added names, etc.
- <age> specifies the age in years of a recorded participant at the time of the recording in which they participate.
- <occupation> contains an informal description of a person's trade, profession or occupation.
- <dialect> contains an informal description of the regional variety of English used by a participant in a spoken text.
- <persNote> contains any additional information supplied about a participant in a spoken text
In each case, the information provided is that given by the respondent and is taken from the log books issued to all participants in the demographic part of the corpus. It has not been normalized.
In the context-governed part of the corpus however, there is no
respondent and relationship information must be deduced from the other
information provided. The role attribute for
<person> elements in these texts will usually have the value
unspecified
.
The setting description
- <date> contains a date in any format.(In addition to global attributes)
- <locale> (locale) contains a brief informal description of the nature of a place for example a room, a restaurant, a park bench etc.
- <activity> (activity) contains a brief informal description of what a participant in a language interaction is doing other than speaking, if anything.
- <placeName> (place name) contains an absolute or relative place name.
Text classification
- <catRef> (category reference) provides a list of codes identifying the categories to which
this text has been assigned, each code referencing a category element
declared in the corpus header.
- <classCode> (classCode) contains the classification code used for this text in some standard classification system.
- <keywords> (Keywords) contains a list of keywords or phrases identifying the topic or nature of a text.(In addition to global attributes)
A <catRef> element is provided in the header of each
text. Its target attribute contains values for each of
the classification codes listed in the following table and defined in
the corpus header. In each case, the classification code consists of a
code used as the identifier of a <category> element within a
<taxonomy> element defined in the corpus header (see above,
). For example: ALLTIM1
indicates ‘dated
1960-1974’. A list of the values used is given in section Text and genre classification codes below.
This taxonomy is that originally defined for selection and
description of texts during the design of the corpus, as further
discussed elsewhere. It is of course possible to classify the texts in
many other ways, and no claim is made that this method is universally
applicable or even generally useful, though it does serve to identify
broadly distinct sub-parts of the corpus for investigation. The reader
is also cautioned that, although an attempt has been made in the
current edition of the corpus to correct the more egregious
classification errors noted in the first edition, unquestionably many
errors and inconsistencies remain. In particular, the categories WRILEV
(perceived level of difficulty) and WRISTA
(estimated circulation
size) were incorrectly differentiated during the preparation of the corpus
and cannot be relied on.
A <classCode> element is also provided for every text in the corpus. This contains the code assigned to the text in a genre-based analysis carried out at Lancaster University by David Lee since publication of the first edition of the BNC. Lee's scheme which is further described in an article (Leeref) classes the texts more delicately in most cases since it takes into account their topic or subject matter.
Lee's scheme is also used as the basis of a very simple categorization for each text, which is provided by means of the type attribute on its <text> or <stext> element. This categorization distinguishes six categories for written text (fiction, academic prose, non-academic prose, newspapers, other published, unpublished), and two for spoken text (conversation, other); It may be found a convenient way of distinguishing the major text types represented in the corpus.
In the first release of the BNC, most texts were assigned a set of
descriptive keywords, tagged as <term> elements within the <keywords>
element. These terms were not taken from any particular descriptive
thesaurus or closed vocabulary; the words or phrases used are those
which seemed useful to the data preparation agency concerned, and are
thus often inconsistent or even misleading. They have been retained
unchanged in the present version of the BNC, pending a more thorough
revision. In the World (second) Edition this set of keywords was
complemented for most written texts by a second set, also
tagged using a <keywords> element, but with a value for its
source attribute of COPAC
, indicating that
the terms so tagged are derived from a different source. The source
used was a major online library catalogue service (see
http://www.copac.ac.uk
). Like other public access catalogue systems, COPAC uses a
well-defined controlled list of keywords for its subject indexing,
details of which are not further given here.
The revision description
Up: Contents Previous: Spoken texts Next: Wordclass Tagging in BNC XML