The header
The header of a TEI-conformant text generally provides a highly structured description of its contents, analogous to the title page and front matter provided for conventional printed books. Such information is all too often missing in electronic texts; or if supplied, provided only in the form of external documentation such as this manual. The component elements of a TEI header are intended to provide in machine-processable form all the information needed to make sensible use of the Corpus.
- type
- specifies the kind of document to which the header is attached. Legal values are:
- creator
- specifies the agency responsible for creating the header.
- status
- specifies the revision status of the associated document. Legal values are:
- update
- specifies the date on which the header content was last changed or created.
- <fileDesc>
- contains a full bibliographic description of the corpus itself or of a text within it.
- <encodingDesc>
- documents the relationship between an electronic text and the source or sources from which it was derived.
- <profileDesc>
- provides further information about various aspects of a text, specifically the language used, the situation and date of its production, the participants and their setting, and a descriptive classification for it.
- <revisionDesc>
- summarizes the revision history of a file.
The file description
- <titleStmt>
- contains title information, identifying the corpus, or a text within it.
- <editionStmt>
- contains additional information relating to a particular version of the corpus (not used with individual corpus texts).
- <extent>
- describes the approximate size of the electronic file as stored on some carrier medium.
- <publicationStmt>
- formally describes the publication or distribution of the corpus and its constituent texts.
- <sourceDesc>
- supplies a bibliographic description for the copy text(s) from which a particular corpus text was derived or generated.
Further detail for each of these is given in the following subsections.
The title statement
- for written texts, a (possibly shortened) version of the original source title, or, if there is none, a descriptive phrase enclosed in square brackets
- an indication of the size and type of the document
- a note indicating the domain or subject matter of the document
Author and editor information for the source from which a text is derived (e.g. the author of a book) is not included in the <filedesc> element but in the <sourceDesc> element discussed below (The source description ).
The edition statement
Note that here, as elsewhere, the element <para> is used to structure the textual note within a header element. This element is defined as a renaming of the standard TEI <p> element, since the <p> element itself has been redefined with a more restrictive content model (see further section ??.)
The extent statement
Counts are provided for each element actually tagged in a text, as further discussed below (The tagging declaration
The publication statement
- name and address of distributor
- These are tagged using the standard <distributor> and <address> elements respectively; for BNC World, in both corpus and text headers, the
name and address given is as follows:
<distributor> Oxford University Computing Services </distributor> <address> <addrLine>13 Banbury Road, Oxford OX2 6NN U.K.</addrLine> <addrLine>Telephone: +44 1865 273221</addrLine> <addrLine>Facsimile: +44 1865 273275</addrLine> <addrLine>Internet mail: natcorp@oucs.ox.ac.uk</addrLine> </address>
- identification numbers for the published text
- These are tagged using the standard <idno> element. For
the corpus header only one such element is specified, as follows:
<idno type="BNC">BNC-W</idno>For individual text headers, two identification numbers are supplied, distinguished by the value for the type attribute.<idno type="bnc">A0A</idno> <idno type="old">CAMfct</idno>The second identifier (of type
old
) is the old-style mnemonic or numeric code attached to BNC texts in early releases of the corpus, and used to label original printed source materials in the BNC Archive. The first three character code (of typebnc
) is the standard BNC identifier. It is used both for the filename in which the text is stored and as the value supplied for the id attribute on the <bncDoc> element containing the whole text. - availability information
-
For contractual reasons, the corpus header includes a brief rehearsal of
the terms and conditions under which the BNC is made available;
this is reproduced in section
?? below. A similar brief notice is also
provided in the same place for each
individual text, as in the following example:
<availability status="restricted"><para> Available worldwide THIS TEXT IS AVAILABLE THROUGHOUT THE WORLD only as part of the British National Corpus at nominal charge FOR ACADEMIC RESEARCH PURPOSES SUBJECT TO A SIGNED END USER LICENCE HAVING BEEN RECEIVED BY OXFORD UNIVERSITY COMPUTING SERVICES, from whom forms and supporting materials are available. THIS TEXT IS NOT AVAILABLE FOR COMMERCIAL RESEARCH AND EXPLOITATION unless terms have first been agreed with the BNC Consortium Exploitation Committee. Apply in the first instance to Oxford University Computing Services. It is your responsibility, as a user, to ensure that an End User Licence is in place. For your information, the Terms of the End User Licence are set out in the corpus header, which is likely to have a file name similar to "corphdr" or "CORPHDR". Distribution of any part of the corpus under the terms of the Licence must include a copy of the corpus header. Distribution of this corpus text under the terms of the Licence must include this header embodying this notice. Permissions grantor for World: CAMRA (imprint) of St Albans </para></availability>Note the inclusion at the end of the notice of the name and address of the agency owning rights in the text concerned, which has granted permission for its inclusion in the corpus. If no such agency is named, permission for rights additional to those explicitly given by the licencing arrangements in place should be sought from the BNC Consortium in the first instance. Note that the BNC world edition includes only texts for which world rights have been cleared by the BNC Consortium.
- date of publication
- The BNC was first officially published on 24 November 1994. The present edition has 31st October 2000 as its official publication date.
The source description
- <recordingStmt>
- describes a set of recordings used in transcription of a spoken text, either as a series of paragraphs or as a formally structured recording element (The recording statement ).
- <biblStruct>
- contains a structured bibliographic citation, in which only bibliographic sub-elements appear and in a specified order. (Structured bibliographic record)
These elements are not used within the corpus header, which simply contains a note about the sources from which the corpus was derived, tagged as a <para> (paragraph). The headers of individual texts each contain one of the above elements to specify their source.
Context-governed spoken texts derived from broadcast or similar ‘published’ material may have either a recording statement or a bibliographic record as their source.
The recording statement
- <recording>
- details of a particular audio recording used as the source of a spoken text, either directly or from a public broadcast. Attributes include:
- time
- specifies the time of day when the recording was made.
- type
- characterizes the recording in terms of the equipment used to make it. Legal values include:
The standard TEI global attribute n is used (for this element only) to provide the number of the audio tape holding the original recording, as deposited with the National Sound Archive in London.
Structured bibliographic record
The standard TEI <biblStruct> element is used to record bibliographic information for each non-spoken component of the BNC. As defined in the TEI, this element has a complex structure designed to support a wide range of standard bibliographic practices. In the BNC, its structure is restricted as further described below.
At the highest level, all BNC <biblStruct> element will contain a <monogr> element holding other elements that describe the item in question. In a few cases, this may be preceded by an <analytic> element, as further described at the end of this section.
- <title>
- the title or chief name of a work, including any alternative titles or subtitles; this must be given first. In several cases, a generated title or descriptive paraphrase is used, generally enclosed within square brackets. In the current version of the corpus, subtitles, alternative or series titles are not distinguished from the main title, other than by the use of conventional punctuation.
- <author>
- the name of an author (personal or corporate) of a work; names are generally given in canonical form, with surnames preceding forenames. Unlike the TEI equivalent element of the same name, the BNC version has two additional attributes:
- <editor>
- the name of the editor (personal or corporate) for a work.
- <imprint>
- groups information relating to the publication or distribution of a bibliographic item.
- <biblScope>
- defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work. Attributes include:
The n attribute is used with both <author> and <imprint> elements to supply a six-letter code identifying the author or imprint concerned. The values used should be unique across the corpus, but this is not validated by the current release of the DTD.
Where ‘series’ information is available for a given title, this is not normally tagged distinctly. Instead the series title is given as part of the monographic title, usually preceded by a colon.
This level of bibliographic description has not been carried out with complete consistency across the current release of the corpus.
The encoding description
The second major component of the TEI header is the encoding description (<encodingDesc>). This contains information about the relationship between an encoded text and its original source and describes the editorial and other principles employed throughout the corpus. It also contains reference information used throughout the corpus.
- <projectDesc>
- describes in detail the purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
- <samplingDecl>
- contains a prose description of the rationale and methods used in sampling texts in the creation of the corpus.
- <editorialDecl>
- provides details of editorial principles and practices applied during the encoding of a text.
- <tagsDecl>
- provides detailed information about the tagging applied to a corpus text.
- <refsDecl>
- specifies how canonical references are constructed for a text.
- <classDecl>
- contains a series of <category> elements, defining the classification codes used for texts within the corpus.
In the BNC, one of each of these elements appears in the corpus header, with the exception of the <tagsDecl> element which is also given in the individual text headers.
Documentary components of the encoding description
The <projectDesc> element for the corpus gives a brief description of the goals, organization and results of the BNC project. It is reproduced in section ?? below.
An exactly equivalent method is used to indicate the various editorial practices applicable in different portion of the corpus. The list of practices is given in an <editorialDecl> element, reproduced in section ?? below, each item of which has an identifying code which is subsequently referenced via the decls attribute on the <div> or <text> element to which that editorial practice applies.
- CN000
- Errors tagged with <sic> when seen; no normalization
- CN001
- Errors tagged with <sic> if seen; normalisation with <corr>
- CN002
- Normalized to standard British English or control list member
- CN004
- Corrections and normalizations applied silently
- HN000
- Smart elision of line-end hyphens; &rehy used for remainder
- HN001
- Dumb elision of line-end hyphens; true hyphens hand-reinstated
- HN002
- Line-end hyphens removed by hand where appropriate
- HN003
- Source material contains no line-end hyphens
- QN000
- Open and close quote normalized to &bquo, &equo
- QN001
- Open and close quote normalized to &quo
- QN002
- Quotation may be represented using <shift>
- SN000
- Segmentation carried out by CLAWS5.
The tagging declaration
The reference and classification declarations
wridom
) is understood
to apply also to each <catDesc> contained by each of its
constituent (daughter) <category> elements. That is, the full
description for category wridom3 is ‘Domain for
written corpus texts : informative: natural science’. A full list of all category codes, and the numbers of texts so classified in the current release of the corpus is provided in section ??.
Information about the classification and categorization of an individual text is held within the <textClass> element discussed below (Text classification )
The profile description
- <creation>
- contains information about the creation of a text.
- <langUsage>
- describes the languages, sublanguages, registers, dialects etc. represented within a text.
- <particDesc>
- describes the identifiable participants in a linguistic interaction together with their relationships, where known.
- <settingDesc>
- describes the setting or settings within which a language interaction takes place.
- <textClass>
- groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.
The creation element
This element is provided to record the date of first publication of individual published texts, and any details concerning the origination of any spoken or written texts, whether or not covered elsewhere. It is supplied in every text header, although the details provided vary. As a minimum, a date (tagged with the standard <date> element) will be included; this gives the date the content of this text was first created. For a spoken text, this will be the same as the date of the recording; for a written text, it will normally be the date of first publication.
For imaginative works, the creation date is also the date used to
classify the text (by means of the writim
category). For
other written works, such as textbooks, which are likely to have been
extensively revised since their first publication, the date used to
classify the text will be that of the edition described in the
<sourceDesc>, but the original date will also be recorded
within the <creation> element.
The <langUsage> element
The participant description
The participant description (<particDesc>) element is used to provide information about speakers of texts transcribed for the BNC. In its basic structure it is close to the element defined by the TEI but it has been modified to include some more specific elements provided for the BNC. It appears both within the corpus header, to define the generic ‘unknown participant’, and also within individual spoken text headers to define the participants specific to those texts.
It contains a series of <person> elements describing the participants whose speech is transcribed in this text, followed by an optional <particLinks> element describing any relationships or links amongst them.
The person element
- id
- (mandatory) supplies a unique code used to identify this speaker and their utterances in the transcription.
- role
- specifies the role of this participant with respect to the respondent, as specified by the respondent.
- sex
- specifies the sex of the participant. Possible values are:
- age
- specifies the age group to which the participant belongs. Possible values are:
- flang
- specifies the first language or mother tongue of the participant. Possible values are listed in section ??.
- dialect
- specifies any dialect spoken by the participant, as specified by the respondent. Possible values are listed in section ??.
- soc
- specifies the social class of the participant. Legal values are:
- educ
- specifies the age at which the participant ceased full-time education. Possible values are:
- resp
- (for spoken demographic participants only) specifies the identifier of the respondent in whose data this participant's interactions are recorded.
The global id attribute is required for each participant whose speech is included in a text, and its value is unique within the corpus. Although a given individual will always have the same identifier within a single text, there is no way of identifying the same individual appearing in different texts. For this reason, all demographically sampled conversations collected by a single respondent are treated together as a single text.
The value for the flang attribute consists of a two-letter language code taken from ISO 639 (normally EN for English), optionally suffixed by a three-letter country code taken from ISO 3166. Thus ‘EN-GBR’ is English as spoken in the United Kingdom; ‘EN-CAN’ is English as spoken in Canada, and ‘FR-FRA’ is French as spoken in France.
The value for the dialect attribute is also a three-letter code taken from a local extension to ISO 3166. A full list of codes used and their meanings is given in section ??.
In each case, the information provided is that given by the respondent and is taken from the log books issued to all participants in the demographic part of the corpus. It has not been normalized.
The particLinks element
- <relation>
- describes any kind of relationship or linkage amongst a
specified group of participants. Attributes include:
- active
- identifies the ‘active’ participants in a directed relationship, or all the participants in a mutual one.
- desc
- supplies a name for the relationship, seen from the point of view of the active participant in a directed relationship.
- mutual
- indicates whether the relationship holds equally amongst all participants. Legal values are:
- passive
- identifies the ‘passive’ participants in a directed relationship.
A list of the different types of relationship identified amongst participants is given in section ??.
Following the TEI Guidelines, we distinguish between mutual relationships, in which all participants are on an equal footing, and directed relationships, in which the roles of the participants are typically described differently. The roles applicable to a directed relationship are arbitrarily classed here as either active or passive. For example, the relationships ‘colleague’ or ‘spouse’ would be classed as mutual, while ‘employee’ or ‘wife’ would be classed as directed. A relationship such as ‘sister’ may or may not be directed, depending on whether it obtains between two women or between a man and a women.
For a mutual relationship, only the active attribute will be supplied; for a directed one, both active and passive attributes will be supplied. In either case, these attributes take as value a list of the identifiers of the <person> elements understood to be involved in the relationship concerned.
In the current edition of the corpus, relation information is provided in this form only for the context-governed participants. The relationships between participants in the demographically-sampled part of the corpus are indicated by the role attribute as discussed above.
The setting description
- <name>
- contains a place name, usually prefixed by the name of the English county in which it is located.
- <locale>
- contains a brief informal description of the nature of a place, for example a room, a restaurant, a park bench etc.
- <activity>
- contains a brief informal description of what a participant in a language interaction is doing other than speaking, if anything. Bears an additional attribute:
Text classification
- <catRef>
- specifies one or more defined categories within some taxonomy or text typology. Attributes include:
- <classCode>
- contains the code used for this text in an externally-defined classification system: in this release of the BNC, the genre codes defined by David Lee are used.
- <keywords>
- contains a list of keywords or phrases identifying the topic or nature of a text, each of which is tagged as a <term> element.
A <catRef> element is provided in the header of each text. Its target attribute contains values for each of the classification codes
listed in the following table and defined in the corpus header. In each case,
the classification code consists of an alphabetic prefix (e.g. alltim
)
identifying the category (e.g. "date"), followed by a single digit indicating a value for that category. Thus the code alltim1
indicates ‘dated 1960-1974’. The value 0 is always used to indicate missing or unknown values. A list
of the values used is given in section ?? below.
This taxonomy is that originally defined for selection and
description of texts during the design of the corpus, as further
discussed elsewhere. It is of course possible to classify the texts in
many other ways, and no claim is made that this method is universally
applicable or even generally useful, though it does serve to identify
broadly distinct sub-parts of the corpus for investigation. The reader
is also cautioned that, although an attempt has been made in the
current edition of the corpus to correct the more egregious
classification errors noted in the first edition, unquestionably many
errors and inconsistencies remain. In particular, the categories wrilev
(perceived level of difficulty) and wrista
(estimated circulation
size) were incorrectly differentiated during the preparation of the corpus
and cannot be relied on.
A <classCode> element is also provided for every text in the corpus. It contains the code assigned to this text in David Lee's genre-based analysis carried out at Lancaster University since publication of the first edition of the BNC.
In the first release of the BNC, most texts were assigned a set of descriptive keywords, tagged within the <keywords> element. These terms were not taken from any particular descriptive thesaurus or closed vocabulary; the words or phrases used are those which seemed useful to the data preparation agency concerned, and are thus often inconsistent or even misleading. They have been retained unchanged in the present version of the BNC, pending a more thorough revision.
In this edition of the BNC, a second set of keywords has been
supplied for the majority of written texts. These keywords are also
tagged using a <keywords> element, but with a value for the
source attribute of COPAC
, indicating that
the terms so tagged are derived from a different source. The source
used is a major online library catalogue service (see
http://www.copac.ac.uk
), from which we have taken the
subject keywords provided for each title identifiable as forming part
of the BNC. Like other public access catalogue systems, COPAC uses a
well-defined controlled list of keywords for its subject indexing,
details of which are not further given here.
The revision description
The revision description (<revisionDesc>) element is the fourth and final element in the standard TEI header. In the BNC, it consists of a series of <change> elements, each containing a <date>, a <respStmt>, and a <para> element. A new <change> element was added at the start of the list for each major change made in the text or header during preparation of the current edition of the corpus.
Up: Contents