The header The header of a TEI-conformant text provides a structured description of its contents, analogous to the title page and front matter of a book. The component elements of a TEI header are intended to provide in machine-processable form all the information needed to make sensible use of the Corpus. Every separate text in the British National Corpus (i.e. each bncDoc element) has its own header, referred to below as a text header. In addition, the corpus itself has a header, referred to below as the corpus header, containing information which is applicable to the whole corpus. Both corpus and text headers are represented by teiHeader elements. The corpus header is supplied in a separate file called bncHdr.xml, whereas text headers are prefixed to each file in the Texts directory. In the remainder of this section, we describe the components of the teiHeader element, as used within the BNC. A TEI header contains a file description (section ), an encoding description (section ), a profile description (section ) and a revision description (section ), represented by the following four elements: The file description The file description (fileDesc) is the first of the four main constituents of the header. It is intended to document an electronic file i.e. (in the case of a corpus header) the whole corpus, or (in the case of a text header) any characteristics peculiar to an individual file within it. In each case, it contains the following five subdivisions: Further detail for each of these is given in the following subsections. The title statement The title statement (titleStmt) element of a BNC text contains one or more title elements, optionally followed by author, editor, or respStmt elements. These sub-elements are used throughout the header, wherever the title of a work or a statement of responsibility are required. For the corpus header, the title statement looks like this: The British National Corpus: XML Edition Lead partner in consortium Oxford University Press Text selection for miscellaneous and unpublished written materials W R Chambers Text selection, data capture and transcription for spoken texts and for 14% of published written texts Longman ELT Text selection for 86% published written texts Oxford University Press Data capture and transcription for all miscellaneous and unpublished written texts and for 86% of published written texts Oxford University Press XML conversion, encoding, storage and distribution Oxford University Computing Services Text enrichment Unit for Computer Research into the English Language, University of Lancaster In individual corpus texts, the title statement follows a pattern like the following: The National Trust Magazine. Sample containing about 21015 words from a periodical (domain: arts) Data capture and transcription Oxford University Press The content of the title element includes the title of the source, followed by the phrase "Sample containing about", the approximate word count for the sample, and further information about the text type and domain, all extracted from other parts of the header. This is followed by responsibility statements showing which of the BNC Consortium members was responsible for capturing the text originally. Here are some typical examples: How we won the open: the caddies' stories. Sample containing about 36083 words from a book (domain: leisure) Harlow Women's Institute committee meeting. Sample containing about 246 words speech recorded in public context The Scotsman: Arts section. Sample containing about 48246 words from a periodical (domain: arts) 32 conversations recorded by `Frank' (PS09E) between 21 and 28 February 1992 with 9 interlocutors, totalling 3193 s-units, 20607 words, and 3 hours 22 minutes 23 seconds of recordings. [Leaflets advertising goods and products]. Sample containing about 23409 words of miscellanea (domain: commerce) A respStmt element is used to indicate each agency responsible for any significant effort in the creation of the text. Since responsibilities for data encoding and storage, and for enrichment, are the same for all texts, they are not repeated in each text header. The responsibility for original data capture and transcription varies text by text, and is therefore stated in each text header, in the following manner: Data capture and transcription Longman ELT Author and editor information for the source from which a text is derived (e.g. the author of a book) is not included in the filedesc element but in the sourceDesc element discussed below (). The edition statement The editionStmt element is used to specify an edition for each file making up the corpus. It takes the same form in both the corpus header and individual text headers: BNC XML Edition, January 2007 The extent statement The extent element is used in each text header to specify the size of the text to which it is attached, as in the following example: 21015 tokens; 21247 w-units; 957 s-units These counts do not include the size of the header itself. The number of tokens is generated by the Unix wc utility, which simply counts blank delimited strings; the other figures give the number of w and s elements respectively. The publication statement The publicationStmt element is used to specify publication and availability information for an electronic text. It contains the following three elements: Individual text headers contains the following fixed text for the first two of these: Distributed under licence by Oxford University Computing Services on behalf of the BNC Consortium. This material is protected by international copyright laws and may not be copied or redistributed in any way. Consult the BNC Web Site at http://www.natcorp.ox.ac.uk for full licencing and distribution conditions. For contractual reasons, the corpus header includes a somewhat longer rehearsal of the terms and conditions under which the BNC is made available. For individual text headers, two identification numbers are supplied, distinguished by the value of their type attribute. A0A CAMfct The second identifier (of type old) is the old-style mnemonic or numeric code attached to BNC texts during the production of the corpus, and is still used to label the original printed source materials in the BNC Archive. The first three character code (of type bnc) is the standard BNC identifier. It is also used both for the filename in which the text is stored and as the value supplied for the xml:id attribute on the bncDoc element containing the whole text, and should always be used to cite the text. The code is a completely arbitrary identifier, and does not indicate anything about the nature of the text. The source description The sourceDesc element is used to supply bibliographic details for the original source material from which an electronic text derives. In the case of a BNC text, this might be a book, pamphlet, newspaper etc., or a recording. One of the following elements available within the sourceDesc will be used, as appropriate: These elements are not used within the corpus header, which simply contains a note about the sources from which the corpus was derived, tagged as a para (paragraph). The headers of individual texts each contain one of the above elements to specify their source. Context-governed spoken texts derived from broadcast or similar published material may have either a recording statement or a bibliographic record as their source. All bibliographic data supplied in the individual text headers is collected together and reproduced in section below. The recording statement The recording statement (recordingStmt) element contains one or more recording elements: The value of the n attribute here provides the number of the audio tape holding the original recording, as deposited with the British Library's Sound Archive in London. In the following simple example, typical of most of the context-governed parts of the BNC, the recording element has no content at all: When, as is often the case for the spoken demographic parts of the BNC, a text has been made up by transcribing several different recordings made by a single respondent over a period of time, each such recording will have its own recording element, as in the following example: Note the presence of an xml:id attribute on each of the above recordings. The value given here is used to indicate the recording from which a given part of the text was transcribed. Each recording is transcribed as a distinct div (division) element within an stext. In that element, the identifier of the source recording is supplied as the value of a decls attribute. Thus, in the spoken text derived from the above mentioned recordings, there will be a div element starting as follows: ... which will contain the part of text transcribed from that recording. As noted above the identifier supplied on the n attribute is quite distinct, and identifies the tape on which the original recording was made, and by which it is referenced in the British Library's Sound Archive. Structured bibliographic record In addition to its usage within the corpus texts (see ), the bibl element is also used to record bibliographic information for each non-spoken component of the BNC. In this case, its structure is constrained to contain only the following elements in the order specified: During production of the BNC, the n attribute was used with both author and imprint elements to supply a six-letter code identifying the author or imprint concerned. The values used should be unique across the corpus, but this is not validated in the current release of the DTD. The imprint element is supplied for published texts only and contains the following elements in the order given: The following example demonstrates how these elements are used to record bibliographic details for a typical book: It might have been Jerusalem. Healy, Thomas Polygon Books Edinburgh 1991 1-81 The following example is typical of the case where a collection of leaflets or newsletters has been treated as a single text: [Potato Marketing Board leaflets] Potato Marketing Board London 1991 Occasionally, a bibliographic item has two titles, for example a series title as well as an individual title, or multiple authors. In the BNC such cases are treated simply by repeating the element concerned, sometimes using the level attribute to distinguish the bibliographic level of the title: Damages for personal injury and death: Damages on deathSaunt, ThomasKemp, DavidLongman Group UK Ltd Harlow 1993 52-68 Where series information is available for a given title, this is not normally tagged distinctly. Instead the series title is given as part of the monographic title, usually preceded by a colon. This level of bibliographic description has not been carried out with complete consistency across the current release of the corpus. The encoding description The second major component of the TEI header is the encoding description (encodingDesc). This contains information about the relationship between an encoded text and its original source and describes the editorial and other principles employed throughout the corpus. It also contains reference information used throughout the corpus. The BNC encodingDesc element has the following six components: In the BNC, one of each of these elements appears in the corpus header. Only the tagsDecl element appears in the individual text headers. Documentary components of the encoding description The projectDesc element for the corpus gives a brief description of the goals, organization and results of the BNC project. The samplingDecl, editorialDecl and refsDecl elements similarly supply brief prose descriptions describing the sampling procedures used in the project and the referencing system applied. This information is also summarized elsewhere in this documentation. The tagging declaration The tagging declaration (tagsDecl) element is used slightly differently in corpus and in text headers. In the corpus header, it is used to list every element name actually used within the corpus, together with a brief description of its function. In text headers, it is used to specify the number of elements actually tagged within each text. In either case it consists of a namespace element, containing a number of tagUsage elements, defined as follows: In the corpus header, each tagUsage element contains a brief description of the element specified by its gi element; the occurs attribute is not supplied, as in the following extract: Non-verbal event in spoken text Point where source material has omitted Header or headline in written text In text headers, the tagUsage elements are empty, but the occurs attribute is always supplied, and indicates the number of such elements which appear within the text, as in the following example, taken from a typical written text: The reference and classification declarations The refsDecl element for the corpus header defines the approved format for references to the corpus. It takes the following form Canonical references in the British National Corpus are to text segment (s) elements, and are constructed by taking the value of the xml:id attribute of the bncDoc element containing the target text, and concatenating a dot separator, followed by the value of the n attribute of the target s element. The standard TEI classDecl element is used in the BNC Corpus Header to formally define several text classication schemes which are used in the corpus. Each scheme or taxonomy defines a number of code/description pairs, applicable to a text in the corpus. For example, the written domain taxonomy defines twelve subject domains ("Imagination", "Informative: natural science", "Informative: applied science" etc.) and each written text is assigned to one of them. Each taxonomy is defined in the corpus header, using the following elements: Here, for example, is the start of the taxonomy element defining the Written domain classification system as it appears in the corpus header: Written Domain Imaginative Informative: natural & pure science Informative: applied science ... For a complete list of the taxonomies used in the BNC and the number of texts etc. classified according to them, refer to the corpus header and to chapter . The classification categories applicable to a given text are specified by the catRef element within the associated text header. Its target lists the identifiers of all category elements applicable to that text. For example, the header of a written text assigned to the social science domain which has a corporate author will include a catRef element like the following: (The dots above represent the identifiers of all other category codes applicable to this text). A full list of all category codes, and the numbers of texts so classified in the current release of the corpus is provided in section . Further information about the classification and categorization of an individual texts is provided within the textClass element discussed below () The Xaira Specification The Xaira Specification element is used by the XAIRA indexing software to index the BNC. A brief description of its components is provided in below; for full information, consult the Xaira documentation available from http://www.xaira.org/ The profile description The third component of a TEI header is the profile description. In the BNC this is used to provide the following elements: The creation element This element is provided to record the date of publication for texts originally published separately, and any details concerning the origination of any spoken or written texts, whether or not covered elsewhere. It is supplied in every text header, although the details provided vary. As a minimum, a date (tagged with the standard date element) will be included; this gives the date the content of this text was first created. For a spoken text, this will be the same as the date of the recording; for a written text, it will normally be the date of first publication of the edition, which may not be the same as the date of publication of the copy used. Here are two typical examples: 1992-02-11: 1971: originally published by Jonathan Cape. Note that the BNC contains modernized editions of some classic texts such as Defoe's Robinson Crusoe (FRX); the creation date specified here is that of the creation of the modernized version rather than the 17th c. original. For imaginative works, the creation date is also the date used to classify the text (by means of the WRITIM category). For other written works, such as textbooks, which are likely to have been extensively revised since their first publication, the date used to classify the text will be that of the edition described in the sourceDesc, but the original date will also be recorded within the creation element. The langUsage element Unlike the other elements of the profile description, the language usage element occurs only in the corpus header. It contains the following text: The language of the British National Corpus is modern British English. Words, fragments, and passages from many other languages, both ancient and modern, occur within the corpus where these may be represented using a Latin alphabet. Long passages in these languages, and material in other languages, are generally silently deleted. In no case is the lang attribute used to indicate the language of a word, phrase or passage, nor are alternate writing system definitions used. The participant description The participant description (particDesc) element is used to provide information about speakers of texts transcribed for the BNC. It appears only within individual spoken text headers to define the participants specific to those texts. It contains a series of person elements describing the participants whose speech is transcribed in this text. The person element Each person element describes a single participant in a language interaction. It carries a number of attributes which are used to provide encoded values for some key aspects of the person concerned: The xml:id attribute is required for each participant whose speech is included in a text, and its value is unique within the corpus. Although a given individual will always have the same identifier within a single text, there is no way of identifying the same individual should they appear in different texts. Since all demographically sampled conversations collected by a single respondent are treated together as a single text, and respondents were recruited from many different social contexts, the probability of the same person being recorded by different respondents is rather low, though not completely impossible. On many occasions the speaker of a given utterance cannot be identified. A special code is used to indicate an unknown speaker, but, for consistency, this is also made unique to each text. Thus, an "unknown speaker" in one text will have different identifying code from an "unknown speaker" in another. As far as possible, different speakers are given different identifying codes, even where they cannot be identified with any confidence; thus there may be more than one "unidentified" speaker in the same text. Where several speakers speak together, if they are identified, then all of the relevant codes are given; if however they are not, then a special "unknown speaker group" code is used. Where it is available, additional information about a participant is provided by one or more of the following elements, appearing within the person element: In each case, the information provided is that given by the respondent and is taken from the log books issued to all participants in the demographic part of the corpus. It has not been normalized. Here is a typical example from the demographic part of the corpus: Terry 14 student London Here is a typical example from the context-governed part of the corpus: frank harasikwa politician Euro candidate presenting self for selection Any recorded relationship between speakers in the demographically sampled part of the corpus is specified by means of the role attribute, which indicates how the speaker concerned is related to the respondent, for example as a friend, colleague, brother, wife, etc. For example, the participant information recorded in the header for a text (KSU) comprising conversations between four participants: Michael and Steve (who are brothers), their mother Christine and their aunt Leslie is as follows: 13 Michael student 45 Christine credit controller 45 Leslie unemployed 21 Steve unemployed In the context-governed part of the corpus however, there is no respondent and relationship information must be deduced from the other information provided. The role attribute for person elements in these texts will usually have the value unspecified. The setting description The settingDesc element is used to describe the context within which a spoken text takes place. It appears once in the header of each spoken text, and contains one or more setting elements for each distinct recording. The content of each setting element supplies additional details about the place, time of day, and other activities going on, using the following additional elements: Some typical examples follow: Essex: Harlow Harlow College A'level lecture Lancashire: Morecambe at home watching television Text classification The TEI provides a number of ways in which classification or text-type information may be specified for a text, grouped together within a textClass element, which appears once in the header of each text. Classifications may be represented using references to internally defined classications provided in the classCode element (such as the BNC classification scheme described in section ), by reference to some other predefined classification system, or by an open set of keywords. All three methods are used in the BNC, using the following elements: A catRef element is provided in the header of each text. Its target attribute contains values for each of the classification codes defined in the corpus header. In each case, the classification code consists of a code used as the identifier of a category element within a taxonomy element defined in the corpus header. For example: ALLTIM1 indicates dated 1960-1974. A list of the values used is given in section below. This taxonomy is that originally defined for selection and description of texts during the design of the corpus, as further discussed elsewhere. It is of course possible to classify the texts in many other ways, and no claim is made that this method is universally applicable or even generally useful, though it does serve to identify broadly distinct sub-parts of the corpus for investigation. The reader is also cautioned that, although an attempt has been made in the current edition of the corpus to correct the more egregious classification errors noted in the first edition, unquestionably many errors and inconsistencies remain. In particular, the categories WRILEV (perceived level of difficulty) and WRISTA (estimated circulation size) were incorrectly differentiated during the preparation of the corpus and cannot be relied on. A classCode element is also provided for every text in the corpus. This contains the code assigned to the text in a genre-based analysis carried out at Lancaster University by David Lee since publication of the first edition of the BNC. Lee's scheme classes the texts more delicately in most cases, since it takes into account their topic or subject matter (see further below). Lee's scheme is also used as the basis of a very simple categorization for each text, which is provided by means of the type attribute on its text or stext element. This categorization distinguishes six categories for written text (fiction, academic prose, non-academic prose, newspapers, other published, unpublished), and two for spoken text (conversation, other); It may be found a convenient way of distinguishing the major text types represented in the corpus: see further . In the first release of the BNC, most texts were assigned a set of descriptive keywords, tagged as term elements within the keywords element. These terms were not taken from any particular descriptive thesaurus or closed vocabulary; the words or phrases used are those which seemed useful to the data preparation agency concerned, and are thus often inconsistent or even misleading. They have been retained unchanged in the present version of the BNC, pending a more thorough revision. In the World (second) Edition this set of keywords was complemented for most written texts by a second set, also tagged using a keywords element, but with a value for its source attribute of COPAC, indicating that the terms so tagged are derived from a different source. The source used was a major online library catalogue service (see ). Like other public access catalogue systems, COPAC uses a well-defined controlled list of keywords for its subject indexing, details of which are not further given here. Here is an example showing how one text (BND) is classified in each of these ways: ... W_religion Marriage - Religious aspects - Christianity Marriage - Christian viewpoints Christian guide to marriage ...... The revision description The revision description (revisionDesc) element is the fourth and final element of a standard TEI header. In the BNC, it consists of a series of change elements. Here is part of a typical example: Tag usage updated for BNC-XML Last check for BNC World first release ... corrected tagUsage POS codes revised for BNC-2; header updated Initial accession to corpus