5 The header

The header of a TEI-conformant text generally provides a highly structured description of its contents, analogous to the title page and front matter provided for conventional printed books. Such information is all too often missing in electronic texts; or if supplied, provided only in the form of external documentation such as this manual. The header elements described in this section are intended to provide in machine-processable form all the information needed to make sensible use of the Corpus.

Every separate text in the British National Corpus (i.e. each <bncDoc> element) has its own header, referred to below as a text header. The corpus itself also has a header, referred to below as the corpus header, containing information which is applicable to the whole corpus, possibly with some local over-riding, as described in section 5.5 . Both corpus and text headers are represented by <header> elements, the type attribute being used to distinguish the two.

<header> supplies the descriptive and declarative information making up an ``electronic title page'' prefixed to every CDIF-conformant text, and also that prefixed to the corpus as a whole. Attributes include:
- type specifies the kind of document to which the header is attached. Legal values are:
  - corpus the header is attached to the corpus.
  - text the header is attached to a single text.
- creator specifies the agency responsible for creating the header.
- status specifies the revision status of the associated document. Legal values are:
  - new for BNC release 1.0.
  - update for all subsequent releases.
- update specifies the date on which the header content was last changed or created.

In the remainder of this section, we describe the components of the <header> elements, which are closely modelled on components of the corresponding TEI element, the <teiHeader>. The CDIF header contains a file description (section 5.1 ), an encoding description (section 5.2 ), a profile description (section 5.3 ) and a revision description (section 5.4 ), represented by the following five elements:

<fileDesc> contains a full bibliographic description of the corpus itself or of a text within it.
<encDesc> documents the relationship between an electronic text and the source or sources from which it was derived.
<profDesc> provides further information about various aspects of a text, specifically the language used, the situation and date of its production, the participants and their setting, and a descriptive classification for it.
<projDesc> describes in detail the purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
<revDesc> summarizes the revision history of a file.

5.1 The file description (`<fileDesc>`)

The file description is the first of the four main constituents of the header and is represented by the <fileDesc> element. It documents the electronic file itself, i.e. (in the case of a corpus header) the whole corpus, or (in the case of a text header) any characteristics peculiar to an individual file within it. It comprises the following five subdivisions:

<titStmt> groups information concerning the title of the corpus and its constituent texts.
<ednStmt> contains any additional information relating to a particular version of a corpus text.
<extent> describes the approximate size of the electronic text as stored on some carrier medium, specified in words (corpus header) and additionally in Kb (corpus texts).
<pubStmt> groups information concerning the publication or distribution of the corpus and its constituent texts.
<srcDesc> supplies a bibliographic description of the copy text(s) from which an electronic text was derived or generated.

Further detail of each of these is given in the following subsections. Note that all except the source description relate only to the electronic file (the corpus text itself).

5.1.1 The title statement (`<titStmt>`)

This element corresponds with the TEI <titleStmt>, but has a simpler structure, consisting of a <title> element, followed by zero or more <respStmt> elements. These sub-elements are used throughout the header, wherever the title of a work or a statement of responsibility are required.

<title> the title or chief name of a work, including any alternative titles or subtitles.
<respStmt> supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription.
<resp> contains a phrase describing the nature of a person's or institution's intellectual responsibility.
<name> proper name of a person, place or institution.

In the file description, the <title> element contains a (possibly shortened) version of the title of the text concerned, generally followed by the phrase ``an electronic sample''. For texts derived from unpublished, untitled, or spoken materials a descriptive summary title is used. A <respStmt> element is used to indicate each agency responsible for any significant effort in the creation of the text. Responsibilities for data encoding and storage, and for enrichment are the same for all texts, but the responsibility for original data capture varies.

    <titStmt>
     <title>Captain Pugwash and the huge reward -- an electronic version</title>
     <respStmt><resp>Data capture and transcription</resp>
     <name>Oxford University Press</name></respStmt>
    </titStmt>

5.1.2 The edition statement

This element corresponds with the TEI <editionStmt>, except that its content is an unstructured note. In the BNC Sampler, it contains the following text:

<ednStmt n=s1>Header automatically generated by mkhdr 0.30
Updated for sampler by interleave 1.1</ednStmt>

5.1.3 The extent statement

This element corresponds with the TEI <extent> element in that it describes the number of words in the whole corpus or in an individual text. It differs in that the size is specified formally as the value of an attribute and that it normally has no content.

<extent> describes the approximate size of the electronic text as stored on some carrier medium, specified in words (corpus header) and additionally in Kb (corpus texts). Attributes include:
- words specifies the number of BNC-defined words in the text.
- kb specifies the size of an individual text in Kbytes.

The number of words is calculated according to a simple algorithm which defines words as blank-delimited strings, and is therefore not identical to the number of <w> elements actually present in the text.

The kb attribute is supplied for individual text headers only. Its value gives the size of the text, including its header, as a number of kilobytes (multiples of 1,024 octets, rounded up to the next integer) in its canonical CDIF representation as a unix text file using the iso 646 coded character set. This is useful in calculating media requirements or file download times. For example:

 
    <extent words=992 kb=20>

5.1.4 The publication statement (`<pubStmt>`)

This corresponds to the TEI <publicationStmt> but has a narrower focus, since it relates only to the public availability of the electronic text. It contains the following sub-elements:

<respStmt> supplies information about any person or institution responsible for the intellectual content of a text, edition, or electronic transcription.
<address> contains a postal or other address, for example of a publisher, distributor, etc.
<avail> supplies information about the availability of a text, for example any restrictions on its use or distribution, its copyright status, etc. Attributes include:
- status supplies a code identifying the current availability of the text. Legal values are:
  - free the text is freely available.
  - restrict the text is not freely available.
  - unknown the status of the text is unknown.
- region specifies the territories within which rights in the electronic text apply. Values include:
  - EU European Union only
  - not-NA all parts of the world other than USA and Canada
  - not-NAP all parts of the world other than the USA, Canada, and the Philippines
  - not-US all parts of the world outside the USA
  - not-USP all parts of the world other than the USA and the Philippines
  - world the text is freely available.

All the texts included in the BNC Sampler are believed to have been cleared for world rights.

5.1.5 The source description (`<srcDesc>`)

This element corresponds with the TEI <sourceDesc>, except that its content is constrained to include only the following possible sub-elements:

<recStmt> describes a set of recordings used in transcription of a spoken text, either as a series of paragraphs or as a formally structured recording element.
<biblStr> contains a structured bibliographic citation, in which only bibliographic sub-elements appear and in a specified order.

When a particular text contains items derived from more than one bibliographic source or recording, all relevant sources for which information is available are listed in the text header, and individual <div>, <div1> or <div2> elements associated with the correct citation or recording by means of the decls attribute, as described in section 5.5 .

Context-governed spoken texts derived from broadcast or similar ``published'' material may have either a recording statement or a bibliographic record as their source.

5.1.5.1 The recording statement (`<recStmt>`)

A recording statement consists of one or more <rec> elements, with the following attributes:

<rec> details of a particular audio recording used as the source of a spoken text, either directly or from a public broadcast. Attributes include:
- type characterizes the recording in terms of the equipment used to make it. Legal values include:
  - dat recording made on Digital Audio tape
  - unknown recording equipment or quality unknown
  - walk recording made on Walkman
- date specifies the date of the recording
- time specifies the time of day when the recording was made.
- dur specifies the duration of the recording, in seconds

Like the <extent> element, this element differs from its TEI equivalent (the <recording> element) in that much of its content is sufficiently regular to be represented by attributes rather than by an included prose description.

Here is a typical recording statement:

<recStmt>
<rec dur=3300 date=1992-07-15 time="12:15" type=DAT>
</rec>
</recStmt>

5.1.5.2 Structured bibliographic record (`<biblStr>`)

The <biblStr> element corresponds to the TEI <biblStruct> element. It has the following component sub-elements:

<analytic> contains bibliographic elements describing an item (e.g. an article or poem) published within a monograph, journal, or periodical and not as an independent publication (not used in BNC Sampler).
<monogr> contains bibliographic elements describing an item (e.g. a book or journal) published as an independent item (i.e. as a separate physical object).

At least one <monogr> element must be present in a <biblStr> element. It may contain the following elements:

<title> the title or chief name of a work, including any alternative titles or subtitles.
<author> in a bibliographic reference, contains the name of an author (personal or corporate) of a work; names should be given in a canonical form, with surnames preceding forenames. Attributes include:
- domicile specifies the author's domicile, as established for the purposes of the BNC ``Britishness'' test.
- born specifies the author's year of birth, where available.
<respStmt> supplies information about any person or institution responsible for the intellectual content of a source text, edition, or electronic transcription.
<edition> provides bibliographic details for an edition of some text.
<imprint> groups information relating to the publication or distribution of a bibliographic item.
<idno> supplies any standard or non-standard number used to identify a bibliographic item. Attributes include:
- type categorizes the number, for example as an ISBN or other standard series. Possible values include:
  - bl British Library call number
  - bnc British National Corpus text identifier
  - isbn International standard book number
  - issn International standard serial number
  - pub Publisher's reference code
<biblScop> defines the scope of a bibliographic reference, for example as a list of page numbers, or a named subdivision of a larger work. Attributes include:
- type identifies the type of information conveyed by the element. Legal values are:
  - issue the element contains an issue number, or volume and issue numbers.
  - pp the element contains a page number or page range.
  - vol the element contains a volume number.
<bibNote> a descriptive note supplying additional information of any kind relating to a bibliographic item described within a corpus or text header.

The order in which these components appear is more tightly constrained in CDIF than in the corresponding TEI element. In particular, the <title> element must be present and it must be given first. None of the other components is mandatory, but if any of them are supplied, they must be in the following order, following the title:

any number of statements of intellectual responsibility (i.e. <author> or <respStmt> elements) relating to the work
any number of edition statements, each followed by an optional <respStmt> (this information is not available in the current version of the corpus)
any number of <imprint> statements
any number of <bibNote> <idno> or <biblScop> elements, in any order.

As noted above, a title may be generated if necessary. In the current version of the corpus, subtitles or alternative titles, if recorded, are not distinguished from the main title, other than by the use of conventional punctuation.

The domicile and born attributes are specified for some authors only, where the information is available. This information is not recorded for editors or other people with intellectual responsibility for a text.

The n attribute is used with both <author> and <imprint> elements to supply a six-letter code used to identify the author or imprint concerned. The values used are in fact unique across the corpus, but this is not validated by the current release of the DTD, for technical reasons.

For published texts at least one <imprint> element should be present. It may contain names of persons or organizations, tagged with the <name> element, and dates, tagged using the <date> element. Where a place of publication is specified, the <pubPlace> element is used.

<name> proper name of a person, place or institution. Attributes include:
- type categorizes the name. Legal values are:
  - org name of an organisation
  - person name of a person
  - place name of a place
<date> a calendar date in any format. Attributes include:
- value specifies standard value for this date in ISO 8601 format
<pubPlace> place of publication for a book, article, etc.

Here is an example source description for a text taken from a provincial newspaper:

<srcDesc>
<biblStr>
<monogr>
<title>
East Anglian daily times
</title>
<imprint n=EASTAN1>
<name>
East Anglian Daily Times Company
</name>
<pubPlace>
Ipswich
</pubPlace>
<date value=1993-03>
1993-03
</date>
</imprint>
</monogr>
</biblStr>
</srcDesc>

and here is one for a typical printed source:

 <srcDesc>
<biblStr>
<monogr>
<title>
Captain Pugwash and the huge reward
</title>
<author n=Ryan-J1 domicile=Rye>
Ryan, John
</author>
<imprint n=GUNGAR1>
<name>
Gungarden Books
</name>
<pubPlace>
Rye, East Sussex
</pubPlace>
<date value=1991>
1991
</date>
</imprint>
</monogr>
</biblStr>
</srcDesc>

5.2 The encoding description (`<encDesc>`)

The second major component of the CDIF header is the encoding description, represented by the <encDesc> element. This contains information about the relationship between an encoded text and its original source and describes the editorial and other principles employed throughout the corpus. It also contains reference information used throughout the corpus.

The <encDesc> element has the following six components:

<projDesc> describes in detail the purpose for which an electronic file was encoded, together with any other relevant information concerning the process by which it was assembled or collected.
<sampDecl> contains a prose description of the rationale and methods used in sampling texts in the creation of the corpus.
<editDecl> provides details of editorial principles and practices applied during the encoding of a text.
<tagsDecl> provides detailed information about the tagging applied to a corpus text.
<refsDecl> specifies how canonical references are constructed for a text.
<clasDecl> contains a series of <category> elements, defining the classification codes used for texts within the corpus.

With the exception of the <tagsDecl> element, each of these elements appears only in the corpus header. Each can appear once only, except for <sampDecl> which appears once for each kind of sampling method employed.

5.2.1 Documentary components of the encoding description

The <projDesc> element for the corpus gives a brief description of the goals, organization and results of the BNC project. It appears in the corpus header only.

The <sampDecl> element for the corpus lists reads as follows:

<sampDecl id=SD000>
Published: chosen selectively from candidate population
</sampDecl>
<sampDecl id=SD001>
Published: chosen at random from candidate population
</sampDecl>
<sampDecl id=SD002>
Unpublished: chosen according to relevant design criteria
</sampDecl>
<sampDecl id=SD003>
Spoken: obtained from demographic sample of UK population
</sampDecl>
<sampDecl id=SD004>
Spoken: obtained in context determined by design criteria
</sampDecl>

The values given for the id attribute on the <sampDecl> applying to a particular text will be specified in the list of identifier values supplied as the value for the target attribute of the <catRef> element prefixed to the text's header. For example, the header of a spoken demographic text will include a <catRef> element like the following:

<catref target='... SD003 ...'>

where the dots indicate other declarations applicable to this text.

The <refsDecl> element for the corpus header defines the approved format for references to the corpus. Only one format is defined, but it is defined with different identifying keys. These are not currently used.

The <clasDecl> element for the corpus header defines all the classification codes used by component texts. It is discussed further below in section 5.2.4 .

5.2.2 The editorial declaration

The <editDecl> element in the corpus header contains the following elements, each specifying a particular kind of editorial practice used for some portion of the corpus, and supplying an identifying code for it. Where, as for the <segm> element, the same principles apply across the whole corpus, this is documented once within the corpus header as a series of paragraphs. Where different parts of the corpus apply different practices (as for example with the <quot> or <hyph> elements) all possible practices are defined once for all in the corpus header.

<corr> specifies a set of correction and normalisation practices applied in creating one or more components of the corpus.
<quot> specifies editorial practice adopted with respect to quotation marks in the original. Attributes include:
- form specifies how quotation marks are indicated within the text. Legal values are:
  - nonstd open and close quotation marks are represented indiscriminately by the same entity reference.
  - std use of quotation marks has been standardized; open and close quotation marks are distinct.
  - unknown use of quotation marks is unknown.
<hyph> summarizes the way in which end-of-line hyphenation in a source text has been treated in the encoded version of it.
<segm> describes the principles according to which the text has been segmented.
<trans> describes the principles according to which the text has been transduced, either in transcribing it from audio tape to written form, or in converting from an electronic original into CDIF.

The following series of <editDecl>elements is defined in the current version of the corpus header:

<editDecl id=CN000>
<corr>
Errors tagged with <sic> when seen; no normalization
</corr>
</editDecl>
<editDecl id=CN001>
<corr>
Errors tagged with <sic> if seen; norm'n with <reg>
</corr>
</editDecl>
<editDecl id=CN002>
<corr>
Normalized to standard British English or control list member
</corr>
</editDecl>
<editDecl id=CN004>
<corr>
Corrections and normalizations applied silently
</corr>
</editDecl>
<editDecl id=HN000>
<hyph>
Smart elision of line-end hyphens; &rehy used for remainder
</hyph>
</editDecl>
<editDecl id=HN001>
<hyph>
Dumb elision of line-end hyphens; true hyphens hand-reinstated
</hyph>
</editDecl>
<editDecl id=HN002>
<hyph>
Line-end hyphens removed by hand where appropriate
</hyph>
</editDecl>
<editDecl id=HN003>
<hyph>
Source material contains no line-end hyphens
</hyph>
</editDecl>
<editDecl id=QN000>
<quot>
Open, close quote normalized to &bquo, &equo
</quot>
</editDecl>
<editDecl id=QN001>
<quot>
Open and close quote normalized to &quo
</quot>
</editDecl>
<editDecl id=QN002>
<quot>
Quotation may be represented using <shift>
</quot>
</editDecl>
<editDecl id=SN000>
<segm>
Segmentation and word-class marking by CLAWS 5
</segm>
</editDecl>
<editDecl id=SN001>
<segm>
Segmentation and word-class marking by CLAWS 6
</segm>
</editDecl>
<editDecl id=SN002>
<segm>
Segmentation, word-class by CLAWS 6, augmented by hand
</segm>
</editDecl>
<editDecl id=TN000>
<trans>
Copy-typed from hard-copy into OUP format; transduced to CDIF
</trans>
</editDecl>
<editDecl id=TN001>
<trans>
Copy-typed from hard-copy into Longman format; transduced to CDIF
</trans>
</editDecl>
<editDecl id=TN002>
<trans>
Scanned from hard-copy into OUP format; transduced to CDIF
</trans>
</editDecl>
<editDecl id=TN003>
<trans>
Scanned from hard-copy into Longman format; transduced to CDIF
</trans>
</editDecl>
<editDecl id=TN004>
<trans>
Transduced from M-R into OUP format; transduced to CDIF
</trans>
</editDecl>
<editDecl id=TN005>
<trans>
Transduced from M-R into Longman format; transduced to CDIF
</trans>
</editDecl>
<editDecl id=TN006>
<trans>
Recording transcribed into Longman format; transduced to CDIF
</trans>
</editDecl>

The editorial practices applicable to a given text are specified by the target attribute of the <catRef> element prefixed to the text's header, in the same way as for other declarable elements in the header. For example, the header of a text in which corrections have been silently applied will include a <catRef> element like the following:

<catRef target='... CN004 ...'>

where the dots indicate other declarations applicable to this text.

5.2.3 The tagging declaration (`<tagsDecl>`)

This element is used slightly differently in corpus and in text headers. In the corpus header, it is used to list every element name actually used within the corpus, together with a brief description of its function. In text headers, the same element is used to specify the number of SGML elements actually tagged within each text. In both cases it consists of a number of <tagUsage> elements, defined as follows:

<tagUsage> supplies information about the usage of a specific element within a <text>. Attributes include:
- gi the name (generic identifier) of the element indicated by the tag.
- occurs the number of occurrences of this element within the text.

In the corpus header, each <tagUsage> element contains a brief description of the element specified by its <gi> element; the occurs attribute is not supplied. In text headers, the <tagUsage> elements are empty, but the occurs attribute is always supplied, and indicates the number of such elements which appear within the text.

A typical written text has a tag declaration like the following:

    <tagsDecl>
      <tagUsage gi=c occurs=5746>
      </tagUsage>
      <tagUsage gi=caption occurs=84>
      </tagUsage>
      <tagUsage gi=div1 occurs=37>
      </tagUsage>
      <tagUsage gi=div2 occurs=66>
      </tagUsage>
      <tagUsage gi=div3 occurs=13>
      </tagUsage>
      <tagUsage gi=gap occurs=6>
      </tagUsage>
      <tagUsage gi=head occurs=156>
      </tagUsage>
      <tagUsage gi=hi occurs=147>
      </tagUsage>
      <tagUsage gi=l occurs=2>
      </tagUsage>
      <tagUsage gi=p occurs=596>
      </tagUsage>
      <tagUsage gi=poem occurs=1>
      </tagUsage>
      <tagUsage gi=ptr occurs=84>
      </tagUsage>
      <tagUsage gi=quote occurs=3>
      </tagUsage>
      <tagUsage gi=s occurs=2411>
      </tagUsage>
      <tagUsage gi=salute occurs=17>
      </tagUsage>
      <tagUsage gi=sic occurs=1>
      </tagUsage>
      <tagUsage gi=text occurs=1>
      </tagUsage>
      <tagUsage gi=w occurs=41465>
      </tagUsage>
    </tagsDecl>

A typical spoken text has a tag declaration like the following:

    <tagsDecl>
      <tagUsage gi=align occurs=2>
      </tagUsage>
      <tagUsage gi=c occurs=530>
      </tagUsage>
      <tagUsage gi=div occurs=2>
      </tagUsage>
      <tagUsage gi=event occurs=10>
      </tagUsage>
      <tagUsage gi=loc occurs=66>
      </tagUsage>
      <tagUsage gi=pause occurs=67>
      </tagUsage>
      <tagUsage gi=ptr occurs=132>
      </tagUsage>
      <tagUsage gi=s occurs=494>
      </tagUsage>
      <tagUsage gi=shift occurs=10>
      </tagUsage>
      <tagUsage gi=sic occurs=1>
      </tagUsage>
      <tagUsage gi=stext occurs=1>
      </tagUsage>
      <tagUsage gi=trunc occurs=9>
      </tagUsage>
      <tagUsage gi=u occurs=391>
      </tagUsage>
      <tagUsage gi=unclear occurs=47>
      </tagUsage>
      <tagUsage gi=vocal occurs=48>
      </tagUsage>
      <tagUsage gi=w occurs=2386>
      </tagUsage>
    </tagsDecl>

5.2.4 The classification declaration (`<clasDecl>`)

The <clasDecl> element contains the descriptive taxonomy used to classify texts within the corpus. It occurs once, in the corpus header, and consists of a set of <category> elements, each representing a particular textual classification feature and a value for that feature.

<category> contains an individual descriptive category or feature-value pair.
<catDesc> describes some category within a taxonomy or text typology, in the form of a brief prose description.

The global id attribute is required for the <category> element, since it is used to associate a <catRef> within a text header with the descriptive category appropriate to it.

The <catDesc> element is used to contain the value for a feature within a <category>, unless that category is further subdivided, in which case a nested <category> element may be used.

For example, the following <category> elements appear within the bnc <clasDecl> element in the header:

<category id=wriDom>
<catDesc>
Domain for written corpus texts
</catDesc>
<category id=wriDom1>
<catDesc>
Imaginative
</catDesc>
</category>
<category id=wriDom2>
<catDesc>
Informative -- natural & pure science
</catDesc>
</category>
<category id=wriDom3>
<catDesc>
Informative -- applied science
</catDesc>
</category>
<category id=wriDom4>
<catDesc>
Informative -- social science
</catDesc>
</category>
<category id=wriDom5>
<catDesc>
Informative -- world affairs
</catDesc>
</category>
<category id=wriDom6>
<catDesc>
Informative -- commerce & finance
</catDesc>
</category>
<category id=wriDom7>
<catDesc>
Informative -- arts
</catDesc>
</category>
<category id=wriDom8>
<catDesc>
Informative -- belief & thought
</catDesc>
</category>
<category id=wriDom9>
<catDesc>
Informative -- leisure
</catDesc>
</category>

The <catDesc> element defined by the outer <category> element here (that with identifier wriDom) is understood to apply also to each <catDesc> contained by each of its constituent (daughter) <category> elements. That is, the full description for category wridom3 is ``Domain for written corpus texts : informative: natural science''.

The category descriptions applicable to a given text are specified by the <catRef> element within its header, as described above. Its target lists the identifiers of all <category> elements applicable to that text. Thus, the header of a written text assigned to the social science domain which has a corporate author will include a <catRef> element like the following:

<catRef target='... wriaty1 wridom4...'>

The dots above represent the identifiers of all other category codes applicable to this text.

A full list of all category codes, and the numbers of texts so classified in the current release of the corpus is provided in section 6.9 .

5.3 The profile description (`<profDesc>`)

The third component of the CDIF header is the profile description, which is represented by the <profDesc> element. These are the components of the profile description:

<creation> contains information about the creation of a text.
<langUsg> describes the languages, sublanguages, registers, dialects etc. represented within a text.
<partics> describes the identifiable speakers in a linguistic interaction together with their relationships, where known.
<settDesc> describes the setting or settings within which a language interaction takes place, as a series of <setting> elements.
<txtClass> groups information which describes the nature or topic of a text in terms of a standard classification scheme, thesaurus, etc.

5.3.1 The creation element

This element is provided to record the date of first publication of individual published texts, and any details concerning the origination of any spoken or written texts, whether or not covered elsewhere. It is supplied in every text header, but is not used in the current release. It contains an empty string or a statement that creation information is not available, where it does not refer to the information recorded in the <biblStr> element. It does however carry a date attribute which specifies the date the text was created:

<creation date="1991-07/1993-02">
See <biblStr> for publication details.</creation>

5.3.2 The `<langUsg>` element

Unlike the other elements of the profile description, the language usage element occurs only once, and definitively, in the corpus header. It contains the following text:

<langUsg>
The language of the British National Corpus is modern
British English.  Words, fragments, and passages from many
other languages, both ancient and modern, occur within the
corpus where these may be represented using a Latin
alphabet.  Long passages in these languages, and material
in other languages, are generally silently deleted.  In no
case is the lang attribute used to indicate the language
of a word, phrase or passage, nor are alternate writing
system definitions used.
</langUsg>

5.3.3 The participant description (`<partics>`)

This element appears both within the corpus header, to define the generic ``unknown participant'', and also within individual spoken text headers to define the participants specific to those texts.

It contains a series of <person> elements describing the participants whose speech is transcribed in this text, followed by an optional group of <relation> elements describing any relationships or links amongst them.

The <person> element has the following description and attributes:

<person> describes a single participant in a language interaction. Attributes include:
- role specifies the role of this participant in the corpus. Legal values are:
  - resp person is a recruited respondent
  - other person is not a recruited respondent
- sex specifies the sex of the participant. Legal values are:
  - m male
  - f female
  - u unknown or inapplicable
- age specifies the age group to which the participant belongs. Legal values are:
  - 0 Under 15 years
  - 1 15 to 24 years
  - 2 25 to 34 years
  - 3 35 to 44 years
  - 4 45 to 59 years
  - 5 Over 59 years
  - X Unknown
- flang specifies the first language or mother tongue of the participant.
- dialect specifies the dialect spoken by the participant.
- soc specifies the social class of the participant. Legal values are:
  - AB AB (top or middle management, administrative or professional)
  - C1 C1 (junior management, supervisory or clerical)
  - C2 C2 (skilled manual)
  - DE DE (semi-skilled or unskilled)
  - UU Class unknown
- educ specifies the age at which the participant ceased full-time education. Legal values are:
  - 0 Still in education
  - 1 Left school aged 14 or under
  - 2 Left school aged 15 or 16
  - 3 Left school aged 17 or 18
  - 4 Education continued until age 19 or over
  - X Information not available
- resp specifies the identifier of the respondent in whose data this participant's interactions are recorded.

The global id attribute is required for each participant whose speech is included in a text, and its value is unique within the corpus. Although a given individual will always have the same identifier within a single text, there is no way of identifying the same individual appearing in different texts. For this reason, all demographically sampled conversations collected by a single respondent are treated together as a single text.

The value for the flang attribute consists of a two-letter language code taken from ISO 639 (normally EN for English), optionally suffixed by a three-letter country code taken from ISO 3166. Thus ``EN-GBR'' is English as spoken in the United Kingdom; ``EN-CAN'' is English as spoken in Canada, and ``FR-FRA'' is French as spoken in France.

The value for the dialect attribute is also a three-letter code taken from a local extension to ISO 3166. A full list of codes used and their meanings is given in section 6.6 .

In addition to the encoded information specified by the attributes listed above, the following details may be supplied within the <person> element itself in some cases:

age specified more exactly than by the age attribute, which groups respondents into age bands.
BMRB code code used for processing by the British Market Research Bureau in selecting demographic participants.
name a forename used to identify the person.
occupation short characterization of the person's occupation.
notes any other information available about the person.

Here are some sample <person> elements:

<person id=PS2N3 n=W0001 sex=M soc=UU age=X educ=X>
Name: Douglas
Occupation: radio presenter</person>

<person id=PS0RA n=W0021 role=OTHER sex=F 
   soc=UU resp=PS0PN age=0 dialect=XMW flang=EN-GBR educ=X>
Age:          8
Name:         Emily
Occupation:   student
</person>

Relationships between participants, where known, are represented using the <relation> element which has the following description and attributes:

<relation> describes any kind of relationship or linkage amongst a specified group of participants. Attributes include:
- active identifies the ``active'' participants in a directed relationship, or all the participants in a mutual one.
- desc supplies a name for the relationship, seen from the point of view of the active participant in a directed relationship.
- mutual indicates whether the relationship holds equally amongst all participants. Legal values are:
  - Y the relationship is mutual
  - N the relationship is directed
- passive identifies the ``passive'' participants in a directed relationship.

A list of the different types of relationship identified amongst participants is given in section 6.7 .

Following the TEI Guidelines, we distinguish between mutual relationships, in which all participants are on an equal footing, and directed relationships, in which the roles of the participants are typically described differently. The roles applicable to a directed relationship are arbitrarily classed here as either active or passive. For example, the relationships ``colleague'' or ``spouse'' would be classed as mutual, while ``employee'' or ``wife'' would be classed as directed. A relationship such as ``sister'' may or may not be directed, depending on whether it obtains between two women or between a man and a women.

For a mutual relationship, only the active attribute will be supplied; for a directed one, both active and passive attributes will be supplied. In either case, these attributes take as value a list of the identifiers of the <person> elements understood to be involved in the relationship concerned.

The following example shows the participant information recorded in the header for a text (KSU) comprising conversations between four participants: Michael and Steve (who are brothers), their mother Christine and their aunt Leslie.

    <partics>
      <person age=0 educ=0 flang=EN-GBR id=PS6RM n=W0001 role=other sex=m
        soc=AB>
        Age:          13
        BNC name:     Michael2
        Name:         Michael
        Occupation:   pupil
      </person>
      <person age=4 educ=X flang=EN-GBR id=PS6RN n=W0002 resp=PS6RM role=other
        sex=f soc=DE>
        Age:          45
        Name:         Christine
        Occupation:   credit controller
      </person>
      <person age=4 educ=X flang=EN-GBR id=PS6RP n=W0003 resp=PS6RM role=other
        sex=f soc=DE>
        Age:          45
        Name:         Leslie
        Occupation:   unemployed
      </person>
      <person age=1 educ=X flang=EN-GBR id=PS6RR n=W0004 resp=PS6RM role=other
        sex=m soc=UU>
        Age:          21
        Name:         Steve
        Occupation:   unemployed
      </person>
      <relation active='PS6RM PS6RR' desc=brother mutual=Y>
      <relation active=PS6RP desc=aunt mutual=N passive=PS6RM>
      <relation active=PS6RN desc=mother mutual=N passive='PS6RM PS6RR'>
      <relation active=PS6RM desc=nephew mutual=N passive=PS6RP>
      <relation active=PS6RM desc=son mutual=N passive=PS6RN>
    </partics>

The relationship ``brother'' between Michael and Steve is mutual, and therefore not directed; their identifiers (PS6RM and PS6RR) are consequently both supplied on the active attribute. The relationship ``mother'' between Christine and Michael and Steve is not mutual: she (PS6RN) is at the active end of the relationship and they are both at the passive end. In the ``son'' relationship, Michael is the active participant, and Christine the passive one. Note however that not all possible relationships are expressed.

5.3.4 The setting description (`<settDesc>`)

This element appears once in the header of each spoken text, and contains one or more <setting> elements documenting the context within which a spoken text takes place.

<setting> describes one particular setting in which a language interaction takes place. Attributes include:
- audSize specifies the size of the audience present at this setting.
- county specifies the name of the British administrative county in which this setting is located
- spont indicates the degree of spontaneity assigned to interactions in this setting. Legal values are:
  - H high spontaneity
  - L low spontaneity
  - M medium spontaneity
  - U spontaneity unknown
- who supplies the identifiers of the participants at this setting.

The content of each <setting> element supplies additional details about the place, time of day, and other activities going on, using the following additional elements:

<locName> contains the name of a city, town, or village.
<locale> contains a brief informal description of the nature of a place, for example a room, a restaurant, a park bench etc.
<activity> contains a brief informal description of what a participant in a language interaction is doing other than speaking, if anything.

Some typical examples follow:

<setting> county=Essex spont=M who='PS000 DCJPS000 DCJPS001'> 
 <locName>Harlow </locName> 
 <locale>Harlow College </locale> 
 <activity>A level lecture conversation </activity> 
</setting>

<setting county=Lancashire spont=H who='PS03W PS03Y'> 
<locName>Morecambe </locName> 
<locale>nightclub </locale> 
<activity>at work conversation </activity> 
</setting>

5.3.5 The text classification (`<txtClass>`)

This element appears once in the header of each text. It contains Dewey classification codes, references to the bnc classification scheme described in section 5.2.4 , and descriptive keywords which together describe the text concerned. The following elements are used for these three purposes:

<catRef> specifies one or more defined categories within some taxonomy or text typology. Attributes include:
- target identifies the categories concerned.
<keywords> contains a list of keywords or phrases identifying the topic or nature of a text, each of which is tagged as a term.
<term> contains a technical term or phrase, particularly in a list of descriptive keywords.
<ddcRef> contains the classification code used for this text in the standard Dewey Decimal classification system.

The last of these always references a classification from the Dewey Decimal Classification scheme. This information is not available in the current version of the corpus.

The target attribute supplies a list of all identifiable classification or editorial codes applying to the text, as discussed above.

The terms specified by the <keywords> element are not taken from any particular descriptive thesaurus; the words or phrases used are those which seemed useful to the data preparation agency concerned. It is hoped to standardize the terminology in a later version of the corpus. Each term is marked as a distinct <term> element, as in the following example:

    <txtClass>
      <catRef target='allAva2 wriATy2 wriAud3 wriDom9 wriLev2 wriMed2 wriPP922
        wriSta2 wriTAS3 wriTim2'>
      <keywords>
        <term>
          horse riding
        </term>
        <term>
          care of horses
        </term>
      </keywords>
    </txtClass>

5.4 The revision description (`<revDesc>`)

The revision description, encoded by the <revDesc> element is the fourth and final element in the header. It is used to record details of any significant change to the corpus and has the following components:

<change> summarizes a particular change or correction made to a particular version of an electronic text which is shared between several researchers.

Unlike its counterpart in the TEI scheme, the CDIF<change> element must contain a date, and a <respStmt> element, specifying the nature of the change, as in the following example:

 <change n=1.1>
   <date>10-jan-1994</date>
   <respStmt><resp>made header</resp><name>DD</resp></respStmt>
 </change>

When any significant change is made to any component of the corpus, the following steps should be taken:

a <change> element is added to the <revDesc> of the text affected
the update attribute of the text header is changed to the date of the change
the value of the status attribute of the text header is set to ``update''
the revision number specified on the n attribute of the <edStmt> of the corpus header is incremented

5.5 Use of `decls` attribute

The decls attribute may be specified for any element defined as a member of the decling (declaring) class: specifically, the elements <text> or <stext>, and the larger division elements (<div> <div1>, or <div2>).

It is used for two distinct but related purposes:

to supply a specific title for parts of composite works
to specify encoding or other declarations applicable to all or part of a text where a number of possibilities have been provided for in the header.

Its value is a list of identifiers, each of which has been supplied elsewhere in a text or corpus header as the identifier for some element which is a member of the declabl (declarable) class: specifically, the <biblStr> element, the <editDecl> element and its constituents (<corr>, <hyph>, <quot>, <segm> and <trans>), and the following other header elements: <recStmt>, <sampDecl>, <setting> and <txtClas>.

For these elements, the corpus header will normally contain several mutually incompatible options, for example, several editorial declarations. Individual texts, or portions of texts, specify explicitly which of the available options applies to them by using the decls attribute. In cases (for example <setting>), where the set of declarable elements applies only within portions of a single text, they will be specified in the text header rather than the corpus header, but the same principle applies.

Declarable elements, once specified, are inherited by all sub-components. That is, if the decls attribute of a <text> specifies a particular value for some declarable element, that value is understood to apply to all components of the text, unless over-ridden. If the decls attribute of a <div1> within that text specifies a different value, the new value applies to the contents of that <div1> only; the value specified by the <text> applies to all subsequent <div1> elements in the same text, unless they also specify a different decls value.

For non-declarable elements, the header of an individual text will specify only those respects (if any) in which it differs from the defaults stated in the corpus header.

Note that this is a simplification of the decls mechanism described in the TEI Guidelines.

Previous
Up
Next