Descriptive tagging

Up: Contents Previous: 3. Basic structure Next: 5. The header

4.1. Written texts

4.1.1. Structural organization

Written texts exhibit a bewildering variety and richness of different structural forms. Some have very little organization at levels higher than the paragraphs; others may have a complex hierarchy of parts, sections, chapters etc. Novels are divided into chapters, newspapers into sections, reference works into articles and so forth. The following elements are used to represent all such textual divisions:

<div1>: major subdivision of a written text, e.g. chapter.
<div2>: further subdivision of a written text, entirely contained within a <div1>, e.g. section.
<div3>: further subdivision of a written text, entirely contained within a <div2>, e.g. subsection.
<div4>: smallest possible subdivision of a written text, entirely contained within a <div3>, e.g. sub-subsection.

Most written texts, of whatever kind, are hierarchically subdivided using these elements. Structural subdivisions smaller than level 4 (but above paragraph level) are all tagged <div4>. In all texts, structural subdivisions at the highest level (<div1>) are always identified; lower levels of subdivision (i.e. <div2>, <div3> or <div4>) may also be supplied where appropriate, but are not required.

These elements all carry a type attribute, in addition to the global attributes id, n, and r, which is used to categorize the division in some respect. In most cases, no specific categorization is applied and the attribute has the value u. For <div1> and <div2> elements, however, more specific values are used: <div1> elements may have a type of article or chapter, while <div2> elements may have a type of section or chapter.

The n attribute is sometimes used to supply an identifying name or number used within the text for a given division, for example, a chapter number, as in the following example:

<div1 n="4" type="chapter">
  <p><s n="525">Delaney made his way through the crew door 
  into the cavernous section of the C130 that gave it its 
  friendly nick-name of Fat Albert.</s>
<!-- BPA -->

More often, however, chapter names or numbers will appear within the text, tagged using the <head> element discussed in section 4.1.2.1. Headings and captions below.

A sequence of paragraph-level elements of arbitrary length may precede the first structural subdivision at any level. A text may have no structural divisions within it at all. Note that any prefatory or appended matter not forming part of a text will not generally be captured: the tei elements <front> and <back> are not used.

4.1.2. Paragraph-level elements and chunks

Written texts may be organized into structural units containing more than one <s> element and smaller than any of the divisions discussed in section 4.1.1. Structural organization above. The most commonly found such element is the <p> (paragraph), but there are several others. Their common identifying feature is that they may appear directly within divisions (that is, directly within <div1>, <div2> etc., or within <text> elements, not nested within some other element such as a paragraph).

An alphabetically ordered list of these elements follows:

<bibl>

a loosely structured bibliographic citation appearing within a corpus text (see 4.1.2.5. Notes and citations).

<caption>

(1) a heading, title etc. attached to a picture or diagram, usually with deictic content (2) a `pull quote' or other text about or extracted from a text and superimposed upon it to draw attention to it (see 4.1.2.1. Headings and captions). Attributes include:

type

categorizes the caption. Legal values are:

BYLINE: caption containing authorship or provenance of an article in a newspaper or periodical
DISPLAY: extra-textual caption such as a pull quote or displayed box
unspec: not specified or unknown

<head>

a title or heading prefixed to some division of a written text or to a poem (see 4.1.2.1. Headings and captions). Attributes include:

type

characterises the heading in some respect. Legal values are:

BYLINE: heading containing authorship or provenance of an article in a periodical
MAIN: a main heading (only one allowed per div)
SUB: a secondary heading (may be zero or more per div)

<list>

a collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit (see 4.1.2.4. Lists).

<note>

any form of note, additional comment or gloss within a written or spoken text (see 4.1.2.5. Notes and citations).

<p>

a paragraph in a written text.

<poem>

a poem, or an extract from one, embedded or quoted within a spoken or written text (see 4.1.2.3. Poems).

<quote>

a quotation from some author other than that of the surrounding text, usually either embedded or displayed (see 4.1.2.2. Quotations).

Examples for each of these (except <p>) are discussed in more detail in the following subsections.

4.1.2.1. Headings and captions

Headings and captions serve a variety of functions in written texts. The BNC scheme distinguishes between <head> elements, which can appear only at the start of a text division and are logically associated with it (for example, chapter titles, newspaper headlines etc.) and <caption> elements which are logically independent of the position they may have within a textual division (for example, captions attached to pictures or figures, ‘pull-quotes’ embedded within the text, ‘by-lines’ identifying authorship and provenance of a newspaper or periodical article).

One or more <head> elements may appear in sequence at the start of any <div1>, <div2>, <div3> or <div4> element, or at the start of a <list> or <poem>, as in the following example:.

<div3 n="6.4" type="u">
   <head type="MAIN"><s n="368">Monte Carlo simulation
   methodology</s></head>
   <div4 n="6.4.1" type="u">
      <head type="MAIN"><s n="369">Some preliminaries</s></head>
      <p><s n="370">Although the methodology is of general 
      applicability, it has been developed here using ARC/INFO 
      running under the VMS operating system on a microVAX 2.</s>
<!-- B1G -->

In the following example, a <caption> element is used to mark the presence of a "scare quote" in the middle of a newspaper story:

<p><s n="9">Tanning beds were introduced to reduce the
risk of burning. </s></p>
<caption id="K36CA00K" type="unspec"><s n="10">Harmful</s></caption>
<p><s n="11">Manufacturers say they are safer because they use UVA 
rays and block the harmful UVB burning rays.</s></p>
<!-- K36 -->

Where captions would interrupt the normal flow, pointers are used as discussed in section 3.7. Pointers.

4.1.2.2. Quotations

A quotation is an extract from some other work than the text itself which is embedded within it, for example as an epigraph or illustration. Quotations in the corpus may be marked up using the <quote> element. Any reference for the citation will usually be contained within it, tagged using the <bibl> element, as in this example:

For example:

<quote>
<p><s n="2">Now hatred is by far the longest pleasure; 
Men love in haste but they detest at leisure...</s>
<bibl><s n="3">Lord Byron</s></bibl></p>
</quote>
<!-- G01 -->

4.1.2.3. Poems

Quoted poems or fragments of verse or song may appear both within and between paragraphs. The <l> (line) element is used to mark each group of consecutive metrical lines, and any titles or headings present are marked with <head> elements. Each such group of lines is marked as a <poem> element, with no indication of its completeness.

No provision is made for marking units of verse such as stanzas, verse paragraphs etc. A part attribute is defined for the <l> which allows incomplete lines to be indicated, but in the current version of the corpus this always takes the value N.

The <poem> element may be thought of as a specialized kind of <quote> element, since none of the works sampled for BNC-baby is primarily poetic. For example:

<poem>
<s n="472">Thus when Burns wrote:</s>
<poem><l part="N">
<s n="473">The rank is but the guinea stamp, The man's the gold for a' that</s>
</l></poem></p>
<!-- ECV -->

Note that the <l> element is not used to mark typographic lineation.

4.1.2.4. Lists

A list is a collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit. Lists may appear within or between paragraphs. Where marked, lists are tagged with the <list> element.

A <list> element consists of an optional <head> element, followed by one or more <item> elements, each of which may optionally be preceded by a <label> element, used to hold the identifier or tag sometimes attached to a list item, for example ‘(a)’. or a word or phrase used for similar purposes.

The <item> element may appear only inside lists. It contains the same mixture of elements as a paragraph, and may thus contain one or more nested lists. It may also contains a series of paragraphs, each marked with a <p> element.

Here is an example of a simple list:

<s n="1716">The personnel file might be 
examined to list all the employees who 
meet the following criteria:</s>
<list type="simple">
<item><s n="1717">Speaks Japanese AND</s>
</item><item><s n="1718">Graduate Engineer AND</s>
</item><item><s n="1719">Single</s>
</item></list>
<!-- FPG -->

Here is an example of a labelled list:

<p><s n="24">This expression is derived assuming</s>
<list type="labelled">
<label><s n="25">(a)</s></label>
<item><s n="26">the volume change on mixing 
   <gap desc="formula"/>,</s></item>
<label><s n="27">(b)</s></label>
<item><s n="28">the molecules are all 
   of equal size,</s></item>
<label><s n="29">(c)</s></label>
<item><s n="30">all possible arrangements have 
   the same energy, <pb n="159"/>
   <gap desc="formula"/>, and</s></item>
<label><s n="31">(d)</s></label>
<item><s n="32">the motion of the components 
  about their equilibrium positions remains 
  unchanged on mixing</s></item>
</list>
<!-- HRG -->

4.1.2.5. Notes and citations

Annotations occurring in written texts, and bibliographic citations or references, have been marked up in some texts, using the <note> element. This element has the following additional attributes:

type

identifies the provenance of the note, i.e. editorial or authorial. Legal values are:

ED: note supplied by BNC transcriber or encoder: all notes in BNC-baby are of this type

resp

code for the person or organization responsible for BNC-supplied note. Legal values are:

OUCS: Note supplied by OUCS staff
OUP: Note supplied by OUP transcribers

place

specifies the location of an original note in the source text. Legal values are:

foot: foot of page
end: end of current division or text
side: left or right margin
unspecified: unknown or unspecified: this is the only value used in BNC-baby.

In BNC-baby the <note> element is used for the following purposes:

for general transcriber comments such as ‘following page blank’
to specify material which has been omitted from the transcription (such material may also be encoded using the <gap> element)
to record bylines and minor headings which do not appear at the start of a <div1>- type element

Here is an example of a typical transcriber's note:

<note resp="OUP" type="ED">
<s n="1840">Page 172 Blank</s></note>
<!-- CB5 -->

The following example shows a note encoding an omission at the end of one division, and another encoding a byline at the start of the next:

<s n="800">Maintenance crews have been working 18-hour 
shifts to repair the damage.</s></p>
<note><s n="801">Racing facts-list omitted</s>
</note>
</div1>
<div1 type="u">
<head><s n="802">Twiceover opening</s></head>
<note><s n="803">by Micky Twiceover</s></note>
<p><s n="804">GREY November or no, Festival fever is here 
again, with its prospects for of makin' whoopee and 
temperatures rising as they always do with the 
actual Festival opening.</s>
<!-- CBM -->

In a few cases, annotations which take the form of bibliographic citations or references have been encoded using the <bibl> element, often in conjunction with a <quote> element, as in the following example:

<quote><p><s n="3308">`Vengeance is mine; I will repay, saith the Lord'</s>
<bibl><s n="3309"><hi rend="it">Romans </hi>12:19</s></bibl></p>     
</quote>
<!-- G01 -->

4.1.3. Phrase-level elements

Phrase-level elements are elements which cannot appear directly within a textual division, but must be contained by some other element. In practice, this means they will be contained within an <s> element.

4.1.3.1. Highlighted phrases

Typographic highlighting in the original may not be marked in the transcript at all. Alternatively, highlighted phrases, and the kind of highlighting used, may be recorded in one of two ways:

using the global rend (rendition) attribute
using the <hi> (highlighted) element

The former is used where the function of the highlighting is clear, for example to mark a heading, and where the boundaries of the highlighted phrase therefore coincide with the boundaries of some other XML element. The latter is used where the function is not clear, where the DTD does not provide a tag to identify the feature concerned or where the highlighted phrase is not coterminous with some other element.

When the <hi> element is used, its rend attribute must be supplied. On all other elements, the rend attribute is optional. Its value indicates the nature of the highlighting used, e.g. italic font, quoted, small caps etc. In BNC-baby only the following values are used:

bo: bold face
hi: superscript
it: italic font
lo: subscript
ul: underlined

The <hi> element is frequently used for text that is italicized in the original source because it is a title, a foreign word or phrase, a technical term, etc. The current version of the markup does not however distinguish amongst these functions, as shown in the following examples:

<s n="36">He was always searching, rummaging: 
<hi rend="it">Les vrais paradis sont les paradis 
qu'on a perdus </hi>.</s>
<!-- FAJ -->

<s n="295">"Oh, so there will be 
<hi rend="it">two </hi>new volumes, 
will there?"</s>
<!--H9D -->

<p><s n="167">Wivenhoe gemmologist Stephanie 
Coward, <hi rend="it">above </hi>, is showing 
her work at the Pam Schomberg Gallery, St 
John's Street, Colchester.</s>
<!-- CFC -->

It should be noted that the purpose of the rend attribute is not to provide information adequate to the needs of a typesetter, but simply to record some qualitative information about the original. In particular, the present version of the corpus includes no indication of size of type or style of writing.

Like all other phrase-level elements, each <hi> element must be entirely contained by an <s> element. This implies that where a highlighted passage contains more than one sentence, the <hi> element must be closed at the end of the enclosing element, and then re-opened within the next.

4.1.3.2. Page breaks

For convenience of reference, the original pagination of the sources from which written texts were taken has been preserved as far as possible. This is done using the following element, which can appear anywhere within <s> elements in written texts:

<pb>: marks the start of a new page in the original source; used to indicate where e.g. articles in periodicals are split across several pages.

In this example, the presence of a page break in the middle of a sentence is indicated by the <pb> element:

<s n="1152">Throughout the interaction region 
subsequent to the collision they have obtained 
a complete set <pb n="142"/>of bounded normal 
modes that are expressed in terms of spin-weighted 
spherical harmonics.</s>
<!-- B2K -->

4.2. Spoken texts

4.2.1. Basic structure

Spoken texts are organized quite differently from written texts. In particular, a complex hierarchy of divisions and subdivisions is inappropriate. The following structural elements are used to represent the organization of spoken texts:

<stext>: an individual spoken text.
<div>: any subdivision or grouping of the utterances (etc.) making up a spoken text.

In the demographically sampled spoken texts, each distinct conversation recorded by a given respondent is treated as a distinct <div> element. All the conversations from a single respondent are grouped together to form a single <stext> element. Each <div> element within a demographically sampled spoken text consists of a sequence of <u> elements (see section 4.2.2. Utterances), interspersed with a variety of empty elements used to indicate para-linguistic phenomena noticed by the transcribers (see section 4.2.3. Paralinguistic phenomena).

To handle overlapping utterances, TEI recommends the use of a device known as an alignment map, discussed in section 4.2.4. Alignment of overlapping speech below. A single alignment map, represented by the <align> element, may be defined for a whole spoken text, or for each division of it: if overlap is present, the alignment map is given at the start of the division or text concerned.

Each utterance is further subdivided into <s> elements, and then into <w> and <c> elements, in the same way as for written texts.

The methods and principles applied in transcription and normalisation of speech are discussed in TGCW21 Spoken Corpus Transcription Guide and summarised in the appropriate part of the corpus header. The editorial tags discussed in section 3.6. Editorial indications above are also used to represent normalisation practice when dealing with transcribed speech.

4.2.2. Utterances

An utterance is a discrete sequence of speech produced by one participant, or group of participants, in a conversation; it is represented by the <u> element, which has the following additional attribute:

who: identifies the person or group responsible for the utterance.

The who attribute is mandatory: its function is to identify the person or group of people making the utterance, using the unique code defined for that person in the appropriate section of the header (see section 5.3.2. The <langUsage> element). A typical example follows:

<u who="PS0H7">
<s n="3778"><w type="ITJ">Mmm mm</s></u>
<!-- KCV -->

The code PS0H7 used here gives the value of the id attribute of some <person> element within the header of the text from which this example is taken (see further 5.3.3. The participant description). The code PS000 is used where the speaker cannot be identified and the code PS001 is used for a group of unidentified speakers. Where there are several distinct, but unidentified, speakers within a text, distinct identifiers are used. For example, if text XYZ contains two different but unidentified speakers, one of them will be given the identifier XYZSP001, and the other XYZSP002.

4.2.3. Paralinguistic phenomena

In transcribing spoken language, it is necessary to select from the possibly very large set of distinct paralinguistic phenomena which might be of interest. In the texts transcribed for the BNC, encoders were instructed to mark the following such phenomena:

voice quality: for example, whispering, laughing, etc., both as discrete events and as changes in voice quality affecting passages within an utterance.
non-verbal but vocalised sounds: for example, coughs, humming noises etc.
non-verbal and non-vocal events: for example passing lorries, animal noises, and other matters considered worthy of note.
significant pauses: silence, within or between utterances, longer than was judged normal for the speaker or speakers.
unclear passages: whole utterances or passages within them which were inaudible or incomprehensible for a variety of reasons.
speech management phenomena: for example truncation, false starts, and correction.
overlap: points at which more than one speaker was active.

Other aspects of spoken texts are not explicitly recorded in the encoding, although their headers contain considerable amounts of situational and participant information.

The elements used to mark these phenomena are listed below in alphabetical order:

<event>

any non-verbal and non-vocal event (such as a door slamming) occurring during a conversation and regarded as worthy of note. Attributes include:

desc: description of the event.
dur: duration of the event in seconds.

<pause>

a marked pause during or between utterances. Attributes include:

dur: duration of the pause in seconds.

<shift>

a marked change in voice quality for any one speaker. Attributes include:

new: description of the voice quality after the shift.

<trunc>

a word or phrase which has been truncated during speech.

<unclear>

a point in a spoken text at which it is unclear what is happening, e.g. who is speaking or what is being said. Attributes include:

dur: the duration of the passage in seconds.

<vocal>

a non-linguistic but communicative sound made by one of the participants in a spoken text. Attributes include:

desc: the kind of sound made
dur: duration of the sound in seconds.

The value of the dur attribute is normally specified only if it is greater than 5 seconds, and its accuracy is only approximate.

With the exception of the <trunc> element, which is a special case of the editorial tags discussed in section 3.6. Editorial indications above, all of these elements are empty, and may appear anywhere within a transcription.

The following example shows how the presence of the <event> tag can sometimes help make sense of otherwise seemingly random bits of conversation:

<u who="PS1A9"><s n="775">What are you doing?</s>
<s n="776"><event desc="dog barks"/>My giddy aunt!</s>
<s n="777">Are you playing rugby this afternoon Kevin?</s>
</u>
<!-- KBC -->

The values used for the desc attribute of the <event> and <vocal> elements are free text strings.

As noted above, a distinction is made between discrete vocal events, such as laughter, and changes in voice quality, such as words which are spoken in a laughing tone. The former are encoded using the <vocal> element, as in the following example:

<u who="PS0Y5"><s n="49"><vocal desc="laugh"/>Right <unclear/>.</s>
</u>
<!-- KB5 -->

The <shift> element is used instead where the laughter indicates a change in voice quality, as in the following example:

<u who="PS02G"><s n="10649"><shift new="laughing"/>Good 
job I didn't have to read it <shift/>!</s></u>
<!-- KB7 -->

Here the passage between the tags <shift new="laughing"> and <shift> is spoken with a laughing intonation.

A list of values currently used for the new attribute is given below in section 6.2. Voice quality codes.

The <trunc> element is used to enclose fragmentary words caused by repair or hesitation, as in the following example:

<u who="PS03W"><s n="757">Then <trunc>Mar </trunc>this 
guy called Mark is there I think.</s>
<!-- KBD -->

4.2.4. Alignment of overlapping speech

By default it is assumed that the events represented in a transcription are non-overlapping and that they are transcribed in temporal sequence. That is, unless otherwise specified, it is implied that the end of one utterance precedes the start of the next following it in the text, perhaps with an interposed <pause> element. Where this is not the case, the following elements are used:

<align>

defines an alignment map used to synchronise points within a spoken text.

<loc>

a synchronisation point within an alignment map to which other elements may refer.

<ptr>

an empty tag pointing from one part of a text to some other element. Attributes include:

target: supplies the identifier of some other element in a text; for alignment, specifically, a <loc> element within an alignment.

For each point of synchrony, i.e. at each place where the number of simultaneous utterances, events, vocals etc. increases or decreases, a <loc> element is defined within an <align> element, which appears at the start of the enclosing <div>, if any. At each place to be synchronised within the text, a <ptr> element is inserted. The target (target) attributes of these <ptr> elements are then used to specify the identifier of the <loc> with which each is to be synchronised.

The following example demonstrates how this mechanism is used to indicate that the second speaker (PS02G) speaks simultaneously with the first, starting from the words "and it's not far from town":

<u who="PS02H"><s n="126">Handy, really.</s>
<s n="127"><pause/><ptr target="KB7LC00S"/>And it's 
not far from town. <ptr target="KB7LC00T"/></s>
</u>
<u who="PS02G"><s n="128"><ptr target="KB7LC00S"/>Very handy.</s>
<s n="129">You can go in for a drink. <ptr target="KB7LC00T"/></s>
</u>
<!-- KB7 -->

Up: Contents Previous: 3. Basic structure Next: 5. The header

Date: (revised 19-22 Nov 2003) Author: edited by Lou Burnard (revised LB).
British National Corpus.

British National Corpus User Reference Guide

4. Descriptive tagging

4.1. Written texts

4.1.1. Structural organization

4.1.2. Paragraph-level elements and chunks

4.1.2.1. Headings and captions

4.1.2.2. Quotations

4.1.2.3. Poems

4.1.2.4. Lists

4.1.2.5. Notes and citations

4.1.3. Phrase-level elements

4.1.3.1. Highlighted phrases

4.1.3.2. Page breaks

4.2. Spoken texts

4.2.1. Basic structure

4.2.2. Utterances

4.2.3. Paralinguistic phenomena

4.2.4. Alignment of overlapping speech