add this bookmarking tool

Spoken texts

Basic structure: spoken texts

The spoken material transcribed for the BNC is also organized into ‘texts’, which are subdivided into ‘divisions’, made up of <w> and <mw> elements grouped into <s> elements in the same way as written texts. However there a number of other elements specific to spoken texts, and their hierarchic organization is naturally not the same as that of written texts. For this reason, a different element (<stext>) is used to represent a spoken text.

In demographically sampled spoken texts, each distinct conversation recorded by a given respondent is treated as a distinct <div> element. All the conversations from a single respondent are then grouped together to form a single <stext> element.

Context-governed spoken texts do not use the <div> element; each <stext> element containing a context-governed spoken text consists of a sequence of <u> elements again interspersed with a variety of empty elements used to indicate para-linguistic phenomena noticed by the transcribers.

The <s> elements making up a spoken text are grouped not into <p> or other similar elements, but instead into <u> elements. Each <u> (utterance) element marks a stretch of uninterrupted speech from a given speaker; (see section Utterances), interspersed within and between it a variety of empty elements are used to indicate para-linguistic phenomena noticed by the transcribers (see section Paralinguistic phenomena).

The methods and principles applied in transcription and normalisation of speech are discussed in TGCW21 Spoken Corpus Transcription Guide and summarised in the appropriate part of the corpus header. The editorial tags discussed in section Editorial indications above are also used to represent normalisation practice when dealing with transcribed speech.


The term utterance is used in the BNC to refer to a continuous stretch of speech produced by one participant in a conversation, or by a group of participants. Structurally, the corresponding element behaves in a similar way to the <p> element in a written text — it groups a sequence of <s> elements together.
  • <u> (utterance) a stretch of speech usually preceded and followed by silence or by a change of speaker.
The who attribute is required on every <u>: its function is to identify the person or group of people making the utterance, using the unique code defined for that person in the appropriate section of the header (see section The <langUsage> element). A simple example follows:
 <u who="PS1LW">   <s n="159">    <w c5="ITJ" hw="mm" pos="INTERJ">Mm </w>    <w c5="ITJ" hw="mm" pos="INTERJ">mm</w>    <c c5="PUN">.</c>   </s>  </u>
The code PS1LW used here will be specified as the value for the xml:id attribute of some <person> element within the header of the text from which this example is taken. A code ending PS000, PSUNK, or PS001 is used where the speaker cannot be identified, prefixed by the identifier for the text. Where there are several distinct, but unidentified, speakers within the same text, distinct identifiers are used.

Paralinguistic phenomena

In transcribing spoken language, it is necessary to select from the possibly very large set of distinct paralinguistic phenomena which might be of interest. In the texts transcribed for the BNC, encoders were instructed to mark the following such phenomena:
voice quality
for example, whispering, laughing, etc., both as discrete events and as changes in voice quality affecting passages within an utterance.
non-verbal but vocalised sounds
for example, coughs, humming noises etc.
non-verbal and non-vocal events
for example passing lorries, animal noises, and other matters considered worthy of note.
significant pauses
silence, within or between utterances, longer than was judged normal for the speaker or speakers.
unclear passages
whole utterances or passages within them which were inaudible or incomprehensible for a variety of reasons.
speech management phenomena
for example truncation, false starts, and correction.
points at which more than one speaker was active.
Other aspects of spoken texts are not explicitly recorded in the encoding, although their headers contain considerable amounts of situational and participant information.

In many cases, because no standardized set of descriptions was predefined, transcribers gave very widely differing accounts of the same phenomena. An attempt has however been made to normalize the descriptions for some of these elements in the BNC XML editions.

The elements used to mark these phenomena are listed below in alphabetical order:
  • <event> (Event) any phenomenon or occurrence, not necessarily vocalized or communicative, for example incidental noises or other events affecting communication.
    provides a brief description of the event
    (duration) indicates the duration of the element in minutes.
  • <pause> a pause either between or within utterances.
  • <shift> (Shift) marks the point at which some paralinguistic feature of a series of utterances by any one speaker changes.
    specifies the new state of the paralinguistic feature specified.
  • <trunc> contains one or more truncated words in transcribed speech.
  • <unclear> contains a word, phrase, or passage which cannot be transcribed with certainty because it is illegible or inaudible in the source.
  • <vocal> (Vocalized semi-lexical) any vocalized but not necessarily lexical phenomenon, for example voiced pauses, non-lexical backchannels, etc.
    provides a brief description of the vocal event
    (duration) indicates the duration of the element in minutes.
    indicates the person, or group of people, to whom the element content is ascribed.

The value of the dur attribute is normally specified only if it is greater than 5 seconds, and its accuracy is only approximate.

With the exception of the <trunc> element, which is a special case of the editorial tags discussed in section Editorial indications above, all of these elements are empty, and may appear anywhere within a transcription.

The following example shows an event, several pauses and a patch of unclear speech:
 <s n="5490">   <event desc="radio on"/>   <pause dur="34"/>   <w c5="PNP" hw="you" pos="PRON">You </w>   <w c5="VVN" hw="get" pos="VERB">got</w>   <w c5="TO0" hw="ta" pos="PREP">ta </w>   <unclear/>   <w c5="NN1" hw="radio" pos="SUBST">Radio </w>   <w c5="CRD" hw="two" pos="ADJ">Two </w>   <w c5="PRP" hw="with" pos="PREP">with </w>   <w c5="DT0" hw="that" pos="ADJ">that</w>   <c c5="PUN">.</c>  </s>  <s n="5491">   <pause dur="6"/>   <w c5="AJ0" hw="bloody" pos="ADJ">Bloody   </w>   <w c5="NN1" hw="pirate" pos="SUBST">pirate </w>   <w c5="NN1" hw="station" pos="SUBST">station </w>   <w c5="VM0" hw="would" pos="VERB">would</w>   <w c5="XX0" hw="not" pos="ADV">n't </w>   <w c5="PNP" hw="you" pos="PRON">you</w>   <c c5="PUN">?</c>  </s>
Where the whole of an utterance is unclear, that is, where no speech has actually been transcribed, the <unclear> element is used on its own, with an optional who attribute to indicate who is speaking, if this is identifiable. For example:
 <u who="xx">   <s>....</s>  </u>  <unclear who="yy"/>  <u who="xx">   <s>... </s>  </u>
Here YY's remarks, whatever they are, are too unclear to be transcribed, and so no <u> element is provided.
The values used for the desc attribute of the <event> element are not constrained in the current version of the corpus, and more than a thousand different values exist in the corpus. Some common examples follow:
 <event desc="laughter"/>  <event desc="telephone noise"/>
A list of the most frequent values is given in Event descriptions.
As noted above, a distinction is made between discrete vocal events, such as laughter, and changes in voice quality, such as words which are spoken in a laughing tone. The former are encoded using the <vocal> element, as in the following example:
 <u who="PS09T">   <s n="4307">    <vocal desc="laugh"/>    <c c5="PUN">, </c>    <w c5="PNP" hw="you" pos="PRON">you</w>    <w c5="VM0" hw="will" pos="VERB">'ll </w>    <w c5="VHI" hw="have" pos="VERB">have </w>    <w c5="TO0" hw="to" pos="PREP">to </w>    <w c5="VVI" hw="take" pos="VERB">take </w>    <w c5="DT0-CJT" hw="that" pos="ADJ">that </w>    <w c5="AVP-PRP" hw="off" pos="ADV">off </w>    <w c5="AV0" hw="there" pos="ADV">there </w>    <vocal desc="laugh"/>    <w c5="ITJ" hw="yeah" pos="INTERJ">yeah </w>    <w c5="PNP" hw="you" pos="PRON">you </w>    <w c5="VM0" hw="can" pos="VERB">can </w>    <pause/>    <vocal desc="laugh"/>    <pause/>   </s>  </u>
The <shift> element is used instead where the laughter indicates a change in voice quality, as in the following example:
 <u who="PS01V">   <s n="4188">    <w c5="CJC" hw="and" pos="CONJ">And </w>    <w c5="UNC" hw="erm" pos="UNC">erm </w>    <pause/>    <w c5="CJC" hw="and" pos="CONJ">and </w>    <w c5="AV0" hw="then" pos="ADV">then </w>    <w c5="PNP" hw="we" pos="PRON">we </w>    <w c5="VVD" hw="go" pos="VERB">went </w>    <w c5="CJC" hw="and" pos="CONJ">and </w>    <w c5="VVD" hw="get" pos="VERB">got </w>    <w c5="DPS" hw="i" pos="PRON">my </w>    <w c5="NN0" hw="fruit" pos="SUBST">fruit </w>    <w c5="CJC" hw="and" pos="CONJ">and </w>    <w c5="NN1" hw="veg" pos="SUBST">veg </w>    <w c5="CJC" hw="and" pos="CONJ">and </w>    <w c5="AV0" hw="then" pos="ADV">then </w>    <w c5="PNP" hw="we" pos="PRON">we </w>    <w c5="VVD" hw="go" pos="VERB">went </w>    <w c5="PRP" hw="in" pos="PREP">in </w>    <w c5="AJ0-NN1" hw="top" pos="ADJ">Top </w>    <w c5="NP0" hw="marks" pos="SUBST">Marks </w>    <w c5="CJC" hw="and" pos="CONJ">and </w>    <w c5="VVD" hw="get" pos="VERB">got </w>    <w c5="PNP" hw="they" pos="PRON">them </w>    <shift new="laughing"/>    <w c5="AV0" hw="so" pos="ADV">so </w>    <w c5="PNP" hw="we" pos="PRON">we </w>    <w c5="AV0" hw="never" pos="ADV">never </w>    <w c5="VVD" hw="get" pos="VERB">got </w>    <shift/>    <w c5="PNP" hw="we" pos="PRON">we </w>    <w c5="VVD" hw="go" pos="VERB">went </w>    <w c5="AVP" hw="through" pos="ADV">through </w>    <w c5="PRP" hw="for" pos="PREP">for </w>    <w c5="AT0" hw="a" pos="ART">a </w>    <w c5="NN1" hw="video" pos="SUBST">video </w>    <w c5="AV0" hw="really" pos="ADV">really</w>    <c c5="PUN">, </c>    <w c5="AV0" hw="never" pos="ADV">never </w>    <w c5="VVN-VVD" hw="get" pos="VERB">got </w>    <w c5="AVP" hw="round" pos="ADV">round </w>    <w c5="PRP" hw="to" pos="PREP">to </w>    <w c5="VVG" hw="look" pos="VERB">looking </w>    <w c5="PRP" hw="for" pos="PREP">for </w>    <w c5="AT0" hw="a" pos="ART">a </w>    <w c5="NN1" hw="video" pos="SUBST">video </w>    <w c5="VDD" hw="do" pos="VERB">did </w>    <w c5="PNP" hw="we" pos="PRON">we</w>    <c c5="PUN">?</c>   </s>  </u>

Here the passage between the tags <shift new=laughing> and <shift> is spoken with a laughing intonation.

A list of values currently used for the new attribute is given below in section Voice quality codes.

Alignment of overlapping speech

By default it is assumed that the events represented in a transcription are non-overlapping and that they are transcribed in temporal sequence. That is, unless otherwise specified, it is implied that the end of one utterance precedes the start of the next following it in the text, perhaps with an interposed <pause> element. Where this is not the case, the following element is used:
  • <align> marks an temporal alignment point within transcribed speech

The with attribute of an <align> element may be thought of as identifying some point in time. Where two or more <align> elements specify the same value for this attribute, their locations are assumed to be synchronised.

The following example demonstrates how this mechanism is used to indicate that one speaker's attempt to take the floor has been unsuccessful:
 <u who="PS6U5">   <s n="485">    <w c5="AJ0" hw="poor" pos="ADJ">Poor </w>    <w c5="AJ0" hw="old" pos="ADJ">old </w>    <w c5="NP0" hw="luxembourg" pos="SUBST">Luxembourg</w>    <w c5="VHZ" hw="have" pos="VERB">'s </w>    <w c5="VVN-AJ0" hw="beat" pos="VERB">beaten</w>    <c c5="PUN">.</c>   </s>   <s n="486">    <w c5="PNP" hw="you" pos="PRON">You </w>    <w c5="PNP" hw="you" pos="PRON">you</w>    <w c5="VHB" hw="have" pos="VERB">'ve </w>    <w c5="PNP" hw="you" pos="PRON">you</w>    <w c5="VHB" hw="have" pos="VERB">'ve </w>    <w c5="AV0" hw="absolutely" pos="ADV">absolutely </w>    <w c5="AV0" hw="just" pos="ADV">just </w>    <w c5="VVN" hw="go" pos="VERB">gone </w>    <w c5="AV0-AJ0" hw="straight" pos="ADV">straight </w>    <align with="KNYLC01D"/>    <w c5="PRP" hw="over" pos="PREP">over </w>    <w c5="PNP" hw="it" pos="PRON">it </w>    <align with="KNYLC01E"/>   </s>  </u>  <u who="PS4YX">   <s n="487">    <align with="KNYLC01D"/>    <w c5="PNP" hw="i" pos="PRON">I </w>    <w c5="VHB" hw="have" pos="VERB">have</w>    <w c5="XX0" hw="not" pos="ADV">n't</w>    <c c5="PUN">.</c>   </s>  </u>  <u who="PS6U5">   <s n="488">    <w c5="CJC" hw="and" pos="CONJ">and </w>    <w c5="VVN" hw="forget" pos="VERB">forgotten </w>    <w c5="AT0" hw="the" pos="ART">the </w>    <w c5="AJ0" hw="poor" pos="ADJ">poor </w>    <w c5="AJ0" hw="little" pos="ADJ">little </w>    <w c5="NN1" hw="country" pos="SUBST">country</w>    <c c5="PUN">.</c>   </s>  </u>
This encoding is the CDIF equivalent of what might be presented in a conventional playscript as follows:
W0001: Poor old Luxembourg's beaten. You, you've, you've absolutely just gone straight over it -- W0014: (interrupting) I haven't. W0001: (at the same time) and forgotten the poor little country.

Up: Contents Previous: Written texts Next: The header