4 Spoken texts

The methods and principles applied in transcription and normalisation of speech are discussed in TGCW21 Spoken Corpus Transcription Guide and summarised in the appropriate part of the corpus header. The editorial tags discussed in section 2.5 above are also used to represent normalisation practice when dealing with transcribed speech.

4.1 Basic structure: spoken texts

Spoken texts are organized quite differently from written texts. In particular, a complex hierarchy of divisions and subdivisions is inappropriate. The following structural elements are used to represent the organization of spoken texts:

<stext> an individual spoken text.
<div> any subdivision or grouping of the utterances (etc.) making up a spoken text.

In demographically sampled spoken texts, each distinct conversation recorded by a given respondent is treated as a distinct <div> element. All the conversations from a single respondent are grouped together to form a single <stext> element. Each <div> element within a demographically sampled spoken text consists of a sequence of <u> elements (see section 4.2 ), interspersed with a variety of empty elements used to indicate para-linguistic phenomena noticed by the transcribers (see section 4.3 ).

Context-governed spoken texts do not use the <div> element; each <stext> element containing a context-governed spoken text consists of a sequence of <u> elements again interspersed with a variety of empty elements used to indicate para-linguistic phenomena noticed by the transcribers.

To handle overlapping utterances, cdif uses a device known as an alignment map, discussed in section 4.4 below. A single alignment map, represented by the <align> element, may be defined for a whole spoken text, or for each division of it: if overlap is present, the alignment map is given at the start of the division or text concerned.

Each utterance is further subdivided into <s> elements, and then into <w> and <c> elements, in the same way as for written texts. The principles underlying the orthographic transcription and use of punctuation are discussed in document TGCW21.

4.2 Utterances

An utterance is a discrete sequence of speech produced by one participant, or group of participants, in a conversation; it is represented by the <u> element, which has the following additional attribute:

who identifies the person or group responsible for the utterance.

The who attribute is mandatory: its function is to identify the person or group of people making the utterance, using the unique code defined for that person in the appropriate section of the header (see section 5.3.2 ). A simple example follows:

<u who=W0001>
<s n=00010>
<w ITJ>Mm <w ITJ>mm <c PUN>.
</u>

The code W0001 used here will be specified as the value for the id attribute of some <person> element within the header of the text from which this example is taken. Where the speaker cannot be identified, one of the following codes is used:

PS000 a single unidentified speaker
PS001 several speakers

. Where there are several distinct, but unidentified, speakers within a text, distinct identifiers are used. For example, if text xyz contains two different but unidentified speakers, one of them will be given the identifier XYZSP001, and the other XYZSP002.

4.3 Paralinguistic phenomena

In transcribing spoken language, it is necessary to select from the possibly very large set of distinct paralinguistic phenomena which might be of interest. In the texts transcribed for the BNC, encoders were instructed to mark the following such phenomena:

voice quality for example, whispering, laughing, etc., both as discrete events and as changes in voice quality affecting passages within an utterance.
non-verbal but vocalised sounds for example, coughs, humming noises etc.
non-verbal and non-vocal events for example passing lorries, animal noises, and other matters considered worthy of note.
significant pauses silence, within or between utterances, longer than was judged normal for the speaker or speakers.
unclear passages whole utterances or passages within them which were inaudible or incomprehensible for a variety of reasons.
speech management phenomena for example truncation, false starts, and correction.
overlap points at which more than one speaker was active.

Other aspects of spoken texts are not explicitly recorded in the encoding, although their headers contain considerable amounts of situational and participant information.

The elements used to mark these phenomena are listed below in alphabetical order:

<event> any non-verbal and non-vocal event (such as a door slamming) occurring during a conversation and regarded as worthy of note. Attributes include:
- desc description of the event.
- dur duration of the event in seconds.
<pause> a marked pause during or between utterances. Attributes include:
- dur duration of the pause in seconds.
<shift> a marked change in voice quality for any one speaker. Attributes include:
- new description of the voice quality after the shift.
<trunc> a word or phrase which has been truncated during speech.
<unclear> a point in a spoken text at which it is unclear what is happening, e.g. who is speaking or what is being said. Attributes include:
- dur the duration of the passage in seconds.
- who the person or group responsible for the unclear piece of speech.
<vocal> a non-linguistic but communicative sound made by one of the participants in a spoken text. Attributes include:
- desc the kind of sound made
- dur duration of the sound in seconds.

The value of the dur attribute is normally specified only if it is greater than 5 seconds, and its accuracy is only approximate.

With the exception of the <trunc> element, which is a special case of the editorial tags discussed in section 2.5 above, all of these elements are empty, and may appear anywhere within a transcription.

The following example shows an event, several pauses and a patch of unclear speech:

<u who=D00011>
<s n=00011>
<event desc="radio on"><w PNP><pause dur=34>You
<w VVD>got<w TO0>ta <unclear><w NN1>Radio
<w CRD>Two <w PRP>with <w DT0>that <c PUN>.
<s n=00012>
<pause dur=6><w AJ0>Bloody <w NN1>pirate
<w NN1>station <w VM0>would<w XX0>n't
<w PNP>you <c PUN>?
</u>

Where the whole of an utterance is unclear, that is, where no speech has actually been transcribed, the <unclear> element is used on its own, with an optional who attribute to indicate who is speaking, if this is identifiable. For example:

<u who=PS1L4>
<s n=0037><w UH>Oh<c YSTP>. </s></u>
<unclear who=PS1L3>
<u who=PS000><vocal desc=laugh></u>
<unclear who=PS1L4>
<u who=PS1L3> <s n=0038><w RR>Perhaps <w PPY>you <w VM>would <w VVI>like 
<w TO>to <w VVI>go <w CC>and <w VDI>do <w APPGE>your <w DA>own <w NN1>thing<c YSTP>. </s>

Here PS1L3's remarks before the laugh indicated by the <vocal> element , whatever they are, are too unclear to be transcribed, and so no <u> element is provided.

The values used for the desc attribute of the <event> element are not constrained in the current version of the corpus. Some common examples follow:

<event desc="a lot of people talking">
<event desc="door closes">
<event desc="tuning in radio">
<event desc="radio advertisements playing">

As noted above, a distinction is made between discrete vocal events, such as laughter, and changes in voice quality, such as words which are spoken in a laughing tone. The former are encoded using the <vocal> element, as in the following example:

<u id=D0038 who=W0011>
<s n=00040>
<vocal desc=laugh><w PNP>you<w VM0>'ll <w VHI>have
<w TO0>to <w VVI>take <w DT0>that
<w PRP>off <w AV0>there <w ITJ><vocal desc=laugh>yeah
<w PNP>you <w VM0>can <pause><vocal desc=laugh><pause>
</u>

The <shift> element is used instead where the laughter indicates a change in voice quality, as in the following example:

<u who=W0003>
<s n=00669>
<w CJC>And <w UNC>erm <w CJC><pause>and <w AV0>then
<w PNP>we <w VVD>went <w PRP>in <w AJ0-NN1>Top <w NN2>Marks
<w CJC>and <w VVD-VVN>got <w PNP>them
<w CJS><shift new=laughing>so <w PNP>we <w AV0>never
<w VVD-VVN>got <w PNP><shift>we <w VVD>went <w AVP>through
<w PRP>for <w AT0>a <w NN1>video <w AV0>really <c PUN>,
<w AV0>never <w VVD-VVN>got <w AVP>round
<w PRP>to <w VVG>looking <w PRP>for <w AT0>a
<w NN1>video <w VDD>did <w PNP>we<c PUN>?
</u>

Here the passage between the tags <shift new=laughing> and <shift> is spoken with a laughing intonation.

A list of values currently used for the new attribute is given below in section 6.5 .

The <trunc> element is commonly used for false starts or reptitions, as in the following example:

<w CCB>but <w AT>the <w NN1>thing <w VBZ>is <w CST>that 
<w DDQ>which <w RT>then <w VVZ>leads <w II>to <trunc> <w FU>unem </trunc> 
<w NN1>unemployment<c YCOM>, <w CCB>but <w AT>the <w NN1>thing <w VBZ>is 
...

4.4 Alignment of overlapping speech

By default it is assumed that the events represented in a transcription are non-overlapping and that they are transcribed in temporal sequence. That is, unless otherwise specified, it is implied that the end of one utterance precedes the start of the next following it in the text, perhaps with an interposed <pause> element. Where this is not the case, the following elements are used:

<align> defines an alignment map used to synchronise points within a spoken text.
<loc> a synchronisation point within an alignment map to which other elements may refer.
<ptr> an empty tag pointing from one part of a text to some other element. Attributes include:
- t supplies the identifier of some other element in a text; for alignment, specifically, a <loc> element within an alignment.

For each point of synchrony, i.e. at each place where the number of simultaneous utterances, events, vocals etc. increases or decreases, a <loc> element is defined within an <align> element, which appears at the start of the enclosing <div>, if any. At each place to be synchronised within the text, a <ptr> element is inserted. The t (target) attributes of these <ptr> elements are then used to specify the identifier of the <loc> with which each is to be synchronised.

The following example demonstrates how this mechanism is used to indicate that one speaker's attempt to take the floor has been unsuccessful:

<u who=W0014>
<s n=00011>
<w AJ0>Poor <w AJ0>old <w NP0>Luxembourg'<w VBZ>s <w AJ0-VVN>beaten<c PUN>.
<s n=00012>
<w PNP>You <w PNP>you<w VHB>'ve <w PNP>you<w VHB>'ve <w AV0>absolutely <w AV0>just
<w VVN>gone <w AV0>straight <ptr t=P1> <w PRP>over <w PNP>it <ptr t=P2>
</u>
<u who=W0001>
<s n=00013>
<ptr t=P1> <w PNP>I <w VHB>haven<w XX0>'t<c PUN>. <ptr t=P2>
</u>
<u who=W0014>
<s n=00014>
<w CJC>and <w VVN>forgotten <w AT0>the <w AJ0>poor <w AJ0>little
<w NN1>country<c PUN>.
</u>

This encoding is the CDIF equivalent of what might be presented in a conventional playscript as follows:

W0001: Poor old Luxembourg's beaten. You, you've, you've absolutely just
gone straight over it --
W0014: (interrupting) I haven't.
W0001: (at the same time) and forgotten the poor little country.

Previous
Up
Next