Spoken texts
Basic structure: spoken texts
Spoken texts are organized quite differently from written texts. In
particular, a complex hierarchy of divisions and subdivisions is
inappropriate. The following structural elements are used to represent
the organization of spoken texts:
- <stext>
- an individual spoken text.
- <div>
- any subdivision or grouping of the utterances (etc.) making up
a spoken text.
In demographically sampled spoken texts, each distinct conversation
recorded by a given respondent is treated as a distinct <div>
element. All the conversations from a single respondent are grouped
together to form a single <stext> element. Each <div>
element within a demographically sampled spoken text consists of a
sequence of <u> elements (see section Utterances),
interspersed with a variety of empty elements used to indicate
para-linguistic phenomena noticed by the transcribers (see section Paralinguistic phenomena).
Context-governed spoken texts do not use the <div> element;
each <stext> element containing a context-governed spoken text
consists of a sequence of <u> elements again interspersed with a
variety of empty elements used to indicate para-linguistic phenomena
noticed by the transcribers.
To handle overlapping utterances, TEI recommends the use of a device
known as an alignment map, discussed in section
Alignment of overlapping speech below. A single alignment map, represented by
the <align> element, may be defined for a whole spoken text, or
for each division of it: if overlap is present, the alignment map is
given at the start of the division or text concerned.
Each utterance is further subdivided into <s> elements, and
then into <w> and <c> elements, in the same way as for
written texts.
The methods and principles applied in transcription and
normalisation of speech are discussed in TGCW21 Spoken Corpus
Transcription Guide and summarised in the appropriate part of
the corpus header. The editorial tags discussed in section ?? above are also used to represent normalisation practice when dealing
with transcribed speech.
Utterances
An
utterance is a discrete sequence of speech produced
by one participant, or group of participants, in a
conversation; it is represented by the
<u> element, which has
the following additional attribute:
- who
- identifies the person or group responsible for the utterance.
The
who attribute is mandatory: its function is to
identify the person or group of people making the utterance, using the
unique code defined for that person in the appropriate section of the
header (see section
??). A simple example follows:
<u who=w0001>
<s n=00010>
<w ITJ>Mm <w ITJ>mm <c PUN>.
</u>
The code
W0001
used here will be specified as
the value for the
id attribute of some
<person>
element within the header of the text from which this example is
taken. The code
PS000 is used where the speaker cannot
be identified and the code
PS001 is used for a group of
unidentified speakers. Where there are several distinct, but
unidentified, speakers within a text, distinct identifiers are used.
For example, if text
xyz contains two different but
unidentified speakers, one of them will be given the identifier
XYZSP001, and the other
XYZSP002.
Paralinguistic phenomena
In transcribing spoken language, it is necessary to select from the
possibly very large set of distinct paralinguistic phenomena which might
be of interest. In the texts transcribed for the BNC, encoders were
instructed to mark the following such phenomena:
- voice quality
- for example, whispering, laughing, etc., both as discrete events
and as changes in voice quality affecting passages within an utterance.
- non-verbal but vocalised sounds
- for example, coughs, humming noises etc.
- non-verbal and non-vocal events
- for example passing lorries, animal noises, and other matters
considered worthy of note.
- significant pauses
- silence, within or between utterances, longer than was judged
normal for the speaker or speakers.
- unclear passages
- whole utterances or passages within them which were inaudible or
incomprehensible for a variety of reasons.
- speech management phenomena
- for example truncation, false starts, and correction.
- overlap
- points at which more than one speaker was active.
Other aspects of spoken texts are not explicitly recorded in
the encoding, although their headers contain considerable amounts of
situational and participant information.
The elements used to mark these phenomena are listed below in
alphabetical order:
- <event>
- any non-verbal and non-vocal event (such as a door slamming)
occurring during a conversation and regarded as worthy of note.
Attributes include:
- desc
- description of the event.
- dur
- duration of the event in seconds.
- <pause>
- a marked pause during or between utterances. Attributes include:
- dur
- duration of the pause in seconds.
- <shift>
- a marked change in voice quality for any one speaker. Attributes
include:
- new
- description of the voice quality after the shift.
- <trunc>
- a word or phrase which has been truncated during speech.
- <unclear>
- a point in a spoken text at which it is unclear what is
happening, e.g. who is speaking or what is being said. Attributes
include:
- dur
- the duration of the passage in seconds.
- who
- the person or group responsible for the unclear piece of speech.
- <vocal>
- a non-linguistic but communicative sound made by one of the
participants in a spoken text. Attributes include:
- desc
- the kind of sound made
- dur
- duration of the sound in seconds.
The value of the dur attribute is normally specified
only if it is greater than 5 seconds, and its accuracy is only
approximate.
With the exception of the <trunc> element, which is a
special case of the editorial tags discussed in section ?? above, all of these elements are empty, and may appear anywhere within
a transcription.
The following example shows an event, several pauses and a patch of
unclear speech:
<u who=d00011>
<s n=00011>
<event desc="radio on"><w PNP><pause dur=34>You
<w VVD>got<w TO0>ta <unclear><w NN1>Radio
<w CRD>Two <w PRP>with <w DT0>that <c PUN>.
<s n=00012>
<pause dur=6><w AJ0>Bloody <w NN1>pirate
<w NN1>station <w VM0>would<w XX0>n't
<w PNP>you <c PUN>?
</u>
Where the whole of an utterance is unclear, that is, where no speech
has actually been transcribed, the
<unclear> element is used on
its own, with an optional
who attribute to indicate who
is speaking, if this is identifiable. For example:
<u who=xx><s>....</u>
<unclear who=yy>
<u who=xx><s>... </u>
Here YY's remarks, whatever they are, are too unclear to be
transcribed, and so no
<u> element is provided.
The values used for the
desc attribute of the
<event> element are not constrained in the current version of
the corpus. Some common examples follow:
<event desc="a lot of people talking">
<event desc="door closes">
<event desc="tuning in radio">
<event desc="radio advertisements playing">
As noted above, a distinction is made between discrete vocal events,
such as laughter, and changes in voice quality, such as words which are
spoken in a laughing tone. The former are encoded using the
<vocal> element, as in the following example:
<u id=d0038 who=w0011>
<s n=00040>
<vocal desc=laugh><w PNP>you<w VM0>'ll <w VHI>have
<w TO0>to <w VVI>take <w DT0>that
<w PRP>off <w AV0>there <w ITJ><vocal desc=laugh>yeah
<w PNP>you <w VM0>can <pause><vocal desc=laugh><pause>
</u>
The
<shift> element is used instead where the laughter
indicates a change in voice quality, as in the following example:
<u who=w0003>
<s n=00669>
<w CJC>And <w UNC>erm <w CJC><pause>and <w AV0>then
<w PNP>we <w VVD>went <w PRP>in <w AJ0-NN1>Top <w NN2>Marks
<w CJC>and <w VVD-VVN>got <w PNP>them
<w CJS><shift new=laughing>so <w PNP>we <w AV0>never
<w VVD-VVN>got <w PNP><shift>we <w VVD>went <w AVP>through
<w PRP>for <w AT0>a <w NN1>video <w AV0>really <c PUN>,
<w AV0>never <w VVD-VVN>got <w AVP>round
<w PRP>to <w VVG>looking <w PRP>for <w AT0>a
<w NN1>video <w VDD>did <w PNP>we<c PUN>?
</u>
Here the passage between the tags <shift new=laughing>
and <shift> is spoken with a laughing intonation.
A list of values currently used for the new
attribute is given below in section ??.
Alignment of overlapping speech
By default it is assumed that the events represented in a
transcription are non-overlapping and that they are transcribed in
temporal sequence. That is, unless otherwise specified, it is implied
that the end of one utterance precedes the start of the next following
it in the text, perhaps with an interposed
<pause> element.
Where this is not the case, the following elements are used:
- <align>
- defines an alignment map used to synchronise points within a
spoken text.
- <loc>
- a synchronisation point within an alignment map to which other
elements may refer.
- <ptr>
- an empty tag pointing from one part of a text to some other
element. Attributes include:
- target
- supplies the identifier of some other element in a text; for
alignment, specifically, a <loc> element within an alignment.
For each point of synchrony, i.e. at each place where the number of
simultaneous utterances, events, vocals etc. increases or decreases, a
<loc> element is defined within an <align> element,
which appears at the start of the enclosing <div>, if any. At
each place to be synchronised within the text, a <ptr> element
is inserted. The target (target) attributes of these
<ptr> elements are then used to specify the identifier of the
<loc> with which each is to be synchronised.
The following example demonstrates how this mechanism is used to
indicate that one speaker's attempt to take the floor has been
unsuccessful:
<u who=w0014>
<s n=00011>
<w AJ0>Poor <w AJ0>old <w NP0>Luxembourg'<w VBZ>s <w AJ0-VVN>beaten<c PUN>.
<s n=00012>
<w PNP>You <w PNP>you<w VHB>'ve <w PNP>you<w VHB>'ve <w AV0>absolutely <w AV0>just
<w VVN>gone <w AV0>straight <ptr target=P1> <w PRP>over <w PNP>it <ptr target=P2>
</u>
<u who=w0001>
<s n=00013>
<ptr target=P1> <w PNP>I <w VHB>haven<w XX0>'t<c PUN>. <ptr target=P2/>
</u>
<u who=w0014>
<s n=00014>
<w CJC>and <w VVN>forgotten <w AT0>the <w AJ0>poor <w AJ0>little
<w NN1>country<c PUN>.
</u>
This encoding is the CDIF equivalent of what might be presented in a
conventional playscript as follows:
W0001: Poor old Luxembourg's beaten. You, you've, you've absolutely just
gone straight over it --
W0014: (interrupting) I haven't.
W0001: (at the same time) and forgotten the poor little country.
Up: Contents Previous: Written texts