Previous
Up
Next
3 Written texts

3 Written texts

3.1 Divisions of written texts

Written texts exhibit a bewildering variety and richness of different structural forms. Some have very little organization at levels higher than the paragraphs; others may have a complex hierarchy of parts, sections, chapters etc. Novels are divided into chapters, newspapers into sections, reference works into articles and so forth. The following elements are used to represent all such textual divisions:

Most written texts, of whatever kind, are hierarchically subdivided using these elements. Structural subdivisions smaller than level 4 (but above paragraph level) are all tagged <div4>. In all texts, structural subdivisions at the highest level (<div1>) are always identified; lower levels of subdivision (i.e. <div2>, <div3> or <div4>) may also be supplied where appropriate, but are not required.

These elements have the following attributes in common, in addition to the global attributes id, n, and r:

The n attribute may sometimes be used to carry an identifying name or number used within the text for a given division, for example, a chapter number, as in the following example:

<div1 type=CHAPTER n=THREE org=SEQ complete=Y>

More often, however, chapter names or numbers will be tagged using the <head> element discussed in section 3.2.1 below.

Where supplied, the value of the attribute type characterises the function of the textual division, according to an informal taxonomy. A list of the values actually used for all written texts is given below in section 6.3 . If a value is supplied for one division at a given level, it may be assumed to apply to all subsequent divisions at the same level until the end of the enclosing element.

A sequence of paragraph-level elements of arbitrary length may precede the first structural subdivision at any level. A text may have no structural divisions within it at all. Note that any prefatory or appended matter not forming part of a text will not generally be captured: the tei elements <front> and <back> elements are not used by cdif.

3.2 Paragraph-level elements and chunks

Written texts may be organized into structural units containing more than one <s> element and smaller than any of the divisions discussed in section 3.1 above. The most commonly found such element is the <p> (paragraph), but there are several others. Their common identifying feature is that they may appear directly within divisions (that is, directly within <div1>, <div2> etc., or within <text> elements, not nested within some other element such as a paragraph).

An alphabetically ordered list of these elements follows:

Examples for each of these (except <p>) are discussed in more detail in the following subsections. Only the <p> and <head> elements are required for cdif conformance. For an indication of the usage of the others within a given text, the <tagsDecl> element in its header should be consulted.

3.2.1 Headings and captions

Headings and captions serve a variety of functions in written texts. cdif distinguishes between <head> elements, which can appear only at the start of a text division and are logically associated with it (for example, chapter titles, newspaper headlines etc.) and <caption> elements which are logically independent of the position they may have within a textual division (for example, captions attached to pictures or figures, ``pull-quotes'' embedded within the text, ``by-lines'' identifying authorship and provenance of a newspaper or periodical article).

One or more <head> elements may appear in sequence at the start of any <div1>, <div2>, <div3> or <div4> element, or at the start of a <list> or <poem>.

In the following example, the <head> element is followed by a number of <caption> elements introducing particular parts of an illustrated newspaper story:

<div1 complete=Y org=SEQ>
<head>
<s n=00040>
<w NN2>TROUSERS <w VVB>SUIT
</head>
<caption>
<s n=00041>
<w EX0>There <w VBZ>is <w PNI>nothing <w AJ0>masculine
<w PRP>about <w DT0>these <w AJ0>new <w NN1>trouser
<w NN2-VVZ>suits <w PRP>in <w NN1>summer<w POS>'s
<w AJ0>soft <w NN2>pastels<c PUN>.
<s n=00042>
<w NP0>Smart <w CJC>and <w AJ0>acceptable
<w PRP>for <w NN1>city <w NN1-VVB>wear <w CJC>but
<w AJ0>soft <w AV0>enough <w PRP>for
<w AJ0>relaxed <w NN2>days
</caption>

The type attribute may be used to distinguish more exactly the function of the caption or heading, as indicated below.

<div1 complete=Y org=SEQ>
<head type=MAIN>
<s n=0223>
<w PNP>They<w VBB>'re <w VDG>doing <w AJ0>fine
</head>
<head type=SUB>
<s n=0224>
<w NP0>Dominic <w VVZ>sees <w AJ0-NN1>double
</head>

Where captions would interrupt the normal flow, pointers are used as discussed in section 2.6 .

3.2.2 Quotations

A quotation is an extract from some other work than the text itself which is embedded within it, for example as an epigraph or illustration. It is marked up using the <quote> element. This may contain any combination of other chunks (for example paragraphs, poems, lists) but may not directly contain phrase-level elements. Any reference for the citation should also be contained within it, and will usually be separately tagged using the <bibl> element, as in the following example:

<quote>
<p>
<s n=2080>
<w DT0>This <w NN1>way <w PRP>for <w AT0>the <w AJ0>sorrowful <w NN1>city<c PUN>.
<s n=2081>
<w DT0>This <w NN1>way <w PRP>for <w AJ0>eternal <w NN1>suffering<c PUN>.
<s n=2082>
<w DT0>This <w NN1>way <w TO0>to <w VVI>join <w AT0>the <w AJ0>lost
<w NN0>people<c PUN>&hellip
<s n=2083>
<w VVB>Abandon <w DT0>all <w NN1>hope<c PUN>, <w PNP>you <w PNQ>who
<w VVB>enter<c PUN>&hellip 
<bibl><s n=2084>
<w NP0>Dante </bibl>
</p>
</quote>

3.2.3 Poems

Poems or fragments of verse or song may appear both within and between paragraphs. The <l> (line) element is used to mark each metrical line, and any titles or headings present are marked with <head> elements. Each such group of lines is marked as a <poem> element, with no indication of its completeness.

No provision is made for marking units of verse such as stanzas, verse paragraphs etc. A part attribute is defined for the <l> which allows incomplete lines to be indicated, but in the current version of the corpus this always takes the value ``u'' (for unknown).

For example:

<poem>
<l part=U>
<s n=0900>
<w PNP>I <w VVB>send <w DPS>my <w NN1>soul <w PRP>through 
<w NN1>time <w CJC>and <w NN1>space <w TO0>to <w VVI>greet 
<w PNP>you<c PUN>.
</l>
<l part=U> 
<s n=0901>
<w PNP>You <w VBD>were <w AT0>a <w NN1>poet<c PUN>.
<s n=0902>
<w PNP>You <w VM0>will <w VVI>understand<c PUN>.
</l>
</poem>

Note that the <l> element is not used to mark typographic lineation. On the few occasions where lineation has been recorded in the BNC, it is marked with the <lb> tag; it does not appear in the BNC Sampler.

3.2.4 Lists

A list is a collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit. Lists may appear within or between paragraphs. Where marked, lists are tagged with the <list> element.

A <list> element consists of an optional <head> element, followed by one or more <item> elements, each of which may optionally be preceded by a <label> element, used to hold the identifier or tag sometimes attached to a list item, for example ``(a)''. It may also contain a word or phrase used for a similar purpose.

The <item> element may appear only inside lists. It contains the same mixture of elements as a paragraph, and may thus contain one or more nested lists. It may also contains a series of paragraphs, each marked with a <p> element.

Here is an example of a simple list:

<list>
<item>
<s n=0087>
<w VBZ>Is <w DPS>your <w NN1>nylon <hi r=it> <w NN1>nightie </hi>
<w AJ0>fireproof<c PUN>?
</item>
<item>
<s n=0088>
<w AT0>The <w NN1>hurricane <w VBD>was <hi r=it> <w AJ0-AV0>mighty </hi>
<w AJ0>fierce<c PUN>. <pb n=78>
</item>
<item>
<s n=0089>
<w VM0>Will <w PNP>you <hi r=it> <w VVI>mow </hi> <w AT0>the <w NN1>lawn<c PUN>?
</item>
<item>
<s n=0090>
<w VDD>Did <w PNP>you <hi r=it> <w VVI>know </hi> <w AT0>the <w NN1>time<c PUN>?
</item>
</list>

Here is an example of a labelled list:

<list>
<label>
<s n=0423>
<w CRD>1<c PUN>. </label>
<item>
<s n=0424>
<w NN1-NP0>Surya <c PUN>&mdash <w NN1>Sun <c PUN>&mdash <w AJ0>Creative
<w NN1>agent
</item>
<label>
<s n=0425>
<w CRD>2<c PUN>. </label>
<item>
<p>
<s n=0426>
<w NN1-NP0>Vayu <c PUN>&mdash <w NN1>Air <c PUN>&mdash <w NP0>Preserving
<w NN1>agent <pb n=43>
</p>
</item>
<label>
<s n=0427>
<w CRD>3<c PUN>. </label>
<item>
<p>
<s n=0428>
<w NN2>Agni <c PUN>&mdash <w NN1>Fire <c PUN>&mdash <w AJ0>Destructive
<w NN1>agent
</p>
</item>
</list>

3.2.5 Notes and citations

Annotations occurring in written texts, and bibliographic citations or references, have been marked up in some texts, using the <note> element. This element has the following additional attributes:

Notes within headers are tagged using a distinct <bibNote> element, which is a departure from TEI-recommended practice, as is the use of the <note> element for both original and supplied annotation. The two usages are distinguished by the type attribute.

Here for example is a typical transcriber's note:

<note type=ED>
<s n=0001>
<w NN1-NP0>Page <w NN2>numbers <w XX0>not <w AJ0>available
</note>

Original notes may contain any mixture of other chunks, and may also contain paragraphs: they may appear in written texts only. They will normally be relocated to the end of the section in which they appear, and their original position marked by a <ptr> element, as discussed in section 2.6 .

For example:

<s n=053>
<w CJS-PRP>As <w AT0>the <w NP0>UK<w POS>'s <w AJ0>main <w AJ0>independent
<w NN1>AIDS <w AV0>home <w NN1-VVB>care <w NN1>provider<c PUN>, 
<w PNP>we <w VVD>cared <w AVP-PRP>for <w PRP>around <w NN0>25% 
<w PRF>of <w DT0>all <w DT0>those <w PNQ>who <w VVD>died 
<w PRF>of <w NN1>AIDS <w ORD>last <w NN1>year <ptr t=A02NT001><c PUN>.
<s n=054>
<w PRP>In <w NP0>London<c PUN>, <w NN1-VVB>demand <w PRP>for <w DPS>our 
<w NN1>Home <w NN1-VVB>Care <w NN2>services <w VVD-VVN>doubled <w AVP-PRP>over <w AT0>the
<w ORD>last <w CRD>twelve <w NN2>months<c PUN>.
<!-- ... -->
<s n=056>
<w PNP>I <w VVB>expect <w NN1-VVB>demand <w PRP>for <w DT0>this 
<w NN1>service <w TO0>to <w VVI>continue <w TO0>to <w VVI>grow 
<w AVP-PRP>over <w AT0>the <w AJ0>coming <w NN1>year<c PUN>.
</p>
<note id=A02NT001 n=2 type=ORIG>
<s n=057>
<w NN1>AIDS <w NN2>deaths<c PUN>: <w NP0>April <w CRD>1990 <c PUN>&mdash
<w NP0>March <w CRD>1991<c PUN>, <w NP0>UK <w NN1>total <c PUL>(<w NN1-NP0>CDSC
<w NN2>figures <c PUN>&mdash <w CRD>584 <w NP0>April <w CRD>1991<c PUN>.<c PUR>)
<s n=058>
<w DPS>Our <w NN1>Home <w NN1-VVB>Care <w NN2>teams <w VVD>saw <w CRD>141
<w NN0>AIDS <w AJ0-VVD>related <w NN2>deaths <w ORD>last <w NN1>year
</note>
Note the use of the n attribute to carry the original footnote number in the above example.

Bibliographic citations or references within running texts may also be marked, using the <bibl> element; this is done in some texts only in the present version of the corpus.

For example:

<bibl>
<s n=1379>
<w NP0>Mordechai <w NP0>Chaim <w NP0>Rumkowski<c PUN>, 
<w AJS>Eldest <w PRF>of <w AT0>the <w NN2>Jews <w PRP>in 
<w AT0>the <w NN1-NP0>Lodz <w NN1>ghetto<c PUN>,
<w VVG>speaking <w PRP>in <w CRD>1942 </bibl>

3.3 Phrase-level elements

Phrase-level elements are elements which cannot appear directly within a textual division, but must be contained by some other element. In practice, this means they will be contained within an <s> element.

3.3.1 Highlighted phrases

Typographic highlighting in the original may not be marked in the transcript at all. Alternatively, highlighted phrases, and the kind of highlighting used, may be recorded in one of two ways:

The former is used where the function of the highlighting is clear, for example to mark a heading, and where the boundaries of the highlighted phrase therefore coincide with the boundaries of some other cdif element. The latter is used where the function is not clear, where cdif does not provide a tag to identify the feature concerned or where the highlighted phrase is not coterminous with some other cdif element.

When the <hi> element is used, its r attribute must be supplied. On all other cdif elements, the r attribute is optional. Its value indicates the nature of the highlighting used, e.g. italic font, quoted, small caps etc. A list of the values used for this attribute is given in section 6.4 below.

It should be noted that the purpose of the r attribute is not to provide information adequate to the needs of a typesetter, but simply to record some qualitative information about the original. In particular, the present version of the corpus includes no indication of size of type or style of writing.

Like all other phrase-level elements, each <hi> element must be entirely contained by an <s> element. This implies that where, for example, a bolded passage contains more than one sentence, or an italicised phrase begins in one verse line and ends in another, the <hi> element must be closed at the end of the enclosing element, and then re-opened within the next.

For example, in the following four lines of verse, the first three are rendered in italics, and the r attribute is therefore specified for each <l> element. In the fourth line, only the first few words are in italics: a <hi> element must be used within the <l> to carry this information.

<l part=U r=it>
<s n=394><w PNP>It <w VBD>was <w CRD>one <w PRF>of <w AT0>a <w NN0>pair<c PUN>.
<s n=395><w DPS>Its <w AJ0>precious <w NN1>twin
</l>
<l part=U r=it>
<s n=396><w VBD>was <w VVN>stolen <w PRP>by <w AT0>the <w NN2>soldiers<c PUN>.
<s n=397><w DT0>All <w AT0>the <w NN1>time
</l>
<l part=U r=it>
<s n=398><w DPS>her <w NN1>uncle <w VVD>stood <w AV0>there <w VVG>clutching <w DT0>this
<w CRD-PNI>one <w PRP>in
</l>
<l part=U>
<s n=399><hi r=it> <w DPS>his <w AJ0>big <w NN1>fist </hi> <c PUN>&mdash <w AV0>so<c PUN>!
<s n=400><w PNP>She <w VDZ>does <w DT0>a little <w NN1>mime<c PUN>.
</l>

3.3.2 Miscellaneous phrase-level elements

The following miscellaneous phrase-level elements also appear within <s> elements in written texts:

In this example, the presence of a page break between two verse lines is indicated by the <pb> tag:

<l part=U>
<s n=1403>
<c PUN>&mdash <w CJC>and <w NN2>creditors <w VVB>grow <w AJ0>cruel<c PUN>,
</l>
<l part=U>
<pb n=75>
</l>
<l part=U>
<s n=1404>
<w AV0>so <w PNP>he <w VVZ>bows <w CJC>and <w NN2-VVZ>scrapes<c PUN>,
</l>


Previous
Up
Next