Written texts Divisions of written texts Written texts exhibit a rich variety of different structural forms. Some have very little organization at levels higher than the paragraphs; others may have a complex hierarchy of parts, sections, chapters etc. Novels are divided into chapters, newspapers into sections, reference works into articles and so forth. In the BNC all such structural divisions are represnted uniformly by means of the div element. In written texts, the n attribute is sometimes used to supply an identifying name or number used within the text for a given division, for example, a chapter number, as in the following example: ... More often, however, chapter names or numbers will appear within the text, tagged using the head element discussed in section below. The value of the attribute type is used to characterise the function of the textual division (see the reference documentation for the values used). If a value is supplied for one division at a given level, it may be assumed to apply to all subsequent divisions at the same level until the end of the enclosing element, although it is not always explicitly specified. Where div levels are nested, for example where the chapters of a novel are grouped into parts each of which may have its own title or number, the level attribute is used to indicate the depth of nesting. This is not strictly necessary (since an XML-aware processor retains this information) but has been added for the convenience of users of previous versions of the corpus, in which the level was explicitly coded into the name of the surrounding element (div1, div2 etc.) In text ANY, for example, each chapter of the original novel corresponds with a div level="2", because the work contains groups of chapters, each of which begins with a page containing just a date. The opening of the text is therefore encoded as follows: Monday , January 13th , 1986 . Victor Wilcox lies awake ... ... Note however that in some texts initial sentences (like Monday, January 13th, 1986 above) may have been misplaced, so that they appear at the start of an inner div rather than the start of its parent. A sequence of paragraph-level elements of arbitrary length may precede the first structural subdivision at any level. A text may have no structural divisions within it at all. Note that any prefatory or appended matter not forming part of a text will not generally be captured: the tei elements front and back elements are not used. Paragraph-level elements and chunks Written texts may be organized into structural units containing more than one s element and smaller than any of the divisions discussed in section above. The most commonly found such element is the p (paragraph): Several other elements may however appear directly within div or within text elements, not nested within some other element such as a paragraph. An list of these elements follows: Each of these elements contains one or more s elements, as discussed above; in some cases enclosed by an intermediate element. They are used chiefly to indicate the function of sections of the text, as indicated in the list above. The following sections provide examples for the use of each of these elements. Headings and captions One or more head elements of specified types may appear in sequence at the start of any div element, or at the start of a list or poem, as in the following examples:. AGEISM THE FOUNDATION OF AGE DISCRIMINATION STEVE SCRUTTON ... As shown above, the type attribute is used to distinguish more exactly the function of a heading. Note that, in the BNC, captions or headings which float within the text, that is, which appear elsewhere than at the very beginning of the section which they name, are not encoded as head elements. A head element can appear only at the start of a text division and is logically associated with it (for example, chapter titles, newspaper headlines etc.). Paragraphs which provide heading or captioning information, but which are logically independent of their position within a textual division (for example, captions attached to pictures or figures, or pull-quotes embedded within the text) are represented in the same way as any other paragraph of text, using the p element, but specifying the value caption in their rend attribute. In the following example, the head element is followed by a number of captions introducing particular parts of a magazine story: TROUSERS SUIT There is nothing masculine about these new trouser suits in summer's soft pastels....... Quotations A quotation is an extract from some other work than the text itself which is embedded within it, for example as an epigraph or illustration. It is marked up using the quote element. This may contain any combination of other chunks (for example paragraphs, poems, lists) but may not directly contain w or s elements. A reference for the citation may also be contained within it. For example: This way for the sorrowful city. ... Abandon all hope, you who enter… Dante Spoken paragraphs As noted above, the sp element is used to mark parts of a written text which were or are intended to be spoken, for example the speeches in a dramatic text or a published interview. Such parts are generally readily identifiable by the use of such conventions as speaker prefixes (the label supplying the name of the speaker) and stage directions, for which the following additional elements are used: The sp element is used only for speech which is presented as such in a written text, by contrast with the element u discussed in section , which is used only for speaker turns identified in a spoken text, i.e. one which has been transcribed from audio tape. If present, a speaker element will appear only at the start of the sp element, followed by one or more p elements containing the actual speech. Here is an example of a stage direction occurring within a speech: Seven books a week . He dances Library books . These elements appear frequently in formal transcriptions of written proceedings, notably those parts of the BNC which are extracted from Hansard: That millionaire mammy's boy — Interruption Mr. Speaker Order. That is not wholly unparliamentary. Poetry Poetry is distinguished from prose in the BNC where it is so presented in the original, for example as fragments of verse or song appearing within or between paragraphs of prose. The l (line) element is used to mark each verse line; where there are several such lines, perhaps with a heading, they are grouped together using the lg (linegroup) element, and any title or heading present is marked with a head element. For example: I send my soul through time and space to greet you. You were a poet. You will understand. Note that the l element is not used to mark typographic lineation. Layout information is not, in general, preserved in the BNC. Lists A list is a collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit. Lists may appear within or between paragraphs. Where marked, lists are tagged with the list element, which may contain the following subelements: A list element consists of an optional head element, followed by one or more item elements, each of which may optionally be preceded by a label element, used to hold the identifier or tag sometimes attached to a list item, for example (a). It may also contain a word or phrase used for a similar purpose. The item element may appear only inside lists. It contains the same mixture of elements as a paragraph, and may thus contain one or more nested lists. It may also contains a series of paragraphs, each marked with a p element. Here is an example of a simple list: Is your nylon nightie fireproof? The hurricane was mighty fierce. Will you mow the lawn? Did you know the time? Here is an example of a labelled list: 1. Surya — Sun — Creative agent 2. Vayu — Air — Preserving agent 3. Agni — Fire — Destructive agent Notes and citations Annotations occurring in written texts, and bibliographic citations or references, have been marked up in some texts, using the note element. Original notes may contain any mixture of other chunks, and may also contain paragraphs: they appear in written texts only. They may be relocated to the end of the section in which they appear. For example: The short is a film about sailing . ... Note the use of the n attribute to carry the original footnote number in the above example. Bibliographic references Bibliographic citations or references within running texts may also be marked, using the bibl element; in the present version of the corpus this is done in some texts only. For example: Zombie no go unless you tell im to go The Communards. Note that the bibl element used within corpus texts has none of the more detailed sub-elements described for it in . Like all the other elements described in the present subsection, the bibl element appearing within corpus texts contains only s elements. Phrase-level elements Phrase-level elements are elements which cannot appear directly within a textual division, but must be contained by some other element. In practice, this means they will be contained within an s element. In addition to the w, mw, and c elements already discussed, only the following phrase-level elements appear within s elements in written texts: Page breaks Wherever possible, the original pagination and page numbering of the source text has been preserved. The pb element is used to mark the approximate position in the text at which each new page starts, and its n attribute supplies the number of the page. — and creditors grow cruel, so he bows and scrapes, Where several pages have been left out of a transcription, for example because they are blank or contain illustrations only, a pb element may be given for each, as in this example: I haven't been to an organized campsite for perhaps fifteen years, so all this is new to me. Highlighted phrases Typographic changes or highlighting in the original may not be marked in the transcript at all. Alternatively, highlighted phrases, and the kind of highlighting used, may be recorded in one of two ways:
  • using the rend (rendition) attribute on elements for which this is defined
  • using the hi (highlighted) element
  • The former is used where the whole of the content of one of the elements bibl, corr, div, head, item, l, label, list, p, quote or stage is highlighted. The latter is used on all other occasions. The values available for the rend attribute in either case and their significance are as listed in the reference documentation in all cases. It should be noted that the purpose of the rend attribute is not to provide information adequate to the needs of a typesetter, but simply to record some qualitative information about the original. Like all other phrase-level elements, each hi element must be entirely contained by an s element. This implies that where, for example, a bolded passage contains more than one sentence, or an italicised phrase begins in one verse line and ends in another, the hi element must be closed at the end of the enclosing element, and then re-opened within the next. Apple is to fruit as dog is to X . For example, in the following four lines of verse, the first three are rendered in italics, and the rend attribute is therefore specified for each l element. In the fourth line, only the first few words are in italics: a hi element must be used within the l to carry this information. It was one of a pair. Its precious twin was stolen by the soldiers. All the time her uncle stood there clutching this one in his big fist — so! She does a little mime.