Markup Scheme for the British National Corpus: the issues and some initial proposals <author>Lou Burnard <date>25 April 1991 <abstract> This is an initial discussion paper intended to help in the definition of the target encoding scheme to be used for all texts to be included in the British National Corpus (BNC). It gives a brief summary of the overall nature of the proposed encoding scheme and its relationship with the current proposals of the Text Encoding Initiative (TEI) and discusses the various textual features which the markup scheme should distinguish. Future versions of this document will include a reference section in which the agreed tagging scheme is more concisely documented. </abstract> </front><body> <div1><head>One DTD or many? <p>I begin with the highly questionable assumption that it is both possible and desirable to define a single Document Type Definition (DTD) encompassing the full range of materials to be included in the BNC. Clearly, at some level, this must be possible, since in many cases (perhaps most), users of the corpus will want to treat it as a single highly organised resource, from which comparable instances of word usage can be extracted. Equally clearly, the corpus will contain many types of structural unit that are highly specific to certain kinds of discourse. At the very least, there will be textual features that simply cannot occur in some kinds of text, and which validating software should be capable of rejecting as erroneous. For example, it might be felt essential to distinguish such features as address, salutation and signature within letters, Nevertheless, it seems worth the effort of defining, at least initially, a view based on the commonalities of the corpus, and expanding this to include additional features and alternative views at a later stage only when this highly generalised approach has proved demonstrably inadequate either to the source materials to hand or some specific applications. (For further discussion see section 4.1 of JHC's paper on markup for the Oxford Pilot Corpus) <div1><head>An interchange format <p>If we have only one DTD, or a very small number of them, then texts using it can be interchanged effectively with minimal formality. This interchange of texts -- both initially between collaborators in the project and subsequently amongst the research community using it -- is crucial to the success of the project. In designing the scheme, I have therefore given highest priority to facilitating interchange rather than local processing or ease of data capture and propose to call the format itself the <hi>Corpus Document Interchange Format</hi> or CDIF. <p>The advantages of a single interchange format based on SGML do not need to be rehearsed here, though it is worth stressing that interfaces between CDIF and several local processors, whether for data capture or enrichment, are much easier to specify and to write than interfaces amongst several local processors. <p>Compatibility with the TEI's recently formulated draft TEI P1:<citn>Guidelines for the Encoding and Interchange of Machine Readable Text</citn> (hereinafter P1) is of obvious importance to the eventual re-usability of the BNC, and many of the suggestions in this document derive from it. It is however more important, in my view, that our CDIF should identify and distinguish a subset of textual features also identified in P1 than that it should call them by the same names. <div1><head>Overview <p>This section summarizes the rest. Textual elements described in more detail below are <hi>bolded</hi>; the assumption is that occurrences of each such element will be marked up in CDIF using an SGML start and end-tag pair named after the element. <div2><head>Overall structure <p>The BNC will consist of a large number of discrete <hi>text</hi>s, each of which will have an identifying <hi>header</hi>. The corpus itself also has a header, in which documentary information relating to all of the texts will be held. Each text has an optional <hi>front</hi>, a <hi>body</hi> and an optional <hi>back</hi>. Each of these units is subdivided into <hi>div</hi>s, representing the major structural divisions of the text in question. The largest subdivision of a given text will be tagged div1, the next smallest div2 and so on. Written prose texts may also be further subdivided into <hi>p</hi>s (paragraphs) and spoken texts into <hi>turn</hi>s. In all texts, the smallest structural unit will be the <hi>s</hi> or segment corresponding with a unit of linguistic analysis roughly analogous to the conventional orthographic sentence, though no particular linguistic claim is made for it. <div2><head>Floating Features <p>Within this overall structure, it should be possible to mark a number of non-structural or floating features. Examples include <hi>head</hi> for titles and captions (not properly floating, since they are generally tied to a particular structural element); <hi>q</hi> for quoted matter and direct speech; <hi>list</hi> for lists and <hi>item</hi> for the items within them; <hi>note</hi> for footnotes etc.; <hi>corr</hi> for editorial corrections of the original source made by the encoder; and, optionally, a variety of lexically `awkward' items such as <hi>abbr</hi>eviations, <hi>acronym</hi>s, <hi>number</hi>s, <hi>name</hi>s, <hi>date</hi>s, <hi>citn</hi> for bibliographic or other citations, <hi>address</hi> for street addresses and <hi>foreign</hi> for non-English words or phrases. Distinguishing most of these would simplify the task of automatic word-class tagging as well as facilitating more sophisticated kinds of analysis, though for some the cost may prove prohibitive. <div2 id=refs><head>Reference scheme <p>A single referencing scheme, based on the structural hierarchy outlined above, will be automatically generated as texts are loaded into the corpus. Thus a given s will acquire a number indicating its sequence within the enclosing div, itself identified by its number within any enclosing div above it, and ultimately within the enclosing text. For example, the value <q>T98.1.9/12</q> might identify the 12th s in chapter 9 of book 1 of the text with number T98. This will only work, of course, if segments are nested within the other structural tags. <p>In addition, for texts derived from printed sources, it may be convenient to include page or column references. As these cannot easily be accommodated within the same hierarchic structure, a decision needs to be taken as to whether page breaks should be indicated by empty `milestone' tags (for which would propose the shorter name <hi>point</hi>) or by a separate concurrent hierarchy of <hi>page</hi> elements. This issue is discussed further below, and also in P1 (section 5.6.4). <div2><head>Word level analysis <p>Below the s level come the individual tokens to which Lancaster will be attaching word class codes. More discussion is needed before recommendations can be made here: some of the issues are summarised below. At the very least, a <hi>w</hi> tag will be needed. To represent the more complex kinds of linguistic analysis, in particular alignments between individual tokens which cross structural boundaries, we would need to use the alignment map mechanism discussed in P1 section 6.2; this is not envisaged for the initial CDIF format at least. <div2><head>Text entities <p>A uniform system must be adopted for representing the huge range of orthographic and other symbols encountered in written texts, and also the various paralinguistic phenomena encountered in spoken texts. The simplest such system is to use SGML entity references for all of these: names for most written symbols are already available in standardised <q>entity sets</q>, while a suitable set of names for items such as pauses, ums and ers, inaudible mutters etc. needs to be devised. <div2><head>Presentational vs. Descriptive Markup <p>For a brief discussion of the difference between the descriptive and the presentational approaches to markup of corpora, see Clear, op cit, section 2. The compromise proposed there is a reasonable one, but skates over some of the more difficult cases, such as the use of different typesizes in newspaper material (which may be highly significant) or use of white space in verse. The recommendation appears to be to use descriptive markup, but to retain renditional aspects of written texts for which conventions have existed long enough to become the subject of study in themselves (e.g. punctuation), discarding all others. This is pragmatic, but needs a better rationale. <p>The mechanism P1 recommends (use of a RENDITION attribute, which may be attached to any tagged element) is optimal from the point of view of flexibility, and (in cases where some sort of presentational markup is already present in the text, as will normally be the case for typesetting tapes etc.) is also probably easy to automate. It does however imply the need to create textual features such as <hi>highlighted</hi> or <hi>q.mark</hi> where rendition is to be attached to some textual element not otherwise distinguished, which may seem rather artificial. It would also greatly add to the density of the tagging, while providing access to a number of textual features of occasionally questionable linguistic importance. <p>The best course seems to be to retain only presentational features which are of clear relevance in linguistic analysis, such as punctuation, capitalisation and (probably) font changes. Such features as lineation, use of white space and different sizes of type will be silently dropped from the tagging. As noted above, an exception will be made for pagination, if it is felt that access to the page numbers of the original source provides a useful additional referencing mechanism to that provided by the existing structural reference scheme. Lineation of verse or song will also be preserved, since here it has a structural significance. <div2><head>Handling of existing markup <p>The extent and nature of the markup already present in texts entering the corpus is a fairly unknown quantity of which the most we can say is that it is likely to vary greatly, but will probably be heavily oriented towards the presentational. As far as possible, we will attempt to convert this automatically to descriptive markup as outlined in this document. A case might be made for retaining in some way whatever level of encoding exists in all texts so that their original appearance is reproducible: my feeling however is that this would introduce both an unwarranted degree of complexity and an intolerable amount of inconsistency. Our aim is a single encoding scheme in which encoding and text are clearly distinguished; this, in my view, implies that the corpus should use only SGML tags to mark all textual features, rather than a mixture, in which SGML tags are used for structural features, and an ad hoc collection of different schemes for e.g. rendition or linguistic analysis. The implications of this are two fold: firstly, any feature marked up in a text for which we have no corresponding SGML tag will be silently dropped from the encoding; secondly, all markup introduced into a text must be converted to an SGML form. <p>This need not, of course, imply that the word class tagging produced by Lancaster must be rewritten to generate SGML tags (to take the obvious example). It does however imply that a conversion into SGML must be undertaken at some stage before the enhanced texts are integrated into the Corpus. <div1><head>Structural Features <p>The structural features of a corpus component are the units and sub-units of which it is composed. CDIF makes the assumption that every corpus component can be completely described by a single hierarchic tree, in which tokens form the leaf nodes, segments the next highest, and the text itself forms the root. Features which do not form part of this structure are, by definition, non-structural or floating features, discussed in the next section. <div2><head>What is a text? <p>For our purposes, a text is simply a discrete textual unit. Its relationship to other textual objects outside the Corpus does not concern us: it may be a sample from a book, a collection of fragments or a complete work, but these are matters for the task group on Corpus Design to pronounce on. For distribution purposes, the text will form the basic unit of access to the corpus, though a case might be made for sampling at a lower level in some situations. Documentation of the corpus will also be carried out in terms of texts. <div2><head>The header <p>Each text has a header describing its provenance, classification and status. Exactly what information should be recorded here and under what headings has yet to be determined and will be the subject of a separate working paper, for which input from the Corpus Design taskgroup will be needed. If the taxonomy already worked out for the Longman Corpus is adopted for the whole corpus, as seems highly desirable, then keywords characterising each text along the various dimensions which that taxonomy defines will be included here. Bibliographic details of written texts will also be included in the header, as will speaker information for spoken texts. Version control details (level of tagging, correction status etc.) might also be included here, and will be automatically incorporated from the text management database. <p>Note that header information common to several texts within the corpus (for example, definition of any taxonomic codes, short forms of cited works etc.) need not be repeated across all of them. There will be a separate header for the whole corpus in which all such declarations and definitions will be held. Individual text headers will invoke these global declarations by reference (see further P1, section 7.2) <div2><head>Front, body and back <p>Most printed texts contain initial prefatory matter (forewords, tables of contents, dedications etc.) which it is convenient to treat separately from the body of the text proper and from any appendixes or other back matter. It is not clear to what extent the Corpus will include complete printed works; if these are included, and if there is a consensus in favour of distinguishing the function of their parts, for example because they may be held to exhibit different linguistic characteristics from the rest of the text, then at the very least <tag>front</tag> and <tag>back</tag> tags will be needed. P1 proposes a range of subdivisions within these (for example <tag>title.page</tag>, <tag>preface</tag>, <tag>dedication</tag>, <tag>contents</tag> etc) which are probably of less relevance in a linguistically oriented corpus. <div2><head>Divs <p>Different types of text may be organised in different ways: into sections, chapters or parts in a conventional prose text; newspapers or magazines are organised into stories; poems and plays into a variety of highly genre-specific categories. To unify all these we propose a single high-level structural organisation based on what P1 calls <hi>div</hi>s. The largest subdivisions of the body of a text will be tagged simply <tag>div1</tag>. If these are further subdivided, each subdivision will be tagged as a <tag>div2</tag>. Sub-subdivisions will be tagged as <tag>div3</tag>, and so on. If thought desirable, an attribute giving the name (for example `chapter') of the structural subdivision could be added. <div2><head>Paragraphs While structural units such as divs are chiefly of use to locate word occurrences within a source for reference purposes, paragraph units are chiefly of use to delimit meaningfully large chunks of text, so that (for example) retrieval software can provide substantially more context for word occurrences than is provided by the conventional KWIC concordance. Within running prose, paragraphs are easily detected and will be tagged with the <tag>p</tag> tag. Whether equivalent units (stanza for example) are needed for verse is more open to question. In some kinds of texts, for example newspapers, the paragraph is an important stylistic and linguistic feature which should not be neglected. <p>In spoken texts, the structurally equivalent unit will be tagged <hi>turn</hi>. This is an extension to P1, which does not have much to say about spoken texts. <div2><head>Segments and lines <p>The smallest structural unit larger than an individual word, corresponding more or less with an orthographic sentence in written texts, is here called a segment, tagged with an <tag>s</tag> tag. Automatic or semi-automatic procedures for identifying segment boundaries in written texts are not hard to find, and will be applied to provide manageable units of analysis for the linguistic tagging. <p>Because the s tag is, like div, intended to be semantically neutral, representing only a segmentation of the text rather than any profound linguistic claim, it could also be used to mark units of analysis in spoken texts. Identifying the boundaries of such units in spoken texts is rather less simple however. To cater for highly analysed texts, segmented at several hierarchic levels, the <hi>s</hi> tag may be nested recursively. <div1><head>Global attributes <p>All tagged elements may carry additional information relating to the specific element occurrence, represented in the encoding scheme by SGML attribute/value pairs. Three such attributes are proposed as potentially useful for all element types: ID, which supplies a system-generated identifier for the element occurrence, N which may be used to supply an alternative identification, and RENDITION which specifies its rendition or physical appearance. Other attributes may be found useful for other element types, notably LANG for textual elements in languages other than English. <div2><head>ID and N <p>As discussed in section <xref target=refs>, the ID attribute is used primarily to supply a reference value for each structural unit within the corpus. These will be automatically generated during the process of loading a text into the corpus, and will provide each S with a unique identifier. <p>The N attribute may be used additionally if the reference scheme implied by the hierarchically organised ID values differs from conventional practice for the text in question. I am not sure whether it will be useful for our purposes. <div2><head>Rendition <p>The Rendition attribute is necessary only if it is decided that some level of presentational markup should be preserved. I have argued above that it should not, but if there is a consensus in favour of at least attempting to preserve information about the way in which particular textual elements are presented in a source text, we will need to decide on an appropriate set of codes to specify that information. <div1><head>Floating Features <p>This section lists a large number of textual features which might be distinguished in addition to the structural features discussed so far. All of them nest within paragraphs or turns; most of them within segments. For most of them, a tag is provided in P1 and is specified below. Some of these items are easily (and non-controversially) identified by automatic or semi-automatic means; others are not. Distinguishing some would enormously enhance the usefulness of the corpus; for others the benefit would be marginal. The benefits are likely to be twofold: in the long run researchers will be able to ask more interesting and detailed questions of the corpus material more simply; in the short run corpus processing and enrichment will be incrementally simplified. From these two perspectives, a few of these features are, in my opinion, essential; the majority are probably desirable or merely nice. I have presented the list in alphabetical order so as not to pre-empt discussion. <div2><head>Abbreviations and acronyms <p>A reason sometimes given for tagging abbreviations is that the stops they often contain confuse stupid sentence-recognition algorithms. This argument carries no weight when, as we intend, sentences are explicitly tagged. However, a case could be made for abbreviations and acronyms as being intrinsically interesting linguistic objects. P1 proposes a tag <tag>abbrev</tag>, with a TYPE attribute to distinguish abbreviations proper from acronyms, and other attributes to supply the full form of the phrase. This is almost certainly unnecessary for our purposes. <div2><head>Addresses <p>Street addresses and similar items such as telephone numbers are fairly easy to pick out by their inclusion of digits, and probably disrupt straightforward linguistic analysis sufficiently to warrant tagging them as such. P1 proposes a tag <tag>address</tag> with very detailed content in which street and town names etc. are distinguished: these again seem unnecessary for our purposes. <div2><head>Bibliographic references <p>For linguistic purposes we probably do not need to identify the subcomponents of such phrases as <q>Buggins, loc cit</q> or distinguish author, title and page references within the body of a formal bibliographic citation. Marking that a particular phrase is such a citation would however be of considerable usefulness, though possibly of quite high difficulty to determine automatically. The tag <tag>citn</tag> is provided by P1 for this purpose. I propose to use it additionally for titles of books, songs etc. with or without the formal appearance of a bibliographic citation. <div2><head>Dates and numbers <p>P1 proposes that dates and numbers should be marked as such largely for linguistic purposes. A number of corpus studies have shown both the unexpectedly high frequency of numeric strings in written language, and an immense variety of ways in which dates and numbers are presented. This literature should be reviewed before making a firm decision about the feasibility or desirability of distinguishing these items in CDIF. <div2><head>Editorial corrections <p>P1 (5.4) proposes a number of tags for situations in which we wish correction or normalisation of the source material by the transcriber or an editor is to be recorded. There are not likely to be many of these in our material, and they will of necesity have to be inserted manually, so that following the specification of P1 should not be too onerous. Corrections made by the transcriber should be tagged with the <tag>corr</tag> tag; the original reading, if required, being specified additionally as the value of a SIC attribute. Additions and deletions made by the transcriber should similarly be tagged using the <tag>add</tag> and <tag>del</tag> tags respectively. <p>Note that the same tags could be applied to the spoken material where the transcriber wishes to indicate that normalisation additional to the usual transcription has occurred, together with an indication of the un-normalised or uncorrected form. <div2><head>Emphasised words and phrases <p>Linguistic foregrounding or emphasis is characteristically realised in printed texts by a variety of forms of highlighting, in spoken texts by a variety of prosodic features. As both highlighting and (say) raised pitch have other functions and are thus ambiguous, it might be thought advantageous to make explicit cases of linguistic emphasis, assuming that a clear decision procedure for what constitutes such can be agreed. The <tag>emph</tag> tag is available for this purpose. <div2><head>Figures, graphics, non-textual objects <p>The presence of non-textual objects such as illustrations, displayed mathematical formulae, tables of numbers etc. has traditionally been indicated by an appropriate note, and we could represent them in the same way without difficulty. The advantage of marking the location of a figure (etc) explicitly with an empty <tag>figure</tag> tag is that the latter can be linked to an external representation of the figure or more significantly with an associated textual object such as a heading. <p>Determining the exact location of non-textual objects within the sequence of a text is often problematic, particularly in newspaper-like material. <div2><head>Foreign words and phrases <p>A convenient way of determining whether a given word or phrase is sufficiently <q>foreign</q> to be tagged as such might be the presence of some renditional distinction, such as italics, underlining, quotes etc, in the original. Otherwise much fruitless discussion is likely as to whether or not (for example) <q>croissant</q> is a foreign word. However it is arrived at, the usefulness of making a distinction between words regarded as part of the language and words not so regarded seems self evident. <p>P1 proposes that the language of any textual element should be determined by the value of a global LANG attribute. This attribute has the useful characteristic that its value can be specified as a default for all element occurrences lower down the hierarchy. P1 also proposes that, where a change of language occurs in the middle of a structural unit, a tag <tag>foreign</tag> may be supplied to carry the required LANG attribute. Whether or not we use this tag will therefore depend on the granularity of our structural tagging -- if the smallest structural unit in CDIF is a lexical token, there will be no need for a special tag for foreign phrases. <div2><head>Headings and captions <p>The function of a heading is to introduce a structural subdivision of some kind: it is therefore not a strictly floating feature, being constrained to appear at the begining (or end) of some other feature. One possible exception is the kind of subheading often used in magazine design, where a part of the text is repeated in a display box to organise the page in a visully satisfying way. Another is the heading attached to a floating feature, such as the caption attached to a picture. Identifying headings and captions in a text is usually simple, while to distinguish them from the body of the text seems essential. <p>Distinguishing amongst different kinds or levels of heading seems less useful however, and a single <tag>head</tag> tag should be adequate for all structurally bound headings and captions. Note that an application program processing the markup will have access to information about where a heading appears within the structure, and will thus be able to distinguish a heading found at the start of a div3 from one found at the start of a div1. <p>Running heads in a printed text, like page numbers, may be regarded as presentational features only and may therefore be disregarded. Very occasionally, for example in childrens books, we may find running heads which change from page to page: these will have to be treated as headings which have no associated structural unit, unless we declare a concurrent page-based hierarchy. <div2><head>Highlighted phrases <p>When the underlying cause for highlighting (i.e. typographic emphasis such as bolding, italics, change of typesize etc) is evident, it should be represented by the appropriate tag (emph, citn, head etc.). There will remain some cases where no decision can be made: these should be marked with the <tag>hi</tag> tag. <p>One possible scenario for data conversion might be to translate all typographically distinguished elements initially into highlighted units, and to then refine these on the basis of an expanding body of usage rules. <div2><head>Lists and list items <p>Lists can appear in both spoken and written texts and appear to disrupt the normal hierarchic structure in both cases. The solution to this proposed in P1 is to allow for lists to appear within paragraphs and between paragraphs, but not to span paragraphs. P1 also distinguishes ordered or enumerated lists from unordered lists and from glossary lists (which are really a rudimentary sort of table, with only two columns) and proposes a variety of tags to deal with such things as list headings, item enumerators etc. (See P1 5.3.8) Of the three styles proposed there for dealing with item enumerators, the last (in which the enumerator is explicitly tagged with an <tag>enum</tag> tag) seems the most appropriate for our purposes. <div2><head>Names of persons, places and institutions <p>Tagging explicitly the names of persons and places would greatly improve the performance of some basic text processing operations, if only by preventing such false tokenisations as the following: <xmp> ... <w>New</w><w>York-born</w><w>financier</w> ... <xmp> More research needs to be done to determine the ease and reliability with which algorithms can be implemented to identify proper names by means of surface features such as capitalisation. P1 proposes a <q>propname</q> tag, with an attribute to identify the type of name concerned. I propose to simplify this to <tag>name</tag> and to leave its type unspecified. <div2><head>Notes <p>Footnotes and end notes in an original printed source should be distinguished from notes or comments supplied by the transcriber; for the most part the latter should be restricted to editorial corrections, as discussed above. P1 (5.3.9) proposes a single <tag>note</tag> tag, with a number of attributes specifying for example the note's function, provenance and location. <p>P1 assumes that the body of a note will be given at the point in the text at which it occurs <tag>note</tag>like this<tag>/note</tag>, which is the most convenient and efficient way when texts are being prepared by hand. For texts which are being converted from a typesetting tape (for example) or which come off an optical scanner, the much more likely situation is that notes will be found out of the normal sequence <note>like this</note> as they would on the printed page. To convert from one form to the other is not difficult, but requires some thought. <p>An alternative approach would be to mark the footnote reference as a point in the text, to collect all the footnote bodies together in a separate optional section, probably within the <tag>back</tag>, and rely on a co-referring SGML identifier to link the two together. This might be regarded as more faithful to the appearance of the original, in some cases, but would involve more complex programming, both when texts are being converted and whenever they are being subsequently processed. <div2><head>Pagination and lineation <p>As mentioned above, there are two possible approaches to the problem of representing pagination: defining a concurrent hierarchy in which pages are regarded as textual elements in their own right, and carry ID values to identify them. This would be appropriate if the pagination of the original source material was of enduring importance in its own right and likely to be the subject of much research. The other is to mark only the points at which the page breaks of the original source occur, using empty elements, named <q>milestones</q> in P1. The disadvantage of the first method is that not all SGML processors can handle the optional CONCUR facility efficiently or at all; the disadvantage of the second is that the markup no longer reflects the fact that a page has scope as well as boundaries. On balance, I believe that the second is the lesser of two evils, and propose that we mark page breaks only, using a <tag>point</tag> at the start of each page. The ID attribute for this tag will carry an automatically generated number unique within the corpus; the N attribute will carry the number printed on the source page. <p>This method can be generalised to encompass other structural units which do not fit into the main hierarchy, in which case an additional attribute UNIT will be used to indicate the kind of unit concerned. One particularly useful application would be to represent the lineation of verse in this way, rather than defining a concurrent hierarchy for metrical structure. <div2><head>Quotations and direct speech <p>P1 does not distinguish quoted matter (where an authorial voice is attributing a piece of discourse to someone else) from dialogue (where the work itself attributes discourse to one or more speakers), on the grounds that the distinction is hard to sustain in most kinds of imaginative writing. I share this opinion and propose that all quoted matter, whether included in narrative, set off as a block quotation, or presented as dialogue should all be regarded as the same textual feature, which may be tagged using a single <tag>q</tag> tag. <p>Quoted elements are rendered in a variety of different ways in printed texts: they may be italicised, enclosed in quotation marks of various kinds, set off in blocks etc. If these distinctions are to be preserved, then the RENDITION attribute is the most appropriate method. A small set of codes identifying the different kinds of rendition needs to be defined. <p>Two minor complications with quoted matter are that it may be nested within other quoted matter and that it may be interrupted by phrases such as <q>he said</q>. For the latter, P1 offers a special tag <tag>in.quot</tag>; the former should present few problems, provided that q tags can self-nest. <p>A more serious complication is the position of the q element within the single hierarchy proposed here. Given that one single quotation might contain several segments, while another might be contained entirely within one segment, it is clear that segment and quotation at least belong to different hierarchies. Quotations can behave in a similarly cavalier way with respect to paragraphs or other structural divisions, and thus would appear to be candidates for a concurrent hierarchy, unless we can agree to a convention whereby a quotation spanning paragraphs is tacitly treated as two consecutive quotations, perhaps with an attribute indicating whether or not it initiates or concludes a sequence of such fragmented quotations. <div2><head>Other matter in quotation marks <p>Occasionally quotations marks are used in printed text to indicate matters other than quotation or direct speech: generally when titles etc are being cited or more generally whenever words are mentioned rather than used. The tag <tag>q.mark</tag> is provided to mark such occasions where we cannot determine which descriptive tag to apply, analogously with the <tag>hi</tag> discussed above. <div1><head>Word level tagging <p>Hic desunt multa <div2><head>Words <div2><head>Alignment maps <div1><head>Alphabetical list of tags <p>This section lists all the features to be distinguished by markup, giving for each: <ul> <li>a proposed tag name <li>a definition <li>the name used for the corresponding or related feature or features in the current TEI scheme, if any, and a reference to the section in P1 where it is discussed. <li>an indication as to whether explicit encoding of this feature is Essential, Desirable or Nice. <li>an indication as to whether automatic recognition and tagging of this feature is likely to be Easy, Tricky or Impossible. </ul> <gl> <gt>abbr<gd>abbrev[5.3.7] <gt>address<gd>address[7.5.3] <gt>al.map<gd>al.map[6.2.5] <gt>back<gd>back[5.2.5] <gt>body<gd>body[5.2.4] <gt>citn<gd>citn[5.5] <gt>corr<gd>corr[5.4] <gt>date<gd>date[5.3.11] <gt>div1<gd>div1[5.2.4] <gt>emph<gd>emph[5.3.2] <gt>enum<gd>enum[5.3.8] <gt>figure<gd>figure[5.9] <gt>foreign<gd>foreign[5.3.4] <gt>front<gd>front[5.2.3] <gt>head<gd>head[5.2.4] <gt>header<gd>tei.header[4] <gt>hi<gd>highlighted[5.3.2] <gt>item<gd>item[5.3.8] <gt>l<gd>l[7.3.1] <gt>list<gd>list[5.3.8] <gt>name<gd>propname[5.3.6] <gt>note<gd>note[5.3.9] <gt>number<gd>num[5.3.11] <gt>p<gd>p[5.3.1] <gt>point<gd>milestone[5.6.4] <gt>q<gd>q[5.3.3] <gt>q.mark<gd>q.mark[5.3.3] <gt>s<gd>s[5.8] <gt>text<gd>tei.1 <gt>turn<gd>No equivalent <gt>w<gd>No equivalent </gl> <div1><head>Entity lists </ldoc>