I begin with the highly questionable assumption that it is
both possible and desirable to define a single Document Type
Definition (DTD) encompassing the full range of materials to be
included in the BNC. Clearly, at some level, this must be
possible, since in many cases (perhaps most), users of the corpus
will want to treat it as a single highly organised resource, from
which comparable instances of word usage can be extracted.
Equally clearly, the corpus will contain many types of structural
unit that are highly specific to certain kinds of discourse. At
the very least, there will be textual features that simply cannot
occur in some kinds of text, and which validating software should
be capable of rejecting as erroneous. For example, it might be
felt essential to distinguish such features as address,
salutation and signature within letters, Nevertheless, it seems
worth the effort of defining, at least initially, a view based on
the commonalities of the corpus, and expanding this to include
additional features and alternative views at a later stage only
when this highly generalised approach has proved demonstrably
inadequate either to the source materials to hand or some
specific applications. (For further discussion see section 4.1 of
JHC's paper on markup for the Oxford Pilot Corpus)
If we have only one DTD, or a very small number of them, then
texts using it can be interchanged effectively with minimal
formality. This interchange of texts -- both initially between
collaborators in the project and subsequently amongst the
research community using it -- is crucial to the success of the
project. In designing the scheme, I have therefore given highest
priority to facilitating interchange rather than local processing
or ease of data capture and propose to call the format itself the
The advantages of a single interchange format based on SGML do
not need to be rehearsed here, though it is worth stressing that
interfaces between CDIF and several local processors, whether for
data capture or enrichment, are much easier to specify and to
write than interfaces amongst several local processors.
Compatibility with the TEI's recently formulated draft TEI
P1: This section summarizes the rest. Textual elements described
in more detail below are The BNC will consist of a large number of discrete
Within this overall structure, it should be possible to mark a
number of non-structural or floating features. Examples include
A single referencing scheme, based on the structural hierarchy
outlined above, will be automatically generated as texts are
loaded into the corpus. Thus a given s will acquire a number
indicating its sequence within the enclosing div, itself
identified by its number within any enclosing div above it, and
ultimately within the enclosing text. For example, the value
In addition, for texts derived from printed sources, it may be
convenient to include page or column references. As these cannot
easily be accommodated within the same hierarchic structure, a
decision needs to be taken as to whether page breaks should be
indicated by empty `milestone' tags (for which would propose the
shorter name Below the s level come the individual tokens to which
Lancaster will be attaching word class codes. More discussion is
needed before recommendations can be made here: some of the
issues are summarised below. At the very least, a A uniform system must be adopted for representing the huge
range of orthographic and other symbols encountered in written
texts, and also the various paralinguistic phenomena encountered
in spoken texts. The simplest such system is to use SGML entity
references for all of these: names for most written symbols are
already available in standardised For a brief discussion of the difference between the
descriptive and the presentational approaches to markup of
corpora, see Clear, op cit, section 2. The compromise proposed
there is a reasonable one, but skates over some of the more
difficult cases, such as the use of different typesizes in
newspaper material (which may be highly significant) or use of
white space in verse. The recommendation appears to be to use
descriptive markup, but to retain renditional aspects of written
texts for which conventions have existed long enough to become
the subject of study in themselves (e.g. punctuation), discarding
all others. This is pragmatic, but needs a better rationale.
The mechanism P1 recommends (use of a RENDITION attribute,
which may be attached to any tagged element) is optimal from the
point of view of flexibility, and (in cases where some sort of
presentational markup is already present in the text, as will
normally be the case for typesetting tapes etc.) is also probably
easy to automate. It does however imply the need to create
textual features such as The best course seems to be to retain only presentational
features which are of clear relevance in linguistic analysis,
such as punctuation, capitalisation and (probably) font changes.
Such features as lineation, use of white space and different
sizes of type will be silently dropped from the tagging. As noted
above, an exception will be made for pagination, if it is felt
that access to the page numbers of the original source provides a
useful additional referencing mechanism to that provided by the
existing structural reference scheme. Lineation of verse or song
will also be preserved, since here it has a structural
significance.
The extent and nature of the markup already present in texts
entering the corpus is a fairly unknown quantity of which the
most we can say is that it is likely to vary greatly, but will
probably be heavily oriented towards the presentational. As far
as possible, we will attempt to convert this automatically to
descriptive markup as outlined in this document. A case might be
made for retaining in some way whatever level of encoding exists
in all texts so that their original appearance is reproducible:
my feeling however is that this would introduce both an
unwarranted degree of complexity and an intolerable amount of
inconsistency. Our aim is a single encoding scheme in which
encoding and text are clearly distinguished; this, in my view,
implies that the corpus should use only SGML tags to mark all
textual features, rather than a mixture, in which SGML tags are
used for structural features, and an ad hoc collection of
different schemes for e.g. rendition or linguistic analysis. The
implications of this are two fold: firstly, any feature marked up
in a text for which we have no corresponding SGML tag will be
silently dropped from the encoding; secondly, all markup
introduced into a text must be converted to an SGML form.
This need not, of course, imply that the word class tagging
produced by Lancaster must be rewritten to generate SGML tags (to
take the obvious example). It does however imply that a
conversion into SGML must be undertaken at some stage before the
enhanced texts are integrated into the Corpus.
The structural features of a corpus component are the units
and sub-units of which it is composed. CDIF makes the assumption
that every corpus component can be completely described by a
single hierarchic tree, in which tokens form the leaf nodes,
segments the next highest, and the text itself forms the root.
Features which do not form part of this structure are, by
definition, non-structural or floating features, discussed in the
next section.
For our purposes, a text is simply a discrete textual unit.
Its relationship to other textual objects outside the Corpus does
not concern us: it may be a sample from a book, a collection of
fragments or a complete work, but these are matters for the task
group on Corpus Design to pronounce on. For distribution
purposes, the text will form the basic unit of access to the
corpus, though a case might be made for sampling at a lower level
in some situations. Documentation of the corpus will also be
carried out in terms of texts.
Each text has a header describing its provenance,
classification and status. Exactly what information should be
recorded here and under what headings has yet to be determined
and will be the subject of a separate working paper, for which
input from the Corpus Design taskgroup will be needed. If the
taxonomy already worked out for the Longman Corpus is adopted for
the whole corpus, as seems highly desirable, then keywords
characterising each text along the various dimensions which that
taxonomy defines will be included here. Bibliographic details of
written texts will also be included in the header, as will
speaker information for spoken texts. Version control details
(level of tagging, correction status etc.) might also be included
here, and will be automatically incorporated from the text
management database.
Note that header information common to several texts within
the corpus (for example, definition of any taxonomic codes, short
forms of cited works etc.) need not be repeated across all of
them. There will be a separate header for the whole corpus in
which all such declarations and definitions will be held.
Individual text headers will invoke these global declarations by
reference (see further P1, section 7.2)
Most printed texts contain initial prefatory matter
(forewords, tables of contents, dedications etc.) which it is
convenient to treat separately from the body of the text proper
and from any appendixes or other back matter. It is not clear to
what extent the Corpus will include complete printed works; if
these are included, and if there is a consensus in favour of
distinguishing the function of their parts, for example because
they may be held to exhibit different linguistic characteristics
from the rest of the text, then at the very least
Different types of text may be organised in different ways:
into sections, chapters or parts in a conventional prose text;
newspapers or magazines are organised into stories; poems and
plays into a variety of highly genre-specific categories. To
unify all these we propose a single high-level structural
organisation based on what P1 calls In spoken texts, the structurally equivalent unit will be
tagged The smallest structural unit larger than an individual word,
corresponding more or less with an orthographic sentence in
written texts, is here called a segment, tagged with an
Because the s tag is, like div, intended to be semantically
neutral, representing only a segmentation of the text rather than
any profound linguistic claim, it could also be used to mark
units of analysis in spoken texts. Identifying the boundaries of
such units in spoken texts is rather less simple however. To
cater for highly analysed texts, segmented at several hierarchic
levels, the All tagged elements may carry additional information relating
to the specific element occurrence, represented in the encoding
scheme by SGML attribute/value pairs. Three such attributes are
proposed as potentially useful for all element types: ID, which
supplies a system-generated identifier for the element
occurrence, N which may be used to supply an alternative
identification, and RENDITION which specifies its rendition or
physical appearance. Other attributes may be found useful for
other element types, notably LANG for textual elements in
languages other than English.
As discussed in section The N attribute may be used additionally if the reference
scheme implied by the hierarchically organised ID values differs
from conventional practice for the text in question. I am not
sure whether it will be useful for our purposes.
The Rendition attribute is necessary only if it is decided
that some level of presentational markup should be preserved. I
have argued above that it should not, but if there is a consensus
in favour of at least attempting to preserve information about
the way in which particular textual elements are presented in a
source text, we will need to decide on an appropriate set of
codes to specify that information.
This section lists a large number of textual features which
might be distinguished in addition to the structural features
discussed so far. All of them nest within paragraphs or turns;
most of them within segments. For most of them, a tag is provided
in P1 and is specified below. Some of these items are easily (and
non-controversially) identified by automatic or semi-automatic
means; others are not. Distinguishing some would enormously
enhance the usefulness of the corpus; for others the benefit
would be marginal. The benefits are likely to be twofold: in the
long run researchers will be able to ask more interesting and
detailed questions of the corpus material more simply; in the
short run corpus processing and enrichment will be incrementally
simplified. From these two perspectives, a few of these features
are, in my opinion, essential; the majority are probably
desirable or merely nice. I have presented the list in
alphabetical order so as not to pre-empt discussion.
A reason sometimes given for tagging abbreviations is that the
stops they often contain confuse stupid sentence-recognition
algorithms. This argument carries no weight when, as we intend,
sentences are explicitly tagged. However, a case could be made
for abbreviations and acronyms as being intrinsically interesting
linguistic objects. P1 proposes a tag Street addresses and similar items such as telephone numbers
are fairly easy to pick out by their inclusion of digits, and
probably disrupt straightforward linguistic analysis sufficiently
to warrant tagging them as such. P1 proposes a tag
For linguistic purposes we probably do not need to identify
the subcomponents of such phrases as P1 proposes that dates and numbers should be marked as such
largely for linguistic purposes. A number of corpus studies have
shown both the unexpectedly high frequency of numeric strings in
written language, and an immense variety of ways in which dates
and numbers are presented. This literature should be reviewed
before making a firm decision about the feasibility or
desirability of distinguishing these items in CDIF.
P1 (5.4) proposes a number of tags for situations in which we
wish correction or normalisation of the source material by the
transcriber or an editor is to be recorded. There are not likely
to be many of these in our material, and they will of necesity
have to be inserted manually, so that following the
specification of P1 should not be too onerous. Corrections made
by the transcriber should be tagged with the Note that the same tags could be applied to the spoken
material where the transcriber wishes to indicate that
normalisation additional to the usual transcription has occurred,
together with an indication of the un-normalised or uncorrected
form.
Linguistic foregrounding or emphasis is characteristically
realised in printed texts by a variety of forms of highlighting,
in spoken texts by a variety of prosodic features. As both
highlighting and (say) raised pitch have other functions and are
thus ambiguous, it might be thought advantageous to make explicit
cases of linguistic emphasis, assuming that a clear decision
procedure for what constitutes such can be agreed. The
The presence of non-textual objects such as illustrations,
displayed mathematical formulae, tables of numbers etc. has
traditionally been indicated by an appropriate note, and we could
represent them in the same way without difficulty. The advantage
of marking the location of a figure (etc) explicitly with an
empty Determining the exact location of non-textual objects within
the sequence of a text is often problematic, particularly in
newspaper-like material.
A convenient way of determining whether a given word or phrase
is sufficiently P1 proposes that the language of any textual element should be
determined by the value of a global LANG attribute. This
attribute has the useful characteristic that its value can be
specified as a default for all element occurrences lower down the
hierarchy. P1 also proposes that, where a change of language
occurs in the middle of a structural unit, a tag
The function of a heading is to introduce a structural
subdivision of some kind: it is therefore not a strictly floating
feature, being constrained to appear at the begining (or end) of
some other feature. One possible exception is the kind of
subheading often used in magazine design, where a part of the
text is repeated in a display box to organise the page in a
visully satisfying way. Another is the heading attached to a
floating feature, such as the caption attached to a picture.
Identifying headings and captions in a text is usually simple,
while to distinguish them from the body of the text seems
essential.
Distinguishing amongst different kinds or levels of heading
seems less useful however, and a single Running heads in a printed text, like page numbers, may be
regarded as presentational features only and may therefore be
disregarded. Very occasionally, for example in childrens books,
we may find running heads which change from page to page: these
will have to be treated as headings which have no associated
structural unit, unless we declare a concurrent page-based
hierarchy.
When the underlying cause for highlighting (i.e. typographic
emphasis such as bolding, italics, change of typesize etc) is
evident, it should be represented by the appropriate tag (emph,
citn, head etc.). There will remain some cases where no decision
can be made: these should be marked with the One possible scenario for data conversion might be to
translate all typographically distinguished elements initially
into highlighted units, and to then refine these on the basis of
an expanding body of usage rules.
Lists can appear in both spoken and written texts and appear
to disrupt the normal hierarchic structure in both cases. The
solution to this proposed in P1 is to allow for lists to appear
within paragraphs and between paragraphs, but not to span
paragraphs. P1 also distinguishes ordered or enumerated lists
from unordered lists and from glossary lists (which are really a
rudimentary sort of table, with only two columns) and proposes a
variety of tags to deal with such things as list headings, item
enumerators etc. (See P1 5.3.8) Of the three styles proposed
there for dealing with item enumerators, the last (in which the
enumerator is explicitly tagged with an Tagging explicitly the names of persons and places would
greatly improve the performance of some basic text processing
operations, if only by preventing such false tokenisations as the
following:
Footnotes and end notes in an original printed source should
be distinguished from notes or comments supplied by the
transcriber; for the most part the latter should be restricted to
editorial corrections, as discussed above. P1 (5.3.9) proposes a
single P1 assumes that the body of a note will be given at the point
in the text at which it occurs An alternative approach would be to mark the footnote
reference as a point in the text, to collect all the footnote
bodies together in a separate optional section, probably within
the As mentioned above, there are two possible approaches to the
problem of representing pagination: defining a concurrent
hierarchy in which pages are regarded as textual elements in
their own right, and carry ID values to identify them. This would
be appropriate if the pagination of the original source material
was of enduring importance in its own right and likely to be the
subject of much research. The other is to mark only the points at
which the page breaks of the original source occur, using empty
elements, named This method can be generalised to encompass other structural
units which do not fit into the main hierarchy, in which case an
additional attribute UNIT will be used to indicate the kind of
unit concerned. One particularly useful application would be to
represent the lineation of verse in this way, rather than
defining a concurrent hierarchy for metrical structure.
P1 does not distinguish quoted matter (where an authorial
voice is attributing a piece of discourse to someone else) from
dialogue (where the work itself attributes discourse to one or
more speakers), on the grounds that the distinction is hard to
sustain in most kinds of imaginative writing. I share this
opinion and propose that all quoted matter, whether included in
narrative, set off as a block quotation, or presented as dialogue
should all be regarded as the same textual feature, which may be
tagged using a single Quoted elements are rendered in a variety of different ways in
printed texts: they may be italicised, enclosed in quotation
marks of various kinds, set off in blocks etc. If these
distinctions are to be preserved, then the RENDITION attribute is
the most appropriate method. A small set of codes identifying the
different kinds of rendition needs to be defined.
Two minor complications with quoted matter are that it may be
nested within other quoted matter and that it may be interrupted
by phrases such as A more serious complication is the position of the q element
within the single hierarchy proposed here. Given that one single
quotation might contain several segments, while another might be
contained entirely within one segment, it is clear that segment
and quotation at least belong to different hierarchies.
Quotations can behave in a similarly cavalier way with respect to
paragraphs or other structural divisions, and thus would appear
to be candidates for a concurrent hierarchy, unless we can agree
to a convention whereby a quotation spanning paragraphs is
tacitly treated as two consecutive quotations, perhaps with an
attribute indicating whether or not it initiates or concludes a
sequence of such fragmented quotations.
Occasionally quotations marks are used in printed text to
indicate matters other than quotation or direct speech: generally
when titles etc are being cited or more generally whenever words
are mentioned rather than used. The tag Hic desunt multa
This section lists all the features to be distinguished by
markup, giving for each:
Corpus Document Interchange Format or CDIF.
bolded; the assumption is that
occurrences of each such element will be marked up in CDIF using
an SGML start and end-tag pair named after the element.
texts, each of which will have an identifying
header. The corpus itself also has a header, in which
documentary information relating to all of the texts will be
held. Each text has an optional front, a body
and an optional back. Each of these units is subdivided
into divs, representing the major structural divisions
of the text in question. The largest subdivision of a given text
will be tagged div1, the next smallest div2 and so on. Written
prose texts may also be further subdivided into ps
(paragraphs) and spoken texts into turns. In all texts,
the smallest structural unit will be the s or segment
corresponding with a unit of linguistic analysis roughly
analogous to the conventional orthographic sentence, though no
particular linguistic claim is made for it.
head for titles and captions (not properly floating,
since they are generally tied to a particular structural
element); q for quoted matter and direct speech;
list for lists and item for the items within
them; note for footnotes etc.; corr for
editorial corrections of the original source made by the encoder;
and, optionally, a variety of lexically `awkward' items such as
abbreviations, acronyms, numbers,
names, dates, citn for bibliographic
or other citations, address for street addresses and
foreign for non-English words or phrases. Distinguishing
most of these would simplify the task of automatic word-class
tagging as well as facilitating more sophisticated kinds of
analysis, though for some the cost may prove prohibitive.
T98.1.9/12
might identify the 12th s in chapter 9 of book
1 of the text with number T98. This will only work, of course, if
segments are nested within the other structural tags.
point) or by a separate concurrent
hierarchy of page elements. This issue is discussed
further below, and also in P1 (section 5.6.4).
w tag
will be needed. To represent the more complex kinds of linguistic
analysis, in particular alignments between individual tokens
which cross structural boundaries, we would need to use the
alignment map mechanism discussed in P1 section 6.2; this is not
envisaged for the initial CDIF format at least.
entity sets
, while a
suitable set of names for items such as pauses, ums and ers,
inaudible mutters etc. needs to be devised.
highlighted or q.mark
where rendition is to be attached to some textual element not
otherwise distinguished, which may seem rather artificial. It
would also greatly add to the density of the tagging, while
providing access to a number of textual features of occasionally
questionable linguistic importance.
divs. The largest
subdivisions of the body of a text will be tagged simply
turn. This is an extension to P1, which does not
have much to say about spoken texts.
s tag may be nested recursively.
Buggins, loc cit
or
distinguish author, title and page references within the body of
a formal bibliographic citation. Marking that a particular phrase
is such a citation would however be of considerable usefulness,
though possibly of quite high difficulty to determine
automatically. The tag foreign
to be tagged as such might be the
presence of some renditional distinction, such as italics,
underlining, quotes etc, in the original. Otherwise much
fruitless discussion is likely as to whether or not (for example)
croissant
is a foreign word. However it is arrived at, the
usefulness of making a distinction between words regarded as part
of the language and words not so regarded seems self evident.
propname
tag, with an attribute to identify
the type of name concerned. I propose to simplify this to
milestones
in P1. The disadvantage of the
first method is that not all SGML processors can handle the
optional CONCUR facility efficiently or at all; the disadvantage
of the second is that the markup no longer reflects the fact that
a page has scope as well as boundaries. On balance, I believe
that the second is the lesser of two evils, and propose that we
mark page breaks only, using a he said
. For the latter, P1 offers a
special tag