add this bookmarking tool

Compatibility issues

The first version of the BNC was released slightly in advance of the publication of the TEI's definitive Recommendations, and over a year before publication of the Corpus Encoding Standard. Although all three standards have much in common (in particular, CDIF - Corpus Document Interchange Format - the BNC's own initial DTD, was influential in the design of the other two), they are not identical. Several elements are named differently, and some, more significantly, have different content models or attributes.

In the present release of the Corpus, considerable effort has been made to improve compatibility of the BNC DTD with TEI and with CES, while retaining as far as possible a degree of compatibility with CDIF. The objective was to ensure that a document which conformed to the BNC's DTD would also conform to either of the other two standards, rather than to ensure that any CES or TEI conformant document would also be BNC conformant. This necessarily involved some modification of the original tagging of the corpus, which is detailed in this section.

Differences between the BNC DTD and TEI

The present version of the BNC document type declaration (DTD) can be expressed as a set of extensions against the standard TEI dtd, using the extension mechanism recommended by that standard. Full details of the procedure are given in chapter 3 of the TEI Guidelines. Essentially, the procedure requires the definition of two ‘extension’ files, called here bncMods.ent and bncMods.dtd, the former containing definitions of parameter entities needed for this set of extensions, and the latter containing the actual SGML element and attribute definitions which make up the required modifications. Copies of these files are included in the present release, along with the DTD derived from them. The present section describes their content informally.

The DTD described elsewhere in this document makes use of several elements already defined in other TEI tagsets, in particular the base tag sets for prose and for spoken texts, and the additional tagsets for language corpora and analysis. To combine all of these with the extension files mentioned above, a TEI conformant document will begin as follows:
<!DOCTYPE TEI.2 SYSTEM "tei2.dtd" [ <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.spoken "INCLUDE"> <!ENTITY % TEI.general "INCLUDE"> <!ENTITY % TEI.analysis "INCLUDE"> <!ENTITY % TEI.corpora "INCLUDE"> <!ENTITY % TEI.extensions.ent SYSTEM "bncMods.ent"> <!ENTITY % TEI.extensions.dtd SYSTEM "bncMods.dtd"> ] ]>

This file can be compiled to form a one file version of the DTD, in which all parameter references have been resolved, and any redundant declarations removed, using software such as the TEI PizzaChef.

The file bncMods.ent consists of a number of SGML parameter entity definitions, which over-ride the definitions provided in the TEI DTD itself. These declarations have the following effects:
  • to exclude from the DTD a large number of standard TEI elements which are not actually used in the BNC DTD;
  • to provide alternative names for some standard TEI elements;
  • to exclude from the TEI DTD some elements which are redefined, either with a stricter content model, or with differing attribute lists, in the BNC DTD;
  • to specify the location within the TEI class system of some elements not defined in the TEI DTD.
Taking these in turn, some 114 standard TEI elements are excluded from the DTD by means of parameter entity declarations like the following:
<!ENTITY % ab "IGNORE"> <!-- ... --> <!ENTITY % xref "IGNORE">

The following is a complete list of standard TEI elements excluded in this way: <ab>, <abbr>, <add>, <affiliation>, <alt>, <altG1rp>, <anchor>, <argument>, <authority>, <back>, <biblFull>, <birth>, <broadcast>, <byline>, <cb>, <channel>, <cit>, <cl>, <constitution>, <correction>, <dateline>, <dateRange>, <del>, <derivation>, <distinct>, <div0>, <div5>, <div6>, <div7>, <divGen>, <docAuthor>, <docDate>, <docEdition>, <docImprint>, <docTitle>, <domain>, <education>, <emph>, <epigraph>, <equipment>, <expan>, <factuality>, <firstLang>, <foreign>, <front>, <fsdDecl>, <funder>, <gloss>, <group>, <headLabel>, <headItem>, <hyphenation>, <index>, <interp>, <interpGrp>, <interpretation>, <join>, <joinGrp>, <kinesic>, <link>, <linkGrp>, <m>, <measure>, <meeting>, <metDecl>, <mentioned>, <milestone>, <normalization>, <notesStmt>, <num>, <opener>, <orig>, <personGrp>, <phr>, <postBox>, <postCode>, <preparedness>, <principal>, <purpose>, <q>, <quotation>, <rendition>, <residence>, <rs>, <reg>, <scriptStmt>, <seg>, <segmentation>, <series>, <seriesStmt>, <signed>, <soCalled>, <socecStatus>, <span>, <spanGrp>, <sponsor>, <state>, <stdVals>, <step>, <street>, <symbol>, <textDesc>, <time>, <timeRange>, <titlePage>, <titlePart>, <trailer>, <variantEncoding>, <when>, <writing>, <xptr>, <xref>.

Four elements in the TEI DTD are given different names in the BNC DTD. For example, the TEI element <speaker> is renamed <spkr>. The declarations below effect this and the other renamings required, by changing the value of the relevant parameter entity:
<!ENTITY % n.teiCorpus.2 "bnc"> <!ENTITY % n.TEI.2 "bncDoc"> <!ENTITY % n.p "para"> <!ENTITY % n.speaker "spkr">
The next part of the bncExtns.ent file contains IGNORE declarations like those above, which have the effect of removing the existing definitions for 22 TEI elements which are to be redefined. The redefinitions are provided in the second of the two BNC extension files, bncExtn.dtd, along with definitions for some new elements not otherwise available. Their effects are summarized in the following table.
Table 1. Summary of differences between TEI and BNC
TEI Element Difference in BNC dtd
<TEI.2>Changed content model to allow either text or stext; renamed as bncDoc
<activity>Simplified content model; added attribute
<age>New element
<align>New element
<author>Simplified content model; added attributes
<body>Simplified content model
<c>redefined to use endtag and shortref minimization
<caption>New element
<change>Changed content model
<dialect>New element
<div>Changed content model, specific to speech
<div1>Simplified content model
<div2>Simplified content model
<div3>Simplified content model
<div4>Simplified content model
<item>Changed to disallow mixed content
<loc>New element
<p>New simplified element (TEI p is renamed para)
<person>Simplified content model; added attributes
<poem>New element
<quote>Changed content model to disallow mixed content
<recording>Simplified content model; added attributes
<s>Simplified content model
<shift>Mandatory attribute made optional
<sp>Changed content model
<stext>New element
<text>Simplified content model
<trunc>New element
<unclear>Simplified content model; added attributes
<w>redefined to use endtag and shortref minimization
Finally, as mentioned above, there are six elements defined in the BNC DTD which do not appear in the TEI DTD. These must be added to the appropriate element class in the TEI content model. The following declarations in the bncMods.ent file have that effect:
<!ENTITY % x.chunk "p|"> <!ENTITY % x.common "caption|poem|unclear|"> <!ENTITY % x.divtop "align|"> <!ENTITY % x.seg "trunc|">

Detailed discussion of the extension mechanism and general conformance issues relating to the use of the TEI is given in chapters 28 and 29 of the TEI Guidelines and is not further discussed here. For an explanation of the mechanisms used above, the detailed presentation of the general organization of the TEI DTD provided in chapter 3 of the Guidelines may also be helpful.

Differences between the BNC DTD and CDIF

This section lists significant differences between the current BNC DTD and CDIF 1.0. It lists elements whose names have been changed, elements whose attributes have changed, and elements whose content has been changed in such a way that CDIF-conformant files will not parse against the new DTD.

The following CDIF elements have been given different names:
  • <avail> is now <availability>
  • <biblscop> is now <biblScope>
  • <biblstr> is now <biblStruct>
  • <bibnote> is now <note>
  • <clasdecl> is now <classDecl>
  • <corr> is now an <item> within <encodingDesc>
  • <editdecl> is now <editorialDecl>
  • <ednstmt> is now <editionStmt>
  • <encdesc> is now <encodingDesc>
  • <header> is now <teiHeader>
  • <hyph> is now an <item> within <encodingDesc>
  • <partics> is now <particDesc>
  • <profdesc> is now <profileDesc>
  • <projdesc> is now <projectDesc>
  • <pubstmt> is now <publicationStmt>
  • <quot> is now an <item> within <encodingDesc>
  • <rec> is now <recording>
  • <recstmt> is now <recordingStmt>
  • <reg> is now <corr>
  • <relation> is now <particLinks>
  • <revdesc> is now <revisionDesc>
  • <segm> is now an <item> within <encodingDesc>
  • <settdesc> is now <settingDesc>
  • <srcdesc> is now <sourceDesc>
  • <titstmt> is now <titleStmt>
  • <txtclass> is now <textClass>

The following elements have significantly different attributes

  • <activity> has acquired the attribute spont, formerly present on its parent <setting>
  • <sic>, <gap>, and <corr> have all acquired attributes resp (rather than ed)
  • <corr> and <sic> no longer have a cause attribute
  • <gap> has acquired the attribute reason (in place of cause)
  • the complete attribute has been removed from <text> and <stext>
  • <w> has a different set of values for its type attribute
The content models for elements in the BNC DTD are generally less restrictive than those of CDIF. In the following list, we specify only those elements where an element conforming to the CDIF model would not also conform to the BNC model.
Table 2. CDIF elements whose content model has changed
elementCDIF modelBNC model
<address>#PCDATA(addrLine+ | (name | postBox | postCode | street)*)
<avail>#PCDATA(para)*
<change>(date, respStmt+)(date, respStmt+,para)
<clasdecl>(category+)(taxonomy+)
<editdecl>(corr | quot | hyph | segm | trans)+(para+)
<ednstmt>#PCDATA((edition, respStmt*) | para+)
<encDesc>(projDesc, (sampDecl|editDecl)*, refsDecl+, tagsDecl?, clasDecl?)projectDesc*, samplingDecl*, editorialDecl*, tagsDecl?, refsDecl*, classDecl*, para*)
<imprint>(pubPlace | name | date)+(pubPlace | publisher | date | biblScope)*
<langusg>#PCDATA(para | language)+
<list>(head*, (label?, item)+)(head?, (item* | (label, item)*))
<monogr>title+, (author | respStmt)*, (edition, respStmt?)*, imprint*, (bibNote | idno | biblScop)* ((((author | editor | respStmt)+, title+, (editor | respStmt)*) | (title+, (author | editor | respStmt)*))?, (note | meeting)*, (edition, (editor | respStmt)*)*, imprint, (imprint | extent | biblScope)*)
<partics>(person+, relation* )(para+ | (person+, particLinks?))
<projdesc>#PCDATA(para)+
<refsdecl>#PCDATA(para)+
<sampdecl>#PCDATA(para)+
<setting>(locName?, locale?, activity?)(para+ | (name | date | locale | activity)*)
<sp>(spkr*, (p | sp | bibl | caption | list | note | poem | quote)* ) +(stage) (spkr?, (p | l | lg | poem | stage | note | caption)+)
<text>( (p|sp|bibl | caption | list | note | poem | quote)*, (div1)*) +(gap | lb | loc | pb | ptr) (body) +(gap | lb | milestone | pb | ptr)

Up: Contents