[bnc] Software for the BNC - Users Reference Guide for the British National Corpus (XML Edition)

Software for the BNC

A design goal of the original BNC project was that it should not be delivered in a format which was proprietary or which required the use of any particular piece of software. This, together with the desire to conform to emerging international standards, was a key factor in determining the choice of SGML as the vehicle for the corpus interchange format. Six years after this decision, SGML is still a widely used international standard format for which many public domain and commercial utilities exist. Indeed, in the shape of XML, which is a simplified version of the original standard, SGML now dominates development of the world wide web, and hence of most sectors of the information processing community. New XML software appears almost every week, and it has been adopted by current ‘major players’ from Sun and IBM to Microsoft.

That said, it must be recognised that the requirements of corpus linguists and others wishing to make use of the BNC are often rather specialist, and therefore unlikely to be supported by mainstream commercially produced software. For this and other reasons, the research user of the BNC should expect to have to do some programming. This is another reason behind the choice of XML as a vehicle for the system: because of the wide take up of these formalisms, there exist many utility libraries and generic programming interfaces which greatly simplify such processes as extracting the tags from a file, selecting portions of the text according to its logical structure, picking out files with certain attributes by searching their headers, and so on.

The BNC uses XML in a simple and straightforward way described in the rest of this manual; simple programs can be readily written using standard UNIX utilities such as grep or perl to access the corpus just as plain text files. More reliably, programs can be written to application programming interfaces (APIs) such as the W3C's Document Object Model (DOM) or the Simple API for XML (SAX), using application libraries developed for almost every modern programming language (C, Perl, Python, tcl etc.). Information about such resources is not provided here, but is readily found on the World Wide Web: currently, one good place to start looking is www.xml.com. Increasingly, support for XML is built into standard utilities such as web browsers, database systems, and stylesheet processors offering a high level of sophistication are readily available.

When the BNC was first published, the top of the range personal computer might have as much as 50 or even 100 megabytes of disk storage and 8 Mb of RAM. At the time of writing, 50 or 100 gigabyte hard disks and 640 Mb of RAM are commonplace on entry level machines. It is thus quite likely that software capable of efficiently handling the 4.5 gigabytes of text which make up the BNC will also soon become commonplace. For the moment, however, it has to be recognized that general purpose tools for XML do not always cope very well with the large size of the whole corpus, although they can still be very useful for processing subsets extracted from it. To handle the whole of the corpus, special purpose indexing software will usually be necessary. Although such systems exist, they are often expensive or difficult to implement. For that reason, the XML edition of the BNC is still provided along with its own access software called Xaira (which can, incidentally, be used with any collection of XML texts, not simply the BNC). It should be emphasized however that use of the BNC is not synonymous with use of XAIRA. Most generic tools developed for corpus linguistics and NLP can be used with the BNC, although the tools may be vary in the extent to which they can make use of the markup in the corpus.

Whatever software is used, the programmer must have a clear understanding of the various elements tagged in the corpus, the contexts in which they may appear, and their intended semantics. The syntax of an XML document is defined by a schema. For TEI conformant texts, the TEI Header provides additional meta-information. The semantics of XML elements are provided by documentation such as that provided elsewhere in this manual.

The BNC delivery format

An XML document like the BNC must have the following components:

an SGML declaration defining various SGML-specific limits;
a document type declaration defining the elements entities and attributes which are legal in the document;
an SGML document instance, that is, the document text itself.

These three components are all included as part of the standard release of the corpus.

Text files

The BNC is delivered in compressed format, using the GNU tar utility. When expanded, it comprises 4054 distinct files, ranging in size from 1 to 45 Kbytes, and totalling about 1.5 Gbytes. Each file contains a single BNC document, i.e. a TEI header and its associated spoken or written text, and has the same name as the value of the id attribute on its <bncDoc> element. Files are grouped according to their names into a three-level hierarchy. For example, all files with names beginning AA are in a subdirectory AA, which is within a subdirectory A (along with all other subdirectories beginning with the letter A). Not all possible three-letter filenames are actually used.

Each single-letter subdirectory (A to K, excluding I) is delivered as a separate compressed archive file. The whole corpus should be unpacked into a single hierarchy, which, as delivered, is called BNC/Texts. The full name for the corpus text with identifier ABC is thus BNC/Texts/A/AB/ABC.

Note that the three-character identifiers used (and hence the directory structure) are entirely arbitrary and do not convey any information about the type of text contained. Each text contains a TEI Header which specifies all such meta information, either directly, or by reference to the corpus header, as described in section The header. For convenience, however, this release includes an XML file called bncIndex.xml and a simple finder file called bncfinder.dat either of which may be used to select files of particular types, as further discussed in Creating a subcorpus .

XML components

All ancillary files relating to the XML structure and processing of the corpus are included in the standard release within a subdirectory called XML. This contains the following files:

bncxml.rng: The BNC XML schema expressed in RelaxNG syntax
bncxml.rng: The BNC XML schema expressed in RelaxNG compact syntax
bncxml.rng: The BNC XML schema expressed in W3C schema language
bncxml.rng: The BNC XML schema expressed as a Document Type Definition (DTD)
driver1.sgm and driver2.sgm: Example XML driver files for processing the BNC.
bncfinder.dat and bncIndex.xml: Ancillary data files which may be used to facilitate access to the corpus (see Creating a subcorpus below.)

The remainder of this section discusses how these files may be used together as an XML document. This is by no means the only way of processing the corpus, of course, and is intended solely to demonstrate the function of the various files listed above. Some basic understanding of the components of an XML system is assumed.

A number of XSLT stylesheets are provided to demonstrate some simple tasks. These include:

display.xsl: converts a BNC text to an HTML format which can be read directly in a browser
justthetext.xsl: removes all the tagging from a BNC text; also removes the whole of the header.
onewordperline.xsl: converts a BNC text to a "one word per line" format
justthecodes.xsl: removes all the words from a BNC text; also removes the whole of the header.

The BNC corpus header

As discussed in section Basic structure above, the BNC consists of an overall corpus header, and a large number of distinct BNC documents, each with its own header. The corpus header must be present for an XML processor to work with any part of the Corpus, because the corpus header contains declarations of elements (such as the classification records) referred to by almost every part of the corpus.

The various elements making up the header and their functions are discussed in section The header. The corpus header itself is included in the file bncHdr.xml.

Creating a subcorpus

Two files are provided with this version of the corpus to assist in the selection of files according to their classification codes: bncfinder.dat and bncIndex.xml.

Using bncfinder.dat

The file bncfinder.dat is a straightforward ASCII format data file, containing one record for each file in the corpus Within each record, the following blank-delimited fields are present:

three-character identifier
size of text in Kbytes
number of <p> or <u> elements
number of <s> elements
number of <w> elements
number of orthographic words
all classification codes assigned to this text

Here are two typical records from this file (wrapped to fit on the page: in the original, each is a single line):

A00 107 112 423 6673 6894 alltim3 allava2 alltyp5 wriaag0 wriad0 wriase0 wriaty2 wriaud3 wridom4 wrilev2 wrimed3 wripp5 wrisam5 wrista2 writas3 KSP 41 259 306 1543 1427 alltim3 allava2 alltyp1 sdeage1 sdecla1 sdesex1 spolog2 sporeg1

This file can be rapidly searched with simple Unix utilities such as grep to identify subcorpora having particular characteristics: for example, the following command line will select records for all spoken demographic texts collected by female respondents:

$grep sdesex2 bncfinder.dat

The same information could of course be obtained by searching through the corpus texts themselves; however the above is likely to be much quicker.

The classification codes used in the bncfinder.dat file are listed in section Text and genre classification codes.

The lines selected by such a procedure can be processed in many ways. Here for example is a program written in the perl language which just creates the references needed to embed texts in a driver file like those above:

open(out, ">bncrefs.sgm") || die "Cannot create bncrefs.sgm: $!\n"; while (<>) { ($id,$k,$p,$s,$w, $o) = split; $ntexts ++; $kbsz += $k; $nsents += $s; $nwords += $o; print OUT "id\n"; } print "You have selected $ntexts texts, totalling $kbsz Kb, $nsents s-units and $nwords orthographic words\n" ;

Assuming that this program is stored in the file subcorp.prl, a command line like the following might be used to create a bncrefs.sgm file defining a subcorpus comprising all the spoken demographic texts collected by female respondents:

$grep sdesex2 bncfinder.dat | perl subcorp.prl

Using bncIndex.xml

The file bncIndex.xml contains similar information to that provided in bncfinder.dat, but formatted for processing as an XML document. It contains a single <bncIndex> element which encloses a series of <doc> elements, one for each text in the BNC. Each <doc> element contains the following subelements:

<idno>

The three character identifier of the text

<title>

Either the short title of the text, taken from the <title> in the <fileDesc>, or the phrase [Unscripted conversation]

<class>

A classification code applied to the text, as giuven by the target attribute of the <catRef> element. One <class> is given for each classification and the following attributes are used to specify it:

type: the identifying code specified by some <category> element in the corpus header
value: a numeric code which, when appended to the type value, gives the classification code for this text.

<genre>

The genre code assigned to this text in David Lee's classification scheme, as recorded in the <classCode> element in its header

<counts>

Size of this text measured in various ways, as specified by the following attributes:

kb: size in Kbytes
ow: size in orthographic words
w: number of <w> elements
s: number of <s> elements

Here are the XML elements for the two texts mentioned above (reformatted slightly to fit on the page):

<doc> <idno>A00</idno> <title>[ACET factsheets & newsletters].</title> <class type="alltim" value="3"/><class type="allava" value="2"/> <class type="alltyp" value="5"/><class type="wriaag" value="0"/> <class type="wriad" value="0"/><class type="wriase" value="0"/> <class type="wriaty" value="2"/><class type="wriaud" value="3"/> <class type="wridom" value="4"/><class type="wrilev" value="2"/> <class type="wrimed" value="3"/><class type="wripp" value="5"/> <class type="wrisam" value="5"/><class type="wrista" value="2"/> <class type="writas" value="3"/> <genre>W_non_ac_medicine</genre> <counts kb="107" ow="6894" w="6673" s="423"/> </doc>  <doc> <idno>KSP</idno> <title>[Spontaneous conversation]</title> <class type="alltim" value="3"/><class type="allava" value="2"/> <class type="alltyp" value="1"/><class type="sdeage" value="1"/> <class type="sdecla" value="1"/><class type="sdesex" value="1"/> <class type="spolog" value="2"/><class type="sporeg" value="1"/> <genre>S_conv</genre> <counts kb="41" ow="1427" w="1543" s="306"/> </doc>

Although XML files do not require a DTD, one is also provided in the directory for convenience. It is called bncIndex.dtd and listed below:

<!ELEMENT bncIndex (doc+) > <!ELEMENT doc (idno, title, class+, genre, count)> <!ELEMENT idno (#PCDATA) > <!ELEMENT title (#PCDATA) > <!ELEMENT class EMPTY > <!ATTLIST class type CDATA #REQUIRED value CDATA #REQUIRED> <!ELEMENT genre (#PCDATA) > <!ELEMENT counts EMPTY > <!ATTLIST counts kb CDATA #REQUIRED ow CDATA #REQUIRED w CDATA #REQUIRED s CDATA #REQUIRED>

XML files may be processed in many different ways, but one of the most convenient is to use an XSLT stylesheet to transform it for display or search it. XSLT is a very high-level programming language defined by the W3C, which offers the ability to transform and process XML documents in a variety of ways. It is (at the time of writing) the language of choice for manipulating XML on the web, where a large number of free tools and tutorials may also be found.

To give a flavour of the language, the following XSLT stylesheet will process the bncIndex.xml selecting only texts with classification "wridom4", and displaying their titles, identifiers, and size as an HTML format table:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" > <xsl:output method="html"/> <xsl:template match="bncIndex"> <html> <center> <h2>There are <xsl:value-of select="count(doc/class[@type='wridom' and @value='4'])"/> BNC World Files with wridom classified as 4</h2> <table> <xsl:apply-templates select="doc"/> </table> </center> </html> </xsl:template> <xsl:template match="doc/class[@type='wridom' and @value='4']" > <tr> <td><xsl:value-of select="../idno"/></td> <td><xsl:value-of select="../title"/></td> <td><xsl:value-of select="../genre"/></td> <td><xsl:value-of select="../counts/@kb"/></td> <td><xsl:value-of select="../counts/@ow"/></td> </tr> </xsl:template> </xsl:stylesheet>

Up: Contents Previous: Miscellaneous tables Next: List of Sources