[bnc] BNC User Reference Manual - Software for the BNC

Software for the BNC

A design goal of the original BNC project was that it should not be delivered in a format which was proprietary or which required the use of any particular piece of software. This, together with the desire to conform to emerging international standards, was a key factor in determining the choice of SGML as the vehicle for the corpus interchange format. Six years after this decision, SGML is still a widely used international standard format for which many public domain and commercial utilities exist. Indeed, in the shape of XML, which is a simplified version of the original standard, SGML now dominates development of the world wide web, and hence of most sectors of the information processing community. New XML software appears almost every week, and it has been adopted by current ‘major players’ from Sun and IBM to Microsoft.

That said, it must be recognised that the requirements of corpus linguists and others wishing to make use of the BNC are often rather specialist, and therefore unlikely to be supported by mainstream commercially produced software. For this and other reasons, the research user of the BNC should expect to have to do some programming. This is another reason behind the choice of SGML or XML as a vehicle for the system: because of the wide take up of these formalisms, there exist many utility librariues and generic programming interfaces which greatly simplify such processes as extracting the tags from a file, selecting portions of the text according to its logical structure, picking out files with certain attributes by searching their headers, and so on.

The BNC uses SGML in a simple and straightforward way described in the rest of this manual; simple programs can be readily written using standard UNIX utilities such as grep or perl to access the corpus just as plain text files. More reliably, programs can be written to application programming interfaces (APIs) such as the W3C's Document Object Model (DOM) or the Simple API for XML (SAX), using application libraries developed for almost every modern programming language (C, Perl, Python, tcl etc.). Information about such resources is not provided here, but is readily found on the World Wide Web: currently, one good place to start looking is www.xml.com.

When the BNC was first published, the top of the range personal computer might have as much as 50 or even 100 megabytes of disk storage and 8 Mb of RAM. At the time of writing, 20 or 30 gigabyte hard disks and 128 Mb of RAM are commonplace on entry level machines. It is thus quite likely that software capable of efficiently handling the 1.5 gigabytes of text which make up the BNC will also soon become commonplace. For the moment, however, it has to be recognized that general purpose tools for SGML and XML do not always cope very well with the large size of the whole corpus, although they can still be very useful for processing subsets extracted from it. To handle the whole of the corpus, special purpose indexing software will usually be necessary. Although such systems exist, they are often expensive or difficult to implement. For that reason, the BNC Project also developed its own low-cost alternative, the SARA package, which is documented separately. It should be emphasized however that use of the BNC is not synonymous with use of SARA. Most generic tools developed for corpus linguistics and NLP can be used with the BNC, although the tools may be vary in the extent to which they can make use of the markup in the corpus.

Whatever software is used, the programmer must have a clear understanding of the various elements tagged in the corpus, the contexts in which they may appear, and their intended semantics. The syntax of an SGML document is defined by a document type definition and by an SGML declaration. For TEI conformant texts, the TEI Header provides additional meta-information. The semantics of SGML elements are provided by documentation such as that provided elsewhere in this manual.

The BNC delivery format

An SGML document like the BNC must have the following components:

an SGML declaration defining various SGML-specific limits;
a document type declaration defining the elements entities and attributes which are legal in the document;
an SGML document instance, that is, the document text itself.

These three components are all included as part of the standard release of the corpus.

Text files

The BNC is delivered in compressed format, using the GNU tar utility. When expanded, it comprises 4054 distinct files, ranging in size from 1 to 45 Kbytes, and totalling about 1.5 Gbytes. Each file contains a single BNC document, i.e. a TEI header and its associated spoken or written text, and has the same name as the value of the id attribute on its <bncDoc> element. Files are grouped according to their names into a three-level hierarchy. For example, all files with names beginning AA are in a subdirectory AA, which is within a subdirectory A (along with all other subdirectories beginning with the letter A). Not all possible three-letter filenames are actually used.

Each single-letter subdirectory (A to K, excluding I) is delivered as a separate compressed archive file. The whole corpus should be unpacked into a single hierarchy, which, as delivered, is called BNC/Texts. The full name for the corpus text with identifier ABC is thus BNC/Texts/A/AB/ABC.

Note that the three-character identifiers used (and hence the directory structure) are entirely arbitrary and do not convey any information about the type of text contained. Each text contains a TEI Header which specifies all such meta information, either directly, or by reference to the corpus header, as described in section ??. For convenience, however, this release includes an XML file called bncIndex.xml and a simple finder file called bncfinder.dat either of which may be used to select files of particular types, as further discussed in Creating a subcorpus .

SGML components

All ancillary files relating to the SGML structure and processing of the corpus are included in the standard release within a subdirectory called SGML. This contains the following files:

bnc.dec: The BNC SGML declaration provided has been modified to allow for the use of either an XML or an SGML format: this is necessary since the TEI Headers provided with the current release of the corpus are distributed in XML format. The corpus texts themselves are however still distributed in SGML format.
bnc.dtd: The BNC document type declaration, as a single SGML file. This file is automatically generated from the standard TEI DTD using a pair of ‘extension files’ as further discussed in section ??.
bncMods.dtd and bncMods.ent: The two extension files used to parameterize the TEI for BNC usage
bncChars.dtd: SGML declarations for all character entities used in the BNC
bncDocs.dtd: SGML declarations for all documents making up the BNC
driver1.sgm and driver2.sgm: Example SGML driver files for processing the BNC.
bncfinder.dat and bncIndex.xml: Ancillary data files which may be used to facilitate access to the corpus (see Creating a subcorpus below.)

The remainder of this section discusses how these files may be used together as an SGML document. This is by no means the only way of processing the corpus, of course, and is intended solely to demonstrate the function of the various files listed above. Some basic understanding of the components of an SGML system is assumed.

To process a single text from the corpus (say, text ABC), a driver file like the following could be used

<!DOCTYPE bnc SYSTEM "http://www.hcu.ox.ac.uk/TEI/Guidelines/DTD/tei2.dtd" [ <!ENTITY % TEI.prose "INCLUDE"> <!ENTITY % TEI.spoken "INCLUDE"> <!ENTITY % TEI.general "INCLUDE"> <!ENTITY % TEI.analysis "INCLUDE"> <!ENTITY % TEI.corpus "INCLUDE"> <!ENTITY % TEI.extensions.ent SYSTEM "/home/BNC/SGML/bncMods.ent"> <!ENTITY % TEI.extensions.dtd SYSTEM "/home/BNC/SGML/bncMods.dtd"> <!ENTITY % BNCchars SYSTEM "/home/BNC/SGML/BNCchars.ent"> %BNCchars; <!ENTITY corphdr SYSTEM "/home/BNC/Texts/corphdr"> <!ENTITY text SYSTEM "/home/BNC/Texts/A/AB/ABC"> ]> <bnc> &corphdr; &text; </bnc>

This driver assumes that the standard TEI DTD is available from the URL given (which was true as of the date of this manual), and that the files from the BNC World distribution have been installed under /home/BNC. Alternatively, if the driver file is to be used offline, using the ‘compiled’ version of the BNC dtd, it might look like the following:

<!DOCTYPE bnc SYSTEM "/home/BNC/SGML/bnc.dtd" [ <!ENTITY % BNCchars SYSTEM "/home/BNC/SGML/BNCchars.ent"> %BNCchars; <!ENTITY BNChdr SYSTEM "/home/BNC/Texts/corphdr"> <!ENTITY text1 SYSTEM "/home/BNC/Texts/A/AB/ABC"> ]> <bnc> &corphdr; &text1; </bnc>

To process more than one file from the corpus, a set of declarations like the one given above for the entity text would be necessary, one for each text concerned. For convenenience, a file containing such declarations for every text in the corpus is also provided: this file, bncdocs.dtd, consists of declarations like the following:

<!ENTITY ABC SYSTEM "BNC/Texts/A/AB/ABC"> <!ENTITY ABD SYSTEM "BNC/Texts/A/AB/ABD">

. With these declarations in force, it becomes possible to refer to the corpus file ABC simply by means of the entity reference &ABC;, as in the following example:

<!DOCTYPE bnc SYSTEM "/home/BNC/SGML/bnc.dtd" [ <!ENTITY % BNCdocs SYSTEM "/home/BNC/SGML/bncDocs.ent"> %BNCdocs; <!ENTITY % BNCchars SYSTEM "/home/BNC/SGML/bncChars.ent"> %BNCchars; <!ENTITY BNChdr SYSTEM "/home/BNC/Texts/corphdr"> ]> <bnc> &BNChdr; &ABC; &ABD; </bnc>

The first line declares that what follows is an SGML document and that the dtd describing it is located in the file with the SYSTEM identifier given (/home/BNC/SGML/bnc.dtd). The next few lines (the portion within square brackets) comprise the DTD subset declaration: declarations here are to be processed before the content of the DTD. It comprises three entity declarations.

The first, for BNCdocs, associates that name with the external entity containing declarations for all the documents making up the BNC itself (i.e. the file bncDocs.ent), and then immediately references that entity. The percent sign is a syntactic convention of SGML which need not concern us here: the effect is that each file in the corpus can now be referenced using a name such as &ABC;. The second, for BNCchars, does almost exactly the same thing, but for the character references used within the corpus (see further ??). The third associates the name BNChdr with the file containing the corpus header.

Following this, the driver file contains the SGML document itself, beginning with the <bnc> start-tag, and ending with the </bnc> end-tag. Between these tags are entity references, one for the corpus header, followed by one for each file to be included in this view of the corpus.

The BNC corpus header

As discussed in section ?? above, the BNC consists of an overall corpus header, and a large number of distinct BNC documents, each with its own header. The corpus header must be present for an SGML processor to work with any part of the Corpus, because the corpus header contains declarations of elements (such as the classification records) referred to by almost every part of the corpus.

The various elements making up the header and their functions are discussed in section ??. The corpus header itself is included in the file corphdr. Its contents are reproduced below, reformatted for legibility.

Creating a subcorpus

Two files are provided with this version of the corpus to assist in the selection of files according to their classification codes: bncfinder.dat and bncIndex.xml.

Using bncfinder.dat

The file bncfinder.dat is a straightforward ASCII format data file, containing one record for each file in the corpus Within each record, the following blank-delimited fields are present:

three-character identifier
size of text in Kbytes
number of <p> or <u> elements
number of <s> elements
number of <w> elements
number of orthographic words
all classification codes assigned to this text

Here are two typical records from this file (wrapped to fit on the page: in the original, each is a single line):

A00 107 112 423 6673 6894 alltim3 allava2 alltyp5 wriaag0 wriad0 wriase0 wriaty2 wriaud3 wridom4 wrilev2 wrimed3 wripp5 wrisam5 wrista2 writas3 KSP 41 259 306 1543 1427 alltim3 allava2 alltyp1 sdeage1 sdecla1 sdesex1 spolog2 sporeg1

This file can be rapidly searched with simple Unix utilities such as grep to identify subcorpora having particular characteristics: for example, the following command line will select records for all spoken demographic texts collected by female respondents:

$grep sdesex2 bncfinder.dat

The same information could of course be obtained by searching through the corpus texts themselves; however the above is likely to be much quicker.

The classification codes used in the bncfinder.dat file are listed in section ??.

The lines selected by such a procedure can be processed in many ways. Here for example is a program written in the perl language which just creates the references needed to embed texts in a driver file like those above:

open(out, ">bncrefs.sgm") || die "Cannot create bncrefs.sgm: $!\n"; while (<>) { ($id,$k,$p,$s,$w, $o) = split; $ntexts ++; $kbsz += $k; $nsents += $s; $nwords += $o; print OUT "id\n"; } print "You have selected $ntexts texts, totalling $kbsz Kb, $nsents s-units and $nwords orthographic words\n" ;

Assuming that this program is stored in the file subcorp.prl, a command line like the following might be used to create a bncrefs.sgm file defining a subcorpus comprising all the spoken demographic texts collected by female respondents:

$grep sdesex2 bncfinder.dat | perl subcorp.prl

Using bncIndex.xml

The file bncIndex.xml contains similar information to that provided in bncfinder.dat, but formatted for processing as an XML document. It contains a single <bncIndex> element which encloses a series of <doc> elements, one for each text in the BNC. Each <doc> element contains the following subelements:

<idno>

The three character identifier of the text

<title>

Either the short title of the text, taken from the <title> in the <fileDesc>, or the phrase [Unscripted conversation]

<class>

A classification code applied to the text, as giuven by the target attribute of the <catRef> element. One <class> is given for each classification and the following attributes are used to specify it:

type: the identifying code specified by some <category> element in the corpus header
value: a numeric code which, when appended to the type value, gives the classification code for this text.

<genre>

The genre code assigned to this text in David Lee's classification scheme, as recorded in the <classCode> element in its header

<counts>

Size of this text measured in various ways, as specified by the following attributes:

kb: size in Kbytes
ow: size in orthographic words
w: number of <w> elements
s: number of <s> elements

Here are the XML elements for the two texts mentioned above (reformatted slightly to fit on the page):

<doc> <idno>A00</idno> <title>[ACET factsheets & newsletters].</title> <class type="alltim" value="3"/><class type="allava" value="2"/> <class type="alltyp" value="5"/><class type="wriaag" value="0"/> <class type="wriad" value="0"/><class type="wriase" value="0"/> <class type="wriaty" value="2"/><class type="wriaud" value="3"/> <class type="wridom" value="4"/><class type="wrilev" value="2"/> <class type="wrimed" value="3"/><class type="wripp" value="5"/> <class type="wrisam" value="5"/><class type="wrista" value="2"/> <class type="writas" value="3"/> <genre>W_non_ac_medicine</genre> <counts kb="107" ow="6894" w="6673" s="423"/> </doc>  <doc> <idno>KSP</idno> <title>[Spontaneous conversation]</title> <class type="alltim" value="3"/><class type="allava" value="2"/> <class type="alltyp" value="1"/><class type="sdeage" value="1"/> <class type="sdecla" value="1"/><class type="sdesex" value="1"/> <class type="spolog" value="2"/><class type="sporeg" value="1"/> <genre>S_conv</genre> <counts kb="41" ow="1427" w="1543" s="306"/> </doc>

Although XML files do not require a DTD, one is also provided in the directory for convenience. It is called bncIndex.dtd and listed below:

<!ELEMENT bncIndex (doc+) > <!ELEMENT doc (idno, title, class+, genre, count)> <!ELEMENT idno (#PCDATA) > <!ELEMENT title (#PCDATA) > <!ELEMENT class EMPTY > <!ATTLIST class type CDATA #REQUIRED value CDATA #REQUIRED> <!ELEMENT genre (#PCDATA) > <!ELEMENT counts EMPTY > <!ATTLIST counts kb CDATA #REQUIRED ow CDATA #REQUIRED w CDATA #REQUIRED s CDATA #REQUIRED>

XML files may be processed in many different ways, but one of the most convenient is to use an XSLT stylesheet to transform it for display or search it. XSLT is a very high-level programming language defined by the W3C, which offers the ability to transform and process XML documents in a variety of ways. It is (at the time of writing) the language of choice for manipulating XML on the web, where a large number of free tools and tutorials may also be found.

To give a flavour of the language, the following XSLT stylesheet will process the bncIndex.xml selecting only texts with classification "wridom4", and displaying their titles, identifiers, and size as an HTML format table:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" > <xsl:output method="html"/> <xsl:template match="bncIndex"> <html> <center> <h2>There are <xsl:value-of select="count(doc/class[@type='wridom' and @value='4'])"/> BNC World Files with wridom classified as 4</h2> <table> <xsl:apply-templates select="doc"/> </table> </center> </html> </xsl:template> <xsl:template match="doc/class[@type='wridom' and @value='4']" > <tr> <td><xsl:value-of select="../idno"/></td> <td><xsl:value-of select="../title"/></td> <td><xsl:value-of select="../genre"/></td> <td><xsl:value-of select="../counts/@kb"/></td> <td><xsl:value-of select="../counts/@ow"/></td> </tr> </xsl:template> </xsl:stylesheet>

Up: Contents