Software for the BNC
A design goal of the original BNC project was that it should not be delivered in a format which was proprietary or which required the use of any particular piece of software. This, together with the desire to conform to emerging international standards, was a key factor in determining the choice of SGML as the vehicle for the corpus interchange format. Six years after this decision, SGML is still a widely used international standard format for which many public domain and commercial utilities exist. Indeed, in the shape of XML, which is a simplified version of the original standard, SGML now dominates development of the world wide web, and hence of most sectors of the information processing community. New XML software appears almost every week, and it has been adopted by current ‘major players’ from Sun and IBM to Microsoft.
That said, it must be recognised that the requirements of corpus linguists and others wishing to make use of the BNC are often rather specialist, and therefore unlikely to be supported by mainstream commercially produced software. For this and other reasons, the research user of the BNC should expect to have to do some programming. This is another reason behind the choice of XML as a vehicle for the system: because of the wide take up of these formalisms, there exist many utility libraries and generic programming interfaces which greatly simplify such processes as extracting the tags from a file, selecting portions of the text according to its logical structure, picking out files with certain attributes by searching their headers, and so on.
The BNC uses XML in a simple and straightforward way described in the rest of this manual; simple programs can be readily written using standard UNIX utilities such as grep or perl to access the corpus just as plain text files. More reliably, programs can be written to application programming interfaces (APIs) such as the W3C's Document Object Model (DOM) or the Simple API for XML (SAX), using application libraries developed for almost every modern programming language (C, Perl, Python, tcl etc.). Information about such resources is not provided here, but is readily found on the World Wide Web: currently, one good place to start looking is www.xml.com. Increasingly, support for XML is built into standard utilities such as web browsers, database systems, and stylesheet processors offering a high level of sophistication are readily available.
When the BNC was first published, the top of the range personal computer might have as much as 50 or even 100 megabytes of disk storage and 8 Mb of RAM. At the time of writing, 50 or 100 gigabyte hard disks and 640 Mb of RAM are commonplace on entry level machines. It is thus quite likely that software capable of efficiently handling the 4.5 gigabytes of text which make up the BNC will also soon become commonplace. For the moment, however, it has to be recognized that general purpose tools for XML do not always cope very well with the large size of the whole corpus, although they can still be very useful for processing subsets extracted from it. To handle the whole of the corpus, special purpose indexing software will usually be necessary. Although such systems exist, they are often expensive or difficult to implement. For that reason, the XML edition of the BNC is still provided along with its own access software called Xaira (which can, incidentally, be used with any collection of XML texts, not simply the BNC). It should be emphasized however that use of the BNC is not synonymous with use of XAIRA. Most generic tools developed for corpus linguistics and NLP can be used with the BNC, although the tools may be vary in the extent to which they can make use of the markup in the corpus.
Whatever software is used, the programmer must have a clear understanding of the various elements tagged in the corpus, the contexts in which they may appear, and their intended semantics. The syntax of an XML document is defined by a schema. For TEI conformant texts, the TEI Header provides additional meta-information. The semantics of XML elements are provided by documentation such as that provided elsewhere in this manual.
The BNC delivery format
These three components are all included as part of the standard release of the corpus.
Text files
The BNC is delivered in compressed format, using the GNU tar utility. When expanded, it comprises 4054 distinct files, ranging in size from 1 to 45 Kbytes, and totalling about 1.5 Gbytes. Each file contains a single BNC document, i.e. a TEI header and its associated spoken or written text, and has the same name as the value of the id attribute on its <bncDoc> element. Files are grouped according to their names into a three-level hierarchy. For example, all files with names beginning AA are in a subdirectory AA, which is within a subdirectory A (along with all other subdirectories beginning with the letter A). Not all possible three-letter filenames are actually used.
Each single-letter subdirectory (A to K, excluding I) is delivered as a separate compressed archive file. The whole corpus should be unpacked into a single hierarchy, which, as delivered, is called BNC/Texts. The full name for the corpus text with identifier ABC is thus BNC/Texts/A/AB/ABC.
Note that the three-character identifiers used (and hence the directory structure) are entirely arbitrary and do not convey any information about the type of text contained. Each text contains a TEI Header which specifies all such meta information, either directly, or by reference to the corpus header, as described in section The header. For convenience, however, this release includes an XML file called bncIndex.xml and a simple finder file called bncfinder.dat either of which may be used to select files of particular types, as further discussed in Creating a subcorpus .
XML components
- bncxml.rng
- The BNC XML schema expressed in RelaxNG syntax
- bncxml.rng
- The BNC XML schema expressed in RelaxNG compact syntax
- bncxml.rng
- The BNC XML schema expressed in W3C schema language
- bncxml.rng
- The BNC XML schema expressed as a Document Type Definition (DTD)
- driver1.sgm and driver2.sgm
- Example XML driver files for processing the BNC.
- bncfinder.dat and bncIndex.xml
- Ancillary data files which may be used to facilitate access to the corpus (see Creating a subcorpus below.)
The remainder of this section discusses how these files may be used together as an XML document. This is by no means the only way of processing the corpus, of course, and is intended solely to demonstrate the function of the various files listed above. Some basic understanding of the components of an XML system is assumed.
- display.xsl
- converts a BNC text to an HTML format which can be read directly in a browser
- justthetext.xsl
- removes all the tagging from a BNC text; also removes the whole of the header.
- onewordperline.xsl
- converts a BNC text to a "one word per line" format
- justthecodes.xsl
- removes all the words from a BNC text; also removes the whole of the header.
The BNC corpus header
As discussed in section Basic structure above, the BNC consists of an overall corpus header, and a large number of distinct BNC documents, each with its own header. The corpus header must be present for an XML processor to work with any part of the Corpus, because the corpus header contains declarations of elements (such as the classification records) referred to by almost every part of the corpus.
The various elements making up the header and their functions are discussed in section The header. The corpus header itself is included in the file bncHdr.xml.
Creating a subcorpus
Two files are provided with this version of the corpus to assist in the selection of files according to their classification codes: bncfinder.dat and bncIndex.xml.
Using bncfinder.dat
The classification codes used in the bncfinder.dat file are listed in section Text and genre classification codes.
Using bncIndex.xml
- <idno>
- The three character identifier of the text
- <title>
- Either the short title of the text, taken from the <title> in the <fileDesc>, or the phrase [Unscripted conversation]
- <class>
- A classification code applied to the text, as giuven by the target attribute of the <catRef> element. One <class> is given for each classification and the following attributes are used to specify it:
- <genre>
- The genre code assigned to this text in David Lee's classification scheme, as recorded in the <classCode> element in its header
- <counts>
- Size of this text measured in various ways, as specified by the following attributes:
XML files may be processed in many different ways, but one of the most convenient is to use an XSLT stylesheet to transform it for display or search it. XSLT is a very high-level programming language defined by the W3C, which offers the ability to transform and process XML documents in a variety of ways. It is (at the time of writing) the language of choice for manipulating XML on the web, where a large number of free tools and tutorials may also be found.
Up: Contents Previous: Miscellaneous tables Next: List of Sources