Getting to Know BNC XML
In this tutorial we will use a few general purpose XML tools as a
means of exploring the BNC. You will use:
We won't try to teach you everything about XML: just enough to give you some idea of its potential!
- Let's start by having a look at a BNC file. Use the File Explorer to navigate to the Texts directory of the BNC. You will see that this contains subdirectories A to K. Choose one of these (say A), and open it. It contains subdirectories like AA, AB, AC etc. Choose one of these (say AY) and open it, and you will see, finally, some filenames: AYA.xml etc.
- Select AYX.xml and double click on it. You should see some sort of default structured display of the XML file.
- This is not what the file really looks like! Select AYX.xml again, but this time right click, and select "Open With". Then choose Notepad from the list of available software.
- An XML file consists of a mixture of tags and data. In the BNC there are at least two tags per word, plus two for each sentence, plus others for other features such as paragraphs or utterances.
- Within the start-tags we can see attributes and attribute values
- Now open the same file with a general purpose editor that understands XML markup. Select AYX.xml again, right click, and select Open with Oxygen
- Oxygen colours the file to distinguish markup and text. It can also reformat it more clearly: click the "Format and Indent" button (several horizontal blue lines). Scroll down to see how Oxygen represents the structure of the text visually.
- Click Undo and the reformatting disappears. You can also run what Oxygen calls a transformation scenario on the text. This is a predefined set of instructions involving a stylesheet and various parameters about input and output. A stylesheet is a way of specifying what should be done to the tagging in an XML text. Stylesheets can be written in different languages: the BNC is delivered with three examples, all written in XSLT.
- In Oxygen, select Transformation -> Configure Transformation Scenario from the Document command (or CTRL-SHIFT-C)
- A list of available scenarios appears. Probably none of these relate to the BNC; we need to define some that do. Proceed as follows:
- Click the New button (bottom left)
- Enter
BNC-words
as the Name of the scenario - The box labelled XSL URL needs to contain the location of the stylesheet to be used. Use the folder button to navigate to it (it is in BNC-XML/XML/Scripts/justTheWords.xsl)
- Now we need to specify where the output from the transformation should go. Click the Output tab.
- Click the Save As radio Button.
- Enter the following into the Save As box:
${home}/${cfn}.txt
- Click the Open in browser checkbox, and the Saved file Radio button
- Press OK to return to the Configure Transformation Scenario window. Try out the transformation by clicking the Transform Now button.
- To
save time setting up the other Transformation Scenarios, you can just
press Duplicate and edit the results. We suggest you create a scenario
called
BNC-opl
which uses the oneWordPerLine.xsl stylesheet, and another calledBNC-display
which uses the display.xsl stylesheet. - Run the BNC-opl scenario to see a version of any corpus text file which you could load into (for example) a spreadsheet.
- Run the the BNC-display scenario to reformat any corpus text file for display using the default web browser.
- Can you do something like this without using Oxygen? Certainly.
- Go back to the XML source of the file and type the following line at its beginning: <<?xml-stylesheet type="text/css" href="http://www.natcorp.ox.ac.uk/workshop/bnc.css"?>>
- Save the file on your desktop rather than in the BNC hierarchy: name it just KSP.xml
- Now just click on the file as you did before. This time, you should see the file formatted rather extravagantly.
Up: Contents