BNC Logo

British National Corpus

Formatting SARA hitlists


Contents

1. Formatting SARA hitlists

You can print the results of a SARA search and you can cut and paste individual solutions to the clipboard. And you can also save all the current solutions into a file, by clicking on the "List solutions" button (or selecting Listing from the Query menu)

A Frequently Asked Question is: ‘Ah, but how do I format this XML file? It looks like gobbledegook when I open it in Word!’ . This document explains why, and what you can do to improve on the situation.

The first thing to understand is that the XML format used contains useful information as well as annoying pointy brackets. The start of the file tells you when the query was run, by whom on which SARA server, and what the actual Query syntax used was. If you added a note to your query, that also appears at the start of the file. Each result in the query file is separately tagged as a <hit> element, with attributes giving the text identifier and the sentence number where it was found. The actual bit of text which SARA decided matched your query is also tagged as a <kw> element.

(Terminological note: when you see something like this in an XML file:

  <foo burble="wibble">drone drone</foo>
what you are looking at is an instance of the <foo> element, the content of which is ‘drone drone’ , and which carries a burble attribute whose value is wibble. All clear now?)

Why don't we just save all the files in plain text? Because then you would have to reformat them manually to separate out the additional information like text numbers etc, or highlight the hit words (or do without that information). That's not what computers are for. Why don't we just save all the files in RTF or Word or insert name-of-favourite-word-processor here format? Because we want to make it possible for you to use these files on any computer system on any platform, not just the ones that people in Redmond think you should buy.

We use XML because it was designed as the language for interchange of information on the web. There are already dozens of programs using XML in various ways, and there will be plenty more. The recipes below describe how you can, cost free, get your XML query files into a nicely formatted shape with some tools I happen to know about. Feel free to experiment with others -- and there will be plenty of others to experiment with. Check out the TEI Software page for links to some I've found useful.

1.1. Before you begin: saving a listing file

Since release of the BNC World edition, we've produced a new version of the client software which provides a couple of new features to improve handling of listing files. You'll need the new version to carry out the recipes described in this note. If you don't know which version you're using, carry out the following:

1.2. Recipe 1: use CSS and a web browser

CSS is short for ‘Cascading Style Sheets’ . It is a W3C-defined language for specifying how any XML or HTML document should be formatted. It allows you to state formatting properties (such as font, colour, size etc.) for any XML element, and also (within some limits) to attach additional text to one. Unfortunately, the current generation of web browsers vary greatly in their abilities to handle CSS, but most of them can make a reasonable stab at displaying a SARA listing file.

In this example, I'm using a CSS stylesheet with the name bnchits.css but you can use any filename you like. You can download the text of my stylesheet from this web site (click on this link using the right mouse button): feel free to tinker with it if you don't like my choice of layout properties.

If you copy this stylesheet into the same directory as any XML listing file produced by SARA, you should be able to view the listing file with a web browser such as Opera directly. Here's a screen shot of Opera viewing this sample listing file, using this stylesheet:

Here's what you need to do to view your listing file in the same way:

(In my opinion, at present IE5's support for CSS is not brilliant. In particular it's not very good at handling attribute values, so it cannot display the text and line numbers. Netscape 6 and Amaya I have not tried, but I am told they work better. Opera on the other hand does a very good job of displaying CSS.)

1.3. Recipe 2: use XSLT to generate HTML and load that into a word processor

XSLT is a powerful general purpose stylesheet language, also defined by the W3C Consortium, which does rather more with an XML file than CSS does. You can do just about any kind of transformation imaginable using this language: converting an XML listing file to HTML with it is rather like using an Arabian scimitar to clip your nails, but none the less effective for that.

You can use an XSLT stylesheet in the same way as you used the CSS stylesheet in the previous recipe, by simply supplying its name to a web browser and having the web browser reformat the XML under its control. However, web browsers vary very greatly in the ways that they do this, and results are hardly reliable.

An alternative approach is to use one of the many XSLT engines available to translate the XML file into another format, such as HTML or plain text, which we can then load into a web browser or word processor with more predictable results. Suitable engines include xt, xalan, and saxon, but there are many others, each of them working in more or less the same way. We will use saxon in this example.

As you have probably realised, you could use XSLT for many things. Here are a couple more examples, each of which was made by running this sample XML listing file through the XSLT stylesheet specified:


Date: (revised 21 June 2001)  Author: (revised LB) .
© British National Corpus