[bnc] - Indexing a Corpus with XAIRA: a Tutorial

Indexing a Corpus with Xaira

In this exercise you will learn how to index your own corpus for use with Xaira. Indexing is quite a complex process, and we will not try to cover all of the issues involved. The reference specification for the Xaira indexer is available online at http://www.oucs.ox.ac.uk/rts/xaira/Doc/indexing.xml; the Xaira Tools Windows utility, which we will use in this exercise, also includes a Help file which should be consulted for more detailed information.

The Xaira Tools Windows utility also has a ‘wizard’ interface, which can easily cope with the most common varieties of corpus. We'll demonstrate that by using it to build three different versions of the same corpus with increasing amounts of complex markup.

In the tutorial, we use as a sample corpus three different versions of Varney The Vampyre, a famous 19th century English popular novel. Dating from 1826, this work has no literary merit whatsoever, but it does have some interesting linguistic properties. Naturally, you should feel free to substitute your own texts if you prefer: we provide this text only to show how the various options work. You can download our three versions from the website for this workshop: http://www.natcorp.ox.ac.uk/workshop/Materials/allvarney.zip

Download this archive, and unpack it into a temporary folder on your desktop. The archive contains three different folders, Plain, Xml, and Wtagged. Each folder contains about 230 small files, each containing one chapter of the novel.
Although you have a version of Xaira on your desktops, for this exercise you need to download and install a later version. Uninstall the old version first. You can find the new version at: http://www.oucs.ox.ac.uk/rts/xaira/Download/ (Build 241114)
Once you have installed the new version of Xaira you can fire up the xaira-tools program (it should be on the Start menu, along with the other bits of Xaira)

Indexing a corpus with no markup

Select `Index Wizard' command from the File menu
The Corpus Name dialog opens. Enter a name for your corpus ("Varney-Plain" for example), and a description of the corpus if you like. Press NEXT to continue.
The Corpus Root dialog opens. Xaira creates a directory using the name you gave in the previous step to hold all the components of your indexed corpus. This directory will be created inside a directory called My Corpora along with your other Documents and Settings. Assuming you don't want to change that, just press NEXT to continue.
The Texts dialog opens. You need to tell Xaira where to find the texts you will use for your corpus. Press BROWSE and navigate to the Work folder (or wherever you unzipped the Varney archive). Select the folder named Plain inside this archive. Press NEXT to continue.
The Markup dialog opens. Decision time! Xaira needs to know what sort of markup there is in these texts. These files are innocent of markup, so select the radio button labelled Plain Text and press NEXT to continue.
The File list dialog opens, showing all the files available in the directory you specified avove. We will use all of them, so just press NEXT to continue.
The Character encoding dialog opens. Xaira will try to guess what character encoding your files are using. It's usually safest to accept the default. Press NEXT to continue.
The Reading files dialog opens. This is where Xaira checks to see whether your XML is all valid. (When you copied the plain text files Xaira added some minimal tagging to your document to make it into XML). Press GO and when it's finished press NEXT to continue.
The Language dialog opens: this is important if you are working with a non-Roman alphabet. Pick the appropriate language from the list (en) and press NEXT to continue.
The Indexing dialog appears. This is where the real work takes place. When you press INDEX the Xaira indexer will process all your files, creating the indexes it needs. It will also create (in the directory etc) a file called corpus.log with details of the whole process: you will need to look at this if anything goes wrong. The process can take several minutes if you are indexing more than a few small files.
Click on the checkbox "View corpus in Xaira client" if necessary, and then press the Finish button to exit the wizard.

Your corpus is ready for use! Using the Xaira client with this version:

You can look for words and phrases (find all one letter words)
The context returned for any query is an input line
Hits are referenced by chapter number and input line number

Feeling ambitious? How would you go about finding all forms of the verb "go"?

Indexing a corpus with simple XML markup

There are many ways in which your corpus might use XML markup. As far as Xaira is concerned, the XML tagging might be used to:

indicate where the boundaries of texts are (if for example you have more than one text in a file)
provide metadata about each text, such as its title, or some indication of its type
indicate default context boundaries, such as sentences or paragraphs, which can be used to identify locations within the text
explicitly tokenize the text and add word-level annotation such as POS tags

Xaira doesn't mind what XML tags you use to indicate any of the above, but it needs to know which they are.

As noted above, we've provided two versions of the Varney text marked up in XML to demonstrate some of what's possible. In this exercise we will look at a version which just has a few XML tags: a <chap> element for each chapter, a <p> for each paragraph, and within that a <q> element for direct speech, and a <hi> element for italicized words. Each <chap> element has a @n attribute which gives its number within the whole text, and at least one <head> element giving its title.

Indexing XML is much the same as indexing plain text, except that a few additional dialogues will appear which enable you to tell XAIRA how it is meant to use the XML tagging in your corpus.

If the Xaira Tools window is still open, close it. Then start the application up again (this is is a bug!)
Select `Index Wizard' command from the File menu
The Corpus Name dialog opens. Enter a name for your corpus ("Varney-XML" for example), and a description of the corpus if you like. Press NEXT to continue.
The Corpus Root dialog opens. Xaira creates a new directory using the name you gave in the previous step to hold all the components of your indexed corpus. This directory will be created inside a directory called My Corpora along with your other Documents and Settings. Assuming you don't want to change that, just press NEXT to continue.
The Texts dialog opens. You need to tell Xaira where to find the texts you will use for your corpus. Press BROWSE and navigate to the Work folder (or wherever you unzipped the Varney archive). Select the folder named Xml inside this archive. Press NEXT to continue.
The Markup dialog opens. Decision time again! Xaira needs to know what sort of markup there is in these texts. This time, select the radio button "XML".
If your corpus contains more than one file, you need to tell Xaira how the files you selected map to the XML structure. The usual option, and the simplest, is that each file contains one XML text element: this is what the Wizard knows as Model 1, and is what we will use throughout this exercise. Press NEXT to continue with the the default options.
The File list dialog opens, showing all the files available in the directory you specified above. This time we will just select a few of them, so press the Select/deselect files button.
Another File list dialog appears, with buttons down the right hand side allowing you to select, reorder, or deselect files from the list. This has many features; for this exercise, we suggest you simply select the first ten files, for example by clicking on the filename and dragging downwards. When ten filenames are highlighted, press the Select button to select only these files, removing the others, and then press the OK button to return to the Wizard. Press NEXT to continue.
The Reading files dialog opens. This is where Xaira checks to see whether your XML is all valid. Press GO to check the selected files. File ch-000.xml is in error (it is empty), so if you included it, you should be able to press the View Log button to see the error message this produces.
If there are many errors, you can print the error report out before closing the window in which it is displayed. In the Reading files dialog, press Back, to return to the File list dialog, and then Select/deselect files to return to the File list dialog. Click on ch-000.xml and press the Delete button to remove it from this version of the corpus. Then continue as before: this time you should get no errors, and can press NEXT to continue.
As before, the Language dialog opens: pick the appropriate language from the list (en) and press NEXT to continue.
On the Text Delineation dialog, select the element chap from the scrolling list on the left, and "n" from the list on the right. This tells Xaira to treat each new <chap> element as a new text, and to identify each text by means of its @n attribute value. Press NEXT to continue.
On the Unit Delineation dialog, select the element p from the scrolling list on the left, and "Auto number" from the list on the right. This tells Xaira to use the <p> element as the default context for searches, and to identify searches by paragraph number within chapter. Press NEXT to continue.
In the word delineation dialog, check the box Use Unicode rules to tokenize text. This tells Xaira that it should use standard rules about word separator characters to identify the word forms to be indexed within the corpus.
The Bibliography dialogue is used if each corpus text contains structured metadata which Xaira will use to identify individual texts. This is not used in the Varney corpus, so just press NEXT to continue.
The Indexing dialog appears as before. Press Index to process all your files.
Click on the checkbox "View corpus in Xaira client" if necessary, and then press the Finish button to exit the wizard.

Your corpus is ready for use! Using the Xaira client with this version:

You can find highlighted phrases (use the XML query button to search for the <hi> element)
The context returned for any query is a paragraph rather than a single line
Hits are referenced by chapter number and paragraph sequence number

Feeling ambitious? Try using Query Builder to find occurrences of the word "Ah" at the start of a paragraph.

Indexing a corpus with word-level markup

In the third version of the corpus, we have marked up the word boundaries using the <w> XML element. As well as allowing us to tokenize the text explicitly, this also allows us to annotate each word with a part of speech and a root form, using attributes @pos and @lem respectively. This was done using a freely available tagging program called tree tagger; tree tagger also enables us to identify sentence divisions automatically in the text, with moderately successful results. These are useful for delimiting searches and giving us rather more accurate reference information, as we will see.

If the Xaira Tools window is still open, close it. Then start the application up again (this is is a bug!)
Select `Index Wizard' command from the File menu
The Corpus Name dialog opens. Enter a name for your corpus ("Varney-POS" for example), and a description of the corpus if you like. Press NEXT to continue.
The Corpus Root dialog opens. Press NEXT to continue.
The Texts dialog opens. Press BROWSE and navigate to the Work folder (or wherever you unzipped the Varney archive). Select the folder named Wtagged inside this archive. Press NEXT to continue.
The Markup dialog opens. Decision time again! Select the radio button "XML".
The File Structure dialog opens. As before, choose Model 1 and press NEXT to continue with the default options.
The File list dialog opens. Although it may take a little longer, we recommend indexing the whole of this version of the corpus so simply press NEXT to continue.
The Reading files dialog opens. Press GO to validate the files: there should be no errors this time. Press NEXT to continue.
As before, the Language dialog opens: pick the appropriate language from the list (en) and press NEXT to continue.
On the Text Delineation dialog, select the element chap from the scrolling list on the left, and "n" from the list on the right, as before. Press NEXT to continue.
On the Unit Delineation dialog, this time select the element s from the scrolling list on the left, and "n" from the list on the right. Press NEXT to coninue.
On the Word Delineation dialogue, select the <w> element and press NEXT.
The Additional Keys dialogue opens: the default keys identified by the wizard are the attributes pos and hw. Press NEXT to continue.
The Bibliography dialogue appears; as before, just press NEXT to continue.
The Indexing dialog appears as before. For this corpus, we need to take one more step which the Wizard cannot help us with. Do not press Index yet! Instead, press Cancel to close the wizard.
The Corpus Wizard cannot do everything. The Xaira Tools utility allows you to fine tune almost all aspects of your corpus indexing. We will demonstrate this by defining what Xaira calls a lemma scheme for your corpus. Proceed as follows:
- In Xaira Tools, select `LemmaSchemes' from the Tools menu. The Lemma Schemes dialog opens.
- Press Add to create a new lemma scheme: a second Lemma scheme dialog box opens.
- In the Name box, enter "TT" for the lemmatization scheme
- In the Gloss box, enter "Tree Tagger", since this is the agency solely responsible for this lemmatization scheme
- Choose hw from the list of available additional keys and press the Add button to add it to the lemma scheme
- Press OK to return to the first Lemma Schemes dialog, and press OK again.
- You are now ready to index your texts. Select Indexer/Run from the Tools menu.
When indexing is complete, start the Xaira client, select Open from the File menu, and navigate to the xcorpus file created in My Corpora\Varney-POS to open it.

Your corpus is ready for use! Using the Xaira client with this version:

You can distinguish verbal and nominal forms of the same orthograph (do a word query for "lie")
You can find words grouped by lemma (find all forms of the lemma "be")
Hits are referenced by chapter number and sentence number

Feeling ambitious? Try using Query Builder to find sentences containing more than three adjectives in a row.

Up: Contents

Indexing a Corpus with Xaira
Indexing a corpus with no markup
Indexing a corpus with simple XML markup
Indexing a corpus with word-level markup