Indexing a corpus for SARA

1. Introduction

This is a preliminary draft of a document explaining how to build a SARA index for any TEI-encoded corpus. It is not complete and should not be relied on for anything other than general indications. Definitive information will be provided in the SARA Technical Manual, currently (July 2001) in production.

Your comments and feedback on the usability of this document would be welcomed. Please send them by email to natcorp@oucs.ox.ac.uk

This document assumes that you already know how to use SARA, and are familiar with the general concepts of TEI corpus encoding.

2. Basic organization of a SARA system

SARA (SGML Aware Retrieval Application) is a system for providing reasonably fast access to very large amounts of SGML-tagged language corpus data. It is not particularly suited to small (less than 10 Mb) datasets or data which is not structured in any way, though it can be used with such data. Using the system involves three components:

a client which interfaces between the user and...
a server which translates user requests into index lookups on the SARA data structures, which were created by...
the indexer, which supplements user-supplied corpus data with an inverted file index and lexicon.

To build a SARA system you need the following ingredients:

a collection of SGML or XML encoded texts
enough disk space: you will need approximately 3 times as much as the space occupied by the texts to build the system
a suitably powerful machine running any recent Windows operating system, or most flavours of Unix
up to date versions of the SARA software (see http://www.natcorp.ox.ac.uk/tools/sara/ for current release information)

To begin with, we recommend you to create a single folder named after your corpus, and then create three subdirectories called Text, Index, and Etc within it. The following discussion assumes you have done that, and that the top level folder is called myCorps. Note that all names in SARA are case-sensitive.

3. Text files

Put your corpus files in the Text folder. Note carefully the following constraints:

each file in the Text folder must contain a single text together with its header, i.e. (if you are using TEI) a <TEI.2> element;
any preliminary matter (such as <!DOCTYPE declarations etc) should be removed from the file;
the file should be given a sensible (short) name, preferably without any extension since this is the name which will be used to identify it by the Client.

You must also create a corpus header file. This can be placed anywhere, but it is convenient to put it in the same folder as the texts. A corpus header has exactly the same structure as any other TEI header; it is typically used to supply definitions for any code books or other encoded data used across the whole corpus. A minimal corpus header looks like this:

<teiHeader type="corpus"><fileDesc>
<titleStmt><title><!-- title for your corpus here--></title>
<respStmt>
<resp>Corpus built by</resp><name><!-- Your Name Here--></name>
</respStmt>
</titleStmt>
<editionStmt><p> First TEI-conformant version </p></editionStmt>
<publicationStmt>
<authority>Distributed by the compiler</authority>
<availability status="restricted">
<p>Availability limited to compiler</p>
</availability>
</publicationStmt>
<sourceDesc><p><!-- describe your source material here --></p></sourceDesc>
</fileDesc>
</teiHeader>

4. Directive files

The behaviour of different parts of SARA is controlled by two files: the corpus parameter file and the corpus description file. You must create these files next. You can do this with any text editor you like (notepad, emacs, whatever comes to hand). Both files have roughly the same format, consisting of a series of lines each of which supplies a parameter and a value. The order of the lines is unimportant, but in other matters (case, spacing, etc.) it is safest to follow the examples closely.

4.1. The corpus parameter file

This file is used by each part of the SARA system in order to locate the files to be operated on. You must specify this file either explicitly or implicitly for SARA to work, and its contents must correctly identify the location of the other SARA system files.

This file also contains values for some internal settings used by the Indexer, which must also be available to the server. For this reason it is essential that the same file is used by both server/client and indexer.

A parameter file can also include comments, introduced by a sharp sign, as in the following example:

NAME=myCorps
TXT=/SARA/myCorps/Text/
HDR=corphdr
#      name for corpus header file (within TXT path)
ETC=/SARA/myCorps/Etc/
IDX=/SARA/myCorps/Index/
ACC=/SARA/Adm/
#      path to Account files (only needed for networked systems)
SORT=/temp/
#     path to temporary sort space (needed when indexing)
TMP=/temp/
#     path to scratch space used by server/client
#do not change the following settings!
HASHSHIFT=3
HASHLAST=6
IGRAN=100
ILOC=30000
GRAN=1000000

If you name the file corpus.prm, then you won't have to specify its name to the indexer or server. Obviously, you should change the path names in the above example to correspond with those you are actually using.

4.2. The corpus description file

This file contains a detailed description of the corpus data itself, for example what tags it contains, which elements function as words or sentences, which attribute is used to identify citations, what character entities are present, and a whole host of other things. You can use the indexer to make a default description file, and then modify it to match your requirements.

The corpus description file must have exactly the same name as the corpus to which it applies, must have the extension .dsc, and must be kept in the Etc folder. The name is case-sensitive.

Here is the minimum you need to create a corpus description file:

ver 101
scope P
scope S
wtag w type

This assumes that your corpus has elements P and S marking paragraphs and sentences, and that POS tagging is included as the value of the type attribute on <w> elements.

If your corpus has no POS tagging at all, your dsc file should contain the line

option noposindex

5. Run the indexer

The indexer is a standalone program supplied with the latest release of the SARA system. On Unix systems, it is built at the same time as the server and other utilities, and is executed from the command line in the same way. On Windows systems, you need to start up a Windows command processor or DOS prompt and then type the appropriate commands. For example, if you have put the file indexer.exe in the folder C:\sara, your corpus folder is C:\sara\myCorpus, and your corpus.prm is in the corpus folder, then you would type the following commands in the DOS window

cd \sara\myCorpus
\sara\indexer

If all is well you should see a display like the following:

C:\sara\mycorps>..\indexer
Opened index OK
Found 3 texts
DSC file is /SARA/myCorps/Etc/myCorps.dsc
Utility menu

1. Rescan file list
2. Build index
3. Build hash index files
4. Build and merge
5. Sort hash index files
6. Compress index files
7. Build dictionary
8. Pack file index
9. Index signature
10. Index bibliography
11. Statistics
12. Build frob file
13. Delete all files
14. Complete new build
15. Test a single text
16. Register a subcorpus
17. Deregister a subcorpus
18. Exit

Enter option number:

The first line should specify the number of text files found in the Text folder, together with the corpus header. Make sure this is correct. The cursor sits at the end of a list of choices: you want option 14, so type 14 and press return

Almost certainly, a sequence of error messages will scroll across the screen, followed by a list of filenames, before the utility menu reappears. This time type 18 to exit.

Now have a look in the folder you named as your Etc folder in the corpus parameter file. You will see that it contains a large number of files which were not there previously. In particular, there will be a file called unknowns.txt which you now need to append to the corpus.dsc file, as it contains default entries for all the SGML elements, character entities, POS codes etc actually present in your data but not defined in your existing description.

Open the description file using the text editor of your choice, and append the contents of the unknowns.txt to the end of it. If you want to edit any of the lines (for example, to supply a textual description which the client can display along with the tag name) you can do so now or later. See the reference manual for full information about what you can type in the description file.

Now run the indexer again, in exactly the same way as before. This time, you should not see any error messages: if you do, you should try to resolve them. (These messages are also written to a file called corpus.log if you want to review those which have scrolled off the top of the window.)

No matter the size of the corpus, SARA will always fill up the Index folder with a large number of subdirectories used as hash buckets, each of which has a number in the same range. Within these numbered hash buckets, SARA places large numbers of files with the extension .HID and .sid. If you are short of disk space, you can safely delete the .HID files once the index run is complete.

6. Testing your new system

The index you have built can be used unchanged on any platform for which a SARA server has been correctly built. This includes a variety of Unix and Linux systems as well as any Microsoft Windows 32-bit environment. You can even use it under Mac OSX, but this tutorial won't tell you how.

6.1. Running under Windows

Start up the SARA windows client. On the first screen that appears, press Menu rather than OK. You should see a list of the servers for which your client is currently registered. You need to tell it to look at the new corpus you have just indexed. Press the ADD button. Type the name of your corpus (or some other strng) into the Name window and check the box marked "Local". Then press the Browse button, and navigate to the location where you have stored the corpus parameter file for your new corpus. Press OK. You will be returned to the list of available corpora, in which your new corpus should now be included. Select the name of your corpus in the list by clicking on it, and then press the OK button.

If all goes well, your corpus will now open and you should be able to search it as per usual.

6.2. Running under Unix

On Unix systems, even if you are the only user, you have to set up the server for network access. This means you must first run the corpadm program (which is also built at the same time as the server and the indexer) to create the account directories. These directories are created in the path specified by the ACC directive in your corpus parameter file. There is detailed documentation of the corpadm program on the BNC web site

Once you have set up a username and password, you should start up the server (sarad) and you can then test the functioning of your index using the solve client, or any other client you may have.

For example:

% corpadm
   corpadm> add guest guest
   corpadm> quit
% sarad
   Started server
% solve fishy
   Connected! 2 solutions
   ...
%

British National Corpus