Introduction to SARA98

This worksheet introduces you to some of the key features of the SARA software. It does not cover everything that the software can do but it gives a good indication of the kinds of facilities available. Please use it as a basis for your exploration of the system.

1. Getting Started

Start up SARA by clicking on the SARA icon (the big blue S) on your Windows desktop, or by selecting SARA from the Program Menu items in the usual way.

The About SARA popup appears, with the text ‘Waiting to contact server zeus’ .

If you are using SARA to connect to some other server or corpus, the name which appears may be different. SARA can be configured to work with other corpora, with other servers providing access to the BNC, or with a copy of the BNC installed locally on your hard disk. You can click on the menu button to see what other servers are available, or to add a new one. Configuring your client to operate with other corpora is not covered in this tutorial.

Click on the OK button to start working with the BNC. On a networked server, a logon dialogue box appears. Enter your username and password, and then press the OK button or hit Enter. A brief SARA message confirming that you are connected to the server appears. Press Enter again.

The message ‘Initialising please wait’ appears at the bottom of the screen, and there is a short delay during which details about the corpus being searched are loaded into the program. When this process is complete, a minimized window titled BNC-2 appears in the bottom left of the screen.

This is the corpus Browse window. If you open it, you will see a list of the texts making up the corpus. You can select texts from the list for various purposes, e.g. to make a subcorpus or simply to browse them; however, this use of the window is not covered in this tutorial.

Enlarge the window to its full size by clicking on the middle of the three buttons in its top right hand corner, and drag the minimized BNC2 window to the bottom left hand corner, for tidiness' sake.

At the top of the screen you see the usual Windows menu items (File, Edit, Texts, View, Window, and, on the far right, Help). Below that you can see a number of buttons which we call collectively the Toolbar.

Using the mouse, move the cursor so that it hovers over each button in turn, without clicking the mouse.

After a second or so, the name of the button will appear in a small popup and a brief description of its purpose will appear at the bottom of the screen. Each button provides rapid access to a specific function also available from a menu. In this tutorial we will be using the following buttons:

Word Query
SGML Query
Query Builder
Edit
Sort
Thin
Page/Line mode
Format (this one isn't really a button)
Collocation

Locate each of the buttons listed above on the toolbar. As with other Windows applications, you can change the buttons available and move them around, using the Toolbars command on the View menu and the mouse in the usual way.

In this tutorial we will discuss just a few of the functions available; for an overview of all them, you may like to explore the built in Help system, by selecting Contents on the Help menu, or simply pressing the F1 key.

The bottom of the screen contains a message area and a status bar, in which SARA posts useful information about its current state. Until you carry out a query however, there is not much to say here.

2. A Quick query

The Quick query box is the unlabelled box on the tool bar. It provides the quickest way of searching the corpus for a word or phrase.

Type just about into the quick query box and then press Enter.

The cursor will turn briefly into an hourglass while SARA searches, and then the ‘Too Many Solutions’ alert will appear, telling you the result of the search: there are 1651 occurrences of this phrase in 927 different texts.

By default SARA will not download and display more than 100 solutions to any query; you must therefore specify how many solutions you want to see and how they should be selected from those available. You can change this behaviour using the Preferences command on the View menu, which we discuss later on in this tutorial.

Using the appropriate radio buttons, specify that you would like a random sample of 100 hits, with only one per text, then click on OK.

Solutions will now be downloaded, one by one. By default, you will see only the first solution. Press the PgDn key to see the next solution.

If you look at the status bar, you will see that it now contains additional information. Reading from left to right, you should see something like the following: BNC2 bnc 2:100(100) A0F 142 . This indicates that the name of the corpus being searched is BNC2, that the lemmatization scheme in effect is called bnc, that the currently highlighted solution is number 2 of 100 chosen from 100 different texts, and that it appears in text A0F at sentence number 142 (your numbers will probably differ, since you are doing a random sample).

3. Displaying solutions

On the toolbar there is a window in which you see the word Plain. Use the arrow next to this to select each of the other possible display formats (Colour, SGML, and Custom) in turn, and compare the results.

In Colour format, different parts of speech are displayed in different colours, and the POS code itself appears when the mouse hovers over a word. In SGML format, you can see all the underlying markup in the file. Custom format displays the text in a user-defined way, using the markup according to specific requirements. In Plain format, you see only the words and punctuation of the text, with the search term highlighted.

Locate the blue Page/Line mode toggle button and click on it. Then click it again.

This button switches between displaying solutions one at a time and displaying solutions in the traditional one-per-line KWIC format. In either mode, you can scroll through the solutions using the PgDn and PgUp keys; in line mode you can also use the arrow keys.

Select line-mode display and press the Down arrow key a few times.

Note the dashed lines surrounding what is known as the current solution. Information about this solution is displayed on the status bar at the bottom of the screen.

Click the right mouse button.

A menu appears from which you can choose to copy the current solution to the clipboard, to expand the amount of context visible for that solution (Max Scope), to select the current solution or a series of solutions (only in Line mode), and to view information about the Source of the current solution. Experiment with these options to see their effect.

You can also change the font in which solutions are displayed, set the colour scheme used for display of POS information, and set default preferences for display mode etc. in subsequent queries using commands on the View menu.

4. Sorting solutions

Switch to Line mode and Custom or Plain format. Find the Sort button on the toolbar (it has letters A and Z with an arrow beside them), and click on it.

A dialogue box appears in which you can specify how the concordance lines should be sorted. You can use different sorting orders and other options for each of two sort keys, and also specify a particular collating method, using radio buttons in this dialogue box. You can also indicate how many words are to be considered when sorting using the Span window.

As primary key, select the Right radio button and specify a 2 word span. Press the Sort button.

The lines will be sorted by the words to the right of the hit, so phrases such as just about everyone will be grouped together.

Now change to Colour display format, and press the Sort button again.

When solutions are displayed in colour format, a radio button labelled POS code is available as a Collating option, making it possible for you to sort the lines according to the POS codes of words they contain.

Click on the POS code radio button and then click on the Sort button or press Enter to sort the solutions again, this time using the POS code of the 2 words following the hit as primary key.

All solutions where just about precedes an article or determiner (AT0) will be sorted to the top of the list, and all cases where it precedes a verb (V..) will sort to the end.

5. Setting Preferences

Before doing further queries, let us set some Preferences for the toolbar and display of results.

Under the View menu, first select Toolbars. Check all the options except Compatibility, and click on OK. Then select the View menu again, followed by Preferences. Under Default View options, check the following: Custom format, Automatic scope, View Query and Concordance.

View Query will show you the text of any query at the head of the solutions display; Concordance will show the solutions in a KWIC (one-per-line) format. Custom format will display certain elements, such as new paragraphs and utterances, in particular ways on the screen, while Automatic scope will display roughly one sentence of context for each citation.

The preferences just specified will be active for the remainder of this SARA session: if, by any chance, you are disconnected from the server or accidentally close the programme, you should reset them before continuing.

6. Word queries

SARA maintains a list of all the distinct word forms in the corpus, together with their frequencies and part of speech codes. We refer to this list as the lexicon. You use the Word Query command to search the lexicon in a number of different ways, and to find places in the corpus where these word forms occur. The Collocation facilities allow you to detect statistically significant patterns of co-occurrence for word forms in the corpus, while the Lemmatization facilities allow you to group words together under linguistically significant headings despite their orthographic form.

The Word Query button looks like a small white box with a vertical yellow stripe.

6.1. Welcome to Wine Street

Click on the Word Query button to open the Word Query dialog box. At the foot of this dialog box there are two check boxes, labelled Show Controls and Show Forms. Check both of them.

The Word Query dialog box expands to provide additional tabs with which you can control its behaviour, and an additional window opens for display of forms.

Click in the box at the top left of the dialog box, and type in the string wine. Then click on the LookUp button (or press Enter).

A list of all the word-forms in the lexicon which begin with the letters wine is displayed in alphabetical order. The other columns show the frequency and the number of different forms grouped under that entry using the current lemmatization scheme. In the default scheme (called bnc) words which differ only in their part-of-speech code are grouped under a single entry.

Click on the column heading Frequency.

The entries are sorted in descending order of frequency.

You can also restrict entries by frequency by using radio buttons on the Download tab. We do not cover that in this tutorial.

Click on the entry for wine.

The different word-forms grouped under this entry are displayed in the lower window. You will see that wine, while generally classified as a singular common noun (NN1), also appears in the lexicon as a proper noun (NP0).

Click on the NP0 form in the lower window to select it. Then click the Query button to download the solutions containing Wine as a proper noun.

Scroll through the solutions. In what city is Wine Street? Who was born there? What famous poem was written there?

Once your curiosity is satisfied, you may wish to investigate uses of wine as a common noun. To save time re-defining a query from scratch, we will use the Edit button on the toolbar: this looks like a tiny pencil writing on a blue-edged screen.

Click on the Edit button. Your previous Word Query dialog box reappears. Click on the greyed out wine entry in the upper window, then on the wine form marked NN1 (singular common noun) in the lower window, and then click the Query button.

The ‘Too Many Solutions’ dialogue appears, reminding us that there are 6050 solutions to this query. In the next section we will use the collocation options in SARA as a means of investigating this mass of data.

6.2. Collocations

In the Too Many solutions dialog box, click the Random radio button and enter 20 as the number of hits to be downloaded. Press Return or click the OK button. Once the hits have been displayed, click the Collocation button on the toolbar.

The Collocations dialog box will be displayed.

You do not need to download all the results of a query to perform a collocation analysis, but you must download some.

Make sure that the Downloads only box is unchecked so that the analysis will be calculated for all 6050 solutions, not just the 20 you have downloaded.

Click the Controls checkbox.

The dialog box expands to offer you several additional tabs which can be used to control the calculation. You can set the window, i.e. the span of words around the hit within which collocations are to be sought; you can use the download tab to control which of the possible collocates are to be displayed; you can apply a lemmatization scheme, and you can also modify the way the collocation scores are calculated. In this exercise we will use only the first two of these.

Click on the Window tab and set the window to 3 words to the left and 3 to the right.

Click on the Download tab, and then on the second radio button, which reads Only the highest-scoring matching word forms. Enter the number 10 in the box following, if is not already there.

To perform the collocation analysis with these parameters, click on the Calculate button.

It may take several minutes for SARA to calculate all the collocates and their scores: wait for the red light on the status bar to go out before you try to do anything else. This may be a good time for a cup of coffee.

The collocate display is designed to show you words which cluster together: this is expressed by means of a frequency-based statistic known as the z-score. The higher the z-score, the more significant the clustering.

You can re-sort the list by clicking on the relevant column heading. You can save a copy of the list by clicking on the Save button and choosing an appropriate format. You can also choose to calculate significance using the Mutual Information statistic rather than Z-scores.

Click on the word mulled in the collocates list, then click on the Query button.

A new window opens, displaying the 27 hits where mulled collocates with wine. In what containers is mulled wine typically consumed?

Close the window in which these solutions are displayed, then click on the Collocates button again.

The collocates list previously displayed reappears. You may like to investigate some of the other words closely associated with wine.

6.3. Working with lemmata

In an inflected language like English, it is often convenient to group under the same heading words which have different forms, as well as to distinguish words which have the same form but different part of speech codes. For example, consider the word rise. This may be a noun or a verb. In the verbal sense, it may be regarded as consisting of a number of inflected forms rose, risen, rises, rising, etc. The lemmatization feature of SARA allows us to perform such groupings.

Open the Word query dialog box, type rise in the box. Make sure that you have both forms and show controls check boxes selected, and the Lemmata tab open. Then click the Lookup button.

A list of words beginning with rise appears.

If the display does not change properly when you click the lemmata tab, close the Word Query dialogue, and then re-open it

Click on the form rise in the upper box.

In the lower box, you will see a list of the different forms grouped under this heading by the current lemmatization scheme. In the default (BNC) lemmatization scheme, this word has six different POS codes, the frequency for each of which is given.

Now change the lemmatization scheme. Select lancaster from the menu at the right, and click on the apply button.

In the Lancaster lemmatization scheme, nominal and verbal forms of rise are treated as different head words, so there are now two entries for the word in the upper list, one tagged SUBST and the other VERB. The frequency count given for each of these includes all of its inflected forms together.

Click on rise VERB in the upper list and scroll through the list of forms which appears in the lower list.

You will see that inflected forms such as risen, rose etc. are now all grouped togther. The archaic form riseth is however treated separately.

7. Introducing subcorpora

In the last section we saw that the verbal lemma rise was approximately twice as frequent as the nominal one. As 90% of the BNC is composed of written texts, this difference probably reflects the relative frequencies with which they are employed in writing. Is there also a difference in speech? You can investigate this question by submitting the same query you posed using the full corpus to the subcorpus of spoken texts.

First close the Word query dialog box.

Towards the end of the toolbar you will see a small box displaying the word all. This is the Subcorpus selection box, which shows that you are currently using the full corpus.

Click in the box and select spoken.

You have now activated just the subcorpus of spoken texts: the Status bar should now show the current corpus as BNC2:spoken.

Now redo your lemmatised Word query for rise.

This time, the results will regard only the spoken texts. Save the list, or take a note of the frequencies of the verbal and nominal lemmas.

Since the spoken component is approximately 10% of the whole corpus, we would expect the frequencies of the two lemmas to be approximately one-tenth what they were in the whole corpus. However, the new figures are very much smaller than this, particularly as far as the verb forms are concerned.

If you have time, you may also like to compare the collocates of the nominal lemma in the full corpus and in the spoken subcorpus: you will find that combinations such as sharp rise are much less common in speech.

The collocate frequencies provided will be for Lancaster lemmata, which you activated in your last Word Query. Their significance level will in each case be calculated with respect to the corpus you are using.

Close all the open queries once you have completed this section.

7.1. Defining your own subcorpus

You can define a subcorpus of your own in three ways. The first uses information in the text headers to identify all the texts in a particular category — information which is provided in the <catRef> element. We shall use this method to define a subcorpus of imaginative written texts — novels, stories, plays and poems.

Make sure that you are using the full corpus — that the box on the toolbar shows the word all.

Click on the SGML query button on the toolbar.

The SGML dialog box will be displayed.

Check the Show header tags box (otherwise the list of elements won't include elements in the text headers). Then scroll down the list of elements until you find <catRef> and select it.

You will see a list of attributes for this element in the lower window.

Scroll through this list until you find written_domain. Select it and click on Add.

This will take you to the Attribute Query dialog box, where you should select the written domain that you wish to search for.

Select imaginative and click on OK to return to the SGML dialog box.

The window at the bottom right hand corner of the box shows the attributes and values you have selected.

Click on OK to send the query to the server. Download all the solutions by clicking the ‘Download All’ radio button.

There are 477 imaginative texts in the corpus, so there are 477 hits for your search. To see what these really look like, you should display them in SGML format: each concordance line will contain a string beginning catRef target="alltim3 allava2...." towards the end of which you will find the value wridom1. These are the BNC text categorization codes. You have thus found all the text headers (and consequently all the texts) with the categorisation wridom1, i.e. all the imaginative texts.

Click the Make subcorpus button on the toolbar

This is the only active button to the right of the box showing the current corpus. A window will appear in which you can type the name of your new subcorpus, which will consist of all the texts for which you have downloaded solutions.

Enter the name imaginative and press Enter.

You will see that imaginative now appears in the list of available subcorpora.

Select imaginative from the list of subcorpora. Then type fictional into the Quick query box

You should now see all the fictional occurrences of the word fictional. Try looking up other words or phrases which seem typical of imaginative writing, such as frightfully, throb, or lips. Compare their frequencies in this subcorpus with their frequencies in the full corpus.

Do not try to obtain a collocates list for a subcorpus which has not been registered. It is not possible to register subcorpora while you are using SARA.

7.2. A second way to make subcorpora

David Lee, of Lancaster University, has provided his own hierarchical classification of all the BNC texts in the World Edition. This makes it possible to define subcorpora using classes such as w (written), w ac (written academic), w ac medicine (written academic medicine).

Make sure you are using all as your corpus, then click on the SGML query button. Check that the Show header tags box is checked, and select <classCode> from the list of elements. Then click on OK.

You will see that there are 4053 solutions, i.e. one for each text in the corpus.

Download a random 20.

This will show you the form of the information provided in the <classCode> element. Choose one category that you like the look of, such as s interview or w non ac.

To make the subcorpus, you need to identify all the texts which contain this classification within the <classCode> element in the header. This requires a complex query, which specifies (for example) that you are only interested in the word interview or the phrase non ac where it occurs within this element.

To create this query, click on the QueryBuilder button on the toolbar (it is shaped like a T).

The QueryBuilder screen appears. You use this screen to define complex queries, each component of the query being represented as a node on this screen. The lefthand node defines the scope of the query — that is, where the search is to be carried out. As you see, this starts off with the assumption that you will search within a single BNC document (i.e. <bncDoc> element). To the right of this node you define what it is you want to look for, as one or more linked content nodes. The box is red because you must supply something.

Click in the red content node and select Phrase from the menu that appears.

This will display the Phrase Query dialog box.

Phrase Queries are similar to Quick Queries, but they also allow additional specifications as to case-sensitivity and searching text headers that are not available elsewhere.

Enter the string corresponding to the code you want to look for, such as s interview or w non ac, then check the Search headers box.

This is necessary since the string you are searching for is not part of a text, but of a header.

Click on OK to insert this query in the content node and return to the Query Builder.

Next, click in the scope node and select SGML. This will display the SGML dialog box. Make sure the Show header tags box is checked, then select <classCode> from the list of elements (type a C to get to the relevant part of the list quickly) and click on OK.

You will now see the complete query: at the foot of the window there should be a message saying Query is OK.

Click on OK to send the query to the server. Download all the solutions, corresponding to all the occurrences of your string in a <classcode> element, making a note of their number.

Wait till all the solutions have been downloaded: then, to define this set of texts as a subcorpus, click on the Make subcorpus button and choose an appropriate name for it.

Now you can activate your subcorpus from the subcorpora box on the toolbar, and design Quick queries or Word queries concerning it. For instance, is there much bad language in the BNC spoken interviews? What about in non-academic writing? You can compare figures with those for the subcorpus of all the spoken texts, or for the corpus as a whole, bearing in mind the different numbers of texts in each.

7.3. A last way of making subcorpora

Yet another way of making subcorpora is to do a query for a word or phrase which you think is likely to occur in the kinds of texts you want to include.

Make sure you are searching the whole corpus again, by choosing all from the list of available subcorpora. Now do a Quick query for the word oaky.

Nearly all the 13 solutions appear to have something to do with wine-tasting, bar one exception where it is a proper name. Double-click on this solution to select it, then click on the arrow at the right of the Thin button on the toolbar, and choose Reverse selection. The solution you selected will be deleted from the display.

Now click on the Make subcorpus button and save the 9 remaining texts as a subcorpus called wines. Activate the subcorpus and use it to look for other wine-tasting terms, such as robust, fruity, ripe, etc.

Not all the solutions, you will find, have to do with wines, because many of these texts are of a varied nature, being taken from the `leisure pages' of newspapers and the like.

8. Building more complex queries

As you've seen, you can type either a word or a phrase into the Quick query box. But suppose you want to search for a phrase in which some of the words, or the word order, can vary? In this section we'll explore some of the facilities for defining more complex, or less exact, queries.

One very simple way is to use the underline character _ as a placeholder for any single word in a phrase.

Type the string cheese _ wine into the Quick query box and press Enter.

As you scroll through the 29 solutions, you will note that there are some different spellings for the middle word in the recovered phrases.

Type the string wine _ cheese into the Quick query box and press Enter.

This is slightly less common in terms of absolute frequency, but not in terms of number of texts. In the general case, we would like to be able to find the two words cheese and wine in the same phrase, in either order. The QueryBuilder is the right tool for this job.

Click on the QueryBuilder button on the toolbar. Click in the red content node and select Phrase from the menu that appears. Type in wine and click on OK.

You will see that the content node has now become black, since you have provided it with valid content. You will also see that the node has small branches growing from its sides: these allow you to add other nodes with other contents, and to link the nodes them in different ways.

Click on the right branch of the node. A second red content node will appear to the right. Click in this new node, select Phrase as before, and type in the word cheese. Click on OK.

Nodes which are presented left to right across the screen are interpreted as alternatives. Your current query will thus find occurrences of wine OR of cheese.

Click on the downward-growing branch beneath the first node to create another node below it. Click in this new node and select Any. Then click on the branch connecting the two nodes and select Link-type and then Next.

Nodes which are presented top to bottom down the screen are interpreted as additive. Your current query will find occurrences of wine or cheese, but only if they are followed by another word. The Next link means that the two nodes concerned must directly follow each other, with no other words intervening.

Now create a node beneath the Any node, also linked with a Next link, containing the Phrase Query cheese and add another node to the right of the cheese containing wine as an alternative.

Your current query will find wine or cheese, followed by any word, followed by wine or cheese again.

Click on OK to send the query to the server.

There should be 76 solutions — rather more than we found for ‘cheese _ wine’ and for ‘wine _ cheese’ together in our earlier quick queries. Scroll through the results to see why.

The Query Builder can be used to search for combinations of words within particular contexts by additionally changing the scope node. For instance, you might want to find co-occurrences of oaky and fruity within the same sentence, or within an overall span of five words.

Open the QueryBuilder again. Click on the scope node. Select SGML, and then choose <s> (the SGML element containing a single sentence).

You can specify any SGML element as the scope for a query. Alternatively, if you select Span you can specify the number of words within which the rest of the query is to be satisfied.

Once you have specified the scope, you need to build up the required content in the content nodes, as before.

Create two content nodes, containing phrase queries for oaky and fruity respectively. Join the nodes vertically with a two-way link.

A two-way link will find the two nodes in either order. A one-way link will find them only if they occur in the top-to-bottom order. A next link will find them only if they are directly adjacent.

As a final exercise, see if you can work out how to compare the number of occurrences of evaluative terms such as wrong or correct in all spoken utterances, in utterances spoken by men, and in utterances spoken by women. To choose utterances by a particular type of speaker you should first specify <u> as the scope node, and then choose from the attributes listed below the list of elements. For their sex choose who_sex. If you wish, you can additionally restrict the search by the age of the respondent (who_age) or other criteria.

Advanced students may like to compare sentence-initial occurrences of the word right within speech and within writing. Hint: content nodes in a QueryBuilder query can contain any kind of query, including an SGML query.

British National Corpus