![]() | British National Corpus |
Introduction to SARA98 |
This worksheet introduces you to some of the key features of the SARA software. It does not cover everything that the software can do but it gives a good indication of the kinds of facilities available. Please use it as a basis for your exploration of the system.
The About SARA popup appears, with the text ‘Waiting to contact server zeus’ .
If you are using SARA to connect to some other server or corpus, the name which appears may be different. SARA can be configured to work with other corpora, with other servers providing access to the BNC, or with a copy of the BNC installed locally on your hard disk. You can click on the menu button to see what other servers are available, or to add a new one. Configuring your client to operate with other corpora is not covered in this tutorial.
The message ‘Initialising please wait’ appears at the bottom of the screen, and there is a short delay during which details about the corpus being searched are loaded into the program. When this process is complete, a minimized window titled BNC-2 appears in the bottom left of the screen.
This is the corpus Browse window. If you open it, you will see a list of the texts making up the corpus. You can select texts from the list for various purposes, e.g. to make a subcorpus or simply to browse them; however, this use of the window is not covered in this tutorial.
At the top of the screen you see the usual Windows menu items (File, Edit, Texts, View, Window, and, on the far right, Help). Below that you can see a number of buttons which we call collectively the Toolbar.
Using the mouse, move the cursor so that it hovers over each button in turn, without clicking the mouse.
After a second or so, the name of the button will appear in a small popup and a brief description of its purpose will appear at the bottom of the screen. Each button provides rapid access to a specific function also available from a menu. In this tutorial we will be using the following buttons:
As with other Windows applications, you can change the buttons available and move them around, using the Toolbars command on the View menu and the mouse in the usual way.
In this tutorial we will discuss just a few of the functions available; for an overview of all them, you may like to explore the built in Help system, by selecting Contents on the Help menu, or simply pressing the F1 key.
The bottom of the screen contains a message area and a status bar, in which SARA posts useful information about its current state. Until you carry out a query however, there is not much to say here.
The Quick query box is the unlabelled box on the tool bar. It provides the quickest way of searching the corpus for a word or phrase.
The cursor will turn briefly into an hourglass while SARA searches, and then the ‘Too Many Solutions’ alert will appear, telling you the result of the search: there are 1651 occurrences of this phrase in 927 different texts.
By default SARA will not download and display more than 100 solutions to any query; you must therefore specify how many solutions you want to see and how they should be selected from those available. You can change this behaviour using the Preferences command on the View menu, which we discuss later on in this tutorial.
Solutions will now be downloaded, one by one. By default, you will see only the first solution. Press the PgDn key to see the next solution.
If you look at the status bar, you will see that it now contains additional information. Reading from left to right, you should see something like the following: BNC2 bnc 2:100(100) A0F 142 . This indicates that the name of the corpus being searched is BNC2, that the lemmatization scheme in effect is called bnc, that the currently highlighted solution is number 2 of 100 chosen from 100 different texts, and that it appears in text A0F at sentence number 142 (your numbers will probably differ, since you are doing a random sample).
In Colour format, different parts of speech are displayed in different colours, and the POS code itself appears when the mouse hovers over a word. In SGML format, you can see all the underlying markup in the file. Custom format displays the text in a user-defined way, using the markup according to specific requirements. In Plain format, you see only the words and punctuation of the text, with the search term highlighted.
This button switches between displaying solutions one at a time and displaying solutions in the traditional one-per-line KWIC format. In either mode, you can scroll through the solutions using the PgDn and PgUp keys; in line mode you can also use the arrow keys.
Note the dashed lines surrounding what is known as the current solution. Information about this solution is displayed on the status bar at the bottom of the screen.
A menu appears from which you can choose to copy the current solution to the clipboard, to expand the amount of context visible for that solution (Max Scope), to select the current solution or a series of solutions (only in Line mode), and to view information about the Source of the current solution. Experiment with these options to see their effect.
You can also change the font in which solutions are displayed, set the colour scheme used for display of POS information, and set default preferences for display mode etc. in subsequent queries using commands on the View menu.
A dialogue box appears in which you can specify how the concordance lines should be sorted. You can use different sorting orders and other options for each of two sort keys, and also specify a particular collating method, using radio buttons in this dialogue box. You can also indicate how many words are to be considered when sorting using the Span window.
The lines will be sorted by the words to the right of the hit, so phrases such as just about everyone will be grouped together.
When solutions are displayed in colour format, a radio button labelled POS code is available as a Collating option, making it possible for you to sort the lines according to the POS codes of words they contain.
All solutions where just about precedes an article or determiner (AT0) will be sorted to the top of the list, and all cases where it precedes a verb (V..) will sort to the end.
Before doing further queries, let us set some Preferences for the toolbar and display of results.
View Query will show you the text of any query at the head of the solutions display; Concordance will show the solutions in a KWIC (one-per-line) format. Custom format will display certain elements, such as new paragraphs and utterances, in particular ways on the screen, while Automatic scope will display roughly one sentence of context for each citation.
The preferences just specified will be active for the remainder of this SARA session: if, by any chance, you are disconnected from the server or accidentally close the programme, you should reset them before continuing.
SARA maintains a list of all the distinct word forms in the corpus, together with their frequencies and part of speech codes. We refer to this list as the lexicon. You use the Word Query command to search the lexicon in a number of different ways, and to find places in the corpus where these word forms occur. The Collocation facilities allow you to detect statistically significant patterns of co-occurrence for word forms in the corpus, while the Lemmatization facilities allow you to group words together under linguistically significant headings despite their orthographic form.
The Word Query button looks like a small white box with a vertical yellow stripe.
The Word Query dialog box expands to provide additional tabs with which you can control its behaviour, and an additional window opens for display of forms.
A list of all the word-forms in the lexicon which begin with the letters wine is displayed in alphabetical order. The other columns show the frequency and the number of different forms grouped under that entry using the current lemmatization scheme. In the default scheme (called bnc) words which differ only in their part-of-speech code are grouped under a single entry.
The entries are sorted in descending order of frequency.
You can also restrict entries by frequency by using radio buttons on the Download tab. We do not cover that in this tutorial.
The different word-forms grouped under this entry are displayed in the lower window. You will see that wine, while generally classified as a singular common noun (NN1), also appears in the lexicon as a proper noun (NP0).
Scroll through the solutions. In what city is Wine Street? Who was born there? What famous poem was written there?
Once your curiosity is satisfied, you may wish to investigate uses of wine as a common noun. To save time re-defining a query from scratch, we will use the Edit button on the toolbar: this looks like a tiny pencil writing on a blue-edged screen.
The ‘Too Many Solutions’ dialogue appears, reminding us that there are 6050 solutions to this query. In the next section we will use the collocation options in SARA as a means of investigating this mass of data.
The Collocations dialog box will be displayed.
You do not need to download all the results of a query to perform a collocation analysis, but you must download some.
Make sure that the Downloads only box is unchecked so that the analysis will be calculated for all 6050 solutions, not just the 20 you have downloaded.
The dialog box expands to offer you several additional tabs which can be used to control the calculation. You can set the window, i.e. the span of words around the hit within which collocations are to be sought; you can use the download tab to control which of the possible collocates are to be displayed; you can apply a lemmatization scheme, and you can also modify the way the collocation scores are calculated. In this exercise we will use only the first two of these.
It may take several minutes for SARA to calculate all the collocates and their scores: wait for the red light on the status bar to go out before you try to do anything else. This may be a good time for a cup of coffee.
The collocate display is designed to show you words which cluster together: this is expressed by means of a frequency-based statistic known as the z-score. The higher the z-score, the more significant the clustering.
You can re-sort the list by clicking on the relevant column heading. You can save a copy of the list by clicking on the Save button and choosing an appropriate format. You can also choose to calculate significance using the Mutual Information statistic rather than Z-scores.
A new window opens, displaying the 27 hits where mulled collocates with wine. In what containers is mulled wine typically consumed?
The collocates list previously displayed reappears. You may like to investigate some of the other words closely associated with wine.
In an inflected language like English, it is often convenient to group under the same heading words which have different forms, as well as to distinguish words which have the same form but different part of speech codes. For example, consider the word rise. This may be a noun or a verb. In the verbal sense, it may be regarded as consisting of a number of inflected forms rose, risen, rises, rising, etc. The lemmatization feature of SARA allows us to perform such groupings.
A list of words beginning with rise appears.
If the display does not change properly when you click the lemmata tab, close the Word Query dialogue, and then re-open it
In the lower box, you will see a list of the different forms grouped under this heading by the current lemmatization scheme. In the default (BNC) lemmatization scheme, this word has six different POS codes, the frequency for each of which is given.
In the Lancaster lemmatization scheme, nominal and verbal forms of rise are treated as different head words, so there are now two entries for the word in the upper list, one tagged SUBST and the other VERB. The frequency count given for each of these includes all of its inflected forms together.
You will see that inflected forms such as risen, rose etc. are now all grouped togther. The archaic form riseth is however treated separately.
In the last section we saw that the verbal lemma rise was approximately twice as frequent as the nominal one. As 90% of the BNC is composed of written texts, this difference probably reflects the relative frequencies with which they are employed in writing. Is there also a difference in speech? You can investigate this question by submitting the same query you posed using the full corpus to the subcorpus of spoken texts.
Towards the end of the toolbar you will see a small box displaying the word all. This is the Subcorpus selection box, which shows that you are currently using the full corpus.
You have now activated just the subcorpus of spoken texts: the Status bar should now show the current corpus as BNC2:spoken.
This time, the results will regard only the spoken texts. Save the list, or take a note of the frequencies of the verbal and nominal lemmas.
Since the spoken component is approximately 10% of the whole corpus, we would expect the frequencies of the two lemmas to be approximately one-tenth what they were in the whole corpus. However, the new figures are very much smaller than this, particularly as far as the verb forms are concerned.
If you have time, you may also like to compare the collocates of the nominal lemma in the full corpus and in the spoken subcorpus: you will find that combinations such as sharp rise are much less common in speech.
The collocate frequencies provided will be for Lancaster lemmata, which you activated in your last Word Query. Their significance level will in each case be calculated with respect to the corpus you are using.
You can define a subcorpus of your own in three ways. The first uses information in the text headers to identify all the texts in a particular category — information which is provided in the <catRef> element. We shall use this method to define a subcorpus of imaginative written texts — novels, stories, plays and poems.
Make sure that you are using the full corpus — that the box on the toolbar shows the word all.
The SGML dialog box will be displayed.
You will see a list of attributes for this element in the lower window.
This will take you to the Attribute Query dialog box, where you should select the written domain that you wish to search for.
The window at the bottom right hand corner of the box shows the attributes and values you have selected.
There are 477 imaginative texts in the corpus, so there are 477 hits for your search. To see what these really look like, you should display them in SGML format: each concordance line will contain a string beginning catRef target="alltim3 allava2...." towards the end of which you will find the value wridom1. These are the BNC text categorization codes. You have thus found all the text headers (and consequently all the texts) with the categorisation wridom1, i.e. all the imaginative texts.
This is the only active button to the right of the box showing the current corpus. A window will appear in which you can type the name of your new subcorpus, which will consist of all the texts for which you have downloaded solutions.
You will see that imaginative now appears in the list of available subcorpora.
You should now see all the fictional occurrences of the word fictional. Try looking up other words or phrases which seem typical of imaginative writing, such as frightfully, throb, or lips. Compare their frequencies in this subcorpus with their frequencies in the full corpus.
Do not try to obtain a collocates list for a subcorpus which has not been registered. It is not possible to register subcorpora while you are using SARA.
David Lee, of Lancaster University, has provided his own hierarchical classification of all the BNC texts in the World Edition. This makes it possible to define subcorpora using classes such as w (written), w ac (written academic), w ac medicine (written academic medicine).
You will see that there are 4053 solutions, i.e. one for each text in the corpus.
This will show you the form of the information provided in the <classCode> element. Choose one category that you like the look of, such as s interview or w non ac.
To make the subcorpus, you need to identify all the texts which contain this classification within the <classCode> element in the header. This requires a complex query, which specifies (for example) that you are only interested in the word interview or the phrase non ac where it occurs within this element.
The QueryBuilder screen appears. You use this screen to define complex queries, each component of the query being represented as a node on this screen. The lefthand node defines the scope of the query — that is, where the search is to be carried out. As you see, this starts off with the assumption that you will search within a single BNC document (i.e. <bncDoc> element). To the right of this node you define what it is you want to look for, as one or more linked content nodes. The box is red because you must supply something.
This will display the Phrase Query dialog box.
Phrase Queries are similar to Quick Queries, but they also allow additional specifications as to case-sensitivity and searching text headers that are not available elsewhere.
This is necessary since the string you are searching for is not part of a text, but of a header.
You will now see the complete query: at the foot of the window there should be a message saying Query is OK.
Now you can activate your subcorpus from the subcorpora box on the toolbar, and design Quick queries or Word queries concerning it. For instance, is there much bad language in the BNC spoken interviews? What about in non-academic writing? You can compare figures with those for the subcorpus of all the spoken texts, or for the corpus as a whole, bearing in mind the different numbers of texts in each.
Yet another way of making subcorpora is to do a query for a word or phrase which you think is likely to occur in the kinds of texts you want to include.
Nearly all the 13 solutions appear to have something to do with wine-tasting, bar one exception where it is a proper name. Double-click on this solution to select it, then click on the arrow at the right of the Thin button on the toolbar, and choose Reverse selection. The solution you selected will be deleted from the display.
Not all the solutions, you will find, have to do with wines, because many of these texts are of a varied nature, being taken from the `leisure pages' of newspapers and the like.
As you've seen, you can type either a word or a phrase into the Quick query box. But suppose you want to search for a phrase in which some of the words, or the word order, can vary? In this section we'll explore some of the facilities for defining more complex, or less exact, queries.
One very simple way is to use the underline character _ as a placeholder for any single word in a phrase.
As you scroll through the 29 solutions, you will note that there are some different spellings for the middle word in the recovered phrases.
This is slightly less common in terms of absolute frequency, but not in terms of number of texts. In the general case, we would like to be able to find the two words cheese and wine in the same phrase, in either order. The QueryBuilder is the right tool for this job.
You will see that the content node has now become black, since you have provided it with valid content. You will also see that the node has small branches growing from its sides: these allow you to add other nodes with other contents, and to link the nodes them in different ways.
Nodes which are presented left to right across the screen are interpreted as alternatives. Your current query will thus find occurrences of wine OR of cheese.
Nodes which are presented top to bottom down the screen are interpreted as additive. Your current query will find occurrences of wine or cheese, but only if they are followed by another word. The Next link means that the two nodes concerned must directly follow each other, with no other words intervening.
Your current query will find wine or cheese, followed by any word, followed by wine or cheese again.
There should be 76 solutions — rather more than we found for ‘cheese _ wine’ and for ‘wine _ cheese’ together in our earlier quick queries. Scroll through the results to see why.
The Query Builder can be used to search for combinations of words within particular contexts by additionally changing the scope node. For instance, you might want to find co-occurrences of oaky and fruity within the same sentence, or within an overall span of five words.
You can specify any SGML element as the scope for a query. Alternatively, if you select Span you can specify the number of words within which the rest of the query is to be satisfied.
Once you have specified the scope, you need to build up the required content in the content nodes, as before.
A two-way link will find the two nodes in either order. A one-way link will find them only if they occur in the top-to-bottom order. A next link will find them only if they are directly adjacent.
As a final exercise, see if you can work out how to compare the number of occurrences of evaluative terms such as wrong or correct in all spoken utterances, in utterances spoken by men, and in utterances spoken by women. To choose utterances by a particular type of speaker you should first specify <u> as the scope node, and then choose from the attributes listed below the list of elements. For their sex choose who_sex. If you wish, you can additionally restrict the search by the age of the respondent (who_age) or other criteria.
Advanced students may like to compare sentence-initial occurrences of the word right within speech and within writing. Hint: content nodes in a QueryBuilder query can contain any kind of query, including an SGML query.