This document summarizes all the facilities provided by the SARA client program. In the BNC Handbook, these facilities are introduced in the context of specific tasks, and as a means of exploring the BNC. In this document, we review them in the same order as they are presented on the initial menu bar, for ease of reference. You will find extensive cross references between the descriptions below, and also in the online help system.
This document describes the original version of the SARA software. Some facilities described herein have been enhanced or modified in the current version.
Once the SARA client program has been correctly installed on your system, you launch it in the same way as any other Windows application, for example by double clicking on an icon. The SARA login dialogue box will then appear. You must have a username and a password before you can access a networked SARA server, for licensing reasons. These will normally be allocated by the person responsible for managing the BNC server to which you are connecting. Note that the username may be quite different from that you use for other network services such as email. SARA can also be run in local mode (i.e. with both the server and the client running on the same machine). In this case, you may not need to log in.
When you have typed in your username, press the Tab key to move to the password box, and type in your password. It will not appear on the screen, but will be validated by the server when you press the Return key or click on the OK button. If you have typed a valid password and username, the box will be replaced by the message of the day, identifying the server to which you are connected. Press Return again, or click on OK, and you will see the main SARA screen.
If you make a mistake entering your password, the system will let you try again. If you want to give up, click on the Cancel button.
You can change your password [mdash ] but only once you have successfully logged in! (see further section 7.5 ).
If your system is not configured correctly, or if the server which you are trying to reach is not available, you will be offered the chance to select a different server address, or to run SARA in local mode, as further discussed in section 7.5 . Do not change the server address unless you need to: any changes made here will apply to all subsequent use of SARA on your machine. Press on the Cancel button to close the comms dialogue box if you simply want to try the same server again later.
All use of SARA is done within the main SARA window. There is a menu bar across the top of this window, which can be used to select the various SARA commands available to you, and a tool bar which can be used to select particular commands rapidly. See section 7.1 for a summary of the commands accessible from the tool bar. At the bottom of the screen, there is an iconified representation of the corpus which you are currently searching (in the current release, this is always the whole of the BNC).
The SARA client can operate in either of two modes. In query mode (the default, and usual mode of operation) the client accepts queries, acts on them, and displays their results in one or more query windows. In browse mode, the client opens a browser window showing the whole of a particular text, which you can then read. In browse mode, the Query menu is replaced by a Browser menu.
To use SARA you give various commands, selected from the menu bar or the tool bar. Some commands affect the behaviour of the client or the server, for example by setting limits for the amount of text to be downloaded in response to a query, or to change the format of text being displayed by the client. Other commands create new queries and open new windows to display their results in.
The commands available are logically grouped together according to Microsoft interface guidelines. The order in which they are discussed below follows their order on the menu bar.
To select an item from the menu bar, click on the appropriate word with the mouse, or type the appropriate keyboard shortcut. A menu will open up, from which further options can be selected with the mouse, possibly with further sub-menus in some cases.
The following options are available from the top level menu structure:
Most of the commands on this menu manipulate queries, as opposed to the results which they return from the corpus: the exceptions are Print and Print preview, both of which relate to the solutions returned by the current query.
The following commands are provided on the File menu:
By default, the first query defined during your SARA session is named Query1 , the second Query2 , and so on. You can give any query a more meaningful name, if you wish, before saving it in an SQY file. The name of a query appears in the title bar of the window containing its results. It can contain only characters which are legal in filenames under MS-DOS, and may not exceed eight characters in length.
Queries are opened or saved using the normal Windows dialogue boxes for file manipulation, which allow you to change drives, specify file names etc. If you do not know how to use these, consult any introductory text on using Microsoft Windows.
The New query option on the File menu opens a submenu from which you can select which type of query you want to perform. SARA allows you to define five different kinds of query:
More detail about each kind of query is given in the appropriate section below. There is a button on the tool bar for each kind of query: it is generally quicker to press the button than to select it from the menu.
A word query may be defined in any of the following ways:
Any of the above will cause the Word Query dialogue box to be displayed, containing a window into which you can type a word, or part of a word, to be searched for in the SARA index. If the Pattern checkbox to the right of the window is checked, whatever you type will be interpreted as a pattern. If it is not checked, whatever you type will be interpreted as a word stem. (Strictly speaking, a word stem is also a kind of pattern: the word stem XXX is exactly equivalent to the pattern XXX.*)
The Lookup button carries out a search of the SARA index. Every form found in the index which starts with the same letters as the word or part of a word you typed in will be displayed in the lower window. If the Pattern checkbox was checked, every word matching the pattern you typed in will be displayed.
For example, typing in colour with the Pattern checkbox unchecked will produce a list of words beginning with the letters `colour', (`colour', `coloured', `colouring', etc.) If the box is checked, only the word `colour' will be produced, since this is the only word which matches that pattern. (Patterns are described below in section 3.4 .) Typing in colou?r.* with the pattern box checked will produce a list of all words beginning with the letters `color' and `colour'. Note that, in this case, if the box is not checked, no words will be returned, since there is no word beginning `colou?r.*' in the BNC.
A pattern expression which begins with anything other than a literal will usually involve a search through the whole BNC index, which will take a very long time indeed, and should be avoided. This implies that searches for word-endings are not easily done.
Note that the items treated by SARA as single words may not correspond with orthographic words. In particular, hyphenated words and words followed by some punctuation characters may not always be indexed in the way you expect.
The lower window will not display more than 200 items: a warning message will appear if the word or word part you typed was not specific enough, perhaps because it was too short. If the word you wish to look up is also a very common prefix, check the Pattern box to select only the word, rather than all words beginning with that string of characters.
You can click on one or more of the word forms displayed in the lower window to select them. As is usual with Windows application, clicking on one or more items with the CTRL key depressed will select each of them; clicking one and then another with the SHIFT key depressed will select both those two and all the other items between them in the list.
When an item is selected in this way it is highlighted on the screen, and a count is displayed below the box indicating the frequency and z-score for the selected word forms within the texts making up the BNC. (Note that words occurring within the text headers are excluded.)
When items are selected, the Query button can be pressed to carry out a search for these word forms within the BNC. Section 3.9 gives further details of the process of downloading the results of a search.
The other buttons on the Word Query dialogue box have the following effects:
When the Word Query is part of a Query Builder query, the Query button is labelled OK and clicking it simply adds the word query into the query being constructed.
A phrase query may be defined in any of the following ways:
Any of the above will cause the Phrase Query dialogue box to be displayed. This dialogue box contains a window into which you can type a word or phrase, a checkbox labelled Ignore Case, and a checkbox labelled Search Headers.
You can type any sequence of words, or a single word, into the window. Press the OK button (or the Return key) and a search is carried out for the specified phrase within the BNC.
If the Search Headers checkbox is checked, then the search is carried out within the TEI headers as well as the text. Otherwise, only the texts are searched.
If the Ignore Case checkbox is unchecked, the search is case-sensitive. If the box is checked, a search for Sara will recover occurrences of `Sara', `SARA', or `sara'; if it is not, only the first of these will be found.
These two check boxes are the only ways SARA provides for searching in a case sensitive way, or for searching within the headers, other than by using a CQL query .
A phrase query can contain punctuation characters as well as words. For example, the phrase query , whereas will find occurrences of `whereas' only where they are preceded by a comma. When searching for a match, newlines between components of a phrase query are not significant: for example, it makes no difference whether the comma is at the end of one line and the `whereas' at the start of the next.
A special punctuation character known as the Anyword character _ can be used within a phrase query (but not at the start or end of one). It will match any single item in the index. For example, the phrase query home _ centre will recover phrases such as `home loan centre', `home improvement centre', `home planning centre' etc.
Note that not every item in the index is a conventional orthographic word. As further discussed in section , the index uses L-words which may be parts of conventional orthographic words such as `n't' or orthographic phrases such as `in spite of'.
Each part of a phrase query is searched for separately, and the results are then combined. Consequently, if a phrase query contains any very common words (for example, `to', `the' etc.) it may take a very long time to execute: in such cases it is usually better to replace the very high frequency word with an AnyWord character. For example, to find the phrase `die the death', type die _ death and discard the (fairly small) number of false positives such as `die a death', using the Thin command described in section 6.3 .
There is no limit on the number of words a phrase query may contain, but the total length of the string may not exceed 200 characters.
Click on the OK button to send the query to the server, or click on the Cancel button to cancel it; see further 3.9 .
A pattern query may be defined in any of the following ways:
All of the above will cause the Pattern Query dialogue box to be displayed. This dialogue box contains a window into which you can type a pattern query. The pattern is validated, and a search is carried out for all the words which match it. See further 3.9 .
As noted above in section 3.2 , a pattern can also be typed as part of a Word Query in order to produce a list of matching words. This is a very useful way of checking the results of a pattern query without actually carrying it out by searching the BNC.
A pattern is a string of characters which is used as a template to match words in the SARA index. The characters making up a pattern can be:
The dot . is a special character which matches any single character. For example the pattern f... matches any four letter word beginning with F.
A sequence of characters within square brackets matches any one of them. For example the pattern [aeiou] matches any vowel;
A sequence can contain a hyphen to express a range. For example, the patterns [0-9] and  are equivalent: either one will match any digit.
The caret ^ is a special character which can appear at the start of a sequence of characters within square brackets, to indicate that any character not in the sequence should be matched. For example, the pattern [^aeiou] will match any consonant; the pattern [^0-9] will match anything which is not a digit.
Single characters or bracketed sequences can be repeated as often as necessary to make up a complete pattern. For example, the pattern [0-9][0-9][0-9] will match all three-digit numbers; the pattern m[0-9][0-9] will match an M followed by two digits.
The question mark ? is a special character which can follow either a single character or a bracketed sequence of characters, to indicate that the character is optional. For example, the pattern colou?r will match either `colour' or `color'; the pattern [0-9][0-9][0-9]? will match all two- or three- digit numbers, e.g. 99 or 42 or 123 or 912.
The star * is a special character which can follow either a single character or a bracketed sequence of characters, to indicate that the character is optionally repeatable. For example, the pattern hm[hm]* will match words begining with HM and containing only those two letters, no matter how long they are, for example `hm' or `hmmmm' or `hmmhmhmmmm'; the pattern sorrow.* will match any word beginning with the letters `sorrow', including `sorrow' itself.
The plus + is a special character which can follow either a single character or a bracketed sequence of characters, to indicate that the character is repeated at least once. For example, the pattern sorrow.+ will match any word beginning with the letters `sorrow', except for `sorrow' itself; the pattern m[0-9]+ will match all words composed of the letters M followed by at least one digit, and nothing but digits, e.g. M1, M2345; similarly, the pattern e+k will find `ek' `eek' `eeeeek' etc.
The plus or star character can be used to indicate repetition at any point in a pattern. However, matching of patterns beginning with such sequences (for example .*ing, to recover all words ending with `ing') is likely to be unacceptably slow, since it requires a scan through the entire word index. In general, it is best to make the first component of any pattern a literal. Repetition can however be effectively used in the middle of a pattern: for example effec.*ly will match `effectively' or `effectually'.
Two or more patterns can be combined as alternatives using the disjunction meta-character (a vertical bar). For example, the pattern seek|sought will match either the word `seek' or the word `sought'. Parentheses () can be used to group parts of a pattern together: for example, the same effect could be obtained by the pattern s(eek|ought).
Any character preceded by the backslash (\) will be treated as a literal even if it is a meta-character. For example, the pattern Mr?s?\. will match any of `M.', `Mr.', `Mrs.' or `Ms.'. Without the backslash, the final dot would be interpreted as a meta-character, matching any character at all. A backslash is unnecessary within square brackets: the pattern M[rs.]* would have a similar effect to the above, except that it would also match forms lacking a final dot (plus a number of probably unintended matches, such as `mss.').
A POS, or part of speech query behaves in the same way as a word query, except that it searches for only a single word, which can be further restricted according to its part of speech (POS) code. It may be defined in any of the following ways:
All of the above will cause the POS Query dialogue box to be displayed, containing two display windows. When the word to be searched for is typed into the upper window, and the mouse is clicked in the lower window, the lower window is filled with a list of the different parts of speech that the word in question has been assigned within the corpus. The same effect can be obtained by typing in a word and pressing the Tab key.
For example, the word `snore' appears in the corpus as a verb (VV1), as a noun (NN1), and as a portmanteau (NN1-VV1). All three possibilities appear in the lower box.
To search for the nominal senses only, highlight the NN1 in the lower window, and press OK. To search for both nominal and portmanteau cases, hold down the control key while highlighting the NN1 and NN1-VV1 entries, and then press OK.
Note that it is not possible to search for a particular part of speech without specifying the word to which it is attached. This implies that you cannot use SARA to search for such things as sequences of three or more adjectives, nor for occurrences of a specific word preceded by any word with a particular part of speech.
The Help system contains a list of POS codes used in the current version of the corpus: this list also appears in appendix below. A brief explanation of each POS code is also displayed when you select it from the upper box in the POS query dialogue box.
Click on the OK button to send the query to the server, or click on the Cancel button to cancel it; see further 3.9 .
An SGML query may be defined in any of the following ways:
All of the above will cause the SGML dialogue box to be displayed.
As well as information about words and their parts of speech, the BNC index searched by SARA contains details of where the SGML elements of which the corpus is composed begin and end. (SGML [mdash ] the ISO Standard Generalised Markup Language [mdash ] is briefly described above at section ; see also chapter 5 of the BNC Users' Reference Guide ).
The start of an SGML element is indicated by a start-tag; its end is indicated by an end-tag. Start-tags may additionally carry named attributes, with particular values, to convey additional information about the element occurrences they delimit.
You can use this information to restrict searches to particular types of text (the categorisation of a text is indicated by attributes of a <catRef> element within its header), or to find particular types of text component [mdash ] for example newspaper headlines, which are mostly tagged <head type=main> in the BNC, or pauses (<pause>) in spoken texts.
The SGML dialogue box contains a scrollable list of the element names or tags defined for the corpus. For an explanation of the way these elements are used in the corpus, refer to the BNC Users Reference Guide . If the Show Header Tags checkbox is checked, all tags used in the corpus will appear; if it is not, then tags which are used only in the headers will be excluded. To search the corpus for an SGML start- or end-tag, you select the name of the element concerned from this list by clicking on it. A brief description of the way this element is used is then displayed.
Provided that the Start radio button is selected, a list of any attributes defined for this element will then be displayed in the lower left hand window. You can restrict the search to occurrences of this element having particular values for some combination of these attributes by selecting attribute names from the list, one at a time, and adding them into the query. Alternatively, if you do not select any attribute name fom the list, the query will select occurrences of this SGML element whatever attribute values it may have.
When you select an attribute name from the list, clicking on the Add button will open a further dialogue, indicating the range of values possible for that attribute. Click on the desired value (or values) and then press OK to close this dialogue box. Several attribute value constraints may be added in this way. You can also remove a particular constraint by selecting it from the right hand window in the SGML dialogue box, and then clicking on the Remove button, or remove all of them by clicking on the Remove All button.
Click on the OK button to send the query to the server, or click on the Cancel button to cancel it; see further 3.9 .
Query Builder is a special purpose tool which allows you to create complex queries using a visual interface. The Query Builder command can be used in either of the following ways:
Either of these will cause the Query Builder dialogue box to be displayed. This dialogue box is used to define a Query Builder query as further described in this section.
Parts of a complex query are represented in the Query Builder dialogue box by nodes of various types. A Query Builder query always has at least two nodes: one, the scope node, defines the the context within which a complex query is to be evaluated. The other nodes, which may be linked in various ways, are known as content nodes. These define the various things which are to be found within this scope. Any form of query can be used in a content node (except for a CQL or Query Builder query).
For example, you might use the Query Builder to search for the word `fork' followed or preceded by the word `knife' within the scope of a single <s> (sentence) element. In this case, the scope node would indicate a single SGML element occurrence, and there would be two content nodes, one for `knife' and the other for `fork'. Alternatively, you might specify the same search but define its scope as a number of words. The default scope for all Query Builder queries is a <bncDoc> element, i.e. any one of the 4124 distinct text samples making up the BNC.
The scope of a query is represented in Query Builder by the scope node which appears on the left of the dialogue box. To the right of this is a single empty content node. Clicking with the mouse inside a content node opens a submenu, from which you can select either Edit, Clear, or (for nodes other than the first one) Delete. Selecting Edit opens a further submenu, from which you select the type of query you wish to define for that node, or, if you have already defined a query for the node, to edit it. Selecting Clear cancels any previous choice, allowing you to select a new query type for the node. Delete removes the content node, but leaves the rest of the query unchanged.
When a single content node has been filled, further nodes can be added to its right, above it, or below it, simply by clicking the mouse on the branch in that direction. Nodes added to the right of a query node represent alternatives. For example, the Query Builder representation of a query to find either the word `fork' or the word `knife' within the scope of a single <bncDoc> element is shown in figure Either FORK or KNIFE Alternation can also be contained within a single content node by using a pattern query, or a word query with alternatives. Figure 2 shows another way of achieving the same effect as the preceding query, using a pattern query. Either FORK or KNIFE (another way)
Nodes added above or below a content node represent additional constraints. The query represented in figure 3, for example, searches for both the word `fork' and the word `knife' within the scope of a single <bncDoc> element. FORK preceding KNIFE The vertical line linking the two content nodes indicates the order and proximity required. Clicking on the line opens a submenu from which you can select one of the following possibilities:
To change the scope of a complex query, click on the scope node. A submenu opens, from which you can choose either SGML or Span. Choosing SGML opens the SGML dialogue box, from which you can select an SGML element, possibly modified by attribute values, as in an SGML query (see further section 3.6 ). Choosing Span opens a dialogue box in which you can enter the number of words within which the rest of the query must be satisfied.
The example shown in figure 4 will find the words `fork' and `knife' in either order, provided they appear within five words of each other. FORK followed or preced by KNIFE within 5 words
When nodes are added both to the right of and above or below a content node, they must all be satisfied. For example, the query shown in figure 5 FORK or SPOON followed or preceded by KNIFE within 5 words will find occurrences of `fork' or `spoon', but only only where they are followed by `knife' within a span of five words.
A content node can contain any kind of query (other than a CQL or a Query Builder query) [mdash ] one or more alternatives chosen from the word query dialogue box; a phrase query; a pattern query; a POS query; or an SGML query. The Anyword character can also be entered as a content node in its own right.
Once you have completed defining the query, press the OK button to carry out a search, or press Cancel to cancel it. See further 3.9 .
CQL (pronounced ``sequel'') is short for the corpus query language. It is the command language which a SARA client program uses to communicate with the SARA server. Usually expressions in CQL are generated for you by the client program, but there is no reason why you should not type them in directly as well. There are also a few features of the command language which cannot be easily (or at all) expressed by the current client except in this way.
A CQL query may be defined in either of the following ways:
Either of the above causes the CQL query dialogue box to be displayed. This dialogue box contains a window into which you can type a CQL query. The query is then validated, and a search is carried out (see further 3.9 ).
The syntax of CQL is defined briefly here. The CQL form of any query can always be viewed by switching on the Query Text option on the Query menu (see 6.4 ).
A CQL query is made up of one or more atomic queries. An atomic query may be one of the following:
Four unary operators are allowed in CQL:
A CQL expression containing more than one atomic query may use the following binary operators:
When queries are joined, the scope of the expression may be defined in one of the following ways:
Some simple examples follow:
Whichever type of SARA query you define, the process of executing it is the same, and proceeds as follows:
The Too Many Solutions dialogue box allows you to reset the download limit temporarily, and also to specify which of the available solutions should be displayed. The number of solutions to be downloaded can be re-set manually, by typing a new number into the box at the bottom, or automatically, by clicking on either or both of the Download all and One per text buttons.
In either case, when solutions are downloaded, they appear in order, starting from the beginning of the corpus. If theRandom checkbox is selected, solutions are chosen at random until the specified number has been reached; if it is not, then either all solutions are chosen, or (if One per text is chosen) the first in each text, until the limit has been reached.
When downloading is complete, the red Busy light will go out.
You can scroll, sort, thin, save, see the sources of, or otherwise manipulate the solution set using options from the Query menu, as described in section 6 .
You can interrupt execution of a query at any time before downloading of solutions begins by pressing the Esc key. This will abort processing of the query as soon as possible.
You can print the results of a query in three different ways:
Only the first of these is discussed in this section; for the other two, refer to sections 4 and 6.6 respectively.
Choosing the Print command will open a standard Windows Print dialogue box. You can select whether printing should be done in landscape or portrait mode by clicking on the appropriate button. You can also choose the printer to be used, and configure the printer in the normal Windows manner.
The current version of SARA does not allow you to change the page layout of the report printed: it contains a running title, derived from the query, and page numbering. References for each hit are printed down the left margin, indicating the text and the sentence number from which it comes. As much of each hit as will fit on a single line is included.
You can use the Print Preview command to see a rough indication on the screen of how the results will look when printed.
For more flexible formatting of the results of a search, you should use the Listing command on the Query menu to save the results in SGML format, as described in section 6.6 below. This file can then be formatted in any way appropriate using the word processor of your choice.
The Edit menu allows you to save or select the current solution from the result set for a query. It has three commands:
At any time one of the set of solutions being displayed is known as the current solution: in page display mode (6.4.1 ), this is the solution which is visible on the screen; in line display mode, it is the solution which has a broken line above and below it. In line mode you make a solution current by scrolling to it, and clicking on it with the mouse. In either mode, you can move forward or backward through the result set by using the arrow keys on the tool bar or the cursor keys on the keyboard.
Choosing the Copy command from the Edit menu copies the current solution to the Windows clipboard, in whatever format it has on the screen. If you now open a Windows application such as Notepad or some other word processor, and select the Paste command from its Edit menu, the current solution will be available to that application. This is a simple way of copying information from SARA to other programs.
A bookmark is a name which you can attach to the current solution so that you can refer to it again, after it has ceased to be current. Bookmarks are specific to particular queries, and are saved and retrieved along with queries.
To create a bookmark for the current solution, select the Bookmark command from the Edit menu. The Bookmark dialogue box will appear, into which you can type any short name for the bookmark. If the name is already in use, you will be offered the choice of overwriting the current bookmark of that name. A list of the names you are currently using, and the queries to which they relate, appears in the lower part of the window.
To make a different solution current, you can choose one of your named bookmarks by means of the Goto command on the Edit menu. Choosing this command opens the Goto Bookmark dialogue box.
Clicking on a bookmark in this window will make the solution to which it points the current one, assuming that the result set from which it comes is available. If you delete a result, for example, by thinning the result set, any associated bookmark will also be deleted. Result sets remain available for as long as they are either open as windows or present as minimized icons on the desk top. If you save a query (using the Save command on the File menu), its associated bookmarks are saved along with it. If you subsequently close the query (using the Close command of the File menu, or the window controls), its associated bookmarks disappear from the Goto dialogue box, but will re-appear if you re-open it.
The Browser menu is available only when SARA is in browse mode. In this mode, you are able to browse through the whole text of any part of the corpus in a special purpose window.
When SARA starts, it is initially in browse mode. As soon as you open a query, or create a new one, SARA switches out of browse mode, and the Browser menu is replaced by the Query menu, discussed in section 6 . To open the browser window, click on the BNC icon at the bottom left of the SARA main screen, or click on the Browse button which appears on the bottom right of the Source window described below in section 6.7 .
In browse mode, you can move from one text in the corpus to the next, simply by using the arrow keys on the button bar, or their equivalent keyboard shortcuts. The texts appear in alphabetical order, according to their three character identifiers.
Each text sample of the BNC has a similar SGML structure. Each text is represented by a <bncDoc> element, which is composed of a <header> element and either a <text> or an <stext> element. These elements are all further subdivided into elements of other named kinds. <header> elements have a rather complex substructure, following international standards for bibliographic description. Both <text> and <stext> elements are composed fundamentally of <s> (sentence) elements, which contain a mixture of <w> (word) elements and <c> (punctuation) elements. In written texts, these are grouped into elements such as <p> (paragraph) or <head> (heading); in spoken texts, they are grouped into <u> (utterance) elements. (For a more detailed description, see the BNC Users' Reference Guide .)
In browse mode, this structure is presented visually in the form of a list of container elements, each of which can be selectively expanded. When a text is first displayed, only the outermost <bncDoc> element containing it is visible. It appears in a special browse window, with a plus sign to the left of the SGML start-tag, which indicates that this element is not yet fully expanded. Click on the plus sign to see the SGML elements of which it is composed. At the next level down, a <bncDoc> element has two subcomponents, a <header> element, and a <text> or <stext> element.
When an element is expanded, the plus sign in front of its start-tag turns into a minus sign, indicating that that element has been expanded. You can continue in this way, expanding elements down to the lowest level <w> elements for any text. If you click on a minus sign, the expansion of the element will be removed.
If you entered browse mode by clicking on the Browse button during display of the results of a SARA query, a red horizontal line will also appear in the Browser window. This line marks the place in the text where the current hit occurs; you can move directly to this point by clicking on the box at the left end of the line. Since this requires that the whole of the text must be downloaded from the server to the client, there may be some delay between your clicking on the box, and the display of the element containing the hit. Once the text is available, the display will automatically scroll to it.
You can now inspect the content of any elements before or after the hit by clicking on the plus signs, as before.
You use the Tags command on the Browser menu to determine which of the SGML tags around parts of the text are to be displayed. By default, all tags are displayed in the Browse window (the Tags command on the Browser menu is checked); click on this command to switch off display of the low-level tags for words, punctuation, and s-units.
The Query menu allows you to manipulate the results of a query in various ways. You can edit a query using the Edit command; sort the results in various ways using the Sort command; thin the results using the Thin command; set various options about the appearance of the results using the Concordance, Options, Query Text or Annotation commands; save the results to a file in SGML format using the Listing command; or display bibliographic information about a particular result using the Source command. You can also calculate collocational information for the results of a query using the Collocations option.
Selecting the Edit command from the Query menu will redisplay whichever dialogue box it was that launched the query whose results are currently displayed. The command is grayed out and unavailable if no results are being displayed.
The query dialogue box will be displayed as it was when the query was sent to the server by pressing the OK button. You can change any part of the dialogue box, and resubmit it by pressing the OK button again, or press Cancel to close the dialogue box and start again.
By default, the results of a query are displayed in their order of appearance within the corpus, alphabetically ordered by its three character filenames. This is rarely of any particular significance, except to group solutions from the same text, and so it is generally desirable to reorder a line mode (concordance) display. This is done by selecting the Sort command from the Query menu, which displays the Sort dialogue box.
You can use the radio buttons in this dialogue box to specify either one or two keys for the sort, and a single collating sequence, applicable to both keys. The keys determine which part of each hit is to be used to sort the results; the collating sequence determines how these keys are to be compared when deciding on their relative order.
The Primary keys for all the context lines are compared first, according to the collating sequence indicated. If any duplicates are found, the Secondary keys are used to order them. Note that the same collating method must be used for both keys.
The Span box indicates how many words make up the key in each case. The Left, Centre or Right radio buttons indicate the position of the key relative to the query focus (i.e. the hit word in the context). If the Left radio button is selected, and the Span is 1, the key will be the word to the left of each query focus. If the Centre radio button is selected, and the Span is 1, the key will be the first word of the query focus itself. If the Right radio button is selected and the Span is 1, the key will be the first word following the query focus.
The Ascending and Descending radio buttons indicate whether the keys are to be sorted into ascending or descending alphabetical order.
The collating method used for both keys is indicated by the radio buttons to the right of the dialogue box. With the ASCII radio button selected, keys are compared according to the ASCII character sequence, in which all uppercase letters precede all lower case ones: `Zebra' precedes `antelope'. With the Ignore case button selected, case distinctions are ignored, so that `Zebra' and `zebra' are regarded as the same key. With the Ignore accents button selected, accented letters are treated as if they were unaccented, so that `[eacute]l[egrave]ve' and `[eacute]lev[eacute]' are regarded as the same key.
If the results being sorted are being displayed in either POS or SGML mode (see further section 6.4.1 ), then the POS code button is available for selection. Selecting it causes keys to be sorted not by their orthographic form but the alphabetical order of their part of speech code. This has the effect of grouping together keys with the same POS code. You can use it, for example, to sort a set of results by the POS code of the word following the query focus.
Selecting the Thin command from the Query menu opens up a sub-menu from which four selections are available, each of which allows you to reduce the number of displayed solutions in the current result set. The commands available are:
The current item in a displayed list can be selected either by double clicking on it, or by pressing the space bar.
Each time you request a random selection from a given set of results, you will get a different random sequence. The only way to get the same random selection more than once is to save the query after thinning it. When a thinned query is saved, any thinning is saved at the same time.
The results of a query can be displayed in one of two modes and in one of four different formats). You can also vary the amount of context or scope displayed for each result. Which options are in effect for a particular query will depend on the initial settings specified by the User Preferences dialogue box (see 7.5 ). The display mode can be changed by using the Concordance command on the Query menu, or by toggling the Concordance button; the format can be changed for a particular set of results by selecting Options from the Query menu. This also determines the amount of context displayed for each result.
In line mode, each occurrence of the item searched for is displayed as a single line on the screen; in page mode, each occurrence is displayed in full on the screen, taking as many lines as necessary.
The Concordance button is used to switch between one mode and the other. The initial mode is set by the Concordance checkbox in the User Preferences dialogue box (see 7.5 ): if this is checked, line mode is used; otherwise page mode is used. Selecting the Concordance command from the Query menu or clicking on the concordance button, enables you to switch modes for a particular set of results.
The usual Windows controls are available to enable you to display different parts of a large set of results. In line mode, you can use the vertical scroll bar to the right of the window to scroll up and down the results; in either mode, you can use the arrow buttons in the tool bar to step through the solutions one at a time. You can also use the cursor keys, PgUp and PgDn, Home and End, to move through the result set in the usual way.
Select the Options command from the Query menu to display the Options dialogue box. The radio buttons selected here determine the format used to display the current results and the amount of context (or scope) visible to either side of the query focus, as further discussed in section 6.4.3 .
The following four display formats are available:
Changing any of these options will affect the display of the current query only. To change the display of all subsequent queries, changes must be made in the User Preferences dialogue box (see 7.5 ). Note also that changing the format of the display will usually require that the results be downloaded again.
The maximum amount of context which can be displayed for each hit is set by the Max download length specified in the User Preferences dialogue box (7.5 ). This sets an upper limit, as a number of characters. Setting it very high will result in long download times; setting it too low will limit the usefulness of what can be displayed on the screen.
Within this overall limit, there are four options for determining the amount of context displayed on the screen by default:
If the scope setting results in less than the maximum download length being displayed, you can always expand what is displayed up to that maximum by double-clicking on the display with the right mouse button. This will expand the context up to what would have been obtained if the Maximum scope setting were in force, for the current hit only.
The query focus is that part of a downloaded hit which is normally highlighted within the display. In a simple word, patterm, or phrase query, it is the whole of the word or phrase found which matched the query. In an SGML query, it is the SGML start- or end-tag which matched the query. In a Query Builder query, it is the part of the text which was matched by the last content node, i.e. that nearest the bottom of the screen.
In custom mode, hits are displayed according to a format which you can tailor to your own liking. You can specify whether or not particular SGML elements should be displayed starting on a new line, whether or not their associated attributes should be displayed, and also specify additional characters to be displayed in association with them.
Two such specifications can be supplied; one, held in a file called linefmt.txt determines how hits should be displayed in line mode displays; the other, held in a file called pagefmt.txt, determines how hits should be displayed in page mode displays. These are ordinary ASCII files which can be edited and displayed by any editor (such as Notepad), or by pressing the Configure button on the Options dialogue box. The files must be held in the working directory used by the SARA client on your system, and must be writable. (See further section 10 ).
The syntax of these files is fairly self explanatory. Each line specifies how a particular element type is to be displayed: if no line is supplied for any element, no special action is taken for it. A line begins with the name of an element (optionally followed by an attribute name) or entity. This is followed by a quoted string, which gives the replacement value for the named entity or for the element's start-tag. A second quoted string can also be supplied to provide a replacement for an element's end-tag. Within replacement strings, the string %s is used to represent the value of the attribute whose name was specified. Formats intended for use in page-mode displays can also use the string \n to indicate a new line and \t to indicate a tab indent.
For example, the default page format file contains the following lines:
div1 "\n" pause "..." event desc "[%s]" u who "\n%s>: "
The first line indicates that the display should start a new line at the start of each new <div1> element. The second line indicates that any <pause> element should be displayed as three dots. The third line indicates that any <event> element should be displayed as whatever value has been supplied for its desc attribute, enclosed in square brackets. Finally, the last line indicates that the content of every <u> element should be prefixed by the start of a new line, the value of its who attribute, and the string >:(i.e. angle bracket, colon, space).
Care should be taken in preparing custom format files, as no syntax checking is currently performed.
In addition to the display of results, the query window can contain two other components, each in a separate pane.
Both query text and annotation are saved together with the query, along with any valid bookmarks you defined for it.
Selecting the Listing command from the Query menu opens a standard file dialogue box in which you can specify a name for the file in which the current result set is to be saved. The result set is saved in SGML format in a file with the same name as the query itself, with the suffix SGM.
Here is the start of a sample listing file, showing the results of a search for the word `corpuses'.
<!DOCTYPE bncXtract PUBLIC "-//BNC//DTD BNC extract 0//EN"> <bncXtract> <hdr date='10-Nov-1996 00:03:29' user=lou server='163.32.247' format=untagged> <source>This data is extracted from the British National Corpus. All rights in the texts cited are reserved. This data may not be reproduced or redistributed in any form, other than as provided for by the Fair Use provisions of the Copyright Act</source> <query><![CDATA["corpuses"] ]></query> </hdr> <hit text=EWA n=531><left> Where an absolute norm for English cannot be relied on, the next best thing is to compare the corpus whose style is under scrutiny with one or more comparable <focus>corpuses<right>, thus establishing a relative norm. </hit> <hit text=FRG n=1222><left> These methodological difficulties are associated with a more general problem of deriving generalizations from <focus>corpuses<right>. </hit> </bncXtract>
Housekeeping information about the query itself is saved in a <hdr> element at the start of the file, giving the date the query was solved, the name of the user, and the server machine, as well as the actual text of the query. Each result in the query result set is saved as a separate <hit> element. The text attribute gives the three character identifier of the text in which the hit was found; the n attribute gives its sentence number. The query focus of the hit is represented as a <focus> element; its left context is represented as a <left> element, and its right context as a <right> element.
Results are saved in a listing file in the format in which they are displayed. Thus, if the result set included SGML tags (i.e. results are being displayed in POS or SGML format), these tags would also appear in the listing file, which would make it difficult to process by other SGML-aware software. To make this less problematic, any angle brackets appearing as content of a <hit> element are converted to square brackets before the listing file is produced. For example, the second <hit> element above would appear as follows if the same query were saved in SGML mode:
<hit text=FRG n=1222><left>[s n=1222] [w DT0]These [w AJ0]methodological [w NN2]difficulties [w VBB]are [w VVN]associated [w PRP]with [w AT0]a [w AV0]more [w AJ0]general [w NN1]problem [w PRF]of [w AJ0-VVG]deriving [w NN2]generalizations [w PRP]from [w NN2]<focus>corpuses<right>[c PUN]. </hit>Note that both the above examples have been reformatted to fit on the printed page: in an actual listing file, no extra line breaks are introduced within the body of a <hit> element. A full specification of the listing file format is included in .
Selecting the Source command from the Query menu or clicking on the Source button on the tool bar will display a Bibliographic data window containing information about the text in which the currently selected result appears. It also gives an indication of the size of the text in words and s-units. The information presented is the same as that available from the reference list included in the BNC Users' Reference Guide. Further information about a text, for example its classification, is available only by inspecting elements in its header.
The Bibliographic data for a written text will generally specify its author, title and publisher. The Bibliographic data for a spoken text will identify the situation in which it was recorded, and will also supply demographic or descriptive details for each person speaking in a lower window. This window can be scrolled left to right or up and down as needed.
Click on the OK button to close the Bibliographic data window. Click on the Browse button to switch to Browse mode, enabling you to browse the whole of this text, as discussed above in section 5 .
The Collocation command allows you to calculate how frequently words collocate i.e. appear together within the current results. For example, if your current query results show occurrences of the word `death', you might wish to see how often the word `die' appears within a certain number of words of the focus.
Selecting the Collocation command from the Query menu opens the Collocation dialogue box. The name of the current query is displayed, together with the number of hits. Enter the word, punctuation mark, or SGML start-tag for which a collocation score is required (the collocate) in the box labelled Collocate and press the Calculate button. Two counts appear in the box below, indicating how often this collocate appears within a specified span, and what proportion of the hits this represents. You can repeat this process as often as you like, with each new collocate appearing in the same results box. If the collocate appears more frequently than the query focus itself, it is displayed in the highlight colour.
Collocation scores are calculated within the span (i.e. number of L-words) indicated in the box at the bottom left of the dialogue box, by default one word to either side of the first word in the query focus. The span is always counted from the leftmost end of the query focus. Changing the span causes the scores for all words to be recalculated.
You can calculate collocation scores with respect either to the number of hits actually downloaded, or with respect to the number of hits present in the corpus, depending on the setting of the Use downloaded hits only checkbox in the top left corner of the dialogue box.
Note that it is not possible to find out which words collocate strongly with a given word other than by trial and error: you must specify the words for which a collocation score is required. It is also impossible to specify a pattern as a collocate.
You can print the contents of the Collocation dialogue box at any time, by clicking on the Print button. This is the only way of saving the results of a Collocation analysis in the current release of the software.
This menu contains commands which allow you to customise the appearance of the main SARA window.
Choose this command to display or hide the tool bar at the top of the window. A check mark appears next to this menu item when the tool bar is displayed.
The tool bar contains a row of buttons, each of which provides rapid access to one of the following commonly used SARA functions. The SARA Tool Bar The buttons, together with a brief description of what each one does, are listed below, in left to right order.
Choose this command to display or hide the status bar at the bottom of the main SARA window. A check mark appears next to the menu item when the status bar is displayed.
The status bar has three areas: the leftmost part is used to display messages describing the action to be executed by the currently selected menu item or tool bar button; the central part provides information about the currently selected solution; the rightmost part displays information about the current state of the keyboard.
The left area of the status bar describes actions of menu items as you use the arrow keys to navigate through menus. This area similarly shows messages that describe the actions of tool bar buttons as you depress them, before releasing them. If after viewing the description of the tool bar button command you wish not to execute the command, then release the mouse button while the pointer is off the tool bar button.
The centre area of the status bar identifies the currently selected solution. The first pane shows the number of this solution, the total number of solutions and the number of texts from which they are taken; the next shows the three letter identifier of the text in which the current solution occurs; and the last gives the number of this <s> element within this text. For example, if the currently selected solution occurs in the 145th s-unit of text ABC, and is the third of 12 solutions taken from 8 texts, the display will read 3:12(8) ABC 145.
The status bar display may not be updated until a set of results has been completely downloaded.
The rightmost part of the status bar will contain one or more of the following words if the corresponding key on the keyboard is currently latched down:
Selecting Font from the View menu causes the standard Windows Font dialogue box to appear. This dialogue box lists the fonts available on your system. You can use it to set the font in which solutions are to be displayed.
Selecting Colours from the View menu causes the Colours dialogue box to appear. This is used to specify the colours used to display parts of the results. Different colours may be specified for each part of speech code, for the query focus (i.e. the actual hit word) and for the default. You can also specify that any of these items is to be displayed in a bold face, in italic, or both.
To change the colour used for words of a particular type, first highlight the relevant item or items from the list at the left of the dialogue box by clicking on them with the mouse, in the same way as words are selected from the word query box. Next press the Colour button, which causes the colour palatte dialogue box to open. Click on the desired colour. The palatte dialogue box disappears, and the Sample box in the colours dialogue box changes to show the effect of the choice just made. If this is unsatisfactory, click the Reset button to revert to the original colour. The Bold and Italic check boxes can be selected to change the weight and slant of the selected items independently of their colour.
A set of such specifications is known as a colour scheme. Each colour scheme is saved in a file with extension .COL. The name of the colour scheme currently in force is displayed at the top right of the colours dialogue box: to select a new scheme, press the Open scheme button. This opens a file dialogue box, in which the scheme may be named. The Save scheme button saves the set of colours currently defined. The Merge scheme button opens an existing colour scheme and allows you to modify it further.
Click the OK button to close the dialogue box and save all changes. Click the Cancel button to leave without changing the colour scheme.
Selecting Preferences from the View menu causes the User preferences dialogue box to be displayed.
This dialogue box is used to set the default behaviour of the SARA client. Changes made here affect all subsequent queries in this and following sessions, but not any currently displayed result set. In addition to changing the defaults for the way that results are displayed (as discussed above in section 6.4.3 ) you can reset
Pressing the Comms button displays the Communications dialogue box. To connect to a different server, you will need to know two numbers: the IP address of the computer concerned, and the port on which it listens for calls from a SARA client. Consult your local administrator for more information on these.
Pressing the Password button displays the Password dialogue box. To change your password, you must supply your current password, and type the new password twice. The new password will take effect from the next time you try to log in.
The commands on this menu allow you to move and manipulate the windows on the screen, in the same way as most other Microsoft Windows applications. The following commands are available:
The commands on this menu allow you to consult SARA's built-in help system, in the same way as most other Microsoft Windows applications. The following commands are available:
Before trying to install the SARA Windows client, you should find out the following information:
You should also check that the computer on which you plan to install the client has the following characteristics:
The installation process creates a SARA directory to hold client executables and a number of parameter files. If you are installing the software on a local network, you should take care that these parameter files are installed in a writable directory, though the executable itself need not be.
Please read the README file supplied with the software for details of any changes and for specific details of the installation process. Up to date information about the current state of the SARA software is also regularly announced on the bnc-discuss mailing list, and on the BNC's own web site. See http://www.natcorp.ox.ac.uk/SARA for details.