Using CLAWS to annotate the British National Corpus

Roger Garside, Department of Computing, University of Lancaster



The Annotation Process

The main steps of the CLAWS annotation process as applied to the BNC were as follows:

  • text-files to be processed are deposited by OUCS (Oxford University Computing Service) in a "drop" directory on the UCREL computer system, from which an automatic procedure logs them into UCREL processing directories overnight, so that they are available to the corpus analysts when they start work.
  • a corpus analyst selects text-files to be processed, and invokes a procedure to carry out the various steps of the tagging procedure in a work area allocated to that analyst, monitoring the process and logging the completion and success or failure of each step. This makes it easy to keep track of where each text-file has reached in processing.
  • the first step on a new text-file is to run it through an SGML parser, to check the validity of the formatting of the file on arrival. This is followed by a number of filter programs which checked for various special SGML situations which were valid according to the parser but caused the Claws tagging system problems. Examples of such things are certain formats of quoted SGML attribute values, and certain empty SGML elements; these are silently corrected in the source text-file (although the original text-file is archived in this case).
  • next the Claws part-of-speech tagging system is run. The text was divided into separate orthographic units, and a part-of-speech marker assigned to each such unit; Claws also segments the text into units which approximate to sentences. The output from this step is a list of orthographic units, together with the preferred tag and some information about how it was chosen. The SGML header information and the more voluminous information associated with SGML tags was put in a supplementary output file, to be re-incorporated with the main output at a later step. This tagging phase is discussed in more detail in the next section.
  • the next step is a post-processing program whose task is to reformat the Claws output into the format required for returning to OUCS. This involves re-incorporating the information in the supplementary file, deleting the extra tagging process information, and representing each part-of-speech marker as an SGML entity. It is at this stage that portmanteau tags are introduced (see next section). Our original plans were to merge this step into the output phase of the Claws system, but we decided that it was preferable to retain the intermediate format with the extra information while manual corrections were being made to the annotation.
  • after this there were a number of further filter programs, which check various aspects of the current output file (for example, there is an intelligent differencing program which checks the validity of all differences between the source text and the output text) or which make various systematic changes to the output file (such as to ensure the SGML element structure is still valid after Claws has segmented the text). This culminates in a rerun of the SGML parser, to ensure the validity of the output file.
  • at this stage the automatic processing terminates, and manual processing commences. All the error reports from the above steps are examined, and where necessary scripts invoked to correct the annotation. Usually the Claws output is corrected, but in extremis the final output file or the original source text can be adjusted. The scripts ensure that steps are rerun as appropriate, and that the monitor information is kept up-to-date, providing an audit trail of what has been done. Selected hundred-sentence blocks are manually corrected to check the validity of the automatic processing.
  • the output files are then returned to OUCS. This is at present a manual procedure, but we expect to automate it at a later stage.
  • after the annotated text-files have been sent to OUCS and some further processing has been performed there, there is a further window of opportunity within which UCREL can revise the text-file annotation if necessary. We are planning to use this to ensure consistency between texts tagged at different times, and to eliminate certain erroneous tagging decisions where this can be done without disruption to the remainder of the text.

Automatic Tagging of the BNC

The tagging of the BNC is carried out with a version of the Claws stochastic part-of-speech tagger (Marshall 1983; Garside, Leech and Sampson 1987). Its main steps are:

  • formatting of the input text into orthographic units, and segmentation into units approximating to sentences.
  • assignment of a list of potential part-of-speech markers to each orthographic unit, based on a lexicon, suffix-list, and a set of rules to deal with capitalised words, hyphenated words, etc.
  • modification of the potential part-of-speech lists by matching to a collection of pattern templates, which make use of the original words and the potential part-of-speech markers already introduced.
  • selection of the preferred part-of-speech by calculation of the most likely part-of-speech sequence, based on probabilities taken from a large corpus of annotated text and using the well-known Viterbi alignment procedure.
  • reformatting and output of the results.

The main modifications we have made to the Claws system (we are currently using version 16 of the Claws4 system) are:

  • earlier versions of Claws used a tagset which evolved from the tagsets used to tag the Brown and LOB corpora. The current version of this (which we call the c6 tagset) has 170-80 tags or parts-of-speech, and is being used to annotate the 2 million word core corpus. For the rest of the BNC we are using a more restricted (c5) tagset of about 65 tags, eliminating some of the finer distinctions made in the larger tagset (for instance, in distinctions between various classes of common noun). Claws has been rewritten so as to be independent of the particular tagset used, and the appropriate tagset is now read in before the other resources (lexicon, etc.) are read.
    In some cases Claws needs to preserve a distinction between certain tags in order to perform the disambiguation process adequately, where we do not wish to maintain the distinction in the final output. In this case, we use what we call process tags, which are mapped onto a smaller set of output tags in step 5 above.
  • earlier versions of Claws chose a single part-of-speech marker for each orthographic unit, and (in common with other stochastic tagging systems) operated at an accuracy rate of about 96%. In order to provide more useful results in a substantial proportion of the residual words which cannot be successfully tagged, we have introduced portmanteau tags. A portmanteau tag is used in a situation where there is insufficient evidence for Claws to make a clear distinction between two tags. Thus, in the notoriously difficult choice between a past participle and the past tense of a verb, if there is insufficient probabilistic evidence to choose between the two Claws marks the word as VVN-VVD. A set of fifteen such portmanteau tags have been declared, covering the major pairs of confusable tags. Experiments have been done to choose a threshold for each portmanteau tag, involving a trade-off between reducing tagging accuracy and reducing tag ambiguity.
  • a great deal of effort has been required in interfacing the Claws system to the SGML mark-up of the input text-files, and in ensuring that the addition of segment markers is consistent with the other SGML mark-up. The resources used by Claws (lexicon, etc.) have now been translated from using the LOB notation for accented letters and other special symbols to using a set of SGML entity names.
  • the lexicons used by the Claws system are in a constant state of improvement. One major change is that we have incorporated a lexicon of some four to five thousand proper names (mainly place names, but also common personal names, etc.).
  • we have developed the pattern template idea (which we erroneously call an idiomlist) very extensively in the current version of Claws. We now have several such template lists, matched at different stages of the tagging process; each pattern consists of a sequence of required or optional items, each of which is a regular expression to match an orthographic unit (with specified restrictions on typographic case) or one of the potential part-of-speech markers assigned at an earlier stage. We use this to find sequences of orthographic units which should be treated as a single grammatical unit (for example according to), as foreign expressions (for example hoi polloi), as place names or other naming expressions (for example Ann Arbor and the Sunday Times). We also use this mechanism to catch particular word and tag patterns which are commonly mis-tagged, in order to supply the correct tag sequence.
  • finally, the most recent version of Claws has been modified to deal with the spoken section of the BNC. There are supplementary lexicons and lists of pattern templates for spoken data; we have made an attempt to deal with common patterns of orthography used to represent non-standard speech (such as truncated words, and, for example, writing having as `avin'); and Claws looks for vocalized pauses and repetitions of parts of phrases (thus we erm, we stopped going is disambiguated as if it read simply we stopped going).

Some BNC text analysed by CLAWS

The following is an example of a piece of BNC text with c5 part-of-speech markers (taken from Captain Pugwash and the Huge Reward):

<s c="0000002 002" n=00001>
When&AVQ-CJS; Captain&NP0; Pugwash&NP0; retires&VVZ; from&PRP;
active&AJ0; piracy&NN1; he&PNP; is&VBZ; amazed&AJ0-VVN; and&CJC;
delighted&AJ0-VVN; to&TO0; be&VBI; offered&VVN; a&AT0; Huge&AJ0;
Reward&NN1; for&PRP; what&DTQ; seems&VVZ; to&TO0; be&VBI; a&AT0;
simple&AJ0; task&NN1;.&PUN;
<s c="0000005 022" n=00002>
Little&DT0; does&VDZ; he&PNP; realise&VVI; what&DTQ; villainy&NN1;
and&CJC; treachery&NN1; lurk&NN1-VVB; in&PRP; the&AT0; little&AJ0;
town&NN1; of&PRF; Sinkport&NN1-NP0;,&PUN; or&CJC; what&DTQ; a&AT0;
hideous&AJ0; fate&NN1; may&VM0; await&VVI; him&PNP; there&AV0;.&PUN;

The C5 Tagset

AJ0 adjective (unmarked) (e.g. GOOD, OLD)
AJC comparative adjective (e.g. BETTER, OLDER)
AJS superlative adjective (e.g. BEST, OLDEST)
AT0 article (e.g. THE, A, AN)
AV0 adverb (unmarked) (e.g. OFTEN, WELL, LONGER, FURTHEST)
AVP adverb particle (e.g. UP, OFF, OUT)
AVQ wh-adverb (e.g. WHEN, HOW, WHY)
CJC coordinating conjunction (e.g. AND, OR)
CJS subordinating conjunction (e.g. ALTHOUGH, WHEN)
CJT the conjunction THAT
CRD cardinal numeral (e.g. 3, FIFTY-FIVE, 6609) (excl ONE)
DPS possessive determiner form (e.g. YOUR, THEIR)
DT0 general determiner (e.g. THESE, SOME)
DTQ wh-determiner (e.g. WHOSE, WHICH)
EX0 existential THERE
ITJ interjection or other isolate (e.g. OH, YES, MHM)
NN0 noun (neutral for number) (e.g. AIRCRAFT, DATA)
NN1 singular noun (e.g. PENCIL, GOOSE)
NN2 plural noun (e.g. PENCILS, GEESE)
NNN <<PROCESS TAG>> numeral noun, neutral for number (dozen, hundred)*/
NNN <<PROCESS TAG>> plural numeral noun (hundreds, thousands)*/
NNS <<PROCESS TAG>> noun of style (e.g. president, governments, Messrs.)
NP0 proper noun (e.g. LONDON, MICHAEL, MARS)
NUL the null tag (for items not to be tagged)
ORD ordinal (e.g. SIXTH, 77TH, LAST)
PNI indefinite pronoun (e.g. NONE, EVERYTHING)
PNP personal pronoun (e.g. YOU, THEM, OURS)
PNQ wh-pronoun (e.g. WHO, WHOEVER)
PNX reflexive pronoun (e.g. ITSELF, OURSELVES)
POS the possessive (or genitive morpheme) 'S or '
PRF the preposition OF
PRP preposition (except for OF) (e.g. FOR, ABOVE, TO)
PUL punctuation - left bracket (i.e. ( or [ )
PUN punctuation - general mark (i.e. . ! , : ; - ? ... )
PUQ punctuation - quotation mark (i.e. ` ' " )
PUR punctuation - right bracket (i.e. ) or ] )
TO0 infinitive marker TO
UNC "unclassified" items which are not words of the English lexicon
VBB the "base forms" of the verb "BE" (except the infinitive), i.e. AM, ARE
VBD past form of the verb "BE", i.e. WAS, WERE
VBG -ing form of the verb "BE", i.e. BEING
VBI infinitive of the verb "BE"
VBN past participle of the verb "BE", i.e. BEEN
VBZ -s form of the verb "BE", i.e. IS, 'S
VDB base form of the verb "DO" (except the infinitive), i.e. "DO"
VDD past form of the verb "DO", i.e. DID
VDG -ing form of the verb "DO", i.e. DOING
VDI infinitive of the verb "DO"
VDN past participle of the verb "DO", i.e. DONE
VDZ -s form of the verb "DO", i.e. DOES
VHB base form of the verb "HAVE" (except the infinitive), i.e. HAVE
VHD past tense form of the verb "HAVE", i.e. HAD, 'D
VHG -ing form of the verb "HAVE", i.e. HAVING
VHI infinitive of the verb "HAVE"
VHN past participle of the verb "HAVE", i.e. HAD
VHZ -s form of the verb "HAVE", i.e. HAS, 'S
VM0 modal auxiliary verb (e.g. CAN, COULD, WILL, 'LL)
VVB base form of lexical verb (except the infinitive)(e.g. TAKE, LIVE)
VVD past tense form of lexical verb (e.g. TOOK, LIVED)
VVG -ing form of lexical verb (e.g. TAKING, LIVING)
VVI infinitive of lexical verb
VVN past participle form of lex. verb (e.g. TAKEN, LIVED)
VVZ -s form of lexical verb (e.g. TAKES, LIVES)
XX0 the negative NOT or N'T
ZZ0 alphabetical symbol (e.g. A, B, c, d)

Bibliography

Garside, R.G. (1993). The Large-scale Production of Syntactically-analysed Corpora, Literary and Linguistic Computing, 8: 39-46.

Garside, R.G., Leech, G.N., and Sampson, G.R. (eds) (1987). The Computational Analysis of English: A Corpus-based Approach. Longman, London.

Leech, G.N., and Garside, R.G. (1991). Running a Grammar Factory: the Production of Syntactically Analysed Corpora or `Treebanks'. In English Computer Corpora: Selected Papers and Research Guide edited by S. Johansson and A. Stenstrom. Mouton de Gruyter, Berlin.

Marshall, I. (1983). Choice of Grammatical Word-class without Global Syntactic Analysis: Tagging Words in the LOB Corpus, Computers and the Humanities, 17: 139-50.