The Automatic Tagging of the British National Corpus (Information to be used with the BNC Sampler Corpus)

Geoffrey Leech and Nicholas Smith

UCREL, Lancaster University, 1998

The BNC2 was automatically tagged using CLAWS4, a much improved version of the CLAWS1 automatic tagger (developed by Roger Garside, Ian Marshall, Eric Atwell and Geoffrey Leech 1983) used to tag the LOB Corpus. The advanced version CLAWS4 is principally the work of Roger Garside, although many other researchers at Lancaster have contributed to its performance in one way or another. Further information about CLAWS4 can be obtained from Garside and Smith (1997).

CLAWS4 is a hybrid tagger, employing a mixture of probabilistic and non-probabilistic techniques. It assigns a tag to a word as a result of five main processes:

a. Tokenization

The first major step in automatic tagging is to divide up the text or corpus to be tagged into individual (1) word tokens and (2) orthographic sentences. These are the segments usually demarcated by (1) spaces and (2) sentence boundaries (i.e. sentence final punctuation followed by a capital letter). This procedure is not so straightforward as it might seem, particularly because of the ambiguity of full stops (which can be abbreviation marks as well as sentence-demarcators) and of capital letters (which can signal a naming expression, as well as the beginning of a sentence). Faults in tokenization occasionally occur, but hardly ever cause tagging errors.

In tokenization, an orthographic word boundary (normally a space, with or without accompanying punctuation) is the default test for identifying the beginning and end of word-tokens. (See, however, the next paragraph and d below.) Hyphens are counted as word-internal, so that a hyphenated word such as key-ring is given just one tag (NN1 - 'singular common noun'). Because of the different ways of writing compound words, the same compound may occur in three forms: as a single word written 'solid' (markup), as a hyphenated word (mark-up) or as a sequence of two words (mark up). In the first two cases, CLAWS4 will give the compound a single tag, whereas in the third case, it will receive two tags: one for mark and the other for up.

A set of special cases dealt with by tokenization is the set of enclitic verb and negative contractions such as 's, 're, 'll and n't, which are orthographically attached to the preceding word. These will each be given a tag of their own, so that (for example) the orthographic forms It's, they're, and can't are given two tags in sequence: pronoun + verb, verb + negative, etc. There are also some 'merged' forms such as won't, gimme and dunno, which are decomposed into more than one word for tagging purposes. For example, dunno actually ends up with the three tags for do + n't + know.

b. Initial assignment of tags

The second stage of CLAWS4 tagging is to assign to each word token one or more tags. Many word tokens are unambiguous, and so will be assigned just one tag: e.g. various JJ (adjective). Other word tokens are ambiguous, taking from two to seven potential tags. For example, the token paint can be tagged NN1, VV0, VVI, i.e. as a noun or as a verb (either present tense or infinitive); the token broadcast can be tagged as VV0, VVI, VVD, VVN (verb which is either present tense, infinitive, past tense, or past participle). In addition, it can be a noun (NN1) or an adjective (JJ), as in a broadcast concert.

To find the list of potential tags associated with a word, CLAWS first looks up the word in a lexicon of c.50,000 word entries. This lexicon look-up accounts for a large proportion of the word tokens in a text. However, many rarer words or names will not be found in the lexicon, and are tagged by other test procedures. Some of the other procedures are:

When a word is associated with more than one tag, information is given by the lexicon look-up or other procedures on the relative probability of each tag. For example, the word for can be a preposition or a conjunction, but is much more likely to be a preposition. This information is provided by the lexicon, either in numerical form, or where quantitative data available is insufficient, by a simple distinction between 'unmarked', 'rare' and 'very rare' tags.

Some adjustment of probability is made according to the position of the word in the sentence. If a word begins with a capital, the likelihood of various tags depends partly on whether it occurs at the beginning of a sentence. For instance, the word Brown at the beginning of a sentence is less likely to be a proper noun than an adjective or a common noun (normally written brown). Hence the likelihood of a proper noun tag being assigned is reduced at the beginning of a sentence.

c. Tag selection (or disambiguation)

The next stage, logically, is to choose the most probable tag from any ambiguous set of tags associated with a word token by tag assignment (but see d below). This is another probabilistic procedure, this time making use of the context in which a word occurs. A method known as Viterbi alignment uses the probabilistic estimates available, both in terms of the tag-word associations and the sequential tag-tag likelihoods, to calculate the most likely path through the sequence of tag ambiguities. (The model employed is largely equivalent to a hidden Markov model.) After tag selection, there is a single 'winning tag' for each word token in a text. However, this is not necessarily the right answer. If the CLAWS tagging stopped at this point, only c.95-96% of the word-tokens would be correctly tagged. This is the main reason for including an additional stage (or rather a set of stages) termed 'idiomtagging'.

d. Idiomtagging

Idiomtagging is a stage of CLAWS4's operation in which sequences of words and tags are matched against a 'template'. Depending on the match, the tags may be disambiguated or corrected. In practice, there are two main reasons for idiomtagging:

Idiomtagging is a matching procedure which operates on lists of rules which might loosely be termed 'idioms'. Among these are:

The idiomtagging component of CLAWS is quite powerful in matching 'template' expressions in which there are wild-card symbols, Boolean operators and gaps of up to n words. They are much more variable than 'idioms' in the ordinary sense, and resemble finite-state networks in specifying a 'recognition grammar' for identifying any string of words+tags matching a set of conditions.

Another important point about idiomtagging is that it is split up into two phrases which operate at different points in the tagging system. One part of the idiomtagging takes place at the end of Stage c, in effect retrospectively correcting some of the errors which would otherwise occur in CLAWS output. Another part, however, actually takes place between Stages b and c. This means it can utilise ambiguous input and also produce ambiguous output, perhaps adjusting the likelihood of one tag relative to another. As an example, consider the case of so long as, which can be a single grammatical item - a conditional conjunction meaning 'provided that'. The difficulty is that so long as can also be a sequence of three separate grammatical items: degree adverb + adjective/adverb + conjunction. In this case, the tagging ambiguity belongs to a whole word sequence rather than a single word, and the output of the idiomtagging has to be passed on to the probabilistic tag selection stage. Hence, although we have called idiomtagging 'Stage d', it is actually split between two stages, one preceding c and one following c.

e. Post-processing

When the text emerges from Stages c and d, each word has an associated set of one or more tags associated with it, and each tag itself is associated with a probability represented as a percentage. An example is:

entering VVG 86% NN1 14% JJ 0%

Clearly VVG (-ing participle of the verb enter) is judged by CLAWS4 to be the most likely tag in this case.

The post-processing phase is designed to produce output in the form which the user is going to find most usable. The output for the BNC is as follows:

For the BNC Sampler Corpus, the tags have undergone a further stage of manual editing, so that erroneous tags have been corrected.

Reference: For further information, read:

Garside, R. and N. Smith, 'A hybrid grammatical tagger: CLAWS4', in R. Garside, G. Leech and A. McEnery (eds), Corpus Annotation: Linguistic Information from Computer Text Corpora, London: Longman (1997), pp.102-121.