The Automatic Tagging of the British National Corpus (2nd Release)

The Automatic Tagging of the British National Corpus (Information to be used with the BNC Sampler Corpus)

Geoffrey Leech and Nicholas Smith

UCREL, Lancaster University, 1998

The BNC2 was automatically tagged using CLAWS4, a much improved version of the CLAWS1 automatic tagger (developed by Roger Garside, Ian Marshall, Eric Atwell and Geoffrey Leech 1983) used to tag the LOB Corpus. The advanced version CLAWS4 is principally the work of Roger Garside, although many other researchers at Lancaster have contributed to its performance in one way or another. Further information about CLAWS4 can be obtained from Garside and Smith (1997).

CLAWS4 is a hybrid tagger, employing a mixture of probabilistic and non-probabilistic techniques. It assigns a tag to a word as a result of five main processes:

a. Tokenization

The first major step in automatic tagging is to divide up the text or corpus to be tagged into individual (1) word tokens and (2) orthographic sentences. These are the segments usually demarcated by (1) spaces and (2) sentence boundaries (i.e. sentence final punctuation followed by a capital letter). This procedure is not so straightforward as it might seem, particularly because of the ambiguity of full stops (which can be abbreviation marks as well as sentence-demarcators) and of capital letters (which can signal a naming expression, as well as the beginning of a sentence). Faults in tokenization occasionally occur, but hardly ever cause tagging errors.

In tokenization, an orthographic word boundary (normally a space, with or without accompanying punctuation) is the default test for identifying the beginning and end of word-tokens. (See, however, the next paragraph and d below.) Hyphens are counted as word-internal, so that a hyphenated word such as key-ring is given just one tag (NN1 - 'singular common noun'). Because of the different ways of writing compound words, the same compound may occur in three forms: as a single word written 'solid' (markup), as a hyphenated word (mark-up) or as a sequence of two words (mark up). In the first two cases, CLAWS4 will give the compound a single tag, whereas in the third case, it will receive two tags: one for mark and the other for up.

A set of special cases dealt with by tokenization is the set of enclitic verb and negative contractions such as 's, 're, 'll and n't, which are orthographically attached to the preceding word. These will each be given a tag of their own, so that (for example) the orthographic forms It's, they're, and can't are given two tags in sequence: pronoun + verb, verb + negative, etc. There are also some 'merged' forms such as won't, gimme and dunno, which are decomposed into more than one word for tagging purposes. For example, dunno actually ends up with the three tags for do + n't + know.

b. Initial assignment of tags

The second stage of CLAWS4 tagging is to assign to each word token one or more tags. Many word tokens are unambiguous, and so will be assigned just one tag: e.g. various JJ (adjective). Other word tokens are ambiguous, taking from two to seven potential tags. For example, the token paint can be tagged NN1, VV0, VVI, i.e. as a noun or as a verb (either present tense or infinitive); the token broadcast can be tagged as VV0, VVI, VVD, VVN (verb which is either present tense, infinitive, past tense, or past participle). In addition, it can be a noun (NN1) or an adjective (JJ), as in a broadcast concert.

To find the list of potential tags associated with a word, CLAWS first looks up the word in a lexicon of c.50,000 word entries. This lexicon look-up accounts for a large proportion of the word tokens in a text. However, many rarer words or names will not be found in the lexicon, and are tagged by other test procedures. Some of the other procedures are:

Look for the ending of a word: e.g. words ending in -ness will normally be nouns.
Look for an initial capital letter (especially when the word is not sentence-initial). Rare names which are not in the lexicon and do not match other procedures will normally be recognized as proper nouns on the basis of the initial capital.
Look for a final -(e)s. This is stripped off, to see if the word otherwise matches a noun or verb; if it does, the word in -s is tagged as a plural noun or a singular present-tense verb.
Numbers and formulae (e.g. 271, *K9, ß+) are tagged by special rules.
If all else fails, a word is tagged ambiguously as a noun, an adjective or a lexical verb.

When a word is associated with more than one tag, information is given by the lexicon look-up or other procedures on the relative probability of each tag. For example, the word for can be a preposition or a conjunction, but is much more likely to be a preposition. This information is provided by the lexicon, either in numerical form, or where quantitative data available is insufficient, by a simple distinction between 'unmarked', 'rare' and 'very rare' tags.

Some adjustment of probability is made according to the position of the word in the sentence. If a word begins with a capital, the likelihood of various tags depends partly on whether it occurs at the beginning of a sentence. For instance, the word Brown at the beginning of a sentence is less likely to be a proper noun than an adjective or a common noun (normally written brown). Hence the likelihood of a proper noun tag being assigned is reduced at the beginning of a sentence.

c. Tag selection (or disambiguation)

The next stage, logically, is to choose the most probable tag from any ambiguous set of tags associated with a word token by tag assignment (but see d below). This is another probabilistic procedure, this time making use of the context in which a word occurs. A method known as Viterbi alignment uses the probabilistic estimates available, both in terms of the tag-word associations and the sequential tag-tag likelihoods, to calculate the most likely path through the sequence of tag ambiguities. (The model employed is largely equivalent to a hidden Markov model.) After tag selection, there is a single 'winning tag' for each word token in a text. However, this is not necessarily the right answer. If the CLAWS tagging stopped at this point, only c.95-96% of the word-tokens would be correctly tagged. This is the main reason for including an additional stage (or rather a set of stages) termed 'idiomtagging'.

d. Idiomtagging

Idiomtagging is a stage of CLAWS4's operation in which sequences of words and tags are matched against a 'template'. Depending on the match, the tags may be disambiguated or corrected. In practice, there are two main reasons for idiomtagging:

The correct tag can only be selected if CLAWS looks at a word+tag sequence as a whole. In tag selection, this was not done, since the program merely used 'bigrams' consisting of two tags in sequence. In other words, idiomtagging is more powerful than the Viterbi disambiguation algorithm because it is able to operate on a 'window' of several word tokens at once.
There are many cases in English where a sequence of orthographic words is best assigned a single tag. Such cases include because of (a preposition), so long as (a conjunction), and of course (an adverb). These so-called multi-words are the opposite of the contracted forms such as don't and there's, where one orthographic word is assigned more than one tag. Thus idiomtagging here plays the role of adjusting tokenization to larger units.

Idiomtagging is a matching procedure which operates on lists of rules which might loosely be termed 'idioms'. Among these are:

a list of multi-words (just described)
a list of place name expressions (e.g. Mount X , where X is some word beginning with a capital.)
a list of personal name expressions (e.g. Dr. (X) Y, where X and Y are words beginning with a capital; the word X may or may not appear in the matching word sequence).
a list of foreign or classical language expressions used in English (e.g. de jure, force majeure)
a list of grammatical sequences where there are typically 'slots' in the sequence which may or may not be filled: e.g. Modal + (adverb/negative) + (adverb/negative) + Infinitive. This matches a sequence such as would not necessarily like. The recognition that the word token like here is an infinitive verb (rather than, say, a present-tense verb or a preposition) could not be trusted if the tagger was not equipped with an idiomtagging component, but had to rely on simply on tag-pair probabilities.

The idiomtagging component of CLAWS is quite powerful in matching 'template' expressions in which there are wild-card symbols, Boolean operators and gaps of up to n words. They are much more variable than 'idioms' in the ordinary sense, and resemble finite-state networks in specifying a 'recognition grammar' for identifying any string of words+tags matching a set of conditions.

Another important point about idiomtagging is that it is split up into two phrases which operate at different points in the tagging system. One part of the idiomtagging takes place at the end of Stage c, in effect retrospectively correcting some of the errors which would otherwise occur in CLAWS output. Another part, however, actually takes place between Stages b and c. This means it can utilise ambiguous input and also produce ambiguous output, perhaps adjusting the likelihood of one tag relative to another. As an example, consider the case of so long as, which can be a single grammatical item - a conditional conjunction meaning 'provided that'. The difficulty is that so long as can also be a sequence of three separate grammatical items: degree adverb + adjective/adverb + conjunction. In this case, the tagging ambiguity belongs to a whole word sequence rather than a single word, and the output of the idiomtagging has to be passed on to the probabilistic tag selection stage. Hence, although we have called idiomtagging 'Stage d', it is actually split between two stages, one preceding c and one following c.

e. Post-processing

When the text emerges from Stages c and d, each word has an associated set of one or more tags associated with it, and each tag itself is associated with a probability represented as a percentage. An example is:

entering VVG 86% NN1 14% JJ 0%

Clearly VVG (-ing participle of the verb enter) is judged by CLAWS4 to be the most likely tag in this case.

The post-processing phase is designed to produce output in the form which the user is going to find most usable. The output for the BNC is as follows:

Normally the word will be output with a single tag - the one which CLAWS4 calculates to be most probable. (In the 100-millon word entire BNC, both the two most probable tags are sometimes output - where there is not enough difference between the probability of these tags to make a confident decision. However, these 'ambiguity tags' do not appear in the BNC Sampler corpus, which has been hand-edited.)
The text is produced in a horizontal format, so that it can be read from left to right across the page or across the screen.
The tags are enclosed in angle-brackets as follows: <w NN1> according to the standard TEI-based CDIF mark-up of the British National Corpus.

For the BNC Sampler Corpus, the tags have undergone a further stage of manual editing, so that erroneous tags have been corrected.

Reference: For further information, read:

Garside, R. and N. Smith, 'A hybrid grammatical tagger: CLAWS4', in R. Garside, G. Leech and A. McEnery (eds), Corpus Annotation: Linguistic Information from Computer Text Corpora, London: Longman (1997), pp.102-121.