The Automatic Tagging of the British National Corpus (Information to be used with the BNC Sampler Corpus)
Geoffrey Leech and Nicholas Smith
UCREL, Lancaster University, 1998
The BNC2 was automatically tagged using CLAWS4, a much
improved version of the CLAWS1 automatic tagger (developed by
Roger Garside, Ian Marshall, Eric Atwell and Geoffrey Leech 1983)
used to tag the LOB Corpus. The advanced version CLAWS4 is principally
the work of Roger Garside, although many other researchers at
Lancaster have contributed to its performance in one way or another.
Further information about CLAWS4 can be obtained from Garside
and Smith (1997).
CLAWS4 is a hybrid tagger, employing a mixture of probabilistic
and non-probabilistic techniques. It assigns a tag to a word as
a result of five main processes:
a. Tokenization
The first major step in automatic tagging is to divide up the
text or corpus to be tagged into individual (1) word tokens and
(2) orthographic sentences. These are the segments usually demarcated
by (1) spaces and (2) sentence boundaries (i.e. sentence final
punctuation followed by a capital letter). This procedure is not
so straightforward as it might seem, particularly because of the
ambiguity of full stops (which can be abbreviation marks as well
as sentence-demarcators) and of capital letters (which can signal
a naming expression, as well as the beginning of a sentence).
Faults in tokenization occasionally occur, but hardly ever cause
tagging errors.
In tokenization, an orthographic word boundary (normally a space,
with or without accompanying punctuation) is the default test
for identifying the beginning and end of word-tokens. (See, however,
the next paragraph and d below.) Hyphens are counted as
word-internal, so that a hyphenated word such as key-ring
is given just one tag (NN1 - 'singular common noun'). Because
of the different ways of writing compound words, the same compound
may occur in three forms: as a single word written 'solid' (markup),
as a hyphenated word (mark-up) or as a sequence of two
words (mark up). In the first two cases, CLAWS4 will give
the compound a single tag, whereas in the third case, it will
receive two tags: one for mark and the other for up.
A set of special cases dealt with by tokenization is the set of
enclitic verb and negative contractions such as 's, 're, 'll
and n't, which are orthographically attached to the preceding
word. These will each be given a tag of their own, so that (for
example) the orthographic forms It's, they're, and
can't are given two tags in sequence: pronoun + verb, verb
+ negative, etc. There are also some 'merged' forms such as won't,
gimme and dunno, which are decomposed into more than
one word for tagging purposes. For example, dunno actually
ends up with the three tags for do + n't + know.
b. Initial assignment of tags
The second stage of CLAWS4 tagging is to assign to each word token
one or more tags. Many word tokens are unambiguous, and so will
be assigned just one tag: e.g. various JJ (adjective).
Other word tokens are ambiguous, taking from two to seven potential
tags. For example, the token paint can be tagged NN1, VV0,
VVI, i.e. as a noun or as a verb (either present tense or infinitive);
the token broadcast can be tagged as VV0, VVI, VVD, VVN
(verb which is either present tense, infinitive, past tense, or
past participle). In addition, it can be a noun (NN1) or an adjective
(JJ), as in a broadcast concert.
To find the list of potential tags associated with a word, CLAWS
first looks up the word in a lexicon of c.50,000 word entries.
This lexicon look-up accounts for a large proportion of the word
tokens in a text. However, many rarer words or names will not
be found in the lexicon, and are tagged by other test procedures.
Some of the other procedures are:
When a word is associated with more than one tag, information
is given by the lexicon look-up or other procedures on the relative
probability of each tag. For example, the word for can
be a preposition or a conjunction, but is much more likely to
be a preposition. This information is provided by the lexicon,
either in numerical form, or where quantitative data available
is insufficient, by a simple distinction between 'unmarked', 'rare'
and 'very rare' tags.
Some adjustment of probability is made according to the position
of the word in the sentence. If a word begins with a capital,
the likelihood of various tags depends partly on whether it occurs
at the beginning of a sentence. For instance, the word Brown
at the beginning of a sentence is less likely to be a proper noun
than an adjective or a common noun (normally written brown).
Hence the likelihood of a proper noun tag being assigned is reduced
at the beginning of a sentence.
c. Tag selection (or disambiguation)
The next stage, logically, is to choose the most probable tag
from any ambiguous set of tags associated with a word token by
tag assignment (but see d below). This is another probabilistic
procedure, this time making use of the context in which a word
occurs. A method known as Viterbi alignment uses the probabilistic
estimates available, both in terms of the tag-word associations
and the sequential tag-tag likelihoods, to calculate the most
likely path through the sequence of tag ambiguities. (The model
employed is largely equivalent to a hidden Markov model.) After
tag selection, there is a single 'winning tag' for each word token
in a text. However, this is not necessarily the right answer.
If the CLAWS tagging stopped at this point, only c.95-96% of the
word-tokens would be correctly tagged. This is the main reason
for including an additional stage (or rather a set of stages)
termed 'idiomtagging'.
d. Idiomtagging
Idiomtagging is a stage of CLAWS4's operation in which sequences
of words and tags are matched against a 'template'. Depending
on the match, the tags may be disambiguated or corrected. In practice,
there are two main reasons for idiomtagging:
Idiomtagging is a matching procedure which operates on lists of
rules which might loosely be termed 'idioms'. Among these are:
The idiomtagging component of CLAWS is quite powerful in matching
'template' expressions in which there are wild-card symbols, Boolean
operators and gaps of up to n words. They are much more
variable than 'idioms' in the ordinary sense, and resemble
finite-state networks in specifying a 'recognition grammar' for
identifying any string of words+tags matching a set of conditions.
Another important point about idiomtagging is that it is split
up into two phrases which operate at different points in the tagging
system. One part of the idiomtagging takes place at the end of
Stage c, in effect retrospectively correcting some of the
errors which would otherwise occur in CLAWS output. Another part,
however, actually takes place between Stages b and
c. This means it can utilise ambiguous input and also
produce ambiguous output, perhaps adjusting the likelihood of
one tag relative to another. As an example, consider the case
of so long as, which can be a single grammatical item
- a conditional conjunction meaning 'provided that'. The difficulty
is that so long as can also be a sequence of three separate
grammatical items: degree adverb + adjective/adverb + conjunction.
In this case, the tagging ambiguity belongs to a whole word sequence
rather than a single word, and the output of the idiomtagging
has to be passed on to the probabilistic tag selection stage.
Hence, although we have called idiomtagging 'Stage d',
it is actually split between two stages, one preceding c
and one following c.
e. Post-processing
When the text emerges from Stages c and d, each
word has an associated set of one or more tags associated with
it, and each tag itself is associated with a probability represented
as a percentage. An example is:
entering VVG 86% NN1 14% JJ 0%
Clearly VVG (-ing participle of the verb enter)
is judged by CLAWS4 to be the most likely tag in this case.
The post-processing phase is designed to produce output in the
form which the user is going to find most usable. The output for
the BNC is as follows:
For the BNC Sampler Corpus, the tags have undergone a further
stage of manual editing, so that erroneous tags have been corrected.
Reference: For further
information, read:
Garside, R. and N. Smith, 'A hybrid grammatical tagger: CLAWS4', in R. Garside, G. Leech and A. McEnery (eds), Corpus Annotation: Linguistic Information from Computer Text Corpora, London: Longman (1997), pp.102-121.