Wordclass Tagging in BNC XML
This section is a revised version of the Manual to accompany
The British National Corpus (Version 2) with Improved Word-class
Tagging by Geoffrey Leech and Nicholas Smith, originally
distributed in HTML form with the BNC World Edition, and available
from the BNC Website at .
Introduction
The wordclass taggingThe terms "POS-tagging" and
"wordclass tagging" are used interchangeably in this manual.
has not changed significantly between the BNC World edition (2001) and
the BNC XML edition (2006). In particular, no attempt has been made to
completely retag the corpus, desirable though this might be. Changes
have been made in the treatment of multiword units and some additional
annotation has been provided (see , but
in most respects the wordclass information provided by the corpus now
is identical to that provided with the first release of the BNC in
1994.
The BNC is wordclass-tagged using a set of 57 tags (known as C5)
which we refer to as the "BNC Basic Tagset". (There are also 4
punctuation tags, excluded from consideration here.) Each C5 tag
represents a grammatical class of words represented by a three
character code such as NN1 for
"singular common noun". The codes are, in many cases, mnemonic.
The BNC, consisting of c.100 million words, was tagged
automatically, using the CLAWS4 automatic tagger developed by Roger
Garside at Lancaster, and a second program, known as Template Tagger,
developed by Mike Pacey and Steve Fligelstone. Further details are given below, and also in Garside
and Leech 1997
chapters 7-9. With such a large corpus, there was no opportunity to
undertake post-editingThe only exceptions to this
statement are: (i) the file F9M, which contains the Rap poetry "City
Psalms" by Benjamin Zephaniah. It was thoroughly hand-corrected
because the tagger, not familiar with Jamaican Creole, had produced an
inordinate number of tagging errors. (ii) files identified as
containing many foreign and classical expressions, as mentioned above. i.e. disambiguation and
correction of tagging errors produced by the automatic tagger, and so
the errors (about 1.15 per cent of all words) remain. In addition,
the corpus contains ambiguous taggings (c.3.75 per cent of all words),
shown in the form of ambiguity tags (also called ‘portmanteau tags’),
consisting of two C5 tags linked by a hyphen:
e.g. VVD-VVN. These tags indicate that the automatic
tagger was unable to determine, with sufficient confidence, which was
the correct category, and so left two possibilities for users to
disambiguate themselves, if they should wish to do so. For example, in
the case of VVD-VVN, the first (more likely) tag, say for a word such as
wanted, is VVD: past tense of lexical verb; and the second (less
likely) tag is VVN: past participle of lexical verb. On the whole,
the likelihood of the first tag of an ambiguity tag being correct is
better than 3 to 1 — see, however, details of individual tags in of the error report document.
After the automatic tagging, some manual tagging was undertaken to
correct some particularly blatant errors, mainly foreign or classical
words embedded in English text. CLAWS is not very successful at
detecting these foreign words and tagging them with their appropriate
tag (UNC), except when they form part of established expressions such
as ad hoc or nom de plume - in which case they are
normally given tags appropriate to their grammatical function, e.g. as
nouns or adverbs. The main purpose of the report on estimated error rates is to
document the rather small percentage of ambiguities and errors
remaining in the tagged BNC, so that users of the corpus can assess
the accuracy of the tagging for their own purposes. Since not
surprisingly we have been unable to inspect each of the 100 million
tags in the BNC, we have had to estimate ambiguity rates and error
rates on the basis of a manual post-editing of a corpus sample of
50,000 words. The estimate is based on twenty-four 2,000-word text
extracts and two 1,000-word extracts, selected so as to be as far as
possible representative of the whole corpus.
Tokenization: splitting the text into words
Regarding the segmentation of a text into individual word-tokens
(called tokenization), our tagging practice in general follows the
default assumption that an orthographic word (separated by spaces from adjacent words,
with or without punctuation) is the appropriate
unit for wordclass tagging. There are, however, exceptions to
this. For example, a single orthographic word may consist of more
than one grammatical word: in the case of enclitic verb contractions
(as in she’s, they’ll, we’re) and negative contractions (as
in don’t, isn’t, won’t), it is appropriate to assign two
diferent wordclass tags to the same orthographic word. A full list
of such contracted forms recognized by CLAWS and preserved in the XML
markup is given in section .
Also quite frequent is the opposite circumstance, where two or more
orthographic words are given a single wordclass tag: e.g. multiword
adverbs such as of course and in short, and
multiword prepositions such as instead of and up to
are each assigned a single word tag (AV0 for adverbs,
PRP for prepositions). Sometimes, whether such
orthographic sequences are to be treated as a single word for tagging
purposes depends on the context and its interpretation. In
short is in some circumstances not an adverb but a sequence of
preposition + adjective (eg. in short sharp bursts ). Up
to in some contexts needs to be treated as a sequence of two
grammatical words: adverbial-particle +
preposition-or-infinitive-marker (eg. We had to phone her up to
get the code.).
In the BNC XML edition, these multiword
units are marked using an additional XML element (mw) which
carries the wordclass assigned to the whole sequence. Within the
mw element, the individual orthographic words are also
marked, using the w element in the same way as elsewhere. For
example, the multiword unit of course is marked up as
follows:
<mw c5="AV0">
<w c5="PRF" hw="of" pos="PREP">of </w>
<w c5="NN1" hw="course" pos="SUBST">course </w>
</mw>. Wordclass tags for the constituent tags of multiword
units were automatically inserted using the table reproduced in ; there may therefore be residual
errors in their usage.
In one respect, we have allowed the orthographic occurrence of
spaces to be criterial. This is in the tagging of compound words such
as markup, mark-up and mark up. Since English
orthographic practice is often variable in such matters, the same
‘compound’ expression may occur in the corpus tagged as two words (if
they are separated by spaces) or as one word (if the sequence is
printed solid or with a hyphen). Thus mark up (as a noun)
will be tagged NN1 AVP, whereas markup or
mark-up will be tagged simply NN1.
Tagging Guidelines and Borderline
Cases Many detailed decisions have to be made in deciding
how to draw the line between the correct and the incorrect assignment
of a tag. So that the concept of what is a ‘correct’ or ‘accurate’
annotation can be determined, there have to be detailed guidelines of
tagging practice. These are constitute the Wordclass Tagging Guidelines.
The
Guidelines have to give much attention to borderline phenomena, where
the distinction between (say) an adjective and a verb participle in
-ing is unclear, and to clarify criteria for differentiating them.
To promote consistency of tagging practice, the guidelines may even
impose somewhat arbitrary dividing lines between one word class and
another. Consider the case of a word such as setting, which may be a
present participle form of a verb(VVG), an adjective (AJ0) or a singular
common noun (NN1). The difference may be illustrated by the three
examples:
Oil prices are rising again. (verb, VVG)
the rising sun (adjective, AJ0)
the attempted rising was put down (noun, NN1)
The assignment of an example of ‘Verb+ing’ to the adjective
category relies heavily on a semantic criterion, viz. the ability to
paraphrase Verb+ing Noun by ‘Noun + Relative Clause that/which/who be
Verb+ing’ or ‘that/which/who Verb(s)’ (e.g. the
rising sun = the sun which is/was rising; a
working mother = a mother who works). These contrast with a case
such as dining table, where the first word dining is judged to be a noun. The reason for this
is that the paraphrasable meaning of the expression is not ‘a table
which is/was dining or dines’, but rather ‘a table (used) for
dining’. Although somewhat arbitrary, this relative clause test is
well established in English grammatical literature, and such criteria
are useful in enabling a reasonable degree of consistency in tagging
practice to be achieved, so that the success rate of corpus tagging
can be checked and evaluated. (See further )
It also has to be recognized that some borderline cases may
occasionally have to be considered unresolvable. We may conclude, for
example, that the word Hatching (occurring as
a heading on its own, without any syntactic context) could be equally
well analysed VVG or NN1, and in such a case one would
be tempted to leave the ambiguity (VVG-NN1) in the
corpus, showing uncertainty where any grammarian would be likely to
acknowledge it. However, in our calculations of ambiguity, we have
adhered to the common assumption that ideally, all tags should be
correctly disambiguated. Other examples of unresolvability from the
sample texts are:
the importance of weaving in the East (verb or noun? - VVG-NN1)
Armed with the knowledge (past participle verb or adjective? - VVN-AJ0)
the Lord is my shepherd (common noun or proper noun? - NN1-NP0)
In practice, in our post-edited sample, we chose the first tag to be correct in these cases.
Ambiguity tags, and the principle of asymmetry
As in the first version of the BNC, we have introduced only a limited number of ambiguity tags, to deal with
particular cases where the tagger has difficulty in distinguishing two categories, and where incorrect taggings
would otherwise result rather frequently. Ambiguity tags involve only the following 18 wordclass labels, and each of the ambiguity tags allows only two labels to be named:
AJ0 general adjective (positive)
AV0 general adverb
AVP adverbial particle
AVQ wh- adverb
CJS general subordinator
CJT subordinator: that
CRD cardinal numeral
DT0 determiner-pronoun
NN1 singular common noun
NN2 plural common noun
NP0 proper noun
PNI indefinite pronoun
PRP general preposition
VVB lexical verb: finite base form
VVD lexical verb: past tense;
VVG lexical verb: present participle (-ing form)
VVN lexical verb: past participle
VVZ lexical verb: -s form
The permitted ambiguity tags are listed in the Wordclass tagging
guidelines ().
It will be noted that overall 30 ambiguity tags are recognized. We
also observe that each ambiguity tag (eg VVD-VVN) is
matched by another ambiguity tag which is its mirror image (eg
VVN-VVD). The ordering of tags is significant: it is the
first of the two tags which is estimated by the tagger to be the more
likely. Hence the interpretation of an ambiguity tag X-Y may be
expressed as follows: There is not sufficient confidence to choose
between tags X and Y; however, X is considered to be more likely.
Guidelines to the Wordclass Tagging
Preliminaries
The BNC basic tagset
For completeness, we begin by listing the C5 tagset used throughout
the BNC, followed by the ambiguity codes used:
Tag
Description
AJ0
Adjective (general or positive) (e.g. good, old, beautiful)
AJC
Comparative adjective (e.g. better, older)
AJS
Superlative adjective (e.g. best, oldest)
AT0
Article (e.g. the, a, an, no)
AV0
General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest.
AVP
Adverb particle (e.g. up, off, out)
AVQ
Wh-adverb (e.g. when, where, how, why, wherever)
CJC
Coordinating conjunction (e.g. and, or, but)
CJS
Subordinating conjunction (e.g. although, when)
CJT
The subordinating conjunction that
CRD
Cardinal number (e.g. one, 3, fifty-five, 3609)
DPS
Possessive determiner-pronoun (e.g. your, their, his)
DT0
General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0.
DTQ
Wh-determiner-pronoun (e.g. which, what, whose, whichever)
EX0
Existential there, i.e. there occurring in the there is ... or there are ... construction
ITJ
Interjection or other isolate (e.g. oh, yes, mhm, wow)
NN0
Common noun, neutral for number (e.g. aircraft, data, committee)
NN1
Singular common noun (e.g. pencil, goose, time, revelation)
NN2
Plural common noun (e.g. pencils, geese, times, revelations)
NP0
Proper noun (e.g. London, Michael, Mars, IBM)
ORD
Ordinal numeral (e.g. first, sixth, 77th, last) .
PNI
Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody)
PNP
Personal pronoun (e.g. I, you, them, ours)
PNQ
Wh-pronoun (e.g. who, whoever, whom)
PNX
Reflexive pronoun (e.g. myself, yourself, itself, ourselves)
POS
The possessive or genitive marker 's or '
PRF
The preposition of
PRP
Preposition (except for of) (e.g. about, at, in, on, on behalf of, with)
PUL
Punctuation: left bracket - i.e. ( or [
PUN
Punctuation: general separating mark - i.e. . , ! , : ; - or ?
PUQ
Punctuation: quotation mark - i.e. ' or "
PUR
Punctuation: right bracket - i.e. ) or ]
TO0
Infinitive marker to
UNC
Unclassified items which are not appropriately considered as items of the English lexicon.
VBB
The present tense forms of the verb BE, except for is, 's: i.e. am, are, 'm, 're and be [subjunctive or imperative]
VBD
The past tense forms of the verb BE: was and were
VBG
The -ing form of the verb BE: being
VBI
The infinitive form of the verb BE: be
VBN
The past participle form of the verb BE: been
VBZ
The -s form of the verb BE: is, 's
VDB
The finite base form of the verb BE: do
VDD
The past tense form of the verb DO: did
VDG
The -ing form of the verb DO: doing
VDI
The infinitive form of the verb DO: do
VDN
The past participle form of the verb DO: done
VDZ
The -s form of the verb DO: does, 's
VHB
The finite base form of the verb HAVE: have, 've
VHD
The past tense form of the verb HAVE: had, 'd
VHG
The -ing form of the verb HAVE: having
VHI
The infinitive form of the verb HAVE: have
VHN
The past participle form of the verb HAVE: had
VHZ
The -s form of the verb HAVE: has, 's
VM0
Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd)
VVB
The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive]
VVD
The past tense form of lexical verbs (e.g. forgot, sent, lived, returned)
VVG
The -ing form of lexical verbs (e.g. forgetting, sending, living, returning)
VVI
The infinitive form of lexical verbs (e.g. forget, send, live, return)
VVN
The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned)
VVZ
The -s form of lexical verbs (e.g. forgets, sends, lives, returns)
XX0
The negative particle not or n't
ZZ0
Alphabetical symbols (e.g. A, a, B, b, c, d)
Total number of wordclass tags in the BNC basic tagset = 57, plus 4 punctuation tags
Ambiguity Tag list
In addition, there are 30 "Ambiguity Tags". These are applied
wherever the probabilities assigned by the CLAWS
automatic tagger to its first and second choice tags were
considered too low for reliable disambiguation. So, for example, the
ambiguity tag AJ0-AV0 indicates that the choice between
adjective (AJ0) and adverb (AV0) is left
open, although the tagger has a preference for an adjective
reading. The mirror tag, AV0-AJ0, again shows
adjective-adverb ambiguity, but this time the more likely reading is
the adverb.
Ambiguity tag
Ambiguous between
More probable tag
AJ0-NN1
AJ0 or NN1
AJ0
AJ0-VVD
AJ0 or VVD
AJ0
AJ0-VVG
AJ0 or VVG
AJ0
AJ0-VVN
AJ0 or VVN
AJ0
AV0-AJ0
AV0 or AJ0
AV0
AVP-PRP
AVP or PRP
AVP
AVQ-CJS
AVQ or CJS
AVQ
CJS-AVQ
CJS or AVQ
CJS
CJS-PRP
CJS or PRP
CJS
CJT-DT0
CJT or DT0
CJT
CRD-PNI
CRD or PNI
CRD
DT0-CJT
DT0 or CJT
DT0
NN1-AJ0
NN1 or AJ0
NN1
NN1-NP0
NN1 or NP0
NN1
NN1-VVB
NN1 or VVB
NN1
NN1-VVG
NN1 or VVG
NN1
NN2-VVZ
NN2 or VVZ
NN2
NP0-NN1
NP0 or NN1
NP0
PNI-CRD
PNI or CRD
PNI
PRP-AVP
PRP or AVP
PRP
PRP-CJS
PRP or CJS
PRP
VVB-NN1
VVB or NN1
VVB
VVD-AJ0
VVD or AJ0
VVD
VVD-VVN
VVD or VVN
VVD
VVG-AJ0
VVG or AJ0
VVG
VVG-NN1
VVG or NN1
VVG
VVN-AJ0
VVN or AJ0
VVN
VVN-VVD
VVN or VVD
VVN
VVZ-NN2
VVZ or NN2
VVZ
Total number of wordclass tags including punctuation and ambiguity tags = 91.
Appearance of wordclass tags and citations
Throughout this section, we show text examples in a simplified format
which is different from the XML contained in the corpus but which
will highlight the particular tag that is being discussed. The XML
tagging (used to mark, for example, paragraph and pause markers) is
not relevant to the present discussion and will usually
not be displayed in the output from software such as concordance
generators or search engines.
As was noted above, each word in the corpus is marked by an XML
w element which provides three additional pieces of
information the wordclass, carried by the c5
attribute, a headword or lemma derived from the word, carried by the
hw attribute, and a simplified wordclass derived from
the c5 value, carried by the pos attribute.
In the XML source therefore, we will see sentences like this:
<w c5="AV0" hw="apparently" pos="ADV">apparently </w>
<w c5="PNP" hw="we" pos="PRON">we </w>
<w c5="VVB" hw="eat" pos="VERB">eat </w>
<w c5="DT0" hw="more" pos="ADJ">more </w>
<w c5="NN1" hw="chocolate" pos="SUBST">chocolate </w>
<w c5="CJS" hw="than" pos="CONJ">than </w>
<w c5="DT0" hw="any" pos="ADJ">any </w>
<w c5="AJ0" hw="other" pos="ADJ">other </w>
<w c5="NN1" hw="country" pos="SUBST">country</w>
<c c5="PUN">.</c>
For simplicity of discussion throughout this section we have chosen
not to present examples in this way, but instead to suppress the bulk
of the XML markup. Only the wordclass attribute of the word (or
words) being in question, we have preserved this and placed it
after the word it relates to in the example
sentences. Under subordinating
conjunctions, for instance, the citation above appears as
follows:
...apparently we eat more chocolate than_CJS any other country. [G3U.1000]
This is purely as an aid to
reading the present document; in the corpus itself, all wordclass
tagging is represented using the XML conventions shown above.
As noted above, any example from the BNC can be identified by means
of the text identifier (a three character code such as GRU) and the
number of the s element within it. We use this method
throughout the following examples, where they are taken from the BNC.
Thus, the example above is taken from s-unit 1000 of text G3U. In
sections and below, we
occasionally cite cases where the POS-tagging in the corpus does not
match the tag given in the citation, in that it is either an error or
an ambiguity tag. This is to give an idea of the contexts in which the
resolution of ambiguities has been less reliable. We list the tag
found in the corpus next to the file reference with an asterisk,
eg. in well we give the ideal tag as
VVB, but the actual tag as AV0:
Tears well_VVB up in my eyes.[BN3.5 *AV0]
Note also that we occasionally use invented examples, rather than
corpus citations, especially where a contrast between categories is
being made.
Appearance and tagging of contracted forms
Contracted forms — including enclitics, eg he's, she'll,
negatives eg don't and can't, and 'fused words', eg
wanna and gimme — are broken down by the tagger into their
component parts, with each part being assigned its own tags. No spaces are introduced in
POS-tagged contracted words: doesn't = does_VDZ n't_XX0
dunno = Du_VDB n_XX0 no_VVI
wanna = wan_VVB na_TO0 or wan_VVB na_AT0
gimme = Gim_VVB me_PNP
This procedure sometimes results in strange-looking word
divisions, particularly with the fused words. However, they do
provide a ready means of comparison with the full forms, such as
want_VVB to_TO0 and give_VVB me_PNP.
Note that in the case of ain't it has been tricky to
resolve the tag of the first part ( ai )
satisfactorily. Therefore in all contexts we have tagged this as an
unclassified word, followed by the negative particle.
Ai_UNC n't_XX0 got yours yet [KCT.1281]
Appearance and tagging of multiwords
The term `multiwords' denotes multiple-word combinations to which
CLAWS assigns a single wordclass tag - for example, a complex
preposition, an adverbial, or a foreign expression naturalised into
English as a compound noun. In the XML version of the corpus, these
sequences are explicitly marked using an XML element
(mw). The individual orthographic words of which the
sequence is composed are also marked, in the same way as other words,
using the w element.
For example, as noted above, in the XML source of the corpus, the multiword
sequence of course is tagged as follows:
<mw c5="AV0">
<w c5="PRF" lemma="of" pos="PREP">of </w>
<w c5="NN1" lemma="course" pos="SUBST">course </w>
</mw>
When displaying examples which contain multiwords in this chapter, we
display only the wordclass of the outermost mw element. Its
boundaries are indicated, where possible, by extra highlighting:
Of course_AV0 I can. [H9V.212]
The wordclass tags assigned to constituent parts of multiword items
are listed in . This part of the wordclass
tagging was done automatically during the XML conversion process, and
has not been checked by CLAWS.
Note that some multiwords can represent different categories
according to context, e.g. in between in: The
stage in between_PRP the original negative and the dupe
is called an interpositive [FB8.295] The truth lies somewhere
in between_AV0 [ABK.2834]
Moreover, sometimes it is more appropriate to tag a word
combination as consisting of ordinary words than as a multiword
sequence, as in the case of but for below:
but_CJC for_PRP years now darkness has been growing [F99.2027]
cf. which they would not have done but
for_PRP the presence of the police. [H81.766]
Words joined by the slash character
Words which are joined together by a slash ( / ) but no whitespace, such as
and/or, are not split up in tagged versions of the text.
if they are of the same wordclass they are assigned the same tag;
if they are of different wordclasses, the whole sequence is assigned the
'unclassified' tag, UNC.
Examples:
A title and/or_CJC an author's name [H0S.358]
You should be a graduate in Electrical/Electronic_AJ0 Engineering, Physics , Mathematics , Computing or a related discipline. [CJU.1049]
Introduction to Word Classes
Nouns
Common nouns
Singular common nouns are tagged NN1, while plurals take NN2:
A child_NN1.
Several children_NN2
An air_NN1 of distinction_NN1
Fifteen miles_NN2 away
Nouns which are morphologically invariant for number or which can take either a singular or plural verb, (so-called
`neutral for number') are tagged NN0:
Now the government_NN0 is considering new warnings on steroids ... [K24.3057]
... the Government_NN0 are putting people's lives in jeopardy. [A7W.518]
I caught a fish_NN0.[KBW.316]
I had caught four fish_NN0 with hardly any effort[B0P.1387]
We make no special distinction between common nouns that can be mass (or `non-count')
nouns (eg water, cheese), and other common nouns. All are tagged NN1 when
singular and NN2 when plural:
Cheese_NN1 is a protein of high biological value. [ABB.1950]
three cheeses_NN2. [CH6.7834]
A car_NN1 glistens in the distance_NN1. [HH0.1035]
Three cars_NN2, two lorries_NN2 and a motorbike_NN1! [CHR.290]
In general we try to tag abbreviations for common nouns (and other word classes) as if
they were written as full forms. Abbreviations for measurement nouns are generally
tagged NN0 as they are invariant for number.
Crewe are top of div_NN1 3 by 8 points [J1C.961] (where div = division)
1 km_NN0
400 km_NN0 (km = 'kilometre' or 'kilometres')
1 oz_NN0.
6 oz_NN0 (oz = 'ounce' or 'ounces')
Nouns such as hundred, hundreds, dozens, gross, are all tagged as numbers,
CRD, rather than nouns.
Proper nouns
The tag NP0 ideally should denote any kind of proper noun, but in practice the
open-endedness of naming expressions makes it difficult to capture all possible types
consistently. We have confined its coverage mainly to personal and
geographical names, and to names of days of the week or months of the
year. Within these, some rather arbitrary borderlines have had to be drawn.
Sally_NP0 Joe_NP0 Bloggs_NP0
Madame_NP0 Pompadour_NP0
Leonardo_NP0 da_NP0 Vinci_NP0
London_NP0 Lake_NP0 Tanganyika_NP0 New_NP0 York_NP0
April_NP0
Number
Note that the distinction between singular and plural proper
nouns is not indicated in the tagset, plural proper nouns being a
comparative rarity:
John_NP0 Smith_NP0. All of the Smiths_NP0.
Multiwords
Note also that proper nouns are not processed as
multiwords (though there may be good linguistic reasons for doing
so). Each word in such a sequence gets its own tag.
Initials
A person's initials preceding a surname are tagged NP0,
just as the surname itself. The choice whether to use a space and/or full-stop between
initials (eg J.F. or J. F. or J F or
JF) is determined by the original source text; the tagged version follows
the same format.
John F. Kennedy = John_NP0 F._NP0 Kennedy_NP0
J. F. Kennedy = J._NP0 F._NP0 Kennedy_NP0
J.F. Kennedy = J.F._NP0 Kennedy_NP0
In the spoken part of the BNC, however, the components
of names — and, in fact, most words — that are spelt aloud as individual letters,
such as I B M, and J R in J R Hartley, are not
tagged NP0 but ZZ0 (letter of the
alphabet). See further
Nouns of style
Preceding a proper noun, or sequence of proper nouns, style (or title) nouns with
uppercase initial capitals are tagged NP0:
Pastor_NP0 Tokes_NP0 Chairman_NP0 Mao_NP0 Sub-Lieutenant_NP0 R_NP0 C_NP0 V_NP0 Wynn_NP0 Sister_NP0 Wendy_NP0
Contrast the last example with the following:
You remember your sister_NN1 Wendy_NP0... [HGJ.800]
where Wendy is in apposition to a common noun sister,
in lowercase letters.
Geographical names
For names of towns, streets, countries and states, seas, oceans, lakes, rivers,
mountains and other geographical placenames, the general rule is to tag as NPO. If
the word the precedes, it is tagged AT0:
East_NP0 Timor_NP0 South_NP0 Carolina_NP0 Baker_NP0 Street_NP0 West_NP0 Harbour_NP0 Lane_NP0 the_AT0 United_NP0 Kingdom_NP0 the_AT0 Baltic_NP0 the_AT0 Indian_NP0 Ocean_NP0 Mount_NP0 St_NP0 Helens_NP0 the_AT0 Alps_NP0
Other tags are used for the constituents of more verbose (especially political)
descriptions of placenames, or those that are not typically marked on maps:
Latin_AJ0 America_NP0 Western_AJ0 Europe_NP0 the_AT0 Western_AJ0 Region_NN1 the_AT0 People_NN0's_POS Republic_NN1 of_PRF China_NP0 the_AT0 Dominican_AJ0 Republic_NN1 the_AT0 Sultanate_NN1 of_PRF Oman_NP0
The examples show a little arbitrariness in application. For
example, contrast
the_AT0 United_NP0 States_NP0 the_AT0 Soviet_AJ0 Union_NN1
Multiword names containing a compass point, ie. those beginning
North, South, East, West,
North East, South-west etc. nearly always become NP0,
whereas those with Northern, Southern,
Eastern, Western follow the non-NP0 pattern. Rare
exceptions are:
Northern_NP0 Ireland_NP0 Western_NP0 Samoa_NP0
Non-personal and non-geographical names
Where names of organisations, sports teams, commercial products (incl
newspapers), shops, restaurants, horses, ships etc.
consist of ordinary words (common nouns, adjectives etc.),
they receive ordinary tags (NN1,
AJ0 etc.). Only if a word used as part of a name is an existing NP0 (typically a personal or
geographical name), or a specially-coined word, is it tagged
NP0. Some examples follow:
Organisations, sports teams etc.
Cable_NN1 and_CJC Wireless_NN1
Procter_NP0 and_CJC Gamble_NP0
Acorn_NN1 Marketing_NN1 Limited_AJ0
Minolta_NP0; IBM_NP0; NATO_NP0
Wolverhampton_NP0 Wanderers_NN2 (
football_NN1 club_NN1 )
Tottenham_NP0 Hotspur_NP0 (football_NN1 club_NN1 )
The_AT0 Chicago_NP0 Bears_NN2
Spartak_NP0 Moscow_NP0
World_NN1 Health_NN1 Organisation_NN1
Oxfam_NP0
There is a slight inconsistency here, in that acronyms of organisation names
(WHO, NATO, IBM etc.) take NP0, whereas the expanded forms of these names take
regular tags.
Products (including newspapers and magazines)
Windows_NN2 software_NN1
Weetabix_NP0
Lancashire_NP0 Evening_NN1 Post_NN1
Mars_NP0 bars_NN2
Time_NN1 Magazine_NN1
Scotchgard_NP0
The_AT0 Reader_NN1 's_POS Digest_NN1
Perrier_NP0 water_NN1
Company names may sometimes be used to represent product names; in such
cases the same tags apply. For example: John drives a Volkswagen_NP0 Golf_NN1
John drives a Volkswagen_NP0.
Shops, pubs, restaurants, hotels, horses, ships etc.
Body_NN1 Shop_NN1
Mothercare_NP0
The_AT0 Grand_AJ0 Theatre_NN1
Sainsburys_NP0 supermarket_NN1
The_AT0 King_NN1 's_POS Arms_NN2
The_AT0 Ritz_NP0
Red_AJ0 Rum_NN1
Aldaniti_NP0
The_AT0 Bounty_NN1
The_AT0 Titanic_NP0
Here again NP0 is reserved for parts of names that are specially coined, or
derived from existing personal/geographical proper nouns.
Verbs
type
The second character of a verb tag marks the type of verb as
follows:
BForms of be (VBB VBD VBG VBI VBN VBZ)
DForms of do ( VDB VDD VDG VDI VDN VDZ)
HForms of have ( VHB VHD VHG VHI VHN VHZ)
MOther modal verbs (VM0)
VLexical verb (VVB VVD VVG VVI VVN
VVZ)
InflectionThe third character of a verb tag
marks the verb inflection as follows:
Bbase form finite
D past tense
Z 3rd person sing present
N past participle
I infinitive
G present participle
be, have, and do
Auxiliary and main uses of these verbs are not distinguished: .
she is_VBZ playing her best tennis for six years. [CH3.1382]
she is_VBZ just a star. [CH3.6939] John has_VHZ built a set
of bookshelves. [C9X.121] John has_VHZ great courage. [CA9.1869]
We did_VDD n't_XX0 see anybody. [KB2.702] They do_VDB nice
work. [ANY.514]
Note the variant form of have in non-standard English:
they shouldn't of_VHI left it the last minute [KD8.7288]
That could of_VHI been 'bout us [B38.322]
Lexical verbs
Tags beginning VV- apply to all other (lexical) verbs.
She travels_VVZ in every Saturday morning. [KRH.4013] The young kids want_VVB to dance_VVI and have fun [CHA.1599] I thought_VVD he looked_VVD a sad sort of a boy. [CDY.2831] ...after running_VVG out of coal, the crew were forced_VVN to burn_VVI timber and resin [HPS.269]
Modals
All modals are tagged VM0. We do not differentiate between so-called past and present forms:
We can_VM0 go there. We could_VM0 go there. We
used_VM0 to_TO0 go there every year.
The form let's is treated as one verb:
Let's_VM0 go_VVI! [A61.1443]
Contracted forms
Contracted forms (can't, won't, gimme, dunno etc) are split into their component parts, which are tagged individually.
Are_VBBn't_XX0 you coming?[A0R.2215]
I du_VDB n_XX0 no_VVI [KR0.23]
Subjunctives and Imperatives
No special tags are used for these:
She suggested that they get_VVB married. [CBC.12107]
Please be_VBB patient. [CHJ.899] Do_VDBn't_XX0 just stand
there watching! [ACB.3470]
Catenative or semi-auxiliary verbs
Again, no special tagging is used for such forms as going
to, ought to, or used to + infinitive:
you're not going_VVG to_TO0 get killed [KCE.6550] you ought_VM0 to_TO0 let them know. [KCT.6115]
Adjectives
Adjectives are given one of the wordclass tags AJ0,
AJC, or AJS.
The general tag for adjectives (AJ0) subsumes:
Predicative and attributive uses
The ground was dry_AJ0 and dusty_AJ0 [GWA.118]The dust from the dry_AJ0 ground [GWA.121]
Quasi-comparatives and quasi-superlatives
Adjectives which have a heightening or downtoning effect rather like that of comparatives and superlatives,
but which do not behave syntactically like comparatives or superlatives, are treated as ordinary adjectives.
Examples include utter, upper and
uppermost:
Events in Eastern Europe were evidently uppermost_AJ0 in Mr
Li's mind. [A95.366] Family contacts were very important in
uniting the upper_AJ0 classes [FB6.1495]
Adjectives used catenatively
For example, able and
unable:
Will you be able_AJ0 to manage? (catenative) Your son is very able_AJ0 (non-catenative)
Comparative adjectives receive the tag AJC;
superlatives take AJS:
A faster_AJC car. The best_AJS in its class.
Ambiguities frequently arise between adjectives and other wordclasses, in
particular adverbs, nouns and participles.
Adverbs
Adverbs are given one of the tags
AV0, AVQ, or AVP
AV0 is the default tag for adverbs. It incorporates a very mixed bag, including:
adverbs of time, manner, place etc.
Eg slowly; here; soon
degree adverbs
Eg very and rather in
very_AV0 tall_AJ0
rather_AV0 painfully_AV0
sentence adverbs
for example:
However_AV0, … In addition_AV0
postnominal adverbsfor example:
aged between 2 and 11 years inclusive_AV0 [AMD.31] the buildings thereon_AV0 [J16.813] during 1986-91 inclusive_AV0 [FT0.1400] Diamonds galore_AV0 [FPH.900]
discourse markers
such as well,
right, like:
you know like_AV0, it's worthwhile opening a cinema at 4 o'clock... [F7A.358]
Note that adverbs, unlike adjectives, are not tagged as positive, comparative, or superlative.
This is because of the relative rarity of comparative and superlative
adverbs.
Interrogative and relative wh-adverbs (when, where, how, why, wherever)
are tagged AVQ whether the word occurs in interrogative
or relative use.
"When_AVQ do your courses start?" [A0F.3117] "...if you let me know when_AVQ the police are called in." [BMU.2291] Yet why_AVQ is that so? [CR7.3089]
Ordinal-type adverbs (including first, fourth,
etc.) are treated separately with the
ORD tag
Prepositional Adverbs (also known as "Adverbial Particle") are
treated as prepositions and tagged AVP: see
Prepositions
Articles, determiners & pronouns
Articles, definite or indefinite, are tagged
AT0. Pronouns which act as determiners of various kinds
(all, which, your etc.) are given tags DPS,
DT0, or DTQ, and distinguished from
pronouns which do not have a determiner function. These are marked
using one of the tags PNP, PNI,
PNQ, or PNX depending on their function.
Articles
All articles are tagged AT0. An article is defined
here as a determiner word which typically begins a noun phrase, but
which cannot occur as the head of a noun phrase.
Examples include a/an, the, no and
every:
Have a_AT0 break Every_AT0 yearThere's no_AT0
time
Determiners
Recognising that there is a high degree of formal and functional overlap between determiners and pronouns, we have conflated under the D-- heading
words that are capable of either function. We distinguish three classes of determiner pronouns:
Determiner-Pronoun
Words such as few, both, another are
tagged DT0:
free secondary education for all_DT0 [ECB.1610] Few_DT0 diseases are incurable [GV1.1129] for the benefit of the few_DT0 [HHX.10183]
Interrogative determiner-pronoun
The wh- (interrogative) determiner-pronoun is tagged DTQ. Which and what are always tagged DTQ:
Which_DTQ country do you live in? [A7N.979] And she
didn't say which_DTQ? [KCF.351 ] What_DTQ time is it? [A0N.406]
Prenominal possessive determiner pronoun
Forms such as my, your, etc are always tagged
DPS, for example:
my_DPS hat
Compare this with the nominal use:
That is your way. This is mine_PNP [ASD.726-7]
Pronouns
Tags beginning P-- indicate pronouns which do not share the determiner function, for example
I, it , anyone.
Pronouns are differentiated according to whether they are:
personal (PNP), eg I, him, they, us. Note also: it is included here.
reflexive personal (PNX), eg herself,
themselves
indefinite pronouns (PNI), anyone, everything,
nobody
interrogative (PNQ), eg who, whoever
Relative pronouns
Which as a relative (or interrogative) pronoun is grouped with the
other determiner-pronouns, and tagged DTQ:
Give 4 details which_DTQ should appear on an order form [HBP.417]
Meanwhile, that as a relative clause complementizer is treated with
that as a complement clause complementizer, and tagged CJT:
I got some currants that_CJT are left over [KST.3733]
this girl that_CJT Claire knows [KC7.1101] He dismissed reports that_CJT his party was divided over tactics [A28.11] We both knew that_CJT enough was enough. [FEX.268]
Note, however, that that takes the tag DT0 when it functions as a demonstrative pronoun or determiner:
Look at that_DT0 bear! [KP8.1547] I guess I was sad about that_DT0.[BMM.239]
Prepositions and prepositional adverbs
Prepositions
Most prepositions are tagged PRP, including a
large number of multiword items. Examples include:
at_PRP the Pompidou Centre in_PRP Paris [A04.325] I
use humour as_PRP a protection [FBL.356] Heard about_PRP this
have you? [KE6.9556] According to_PRP ancient tradition,
...[A04.784] Many disputes are dealt with by bodies other
than_PRP courts. [F9B.4] Nice walls and a big sky to look
at_PRP. [A25.122]
Of
The preposition of is assigned a special tag PRF
because of its frequency and its almost exclusively postnominal function. Examples:
a couple of_PRF cans of_PRF Coke[ AJN.283] DNA
consists of_PRF a string of_PRF four kinds of_PRF bases [AE7.107]
Note that numerous multiwords contain of,
eg in front of, in light of, by means of, etc.
Prepositional adverbs/particles
Preposition-type words which have no complement are tagged AVP.
Typical uses of AVP are in phrasal verb constructions, or when it functions as a
place adjunct:
We gave up_AVP after two hours. [KSV.1029] there were a lot of horses around_AVP. [HR7.3101]
There are many instances of ambiguity between PRP and AVP.
Conjunctions
Co-ordinating conjunction
Co-ordinators such as and, or, but,
nor etc are tagged CJC:
Fish and_CJC chips James laughed and_CJC spilled wine. [A0N.136] She was paralysed but_CJC she could still feel the pain. [FLY.529]
Subordinating conjunction
All subordinating conjunctions are all tagged CJS
and introduce one of:
an adverbial clause (of time, reason, condition etc.)
"When_CJS you 've done it , you should go
home,"[CRE.949]
I still stayed there after_CJS I heard the shooting [HW8.3263] As_CJS you may know Scorton will again enter the Best Kept Village competition in 1992 [HPK.768] Do send me an interim copy as_CJS soon as you can [HD3.69] If_CJS it's wet just take your time. [KCL.554]
a comparative clause
introduced by than or
as, and occurring with or without ellipsis:
It was worse than_CJS she could have imagined.[CH0.1315]
...apparently we eat more chocolate than_CJS any other country.[G3U.1000] "it's as good as_CJS it's going to get."[K9K.199] make the transporter as light as_CJS possible. [CA1.1113]
a nominal wh-clause
containing whether or if
Can you tell me whether_CJS ivies do damage
trees. [C9C.720]
Complementary clause
The conjunction that at the start of a clause introducing reported speech and thought, and also
at the start of a relative clause is tagged CJT:
Historians knew that_CJT this was nonsense.[G3C.363]China announced that_CJT it was ending martial law in the Tibetan capital Lhasa. [KRU.95] The problem that_CJT he was having was that_CJT she was his legal wife 's sister [HE3.210]
Numerals
Cardinal numbers and similar items are tagged
CRD. Ordinal numbers and similar items are tagged ORD.
Numbers and fractions
All cardinal numbers, numeral nouns, fractions and so on take the tag CRD,
whether they are written as words or numerals, and whether functioning nominally or prenominally.
Examples: 5_CRD out of 10_CRD[CGM.525]
one_CRD striking feature of the years 1929-31_CRD [A6G.134]
his first_ORD innings, when he scored forty-two_CRD, with seven_CRD
fours_CRD [KJT.128]
Hundreds_CRD of people audition each year [K1S.2239]
About a dozen_CRD there. [HEU.131]
Ordinal numbers and similar
Ordinal numbers are assigned ORD in all syntactic positions, including adverbial positions,
as in We only came fourth_ORD in the county championship last_ORD year[EDT.1629]
Note that ORD is also assigned to less overtly numeric words like next and last, even in clear adverbial, adjectival or nominal contexts. This is because next and last function like ordinals both syntactically and semantically.
Currency and measurement expressions
Measurement expressions, consisting of numbers and a unit of measurement of some kind
(together as one word), are assigned a noun tag, usually NN0 (neutral for number) or NN2 (plural):
6kg_NN0 £600_NN0 12.5%_NN0
formulae
Other sequences of numeric and alphabetic characters are assigned UNC
(unclassified) tags: Figure 2b_UNC [FTC.250]Serial no. S835508_UNC [C9H.2282]A4_UNC sheet of paper [CN4.296]Mark drove home along the M1_UNC [AC2.2210]
Miscellaneous other tags
Existential there
The tag EX0 is used for there when it
merely states that something exists or existed. It occurs at the
beginning of a clause and is usually followed by the verb be and an indefinite noun phrase; for example There_EX0 was a long pause and then a smile
[A4H.416] Waiter! Waiter! There_EX0's an awful film on my soup!
[CHR.657-9] There_EX0 appears to be little alternative
[ECE.2139]
Compare this with there when it has a clear locative meaning ('in/to that place'):
Don't stand there_AV0 grinning like a stuck pig [C85.1553]
Interjection
The tag ITJ is used for any interjection:
Hello_ITJ, Nell. Oi_ITJ - come here! Yes_ITJ , please_AV0 do
No_ITJ not_XX0 yet_AV0
( For the distinction between ITJ and the unclassified tag, UNC, see )
Genitive morpheme
The tag POS is used for the
genitive morpheme 's (singular) or '
(plural after an s):
teacher_NN1 's_POS pet teachers_NN2 '_POS pet
Note the lack of space between the noun and the following POS, as 's is
tokenized in the same way whether it represents a genitive or a contracted verb. See further
on tagging of 's in
Infinitive marker
The tag TO0 is used for
the infinitive marker. This includes elliptical uses.
"Do you want to_TO0 talk about it?" [EFG.1935]In the summer holidays I can , I can get up early if I want to_TO0 . [KPG.4153]
Note the morphological variation of to in the following colloquial forms:
We got_VVN ta_TO0 go
We wan_VVBna_TO0 stay.
Unclassified words
The tag UNC
is used for unclassified (or unclassifiable) words. It is applied in contexts where no other wordclass tag
seems appropriate, including
"Noise words" and pause fillers in spoken utterances; imitations of animal or machine sounds:
blah_UNC blah_UNC blah_UNC er_UNC I think so
Certain fused forms (in written or spoken data) for which no other tag would be appropriate:
Methinks_UNC That ai_UNC n't_XX0 right.0.5 cm
increments/30_UNC seconds [HWT.282]Fits with most
lap/diagonal_UNC seat belts. [BNX.392]
Truncated words in speech. Partial words that are not completed by a
speaker, whether through hesitation or an interruption, are also
usually marked with the XML tags trunc; for example
the partial word bathr in the following:
The bathr_UNC data. er you can't beat a white bathroom suite anyway. [KCF.771]
Partial repetitions of multiwords in spoken data.
Occasionally in spoken data, when a multiword sequence is used, it appears to be repeated, but only partially so.
In the following example, the orthographic word sort is used twice:
we're going to sort sort of summarize... [G5X.106]
We treat the first sort as an incomplete multiword, and tag it UNC (rather like truncated words, above). The complete multiword sort of is tagged AV0, as normally.
we're going to sort_UNC sort of_AV0 summarize...
See for
further examples; for the distinction between UNC
and ITJ see .
Negative particle
XX0 is the tag for the negative particle not, and also for its contracted or fused form,
Brown did_VDDn't _XX0 see it that way. [A6W.338]no, that is not_XX0 correct. [JK0.257]
Letter
ZZ0 is used for a free-standing letter of the
alphabet such as A, X, x, p, r . If however, the letter
clearly represents a separate word, or an abbreviation of a separate
word, we have tried to assign the appropriate POS-tag for the full
form of that word, rather than ZZ0.For example,
I as personal pronoun is PNP rather than ZZ0.
a as indefinite article is tagged AT0
F as in John F. Kennedy is tagged NP0
v meaning 'versus' is tagged PRP in
Italy v_PRP New Zealand ... Hungary v_PRP Thailand [A1N.507].
Although the same should apply to v. the
full-stop has sometimes incorrectly produced a
new sentence break. (See eg CHS.1076, EB2.19, EDL.313)
In spoken texts, words which are spelt out by the speaker are transcribed letter by
letter, and each letter is tagged ZZ0.
I_ZZ0 B_ZZ0 M_ZZ0 compatible [JYM.6]children who go to the E_ZZ0 N_ZZ0 T_ZZ0 clinic [KB8.3807]
Disambiguation Guide
The following is a guide to resolution of the most common tagging
ambiguities. It states the principles by which we have drawn the line
between the "correct" and the "incorrect" assignment of a tag in
particular contexts (as applied in the report on tagging error rates.) Note that
in the next two sections, we also cite examples where the POS-tagging
in the corpus is less reliable and does not match that given for the
citation. In such cases we append the actual tag in the corpus to the
file reference with an asterisk. Eg. under Adjective vs Adverb (next
section), the preferred tag for long is AV0, but the actual
tag is ambiguous AV0-AJ0: You're not supposed to keep
medicine that long_AV0. [H8Y.1976 *AV0-AJ0]
Note also that in this section we use a number of invented
examples (in addition to corpus citations) to clarify the distinction
between categories.
Disambiguation by Tag Pair
Adjective vs. adverb
After a verb or an object, there is sometimes a difficult choice
between AJ0 and AV0, or between AJC and AV0. e.g.:
We arrived tired_AJ0, but safe_AJ0 [CCP.529]
Here, both tired and safe are AJ0. The main test is to see whether one can express the relation between these words
and their logical subjects using the verb be: They arrived tired but safe implies 'They were tired but safe'. The
word tagged AJ0 refers to a property of a noun, rather than to a
property of an event or situation. Contrast:
After a little he remembered it and sang out loud_AV0.[A0N.1144]-->
This sentence does not imply that he was loud,
but is more or less equivalent to He sang out loudly. It means that his singing was loud.
It follows that when, in colloquial English, a word which we normally
expect to be an adjective is used as an adverb, we should tag it AV0:
You did great_AV0 though. [HH0.3248 *AV0-AJ0]
Here is another pair of examples, where the AJ0/AV0 word follows
an object:
everyone below 25 grew their hair too long_AJ0. [ARP.590 *AV0-AJ0] (i.e. 'their hair was too long'.)
Try not to keep her too long_AV0. [FAB.3620 *AV0-AJ0] (i.e. NOT 'she will be too long.')
Also note the similar distinction between AJC and AV0:
They'll have to make the taxes higher_AJC. ('the taxes will be higher')
We can make this piece higher_AJC if you want to. [BNG.2268]
You'll have to aim higher_AV0. (NOT 'you will be higher')
You should aim higher_AV0 [ACN.984 *AJC]
Similar considerations arise for the choice between AJS and AV0:
I thought it best_AJS to call. [AT4.3239]
I liked the cartoons best_AV0 [CAM.194]
Adjective vs. noun
There are many words in English which can be tagged either
adjective (AJ0) or noun (NN1). Colour words like black, white
and red are fairly consistent in allowing the two tags,
and may be used to illustrate the difference. In attributive
(premodifying) or predicative (complementing) positions without
further modification these words are normally adjectives:
a white_AJ0 screen, The screen is white_AJ0.
When the word is the head of a noun phrase, on the other hand, it is a noun:
Red_NN1 is my favourite colour. They painted the wall
a brilliant white_NN1.
Sometimes a word cannot be used predicatively as an adjective,
but can occur attributively in a way which suggests adjectival
use. For example, past and present are
adjectives in
All past_AJ0 and present_AJ0 employees of the branch are invited. [K99.216]
We do not find present or similar words being used as predicative
adjectives, however:
*These needs are past, present, and future.
(Note that present can be used as a predicative adjective
meaning the opposite of absent; but this meaning is not comparable
to the temporal meanings of past, present and future
above.)
Contrast K99.216 above with cases where past, present etc.
are heads of noun phrases, e.g. following the definite article,
and are clearly nouns:
You're living in the past_NN1. [HGS.1045] I don't even want to think about the future_NN1. [JY4.2864]
The only reason for treating past and present in
the example above as adjectives is that they have an institutionalized
meaning as modifiers, which is rather different from the meaning
they have as nouns. Further examples of this type are words such
as model in model behaviour, giant in a giant
caterpillar and vintage in vintage cars.
Words ending in -ing are a particular problem: when they
premodify a noun, they can be tagged either NN1 (noun) or AJ0 (adjective).
Contrast:
new spending_NN1 plans [CEN.5922] a working_AJ0 mother [ED4.153]
his reading_NN1 ability [CFV.1897] in the coming_AJ0 weeks [HKU.1333]
The guideline is as follows.
If X-ing + Noun is equivalent in meaning to Noun
who/which X-es (or X-ed or BE + X-ing), then X-ing is an adjective (AJ0).
That is, a word ending -ing is an adjective when it is the
notional subject of the noun it premodifies. For example:
two smiling_AJ0 children [HTT.743] ('two children who are smiling')
In other cases, X-ing is generally a noun (NN1). In such cases, it is often possible to paraphrase X-ing + Noun
by a more explicit phrase in which X-ing is clearly a noun:
new spending_NN1 plans ('new plans for spending') his reading_NN1 ability ('his ability in reading')
Further examples:
a mating_AJ0 animal [GU8.2142] the mating_NN1 game [ECG.336 *AJ0-NN1] a falling_AJ0 rate of unemployment [KR2.2129] slimming_NN1 tablets. [KCA.941 *NN1-VVG]
Determiner-pronoun vs. adverb
More and less can be assigned to either of the tags
DT0 or AV0. The difference between them is that DT0 is for noun-phrase-like
(and determiner-like) uses of the word in question, whereas AV0
is for adverbial uses. The two can be hard to distinguish, particularly
after a verb:
(a) You should relax more_AV0. (b) You should spend more_DT0.
Since relax is an intransitive verb in (a), more
cannot be a noun phrase following it. Instead, more can
be paraphrased roughly as 'to a greater extent' or 'to a greater
degree'. On the other hand, spend in (b) is a transitive
verb, and so more is a determiner-pronoun form following
it. As confirmation of this, note that sentence (b) could be turned
into a passive with more as subject: More should be
spent.... There are unfortunately some verbs for which the
distinction is less clear than in the above examples, e.g.:
You should eat more. You should read more. You should smoke less.
In these cases, the verb may be used transitively or intransitively
with almost identical meanings, so that the syntactic structures
of the immediate and/or surrounding context are the only clues
as to which is the case:
Do you smoke? (Intransitive)How many do you smoke in a week? (Transitive)
Contrast (c) and (d) below:
(c) At the moment we have 23 fixtures per season. Personally, I would rather play more_DT0. (d) You should work less and play more_AV0.
(In (d) the adverb more has roughly the meaning of 'more
often'.)
Note. The automatic disambiguation of determiners and adverbs is not reliable, because transitivity has not been encoded in the tagger. Sentences like (c) and (d), where more follows the verb at end of a sentence, are invariably tagged AV0.
Adjective vs. participle
Another area of borderline cases is the tagging of words as adjectives
(AJ0) or as participles (VVG or
VVN).
One test is to see whether a degree adverb like very can be inserted in front of the
word: e.g. in We were very surprised, surprised is an AJ0.
Another test, having the opposite effect, is to see
whether there is an agent by-phrase following the word in
-ed or -en. If so it is a VVN: We were
surprised_VVN by pirates. Even where it is not present, the
possibility of adding the by-phrase, without changing the
meaning of the word, is evidence in favour of VVN. (However, this
criterion can clash with the preceding one — since it occasionally
happens that an -ed word is both preceded by an adverb like
very and followed by a by-phrase: E.g. I was so
irritated by his behaviour that I put the phone down. When these
do occur, we give preference to AJ0.)
A third test is negative: to see whether the word in question
can be placed before a noun. e.g.:
The effect is lasting_AJ0 (compare a lasting_AJ0 effect).
The door is locked_AJ0 (compare the locked_AJ0 door.)
This shows that lasting or locked can easily be (but need not be) an AJ0. If the word could not be placed (with
the same meaning) before the noun, this would be evidence that the word is a participle.
Even though an -ing word is normally a VVG after
the verb be, it is generally treated as an AJ0 before a noun:
The man was dying_VVG. [HTM.1494 *VVG-AJ0]
the dying_AJ0 man. [FSF.1787]
However, when the -ing or -ed forms part of
a premodifying phrase, the VVG or VVN tag is preferred:
an interest_NN1 earning_VVG account a hypothesis_NN1 driven_VVN approach
In these examples the NN1+VVG/VVN sequence has the character of a premodifying adjective compound. We can therefore imagine the
two words bracketed together forming an adjective: an interest-earning_AJ0 account. But within the adjective, the VVG and VVN tags retain their verbal character, with the initial noun acting as object of the verb (cf. the account earns interest).
The same applies when the premodifying compound phase is noun-like:
a shanty_NN1 singing_VVG competition[K4W.2952]
If the verb be can be replaced by another verb such
as seem or become, without changing the meaning
of the following AJ0 / VVN word, this is a strong indication that
the construction is not properly a passive, and that the word
is an AJ0: The building was infested_AJ0 with cockroaches (cf.: The building seemed/became infested with cockroaches)
A further distinction which can be used to test with 'event' verbs is that the AJ0 refers to a 'resultant state', whereas the
VVN refers to an event:
Bill was married_AJ0. (i.e. he was not single)
Bill was married_VVN to Sarah on the 15th May. (i.e. the actual event)
This is a manifestation of the general semantic character of adjectives
(which typically refer to states or qualities) and verbs (which
typically refer to events or actions).
However, this criterion is not definitive, as VVG and VVN can also sometimes refer to states, when the meaning of the verb is stative:
She is not disturbed_VVN by that sort of threat. The tourists were standing_VVG around a map of the city.
Finally, here is a test which clearly identifies an -ing form as a verb.
A verb takes following complements such as a noun phrase, an adjective or an adverbial. These cannot follow the same word as adjective. E.g.:
Are you expecting_VVG someone?[G01.2610]
The arithmetic is looking_VVG good. [K1M.3611]
Turning_VVG suddenly, she ran for the safety of the car [CK8.297]
Contrast:
His manner was insulting_AJ0.
where insulting could not normally be followed by an object:
* insulting us.
Preposition vs. prepositional adverb vs. general adverb
This kind of ambiguity occurs frequently, particularly in spoken texts. Compare:
(a) She ran down_PRP the hill.
(b) She ran down_AVP her best friends.
In (a), down is a preposition, because:
An adverb could be inserted before it:
She ran quickly down the hill.(But not: *She ran viciously down her best friends.)
It can be moved (somewhat awkwardly) to the front of a wh-word:
This is the hill down_PRP which he ran. Down_PRP which slopes do you like ski-ing?
In (b), down is an adverbial particle because:
It can be placed before or after the noun phrase acting as
object of the verb:
She ran her best friends down_AVP. (But not: *She ran the hill down.)
If the noun phrase is replaced by a pronoun, the pronoun has
to be placed in front of the particle:
She ran them down_AVP. (= her best friends)(But not: *She ran down them.)
Similarly: The dentist took all my teeth out_AVP. (The dentist took them out)
Notice that the syntactic distinction between (for example) down as an adverbial particle and down as a preposition
is independent of the semantic distinction between locative and
non-locative interpretations of down.
When the verb is simply followed by down or out,
etc., without a following noun phrase, it is normally an AVP:
Income tax is coming down_AVP.
The decorations are put up_AVP on Christmas Eve.
However, it is important to recognize 'stranded' prepositions,
which have been deprived of the company of their noun phrase,
the prepositional complement, because it has been fronted or omitted
through ellipsis (e.g. in relative clauses, with passives, in
questions, etc.):
This is the hill (which) she ran down_PRP.(Cf. This is the hill down which she ran.)
The poor were looked down on_PRP by the rich.(Here on is the stranded preposition)
Which car did she arrive in_PRP?
The same tests apply to words which are tagged either as prepositions or as general adverbs (AV0), such as across, past and behind.
Note, additionally, the use of
about as a degree adverb.
Interjection vs. unclassified
The borderline between interjections or exclamatory particles (tagged
ITJ) and unclassified 'noise' words (tagged
UNC) is drawn as follows:
ITJ is used for 'institutionalized' interjections or discourse particles such as good-bye, oh, no, oops, hallelujah, whoa, wow ; however
Well,
right and like
functioning as discourse markers are tagged AV0.
UNC is used in contexts where no other wordclass tag seems appropriate:
'noise' words and pause fillers in spoken utterances; this
includes imitations of animal or machine sounds:
blah_UNC blah_UNC blah_UNC
er_UNC I think so. Erm_UNC nope_ITJ.
certain fused forms which cannot easily be broken down into
separate word classes:
methinks_UNC. ai_UNC n't_XX0
constituent w elements within multiword expressions
for which no unique C5 code can be found
The contraction ain't is a special case: its first half
is tagged UNC because it abbreviates so many different verb forms
(am not, is not, are not, has not, have not) that no single
tag can be applied to it (unless one were to invent a special
tag for that purpose).
Disambiguation by Word
In this section we discuss some common words which belong to more than one word class, and are among the most problematic for disambiguation. As in section 3, if the tag stated in the example differs from the actual tag in the corpus, we append the latter to the file reference number in the next line. Eg *AV0 in
Tears well_VVB up in my eyes. [BN3.5 *AV0]
Apostrophe S
In the BNC the two-character sequence 's is generally tagged as a
separate wordform, following
without a space the immediately preceding word.
Contracted forms
When it represents a shortened form of is, has or (rarely) does, it has the appropriate verb tag.
Occasionally, for example with auxiliaries followed by past participles, there are difficulties determining what the full form of the verb should be.
Examples:
That_DT0's_VBZ perfect is that one... (= That is...) [KCX.1254]
She_NP0 's_VHZ got tickets. (= She has...) [KPV.6479]
well, what_DTQ 's_VDZ he do?, is he a plumber? (= What does...) [KD6.310]
Genitives
Britain_NP0's_POS small businesses [HMH.67]
After today_AV0's_POS announcement [K6F.39]
's plural
When 's acts as a marker of the -s plural, or as part
of the verb form let's, it is part of a single word, and
is not assigned its own tag. E.g.:
success in the three R_ZZ0's [EVY.59]
in the 1980_CRD's [HJ1.22024]
Let_VM0's go_VVI. [A61.1443]
Note that let's is not considered a contraction of let
us, but is treated as a single 'verbal particle', tagged VM0,
on the grounds that it is closely analogous to modal auxiliaries.
ABOUT
Degree adverb:
When about has an approximating meaning, typically premodifying
a quantifying expression, it is tagged AV0 (not
PRP):
...it was about_AV0 three weeks ago [FAJ.1714]
about_AV0 half the size of a grain of rice [AJ4.33]
Note also the multiword just about, as in:
We're just about_AV0 ready.
Preposition vs. particle:
See further at
Examples:
my mother was reading a novel about_PRP gypsies... . [ARJ.2068]
How did this transformation come about_AVP? [A11.786]
AS
Comparative constructions:
As is a degree adverb (AV0) when it occurs before an adjective,
adverb or determiner (and sometimes other words) in phrases of
the type as X as Y, or simply as X (where the comparative
clause or phrase as Y) is omitted but understood:
I go to see them as_AV0 often as I can . [AC7.1189]
and they employ ninety people, twice as_AV0 many as last year. [K1C.3540]
And every bit as_AV0 good .[EEW.1132 *CJS]
In the first and second examples above, the second as introduces
a comparative construction which expresses 'equal comparison',
as contrasted with the unequal comparison of more X than Y.
When as is a word introducing such a comparative construction,
it is tagged CJS:
Capitalism is not as_AV0 good as_CJS it claims. [CFT.2042]
Linked together, they can crunch numbers as_AV0 fast as_CJS any mainframe.[CRB.271]
She will deposit as_AV0 many as_CJS a dozen eggs there. [F9F.424]
Notice that as in this comparative use is tagged CJS whether
or not it introduces a clause. Often it introduces a noun phrase.
In the following example, it introduces an adjective:
always reply as_AV0 quickly as_CJS possible. [C9R.989]
Introducing other clauses:
The tag CJS is also used when introducing other subordinate clauses,
such as adverbial clauses of time or reason:
New York called just as_CJS I was leaving. [APU.1543]
As_CJS you've gone to so much trouble , it would seem discourteous to refuse [KY9.2107]
Preposition:
The tag PRP is used for as functioning clearly as a preposition:
Consider it as_PRP a kind of insurance [AD0.1641]
As_PRP head of information, Christina will lead a team of four TEC staff... [BM4.2830]
Usually the meaning is related to the equative meaning of the
verb be. However, the guideline restricts PRP to cases
where as is followed by the normal noun phrase or nominal,
as is normal for prepositions. Where the as is followed
by an adjective or a past participle clause, it is tagged CJS,
even though it may retain the equative type of meaning:
We regard these results as_CJS encouraging. [B1G.184]
I very much hope that you will in fact support the motion as_CJS originally intended. [KGX.93]
Multiwords:
As is part of many multiwords which get tagged with a
single tag: e.g. as soon as, such as, in so far as,
as long as, as well as. The sequence as well as, for
example, is tagged as a preposition (PRP) in such examples as
Sometimes as well as_PRP going this way we actually need to go in this was too. [G5N.31]
Note that this is different from the multiword adverb as well (meaning
also); it is also different from the sequence of as
well as as three separate words, e.g. in:
She's as_AV0 well_AJ0 as_CJS can be expected. [F9X.2095]
BUT
The coordinating conjunction CJC is overwhelmingly the most common
use of but. The following other cases can also be detected:
Adverb:
But is an adverb when its meaning is similar to 'only':
She can spare you but_AV0 a few minutes [CCD.82 *CJC]
There is but_AV0 one penalty. [ALS.185 *CJC]
Subordinating conjunction or preposition:
But is either a conjunction (CJS) or a preposition (PRP)
if it has the meaning of 'except (for)', 'other than' or 'apart
from'. CJS is used when it introduces a clause, and PRP is used
when it introduces a phrase:
...mediocre albums that do nothing but_CJS take up shelf space [C9M.1014]
I couldn't help but_CJS notice. [JY0.5323 *CJC]
I always feel they are open meetings in everything but_PRP name. [HJ3.5520]
No one had guessed she was anything but_PRP a boy. [C85.517]
Coordinating conjunction:
Otherwise but is a coordinating conjunction, tagged CJC,
linking units of the same kind (e.g. clauses or adjective/adverb
phrases). Its function is to express contrastive or 'adversative'
meaning:
God and minds do exist , but_CJC materially so . [ABM.1265]
And that's it for another week but_CJC don't forget the late news at eleven thirty. [J1M.2520]
Hares ( but_CJC not rabbits ) are particularly vulnerable... [B72.892]
Multiwords
Note also multiwords such as but for (PRP):
The fare increases would have been bigger but for_PRP the governments last minute intervention. [K6D.124]
HOME
As a locative adverb, home has no determiner or article
preceding:
We stayed home_AV0. [FAP.313]
This is my home_NN1. [AMB.1805]
LIKE
Discoursal function:
In speech, when like has a discoursal function as a 'hedge',
we tag it AV0:
well she says like_AV0, I won't be a minute [KCY.1518]
I'm driving along, you know like_AV0 trunc wha/trunc when you're
in the car by yourself and everything's turning over in your head [KBU.1096]
Other functions:
Like very frequently occurs as a preposition or as a verb.
The noun and adjective uses are fairly rare:
...but I like_VVB Monday best. [FU4.1089]
He didn't look like_PRP a goodie. [H0M.1353]
... fuel, weapons, ground crew and the like_NN1. [JNN.105 *AJ0-NN1]
Churchill and Eden were not of like_AJ0 minds... [ACH.1297]
LITTLE
Adjective:
The meaning of little (AJ0) is the opposite of big:
Bless their dear little_AJ0 faces. [HRB.722]
Little_AJ0 green shoots of recovery are stirring. [CEL.968]
Determiner-pronoun:
The meaning of little (DT0) is 'not much':
I have little_DT0 to say. [G1Y.1133]
...there was little_DT0 food left. [FSJ.720]
Adverb:
As an adverb (AV0) little also has the meaning 'not much':
I care very little_AV0 about petty-minded, selfish "rules". [B0P.211]
A little
Note that a little can also be a multiword adverb (AV0):
They are all a little_AV0 drunk. [G0F.2117]
However, the quantifier a little meaning 'a small amount' is not tagged as a multiword
In BNC version 1, the quantifier a little
meaning 'a small amount' was sometimes (but not reliably) tagged as a
multiword DT0 but as AT0 + DT0
You couldn't let me have a_AT0 little_DT0 milk? [GUM.1656]
[See ]
MUCH
Determiner-pronoun:
Much_DT0 of this work has to be done on the spot. [C8R.24]
I've spent too much_DT0 money. [KPV.62659]
Adverb:
Thanks very much_AV0. [A73.5]
I didn't sleep much_AV0 last night [ALH.1495]
See also
MORE and LESS
See for a fuller
discussion. Further examples:
You deserve more_DT0 than a medal. [K97.3705]
More_DT0 haste, less_DT0 speed. [J10.4543]
...this will make him more_AV0 tired than usual [A75.282]
But I couldn't agree more_AV0 [BMD.3]
More than as a multiword premodifier counts as an AV0:
more than_AV0 one in a million [K5N.46]
NO
Article
No_AT0 problem_NN1. [H4H.227]
Noun
As a noun, no is usually an abbreviation for number:
quoting Ref_NN1 No_NN1 BCE90_UNC [CJU.673]
Adverb
but the matter was taken no_AV0 further_AV0. [ARF.183 no: *AT0]
To put it no_AV0 more_AV0 strongly_AV0, it has not been proved beyond doubt that.... [EW7.125]
Interjection:
No is tagged as an interjection (ITJ) where it functions
as the opposite of Yes.
"...See how easy my job can be?" "Frankly, no_ITJ". [HR4.2329]
ONE
Numeral:
The clearest cases of CRD are in a quantifying noun phrase, typically allowing the substitution of another numerical expression (e.g. one chip
contrasts with two chips) or of the digit 1 (1 chip):
Can I have one_CRD chip, please? [KDB.1416]So are there criticisms? Just one_CRD. [CG2.1489]... one_CRD in five sufferers never tells their partners. [CF5.8 *PNI] Orford Ness is one_CRD of Britain's most unusual coastal features. [CF8.86]
In such noun phrases, one functions like a determiner-pronoun such as some.
Indefinite Pronoun:
The clearest cases of PNI are:
As a substitute form, standing for an understood noun or noun
phrase:
The channel was not a broad one_PNI [AEA.1457]
In this use, one has a plural form ones.
As a generic personal pronoun, meaning 'people in general':
And I think one_PNI might go on to argue that far from saving labour it creates it. [J17.1915]
Note that the reliability of the ambiguity tag PNI-CRD (in which the pronoun is rated more likely)
is somewhat low. See
RIGHT
As both an adverb (AV0) and an adjective (AJ0) right means
the opposite of 'wrong' and also the opposite of 'left'. As a
noun, it generally means 'entitlements': e.g. I have a
right_NN1 to know. The uses of right as a verb are
very rare.
Less obvious points:
Discoursal function:
As a discourse marker, right is tagged AV0:
Right_AV0, how you doing there? [KBL.4671]
Right_AV0, er, members, any questions ? [F7V.138]
Degree adverb (intensifier):
In dialectal usage, right can be an intensifier, and is tagged AV0:
it's a ... it's a right_AV0 soft carpet. [KB2.1242-4]
SO
In most cases so is tagged as an adverb (AV0):
So_AV0 this is where you work... [H8M.2964]
Right, so_AV0 what's fifty three per cent as a decimal? [JP4.357]
They waited but nothing happened so_AV0 they made a fuss. [FU1.2484]
As a pro-form meaning 'thus' or standing for a clause or predicate,
so is tagged AV0:
So_AV0 say I and so_AV0 say the folk. [G11.228]"Yes, I think so_AV0." [CCM.151]
As a degree adverb or intensifier, so is tagged AV0:
tough and long lasting - that's why they're so_AV0 popular. [BN4.929]There would not be so_AV0 many lonely people in our land [B1Y.1262]
Introducing purpose clauses, so is tagged CJS (subordinating
conjunction):
Drink your tea so_CJS they can have your cup. [KB2.1767]
Note that so is frequently part of a multiword: so
that, so far, so as to, (in) so far as, etc.
See the list of multiwords
THAT
As a demonstrative (pronoun or determiner), that is tagged DT0
That_DT0's_VBZ my coat yeah. [KBS.1309]he's getting hooked on the taste of vaseline, that_DT0 dog. [KCL.197]
As a clause-initiating conjunction, that is
tagged CJT.
This applies to that as a complementizer:
Many experts claim that_CJT it is good for your growing baby, too. [G2T.1091]
and also to that as a relativizer (introducing a relative
clause):
A ship that_CJT never enters harbour. [BPA.1326]
This is different from the more traditional analysis which treats
that introducing a relative clause as a relative pronoun.
As a degree adverb (intensifier):
It wasn't all that_AV0 bad. [KPP.321]
That occurs commonly in multiwords such as so
that, in that, in order that.
THEN
In all functions except clear adjectival usage (AJ0, usually following the), then
receives the tag AV0: And then_AV0 she spoke. [H8T.2675]"Come on, then_AV0." [K8V.1722]Mr Willi Brandt, the then_AJ0 Mayor of West Berlin. [A87.84]...the then_AJ0 state governor , who wasn't then_AV0 Bill Clinton [A87.84]
TO
Infinitive marker
When used with an infinitive, to is always tagged
TO0. Note elliptical uses of the pre-infinitival
to, especially in informal spoken texts:
In the summer holidays, I can, I can get up early if I want to_TO0. [KPG.4153]
Note also the common colloquial spelling of want to, got to,
and going to as fused words:
wanna = wan_VVB na_TO0 gotta = got_VVN ta_TO0 gonna = gon_VVG na_TO0
Preposition
When used as a preposition, to is always tagged
PRP. Prepositions are normally followed by a noun phrase or nominal
clause. Where the preposition is 'stranded' (i.e. where the noun
phrase associated with the preposition has been moved or ellided )
it can be confused with an adverbial particle:
That 's the school that Terry goes to_PRP. [KB8.2442]...what you're entitled to_PRP by law is money back [FUT.360]"Where to_PRP?""The_PRP moon." [FNW.240-1]
Adverbial particle
The adverbial particle to is rare but does occur, for example
in come to meaning 'regain consciousness'.
WELL
Adverb
By far the most common function for well is as an adverb:
She's playing well_AV0
Discoursal function:
When well has the function of a discourse marker, it is
treated as an adverb (AV0):
Oh well_AV0! That'll be the finish! [FX6.196-7] I bet he doesn't get up till about, well_AV0, it's eleven now. [KBL.3808]
Degree adverb:
Well is tagged AV0, too, where it has an intensifying function: e.g.
It was dark outside and well_AV0 past your bedtime. [ASS.898]
Adjective
Well is tagged as an adjective where it means 'in good
health': You don't look well_AJ0. [HPR.107]
Verb
As a verb, well is very rare, but occurs in the phrasal
verb well up. NB. This use has not been accurately tagged in the corpus:
Tears well_VVB up in my eyes. [BN3.5 *AV0]
WHEN
When can introduce three types of clauses: an adverbial
clause, a nominal clause, or a relative clause. Where it introduces
an adverbial clause, it is tagged CJS. Otherwise it is
tagged AVQ. The AVQ tag is also used for
when introducing a question. Examples:
Adverbial clause:
When_CJS I got back to my flat, I decided to ring Toby. [CS4.1265]
the crowd left quietly when_CJS the police arrived. [APP.1017]
(when = at the time at which)
If you smoke when_CJS you're pregnant... [A0J.1598] (when = whenever)
Note that when is also a subordinating conjunction in abbreviated
adverbial clauses which lack a subject and finite verb, such as
when in doubt, when ready, when completed.
Nominal clause
I can't remember when_AVQ we last had a frost. [KBF.11728]
"Do you remember when_AVQ we used to go with Daddy in the boat on Saturdays?" [A6N.2022]
You never know when_AVQ the next big story will break. [HJ6.100]
Before an infinitive, when is also tagged AVQ:
Otto knew when_AVQ to change the subject. [FAT.1603]
Also when the rest of the infinitive clause is understood:
Tell me when_AVQ.
Relative clause
in the year when_AVQ I was born (when = in which)
the moment when_AVQ he arrived (when = at which)
Note that when can often be omitted in relative clauses:
the moment he arrived.
Direct questions
When_AVQ did you find out?
WHERE
Where is like when
in that it can be a wh- adverb
(AVQ) or a subordinating conjunction
(CJS). However, with where the
CJS tag is much less likely. Examples:
In adverbial clauses
...to hit him where_CJS it hurts. [CEN.2816]
In other contexts
Nominal clause:
I don't know where_AVQ she picked them up. [G1D.1163]
Relative clauses
It was the house where_AVQ the poor woodcutter lived with Hansel and Gretel
Direct questions:
Where_AVQ are you going? [KB9.2650]
WORTH
Preposition
worth is tagged PRP where it could answer a question
such as 'How much is X worth?' or 'What is X worth?'
these pictures are worth_PRP a small fortune. [FNT.1060]
That makes him worth_PRP about $60m. [CT3.479]
'Darling, it's not worth_PRP getting upset. [HH9.2308]
worth also occurs as a 'stranded preposition' in questions
used to elicit such responses, and in some other common constructions:
how much d'ya think it's worth_PRP? [KCX.1344]
share prices say nothing about what a company is worth_PRP. [A9U.305 *NN1]
Please go ahead and push Grapevine for all you are worth_PRP. [AP1.575]
Noun
worth is tagged NN1 when it is an obvious
noun (meaning 'value'). Typically this occurs following expressions
of quantity, whether or not the quantity is expressed by a possessive
or genitive (e.g. its, 's).
Baker showed his worth_NN1 for Ipswich in the 20th minute [CF9.102]
hundreds of pounds' worth_NN1 of damage. [A0H.15]
£2,500 WORTH_NN1 OF PRIZES [ECJ.1147]
Features of spoken corpus tagging
The spoken and written texts of the BNC have been tagged in the same way, except that the following
phenomena occur almost entirely in the spoken part of the corpus.
Individual letters
Words spelt out by a speaker as individual letters have been transcribed letter by letter,
each being tagged ZZ0.
children who go to the E_ZZ0 N_ZZ0 T_ZZ0 clinic [KB8.3805]
...ten ninety minute tapes! T_ZZ0 D_ZZ0 K_ZZ0 tapes! [KPG.3534-5]
In the written corpus these items would nearly always be written
and tagged as whole words (ENT or TDK in the above
example).
Truncated words
Words that are left incomplete by the speaker are enclosed
within an XML
trunc element and tagged UNC. Examples include bathr and
su in the following
The trunc bathr_UNC /trunc er you can't
beat a white bathroom suite anyway. [KCF.721]
Aye, they only came in the trunc su_UNC /trunc they only came up here in the summer. [GYS.127]
Partial repetition of multiwords
Occasionally in spoken data it happens that only a portion of a
multiword sequence is repeated. In this example, the word sort is used twice; in both cases
it appears to function not as a separate word but as part of the multiword adverb
sort of.
we're going to sort sort of summarize... [G5X.106]
We treat the first sort as an incomplete multiword, and tag it UNC
(rather like truncated words, above). The complete multiword sort of is tagged AV0, as normally.
we're going to sort_UNC sort of_AV0 summarize...
Further examples of incomplete multiwords are the as long in as long as (conjunction), of in because of (preposition) and the in in in general (adverb) below
As_UNC long_UNC As_CJS long as everyone recognizes that for an area of that size... [J9T.258]
because_PRP of the <pause> of_UNC the drought. When we were away it didn't get watered in. [KCH.982]
I know that in_UNC in_UNC in_AV0 general, in in in erm, imperial measure, it is <trunc> f </trunc> five feet eight inches [JK1.480]
The second example shows that when words are repeated, the incomplete
portion of a multiword is not necessarily immediately adjacent to the
fully formed multiword. In the last example, the three instances of
in before erm, imperial measure have not been
analysed as part of the multiword in general; they are
instead tagged as ordinary words (in this case, ambiguous between
preposition and prepositional adverb: PRP-AVP). There are a few cases
where the tagger has probably been over-zealous in spotting repeated
portions of multiwords: What happens now_UNC, now_CJS that
you are winched down? [HEF.9] Here, the first instance of
now would probably have better been interpreted as a single
word adverb (='at this time'), not part of the multiword conjunction
now thatIn our experience, human analysts
too sometimes have difficulty resolving ambiguities such as these,
especially when using the plain orthographic transcriptions of the
BNC, and with no direct access to the original sound
recordings..
Er and erm inside multiwords
Generally (in both written and spoken texts) the pause fillers er
and erm take the tag UNC. This applies also
when they appear within a multiword
sequence, as in every er so often. The code assigned to the surrounding
mw element is identical to that which would have been
assigned if the filler were not present.
And your homework was handed in every er so often_AV0, you know [G64.152]
something had gone wrong with the pause gas pipes because erm of_PRP pause flooding. [KB8.5356]
these kind of books were, er, generally er, at , at er best_AV0 ignored [HUN]
Note that in the last example the word at preceding the multiword
at er best is treated as a partial repetition of that multiword, and
therefore tagged UNC.
POS-tagging Error Rates
This section reports on the accuracy of the results of the improved tagging programs.
Levels of estimation
Based on the findings from the 50,000-word test sample, the estimated ambiguity and error rates for the BNC are shown below in three different degrees of detail.:
First, as a general assessment of accuracy, the estimated
rates are given for the whole corpus. (See below.)
Secondly, separate estimates of ambiguity rates and error rates are given for each of the 57 word tags in the corpus. This will enable users of the corpus to assign appropriate degrees of reliability to each tag. Some tags are always correct; other tags are quite often erroneous. For example, the tag VDD stands for a single form of the verb do: the form did. Since the spelling did is unambiguous, the chances of ambiguity or error, in the use of the tag VDD, are virtually nil. On the other hand, the tag VVB (base finite form of a lexical verb) is not only quite frequent, but also highly prone to ambiguity and error. 15 per cent of the occurrences of VVB are errors - a much higher error rate than any other tag. (See below.)
Thirdly, separate estimates of ambiguity rates and error
rates are given for ‘wrong-tag--right-tag’ pairings XXX, YYY,
consisting of (i) the actually-occurring erroneous tag XXX, and (ii)
the correct tag YYY which should have occurred in its place. However,
because the number of possible tag-pairs is large (572), and most of
these tag-pairs have few or no errors, only the more common pairings
of erroneous tag and correct tag are separately listed, with their
estimated probability of occurrence. This list of tag-pairings will
help users further, in enabling them to estimate not merely the
reliability of a tag, but, if that tag is incorrect, the likelihood
that the correct tag would have been some other particular tag. In
this way, the frequency of grammatical word classes, or individual
words in those classes, can be estimated more accurately for the whole
BNC. (See below.)
Presentation of Ambiguity Rates and Error Rates (fine-grained mode of calculation)
In this section, we examine ambiguities and errors using a
‘fine-grained’ mode of calculation, treating each error as of equal
importance to any other error. In we look at
the same data in terms of a ‘coarse-grained’ mode of calculation,
ignoring errors and ambiguities involving subcategories of the same
part of speech.
Overall estimated ambiguity and error rates: based on the 50,000 word sample
As the following table shows, the ambiguity rate varies
considerably between written and spoken texts. (However, note that
the calculation for speech is based on a small sample of 5,000
words.)
Estimated ambiguity and error rates for the whole corpus (fine-grained calculation)
Sample tag count
Ambiguity rate (%)
Error rate (%)
Written texts
45,000
3.83%
1.14%
Spoken texts
5,000
3.00%
1.17%
All texts
50,000
3.75%
1.15%
It will be noted that written texts on the whole have a higher ambiguity rate, whereas spoken texts have a slightly greater error rate.
The success of an automatic tagger is sometimes represented in terms
of the information-retrieval measures of precision and recall, rather
than ambiguity rate and error rate as in . Precision is the
extent to which incorrect tags are successfully discarded from the
output. Recall is the extent to which all correct tags are
successfully retained in the output of the tagger, allowing, however,
for more than one reading to occur for one word (i.e. ambiguous
tagging is permitted). According to these measures, the success of
the tagging is as follows:
Precision
Recall
Written texts
96.17%
98.86%
Spoken texts
97.00%
98.83%
All texts
96.25%
98.85%
However, from now on we will continue to use ‘ambiguity rate’ and ‘error rate’, which appear to us more transparent.
Estimated ambiguity and error rates for each tag (fine-grained mode of calculation)
The estimates for individual tags are again based on the 50,000 sample, and the ambiguity rate for each tag is based on the number of ambiguity tags which begin with a given tag. The table also specifies the estimated likelihood that a given tag, in the first position of the ambiguity tag, is the correct tag.
In , column (b) shows the overall
frequency of particular tags (not including ambiguity tags). Column
(c) gives the overall occurrence of ambiguity tags, as well as of
particular ambiguity tags, beginning with a given tag. (Ambiguity
tags marked * are less ‘serious’ in that they apply to two
subcategories of the same part of speech, such as past tense and past
participle of the verb - see 4.1 below.) Column (d) shows which tags
are more or less likely to be found as the first part of an ambiguity
tag. For example, both NP0 and VVG have an
especially high incidence of ambiguity tags. Column (e) tells us,
given that we have observed an ambiguity tag, what is the likelihood
of the first tag’s being correct? Overall, there is more than a 3-1
chance that the first tag will be correct; but there are some
exceptions, where the chances of the first tag’s being correct are
much lower: for example, PNI (indefinite pronoun). Note
that (f) and (g) exclude errors where the first tag of an ambiguity
tag is wrong; contrast , and column (c), below.
Estimated ambiguity rates and error
rates by tag
(a) Tag
(b) SingleTag count (out of 50,000 words)
(c) Ambiguity Tag count (out of 50,000 words)
(d) Ambiguity rate (%)(c / b + c)
(e) 1st tag of ambiguity tag correct (% of all ambiguity tags)
(f) Error count
(g) Error rate (%)(f / b)
AJ0
3412
all 338
9.01%
282 (83.43%)
46
1.35%
(AJ0-AVO 48)
(AJ0-NN1 209)
(AJ0-VVD 21)
(AJ0-VVG 28)
(AJ0-VVN 32)
AJC
142
0.0%
4
2.82%
AJS
26
0.0%
2
7.69%
AT0
4351
0.0%
2
0.05%
AV0
2450
all 45
1.80%
37 (82.22%)
57
2.33%
(AV0-AJ0 45)
AVP
379
all 44
10.40%
34 (77.27%)
6
1.58%
(AVP-PRP 44)
AVQ
157
all 10
5.99%
10 (100.00%)
9
5.73%
(AVQ-CJS 10)
CJC
1915
0.0%
3
0.16%
CJS
692
all 39
5.34%
30 (76.92%)
18
2.60%
(CJS-AVQ 26)
(CJS-PRP 13)
CJT
236
(all) 28
10.61%
3
1.27%
(CJT-DT0 28 )
CRD
940
all 1
0.11%
0 (0.00%)
0
0.00%
(CRD-PNI 1)
DPS
787
0.0%
0
0.00%
DT0
1180
all 20
1.67%
16 (80.00%)
19
1.61%
(DT0-CJT 20)
DTQ
370
0.0%
0
0.00%
EX0
131
0.0%
1
0.76%
ITJ
214
0.0%
2
0.93%
NN0
270
0.0%
10
3.70%
NN1
7198
all 514
6.66%
395 (76.84%)
86
1.19%
(NN1-AJ0 130)
(NN1-NP0 92)*
(NN1-VVB 243)
(NN1-VVG 49)
NN2
2718
all 55
1.98%
48 (87.27%)
30
1.10%
(NN2-VVZ 55)
NP0
1385
all 264
16.01%
224 (84.84%)
31
2.24%
(NP0-NN1 264)*
ORD
136
0.0%
0
0.00%
PNI
159
all 8
4.79%
3 (37.50%)
5
3.14%
(PNI-CRD 8)
PNP
2646
0.0%
0
0.00%
PNQ
112
0.0%
0
0.00%
PNX
84
0.0%
0
0.00%
POS
217
0.0%
5
2.30%
PRF
1615
0.0%
0
0.00%
PRP
4051
all 166
3.94%
154 (92.77%)
24
0.59%
(PRP-AVP 132)
(PRP-CJS 34)
TO0
819
0.0%
6
0.73%
UNC
158
0.0%
4
2.53%
VBB
328
0.0%
1
0.30%
VBD
663
0.0%
0
0.00%
VBG
37
0.0%
0
0.00%
VBI
374
0.0%
0
0.00%
VBN
133
0.0%
0
0.00%
VBZ
640
0.0%
4
0.63%
VDB
87
0.0%
0
0.00%
VDD
71
0.0%
0
0.00%
VDG
10
0.0%
0
0.00%
VDI
36
0.0%
0
0.00%
VDN
20
0.0%
0
0.00%
VDZ
22
0.0%
0
0.00%
VHB
150
0.0%
1
0.67%
VHD
258
0.0%
0
0.00%
VHG
16
0.0%
0
0.00%
VHI
119
0.0%
0
0.00%
VHN
9
0.0%
0
0.00%
VHZ
116
0.0%
1
0.86%
VM0
782
0.0%
3
0.38%
VVB
560
all 84
13.04%
56 (66.67%)
84
15.00%
(VVB-NN1 84)
VVD
970
all 90
8.49%
62 (58.89%)
50
5.15%
(VVD-AJ0 11)
(VVD-VVN 79)*
VVG
597
all 132
18.11%
112 (84.84%)
9
1.51%
(VVG-AJ0 83)
(VVG-NN1 49)
VVI
1211
0.0%
7
0.58%
VVN
1086
all 158
12.70%
113 (71.52%)
27
2.49%
(VVN-AJ0 50)
(VVN-VVD 108)*
VVZ
295
all 26
8.10%
14 (53.85%)
11
3.73%
(VVZ-NN2 26)
XX0
363
0.0%
0
0.00%
ZZ0
75
0.0%
3
4.00%
Estimated error rates specifying the incorrect tag and the correct tag (fine-grained calculation)
The next table, , gives the frequency, as
a percentage, of error-prone tag-pairs where XXX is the incorrect tag
and YYY is the correct tag which should have occurred in its
place. In the third column, the number of the specified error-type is
listed, as a frequency count from the sample of 50,000 words. In the
fourth column, this is expressed as a percentage of all the tagging
errors of word category XXX (in column
(f)). The fifth column answers the question: if tag XXX occurs, what
is the likelihood that it is an error for tag YYY? Where the number
of occurrences of a given error-type is less than 5 (i.e. 1 in 10,000
words), they are ignored. Hence, is not exhaustive: only the more likely error-types are listed. In the second column, we add, where useful, the individual words which trigger these errors.
Estimated frequency of selected tag-pairs
(1) Incorrect tag XXX
(2) Corrected tag YYY
(3) No. of occurrences of this error type
(4) % of all incorrect uses of tag(XXX)
(5) % of all tags XXX
AJ0
AVO
12
26.1%
0.4%
NN1
12
26.1%
0.4%
NP0
5
10.9%
0.1%
VVN
8
17.4%
0.2%
AV0
AJ0
6
10.5%
0.2%
AJC
8
14.0%
0.3%
DT0
24
42.1%
1.0%
EX (there)
5
8.8%
0.2%
PRP
5
8.8%
0.2%
AVQ
CJS (when, where)
6
66.7%
3.8%
CJS
PRP
10
55.6%
1.4%
DTO
AV0
15
78.9%
1.3%
NN1
AJ0
13
15.1%
0.2%
NN0*
8
9.3%
0.1%
NP0*
22
25.6%
0.3%
UNC
9
10.5%
0.2%
VVI
13
15.1%
0.2%
NN2
NP0*
14
46.7%
0.5%
NP0
NN1*
10
32.3%
0.7%
NN0*
5
16.1%
0.4%
PRP
AV0
7
29.2%
0.2%
AVP
5
20.8%
0.1%
TO0
PRP (to)
6
100.0%
0.7%
VVB
AJ0
7
8.3%
1.3%
NN1
7
8.3%
1.3%
VVI*
55
65.5%
9.8%
VVD
AJ0
6
12.0%
0.6%
VVN*
44
88.0%
4.5%
VVG
NN1
9
100.0%
1.5%
VVI
NN1
5
71.4%
0.4%
VVN
AJ0
7
25.9%
0.6%
VVD*
17
63.0%
1.6%
VVZ
NN2
8
72.7%
2.7%
Similar to before, the asterisk * indicates a ‘less serious’ error, in which the erroneous and correct tags belong to the same major category or part of speech. As the table shows, the most frequent specific error types are within the verb category: VVB ? VVI (55, or 9.8% of all VVB tags) and VVD ? VVN (44, or 4.5% of all VVD tags).
A further mode of calculation: ignoring subcategories of the same part of speech
Presentation of Ambiguity and Error Rates (coarse-grained calculation)
Yet a further way of looking at the ambiguities and errors in the corpus is to make a coarse-grained calculation in counting these phenomena. In a fine-grained measurement, which is the one assumed up to now, each tag is considered to define its own word class which is different from all other word classes. Using the coarse-grained calculation, on the other hand, we consider words to belong to different word classes (parts of speech) only when the major category is different. If we consider the pair NN1 (singular and common noun) and NP0 (proper noun), the coarse-grained calculation says that the ambiguity tag NN1-NP0 or NP0-NN1 does not show tagging uncertainty, since both the proposed tags agree in categorizing the word as the same part of speech (a noun). So this does not add to the ambiguity rate. Similarly, the coarse-grained point of view on error is that, if a word is tagged as NN1 when it should be NP0, or vice versa, then this is not error, because both tags are within the noun category. To summarize: in the fine-grained calculation, minor differences of wordclass count towards the ambiguity and error rates; in the coarse-grained calculation, they do not.
In this section, the same calculations are made as in section 3,
except that errors and ambiguities which are confined within a major
category (noun, verb, etc.) are ignored. In practice, most of the
errors and ambiguities of this kind come from the difficulty the
tagger finds in recognizing the difference between NN1 (singular
common noun) and NP0 (proper noun), between VVD (past tense lexical
verb) and VVN (past participle lexical verb), and between VVB (finite
present tense base form, lexical verb) and VVI (infinitive lexical
verb). Thus the ambiguity tags NN1-NP0, VVD-VVN and their mirror
images do not occur in the relevant table () below. However,
since there are no ambiguity tags for VVB and VVI, the problem of
distinguishing these two shows up only in the error calculation.
The
three tables in this section correspond with the three tables in the
preceding section.
Estimated ambiguity and error rates for the whole corpus
Sample tag count
Ambiguity rate (%)
Error rate (%)
Written texts
45,000
2.78%
0.69%
Spoken texts
5,000
2.67%
0.87%
All texts
50,000
2.77%
0.71%
It will be noted from that this method of
calculation reduces the overall ambiguity rate by c.1 per cent, and
the overall error rate by c.0.5 per cent. We will not present
coarse-grained tables corresponding to and
above: these
tables would be unchanged from the fine-grained calculation, except
that the rows marked with an asterisk (*) would be deleted, and the
other calculations changed as necessary.
Different modes of calculation: eliminating ambiguities
Given that the elimination of errors was beyond our capability
within the time frame and budget we had available, the corpus in its
present form, containing ambiguity tags as well as a small proportion
of errors, is designed for what we believe will be the most common
type of user, who will find it easier to tolerate ambiguity than
error. However, other users may prefer a corpus which does not
contain ambiguities, even though its error rate is higher. For this
latter type of user, the present corpus is easy to interpret as a
corpus free of ambiguities, simply by deleting or ignoring the second
tag of any ambiguity tag, and accepting the first tag as the only
one. In what follows, we therefore allow two modes of calculation: in
addition to the "safer" mode, in which ambiguities are allowed and
consequently errors are relatively low, we allow a "riskier" mode in
which ambiguities are abolished, and errors are more frequent. In
fact, if ambiguity tags are eliminated, the overall error rate rises
to almost 2 per cent.
Estimated error rates for the whole corpus
Sample tag count
Error rate (%)
Written texts
45,000
2.01%
Spoken texts
5,000
1.92%
All texts
50,000
2.00%
The following table gives an error count (c) for each tag:
i.e. the number of errors in the 50,000 word sample where that tag
was the erroneous tag. [Cf. the "safer" error count in ,
column (f).] In addition, each tag has a correction count (d):
i.e. the number of erroneous tags for which that tag was the correct
tag. If we subtract the Error count (c) from the Tag count (b), and
add the Correction count (d) to the result, we arrive at the "Real
tag count" (e) representing the number of occurrences of that tag in
the corrected sample corpus. Not included in the table is the small
number of ‘multiword’ errors which resulted in two tags being
replaced by one (error count), or one tag being replaced by two
(correction count), due to the incorrect non-use or use of multiword
tags. The last column divides the error count by the tag count to
provide the error rate (as a percentage).
Estimated error rates (by tag)
(a)
Tag
(b)
Tag count
(c)
Error count
(d)
Correction count
(e)
Real tag count
(b - c + d )
(f)
Error rate (%)
(c / b)x 100
AJ0
3750
102
(132)
3780
2.72%
AJC
142
4
(12)
150
2.82%
AJS
26
2
(0)
24
7.69%
AT0
4351
2
(3)
4352
0.05%
AV0
2495
65
(67)
2497
2.61%
AVP
423
16
(17)
424
3.78%
AVQ
167
9
(6)
164
5.39%
CJC
1915
3
(1)
1913
0.16%
CJS
731
27
(5)
709
3.69%
CJT
264
3
(15)
276
1.14%
CRD
940
1
(11)
950
0.11%
DPS
787
0
(0)
787
0.00%
DT0
1200
23
(29)
1206
1.92%
DTQ
370
0
(0)
370
0.00%
EX0
131
1
(5)
135
0.76%
ITJ
214
2
(2)
214
0.93%
NN0
270
10
(16)
276
0.37%
NN1
7712
205
(152)
7659
2.66%
NN2
2773
37
(29)
2765
1.33%
ORD
136
0
(2)
138
0.00%
NP0
1649
71
(102)
1680
4.31%
PNI
167
10
(1)
158
5.99%
PNP
2646
0
(1)
2647
0.00%
PNQ
112
0
(0)
112
0.00%
PNX
84
0
(1)
85
0.00%
POS
217
5
(6)
218
2.30%
PRF
1615
0
(0)
1615
0.00%
PRP
4217
36
(45)
4226
0.85%
TO0
819
6
(1)
814
0.73%
UNC
158
4
(29)
183
2.53%
VBB
328
1
(0)
327
0.30%
VBD
663
0
(0)
663
0.00%
VBG
37
0
(0)
37
0.00%
VBI
374
0
(0)
374
0.00%
VBN
133
0
(0)
133
0.00%
VBZ
640
4
(5)
641
0.63%
VDB
87
0
(0)
87
0.00%
VDD
71
0
(0)
71
0.00%
VDG
10
0
(0)
10
0.00%
VDI
36
0
(0)
36
0.00%
VDN
20
0
(0)
20
0.00%
VDZ
22
0
(0)
22
0.00%
VHB
150
1
(0)
151
0.67%
VHD
258
0
(0)
258
0.00%
VHG
16
0
(0)
16
0.00%
VHI
119
0
(1)
120
0.00%
VHN
9
0
(0)
9
0.00%
VHZ
116
1
(0)
115
0.86%
VM0
782
3
(0)
779
0.38%
VVB
644
112
(13)
545
17.39%
VVD
1060
78
(60)
1042
7.36%
VVG
729
29
(29)
729
3.98%
VVI
1211
7
(73)
1277
0.57%
VVN
1244
72
(87)
1259
5.79%
VVZ
321
23
(12)
310
7.17%
XX0
363
0
(0)
363
0.00%
ZZ0
75
3
(4)
76
4.00%
It is clear from this table that the amount of error in the
tagging of the corpus varies greatly from one tag to another. The
most error prone-tag, by a large margin, is VVB, with
more than 17 per cent error, while many of the tags are associated
with no errors at all, and well over half the tags have less than a 1
per cent error.
The final table gives figures for the third level of detail, where we
itemise individual tag pairs XXX, YYY, where XXX is the incorrect
tag, and YYY is the correct one which should have appeared but did
not. Only those pairings which account for 5 or more errors are
listed. This table differs from in that here
the second tags of ambiguity tags are not taken into account
("riskier mode"). It will be seen that the errors which occur tend to
fall into a relatively small number of major categories.
The percentages in columns 4 and 5 of this table are calculated
with respect to the figures given in .
Estimated frequency of selected tag-pairs
Incorrect tag XXX
Correct tag YYY
No. of occurrences of this error type
% of all incorrect uses of tag XXX
% of all tags XXX
AJ0
AV0
22
21.57%
0.59%
NN1
41
40.19%
1.09%
NP0
5
4.90%
0.13%
VVG
14
13.73%
0.37%
VVN
14
13.73%
0.37%
AV0
AJ0
9
13.85%
0.36%
AJC
8
12.31%
0.32%
DT0
26
40.00%
1.04%
EX0 (there)
5
7.69%
0.20%
PRP
6
9.23%
0.24%
AVP
CJT
6
94.12%
1.42%
AVQ
CJS (when, where)
6
66.67%
3.59%
CJS
PRP
15
55.56%
2.05%
DTO
AV0 (much, more, etc)
15
65.22%
1.25%
NN1
AJ0
63
30.73%
0.82%
NN0
8
3.90%
0.10%
NP0
74
36.10%
0.96%
UNC
9
4.39%
0.12%
VVB
9
4.39%
0.12%
VVG
13
6.34%
0.17%
VVI
13
6.34%
0.17%
NN2
NP0
14
37.84%
0.50%
UNC
9
24.32%
0.32%
VVZ
10
27.02%
0.36%
NN0
UNC
7
70.00%
2.59%
NP0
NN1
50
70.42%
3.03%
NN2
5
7.04%
0.30%
PNI
CRD (one)
9
90.00%
5.39%
PRP
AV0
8
22.22%
0.19%
TO0
PRP (to)
6
100.00%
0.73%
VVB
AJ0
7
6.25%
1.09%
NN1
35
31.25%
5.43%
VVI
55
49.11%
8.54%
VVN
5
4.46%
0.85%
VVD
AJ0
14
17.95%
1.32%
VVN
64
82.05%
6.04%
VVG
AJ0
11
37.93%
1.51%
NN1
18
62.07%
2.47%
VVI
NN1
5
71.43%
0.41%
VVZ
NN2
20
86.96%
6.23%
Some of the error types above are associated with one or two particular words, and where these occur they are listed. For example, the AV0 - EX0 type of error occurs invariably with the one word there.
Finally, we list here the text samples used to constitute the
manually-conducted 50,000-word error analysis. Each sample consisted
of 2,000 words taken from the BNC texts listed below, except that two
samples, one of written and one of spoken English, consisted of 1,000
words only. These samples are marked "*" in the list below. The reason
for using half-length samples in two cases was to maintain the
proportion of written and spoken data as 90% - 10%, so as to keep the
proportions of the sample the same as the proportions in the BNC as a
whole. The BNC text files are cited by the three-character code used
in the BNC Users Reference Guide.
Written imaginative writing
G0S, ADY, H7P, GW0, FSF
Written informative writing
Natural ScienceJXN
Applied ScienceHWV, CEG
Social ScienceCLH, EE8, *A6Y
World AffairsA4J, CMT, EE2, EB7
Commerce and financeHGP, B27
ArtsC9U, G1N
Belief and thoughtCA9
LeisureEX0, ADR, CE4
Spoken demographic KBG
Spoken context-governed D8Y, *FXH
POS-Tagging Workflow
The overall process of creating the wordclass annotation may be
summarized as follows:
A. Tokenization
B. Initial tag assignment
C. Tag selection (disambiguation)
D. Idiom-Tagging
E. Template Tagger
F. Postprocessing: including Ambiguity tagging
The first four phases were carried out automatically, using
CLAWS4, an automatic tagger which developed out of the CLAWS1
automatic tagger (authors: Roger Garside and Ian
Marshall 1983) used to tag the LOB Corpus. The advanced version
CLAWS4 is principally the work of Roger Garside, although many other
researchers at Lancaster have contributed to its performance in one
way or another. Further information about CLAWS4 can be obtained from
Leech, Garside and Bryant 1994 and Garside and Smith 1997. CLAWS4 is a
hybrid tagger, employing a mixture of probabilistic and
non-probabilistic techniques. The fifth and sixth phases used other
systems,described in the appropriate section below.
A. Tokenization
The first major step in automatic tagging is to divide up the
text or corpus to be tagged into individual (1) word tokens and
(2) orthographic sentences. These are the segments usually
demarcated by (1) spaces and (2) sentence boundaries
(i.e. sentence final punctuation followed by a capital
letter). This procedure is not so straightforward as it might
seem, particularly because of the ambiguity of full stops (which
can be abbreviation marks as well as sentence-demarcators) and
of capital letters (which can signal a naming expression, as
well as the beginning of a sentence). Faults in tokenization
occasionally occur, but rarely cause tagging errors.
In tokenization, an orthographic word boundary (normally a
space, with or without accompanying punctuation) is the default
test for identifying the beginning and end of word-tokens. (See,
however, the next paragraph and
below.) Hyphens are counted as word-internal, so that a
hyphenated word such as key-ring is given just one tag
(NN1). Because of the different ways of writing
compound words, the same compound may occur in three forms: as a
single word written ‘solid’ (markup), as a hyphenated
word (mark-up) or as a sequence of two words (mark
up). In the first two cases, CLAWS4 will give the compound
a single tag, whereas in the third case, it will receive two
tags: one for mark and the other for up.
A set of special cases dealt with by tokenization is the set
of enclitic verb and negative contractions such as 's, 're,
'll and 'nt, which are orthographically attached
to the preceding word. These will be given a tag of their own,
so that (for example) the orthographic forms It's,
they're, and can't are given two tags in
sequence: pronoun + verb, verb + negative, etc. There are also
some 'merged' forms such as won't and dunno,
which are decomposed into more than one word for tagging
purposes. For example, dunno actually ends up with the
three tags for do + n't + know (for a list of these
contracted forms, see ).
B. Initial assignment of tags
The second stage of CLAWS POS-tagging is to assign to each word token one or more tags. Many word tokens are unambiguous, and so will be assigned just one tag: e.g. various AJ0 (adjective). Other word tokens are ambiguous, taking from two to seven potential tags. For example, the token paint can be tagged NN1, VVB, VVI, i.e. as a noun or as a verb; the token broadcast can be tagged as VVB, VVI, VVD, VVN (verb which is either present tense, infinitive, past tense, or past participle). In addition, it can be a noun (NN1) or an adjective (AJ0), as in a broadcast concert.
To find the list of potential tags associated with a word, CLAWS first looks up the word in a lexicon of c.50,000 word entries. This lexicon look-up accounts for a large proportion of the word tokens in a text. However, many rarer words or names will not be found in the lexicon, and are tagged by other test procedures. Some of the other procedures are:
Look for the ending of a word: e.g. words in -ness will normally be nouns.
Look for an initial capital letter (especially when the word is not sentence-initial). Rare names which are not in the lexicon and do not match other procedures will normally be recognized as proper nouns on the basis of the initial capital.
Look for a final -(e)s. This is stripped off, to see if the word otherwise matches a noun or verb; if it does, the word in -s is tagged as a plural noun or a singular present-tense verb.
Numbers and formulae (e.g. 271, *K9, +) are tagged by special rules.
If all else fails, a word is tagged ambiguously as either a noun, an adjective or a lexical verb.
When a word is associated with more than one tag, information is given by the lexicon look-up or other procedures on the relative probability of each tag. For example, the word for can be a preposition or a conjunction, but is much more likely to be a preposition. This information is provided by the lexicon, either in numerical form, or where numerical data available are insufficient, by a simple distinction between 'unmarked', 'rare' and 'very rare' tags.
Some adjustment of probability is made according to the position of the word in the sentence. If a word begins with a capital, the likelihood of various tags depends partly on whether the word occurs at the beginning of a sentence. For instance, the word Brown at the beginning of a sentence is less likely to be a proper noun than an adjective or a common noun (normally written brown). Hence the likelihood of a proper noun tag being assigned is reduced at the beginning of a sentence.
C. Tag selection (or disambiguation)
The next stage, logically, is to choose the most probable tag
from any ambiguous set of tags associated with a word token by
tag assignment (but see below). This is another
probabilistic procedure, this time making use of the context in
which a word occurs. A method known as Viterbi
alignment uses the probabilistic estimates available, both
in terms of the tag-word associations and the sequential tag-tag
likelihoods, to calculate the most likely path through the
sequence of tag ambiguities. (The model employed is largely
equivalent to a hidden Markov model.) After tag selection, a
single 'winning tag' is selected for each word token in a
text. (The less likely tags are not obliterated: they follow the
winning tag in descending probability order.) However, the
winning tag is not necessarily the right answer. If the CLAWS
tagging stopped at this point, only c.95-96% of the word-tokens
would be correctly tagged. This is the main reason for including
an additional stage (or rather a set of stages) termed
'idiom-tagging'.
D. Idiom-Tagging
Idiom-tagging is a stage of CLAWS4's operation in which
sequences of words and tags are matched against a
template. Depending on the match, the tags may be
disambiguated or corrected. In practice, there are two main
reasons for idiom-tagging:
The correct tag can only be selected if CLAWS looks at a word+tag sequence as a whole. In tag selection, this was not done, since the program merely used 'bigrams' consisting of two tags in sequence. In other words, idiom-tagging is more powerful than the Viterbi disambiguation algorithm because it is able to operate on a 'window' of several word tokens at once.
There are many cases in English where a sequence of orthographic words is best assigned a single tag. Such cases include because of (a preposition), so long as (a conjunction), and of course (an adverb). These so-called multiwords are the opposite of the contracted forms such as don't and there's, where one orthographic word is assigned more than one tag. Thus idiom-tagging here plays the role of adjusting tokenization to larger units.
Idiom-tagging is a matching procedure which operates on lists of rules which might loosely be termed idioms. Among these are:
a list of multiwords (just described) such as because of, so long as and of course.
a list of place name expressions (e.g. Mount X , where X is some word beginning with a capital).
a list of personal name expressions (e.g. Dr. (X) Y, where X and Y are words beginning with a cap.; the word X may or may not appear in the matching word sequence).
a list of foreign or classical language expressions used in English (e.g. de jure, hoi polloi)
a list of grammatical sequences where there are typically 'slots' in the sequence which may or may not be filled: e.g. Modal + (adverb/negative) + (adverb/negative) + Infinitive. This matches a sequence such as would not necessarily like. The recognition that the word token like here is an infinitive verb (rather than, say, a present-tense verb or a preposition) could not be trusted if the tagger was not equipped with an idiom-tagging component, but had to rely simply on tag-pair probabilities.
The idiom-tagging component of CLAWS is quite powerful in
matching 'template' expressions in which there are wild-card
symbols, Boolean operators and gaps of up to
n words. They are much more variable than idioms in
the ordinary sense, and resemble finite-state networks.
Another important point about idiom-tagging is that it is
split up into two main phases which operate at different points
in the tagging system. One part of the idiom-tagging takes place
at the end of Stage C., in effect retrospectively correcting
some of the errors which would otherwise occur in CLAWS
output. Another part, however, actually takes place
between Stages B. and C. This means it can utilise
ambiguous input and also produce ambiguous output, perhaps
adjusting the likelihood of one tag relative to another. As an
example, consider the case of so long as, which can be
a single grammatical item - a conditional conjunction meaning
'provided that'. The difficulty is that so long as can
also be a sequence of three separate grammatical items: degree
adverb + adjective/adverb + conjunction/preposition. In this
case, the tagging ambiguity belongs to a whole word sequence
rather than a single word, and the output of the idiom-tagging
has to be passed on to the probabilistic tag selection
stage. Hence, although we have called idiom-tagging Stage D,
it is actually split between two stages, one preceding C. and
one following C.
When the text emerges from Stages C. and D., each word has an associated set of one or more tags associated with it, and each tag itself is associated with a probability represented as a percentage. An example is:
entering VVG 86% NN1 14% AJ0 0%
Clearly VVG (-ing participle of the verb enter) is judged by CLAWS4 to be the most likely tag in this case.
E. After CLAWS: the Template Tagger
The error rate with CLAWS4 averages around 3%.That is, the error rate based on CLAWS's first choice tag only. For the BNC Tagging Enhancement project, we decided to concentrate our efforts on the rule-based part of the system, where most of the inroads in error reduction had been made. This involved (a) developing software with more powerful pattern-matching capabilities than the CLAWS Idiomlist, and (b) carrying out a more systematic analysis of errors, to identify appropriate error-correcting rules.
The next program, known as Template Tagger, supplements
rather than supplants CLAWS. It takes a CLAWS output file as
its input, and "patches"We borrow the term
"patching" from Brill (1992), although
for his tagging program the patches are discovered by an
automatic procedure. any erroneous tags it finds by using
hand-written template rules. Figure 1 above shows where
Template Tagger fits in the overall tagging scheme. Effectively,
it is an elaborate 'search and replace' tool, capable of
matching longer-distance and more variable dependencies than is
possible with the Idiomlist:
it can refer to information at the level of the word, or tag, or by user-defined categories grouping lexical, grammatical, semantic or other related features
it can handle a wide and variable context window, incorporating
repetition of the value in (a) a specified number of times, or indefinitely up to the left or right sentence boundary (or other delimiter) from any given word or tag; and
different levels of optionality: necessarily present, optional, and necessarily excluded.
These features can best be understood by an example. In BNC1 there were quite a number of errors disambiguating prepositions from subordinating conjunctions, in connection with words like after, before, since and so on. The following rule corrects many such cases from subordinating conjunction (CJS) to prepositions (PRP) tags. It applies a basic grammatical principle that subordinating conjunctions mark the start of clauses and generally require a finite verb somewhere later in the sentence.
#AFTER [CJS^PRP] PRP, ([!#FINITE_VB/VVN])16, #PUNC1
The two commas divide the rule into three units, each
containing a word or tag or word+tag combination. Square
brackets contain tag patterns, and a tag following square
brackets is the replacement tag (ie the action part of the
rule). #AFTER refers to a list of words like after,
before and since, that have similar grammatical
properties. These words are defined in a separate file; not all
conjunction-preposition words are listed - as, for
instance, can be used elliptically, without the requirement for
a following verb. (See Tagging Guidelines under as). The definition for #FINITE_VB
contains a list of possible POS-tags (rather than word values),
eg VVZ/VV0/VM0. Finally #PUNC1 is a 'hard' punctuation boundary
(one of . : ; ? and ! ). The patching rule can be interpreted
as:
'If a sequence of the following kind occurs:
a word like after, before or since,
which CLAWS has identified as most likely being a
subordinating conjunction, and less likely a preposition;
an interval of up to 16 words, none of which has been tagged
as a finite verb or past participle The repetition value of up to 16 words was reached at by trial and error; an occurrence of a finite verb beyond that range was rarely in the same clause as the #AFTER-type word.
(NB [! … ] negates the tag pattern.);
a 'hard' punctuation boundary
then change the conjunction tag to preposition.'
The rule doesn't always work accurately, and doesn't cater for all preposition-conjunction errors. (i) It relies to a large extent on CLAWS having correctly identified finite verb tags in the right context of the preposition-conjunction; sometimes, however, a past participle is confused with a past tense form. (We therefore added VVN, ie past participle, as a possible alternative to #FINITE_VB in the second part of the pattern. The downside of this was that Template Tagger ignored some conjunction-preposition errors containing genuine use of VVN in the right context). (ii) The scope of the rule doesn't cover long sentences where more than 16 non-finite-verb words occur after the conjunction-preposition. A separate rule had to be written to handle such cases. (iii) Adverb uses of after, before and since etc. need to be fixed by additional rules.
Targetting and writing the Template rules
The Templates are targetted at the most error-prone
categories introduced (or rather, left unresolved) by CLAWS. As
with the preposition-conjunction example just shown, many
disambiguation errors congregate around pairs of tags, for
example adjective and adverb, or noun and verb. Sometimes a
triple is involved, eg a past tense verb (VVD),
past participle (VVN) and adjective
(AJ0) in the case of surprised.
A small team of researchers sought out patterns in the errors by concordancing a training corpus that contained two parallel versions of the tagging: the automatic version produced by CLAWS and a hand-corrected version, which served as a benchmark. A concordance query of the form "tag A | tag B", would retrieve lines where the former version assigned an incorrect tag A and the latter a correct tag B. An example is shown below, in which A is a subordinating conjunction and B a preposition.
the company which have occurred | since CJS [PRP] | the balance sheet date .
nd-green shirt with epaulettes . | Before CJS [PRP] | the show , the uniforms were approved by
rt towards the library catalogue | since CJS [PRP] | the advent of online systems . The overall
ales . There have been no events | since CJS [PRP] | the balance sheet date which materially af
n in demand , adding 13p to 173p | since CJS [PRP] | the end of October . Printing group Linx h
Hugh Candidus of Peterborough . | After CJS [PRP] | the appointment of Henry of Poitou , a sel
boys would be in the Ravenna mud | until CJS [PRP] | the spring . Our landlady obviously liked
ution in treatment brought about | since CJS [PRP] | the arrival of penicillin and antibiotics
By working interactively with the parallel concordance,
sorting on the tags of the immediate context, testing for
significant collocates to the left and right, and generally
applying his/her linguistic knowledge, the researcher can often
detect sufficient commonality between the tagging errors to
formulate a patching rule (or a set of rules) such as that
shown above. It took several iterations of training and testing
to refine the rules to a point where they could be applied by
Template Tagger to the full corpus.
Training and testing were mostly carried out on the BNC Sampler corpus of 2 million words. For less frequent phenomena we needed to use sections from the full BNC. None of the texts used for the tagging error report is included in the Sampler.
It should be said that some categories of error were easier to write rules for than others. Finding productive rules for noun-verb correction was especially difficult, because of the many types of ambiguity between nouns, verb and other categories, and the widely differing contexts in which they appear. The errors and ambiguity tags associated with NN1-VVB and NN2-VVZ in BNC2 in the error report testify to this problem. Here a more sophisticated lexicon, detailing the selectional restrictions of individual verbs and nouns (and other categories) would have undoubtedly been useful.
Ordering the rules
In some instances the ordering of rules was important. When two rules in the same ruleset compete, the longer match applies. Clashes arise in the case of the multiply ambiguous word as, for instance. Besides the clear grammatical choices between a preposition and a complementiser introducing an adverbial clause, there are many "interfering" idiomatic uses (as well as, as regards etc) and elliptical uses ( The TGV goes as fast as the Bullet train [sc.goes]). To avoid interference between the rules, we found it preferable to let an earlier pass of the rules handle more idiomatic (or exceptional) structures, and let a later pass deal with the more regular grammatical dependencies.
In many rule sets, however, we found that ordering did not affect the overall result, as we tried to ensure each rule was 'true' in all cases. Since, however, more than one rule sometimes carried out the same tag change to a particular word, the system was not optimised for speed and efficiency.
Besides the ordering of rules within rulesets, it is worth considering the placement of Template Tagger within the tagging schema (Figure 1). Ideally, it would be sensible to exploit the full pattern-matching functionality of the Template Tagger earlier in the schema, using it in place of the CLAWS Idiomlist not just after statistical disambiguation, where it is undoubtedly necessary, but also before it. In this way Template Tagger could have precluded much unnecessary ambiguity passing to Stage C. above. The reason we did not do this was pragmatic, that TT was in fact developed as a general-purpose annotation tool (See Fligelstone, Pacey and Rayson 1997), and not exclusively for the POS-tagging of BNC2. In future versions of the tagging software we hope to integrate Template Tagger more fully with CLAWS.
F. Postprocessing, including Ambiguity tagging
The post-processing phase has the task of producing output in the form in which the user is going to find it most usable.
The text is produced in a horizontal format, so that it can be read from left to right across the page or across the screen.
The tags are enclosed in angle-brackets as follows: NN1 according to the standard TEI-based CDIF mark-up of the British National Corpus.
Normally the word will be output with a single tag - the one which CLAWS4 calculates to be most probable.
"Ambiguity tags" (such as NN1-AJ0) are output if the difference between the probability of the first tag and of the second fails to reach a pre-decided threshold.
The final phase, "ambiguity tagging", merits a little further discussion. The requirement for such tags is clear when one observes that even using Template Tagger on top of CLAWS, there remains a residuum of error, around 2%, in the corpus. By permitting ambiguity tags we are effectively able to "hedge" in many instances that might otherwise have counted as errors - improving the chances of retrieving a particular tag, but at the cost of retrieving other tags as well. We considered that a reasonable goal would be to employ sufficient ambiguity tags to achieve an overall error rate for the corpus of 1%.
Because CLAWS's reliability in statistical disambiguation varies according to the POS-tags involved, we calculated the thresholds for application of ambiguity tags separately for each relevant tag-pair A-B (where A is CLAWS's first-choice and B its second-choice tag). First, the tag-pairs were chosen according to their error frequencies in a training corpus of 100,000 words. The proportion of A-B errors to the total number of errors indicated how many errors of that type would be allowed in order to achieve the 1% error rate overall; we will refer to this figure as "the target number of errors" for A-B. We then
collected each instance of A-B error, noting the difference in probability score between A and B.
plotted each error against the probability difference
found the threshold on the difference axis that would yield the target number of errors. Below this threshold each instance of A-B would be converted to an ambiguity tag.
As we report under Error
rates, the BNC in fact contains a higher error rate than
1%. This is because some thresholds applied at the 1% rate
incurred a very high frequency of potential ambiguity tags: we
hand-adjusted such thresholds if permitting a slight rise in
errors led to a substantial reduction in the number of
ambiguities. Further comments on stages E. and F. can be found in Smith 1997.
Additional annotation in BNC XML
As noted above, the linguistic annotation of the corpus was
enhanced in the BNC XML edition in three respects:
multiwords and their constituent items are explicitly tagged
using the mw and w XML elements
an additional wordclass scheme, using a much simplified version
of the C5 tagset was deployed
lemmatization of each word was carried out automatically on the
basis of manually-defined rules.
The simplified wordclass scheme used for the second of these
enhancements is listed in of the manual, where the
mapping between these values and the C5 tags from which they are
derived is also specified.
The lemmatization procedure adopted derives ultimately from work
reported in Beale 1987, as subsequently
refined by others at Lancaster, and applied in a range of projects
including the JAWS program (Fligelstone et al
1996) and the book Word Frequencies in Written and Spoken
English (Leech et al 2001). The basic
approach is to apply a number of morphological rules, combining simple
POS-sensitive suffix stripping rules with a word list of common
exceptions.
This process was carried out during the XML conversion, using code and a set of rules files kindly supplied by Paul Rayson.