Wordclass Tagging in BNC XML
The wordclass tagging1 has not changed significantly between the BNC World edition (2001) and the BNC XML edition (2006). In particular, no attempt has been made to completely retag the corpus, desirable though this might be. Changes have been made in the treatment of multiword units and some additional annotation has been provided (see Additional annotation in BNC XML , but in most respects the wordclass information provided by the corpus now is identical to that provided with the first release of the BNC in 1994.
The BNC is wordclass tagged using a set of 57 tags (known as C5)
which we refer to as the "BNC Basic Tagset". (There are also 4
punctuation tags, excluded from consideration here.) Each C5 tag
represents a grammatical class of words, and consists of a partially
mnemonic sequence of three characters: e.g.
"singular common noun".
The BNC, consisting of c.100 million words, was tagged
automatically, using the CLAWS4 automatic tagger developed by Roger
Garside at Lancaster, and a second program, known as Template Tagger,
developed chiefly by Mike Pacey. (Further details are given below, and also in R. Garside, G. Leech and
T. McEnery, 1997 (eds.), Corpus Annotation: Linguistic
Information from Computer Text Corpora, London: Longman,
chapters 7-9). With such a large corpus, there was no opportunity to
undertake post-editing2 i.e. disambiguation and
correction of tagging errors produced by the automatic tagger, and so
the errors (about 1.15 per cent of all words) remain in the
distributed form of the corpus. In addition, the distributed form of
the corpus contains ambiguous taggings (c.3.75 per cent of all words),
shown in the form of ambiguity tags (also called ‘portmanteau tags’),
consisting of two C5 tags linked by a hyphen:
VVD-VVN. These tags indicate that the automatic
tagger was unable to determine, with sufficient confidence, which was
the correct category, and so left two possibilities for users to
disambiguate themselves, if they should wish to do so. For example, in
the case of VVD-VVN, the first (preferred) tag, say for a word such as
wanted, is VVD: past tense of lexical verb; and the second (less
favoured) tag is VVN: past participle of lexical verb. On the whole,
the likelihood of the first tag of an ambiguity tag being correct is
over 3 to 1 — see, however, details of individual tags in Table 23. Estimated ambiguity and error rates for the whole corpus (fine-grained calculation) of the error report document.
After the automatic tagging, some manual tagging was undertaken to correct some particularly blatant errors, mainly foreign or classical words embedded in English text. CLAWS is not very successful at detecting these foreign words and tagging them with their appropriate tag (UNC), except when they form part of established expressions such as ad hoc or nom de plume - in which case they are normally given tags appropriate to their grammatical function, e.g. as nouns or adverbs.
The main purpose of the report on estimated error rates is to document the rather small percentage of ambiguities and errors remaining in the tagged BNC, so that users of the corpus can assess the accuracy of the tagging for their own purposes. Since not surprisingly we have been unable to inspect each of the 100 million tags in the BNC, we have had to estimate ambiguity rates and error rates on the basis of a manual post-editing of a corpus sample of 50,000 words. The estimate is based on twenty-four 2,000-word text extracts and two 1,000-word extracts, selected so as to be as far as possible representative of the whole corpus.
Tokenization: splitting the text into words
Regarding the segmentation of a text into individual word-tokens (called tokenization), our tagging practice in general follows the default assumption that an orthographic word (separated by spaces, with or without punctuation, from adjacent words) is the appropriate unit for wordclass tagging. There are, however, exceptions to this. For example, a single orthographic word may consist of more than one grammatical word: in the case of enclitic verb contractions (as in she’s, they’ll, we’re) and negative contractions (as in don’t, isn’t, won’t), it is appropriate to assign two diferent wordclass tags to the same orthographic word. A full list of such contracted forms recognized by CLAWS and preserved in the XML markup is given in section Contracted forms and multiwords.
Also quite frequent is the opposite circumstance, where two or more
orthographic words are given a single wordclass tag: e.g. multiword
adverbs such as of course and in short, and
multiword prepositions such as instead of and up to
are each assigned a single word tag (
AV0 for adverbs,
PRP for prepositions). Sometimes, whether such
orthographic sequences are to be treated as a single word for tagging
purposes depends on the context and its interpretation. In
short is in some circumstances not an adverb but a sequence of
preposition + adjective (eg. in short, sharp bursts ). Up
to in some contexts needs to be treated as a sequence of two
grammatical words: adverbial-particle +
preposition-or-infinitive-marker (eg. We had to phone her up to
get the code.).
In one respect, we have allowed the orthographic occurrence of
spaces to be criterial. This is in the tagging of compound words such
as markup, mark-up and mark up. Since English
orthographic practice is often variable in such matters, the same
‘compound’ expression may occur in the corpus tagged as two words (if
they are separated by spaces) or as one word (if the sequence is
printed solid or with a hyphen). Thus mark up (as a noun)
will be tagged
NN1 AVP, whereas markup or
mark-up will be tagged simply
Tagging Guidelines and Borderline Cases
Many detailed decisions have to be made in deciding how to draw the line between the correct and the incorrect assignment of a tag. So that the concept of what is a ‘correct’ or ‘accurate’ annotation can be determined, there have to be detailed guidelines of tagging practice. These are constitute the Wordclass Tagging Guidelines.
VVG), an adjective (
AJ0) or a singular common noun (NN1). The difference may be illustrated by the three examples:
The assignment of an example of ‘Verb+ing’ to the adjective category relies heavily on a semantic criterion, viz. the ability to paraphrase Verb+ing Noun by ‘Noun + Relative Clause that/which/who be Verb+ing’ or ‘that/which/who Verb(s)’ (e.g. the rising sun = the sun which is/was rising; a working mother = a mother who works). These contrast with a case such as dining table, where the first word dining is judged to be a noun. The reason for this is that the paraphrasable meaning of the expression is not ‘a table which is/was dining or dines’, but rather ‘a table (used) for dining’. Although somewhat arbitrary, this relative clause test is well established in English grammatical literature, and such criteria are useful in enabling a reasonable degree of consistency in tagging practice to be achieved, so that the success rate of corpus tagging can be checked and evaluated. (See further Adjective vs. noun)
VVGor NN1, and in such a case one would be tempted to leave the ambiguity (
VVG-NN1) in the corpus, showing uncertainty where any grammarian would be likely to acknowledge it. However, in our calculations of ambiguity, we have adhered to the common assumption that ideally, all tags should be correctly disambiguated. Other examples of unresolvability from the sample texts are:
In practice, in our post-edited sample, we chose the first tag to be correct in these cases.
Ambiguity tags, and the principle of asymmetry
- AJ0 general adjective (positive)
- NN2 plural common noun
- AV0 general adverb
- NP0 proper noun
- AVP adverbial particle
- PNI indefinite pronoun
- AVQ wh- adverb PRP general preposition
- CJS general subordinator
- VVB lexical verb: finite base form
- CJT subordinator: that
- VVD lexical verb: past tense;
- CRD cardinal numeral
VVGlexical verb: present participle (-ing form)
- DT0 determiner-pronoun
- VVN lexical verb: past participle
- NN1 singular common noun
- VVZ lexical verb: -s form
The permitted ambiguity tags are listed in the Wordclass tagging guidelines (Ambiguity Tag list).
It will be noted that overall 30 ambiguity tags are recognized. We
also observe that each ambiguity tag (eg
matched by another ambiguity tag which is its mirror image (eg
VVN-VVD). The ordering of tags is significant: it is the
first of the two tags which is estimated by the tagger to be the more
likely. Hence the interpretation of an ambiguity tag X-Y may be
expressed as follows: ‘There is not sufficient confidence to choose
between tags X and Y; however, X is considered to be more likely.’
Guidelines to the Wordclass Tagging
Appearance of wordclass tags and citations
Throughout this section, we will show text examples in a format which is different from the XML contained in the corpus but which will highlight the particular tag that is being discussed. The XML tagging (for example, paragraph and pause markers) is not generally relevant to the present discussion and is usually invisible when using concordancing software such as Xaira, BNCWeb, or WordSmith.
As noted above, each word in the corpus is marked by an XML <w> element which provides three additional pieces of information the wordclass, carried by the c5 attribute, a headword or lemma derived from the word, carried by the hw attribute, and a simplified wordclass derived from the c5 value, carried by the pos attribute.
This is purely as an aid to reading the present document; in the corpus itself, all wordclass tagging is represented using the XML conventions shown above.
...apparently we eat more chocolate than_CJS any other country. [G3U.1000]
VVB, but the actual tag as
Note also that we occasionally use invented examples, rather than corpus citations, especially where a contrast between categories is being made.
Tears well_VVB up in my eyes.[BN3.5 *AV0]
Appearance and tagging of contracted forms
doesn't = does_VDZ n't_XX0
dunno = Du_VDB n_XX0 no_VVI
wanna = wan_VVB na_TO0 or wan_VVB na_AT0
gimme = Gim_VVB me_PNP
This procedure sometimes results in strange-looking word divisions, particularly with the fused words. However, they do provide a ready means of comparison with the full forms, such as want_VVB to_TO0 and give_VVB me_PNP.
Appearance and tagging of multiwords
The term `multiwords' denotes multiple-word combinations which CLAWS determines function as one wordclass - for example, a complex preposition, an adverbial, or a foreign expression naturalised into English as a compound noun. In the XML version of the corpus, these sequences are explicitly markeed using an XML element (<mw>). The individual orthographic words of which the sequence is composed are also marked, in the same way as other words, using the <w> element.
When displaying examples which contain multiwords in this chapter, we display only the wordclass of the outermost <mw> element. Its boundaries are indicated, where possible, by extra highlighting:
<mw c5="AV0"> <w c5="PRF" lemma="of" pos="PREP">of </w> <w c5="NN1" lemma="course" pos="SUBST">course </w> </mw>
Of course_AV0 I can. [H9V.212]
The wordclass tags assigned to constituent parts of multiword items are listed in Contracted forms and multiwords. This part of the wordclass tagging was done automatically during the XML conversion process, and has not been checked by CLAWS.
The stage in between_PRP the original negative and the dupe is called an interpositive [FB8.295]
The truth lies somewhere in between_AV0 [ABK.2834]
but_CJC for_PRP years now darkness has been growing [F99.2027] cf.
which they would not have done but for_PRP the presence of the police. [H81.766]
Words joined by the slash character
Introduction to Word Classes
NN1, while plurals take
An air_NN1 of distinction_NN1
Fifteen miles_NN2 away
Now the government_NN0 is considering new warnings on steroids ... [K24.3057]
... the Government_NN0 are putting people's lives in jeopardy. [A7W.518]
I caught a fish_NN0.[KBW.316]
I had caught four fish_NN0 with hardly any effort[B0P.1387]
Cheese_NN1 is a protein of high biological value. [ABB.1950]
three cheeses_NN2. [CH6.7834]
A car_NN1 glistens in the distance_NN1. [HH0.1035]
Three cars_NN2, two lorries_NN2 and a motorbike_NN1! [CHR.290]
NN0as they are invariant for number.
Crewe are top of div_NN1 3 by 8 points [J1C.961] (where div = division)
400 km_NN0 (km = 'kilometre' or 'kilometres')
6 oz_NN0 (oz = 'ounce' or 'ounces')
Nouns such as hundred, hundreds, dozens, gross, are all tagged as numbers,
CRD, rather than nouns.
Sally_NP0; Joe_NP0 Bloggs_NP0; Madame_NP0 Pompadour_NP0; Leonardo_NP0 da_NP0 Vinci_NP0 London_NP0; Lake_NP0 Tanganyika_NP0; New_NP0 York_NP0 April_NP0; Sunday_NP0
- Note that the distinction between singular and plural proper
nouns is not indicated in the tagset, plural proper nouns being a
John_NP0 Smith_NP0. All of the Smiths_NP0.
- Note also that proper nouns are not processed as multiwords (though there may be good linguistic reasons for doing so). Each word in such a sequence gets its own tag.
A person's initials preceding a surname are tagged
NP0, just as the surname itself. The choice whether to use a space and/or full-stop between initials (eg J.F. or J. F. or J F or JF) is determined by the original source text; the tagged version follows the same format.
John F. Kennedy = John_NP0 F._NP0 Kennedy_NP0 J. F. Kennedy = J._NP0 F._NP0 Kennedy_NP0 J.F. Kennedy = J.F._NP0 Kennedy_NP0
In the spoken part of the BNC, however, the components of names — and, in fact, most words — that are spelt aloud as individual letters, such as I B M, and J R in J R Hartley, are not tagged
ZZ0(letter of the alphabet). See below
- Nouns of style
Preceding a proper noun, or sequence of proper nouns, style (or title) nouns with uppercase initial capitals are tagged
Sub-Lieutenant_NP0 R_NP0 C_NP0 V_NP0 Wynn_NP0
- Geographical names
For names of towns, streets, countries and states, seas, oceans, lakes, rivers, mountains and other geographical placenames, the general rule is to tag as
NPO. If the word the precedes, it is tagged
West_NP0 Harbour_NP0 Lane_NP0
the_AT0 United_NP0 Kingdom_NP0
the_AT0 Indian_NP0 Ocean_NP0
Mount_NP0 St_NP0 Helens_NP0
the_AT0 Alps_NP0Other tags are used for the constituents of more verbose (especially political) descriptions of placenames, or those that are not typically marked on maps:
the_AT0 Western_AJ0 Region_NN1
the_AT0 People_NN0's_POS Republic_NN1 of_PRF China_NP0
the_AT0 Dominican_AJ0 Republic_NN1
the_AT0 Sultanate_NN1 of_PRF Oman_NP0The examples show a little arbitrariness in application. For example, contrast
the_AT0 United_NP0 States_NP0
the_AT0 Soviet_AJ0 Union_NN1
- Non-personal and non-geographical names
- Where names of organisations, sports teams, commercial products (incl
newspapers), shops, restaurants, horses, ships etc.
consist of ordinary words (common nouns, adjectives etc.),
they receive ordinary tags (
AJ0etc.). Only if a word used as part of a name is an existing NP0 (typically a personal or geographical name), or a specially-coined word, is it tagged
NP0. Some examples follow:
- Organisations, sports teams etc.
There is a slight inconsistency here, in that acronyms of organisation names (WHO, NATO, IBM etc.) take
Cable_NN1 and_CJC Wireless_NN1
Procter_NP0 and_CJC Gamble_NP0 Acorn_NN1 Marketing_NN1 Limited_AJ0
Minolta_NP0; IBM_NP0; NATO_NP0
Wolverhampton_NP0 Wanderers_NN2 ( football_NN1 club_NN1 )
Tottenham_NP0 Hotspur_NP0 (football_NN1 club_NN1 )
The_AT0 Chicago_NP0 Bears_NN2
World_NN1 Health_NN1 Organisation_NN1
NP0, whereas the expanded forms of these names take regular tags.
- Products (including newspapers and magazines)
Lancashire_NP0 Evening_NN1 Post_NN1
The_AT0 Reader_NN1 's_POS Digest_NN1
- Shops, pubs, restaurants, hotels, horses, ships etc.
Here again NP0 is reserved for parts of names that are specially coined, or derived from existing personal/geographical proper nouns.
The_AT0 Grand_AJ0 Theatre_NN1
The_AT0 King_NN1 's_POS Arms_NN2
- The second character of a verb tag marks the type of verb as follows:
- The third character of a verb tag marks the verb inflection as follows:
- be, have, and do
Auxiliary and main uses of these verbs are not distinguished: .
she is_VBZ playing her best tennis for six years. [CH3.1382]
she is_VBZ just a star. [CH3.6939]
John has_VHZ built a set of bookshelves. [C9X.121]
John has_VHZ great courage. [CA9.1869]
We did_VDD n't_XX0 see anybody. [KB2.702]
They do_VDB nice work. [ANY.514]
- Lexical verbs
- Tags beginning VV- apply to all other (lexical) verbs.
She travels_VVZ in every Saturday morning. [KRH.4013]
The young kids want_VVB to dance_VVI and have fun [CHA.1599]
I thought_VVD he looked_VVD a sad sort of a boy. [CDY.2831]
...after running_VVG out of coal, the crew were forced_VVN to burn_VVI timber and resin [HPS.269]
All modals are tagged
VM0. We do not differentiate between so-called past and present forms:
We can_VM0 go there.
We could_VM0 go there.
We used_VM0 to_TO0 go there every year.
- Contracted forms
- Subjunctives and Imperatives
- No special tags are used for these:
She suggested that they get_VVB married. [CBC.12107]
Please be_VBB patient. [CHJ.899]
Do_VDBn't_XX0 just stand there watching! [ACB.3470]
- Catenative or semi-auxiliary verbs
- Again, no special tagging is used for such forms as going
to, ought to, or used to + infinitive:
you're not going_VVG to_TO0 get killed [KCE.6550]
you ought_VM0 to_TO0 let them know. [KCT.6115]
Adjectives are given one of the wordclass tags
- Predicative and attributive uses
The ground was dry_AJ0 and dusty_AJ0 [GWA.118]
The dust from the dry_AJ0 ground [GWA.121]
- Quasi-comparatives and quasi-superlatives
Adjectives which have a heightening or downtoning effect rather like that of comparatives and superlatives,
but which do not behave syntactically like comparatives or superlatives, are treated as ordinary adjectives.
Examples include utter, upper and
Events in Eastern Europe were evidently uppermost_AJ0 in Mr Li's mind. [A95.366]
Family contacts were very important in uniting the upper_AJ0 classes [FB6.1495]
- Adjectives used catenatively
- For example, able and
Will you be able_AJ0 to manage? (catenative)
Your son is very able_AJ0 (non-catenative)
AJC; superlatives take
A faster_AJC car.
The best_AJS in its class.
Ambiguities frequently arise between adjectives and other wordclasses, in particular adverbs, nouns and participles.
Adverbs are given one of the tags
AV0, AVQ, or
AV0is the default tag for adverbs. It incorporates a very mixed bag, including:
- adverbs of time, manner, place etc.
- Eg slowly; here; soon
- degree adverbs
- Eg very and rather in
- sentence adverbs
- for example:
- postnominal adverbs
- for example:
aged between 2 and 11 years inclusive_AV0 [AMD.31]
the buildings thereon_AV0 [J16.813]
during 1986-91 inclusive_AV0 [FT0.1400]
Diamonds galore_AV0 [FPH.900]
- discourse markers
- such as well,
you know like_AV0, it's worthwhile opening a cinema at 4 o'clock... [F7A.358]
Note that adverbs, unlike adjectives, are not tagged as positive, comparative, or superlative. This is because of the relative rarity of comparative and superlative adverbs.
AVQwhether the word occurs in interrogative or relative use.
"When_AVQ do your courses start?" [A0F.3117]
"...if you let me know when_AVQ the police are called in." [BMU.2291]
Yet why_AVQ is that so? [CR7.3089]
Ordinal-type adverbs (including first, fourth,
etc.) are treated separately with the
Prepositional Adverbs (also known as "Adverbial Particle") are
treated as prepositions and tagged
Articles, determiners & pronouns
Articles, definite or indefinite, are tagged
AT0. Pronouns which act as determiners of various kinds
(all, which, your etc.) are given tags
DTQ, and distinguished from
pronouns which do not have a determiner function. These are marked
using one of the tags
PNX depending on their function.
- All articles are tagged
AT0. An article is defined here as a determiner word which typically begins a noun phrase, but which cannot occur as the head of a noun phrase. Examples include a/an, the, no and every:
Have a_AT0 break
There's no_AT0 time
- Recognising that there is a high degree of formal and functional overlap between determiners and pronouns, we have conflated under the D-- heading
words that are capable of either function. We distinguish three classes of determiner pronouns:
- Words such as few, both, another are
free secondary education for all_DT0 [ECB.1610]
Few_DT0 diseases are incurable [GV1.1129]
for the benefit of the few_DT0 [HHX.10183]
- Interrogative determiner-pronoun
- The wh- (interrogative) determiner-pronoun is tagged
DTQ. Which and what are always tagged
Which_DTQ country do you live in? [A7N.979]
And she didn't say which_DTQ? [KCF.351 ]
What_DTQ time is it? [A0N.406]
- Prenominal possessive determiner pronoun
P--indicate pronouns which do not share the determiner function, for example I, it , anyone. Pronouns are differentiated according to whether they are:
- Relative pronouns
Which as a relative (or interrogative) pronoun is grouped with the other determiner-pronouns, and tagged
Give 4 details which_DTQ should appear on an order form [HBP.417]Meanwhile, that as a relative clause complementizer is treated with that as a complement clause complementizer, and tagged
I got some currants that_CJT are left over [KST.3733]
this girl that_CJT Claire knows [KC7.1101]
He dismissed reports that_CJT his party was divided over tactics [A28.11]
We both knew that_CJT enough was enough. [FEX.268]
Prepositions and prepositional adverbs
- Most prepositions are tagged
PRP, including a large number of multiword items. Examples include:
at_PRP the Pompidou Centre in_PRP Paris [A04.325]
I use humour as_PRP a protection [FBL.356]
Heard about_PRP this have you? [KE6.9556]
According_PRP to ancient tradition, ...[A04.784]
Many disputes are dealt with by bodies other_PRP than courts. [F9B.4]
Nice walls and a big sky to look at_PRP. [A25.122]
- The preposition of is assigned a special tag
PRFbecause of its frequency and its almost exclusively postnominal function. Examples:
. Note that numerous multiwords contain of, eg in front of, in light of, by means of, etc.
a couple of_PRF cans of_PRF Coke[ AJN.283]
DNA consists of_PRF a string of_PRF four kinds of_PRF bases [AE7.107]
- Prepositional adverbs/particles
- Preposition-type words which have no complement are tagged
AVP. Typical uses of
AVPare in phrasal verb constructions, or when it functions as a place adjunct:
There are many instances of ambiguity between PRP and AVP.
We gave up_AVP after two hours. [KSV.1029]
there were a lot of horses around_AVP. [HR7.3101]
- Co-ordinating conjunction
- Co-ordinators such as and, or, but,
nor etc are tagged
Fish and_CJC chips
James laughed and_CJC spilled wine. [A0N.136]
She was paralysed but_CJC she could still feel the pain. [FLY.529]
- Subordinating conjunction
- All subordinating conjunctions are all tagged
CJSand introduce one of:
- an adverbial clause (of time, reason, condition etc.)
"When_CJS you 've done it , you should go home,"[CRE.949]
I still stayed there after_CJS I heard the shooting [HW8.3263]
As_CJS you may know Scorton will again enter the Best Kept Village competition in 1992 [HPK.768]
Do send me an interim copy as_CJS soon as you can [HD3.69]
If_CJS it's wet just take your time. [KCL.554]
- a comparative clause
- introduced by than or
as, and occurring with or without ellipsis:
It was worse than_CJS she could have imagined.[CH0.1315]
...apparently we eat more chocolate than_CJS any other country.[G3U.1000]
"it's as good as_CJS it's going to get."[K9K.199]
make the transporter as light as_CJS possible. [CA1.1113]
- a nominal wh-clause
- containing whether or if
Can you tell me whether_CJS ivies do damage trees. [C9C.720]
- Complementary clause
- The conjunction that at the start of a clause introducing reported speech and thought, and also
at the start of a relative clause is tagged
Historians knew that_CJT this was nonsense.[G3C.363]
China announced that_CJT it was ending martial law in the Tibetan capital Lhasa. [KRU.95]
The problem that_CJT he was having was that_CJT she was his legal wife 's sister [HE3.210]
Cardinal numbers and similar items are tagged
CRD. Ordinal numbers and similar items are tagged
- Numbers and fractions
- All cardinal numbers, numeral nouns, fractions and so on take the tag
CRD, whether they are written as words or numerals, and whether functioning nominally or prenominally. Examples:
5_CRD out of 10_CRD[CGM.525]
one_CRD striking feature of the years 1929_CRD-31[A6G.134]
his first_ORD innings, when he scored forty_CRD-two, with seven_CRD fours_CRD [KJT.128]
Hundreds_CRD of people audition each year [K1S.2239]
About a dozen_CRD there. [HEU.131]
- Ordinal numbers and similar
- Ordinal numbers are assigned
ORDin all syntactic positions, including adverbial positions, as in
. Note that
We only came fourth_ORD in the county championship last_ORD year[EDT.1629]
ORDis also assigned to less overtly numeric words like next and last, even in clear adverbial, adjectival or nominal contexts. This is because next and last function like ordinals both syntactically and semantically.
- Currency and measurement expressions
- Measurement expressions, consisting of numbers and a unit of measurement of some kind
(together as one word), are assigned a noun tag, usually
NN0(neutral for number) or
12&ins;_NN2 ( = 12 inches)
- Other sequences of numeric and alphabetic characters are assigned
Figure 2b_UNC [FTC.250]
Serial no. S835508_UNC [C9H.2282]
A4_UNC sheet of paper [CN4.296]
Mark drove home along the M1_UNC [AC2.2210]
Miscellaneous other tags
- Existential there
- The tag
EX0is used for there when it does not carry any real meaning: it merely states that something exists or existed. It occurs at the beginning of a clause and is usually followed by the verb be and an indefinite noun phrase; for example
Compare this with there when it has a clear locative meaning ('in/to that place'):
There_EX0 was a long pause and then a smile [A4H.416]
Waiter! Waiter! There_EX0's an awful film on my soup! [CHR.657-9]
There_EX0 appears to be little alternative [ECE.2139]
Don't stand there_AV0 grinning like a stuck pig [C85.1553]
- The tag
ITJis used for any interjection:
( For the distinction between ITJ and the unclassified tag, UNC, see Interjection vs. unclassified)
Oi_ITJ - come here!
Yes_ITJ , please_AV0 do
No_ITJ not_XX0 yet_AV0
- Genitive morpheme
- The tag
POSis used for the genitive morpheme 's (singular) or ' (plural after an s):
Note the lack of space between the noun and the following
teacher_NN1 's_POS pet
teachers_NN2 '_POS pet
POS, as 's is tokenized in the same way whether it represents a genitive or a contracted verb. See further on tagging of 's in apostrophe 'S
- Infinitive marker
- The tag
TO0is used for the infinitive marker. This includes elliptical uses.
Note the morphological variation of to in the following colloquial forms:
"Do you want to_TO0 talk about it?" [EFG.1935]
In the summer holidays I can , I can get up early if I want to_TO0 . [KPG.4153]
We got_VVN ta_TO0 go
We wan_VVBna_TO0 stay.
- Unclassified words
- The tag
UNCis used for unclassified (or unclassifiable) words. It is applied in contexts where no other wordclass tag seems appropriate, including
- "Noise words" and pause fillers in spoken utterances; imitations of animal or machine sounds:
blah_UNC blah_UNC blah_UNC
er_UNC I think so
- Certain fused forms (in written or spoken data) for which no other tag would be appropriate:
That ai_UNC n't_XX0 right.
0.5 cm increments_UNC/30 seconds [HWT.282]
Fits with most lap/diagonal_UNC seat belts. [BNX.392]
- Truncated words in speech. Partial words that are not completed by a
speaker, whether through hesitation or an interruption, are also
usually marked with the XML tags <trunc>; for example
the partial word bathr in the following:
The bathr_UNC data. er you can't beat a white bathroom suite anyway. [KCF.771]
- Partial repetitions of multiwords in spoken data.
Occasionally in spoken data, when a multiword sequence is used, it appears to be repeated, but only partially so. In the following example, the orthographic word sort is used twice:
We treat the first sort as an incomplete multiword, and tag it
we're going to sort sort of summarize... [G5X.106]
UNC(rather like truncated words, above). The complete multiword sort of is tagged AV0, as normally.
we're going to sort_UNC sort of_AV0 summarize...
ITJsee Interjection vs. unclassified.
- "Noise words" and pause fillers in spoken utterances; imitations of animal or machine sounds:
- Negative particle
XX0is the tag for the negative particle not, and also for its contracted or fused form,
Brown did_VDD n't _XX0see it that way. [A6W.338]
no, that is not_XX0 correct. [JK0.257]
ZZ0is used for a free-standing letter of the alphabet such as A, X, x, p, r . If however, the letter clearly represents a separate word, or an abbreviation of a separate word, we have tried to assign the appropriate POS-tag for the full form of that word, rather than ZZ0.For example,
- I as personal pronoun is
PNPrather than ZZ0.
- a as indefinite article is tagged
- F as in John F. Kennedy is tagged
- v meaning 'versus' is tagged
Although the same should apply to v. the full-stop is liable to force a new sentence break. (See eg CHS.1076, EB2.19, EDL.313)
Italy v_PRP New Zealand ... Hungary v_PRP Thailand [A1N.507].
- In spoken texts, words which are spelt out by the speaker are transcribed letter by
letter, and each letter is tagged ZZ0.
I_ZZ0 B_ZZ0 M_ZZ0 compatible [JYM.6]
children who go to the E_ZZ0 N_ZZ0 T_ZZ0 clinic [KB8.3807]
- I as personal pronoun is
You're not supposed to keep medicine that long_AV0. [H8Y.1976 *AV0-AJ0]
Note also that in this section we use a number of invented examples (in addition to corpus citations) to clarify the distinction between categories.
Disambiguation by Tag Pair
Adjective vs. adverb
We arrived tired_AJ0, but safe_AJ0 [CCP.529]
After a little he remembered it and sang out loud_AV0.[A0N.1144]-->
This sentence does not imply that he was loud, but is more or less equivalent to He sang out loudly. It means that his singing was loud.
You did great_AV0 though. [HH0.3248 *AV0-AJ0]
AJ0/AV0word follows an object:
everyone below 25 grew their hair too long_AJ0. [ARP.590 *AV0-AJ0]
(i.e. 'their hair was too long'.)
Try not to keep her too long_AV0. [FAB.3620 *AV0-AJ0]
(i.e. NOT 'she will be too long.')
They'll have to make the taxes higher_AJC. ('the taxes will be higher')
We can make this piece higher_AJC if you want to. [BNG.2268]
You'll have to aim higher_AV0. (NOT 'you will be higher')
You should aim higher_AV0 [ACN.984 *AJC]
Adjective vs. noun
AJ0) or noun (NN1). Colour words like black, white and red are fairly consistent in allowing the two tags, and may be used to illustrate the difference. In attributive (premodifying) or predicative (complementing) positions without further modification these words are normally adjectives:
When the word is the head of a noun phrase, on the other hand, it is a noun:
a white_AJ0 screen, The screen is white_AJ0.
Red_NN1 is my favourite colour.
They painted the wall a brilliant white_NN1.
All past_AJ0 and present_AJ0 employees of the branch are invited. [K99.216]
*These needs are past, present, and future.
(Note that present can be used as a predicative adjective meaning the opposite of absent; but this meaning is not comparable to the temporal meanings of past, present and future above.)
You're living in the past_NN1. [HGS.1045]
I don't even want to think about the future_NN1. [JY4.2864]
The only reason for treating past and present in the example above as adjectives is that they have an institutionalized meaning as modifiers, which is rather different from the meaning they have as nouns. Further examples of this type are words such as model in model behaviour, giant in a giant caterpillar and vintage in vintage cars.
new spending_NN1 plans [CEN.5922]
a working_AJ0 mother [ED4.153]
his reading_NN1 ability [CFV.1897]
in the coming_AJ0 weeks [HKU.1333]
AJ0). That is, a word ending -ing is an adjective when it is the notional subject of the noun it premodifies. For example:
two smiling_AJ0 children [HTT.743] ('two children who are smiling')
NN1). In such cases, it is often possible to paraphrase X-ing + Noun by a more explicit phrase in which X-ing is clearly a noun:
new spending_NN1 plans ('new plans for spending')
his reading_NN1 ability ('his ability in reading')
Determiner-pronoun vs. adverb
AV0. The difference between them is that
DT0is for noun-phrase-like (and determiner-like) uses of the word in question, whereas
AV0is for adverbial uses. The two can be hard to distinguish, particularly after a verb:
(a) You should relax more_AV0.
(b) You should spend more_DT0.
You should eat more.
You should read more.
You should smoke less.
Do you smoke? (Intransitive)
How many do you smoke in a week? (Transitive)
(c) At the moment we have 23 fixtures per season. Personally, I would rather play more_DT0.
(d) You should work less and play more_AV0.
(In (d) the adverb more has roughly the meaning of 'more often'.)
Note. The automatic disambiguation of determiners and adverbs is not reliable, because transitivity has not been encoded in the tagger. Sentences like (c) and (d), where more follows the verb at end of a sentence, are invariably tagged AV0.
Adjective vs. participle
Another area of borderline cases is the tagging of words as adjectives
AJ0) or as participles (
One test is to see whether a degree adverb like very can be inserted in front of the word: e.g. in We were very surprised, surprised is an AJ0.
Even where it is not present, the possibility of adding the by-phrase, without changing the meaning of the word, is evidence in favour of
We were surprised_VVN by pirates.
VVN. (However, this criterion can clash with the preceding one — since it occasionally happens that an -ed word is both preceded by an adverb like very and followed by a by-phrase: E.g. I was so irritated by his behaviour that I put the phone down. When these do occur, we give preference to
This shows that lasting or locked can easily be (but need not be) an
The effect is lasting_AJ0 (compare a lasting_AJ0 effect).
The door is locked_AJ0 (compare the locked_AJ0 door.)
AJ0. If the word could not be placed (with the same meaning) before the noun, this would be evidence that the word is a participle.
VVGafter the verb be, it is generally treated as an
AJ0before a noun:
The man was dying_VVG. [HTM.1494 *VVG-AJ0]
the dying_AJ0 man. [FSF.1787]
VVNtag is preferred:
an interest_NN1 earning_VVG account
a hypothesis_NN1 driven_VVN approach
In these examples the
NN1+VVG/VVN sequence has the character of a premodifying adjective compound. We can therefore imagine the
two words bracketed together forming an adjective: an interest-earning_AJ0 account. But within the adjective, the VVG and VVN tags retain their verbal character, with the initial noun acting as object of the verb (cf. the account earns interest).
a shanty_NN1 singing_VVG competition[K4W.2952]
AJ0 / VVNword, this is a strong indication that the construction is not properly a passive, and that the word is an
The building was infested_AJ0 with cockroaches
(cf.: The building seemed/became infested with cockroaches)
This is a manifestation of the general semantic character of adjectives (which typically refer to states or qualities) and verbs (which typically refer to events or actions).
Bill was married_AJ0. (i.e. he was not single)
Bill was married_VVN to Sarah on the 15th May. (i.e. the actual event)
She is not disturbed_VVN by that sort of threat.
The tourists were standing_VVG around a map of the city.
Are you expecting_VVG someone?[G01.2610]
The arithmetic is looking_VVG good. [K1M.3611]
Turning_VVG suddenly, she ran for the safety of the car [CK8.297]
Preposition vs. prepositional adverb vs. general adverb
(a) She ran down_PRP the hill.
(b) She ran down_AVP her best friends.
- It can be placed before or after the noun phrase acting as
object of the verb:
She ran her best friends down_AVP.
(But not: *She ran the hill down.)
- If the noun phrase is replaced by a pronoun, the pronoun has
to be placed in front of the particle:
She ran them down_AVP. (= her best friends)
(But not: *She ran down them.)
The dentist took all my teeth out_AVP. (The dentist took them out)
Notice that the syntactic distinction between (for example) down as an adverbial particle and down as a preposition is independent of the semantic distinction between locative and non-locative interpretations of down.
Income tax is coming down_AVP.
The decorations are put up_AVP on Christmas Eve.
This is the hill (which) she ran down_PRP.
(Cf. This is the hill down which she ran.)
The poor were looked down on_PRP by the rich.
(Here on is the stranded preposition)
Which car did she arrive in_PRP?
The same tests apply to words which are tagged either as prepositions or as general adverbs (AV0), such as across, past and behind.
Note, additionally, the use of about as a degree adverb.
Interjection vs. unclassified
The borderline between interjections or exclamatory particles (tagged
ITJ) and unclassified 'noise' words (tagged
UNC) is drawn as follows:
ITJ is used for 'institutionalized' interjections or discourse particles such as good-bye, oh, no, oops, hallelujah, whoa, wow ; however
right and like
functioning as discourse markers are tagged
UNCis used in contexts where no other wordclass tag seems appropriate:
- 'noise' words and pause fillers in spoken utterances; this
includes imitations of animal or machine sounds:
blah_UNC blah_UNC blah_UNC
er_UNC I think so.
- certain fused forms which cannot easily be broken down into
separate word classes:
- constituent <w> elements within multiword expressions for which no unique C5 code can be found
The contraction ain't is a special case: its first half is tagged UNC because it abbreviates so many different verb forms (am not, is not, are not, has not, have not) that no single tag can be applied to it (unless one were to invent a special tag for that purpose).
Disambiguation by Word
Tears well_VVB up in my eyes. [BN3.5 *AV0]
- Contracted forms
When it represents a shortened form of is, has or (rarely) does, it has the appropriate verb tag.
Occasionally, for example with auxiliaries followed by past participles, there are difficulties determining what the full form of the verb should be.
That_DT0's_VBZ perfect is that one... (= That is...) [KCX.1254]
She_NP0 's_VHZ got tickets. (= She has...) [KPV.6479]
well, what_DTQ 's_VDZ he do?, is he a plumber? (= What does...) [KD6.310]
Britain_NP0's_POS small businesses [HMH.67]
After today_AV0's_POS announcement [K6F.39]
- 's plural
- When 's acts as a marker of the -s plural, or as part
of the verb form let's, it is part of a single word, and
is not assigned its own tag. E.g.:
success in the three R_ZZ0's [EVY.59]
in the 1980_CRD's [HJ1.22024]
Let_VM0's go_VVI. [A61.1443]
Note that let's is not considered a contraction of let
us, but is treated as a single 'verbal particle', tagged
on the grounds that it is closely analogous to modal auxiliaries.
- Degree adverb:
When about has an approximating meaning, typically premodifying
a quantifying expression, it is tagged
Note also the multiword just about, as in:
...it was about_AV0 three weeks ago [FAJ.1714]
about_AV0 half the size of a grain of rice [AJ4.33]
We're just about_AV0 ready.
- Preposition vs. particle:
- See further at Preposition vs. prepositional adverb vs. general adverb
- Comparative constructions:
- As is a degree adverb (AV0) when it occurs before an adjective,
adverb or determiner (and sometimes other words) in phrases of
the type as X as Y, or simply as X (where the comparative
clause or phrase as Y) is omitted but understood:
In the first and second examples above, the second as introduces a comparative construction which expresses 'equal comparison', as contrasted with the unequal comparison of more X than Y. When as is a word introducing such a comparative construction, it is tagged CJS:
I go to see them as_AV0 often as I can . [AC7.1189]
and they employ ninety people, twice as_AV0 many as last year. [K1C.3540]
And every bit as_AV0 good .[EEW.1132 *CJS]
Notice that as in this comparative use is tagged CJS whether or not it introduces a clause. Often it introduces a noun phrase. In the following example, it introduces an adjective:
Capitalism is not as_AV0 good as_CJS it claims. [CFT.2042]
Linked together, they can crunch numbers as_AV0 fast as_CJS any mainframe.[CRB.271]
She will deposit as_AV0 many as_CJS a dozen eggs there. [F9F.424]
always reply as_AV0 quickly as_CJS possible. [C9R.989]
- Introducing other clauses:
The tag CJS is also used when introducing other subordinate clauses,
such as adverbial clauses of time or reason:
New York called just as_CJS I was leaving. [APU.1543]
As_CJS you've gone to so much trouble , it would seem discourteous to refuse [KY9.2107]
- The tag PRP is used for as functioning clearly as a preposition:
Usually the meaning is related to the equative meaning of the verb be. However, the guideline restricts PRP to cases where as is followed by the normal noun phrase or nominal, as is normal for prepositions. Where the as is followed by an adjective or a past participle clause, it is tagged CJS, even though it may retain the equative type of meaning:
Consider it as_PRP a kind of insurance [AD0.1641]
As_PRP head of information, Christina will lead a team of four TEC staff... [BM4.2830]
We regard these results as_CJS encouraging. [B1G.184]
I very much hope that you will in fact support the motion as_CJS originally intended. [KGX.93]
- As is part of many multiwords which get tagged with a
single tag: e.g. as soon as, such as, in so far as,
as long as, as well as. The sequence as well as, for
example, is tagged as a preposition (
PRP) in such examples as
Note that this is different from the multiword adverb as well (meaning also); it is also different from the sequence of as well as as three separate words, e.g. in:
Sometimes as well as_PRP going this way we actually need to go in this was too. [G5N.31]
She's as_AV0 well_AJ0 as_CJS can be expected. [F9X.2095]
CJCis overwhelmingly the most common use of but. The following other cases can also be detected:
- But is an adverb when its meaning is similar to 'only':
She can spare you but_AV0 a few minutes [CCD.82 *CJC] There is but_AV0 one penalty. [ALS.185 *CJC]
- Subordinating conjunction or preposition:
- But is either a conjunction (
CJS) or a preposition (
PRP) if it has the meaning of 'except (for)', 'other than' or 'apart from'.
CJSis used when it introduces a clause, and
PRPis used when it introduces a phrase:
...mediocre albums that do nothing but_CJS take up shelf space [C9M.1014]
I couldn't help but_CJS notice. [JY0.5323 *CJC]
I always feel they are open meetings in everything but_PRP name. [HJ3.5520]
No one had guessed she was anything but_PRP a boy. [C85.517]
- Coordinating conjunction:
Otherwise but is a coordinating conjunction, tagged
CJC, linking units of the same kind (e.g. clauses or adjective/adverb phrases). Its function is to express contrastive or 'adversative' meaning:
God and minds do exist , but_CJC materially so . [ABM.1265]
And that's it for another week but_CJC don't forget the late news at eleven thirty. [J1M.2520]
Hares ( but_CJC not rabbits ) are particularly vulnerable... [B72.892]
- Note also multiwords such as but for (
The fare increases would have been bigger but for_PRP the governments last minute intervention. [K6D.124]
- Discoursal function:
In speech, when like has a discoursal function as a 'hedge',
we tag it
well she says like_AV0, I won't be a minute [KCY.1518]
I'm driving along, you know like_AV0 <trunc> wha</trunc> when you're in the car by yourself and everything's turning over in your head [KBU.1096]
- Other functions:
- Like very frequently occurs as a preposition or as a verb.
The noun and adjective uses are fairly rare:
...but I like_VVB Monday best. [FU4.1089]
He didn't look like_PRP a goodie. [H0M.1353]
... fuel, weapons, ground crew and the like_NN1. [JNN.105 *AJ0-NN1]
Churchill and Eden were not of like_AJ0 minds... [ACH.1297]
The meaning of little (
AJ0) is the opposite of big:
Bless their dear little_AJ0 faces. [HRB.722]
Little_AJ0 green shoots of recovery are stirring. [CEL.968]
The meaning of little (
DT0) is 'not much':
I have little_DT0 to say. [G1Y.1133]
...there was little_DT0 food left. [FSJ.720]
As an adverb (
AV0) little also has the meaning 'not much':
I care very little_AV0 about petty-minded, selfish "rules". [B0P.211]
- A little
Note that a little can also be a multiword adverb (
However, the quantifier a little meaning 'a small amount' is not tagged as a multiword 3 but as
They are all a little_AV0 drunk. [G0F.2117]
AT0 + DT0
[See Determiner-pronoun vs. adverb ]
You couldn't let me have a_AT0 little_DT0 milk? [GUM.1656]
Much_DT0 of this work has to be done on the spot. [C8R.24]
I've spent too much_DT0 money. [KPV.62659]
Thanks very much_AV0. [A73.5]
I didn't sleep much_AV0 last night [ALH.1495]
See also Determiner-pronoun vs. adverb
MORE and LESS
You deserve more_DT0 than a medal. [K97.3705]
More_DT0 haste, less_DT0 speed. [J10.4543]
...this will make him more_AV0 tired than usual [A75.282]
But I couldn't agree more_AV0 [BMD.3]
No_AT0 problem_NN1. [H4H.227]
As a noun, no is usually an abbreviation for number:
quoting Ref_NN1 No_NN1 BCE90_UNC [CJU.673]
but the matter was taken no_AV0 further_AV0. [ARF.183 no: *AT0]
To put it no_AV0 more_AV0 strongly_AV0, it has not been proved beyond doubt that.... [EW7.125]
- No is tagged as an interjection (
ITJ) where it functions as the opposite of Yes.
"...See how easy my job can be?"
"Frankly, no_ITJ". [HR4.2329]
The clearest cases of
CRDare in a quantifying noun phrase, typically allowing the substitution of another numerical expression (e.g. one chip contrasts with two chips) or of the digit 1 (1 chip):
In such noun phrases, one functions like a determiner-pronoun such as some.
Can I have one_CRD chip, please? [KDB.1416]
So are there criticisms? Just one_CRD. [CG2.1489]
... one_CRD in five sufferers never tells their partners. [CF5.8 *PNI]
Orford Ness is one_CRD of Britain's most unusual coastal features. [CF8.86]
- Indefinite Pronoun:
- The clearest cases of
(a) As a substitute form, standing for an understood noun or noun
In this use, one has a plural form ones.
The channel was not a broad one_PNI [AEA.1457]
(b) As a generic personal pronoun, meaning 'people in general':
And I think one_PNI might go on to argue that far from saving labour it creates it. [J17.1915]
- (a) As a substitute form, standing for an understood noun or noun phrase:
Note that the reliability of the ambiguity tag
PNI-CRD (in which the pronoun is rated more likely)
is somewhat low. See POS-tagging Error Rates
As both an adverb (
AV0) and an adjective (
AJ0) right means
the opposite of 'wrong' and also the opposite of 'left'. As a
noun, it generally means 'entitlements': e.g. I have a
right_NN1 to know. The uses of right as a verb are
- Discoursal function:
As a discourse marker, right is tagged
Right_AV0, how you doing there? [KBL.4671]
Right_AV0, er, members, any questions ? [F7V.138]
- Degree adverb (intensifier):
In dialectal usage, right can be an intensifier, and is tagged
it's a ... it's a right_AV0 soft carpet. [KB2.1242-4]
- In most cases so is tagged as an adverb (
So_AV0 this is where you work... [H8M.2964]
Right, so_AV0 what's fifty three per cent as a decimal? [JP4.357]
They waited but nothing happened so_AV0 they made a fuss. [FU1.2484]
- As a pro-form meaning 'thus' or standing for a clause or predicate,
so is tagged
So_AV0 say I and so_AV0 say the folk. [G11.228]
"Yes, I think so_AV0." [CCM.151]
- As a degree adverb or intensifier, so is tagged
tough and long lasting - that's why they're so_AV0 popular. [BN4.929]
There would not be so_AV0 many lonely people in our land [B1Y.1262]
- Introducing purpose clauses, so is tagged
Drink your tea so_CJS they can have your cup. [KB2.1767]
- Note that so is frequently part of a multiword: so that, so far, so as to, (in) so far as, etc. See the list of multiwords
- As a demonstrative (pronoun or determiner), that is tagged
That_DT0's_VBZ my coat yeah. [KBS.1309]
he's getting hooked on the taste of vaseline, that_DT0 dog. [KCL.197]
- As a clause-initiating conjunction, that is
CJT. This applies to that as a complementizer:
and also to that as a relativizer (introducing a relative clause):
Many experts claim that_CJT it is good for your growing baby, too. [G2T.1091]
This is different from the more traditional analysis which treats that introducing a relative clause as a relative pronoun.
A ship that_CJT never enters harbour. [BPA.1326]
- As a degree adverb (intensifier):
It wasn't all that_AV0 bad. [KPP.321]
- That occurs commonly in multiwords such as so that, in that, in order that.
AJ0, usually following the), then receives the tag
And then_AV0 she spoke. [H8T.2675]
"Come on, then_AV0." [K8V.1722]
Mr Willi Brandt, the then_AJ0 Mayor of West Berlin. [A87.84]
...the then_AJ0 state governor , who wasn't then_AV0 Bill Clinton [A87.84]
- Infinitive marker
When used with an infinitive, to is always tagged
TO0. Note elliptical uses of the pre-infinitival to, especially in informal spoken texts:
Note also the common colloquial spelling of want to, got to, and going to as fused words:
In the summer holidays, I can, I can get up early if I want to_TO0. [KPG.4153]
wanna = wan_VVB na_TO0
gotta = got_VVN ta_TO0
gonna = gon_VVG na_TO0
- When used as a preposition, to is always tagged
PRP. Prepositions are normally followed by a noun phrase or nominal clause. Where the preposition is 'stranded' (i.e. where the noun phrase associated with the preposition has been moved or ellided ) it can be confused with an adverbial particle:
That 's the school that Terry goes to_PRP. [KB8.2442]
...what you're entitled to_PRP by law is money back [FUT.360]
"Where to_PRP?""The_PRP moon." [FNW.240-1]
- Adverbial particle
- The adverbial particle to is rare but does occur, for example in come to meaning 'regain consciousness'.
- By far the most common function for well is as an adverb:
She's playing well_AV0
- Discoursal function:
- When well has the function of a discourse marker, it is
treated as an adverb (
Oh well_AV0! That'll be the finish! [FX6.196-7]
I bet he doesn't get up till about, well_AV0, it's eleven now. [KBL.3808]
- Degree adverb:
- Well is tagged
AV0, too, where it has an intensifying function: e.g.
It was dark outside and well_AV0 past your bedtime. [ASS.898]
- Well is tagged as an adjective where it means 'in good
You don't look well_AJ0. [HPR.107]
- As a verb, well is very rare, but occurs in the phrasal
verb well up. NB. This use has not been accurately tagged in the corpus:
Tears well_VVB up in my eyes. [BN3.5 *AV0]
CJS. Otherwise it is tagged
AVQtag is also used for when introducing a question. Examples:
- Adverbial clause:
Note that when is also a subordinating conjunction in abbreviated adverbial clauses which lack a subject and finite verb, such as when in doubt, when ready, when completed.
When_CJS I got back to my flat, I decided to ring Toby. [CS4.1265]
the crowd left quietly when_CJS the police arrived. [APP.1017] (when = at the time at which)
If you smoke when_CJS you're pregnant... [A0J.1598] (when = whenever)
- Nominal clause
Before an infinitive, when is also tagged AVQ:
I can't remember when_AVQ we last had a frost. [KBF.11728]
"Do you remember when_AVQ we used to go with Daddy in the boat on Saturdays?" [A6N.2022]
You never know when_AVQ the next big story will break. [HJ6.100]
Also when the rest of the infinitive clause is understood:
Otto knew when_AVQ to change the subject. [FAT.1603]
Tell me when_AVQ.
- Relative clause
Note that when can often be omitted in relative clauses: the moment he arrived.
in the year when_AVQ I was born (when = in which)
the moment when_AVQ he arrived (when = at which)
- Direct questions
When_AVQ did you find out?
AVQ) or a subordinating conjunction (
CJS). However, with where the
CJStag is much less likely. Examples:
- worth is tagged
PRPwhere it could answer a question such as 'How much is X worth?' or 'What is X worth?'
worth also occurs as a 'stranded preposition' in questions used to elicit such responses, and in some other common constructions:
these pictures are worth_PRP a small fortune. [FNT.1060]
That makes him worth_PRP about $60m. [CT3.479]
'Darling, it's not worth_PRP getting upset. [HH9.2308]
how much d'ya think it's worth_PRP? [KCX.1344]
share prices say nothing about what a company is worth_PRP. [A9U.305 *NN1]
Please go ahead and push Grapevine for all you are worth_PRP. [AP1.575]
- worth is tagged
NN1when it is an obvious noun (meaning 'value'). Typically this occurs following expressions of quantity, whether or not the quantity is expressed by a possessive or genitive (e.g. its, 's).
Baker showed his worth_NN1 for Ipswich in the 20th minute [CF9.102]
hundreds of pounds' worth_NN1 of damage. [A0H.15]
£2,500 WORTH_NN1 OF PRIZES [ECJ.1147]
Features of spoken corpus tagging
- Individual letters
- Words spelt out by a speaker as individual letters have been transcribed letter by letter,
each being tagged ZZ0.
children who go to the E_ZZ0 N_ZZ0 T_ZZ0 clinic [KB8.3805]
...ten ninety minute tapes! T_ZZ0 D_ZZ0 K_ZZ0 tapes! [KPG.3534-5]
In the written corpus these items would nearly always be written and tagged as whole words (ENT or TDK in the above example).
- Truncated words
- Words that are left incomplete by the speaker are enclosed
within an XML
<trunc> element and tagged
UNC. Examples include bathr and su in the following
The <trunc> bathr_UNC </trunc> er you can't beat a white bathroom suite anyway. [KCF.721]
Aye, they only came in the <trunc> su_UNC </trunc> they only came up here in the summer. [GYS.127]
- Partial repetition of multiwords
Occasionally in spoken data it happens that only a portion of a
multiword sequence is repeated. In this example, the word sort is used twice; in both cases
it appears to function not as a separate word but as part of the multiword adverb
We treat the first sort as an incomplete multiword, and tag it
we're going to sort sort of summarize... [G5X.106]
UNC(rather like truncated words, above). The complete multiword sort of is tagged
AV0, as normally.
Further examples of incomplete multiwords are the as long in as long as (conjunction), of in because of (preposition) and the in in in general (adverb) below
we're going to sort_UNC sort of_AV0 summarize...
The second example shows that when words are repeated, the incomplete portion of a multiword is not necessarily immediately adjacent to the fully formed multiword. In the last example, the three instances of in before erm, imperial measure have not been analysed as part of the multiword in general; they are instead tagged as ordinary words (in this case, ambiguous between preposition and prepositional adverb: PRP-AVP). There are a few cases where the tagger has probably been over-zealous in spotting repeated portions of multiwords:
As_UNC long_UNC As_CJS long as everyone recognizes that for an area of that size... [J9T.258]
because_PRP of the <pause> of_UNC the drought. When we were away it didn't get watered in. [KCH.982]
I know that in_UNC in_UNC in_AV0 general, in in in erm, imperial measure, it is <trunc> f </trunc> five feet eight inches [JK1.480]
Here, the first instance of now would probably have better been interpreted as a single word adverb (='at this time'), not part of the multiword conjunction now that4.
What happens now_UNC, now_CJS that you are winched down? [HEF.9]
- Er and erm inside multiwords
- Generally (in both written and spoken texts) the pause fillers er
and erm take the tag
UNC. This applies also when they appear within a multiword sequence, as in every er so often. The code assigned to the surrounding <mw> element is identical to that which would have been assigned if the filler were not present.
Note that in the last example the word at preceding the multiword at er best is treated as a partial repetition of that multiword, and therefore tagged
And your homework was handed in every er so often_AV0, you know [G64.152]
something had gone wrong with the <pause> gas pipes because erm of_PRP <pause> flooding. [KB8.5356]
these kind of books were, er, generally er, at , at er best_AV0 ignored [HUN]
POS-tagging Error Rates
This section reports on the accuracy of the results of the improved tagging programs.
Levels of estimation
- (a) First, as a general assessment of accuracy, the estimated rates are given for the whole corpus. (See Table 23. Estimated ambiguity and error rates for the whole corpus (fine-grained calculation) below.)
- (b) Secondly, separate estimates of ambiguity rates and error rates are given for each of the 57 word tags in the corpus. This will enable users of the corpus to assign appropriate degrees of reliability to each tag. Some tags are always correct; other tags are quite often erroneous. For example, the tag VDD stands for a single form of the verb do: the form did. Since the spelling did is unambiguous, the chances of ambiguity or error, in the use of the tag VDD, are virtually nil. On the other hand, the tag VVB (base finite form of a lexical verb) is not only quite frequent, but also highly prone to ambiguity and error. 15 per cent of the occurrences of VVB are errors - a much higher error rate than any other tag. (See Table 25. Estimated ambiguity rates and error rates by tag below.)
- (c) Thirdly, separate estimates of ambiguity rates and error rates are given for ‘wrong-tag--right-tag’ pairings XXX, YYY, consisting of (i) the actually-occurring erroneous tag XXX, and (ii) the correct tag YYY which should have occurred in its place. However, because the number of possible tag-pairs is large (572), and most of these tag-pairs have few or no errors, only the more common pairings of erroneous tag and correct tag are separately listed, with their estimated probability of occurrence. This list of tag-pairings will help users further, in enabling them to estimate not merely the reliability of a tag, but, if that tag is incorrect, the likelihood that the correct tag would have been some other particular tag. In this way, the frequency of grammatical word classes, or individual words in those classes, can be estimated more accurately for the whole BNC. (See Table 26. Estimated frequency of selected tag-pairs below.)
Presentation of Ambiguity Rates and Error Rates (fine-grained mode of calculation)
In this section, we examine ambiguities and errors using a ‘fine-grained’ mode of calculation, treating each error as of equal importance to any other error. In Presentation of Ambiguity and Error Rates (coarse-grained calculation) we look at the same data in terms of a ‘coarse-grained’ mode of calculation, ignoring errors and ambiguities involving subcategories of the same part of speech.
Overall estimated ambiguity and error rates: based on the 50,000 word sample
As the following table shows, the ambiguity rate varies considerably between written and spoken texts. (However, note that the calculation for speech is based on a small sample of 5,000 words.)
|Sample tag count||Ambiguity rate (%)||Error rate (%)|
It will be noted that written texts on the whole have a higher ambiguity rate, whereas spoken texts have a slightly greater error rate.
The success of an automatic tagger is sometimes represented in terms of the information-retrieval measures of precision and recall, rather than ambiguity rate and error rate as in Table 23. Estimated ambiguity and error rates for the whole corpus (fine-grained calculation). Precision is the extent to which incorrect tags are successfully discarded from the output. Recall is the extent to which all correct tags are successfully retained in the output of the tagger, allowing, however, for more than one reading to occur for one word (i.e. ambiguous tagging is permitted). According to these measures, the success of the tagging is as follows:
However, from now on we will continue to use ‘ambiguity rate’ and ‘error rate’, which appear to us more transparent.
Estimated ambiguity and error rates for each tag (fine-grained mode of calculation)
The estimates for individual tags are again based on the 50,000 sample, and the ambiguity rate for each tag is based on the number of ambiguity tags which begin with a given tag. The table also specifies the estimated likelihood that a given tag, in the first position of the ambiguity tag, is the correct tag.
In Table 25. Estimated ambiguity rates and error
rates by tag, column (b) shows the overall
frequency of particular tags (not including ambiguity tags). Column
(c) gives the overall occurrence of ambiguity tags, as well as of
particular ambiguity tags, beginning with a given tag. (Ambiguity
tags marked * are less ‘serious’ in that they apply to two
subcategories of the same part of speech, such as past tense and past
participle of the verb - see 4.1 below.) Column (d) shows which tags
are more or less likely to be found as the first part of an ambiguity
tag. For example, both
VVG have an
especially high incidence of ambiguity tags. Column (e) tells us,
given that we have observed an ambiguity tag, what is the likelihood
of the first tag’s being correct? Overall, there is more than a 3-1
chance that the first tag will be correct; but there are some
exceptions, where the chances of the first tag’s being correct are
much lower: for example,
PNI (indefinite pronoun). Note
that (f) and (g) exclude errors where the first tag of an ambiguity
tag is wrong; contrast Table 28. Estimated error rates for the whole corpus, and Table 29. Estimated error rates (by tag) column (c), below.
Estimated error rates specifying the incorrect tag and the correct tag (fine-grained calculation)
The next table, Table 26. Estimated frequency of selected tag-pairs, gives the frequency, as a percentage, of error-prone tag-pairs where XXX is the incorrect tag and YYY is the correct tag which should have occurred in its place. In the third column, the number of the specified error-type is listed, as a frequency count from the sample of 50,000 words. In the fourth column, this is expressed as a percentage of all the tagging errors of word category XXX (in Table 25. Estimated ambiguity rates and error rates by tag column (f)). The fifth column answers the question: if tag XXX occurs, what is the likelihood that it is an error for tag YYY? Where the number of occurrences of a given error-type is less than 5 (i.e. 1 in 10,000 words), they are ignored. Hence, Table 26. Estimated frequency of selected tag-pairs is not exhaustive: only the more likely error-types are listed. In the second column, we add, where useful, the individual words which trigger these errors.
Similar to before, the asterisk * indicates a ‘less serious’ error, in which the erroneous and correct tags belong to the same major category or part of speech. As the table shows, the most frequent specific error types are within the verb category: VVB ? VVI (55, or 9.8% of all VVB tags) and VVD ? VVN (44, or 4.5% of all VVD tags).
A further mode of calculation: ignoring subcategories of the same part of speech
Presentation of Ambiguity and Error Rates (coarse-grained calculation)
Yet a further way of looking at the ambiguities and errors in the corpus is to make a coarse-grained calculation in counting these phenomena. In a fine-grained measurement, which is the one assumed up to now, each tag is considered to define its own word class which is different from all other word classes. Using the coarse-grained calculation, on the other hand, we consider words to belong to different word classes (parts of speech) only when the major category is different. If we consider the pair NN1 (singular and common noun) and NP0 (proper noun), the coarse-grained calculation says that the ambiguity tag NN1-NP0 or NP0-NN1 does not show tagging uncertainty, since both the proposed tags agree in categorizing the word as the same part of speech (a noun). So this does not add to the ambiguity rate. Similarly, the coarse-grained point of view on error is that, if a word is tagged as NN1 when it should be NP0, or vice versa, then this is not error, because both tags are within the noun category. To summarize: in the fine-grained calculation, minor differences of wordclass count towards the ambiguity and error rates; in the coarse-grained calculation, they do not.
In this section, the same calculations are made as in section 3, except that errors and ambiguities which are confined within a major category (noun, verb, etc.) are ignored. In practice, most of the errors and ambiguities of this kind come from the difficulty the tagger finds in recognizing the difference between NN1 (singular common noun) and NP0 (proper noun), between VVD (past tense lexical verb) and VVN (past participle lexical verb), and between VVB (finite present tense base form, lexical verb) and VVI (infinitive lexical verb). Thus the ambiguity tags NN1-NP0, VVD-VVN and their mirror images do not occur in the relevant table (Table 28. Estimated error rates for the whole corpus) below. However, since there are no ambiguity tags for VVB and VVI, the problem of distinguishing these two shows up only in the error calculation.
The three tables in this section correspond with the three tables in the preceding section.
|Sample tag count||Ambiguity rate (%)||Error rate (%)|
It will be noted from Table 27. Estimated ambiguity and error rates for the whole corpus that this method of calculation reduces the overall ambiguity rate by c.1 per cent, and the overall error rate by c.0.5 per cent. We will not present coarse-grained tables corresponding to Table 25. Estimated ambiguity rates and error rates by tag and Table 26. Estimated frequency of selected tag-pairs above: these tables would be unchanged from the fine-grained calculation, except that the rows marked with an asterisk (*) would be deleted, and the other calculations changed as necessary.
Different modes of calculation: eliminating ambiguities
Given that the elimination of errors was beyond our capability within the time frame and budget we had available, the corpus in its present form, containing ambiguity tags as well as a small proportion of errors, is designed for what we believe will be the most common type of user, who will find it easier to tolerate ambiguity than error. However, other users may prefer a corpus which does not contain ambiguities, even though its error rate is higher. For this latter type of user, the present corpus is easy to interpret as a corpus free of ambiguities, simply by deleting or ignoring the second tag of any ambiguity tag, and accepting the first tag as the only one. In what follows, we therefore allow two modes of calculation: in addition to the "safer" mode, in which ambiguities are allowed and consequently errors are relatively low, we allow a "riskier" mode in which ambiguities are abolished, and errors are more frequent. In fact, if ambiguity tags are eliminated, the overall error rate rises to almost 2 per cent.
|Sample tag count||Error rate (%)|
The following table gives an error count (c) for each tag: i.e. the number of errors in the 50,000 word sample where that tag was the erroneous tag. [Cf. the "safer" error count in Table 26. Estimated frequency of selected tag-pairs, column (f).] In addition, each tag has a correction count (d): i.e. the number of erroneous tags for which that tag was the correct tag. If we subtract the Error count (c) from the Tag count (b), and add the Correction count (d) to the result, we arrive at the "Real tag count" (e) representing the number of occurrences of that tag in the corrected sample corpus. Not included in the table is the small number of ‘multiword’ errors which resulted in two tags being replaced by one (error count), or one tag being replaced by two (correction count), due to the incorrect non-use or use of multiword tags. The last column divides the error count by the tag count to provide the error rate (as a percentage).
It is clear from this table that the amount of error in the
tagging of the corpus varies greatly from one tag to another. The
most error prone-tag, by a large margin, is
more than 17 per cent error, while many of the tags are associated
with no errors at all, and well over half the tags have less than a 1
per cent error.
The final table gives figures for the third level of detail, where we
itemise individual tag pairs XXX, YYY, where XXX is the incorrect
tag, and YYY is the correct one which should have appeared but did
not. Only those pairings which account for 5 or more errors are
listed. This table differs from Table 26. Estimated frequency of selected tag-pairs in that here
the second tags of ambiguity tags are not taken into account
("riskier mode"). It will be seen that the errors which occur tend to
fall into a relatively small number of major categories.
Some of the error types above are associated with one or two particular words, and where these occur they are listed. For example, the AV0 - EX0 type of error occurs invariably with the one word there.
- Written imaginative writing
- G0S, ADY, H7P, GW0, FSF
- Written informative writing
- Spoken demographic
- Spoken context-governed
- D8Y, *FXH
The first four phases were carried out automatically, using CLAWS4, an automatic tagger which developed out of the CLAWS1 automatic tagger (authors: Roger Garside and Ian Marshall 1983) used to tag the LOB Corpus. The advanced version CLAWS4 is principally the work of Roger Garside, although many other researchers at Lancaster have contributed to its performance in one way or another. Further information about CLAWS4 can be obtained from Leech, Garside and Bryant 1994 and Garside and Smith 1997. CLAWS4 is a hybrid tagger, employing a mixture of probabilistic and non-probabilistic techniques. The fifth and sixth phases used other systems,described in the appropriate section below.
The first major step in automatic tagging is to divide up the text or corpus to be tagged into individual (1) word tokens and (2) orthographic sentences. These are the segments usually demarcated by (1) spaces and (2) sentence boundaries (i.e. sentence final punctuation followed by a capital letter). This procedure is not so straightforward as it might seem, particularly because of the ambiguity of full stops (which can be abbreviation marks as well as sentence-demarcators) and of capital letters (which can signal a naming expression, as well as the beginning of a sentence). Faults in tokenization occasionally occur, but rarely cause tagging errors.
In tokenization, an orthographic word boundary (normally a
space, with or without accompanying punctuation) is the default
test for identifying the beginning and end of word-tokens. (See,
however, the next paragraph and D. Idiom-Tagging
below.) Hyphens are counted as word-internal, so that a
hyphenated word such as key-ring is given just one tag
NN1). Because of the different ways of writing
compound words, the same compound may occur in three forms: as a
single word written ‘solid’ (markup), as a hyphenated
word (mark-up) or as a sequence of two words (mark
up). In the first two cases, CLAWS4 will give the compound
a single tag, whereas in the third case, it will receive two
tags: one for mark and the other for up.
A set of special cases dealt with by tokenization is the set of enclitic verb and negative contractions such as 's, 're, 'll and 'nt, which are orthographically attached to the preceding word. These will be given a tag of their own, so that (for example) the orthographic forms It's, they're, and can't are given two tags in sequence: pronoun + verb, verb + negative, etc. There are also some 'merged' forms such as won't and dunno, which are decomposed into more than one word for tagging purposes. For example, dunno actually ends up with the three tags for do + n't + know (for a list of these contracted forms, see Contracted forms and multiwords).
B. Initial assignment of tags
The second stage of CLAWS POS-tagging is to assign to each word token one or more tags. Many word tokens are unambiguous, and so will be assigned just one tag: e.g. various AJ0 (adjective). Other word tokens are ambiguous, taking from two to seven potential tags. For example, the token paint can be tagged NN1, VVB, VVI, i.e. as a noun or as a verb; the token broadcast can be tagged as VVB, VVI, VVD, VVN (verb which is either present tense, infinitive, past tense, or past participle). In addition, it can be a noun (NN1) or an adjective (
AJ0), as in a broadcast concert.
- Look for the ending of a word: e.g. words in -ness will normally be nouns.
- Look for an initial capital letter (especially when the word is not sentence-initial). Rare names which are not in the lexicon and do not match other procedures will normally be recognized as proper nouns on the basis of the initial capital.
- Look for a final -(e)s. This is stripped off, to see if the word otherwise matches a noun or verb; if it does, the word in -s is tagged as a plural noun or a singular present-tense verb.
- Numbers and formulae (e.g. 271, *K9, +) are tagged by special rules.
- If all else fails, a word is tagged ambiguously as either a noun, an adjective or a lexical verb.
When a word is associated with more than one tag, information is given by the lexicon look-up or other procedures on the relative probability of each tag. For example, the word for can be a preposition or a conjunction, but is much more likely to be a preposition. This information is provided by the lexicon, either in numerical form, or where numerical data available are insufficient, by a simple distinction between 'unmarked', 'rare' and 'very rare' tags.
Some adjustment of probability is made according to the position of the word in the sentence. If a word begins with a capital, the likelihood of various tags depends partly on whether the word occurs at the beginning of a sentence. For instance, the word Brown at the beginning of a sentence is less likely to be a proper noun than an adjective or a common noun (normally written brown). Hence the likelihood of a proper noun tag being assigned is reduced at the beginning of a sentence.
C. Tag selection (or disambiguation)
The next stage, logically, is to choose the most probable tag from any ambiguous set of tags associated with a word token by tag assignment (but see D. Idiom-Tagging below). This is another probabilistic procedure, this time making use of the context in which a word occurs. A method known as Viterbi alignment uses the probabilistic estimates available, both in terms of the tag-word associations and the sequential tag-tag likelihoods, to calculate the most likely path through the sequence of tag ambiguities. (The model employed is largely equivalent to a hidden Markov model.) After tag selection, a single 'winning tag' is selected for each word token in a text. (The less likely tags are not obliterated: they follow the winning tag in descending probability order.) However, the winning tag is not necessarily the right answer. If the CLAWS tagging stopped at this point, only c.95-96% of the word-tokens would be correctly tagged. This is the main reason for including an additional stage (or rather a set of stages) termed 'idiom-tagging'.
- The correct tag can only be selected if CLAWS looks at a word+tag sequence as a whole. In tag selection, this was not done, since the program merely used 'bigrams' consisting of two tags in sequence. In other words, idiom-tagging is more powerful than the Viterbi disambiguation algorithm because it is able to operate on a 'window' of several word tokens at once.
- There are many cases in English where a sequence of orthographic words is best assigned a single tag. Such cases include because of (a preposition), so long as (a conjunction), and of course (an adverb). These so-called multiwords are the opposite of the contracted forms such as don't and there's, where one orthographic word is assigned more than one tag. Thus idiom-tagging here plays the role of adjusting tokenization to larger units.
- a list of multiwords (just described) such as because of, so long as and of course.
- a list of place name expressions (e.g. Mount X , where X is some word beginning with a capital).
- a list of personal name expressions (e.g. Dr. (X) Y, where X and Y are words beginning with a cap.; the word X may or may not appear in the matching word sequence).
- a list of foreign or classical language expressions used in English (e.g. de jure, hoi polloi)
- a list of grammatical sequences where there are typically 'slots' in the sequence which may or may not be filled: e.g. Modal + (adverb/negative) + (adverb/negative) + Infinitive. This matches a sequence such as would not necessarily like. The recognition that the word token like here is an infinitive verb (rather than, say, a present-tense verb or a preposition) could not be trusted if the tagger was not equipped with an idiom-tagging component, but had to rely simply on tag-pair probabilities.
The idiom-tagging component of CLAWS is quite powerful in matching 'template' expressions in which there are wild-card symbols, Boolean operators and gaps of up to n words. They are much more variable than ‘idioms’ in the ordinary sense, and resemble finite-state networks.
Another important point about idiom-tagging is that it is split up into two main phases which operate at different points in the tagging system. One part of the idiom-tagging takes place at the end of Stage C., in effect retrospectively correcting some of the errors which would otherwise occur in CLAWS output. Another part, however, actually takes place between Stages B. and C. This means it can utilise ambiguous input and also produce ambiguous output, perhaps adjusting the likelihood of one tag relative to another. As an example, consider the case of so long as, which can be a single grammatical item - a conditional conjunction meaning 'provided that'. The difficulty is that so long as can also be a sequence of three separate grammatical items: degree adverb + adjective/adverb + conjunction/preposition. In this case, the tagging ambiguity belongs to a whole word sequence rather than a single word, and the output of the idiom-tagging has to be passed on to the probabilistic tag selection stage. Hence, although we have called idiom-tagging ‘Stage D’, it is actually split between two stages, one preceding C. and one following C.
Clearly VVG (-ing participle of the verb enter) is judged by CLAWS4 to be the most likely tag in this case.
entering VVG 86% NN1 14% AJ0 0%
E. After CLAWS: the Template Tagger
The error rate with CLAWS4 averages around 3%.5 For the BNC Tagging Enhancement project, we decided to concentrate our efforts on the rule-based part of the system, where most of the inroads in error reduction had been made. This involved (a) developing software with more powerful pattern-matching capabilities than the CLAWS Idiomlist, and (b) carrying out a more systematic analysis of errors, to identify appropriate error-correcting rules.
- it can refer to information at the level of the word, or tag, or by user-defined categories grouping lexical, grammatical, semantic or other related features
- it can handle a wide and variable context window, incorporating
These features can best be understood by an example. In BNC1 there were quite a number of errors disambiguating prepositions from subordinating conjunctions, in connection with words like after, before, since and so on. The following rule corrects many such cases from subordinating conjunction (CJS) to prepositions (PRP) tags. It applies a basic grammatical principle that subordinating conjunctions mark the start of clauses and generally require a finite verb somewhere later in the sentence. #AFTER [CJS^PRP] PRP, ([!#FINITE_VB/VVN])16, #PUNC1
The two commas divide the rule into three units, each containing a word or tag or word+tag combination. Square brackets contain tag patterns, and a tag following square brackets is the replacement tag (ie the action part of the rule). #AFTER refers to a list of words like after, before and since, that have similar grammatical properties. These words are defined in a separate file; not all conjunction-preposition words are listed - as, for instance, can be used elliptically, without the requirement for a following verb. (See Tagging Guidelines under as). The definition for #FINITE_VB contains a list of possible POS-tags (rather than word values), eg VVZ/VV0/VM0. Finally #PUNC1 is a 'hard' punctuation boundary (one of . : ; ? and ! ). The patching rule can be interpreted as: 'If a sequence of the following kind occurs: a word like after, before or since, which CLAWS has identified as most likely being a subordinating conjunction, and less likely a preposition; an interval of up to 16 words, none of which has been tagged as a finite verb or past participle 7 (NB [! … ] negates the tag pattern.); a 'hard' punctuation boundary then change the conjunction tag to preposition.'
The rule doesn't always work accurately, and doesn't cater for all preposition-conjunction errors. (i) It relies to a large extent on CLAWS having correctly identified finite verb tags in the right context of the preposition-conjunction; sometimes, however, a past participle is confused with a past tense form. (We therefore added VVN, ie past participle, as a possible alternative to #FINITE_VB in the second part of the pattern. The downside of this was that Template Tagger ignored some conjunction-preposition errors containing genuine use of VVN in the right context). (ii) The scope of the rule doesn't cover long sentences where more than 16 non-finite-verb words occur after the conjunction-preposition. A separate rule had to be written to handle such cases. (iii) Adverb uses of after, before and since etc. need to be fixed by additional rules.
Targetting and writing the Template rules
The Templates are targetted at the most error-prone
categories introduced (or rather, left unresolved) by CLAWS. As
with the preposition-conjunction example just shown, many
disambiguation errors congregate around pairs of tags, for
example adjective and adverb, or noun and verb. Sometimes a
triple is involved, eg a past tense verb (
past participle (
VVN) and adjective
AJ0) in the case of surprised.
A small team of researchers sought out patterns in the errors by concordancing a training corpus that contained two parallel versions of the tagging: the automatic version produced by CLAWS and a hand-corrected version, which served as a benchmark. A concordance query of the form "tag A | tag B", would retrieve lines where the former version assigned an incorrect tag A and the latter a correct tag B. An example is shown below, in which A is a subordinating conjunction and B a preposition.
By working interactively with the parallel concordance, sorting on the tags of the immediate context, testing for significant collocates to the left and right, and generally applying his/her linguistic knowledge, the researcher can often detect sufficient commonality between the tagging errors to formulate a patching rule (or a set of rules) such as that shown above. It took several iterations of training and testing to refine the rules to a point where they could be applied by Template Tagger to the full corpus.8
It should be said that some categories of error were easier to write rules for than others. Finding productive rules for noun-verb correction was especially difficult, because of the many types of ambiguity between nouns, verb and other categories, and the widely differing contexts in which they appear. The errors and ambiguity tags associated with NN1-VVB and NN2-VVZ in BNC2 in the error report testify to this problem. Here a more sophisticated lexicon, detailing the selectional restrictions of individual verbs and nouns (and other categories) would have undoubtedly been useful.
Ordering the rules
In some instances the ordering of rules was important. When two rules in the same ruleset compete, the longer match applies. Clashes arise in the case of the multiply ambiguous word as, for instance. Besides the clear grammatical choices between a preposition and a complementiser introducing an adverbial clause, there are many "interfering" idiomatic uses (as well as, as regards etc) and elliptical uses ( The TGV goes as fast as the Bullet train [sc.goes]). To avoid interference between the rules, we found it preferable to let an earlier pass of the rules handle more idiomatic (or exceptional) structures, and let a later pass deal with the more regular grammatical dependencies.
In many rule sets, however, we found that ordering did not affect the overall result, as we tried to ensure each rule was 'true' in all cases. Since, however, more than one rule sometimes carried out the same tag change to a particular word, the system was not optimised for speed and efficiency.
Besides the ordering of rules within rulesets, it is worth considering the placement of Template Tagger within the tagging schema (Figure 1). Ideally, it would be sensible to exploit the full pattern-matching functionality of the Template Tagger earlier in the schema, using it in place of the CLAWS Idiomlist not just after statistical disambiguation, where it is undoubtedly necessary, but also before it. In this way Template Tagger could have precluded much unnecessary ambiguity passing to Stage C. above. The reason we did not do this was pragmatic, that TT was in fact developed as a general-purpose annotation tool (See Fligelstone, Pacey and Rayson 1997), and not exclusively for the POS-tagging of BNC2. In future versions of the tagging software we hope to integrate Template Tagger more fully with CLAWS.
F. Postprocessing, including Ambiguity tagging
- The text is produced in a horizontal format, so that it can be read from left to right across the page or across the screen.
- The tags are enclosed in angle-brackets as follows: <NN1> according to the standard TEI-based CDIF mark-up of the British National Corpus.
- Normally the word will be output with a single tag - the one which CLAWS4 calculates to be most probable.
- "Ambiguity tags" (such as <NN1-AJ0>) are output if the difference between the probability of the first tag and of the second fails to reach a pre-decided threshold.
The final phase, "ambiguity tagging", merits a little further discussion. The requirement for such tags is clear when one observes that even using Template Tagger on top of CLAWS, there remains a residuum of error, around 2%, in the corpus. By permitting ambiguity tags we are effectively able to "hedge" in many instances that might otherwise have counted as errors - improving the chances of retrieving a particular tag, but at the cost of retrieving other tags as well. We considered that a reasonable goal would be to employ sufficient ambiguity tags to achieve an overall error rate for the corpus of 1%.
- collected each instance of A-B error, noting the difference in probability score between A and B.
- plotted each error against the probability difference
- found the threshold on the difference axis that would yield the target number of errors. Below this threshold each instance of A-B would be converted to an ambiguity tag.
As we report under Error rates, the BNC in fact contains a higher error rate than 1%. This is because some thresholds applied at the 1% rate incurred a very high frequency of potential ambiguity tags: we hand-adjusted such thresholds if permitting a slight rise in errors led to a substantial reduction in the number of ambiguities. Further comments on stages E. and F. can be found in Smith 1997.
Additional annotation in BNC XML
The simplified wordclass scheme used for the second of these enhancements is listed in Simplified Wordclass Tags of the manual, where the mapping between these values and the C5 tags from which they are derived is also specified.
The lemmatization procedure adopted derives ultimately from work reported in Beale 1987, as subsequently refined by others at Lancaster, and applied in a range of projects including the JAWS program (Fligelstone 1994) and the book Word Frequencies in Written and Spoken English (Leech et al 2001). The basic approach is to apply a number of morphological rules, combining simple POS-sensitive suffix stripping rules with a word list of common exceptions.
This process was carried out during the XML conversion, using code and a set of rules files kindly supplied by Paul Rayson.
- Beale, A.D. (1987) 'Towards a distributional lexicon' in Garside et al (1987).
- Brill, E. (1992) 'A simple rule-based part-of-speech tagger'. Proceedings of the 3rd conference on Applied Natural Language Processing. Italy: Trento.
- Fligelstone S., Pacey M., and Rayson P. (1997) 'How to Generalize the Task of Annotation'. In Garside et al. (1997)
- Garside R., Leech G. and Sampson, G. (eds.) (1987) The Computational Analysis of English. London: Longman.
- Garside R., Leech G. and McEnery A. (eds.) (1997) Corpus Annotation. London: Longman.
- Garside R., and Smith N. (1997) 'A hybrid grammatical tagger: CLAWS4'. In Garside et al. (1997)
- Leech, G., Garside, R., and Bryant, M. (1994). CLAWS4: The tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94). Japan: Kyoto. (pp.622-628.)
- Leech, G., Rayson, P., and Wilson, A. (2001). Word frequencies in written and spoken English based on the British National Corpus. London: Longman.
- Marshall, I. (1983). 'Choice of Grammatical wordclass without Global Syntactic Analysis: Tagging Words in the LOB Corpus'. Computers and the Humanities 17, 139-50.
- Smith, N. (1997) 'Improving a Tagger'. In Garside et al. (1997)
Up: Contents Previous: The header Next: Miscellaneous tables