Up: Contents Previous: 5 The header Next: 7 Software for the BNC
This section of the User Reference Guide is derived from the Manual to accompany The British National Corpus (Version 2) with Improved Word-class Tagging originally prepared for the BNC World edition by Geoffrey Leech and Nicholas Smith at the University of Lancaster.
The wordclass tagging2 has not changed significantly between the BNC World edition (2001) and the BNC XML edition (2006). In particular, no attempt has been made to completely retag the corpus, desirable though this might be. Changes have been made in the treatment of multiword units and some additional annotation has been provided (see 6.6.4.7 Additional annotation in BNC XML , but in most respects the wordclass information provided by the corpus now is identical to that provided with the first release of the BNC in 1994.
The BNC is wordclass-tagged using a set of 57 tags (known as C5) which we refer to as the "BNC Basic Tagset". (There are also 4 punctuation tags, excluded from consideration here.) Each C5 tag represents a grammatical class of words represented by a three character code such as NN1 for "singular common noun". The codes are, in many cases, mnemonic.
The BNC, consisting of c.100 million words, was tagged automatically, using the CLAWS4 automatic tagger developed by Roger Garside at Lancaster, and a second program, known as Template Tagger, developed by Mike Pacey and Steve Fligelstone. Further details are given below, and also in Garside and Leech 1997 chapters 7-9. With such a large corpus, there was no opportunity to undertake post-editing3 i.e. disambiguation and correction of tagging errors produced by the automatic tagger, and so the errors (about 1.15 per cent of all words) remain. In addition, the corpus contains ambiguous taggings (c.3.75 per cent of all words), shown in the form of ambiguity tags (also called ‘portmanteau tags’), consisting of two C5 tags linked by a hyphen: e.g. VVD-VVN. These tags indicate that the automatic tagger was unable to determine, with sufficient confidence, which was the correct category, and so left two possibilities for users to disambiguate themselves, if they should wish to do so. For example, in the case of VVD-VVN, the first (more likely) tag, say for a word such as wanted, is VVD: past tense of lexical verb; and the second (less likely) tag is VVN: past participle of lexical verb. On the whole, the likelihood of the first tag of an ambiguity tag being correct is better than 3 to 1 — see, however, details of individual tags in Table 23. Estimated ambiguity and error rates for the whole corpus (fine-grained calculation) of the error report document.
After the automatic tagging, some manual tagging was undertaken to correct some particularly blatant errors, mainly foreign or classical words embedded in English text. CLAWS is not very successful at detecting these foreign words and tagging them with their appropriate tag (UNC), except when they form part of established expressions such as ad hoc or nom de plume - in which case they are normally given tags appropriate to their grammatical function, e.g. as nouns or adverbs.
The main purpose of the report on estimated error rates is to document the rather small percentage of ambiguities and errors remaining in the tagged BNC, so that users of the corpus can assess the accuracy of the tagging for their own purposes. Since not surprisingly we have been unable to inspect each of the 100 million tags in the BNC, we have had to estimate ambiguity rates and error rates on the basis of a manual post-editing of a corpus sample of 50,000 words. The estimate is based on twenty-four 2,000-word text extracts and two 1,000-word extracts, selected so as to be as far as possible representative of the whole corpus.
Regarding the segmentation of a text into individual word-tokens (called tokenization), our tagging practice in general follows the default assumption that an orthographic word (separated by spaces from adjacent words, with or without punctuation) is the appropriate unit for wordclass tagging. There are, however, exceptions to this. For example, a single orthographic word may consist of more than one grammatical word: in the case of enclitic verb contractions (as in she’s, they’ll, we’re) and negative contractions (as in don’t, isn’t, won’t), it is appropriate to assign two diferent wordclass tags to the same orthographic word. A full list of such contracted forms recognized by CLAWS and preserved in the XML markup is given in section 9.7 Contracted forms and multiwords.
Also quite frequent is the opposite circumstance, where two or more orthographic words are given a single wordclass tag: e.g. multiword adverbs such as of course and in short, and multiword prepositions such as instead of and up to are each assigned a single word tag (AV0 for adverbs, PRP for prepositions). Sometimes, whether such orthographic sequences are to be treated as a single word for tagging purposes depends on the context and its interpretation. In short is in some circumstances not an adverb but a sequence of preposition + adjective (eg. in short sharp bursts ). Up to in some contexts needs to be treated as a sequence of two grammatical words: adverbial-particle + preposition-or-infinitive-marker (eg. We had to phone her up to get the code.).
<mw>
) which
carries the wordclass assigned to the whole sequence. Within the
<mw>
element, the individual orthographic words are also
marked, using the <w>
element in the same way as elsewhere. For
example, the multiword unit of course is marked up as
follows:
In one respect, we have allowed the orthographic occurrence of spaces to be criterial. This is in the tagging of compound words such as markup, mark-up and mark up. Since English orthographic practice is often variable in such matters, the same ‘compound’ expression may occur in the corpus tagged as two words (if they are separated by spaces) or as one word (if the sequence is printed solid or with a hyphen). Thus mark up (as a noun) will be tagged NN1 AVP, whereas markup or mark-up will be tagged simply NN1.
Many detailed decisions have to be made in deciding how to draw the line between the correct and the incorrect assignment of a tag. So that the concept of what is a ‘correct’ or ‘accurate’ annotation can be determined, there have to be detailed guidelines of tagging practice. These are constitute the Wordclass Tagging Guidelines.
The assignment of an example of ‘Verb+ing’ to the adjective category relies heavily on a semantic criterion, viz. the ability to paraphrase Verb+ing Noun by ‘Noun + Relative Clause that/which/who be Verb+ing’ or ‘that/which/who Verb(s)’ (e.g. the rising sun = the sun which is/was rising; a working mother = a mother who works). These contrast with a case such as dining table, where the first word dining is judged to be a noun. The reason for this is that the paraphrasable meaning of the expression is not ‘a table which is/was dining or dines’, but rather ‘a table (used) for dining’. Although somewhat arbitrary, this relative clause test is well established in English grammatical literature, and such criteria are useful in enabling a reasonable degree of consistency in tagging practice to be achieved, so that the success rate of corpus tagging can be checked and evaluated. (See further Adjective vs. noun)
In practice, in our post-edited sample, we chose the first tag to be correct in these cases.
The permitted ambiguity tags are listed in the Wordclass tagging guidelines ( Ambiguity Tag list).
It will be noted that overall 30 ambiguity tags are recognized. We also observe that each ambiguity tag (eg VVD-VVN) is matched by another ambiguity tag which is its mirror image (eg VVN-VVD). The ordering of tags is significant: it is the first of the two tags which is estimated by the tagger to be the more likely. Hence the interpretation of an ambiguity tag X-Y may be expressed as follows: ‘There is not sufficient confidence to choose between tags X and Y; however, X is considered to be more likely.’
Tag | Description |
AJ0 | Adjective (general or positive) (e.g. good, old, beautiful) |
AJC | Comparative adjective (e.g. better, older) |
AJS | Superlative adjective (e.g. best, oldest) |
AT0 | Article (e.g. the, a, an, no) |
AV0 | General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest. |
AVP | Adverb particle (e.g. up, off, out) |
AVQ | Wh-adverb (e.g. when, where, how, why, wherever) |
CJC | Coordinating conjunction (e.g. and, or, but) |
CJS | Subordinating conjunction (e.g. although, when) |
CJT | The subordinating conjunction that |
CRD | Cardinal number (e.g. one, 3, fifty-five, 3609) |
DPS | Possessive determiner-pronoun (e.g. your, their, his) |
DT0 | General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0. |
DTQ | Wh-determiner-pronoun (e.g. which, what, whose, whichever) |
EX0 | Existential there, i.e. there occurring in the there is ... or there are ... construction |
ITJ | Interjection or other isolate (e.g. oh, yes, mhm, wow) |
NN0 | Common noun, neutral for number (e.g. aircraft, data, committee) |
NN1 | Singular common noun (e.g. pencil, goose, time, revelation) |
NN2 | Plural common noun (e.g. pencils, geese, times, revelations) |
NP0 | Proper noun (e.g. London, Michael, Mars, IBM) |
ORD | Ordinal numeral (e.g. first, sixth, 77th, last) . |
PNI | Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody) |
PNP | Personal pronoun (e.g. I, you, them, ours) |
PNQ | Wh-pronoun (e.g. who, whoever, whom) |
PNX | Reflexive pronoun (e.g. myself, yourself, itself, ourselves) |
POS | The possessive or genitive marker 's or ' |
PRF | The preposition of |
PRP | Preposition (except for of) (e.g. about, at, in, on, on behalf of, with) |
PUL | Punctuation: left bracket - i.e. ( or [ |
PUN | Punctuation: general separating mark - i.e. . , ! , : ; - or ? |
PUQ | Punctuation: quotation mark - i.e. ' or " |
PUR | Punctuation: right bracket - i.e. ) or ] |
TO0 | Infinitive marker to |
UNC | Unclassified items which are not appropriately considered as items of the English lexicon. |
VBB | The present tense forms of the verb BE, except for is, 's: i.e. am, are, 'm, 're and be [subjunctive or imperative] |
VBD | The past tense forms of the verb BE: was and were |
VBG | The -ing form of the verb BE: being |
VBI | The infinitive form of the verb BE: be |
VBN | The past participle form of the verb BE: been |
VBZ | The -s form of the verb BE: is, 's |
VDB | The finite base form of the verb DO: do |
VDD | The past tense form of the verb DO: did |
VDG | The -ing form of the verb DO: doing |
VDI | The infinitive form of the verb DO: do |
VDN | The past participle form of the verb DO: done |
VDZ | The -s form of the verb DO: does, 's |
VHB | The finite base form of the verb HAVE: have, 've |
VHD | The past tense form of the verb HAVE: had, 'd |
VHG | The -ing form of the verb HAVE: having |
VHI | The infinitive form of the verb HAVE: have |
VHN | The past participle form of the verb HAVE: had |
VHZ | The -s form of the verb HAVE: has, 's |
VM0 | Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd) |
VVB | The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive] |
VVD | The past tense form of lexical verbs (e.g. forgot, sent, lived, returned) |
VVG | The -ing form of lexical verbs (e.g. forgetting, sending, living, returning) |
VVI | The infinitive form of lexical verbs (e.g. forget, send, live, return) |
VVN | The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned) |
VVZ | The -s form of lexical verbs (e.g. forgets, sends, lives, returns) |
XX0 | The negative particle not or n't |
ZZ0 | Alphabetical symbols (e.g. A, a, B, b, c, d) |
Total number of wordclass tags in the BNC basic tagset = 57, plus 4 punctuation tags
In addition, there are 30 "Ambiguity Tags". These are applied wherever the probabilities assigned by the CLAWS automatic tagger to its first and second choice tags were considered too low for reliable disambiguation. So, for example, the ambiguity tag AJ0-AV0 indicates that the choice between adjective (AJ0) and adverb (AV0) is left open, although the tagger has a preference for an adjective reading. The mirror tag, AV0-AJ0, again shows adjective-adverb ambiguity, but this time the more likely reading is the adverb.
Ambiguity tag | Ambiguous between | More probable tag |
AJ0-NN1 | AJ0 or NN1 | AJ0 |
AJ0-VVD | AJ0 or VVD | AJ0 |
AJ0-VVG | AJ0 or VVG | AJ0 |
AJ0-VVN | AJ0 or VVN | AJ0 |
AV0-AJ0 | AV0 or AJ0 | AV0 |
AVP-PRP | AVP or PRP | AVP |
AVQ-CJS | AVQ or CJS | AVQ |
CJS-AVQ | CJS or AVQ | CJS |
CJS-PRP | CJS or PRP | CJS |
CJT-DT0 | CJT or DT0 | CJT |
CRD-PNI | CRD or PNI | CRD |
DT0-CJT | DT0 or CJT | DT0 |
NN1-AJ0 | NN1 or AJ0 | NN1 |
NN1-NP0 | NN1 or NP0 | NN1 |
NN1-VVB | NN1 or VVB | NN1 |
NN1-VVG | NN1 or VVG | NN1 |
NN2-VVZ | NN2 or VVZ | NN2 |
NP0-NN1 | NP0 or NN1 | NP0 |
PNI-CRD | PNI or CRD | PNI |
PRP-AVP | PRP or AVP | PRP |
PRP-CJS | PRP or CJS | PRP |
VVB-NN1 | VVB or NN1 | VVB |
VVD-AJ0 | VVD or AJ0 | VVD |
VVD-VVN | VVD or VVN | VVD |
VVG-AJ0 | VVG or AJ0 | VVG |
VVG-NN1 | VVG or NN1 | VVG |
VVN-AJ0 | VVN or AJ0 | VVN |
VVN-VVD | VVN or VVD | VVN |
VVZ-NN2 | VVZ or NN2 | VVZ |
Total number of wordclass tags including punctuation and ambiguity tags = 91.
Throughout this section, we will show text examples in a format which is different from the XML contained in the corpus but which will highlight the particular tag that is being discussed. The XML tagging (for example, paragraph and pause markers) is not generally relevant to the present discussion and is usually invisible when using concordancing software such as Xaira, BNCWeb, or WordSmith.
As noted above, each word in the corpus is marked by an XML
<w>
element which provides three additional pieces of
information the wordclass, carried by the c5
attribute, a headword or lemma derived from the word, carried by the
hw attribute, and a simplified wordclass derived from
the c5 value, carried by the pos attribute.
This is purely as an aid to reading the present document; in the corpus itself, all wordclass tagging is represented using the XML conventions shown above....apparently we eat more chocolate than_CJS any other country. [G3U.1000]
<s>
element within it. We use this method
throughout the following examples, where they are taken from the BNC.
Thus, the example above is taken from s-unit 1000 of text G3U. In
sections 6.5.9 Disambiguation Guide and 6.5.9.2 Disambiguation by Word below, we
occasionally cite cases where the POS-tagging in the corpus does not
match the tag given in the citation, in that it is either an error or
an ambiguity tag. This is to give an idea of the contexts in which the
resolution of ambiguities has been less reliable. We list the tag
found in the corpus next to the file reference with an asterisk,
eg. in well we give the ideal tag as
VVB, but the actual tag as AV0: Note also that we occasionally use invented examples, rather than corpus citations, especially where a contrast between categories is being made.Tears well_VVB up in my eyes.[BN3.5 *AV0]
doesn't = does_VDZ n't_XX0
dunno = Du_VDB n_XX0 no_VVI
wanna = wan_VVB na_TO0 or wan_VVB na_AT0
gimme = Gim_VVB me_PNP
This procedure sometimes results in strange-looking word divisions, particularly with the fused words. However, they do provide a ready means of comparison with the full forms, such as want_VVB to_TO0 and give_VVB me_PNP.
Ai_UNC n't_XX0 got yours yet [KCT.1281]
The term `multiwords' denotes multiple-word combinations to which
CLAWS assigns a single wordclass tag - for example, a complex
preposition, an adverbial, or a foreign expression naturalised into
English as a compound noun. In the XML version of the corpus, these
sequences are explicitly marked using an XML element
(<mw>
). The individual orthographic words of which the
sequence is composed are also marked, in the same way as other words,
using the <w>
element.
When displaying examples which contain multiwords in this chapter, we display only the wordclass of the outermost<mw c5="AV0"> <w c5="PRF" lemma="of" pos="PREP">of </w> <w c5="NN1" lemma="course" pos="SUBST">course </w> </mw>
<mw>
element. Its
boundaries are indicated, where possible, by extra highlighting:
Of course_AV0 I can. [H9V.212]
The wordclass tags assigned to constituent parts of multiword items are listed in 9.7 Contracted forms and multiwords. This part of the wordclass tagging was done automatically during the XML conversion process, and has not been checked by CLAWS.
The stage in between_PRP the original negative and the dupe is called an interpositive [FB8.295]
The truth lies somewhere in between_AV0 [ABK.2834]
but_CJC for_PRP years now darkness has been growing [F99.2027] cf.
which they would not have done but for_PRP the presence of the police. [H81.766]
A title and/or_CJC an author's name [H0S.358]
You should be a graduate in Electrical/Electronic_AJ0 Engineering, Physics , Mathematics , Computing or a related discipline. [CJU.1049]
A child_NN1.
Several children_NN2
An air_NN1 of distinction_NN1
Fifteen miles_NN2 away
Now the government_NN0 is considering new warnings on steroids ... [K24.3057]
... the Government_NN0 are putting people's lives in jeopardy. [A7W.518]
I caught a fish_NN0.[KBW.316]
I had caught four fish_NN0 with hardly any effort[B0P.1387]
Cheese_NN1 is a protein of high biological value. [ABB.1950]
three cheeses_NN2. [CH6.7834]
A car_NN1 glistens in the distance_NN1. [HH0.1035]
Three cars_NN2, two lorries_NN2 and a motorbike_NN1! [CHR.290]
Crewe are top of div_NN1 3 by 8 points [J1C.961] (where div = division)
1 km_NN0
400 km_NN0 (km = 'kilometre' or 'kilometres')
1 oz_NN0.
6 oz_NN0 (oz = 'ounce' or 'ounces')
Nouns such as hundred, hundreds, dozens, gross, are all tagged as numbers, CRD, rather than nouns.
Sally_NP0
Joe_NP0 Bloggs_NP0
Madame_NP0 Pompadour_NP0
Leonardo_NP0 da_NP0 Vinci_NP0
London_NP0
Lake_NP0 Tanganyika_NP0
New_NP0 York_NP0
April_NP0
John_NP0 Smith_NP0. All of the Smiths_NP0.
John F. Kennedy = John_NP0 F._NP0 Kennedy_NP0
J. F. Kennedy = J._NP0 F._NP0 Kennedy_NP0
J.F. Kennedy = J.F._NP0 Kennedy_NP0
In the spoken part of the BNC, however, the components of names — and, in fact, most words — that are spelt aloud as individual letters, such as I B M, and J R in J R Hartley, are not tagged NP0 but ZZ0 (letter of the alphabet). See further Letter
Pastor_NP0 Tokes_NP0
Chairman_NP0 Mao_NP0
Sub-Lieutenant_NP0 R_NP0 C_NP0 V_NP0 Wynn_NP0
Sister_NP0 Wendy_NP0
where Wendy is in apposition to a common noun sister, in lowercase letters.You remember your sister_NN1 Wendy_NP0... [HGJ.800]
East_NP0 Timor_NP0
South_NP0 Carolina_NP0
Baker_NP0 Street_NP0
West_NP0 Harbour_NP0 Lane_NP0
the_AT0 United_NP0 Kingdom_NP0
the_AT0 Baltic_NP0
the_AT0 Indian_NP0 Ocean_NP0
Mount_NP0 St_NP0 Helens_NP0
the_AT0 Alps_NP0
Latin_AJ0 America_NP0
Western_AJ0 Europe_NP0
the_AT0 Western_AJ0 Region_NN1
the_AT0 People_NN0's_POS Republic_NN1 of_PRF China_NP0
the_AT0 Dominican_AJ0 Republic_NN1
the_AT0 Sultanate_NN1 of_PRF Oman_NP0
the_AT0 United_NP0 States_NP0
the_AT0 Soviet_AJ0 Union_NN1
Northern_NP0 Ireland_NP0
Western_NP0 Samoa_NP0
There is a slight inconsistency here, in that acronyms of organisation names (WHO, NATO, IBM etc.) take NP0, whereas the expanded forms of these names take regular tags.Cable_NN1 and_CJC Wireless_NN1
Procter_NP0 and_CJC Gamble_NP0
Acorn_NN1 Marketing_NN1 Limited_AJ0
Minolta_NP0; IBM_NP0; NATO_NP0
Wolverhampton_NP0 Wanderers_NN2 ( football_NN1 club_NN1 )
Tottenham_NP0 Hotspur_NP0 (football_NN1 club_NN1 )
The_AT0 Chicago_NP0 Bears_NN2
Spartak_NP0 Moscow_NP0
World_NN1 Health_NN1 Organisation_NN1
Oxfam_NP0
Windows_NN2 software_NN1
Weetabix_NP0
Lancashire_NP0 Evening_NN1 Post_NN1
Mars_NP0 bars_NN2
Time_NN1 Magazine_NN1
Scotchgard_NP0
The_AT0 Reader_NN1 's_POS Digest_NN1
Perrier_NP0 water_NN1
John drives a Volkswagen_NP0 Golf_NN1
John drives a Volkswagen_NP0.
Here again NP0 is reserved for parts of names that are specially coined, or derived from existing personal/geographical proper nouns.Body_NN1 Shop_NN1
Mothercare_NP0
The_AT0 Grand_AJ0 Theatre_NN1
Sainsburys_NP0 supermarket_NN1
The_AT0 King_NN1 's_POS Arms_NN2
The_AT0 Ritz_NP0
Red_AJ0 Rum_NN1
Aldaniti_NP0
The_AT0 Bounty_NN1
The_AT0 Titanic_NP0
B | Forms of be (VBB VBD VBG VBI VBN VBZ) |
D | Forms of do ( VDB VDD VDG VDI VDN VDZ) |
H | Forms of have ( VHB VHD VHG VHI VHN VHZ) |
M | Other modal verbs (VM0) |
V | Lexical verb (VVB VVD VVG VVI VVN VVZ) |
B | base form finite |
D | past tense |
Z | 3rd person sing present |
N | past participle |
I | infinitive |
G | present participle |
she is_VBZ playing her best tennis for six years. [CH3.1382]
she is_VBZ just a star. [CH3.6939]
John has_VHZ built a set of bookshelves. [C9X.121]
John has_VHZ great courage. [CA9.1869]
We did_VDD n't_XX0 see anybody. [KB2.702]
They do_VDB nice work. [ANY.514]
they shouldn't of_VHI left it the last minute [KD8.7288]
That could of_VHI been 'bout us [B38.322]
She travels_VVZ in every Saturday morning. [KRH.4013]
The young kids want_VVB to dance_VVI and have fun [CHA.1599]
I thought_VVD he looked_VVD a sad sort of a boy. [CDY.2831]
...after running_VVG out of coal, the crew were forced_VVN to burn_VVI timber and resin [HPS.269]
We can_VM0 go there.
We could_VM0 go there.
We used_VM0 to_TO0 go there every year.
Let's_VM0 go_VVI! [A61.1443]
Are_VBBn't_XX0 you coming?[A0R.2215]
I du_VDB n_XX0 no_VVI [KR0.23]
She suggested that they get_VVB married. [CBC.12107]
Please be_VBB patient. [CHJ.899]
Do_VDBn't_XX0 just stand there watching! [ACB.3470]
you're not going_VVG to_TO0 get killed [KCE.6550]
you ought_VM0 to_TO0 let them know. [KCT.6115]
Adjectives are given one of the wordclass tags AJ0, AJC, or AJS.
The ground was dry_AJ0 and dusty_AJ0 [GWA.118]
The dust from the dry_AJ0 ground [GWA.121]
Events in Eastern Europe were evidently uppermost_AJ0 in Mr Li's mind. [A95.366]
Family contacts were very important in uniting the upper_AJ0 classes [FB6.1495]
Will you be able_AJ0 to manage? (catenative)
Your son is very able_AJ0 (non-catenative)
A faster_AJC car.
The best_AJS in its class.
Ambiguities frequently arise between adjectives and other wordclasses, in particular adverbs, nouns and participles.
Adverbs are given one of the tags AV0, AVQ, or AVP
very_AV0 tall_AJ0
rather_AV0 painfully_AV0
However_AV0, …
In addition_AV0
aged between 2 and 11 years inclusive_AV0 [AMD.31]
the buildings thereon_AV0 [J16.813]
during 1986-91 inclusive_AV0 [FT0.1400]
Diamonds galore_AV0 [FPH.900]
you know like_AV0, it's worthwhile opening a cinema at 4 o'clock... [F7A.358]
Note that adverbs, unlike adjectives, are not tagged as positive, comparative, or superlative. This is because of the relative rarity of comparative and superlative adverbs.
"When_AVQ do your courses start?" [A0F.3117]
"...if you let me know when_AVQ the police are called in." [BMU.2291]
Yet why_AVQ is that so? [CR7.3089]
Ordinal-type adverbs (including first, fourth, etc.) are treated separately with the ORD tag
Prepositional Adverbs (also known as "Adverbial Particle") are treated as prepositions and tagged AVP: see Prepositions
Articles, definite or indefinite, are tagged AT0. Pronouns which act as determiners of various kinds (all, which, your etc.) are given tags DPS, DT0, or DTQ, and distinguished from pronouns which do not have a determiner function. These are marked using one of the tags PNP, PNI, PNQ, or PNX depending on their function.
Have a_AT0 break
Every_AT0 year
There's no_AT0 time
free secondary education for all_DT0 [ECB.1610]
Few_DT0 diseases are incurable [GV1.1129]
for the benefit of the few_DT0 [HHX.10183]
Which_DTQ country do you live in? [A7N.979]
And she didn't say which_DTQ? [KCF.351 ]
What_DTQ time is it? [A0N.406]
my_DPS hat
That is your way. This is mine_PNP [ASD.726-7]
Give 4 details which_DTQ should appear on an order form [HBP.417]
I got some currants that_CJT are left over [KST.3733]
this girl that_CJT Claire knows [KC7.1101]
He dismissed reports that_CJT his party was divided over tactics [A28.11]
We both knew that_CJT enough was enough. [FEX.268]
Look at that_DT0 bear! [KP8.1547]
I guess I was sad about that_DT0.[BMM.239]
at_PRP the Pompidou Centre in_PRP Paris [A04.325]
I use humour as_PRP a protection [FBL.356]
Heard about_PRP this have you? [KE6.9556]
According to_PRP ancient tradition, ...[A04.784]
Many disputes are dealt with by bodies other than_PRP courts. [F9B.4]
Nice walls and a big sky to look at_PRP. [A25.122]
Note that numerous multiwords contain of, eg in front of, in light of, by means of, etc.a couple of_PRF cans of_PRF Coke[ AJN.283]
DNA consists of_PRF a string of_PRF four kinds of_PRF bases [AE7.107]
There are many instances of ambiguity between PRP and AVP.We gave up_AVP after two hours. [KSV.1029]
there were a lot of horses around_AVP. [HR7.3101]
Fish and_CJC chips
James laughed and_CJC spilled wine. [A0N.136]
She was paralysed but_CJC she could still feel the pain. [FLY.529]
"When_CJS you 've done it , you should go home,"[CRE.949]
I still stayed there after_CJS I heard the shooting [HW8.3263]
As_CJS you may know Scorton will again enter the Best Kept Village competition in 1992 [HPK.768]
Do send me an interim copy as_CJS soon as you can [HD3.69]
If_CJS it's wet just take your time. [KCL.554]
It was worse than_CJS she could have imagined.[CH0.1315]
...apparently we eat more chocolate than_CJS any other country.[G3U.1000]
"it's as good as_CJS it's going to get."[K9K.199]
make the transporter as light as_CJS possible. [CA1.1113]
Can you tell me whether_CJS ivies do damage trees. [C9C.720]
Historians knew that_CJT this was nonsense.[G3C.363]
China announced that_CJT it was ending martial law in the Tibetan capital Lhasa. [KRU.95]
The problem that_CJT he was having was that_CJT she was his legal wife 's sister [HE3.210]
Cardinal numbers and similar items are tagged CRD. Ordinal numbers and similar items are tagged ORD.
5_CRD out of 10_CRD[CGM.525]
one_CRD striking feature of the years 1929-31_CRD [A6G.134]
his first_ORD innings, when he scored forty-two_CRD, with seven_CRD fours_CRD [KJT.128]
Hundreds_CRD of people audition each year [K1S.2239]
About a dozen_CRD there. [HEU.131]
Note that ORD is also assigned to less overtly numeric words like next and last, even in clear adverbial, adjectival or nominal contexts. This is because next and last function like ordinals both syntactically and semantically.We only came fourth_ORD in the county championship last_ORD year[EDT.1629]
6kg_NN0
£600_NN0
12.5%_NN0
Figure 2b_UNC [FTC.250]
Serial no. S835508_UNC [C9H.2282]
A4_UNC sheet of paper [CN4.296]
Mark drove home along the M1_UNC [AC2.2210]
Compare this with there when it has a clear locative meaning ('in/to that place'):There_EX0 was a long pause and then a smile [A4H.416]
Waiter! Waiter! There_EX0's an awful film on my soup! [CHR.657-9]
There_EX0 appears to be little alternative [ECE.2139]
Don't stand there_AV0 grinning like a stuck pig [C85.1553]
( For the distinction between ITJ and the unclassified tag, UNC, see Interjection vs. unclassified)Hello_ITJ, Nell.
Oi_ITJ - come here!
Yes_ITJ , please_AV0 do
No_ITJ not_XX0 yet_AV0
Note the lack of space between the noun and the following POS, as 's is tokenized in the same way whether it represents a genitive or a contracted verb. See further on tagging of 's in Apostrophe Steacher_NN1 's_POS pet
teachers_NN2 '_POS pet
Note the morphological variation of to in the following colloquial forms:"Do you want to_TO0 talk about it?" [EFG.1935]
In the summer holidays I can , I can get up early if I want to_TO0 . [KPG.4153]
We got_VVN ta_TO0 go
We wan_VVBna_TO0 stay.
blah_UNC blah_UNC blah_UNC
er_UNC I think so
Methinks_UNC
That ai_UNC n't_XX0 right.
0.5 cm increments/30_UNC seconds [HWT.282]
Fits with most lap/diagonal_UNC seat belts. [BNX.392]
<trunc>
; for example
the partial word bathr in the following:
The bathr_UNC data. er you can't beat a white bathroom suite anyway. [KCF.771]
We treat the first sort as an incomplete multiword, and tag it UNC (rather like truncated words, above). The complete multiword sort of is tagged AV0, as normally.we're going to sort sort of summarize... [G5X.106]
we're going to sort_UNC sort of_AV0 summarize...
Brown did_VDDn't _XX0 see it that way. [A6W.338]
no, that is not_XX0 correct. [JK0.257]
Although the same should apply to v. the full-stop has sometimes incorrectly produced a new sentence break. (See eg CHS.1076, EB2.19, EDL.313)Italy v_PRP New Zealand ... Hungary v_PRP Thailand [A1N.507].
I_ZZ0 B_ZZ0 M_ZZ0 compatible [JYM.6]
children who go to the E_ZZ0 N_ZZ0 T_ZZ0 clinic [KB8.3807]
You're not supposed to keep medicine that long_AV0. [H8Y.1976 *AV0-AJ0]
Note also that in this section we use a number of invented examples (in addition to corpus citations) to clarify the distinction between categories.
We arrived tired_AJ0, but safe_AJ0 [CCP.529]
After a little he remembered it and sang out loud_AV0.[A0N.1144]-->
This sentence does not imply that he was loud, but is more or less equivalent to He sang out loudly. It means that his singing was loud.
You did great_AV0 though. [HH0.3248 *AV0-AJ0]
everyone below 25 grew their hair too long_AJ0. [ARP.590 *AV0-AJ0]
(i.e. 'their hair was too long'.)
Try not to keep her too long_AV0. [FAB.3620 *AV0-AJ0]
(i.e. NOT 'she will be too long.')
They'll have to make the taxes higher_AJC. ('the taxes will be higher')
We can make this piece higher_AJC if you want to. [BNG.2268]
You'll have to aim higher_AV0. (NOT 'you will be higher')
You should aim higher_AV0 [ACN.984 *AJC]
I thought it best_AJS to call. [AT4.3239]
I liked the cartoons best_AV0 [CAM.194]
When the word is the head of a noun phrase, on the other hand, it is a noun:a white_AJ0 screen, The screen is white_AJ0.
Red_NN1 is my favourite colour.
They painted the wall a brilliant white_NN1.
All past_AJ0 and present_AJ0 employees of the branch are invited. [K99.216]
*These needs are past, present, and future.
(Note that present can be used as a predicative adjective meaning the opposite of absent; but this meaning is not comparable to the temporal meanings of past, present and future above.)
You're living in the past_NN1. [HGS.1045]
I don't even want to think about the future_NN1. [JY4.2864]
The only reason for treating past and present in the example above as adjectives is that they have an institutionalized meaning as modifiers, which is rather different from the meaning they have as nouns. Further examples of this type are words such as model in model behaviour, giant in a giant caterpillar and vintage in vintage cars.
new spending_NN1 plans [CEN.5922]
a working_AJ0 mother [ED4.153]
his reading_NN1 ability [CFV.1897]
in the coming_AJ0 weeks [HKU.1333]
two smiling_AJ0 children [HTT.743] ('two children who are smiling')
new spending_NN1 plans ('new plans for spending')
his reading_NN1 ability ('his ability in reading')
a mating_AJ0 animal [GU8.2142]
the mating_NN1 game [ECG.336 *AJ0-NN1]
a falling_AJ0 rate of unemployment [KR2.2129]
slimming_NN1 tablets. [KCA.941 *NN1-VVG]
(a) You should relax more_AV0.
(b) You should spend more_DT0.
You should eat more.
You should read more.
You should smoke less.
Do you smoke? (Intransitive)
How many do you smoke in a week? (Transitive)
(c) At the moment we have 23 fixtures per season. Personally, I would rather play more_DT0.
(d) You should work less and play more_AV0.
(In (d) the adverb more has roughly the meaning of 'more often'.)
Note. The automatic disambiguation of determiners and adverbs is not reliable, because transitivity has not been encoded in the tagger. Sentences like (c) and (d), where more follows the verb at end of a sentence, are invariably tagged AV0.
Another area of borderline cases is the tagging of words as adjectives (AJ0) or as participles (VVG or VVN).
One test is to see whether a degree adverb like very can be inserted in front of the word: e.g. in We were very surprised, surprised is an AJ0.
Even where it is not present, the possibility of adding the by-phrase, without changing the meaning of the word, is evidence in favour of VVN. (However, this criterion can clash with the preceding one — since it occasionally happens that an -ed word is both preceded by an adverb like very and followed by a by-phrase: E.g. I was so irritated by his behaviour that I put the phone down. When these do occur, we give preference to AJ0.)We were surprised_VVN by pirates.
This shows that lasting or locked can easily be (but need not be) an AJ0. If the word could not be placed (with the same meaning) before the noun, this would be evidence that the word is a participle.The effect is lasting_AJ0 (compare a lasting_AJ0 effect).
The door is locked_AJ0 (compare the locked_AJ0 door.)
The man was dying_VVG. [HTM.1494 *VVG-AJ0]
the dying_AJ0 man. [FSF.1787]
an interest_NN1 earning_VVG account
a hypothesis_NN1 driven_VVN approach
In these examples the NN1+VVG/VVN sequence has the character of a premodifying adjective compound. We can therefore imagine the two words bracketed together forming an adjective: an interest-earning_AJ0 account. But within the adjective, the VVG and VVN tags retain their verbal character, with the initial noun acting as object of the verb (cf. the account earns interest).
a shanty_NN1 singing_VVG competition[K4W.2952]
The building was infested_AJ0 with cockroaches
(cf.: The building seemed/became infested with cockroaches)
This is a manifestation of the general semantic character of adjectives (which typically refer to states or qualities) and verbs (which typically refer to events or actions).Bill was married_AJ0. (i.e. he was not single)
Bill was married_VVN to Sarah on the 15th May. (i.e. the actual event)
She is not disturbed_VVN by that sort of threat.
The tourists were standing_VVG around a map of the city.
Are you expecting_VVG someone?[G01.2610]
The arithmetic is looking_VVG good. [K1M.3611]
Turning_VVG suddenly, she ran for the safety of the car [CK8.297]
where insulting could not normally be followed by an object:His manner was insulting_AJ0.
* insulting us.
(a) She ran down_PRP the hill.
(b) She ran down_AVP her best friends.
She ran quickly down the hill.
(But not: *She ran viciously down her best friends.)
This is the hill down_PRP which he ran.
Down_PRP which slopes do you like ski-ing?
She ran her best friends down_AVP.
(But not: *She ran the hill down.)
Similarly:She ran them down_AVP. (= her best friends)
(But not: *She ran down them.)
The dentist took all my teeth out_AVP. (The dentist took them out)
Notice that the syntactic distinction between (for example) down as an adverbial particle and down as a preposition is independent of the semantic distinction between locative and non-locative interpretations of down.
Income tax is coming down_AVP.
The decorations are put up_AVP on Christmas Eve.
This is the hill (which) she ran down_PRP.
(Cf. This is the hill down which she ran.)
The poor were looked down on_PRP by the rich.
(Here on is the stranded preposition)
Which car did she arrive in_PRP?
The same tests apply to words which are tagged either as prepositions or as general adverbs (AV0), such as across, past and behind.
Note, additionally, the use of about as a degree adverb.
The borderline between interjections or exclamatory particles (tagged ITJ) and unclassified 'noise' words (tagged UNC) is drawn as follows:
ITJ is used for 'institutionalized' interjections or discourse particles such as good-bye, oh, no, oops, hallelujah, whoa, wow ; however Well, right and like functioning as discourse markers are tagged AV0.
blah_UNC blah_UNC blah_UNC
er_UNC I think so.
Erm_UNC nope_ITJ.
methinks_UNC.
ai_UNC n't_XX0
<w>
elements within multiword expressions
for which no unique C5 code can be foundThe contraction ain't is a special case: its first half is tagged UNC because it abbreviates so many different verb forms (am not, is not, are not, has not, have not) that no single tag can be applied to it (unless one were to invent a special tag for that purpose).
Tears well_VVB up in my eyes. [BN3.5 *AV0]
That_DT0's_VBZ perfect is that one... (= That is...) [KCX.1254]
She_NP0 's_VHZ got tickets. (= She has...) [KPV.6479]
well, what_DTQ 's_VDZ he do?, is he a plumber? (= What does...) [KD6.310]
success in the three R_ZZ0's [EVY.59]
in the 1980_CRD's [HJ1.22024]
Let_VM0's go_VVI. [A61.1443]
Note that let's is not considered a contraction of let us, but is treated as a single 'verbal particle', tagged VM0, on the grounds that it is closely analogous to modal auxiliaries.
Note also the multiword just about, as in:...it was about_AV0 three weeks ago [FAJ.1714]
about_AV0 half the size of a grain of rice [AJ4.33]
We're just about_AV0 ready.
my mother was reading a novel about_PRP gypsies... . [ARJ.2068]
How did this transformation come about_AVP? [A11.786]
In the first and second examples above, the second as introduces a comparative construction which expresses 'equal comparison', as contrasted with the unequal comparison of more X than Y. When as is a word introducing such a comparative construction, it is tagged CJS:I go to see them as_AV0 often as I can . [AC7.1189]
and they employ ninety people, twice as_AV0 many as last year. [K1C.3540]
And every bit as_AV0 good .[EEW.1132 *CJS]
Notice that as in this comparative use is tagged CJS whether or not it introduces a clause. Often it introduces a noun phrase. In the following example, it introduces an adjective:Capitalism is not as_AV0 good as_CJS it claims. [CFT.2042]
Linked together, they can crunch numbers as_AV0 fast as_CJS any mainframe.[CRB.271]
She will deposit as_AV0 many as_CJS a dozen eggs there. [F9F.424]
always reply as_AV0 quickly as_CJS possible. [C9R.989]
New York called just as_CJS I was leaving. [APU.1543]
As_CJS you've gone to so much trouble , it would seem discourteous to refuse [KY9.2107]
Usually the meaning is related to the equative meaning of the verb be. However, the guideline restricts PRP to cases where as is followed by the normal noun phrase or nominal, as is normal for prepositions. Where the as is followed by an adjective or a past participle clause, it is tagged CJS, even though it may retain the equative type of meaning:Consider it as_PRP a kind of insurance [AD0.1641]
As_PRP head of information, Christina will lead a team of four TEC staff... [BM4.2830]
We regard these results as_CJS encouraging. [B1G.184]
I very much hope that you will in fact support the motion as_CJS originally intended. [KGX.93]
Note that this is different from the multiword adverb as well (meaning also); it is also different from the sequence of as well as as three separate words, e.g. in:Sometimes as well as_PRP going this way we actually need to go in this was too. [G5N.31]
She's as_AV0 well_AJ0 as_CJS can be expected. [F9X.2095]
She can spare you but_AV0 a few minutes [CCD.82 *CJC] There is but_AV0 one penalty. [ALS.185 *CJC]
...mediocre albums that do nothing but_CJS take up shelf space [C9M.1014]
I couldn't help but_CJS notice. [JY0.5323 *CJC]
I always feel they are open meetings in everything but_PRP name. [HJ3.5520]
No one had guessed she was anything but_PRP a boy. [C85.517]
God and minds do exist , but_CJC materially so . [ABM.1265]
And that's it for another week but_CJC don't forget the late news at eleven thirty. [J1M.2520]
Hares ( but_CJC not rabbits ) are particularly vulnerable... [B72.892]
The fare increases would have been bigger but for_PRP the governments last minute intervention. [K6D.124]
We stayed home_AV0. [FAP.313]
This is my home_NN1. [AMB.1805]
well she says like_AV0, I won't be a minute [KCY.1518]
I'm driving along, you know like_AV0<trunc>
wha</trunc>
when you're in the car by yourself and everything's turning over in your head [KBU.1096]
...but I like_VVB Monday best. [FU4.1089]
He didn't look like_PRP a goodie. [H0M.1353]
... fuel, weapons, ground crew and the like_NN1. [JNN.105 *AJ0-NN1]
Churchill and Eden were not of like_AJ0 minds... [ACH.1297]
Bless their dear little_AJ0 faces. [HRB.722]
Little_AJ0 green shoots of recovery are stirring. [CEL.968]
I have little_DT0 to say. [G1Y.1133]
...there was little_DT0 food left. [FSJ.720]
I care very little_AV0 about petty-minded, selfish "rules". [B0P.211]
However, the quantifier a little meaning 'a small amount' is not tagged as a multiword 4 but as AT0 + DT0They are all a little_AV0 drunk. [G0F.2117]
[See Determiner-pronoun vs. adverb ]You couldn't let me have a_AT0 little_DT0 milk? [GUM.1656]
Much_DT0 of this work has to be done on the spot. [C8R.24]
I've spent too much_DT0 money. [KPV.62659]
Thanks very much_AV0. [A73.5]
I didn't sleep much_AV0 last night [ALH.1495]
See also Determiner-pronoun vs. adverb
You deserve more_DT0 than a medal. [K97.3705]
More_DT0 haste, less_DT0 speed. [J10.4543]
...this will make him more_AV0 tired than usual [A75.282]
But I couldn't agree more_AV0 [BMD.3]
more than_AV0 one in a million [K5N.46]
No_AT0 problem_NN1. [H4H.227]
quoting Ref_NN1 No_NN1 BCE90_UNC [CJU.673]
but the matter was taken no_AV0 further_AV0. [ARF.183 no: *AT0]
To put it no_AV0 more_AV0 strongly_AV0, it has not been proved beyond doubt that.... [EW7.125]
"...See how easy my job can be?"
"Frankly, no_ITJ". [HR4.2329]
In such noun phrases, one functions like a determiner-pronoun such as some.Can I have one_CRD chip, please? [KDB.1416]
So are there criticisms? Just one_CRD. [CG2.1489]
... one_CRD in five sufferers never tells their partners. [CF5.8 *PNI]
Orford Ness is one_CRD of Britain's most unusual coastal features. [CF8.86]
In this use, one has a plural form ones.The channel was not a broad one_PNI [AEA.1457]
And I think one_PNI might go on to argue that far from saving labour it creates it. [J17.1915]
Note that the reliability of the ambiguity tag PNI-CRD (in which the pronoun is rated more likely) is somewhat low. See 6.6 POS-tagging Error Rates
As both an adverb (AV0) and an adjective (AJ0) right means the opposite of 'wrong' and also the opposite of 'left'. As a noun, it generally means 'entitlements': e.g. I have a right_NN1 to know. The uses of right as a verb are very rare.
Right_AV0, how you doing there? [KBL.4671]
Right_AV0, er, members, any questions ? [F7V.138]
it's a ... it's a right_AV0 soft carpet. [KB2.1242-4]
So_AV0 this is where you work... [H8M.2964]
Right, so_AV0 what's fifty three per cent as a decimal? [JP4.357]
They waited but nothing happened so_AV0 they made a fuss. [FU1.2484]
So_AV0 say I and so_AV0 say the folk. [G11.228]
"Yes, I think so_AV0." [CCM.151]
tough and long lasting - that's why they're so_AV0 popular. [BN4.929]
There would not be so_AV0 many lonely people in our land [B1Y.1262]
Drink your tea so_CJS they can have your cup. [KB2.1767]
That_DT0's_VBZ my coat yeah. [KBS.1309]
he's getting hooked on the taste of vaseline, that_DT0 dog. [KCL.197]
and also to that as a relativizer (introducing a relative clause):Many experts claim that_CJT it is good for your growing baby, too. [G2T.1091]
This is different from the more traditional analysis which treats that introducing a relative clause as a relative pronoun.A ship that_CJT never enters harbour. [BPA.1326]
It wasn't all that_AV0 bad. [KPP.321]
And then_AV0 she spoke. [H8T.2675]
"Come on, then_AV0." [K8V.1722]
Mr Willi Brandt, the then_AJ0 Mayor of West Berlin. [A87.84]
...the then_AJ0 state governor , who wasn't then_AV0 Bill Clinton [A87.84]
Note also the common colloquial spelling of want to, got to, and going to as fused words:In the summer holidays, I can, I can get up early if I want to_TO0. [KPG.4153]
wanna = wan_VVB na_TO0
gotta = got_VVN ta_TO0
gonna = gon_VVG na_TO0
That 's the school that Terry goes to_PRP. [KB8.2442]
...what you're entitled to_PRP by law is money back [FUT.360]
"Where to_PRP?""The_PRP moon." [FNW.240-1]
She's playing well_AV0
Oh well_AV0! That'll be the finish! [FX6.196-7]
I bet he doesn't get up till about, well_AV0, it's eleven now. [KBL.3808]
It was dark outside and well_AV0 past your bedtime. [ASS.898]
You don't look well_AJ0. [HPR.107]
Tears well_VVB up in my eyes. [BN3.5 *AV0]
Note that when is also a subordinating conjunction in abbreviated adverbial clauses which lack a subject and finite verb, such as when in doubt, when ready, when completed.When_CJS I got back to my flat, I decided to ring Toby. [CS4.1265]
the crowd left quietly when_CJS the police arrived. [APP.1017] (when = at the time at which)
If you smoke when_CJS you're pregnant... [A0J.1598] (when = whenever)
Before an infinitive, when is also tagged AVQ:I can't remember when_AVQ we last had a frost. [KBF.11728]
"Do you remember when_AVQ we used to go with Daddy in the boat on Saturdays?" [A6N.2022]
You never know when_AVQ the next big story will break. [HJ6.100]
Also when the rest of the infinitive clause is understood:Otto knew when_AVQ to change the subject. [FAT.1603]
Tell me when_AVQ.
Note that when can often be omitted in relative clauses: the moment he arrived.in the year when_AVQ I was born (when = in which)
the moment when_AVQ he arrived (when = at which)
When_AVQ did you find out?
...to hit him where_CJS it hurts. [CEN.2816]
I don't know where_AVQ she picked them up. [G1D.1163]
It was the house where_AVQ the poor woodcutter lived with Hansel and Gretel
Where_AVQ are you going? [KB9.2650]
worth also occurs as a 'stranded preposition' in questions used to elicit such responses, and in some other common constructions:these pictures are worth_PRP a small fortune. [FNT.1060]
That makes him worth_PRP about $60m. [CT3.479]
'Darling, it's not worth_PRP getting upset. [HH9.2308]
how much d'ya think it's worth_PRP? [KCX.1344]
share prices say nothing about what a company is worth_PRP. [A9U.305 *NN1]
Please go ahead and push Grapevine for all you are worth_PRP. [AP1.575]
Baker showed his worth_NN1 for Ipswich in the 20th minute [CF9.102]
hundreds of pounds' worth_NN1 of damage. [A0H.15]
£2,500 WORTH_NN1 OF PRIZES [ECJ.1147]
children who go to the E_ZZ0 N_ZZ0 T_ZZ0 clinic [KB8.3805]
...ten ninety minute tapes! T_ZZ0 D_ZZ0 K_ZZ0 tapes! [KPG.3534-5]
In the written corpus these items would nearly always be written and tagged as whole words (ENT or TDK in the above example).
<trunc>
element and tagged UNC. Examples include bathr and
su in the following
The
<trunc>
bathr_UNC</trunc>
er you can't beat a white bathroom suite anyway. [KCF.721]
Aye, they only came in the<trunc>
su_UNC</trunc>
they only came up here in the summer. [GYS.127]
We treat the first sort as an incomplete multiword, and tag it UNC (rather like truncated words, above). The complete multiword sort of is tagged AV0, as normally.we're going to sort sort of summarize... [G5X.106]
Further examples of incomplete multiwords are the as long in as long as (conjunction), of in because of (preposition) and the in in in general (adverb) belowwe're going to sort_UNC sort of_AV0 summarize...
The second example shows that when words are repeated, the incomplete portion of a multiword is not necessarily immediately adjacent to the fully formed multiword. In the last example, the three instances of in before erm, imperial measure have not been analysed as part of the multiword in general; they are instead tagged as ordinary words (in this case, ambiguous between preposition and prepositional adverb: PRP-AVP). There are a few cases where the tagger has probably been over-zealous in spotting repeated portions of multiwords:As_UNC long_UNC As_CJS long as everyone recognizes that for an area of that size... [J9T.258]
because_PRP of the <pause> of_UNC the drought. When we were away it didn't get watered in. [KCH.982]
I know that in_UNC in_UNC in_AV0 general, in in in erm, imperial measure, it is <trunc> f </trunc> five feet eight inches [JK1.480]
Here, the first instance of now would probably have better been interpreted as a single word adverb (='at this time'), not part of the multiword conjunction now that5.What happens now_UNC, now_CJS that you are winched down? [HEF.9]
<mw>
element is identical to that which would have been
assigned if the filler were not present.
Note that in the last example the word at preceding the multiword at er best is treated as a partial repetition of that multiword, and therefore tagged UNC.And your homework was handed in every er so often_AV0, you know [G64.152]
something had gone wrong with the<pause>
gas pipes because erm of_PRP<pause>
flooding. [KB8.5356]
these kind of books were, er, generally er, at , at er best_AV0 ignored [HUN]
This section reports on the accuracy of the results of the improved tagging programs.
In this section, we examine ambiguities and errors using a ‘fine-grained’ mode of calculation, treating each error as of equal importance to any other error. In 6.6.3.1 Presentation of Ambiguity and Error Rates (coarse-grained calculation) we look at the same data in terms of a ‘coarse-grained’ mode of calculation, ignoring errors and ambiguities involving subcategories of the same part of speech.
As the following table shows, the ambiguity rate varies considerably between written and spoken texts. (However, note that the calculation for speech is based on a small sample of 5,000 words.)
Sample tag count | Ambiguity rate (%) | Error rate (%) | |
Written texts | 45,000 | 3.83% | 1.14% |
Spoken texts | 5,000 | 3.00% | 1.17% |
All texts | 50,000 | 3.75% | 1.15% |
It will be noted that written texts on the whole have a higher ambiguity rate, whereas spoken texts have a slightly greater error rate.
The success of an automatic tagger is sometimes represented in terms of the information-retrieval measures of precision and recall, rather than ambiguity rate and error rate as in Table 23. Estimated ambiguity and error rates for the whole corpus (fine-grained calculation). Precision is the extent to which incorrect tags are successfully discarded from the output. Recall is the extent to which all correct tags are successfully retained in the output of the tagger, allowing, however, for more than one reading to occur for one word (i.e. ambiguous tagging is permitted). According to these measures, the success of the tagging is as follows:
Precision | Recall | |
Written texts | 96.17% | 98.86% |
Spoken texts | 97.00% | 98.83% |
All texts | 96.25% | 98.85% |
However, from now on we will continue to use ‘ambiguity rate’ and ‘error rate’, which appear to us more transparent.
The estimates for individual tags are again based on the 50,000 sample, and the ambiguity rate for each tag is based on the number of ambiguity tags which begin with a given tag. The table also specifies the estimated likelihood that a given tag, in the first position of the ambiguity tag, is the correct tag.
In Table 25. Estimated ambiguity rates and error rates by tag, column (b) shows the overall frequency of particular tags (not including ambiguity tags). Column (c) gives the overall occurrence of ambiguity tags, as well as of particular ambiguity tags, beginning with a given tag. (Ambiguity tags marked * are less ‘serious’ in that they apply to two subcategories of the same part of speech, such as past tense and past participle of the verb - see 4.1 below.) Column (d) shows which tags are more or less likely to be found as the first part of an ambiguity tag. For example, both NP0 and VVG have an especially high incidence of ambiguity tags. Column (e) tells us, given that we have observed an ambiguity tag, what is the likelihood of the first tag’s being correct? Overall, there is more than a 3-1 chance that the first tag will be correct; but there are some exceptions, where the chances of the first tag’s being correct are much lower: for example, PNI (indefinite pronoun). Note that (f) and (g) exclude errors where the first tag of an ambiguity tag is wrong; contrast Table 28. Estimated error rates for the whole corpus, and Table 29. Estimated error rates (by tag) column (c), below.
(a) Tag | (b) SingleTag count (out of 50,000 words) | (c) Ambiguity Tag count (out of 50,000 words) | (d) Ambiguity rate (%)(c / b + c) | (e) 1st tag of ambiguity tag correct (% of all ambiguity tags) | (f) Error count | (g) Error rate (%)(f / b) |
AJ0 | 3412 | all 338 | 9.01% | 282 (83.43%) | 46 | 1.35% |
(AJ0-AVO 48) | ||||||
(AJ0-NN1 209) | ||||||
(AJ0-VVD 21) | ||||||
(AJ0-VVG 28) | ||||||
(AJ0-VVN 32) | ||||||
AJC | 142 | 0.0% | 4 | 2.82% | ||
AJS | 26 | 0.0% | 2 | 7.69% | ||
AT0 | 4351 | 0.0% | 2 | 0.05% | ||
AV0 | 2450 | all 45 | 1.80% | 37 (82.22%) | 57 | 2.33% |
(AV0-AJ0 45) | ||||||
AVP | 379 | all 44 | 10.40% | 34 (77.27%) | 6 | 1.58% |
(AVP-PRP 44) | ||||||
AVQ | 157 | all 10 | 5.99% | 10 (100.00%) | 9 | 5.73% |
(AVQ-CJS 10) | ||||||
CJC | 1915 | 0.0% | 3 | 0.16% | ||
CJS | 692 | all 39 | 5.34% | 30 (76.92%) | 18 | 2.60% |
(CJS-AVQ 26) | ||||||
(CJS-PRP 13) | ||||||
CJT | 236 | (all) 28 | 10.61% | 3 | 1.27% | |
(CJT-DT0 28 ) | ||||||
CRD | 940 | all 1 | 0.11% | 0 (0.00%) | 0 | 0.00% |
(CRD-PNI 1) | ||||||
DPS | 787 | 0.0% | 0 | 0.00% | ||
DT0 | 1180 | all 20 | 1.67% | 16 (80.00%) | 19 | 1.61% |
(DT0-CJT 20) | ||||||
DTQ | 370 | 0.0% | 0 | 0.00% | ||
EX0 | 131 | 0.0% | 1 | 0.76% | ||
ITJ | 214 | 0.0% | 2 | 0.93% | ||
NN0 | 270 | 0.0% | 10 | 3.70% | ||
NN1 | 7198 | all 514 | 6.66% | 395 (76.84%) | 86 | 1.19% |
(NN1-AJ0 130) | ||||||
(NN1-NP0 92)* | ||||||
(NN1-VVB 243) | ||||||
(NN1-VVG 49) | ||||||
NN2 | 2718 | all 55 | 1.98% | 48 (87.27%) | 30 | 1.10% |
(NN2-VVZ 55) | ||||||
NP0 | 1385 | all 264 | 16.01% | 224 (84.84%) | 31 | 2.24% |
(NP0-NN1 264)* | ||||||
ORD | 136 | 0.0% | 0 | 0.00% | ||
PNI | 159 | all 8 | 4.79% | 3 (37.50%) | 5 | 3.14% |
(PNI-CRD 8) | ||||||
PNP | 2646 | 0.0% | 0 | 0.00% | ||
PNQ | 112 | 0.0% | 0 | 0.00% | ||
PNX | 84 | 0.0% | 0 | 0.00% | ||
POS | 217 | 0.0% | 5 | 2.30% | ||
PRF | 1615 | 0.0% | 0 | 0.00% | ||
PRP | 4051 | all 166 | 3.94% | 154 (92.77%) | 24 | 0.59% |
(PRP-AVP 132) | ||||||
(PRP-CJS 34) | ||||||
TO0 | 819 | 0.0% | 6 | 0.73% | ||
UNC | 158 | 0.0% | 4 | 2.53% | ||
VBB | 328 | 0.0% | 1 | 0.30% | ||
VBD | 663 | 0.0% | 0 | 0.00% | ||
VBG | 37 | 0.0% | 0 | 0.00% | ||
VBI | 374 | 0.0% | 0 | 0.00% | ||
VBN | 133 | 0.0% | 0 | 0.00% | ||
VBZ | 640 | 0.0% | 4 | 0.63% | ||
VDB | 87 | 0.0% | 0 | 0.00% | ||
VDD | 71 | 0.0% | 0 | 0.00% | ||
VDG | 10 | 0.0% | 0 | 0.00% | ||
VDI | 36 | 0.0% | 0 | 0.00% | ||
VDN | 20 | 0.0% | 0 | 0.00% | ||
VDZ | 22 | 0.0% | 0 | 0.00% | ||
VHB | 150 | 0.0% | 1 | 0.67% | ||
VHD | 258 | 0.0% | 0 | 0.00% | ||
VHG | 16 | 0.0% | 0 | 0.00% | ||
VHI | 119 | 0.0% | 0 | 0.00% | ||
VHN | 9 | 0.0% | 0 | 0.00% | ||
VHZ | 116 | 0.0% | 1 | 0.86% | ||
VM0 | 782 | 0.0% | 3 | 0.38% | ||
VVB | 560 | all 84 | 13.04% | 56 (66.67%) | 84 | 15.00% |
(VVB-NN1 84) | ||||||
VVD | 970 | all 90 | 8.49% | 62 (58.89%) | 50 | 5.15% |
(VVD-AJ0 11) | ||||||
(VVD-VVN 79)* | ||||||
VVG | 597 | all 132 | 18.11% | 112 (84.84%) | 9 | 1.51% |
(VVG-AJ0 83) | ||||||
(VVG-NN1 49) | ||||||
VVI | 1211 | 0.0% | 7 | 0.58% | ||
VVN | 1086 | all 158 | 12.70% | 113 (71.52%) | 27 | 2.49% |
(VVN-AJ0 50) | ||||||
(VVN-VVD 108)* | ||||||
VVZ | 295 | all 26 | 8.10% | 14 (53.85%) | 11 | 3.73% |
(VVZ-NN2 26) | ||||||
XX0 | 363 | 0.0% | 0 | 0.00% | ||
ZZ0 | 75 | 0.0% | 3 | 4.00% |
The next table, Table 26. Estimated frequency of selected tag-pairs, gives the frequency, as a percentage, of error-prone tag-pairs where XXX is the incorrect tag and YYY is the correct tag which should have occurred in its place. In the third column, the number of the specified error-type is listed, as a frequency count from the sample of 50,000 words. In the fourth column, this is expressed as a percentage of all the tagging errors of word category XXX (in Table 25. Estimated ambiguity rates and error rates by tag column (f)). The fifth column answers the question: if tag XXX occurs, what is the likelihood that it is an error for tag YYY? Where the number of occurrences of a given error-type is less than 5 (i.e. 1 in 10,000 words), they are ignored. Hence, Table 26. Estimated frequency of selected tag-pairs is not exhaustive: only the more likely error-types are listed. In the second column, we add, where useful, the individual words which trigger these errors.
(1) Incorrect tag XXX | (2) Corrected tag YYY | (3) No. of occurrences of this error type | (4) % of all incorrect uses of tag(XXX) | (5) % of all tags XXX |
AJ0 | AVO | 12 | 26.1% | 0.4% |
NN1 | 12 | 26.1% | 0.4% | |
NP0 | 5 | 10.9% | 0.1% | |
VVN | 8 | 17.4% | 0.2% | |
AV0 | AJ0 | 6 | 10.5% | 0.2% |
AJC | 8 | 14.0% | 0.3% | |
DT0 | 24 | 42.1% | 1.0% | |
EX (there) | 5 | 8.8% | 0.2% | |
PRP | 5 | 8.8% | 0.2% | |
AVQ | CJS (when, where) | 6 | 66.7% | 3.8% |
CJS | PRP | 10 | 55.6% | 1.4% |
DTO | AV0 | 15 | 78.9% | 1.3% |
NN1 | AJ0 | 13 | 15.1% | 0.2% |
NN0* | 8 | 9.3% | 0.1% | |
NP0* | 22 | 25.6% | 0.3% | |
UNC | 9 | 10.5% | 0.2% | |
VVI | 13 | 15.1% | 0.2% | |
NN2 | NP0* | 14 | 46.7% | 0.5% |
NP0 | NN1* | 10 | 32.3% | 0.7% |
NN0* | 5 | 16.1% | 0.4% | |
PRP | AV0 | 7 | 29.2% | 0.2% |
AVP | 5 | 20.8% | 0.1% | |
TO0 | PRP (to) | 6 | 100.0% | 0.7% |
VVB | AJ0 | 7 | 8.3% | 1.3% |
NN1 | 7 | 8.3% | 1.3% | |
VVI* | 55 | 65.5% | 9.8% | |
VVD | AJ0 | 6 | 12.0% | 0.6% |
VVN* | 44 | 88.0% | 4.5% | |
VVG | NN1 | 9 | 100.0% | 1.5% |
VVI | NN1 | 5 | 71.4% | 0.4% |
VVN | AJ0 | 7 | 25.9% | 0.6% |
VVD* | 17 | 63.0% | 1.6% | |
VVZ | NN2 | 8 | 72.7% | 2.7% |
Similar to before, the asterisk * indicates a ‘less serious’ error, in which the erroneous and correct tags belong to the same major category or part of speech. As the table shows, the most frequent specific error types are within the verb category: VVB ? VVI (55, or 9.8% of all VVB tags) and VVD ? VVN (44, or 4.5% of all VVD tags).
Yet a further way of looking at the ambiguities and errors in the corpus is to make a coarse-grained calculation in counting these phenomena. In a fine-grained measurement, which is the one assumed up to now, each tag is considered to define its own word class which is different from all other word classes. Using the coarse-grained calculation, on the other hand, we consider words to belong to different word classes (parts of speech) only when the major category is different. If we consider the pair NN1 (singular and common noun) and NP0 (proper noun), the coarse-grained calculation says that the ambiguity tag NN1-NP0 or NP0-NN1 does not show tagging uncertainty, since both the proposed tags agree in categorizing the word as the same part of speech (a noun). So this does not add to the ambiguity rate. Similarly, the coarse-grained point of view on error is that, if a word is tagged as NN1 when it should be NP0, or vice versa, then this is not error, because both tags are within the noun category. To summarize: in the fine-grained calculation, minor differences of wordclass count towards the ambiguity and error rates; in the coarse-grained calculation, they do not.
In this section, the same calculations are made as in section 3, except that errors and ambiguities which are confined within a major category (noun, verb, etc.) are ignored. In practice, most of the errors and ambiguities of this kind come from the difficulty the tagger finds in recognizing the difference between NN1 (singular common noun) and NP0 (proper noun), between VVD (past tense lexical verb) and VVN (past participle lexical verb), and between VVB (finite present tense base form, lexical verb) and VVI (infinitive lexical verb). Thus the ambiguity tags NN1-NP0, VVD-VVN and their mirror images do not occur in the relevant table (Table 28. Estimated error rates for the whole corpus) below. However, since there are no ambiguity tags for VVB and VVI, the problem of distinguishing these two shows up only in the error calculation.
The three tables in this section correspond with the three tables in the preceding section.
Sample tag count | Ambiguity rate (%) | Error rate (%) | |
Written texts | 45,000 | 2.78% | 0.69% |
Spoken texts | 5,000 | 2.67% | 0.87% |
All texts | 50,000 | 2.77% | 0.71% |
It will be noted from Table 27. Estimated ambiguity and error rates for the whole corpus that this method of calculation reduces the overall ambiguity rate by c.1 per cent, and the overall error rate by c.0.5 per cent. We will not present coarse-grained tables corresponding to Table 25. Estimated ambiguity rates and error rates by tag and Table 26. Estimated frequency of selected tag-pairs above: these tables would be unchanged from the fine-grained calculation, except that the rows marked with an asterisk (*) would be deleted, and the other calculations changed as necessary.
Given that the elimination of errors was beyond our capability within the time frame and budget we had available, the corpus in its present form, containing ambiguity tags as well as a small proportion of errors, is designed for what we believe will be the most common type of user, who will find it easier to tolerate ambiguity than error. However, other users may prefer a corpus which does not contain ambiguities, even though its error rate is higher. For this latter type of user, the present corpus is easy to interpret as a corpus free of ambiguities, simply by deleting or ignoring the second tag of any ambiguity tag, and accepting the first tag as the only one. In what follows, we therefore allow two modes of calculation: in addition to the "safer" mode, in which ambiguities are allowed and consequently errors are relatively low, we allow a "riskier" mode in which ambiguities are abolished, and errors are more frequent. In fact, if ambiguity tags are eliminated, the overall error rate rises to almost 2 per cent.
Sample tag count | Error rate (%) | |
Written texts | 45,000 | 2.01% |
Spoken texts | 5,000 | 1.92% |
All texts | 50,000 | 2.00% |
The following table gives an error count (c) for each tag: i.e. the number of errors in the 50,000 word sample where that tag was the erroneous tag. [Cf. the "safer" error count in Table 26. Estimated frequency of selected tag-pairs, column (f).] In addition, each tag has a correction count (d): i.e. the number of erroneous tags for which that tag was the correct tag. If we subtract the Error count (c) from the Tag count (b), and add the Correction count (d) to the result, we arrive at the "Real tag count" (e) representing the number of occurrences of that tag in the corrected sample corpus. Not included in the table is the small number of ‘multiword’ errors which resulted in two tags being replaced by one (error count), or one tag being replaced by two (correction count), due to the incorrect non-use or use of multiword tags. The last column divides the error count by the tag count to provide the error rate (as a percentage).
(a) Tag | (b) Tag count | (c) Error count | (d) Correction count | (e) Real tag count (b - c + d ) | (f) Error rate (%) (c / b)x 100 |
AJ0 | 3750 | 102 | (132) | 3780 | 2.72% |
AJC | 142 | 4 | (12) | 150 | 2.82% |
AJS | 26 | 2 | (0) | 24 | 7.69% |
AT0 | 4351 | 2 | (3) | 4352 | 0.05% |
AV0 | 2495 | 65 | (67) | 2497 | 2.61% |
AVP | 423 | 16 | (17) | 424 | 3.78% |
AVQ | 167 | 9 | (6) | 164 | 5.39% |
CJC | 1915 | 3 | (1) | 1913 | 0.16% |
CJS | 731 | 27 | (5) | 709 | 3.69% |
CJT | 264 | 3 | (15) | 276 | 1.14% |
CRD | 940 | 1 | (11) | 950 | 0.11% |
DPS | 787 | 0 | (0) | 787 | 0.00% |
DT0 | 1200 | 23 | (29) | 1206 | 1.92% |
DTQ | 370 | 0 | (0) | 370 | 0.00% |
EX0 | 131 | 1 | (5) | 135 | 0.76% |
ITJ | 214 | 2 | (2) | 214 | 0.93% |
NN0 | 270 | 10 | (16) | 276 | 0.37% |
NN1 | 7712 | 205 | (152) | 7659 | 2.66% |
NN2 | 2773 | 37 | (29) | 2765 | 1.33% |
ORD | 136 | 0 | (2) | 138 | 0.00% |
NP0 | 1649 | 71 | (102) | 1680 | 4.31% |
PNI | 167 | 10 | (1) | 158 | 5.99% |
PNP | 2646 | 0 | (1) | 2647 | 0.00% |
PNQ | 112 | 0 | (0) | 112 | 0.00% |
PNX | 84 | 0 | (1) | 85 | 0.00% |
POS | 217 | 5 | (6) | 218 | 2.30% |
PRF | 1615 | 0 | (0) | 1615 | 0.00% |
PRP | 4217 | 36 | (45) | 4226 | 0.85% |
TO0 | 819 | 6 | (1) | 814 | 0.73% |
UNC | 158 | 4 | (29) | 183 | 2.53% |
VBB | 328 | 1 | (0) | 327 | 0.30% |
VBD | 663 | 0 | (0) | 663 | 0.00% |
VBG | 37 | 0 | (0) | 37 | 0.00% |
VBI | 374 | 0 | (0) | 374 | 0.00% |
VBN | 133 | 0 | (0) | 133 | 0.00% |
VBZ | 640 | 4 | (5) | 641 | 0.63% |
VDB | 87 | 0 | (0) | 87 | 0.00% |
VDD | 71 | 0 | (0) | 71 | 0.00% |
VDG | 10 | 0 | (0) | 10 | 0.00% |
VDI | 36 | 0 | (0) | 36 | 0.00% |
VDN | 20 | 0 | (0) | 20 | 0.00% |
VDZ | 22 | 0 | (0) | 22 | 0.00% |
VHB | 150 | 1 | (0) | 151 | 0.67% |
VHD | 258 | 0 | (0) | 258 | 0.00% |
VHG | 16 | 0 | (0) | 16 | 0.00% |
VHI | 119 | 0 | (1) | 120 | 0.00% |
VHN | 9 | 0 | (0) | 9 | 0.00% |
VHZ | 116 | 1 | (0) | 115 | 0.86% |
VM0 | 782 | 3 | (0) | 779 | 0.38% |
VVB | 644 | 112 | (13) | 545 | 17.39% |
VVD | 1060 | 78 | (60) | 1042 | 7.36% |
VVG | 729 | 29 | (29) | 729 | 3.98% |
VVI | 1211 | 7 | (73) | 1277 | 0.57% |
VVN | 1244 | 72 | (87) | 1259 | 5.79% |
VVZ | 321 | 23 | (12) | 310 | 7.17% |
XX0 | 363 | 0 | (0) | 363 | 0.00% |
ZZ0 | 75 | 3 | (4) | 76 | 4.00% |
It is clear from this table that the amount of error in the tagging of the corpus varies greatly from one tag to another. The most error prone-tag, by a large margin, is VVB, with more than 17 per cent error, while many of the tags are associated with no errors at all, and well over half the tags have less than a 1 per cent error. The final table gives figures for the third level of detail, where we itemise individual tag pairs XXX, YYY, where XXX is the incorrect tag, and YYY is the correct one which should have appeared but did not. Only those pairings which account for 5 or more errors are listed. This table differs from Table 26. Estimated frequency of selected tag-pairs in that here the second tags of ambiguity tags are not taken into account ("riskier mode"). It will be seen that the errors which occur tend to fall into a relatively small number of major categories.
Incorrect tag XXX | Correct tag YYY | No. of occurrences of this error type | % of all incorrect uses of tag XXX | % of all tags XXX |
AJ0 | AV0 | 22 | 21.57% | 0.59% |
NN1 | 41 | 40.19% | 1.09% | |
NP0 | 5 | 4.90% | 0.13% | |
VVG | 14 | 13.73% | 0.37% | |
VVN | 14 | 13.73% | 0.37% | |
AV0 | AJ0 | 9 | 13.85% | 0.36% |
AJC | 8 | 12.31% | 0.32% | |
DT0 | 26 | 40.00% | 1.04% | |
EX0 (there) | 5 | 7.69% | 0.20% | |
PRP | 6 | 9.23% | 0.24% | |
AVP | CJT | 6 | 94.12% | 1.42% |
AVQ | CJS (when, where) | 6 | 66.67% | 3.59% |
CJS | PRP | 15 | 55.56% | 2.05% |
DTO | AV0 (much, more, etc) | 15 | 65.22% | 1.25% |
NN1 | AJ0 | 63 | 30.73% | 0.82% |
NN0 | 8 | 3.90% | 0.10% | |
NP0 | 74 | 36.10% | 0.96% | |
UNC | 9 | 4.39% | 0.12% | |
VVB | 9 | 4.39% | 0.12% | |
VVG | 13 | 6.34% | 0.17% | |
VVI | 13 | 6.34% | 0.17% | |
NN2 | NP0 | 14 | 37.84% | 0.50% |
UNC | 9 | 24.32% | 0.32% | |
VVZ | 10 | 27.02% | 0.36% | |
NN0 | UNC | 7 | 70.00% | 2.59% |
NP0 | NN1 | 50 | 70.42% | 3.03% |
NN2 | 5 | 7.04% | 0.30% | |
PNI | CRD (one) | 9 | 90.00% | 5.39% |
PRP | AV0 | 8 | 22.22% | 0.19% |
TO0 | PRP (to) | 6 | 100.00% | 0.73% |
VVB | AJ0 | 7 | 6.25% | 1.09% |
NN1 | 35 | 31.25% | 5.43% | |
VVI | 55 | 49.11% | 8.54% | |
VVN | 5 | 4.46% | 0.85% | |
VVD | AJ0 | 14 | 17.95% | 1.32% |
VVN | 64 | 82.05% | 6.04% | |
VVG | AJ0 | 11 | 37.93% | 1.51% |
NN1 | 18 | 62.07% | 2.47% | |
VVI | NN1 | 5 | 71.43% | 0.41% |
VVZ | NN2 | 20 | 86.96% | 6.23% |
Some of the error types above are associated with one or two particular words, and where these occur they are listed. For example, the AV0 - EX0 type of error occurs invariably with the one word there.
The first four phases were carried out automatically, using CLAWS4, an automatic tagger which developed out of the CLAWS1 automatic tagger (authors: Roger Garside and Ian Marshall 1983) used to tag the LOB Corpus. The advanced version CLAWS4 is principally the work of Roger Garside, although many other researchers at Lancaster have contributed to its performance in one way or another. Further information about CLAWS4 can be obtained from Leech, Garside and Bryant 1994 and Garside and Smith 1997. CLAWS4 is a hybrid tagger, employing a mixture of probabilistic and non-probabilistic techniques. The fifth and sixth phases used other systems,described in the appropriate section below.
The first major step in automatic tagging is to divide up the text or corpus to be tagged into individual (1) word tokens and (2) orthographic sentences. These are the segments usually demarcated by (1) spaces and (2) sentence boundaries (i.e. sentence final punctuation followed by a capital letter). This procedure is not so straightforward as it might seem, particularly because of the ambiguity of full stops (which can be abbreviation marks as well as sentence-demarcators) and of capital letters (which can signal a naming expression, as well as the beginning of a sentence). Faults in tokenization occasionally occur, but rarely cause tagging errors.
In tokenization, an orthographic word boundary (normally a space, with or without accompanying punctuation) is the default test for identifying the beginning and end of word-tokens. (See, however, the next paragraph and 6.6.4.4 D. Idiom-Tagging below.) Hyphens are counted as word-internal, so that a hyphenated word such as key-ring is given just one tag (NN1). Because of the different ways of writing compound words, the same compound may occur in three forms: as a single word written ‘solid’ (markup), as a hyphenated word (mark-up) or as a sequence of two words (mark up). In the first two cases, CLAWS4 will give the compound a single tag, whereas in the third case, it will receive two tags: one for mark and the other for up.
A set of special cases dealt with by tokenization is the set of enclitic verb and negative contractions such as 's, 're, 'll and 'nt, which are orthographically attached to the preceding word. These will be given a tag of their own, so that (for example) the orthographic forms It's, they're, and can't are given two tags in sequence: pronoun + verb, verb + negative, etc. There are also some 'merged' forms such as won't and dunno, which are decomposed into more than one word for tagging purposes. For example, dunno actually ends up with the three tags for do + n't + know (for a list of these contracted forms, see 9.7 Contracted forms and multiwords).
The second stage of CLAWS POS-tagging is to assign to each word token one or more tags. Many word tokens are unambiguous, and so will be assigned just one tag: e.g. various AJ0 (adjective). Other word tokens are ambiguous, taking from two to seven potential tags. For example, the token paint can be tagged NN1, VVB, VVI, i.e. as a noun or as a verb; the token broadcast can be tagged as VVB, VVI, VVD, VVN (verb which is either present tense, infinitive, past tense, or past participle). In addition, it can be a noun (NN1) or an adjective (AJ0), as in a broadcast concert.
When a word is associated with more than one tag, information is given by the lexicon look-up or other procedures on the relative probability of each tag. For example, the word for can be a preposition or a conjunction, but is much more likely to be a preposition. This information is provided by the lexicon, either in numerical form, or where numerical data available are insufficient, by a simple distinction between 'unmarked', 'rare' and 'very rare' tags.
Some adjustment of probability is made according to the position of the word in the sentence. If a word begins with a capital, the likelihood of various tags depends partly on whether the word occurs at the beginning of a sentence. For instance, the word Brown at the beginning of a sentence is less likely to be a proper noun than an adjective or a common noun (normally written brown). Hence the likelihood of a proper noun tag being assigned is reduced at the beginning of a sentence.
The next stage, logically, is to choose the most probable tag from any ambiguous set of tags associated with a word token by tag assignment (but see 6.6.4.4 D. Idiom-Tagging below). This is another probabilistic procedure, this time making use of the context in which a word occurs. A method known as Viterbi alignment uses the probabilistic estimates available, both in terms of the tag-word associations and the sequential tag-tag likelihoods, to calculate the most likely path through the sequence of tag ambiguities. (The model employed is largely equivalent to a hidden Markov model.) After tag selection, a single 'winning tag' is selected for each word token in a text. (The less likely tags are not obliterated: they follow the winning tag in descending probability order.) However, the winning tag is not necessarily the right answer. If the CLAWS tagging stopped at this point, only c.95-96% of the word-tokens would be correctly tagged. This is the main reason for including an additional stage (or rather a set of stages) termed 'idiom-tagging'.
The idiom-tagging component of CLAWS is quite powerful in matching 'template' expressions in which there are wild-card symbols, Boolean operators and gaps of up to n words. They are much more variable than ‘idioms’ in the ordinary sense, and resemble finite-state networks.
Another important point about idiom-tagging is that it is split up into two main phases which operate at different points in the tagging system. One part of the idiom-tagging takes place at the end of Stage C., in effect retrospectively correcting some of the errors which would otherwise occur in CLAWS output. Another part, however, actually takes place between Stages B. and C. This means it can utilise ambiguous input and also produce ambiguous output, perhaps adjusting the likelihood of one tag relative to another. As an example, consider the case of so long as, which can be a single grammatical item - a conditional conjunction meaning 'provided that'. The difficulty is that so long as can also be a sequence of three separate grammatical items: degree adverb + adjective/adverb + conjunction/preposition. In this case, the tagging ambiguity belongs to a whole word sequence rather than a single word, and the output of the idiom-tagging has to be passed on to the probabilistic tag selection stage. Hence, although we have called idiom-tagging ‘Stage D’, it is actually split between two stages, one preceding C. and one following C.
Clearly VVG (-ing participle of the verb enter) is judged by CLAWS4 to be the most likely tag in this case.entering VVG 86% NN1 14% AJ0 0%
The error rate with CLAWS4 averages around 3%.6 For the BNC Tagging Enhancement project, we decided to concentrate our efforts on the rule-based part of the system, where most of the inroads in error reduction had been made. This involved (a) developing software with more powerful pattern-matching capabilities than the CLAWS Idiomlist, and (b) carrying out a more systematic analysis of errors, to identify appropriate error-correcting rules.
These features can best be understood by an example. In BNC1 there were quite a number of errors disambiguating prepositions from subordinating conjunctions, in connection with words like after, before, since and so on. The following rule corrects many such cases from subordinating conjunction (CJS) to prepositions (PRP) tags. It applies a basic grammatical principle that subordinating conjunctions mark the start of clauses and generally require a finite verb somewhere later in the sentence. #AFTER [CJS^PRP] PRP, ([!#FINITE_VB/VVN])16, #PUNC1
The two commas divide the rule into three units, each containing a word or tag or word+tag combination. Square brackets contain tag patterns, and a tag following square brackets is the replacement tag (ie the action part of the rule). #AFTER refers to a list of words like after, before and since, that have similar grammatical properties. These words are defined in a separate file; not all conjunction-preposition words are listed - as, for instance, can be used elliptically, without the requirement for a following verb. (See Tagging Guidelines under as). The definition for #FINITE_VB contains a list of possible POS-tags (rather than word values), eg VVZ/VV0/VM0. Finally #PUNC1 is a 'hard' punctuation boundary (one of . : ; ? and ! ). The patching rule can be interpreted as: 'If a sequence of the following kind occurs: a word like after, before or since, which CLAWS has identified as most likely being a subordinating conjunction, and less likely a preposition; an interval of up to 16 words, none of which has been tagged as a finite verb or past participle 8 (NB [! … ] negates the tag pattern.); a 'hard' punctuation boundary then change the conjunction tag to preposition.'
The rule doesn't always work accurately, and doesn't cater for all preposition-conjunction errors. (i) It relies to a large extent on CLAWS having correctly identified finite verb tags in the right context of the preposition-conjunction; sometimes, however, a past participle is confused with a past tense form. (We therefore added VVN, ie past participle, as a possible alternative to #FINITE_VB in the second part of the pattern. The downside of this was that Template Tagger ignored some conjunction-preposition errors containing genuine use of VVN in the right context). (ii) The scope of the rule doesn't cover long sentences where more than 16 non-finite-verb words occur after the conjunction-preposition. A separate rule had to be written to handle such cases. (iii) Adverb uses of after, before and since etc. need to be fixed by additional rules.
The Templates are targetted at the most error-prone categories introduced (or rather, left unresolved) by CLAWS. As with the preposition-conjunction example just shown, many disambiguation errors congregate around pairs of tags, for example adjective and adverb, or noun and verb. Sometimes a triple is involved, eg a past tense verb (VVD), past participle (VVN) and adjective (AJ0) in the case of surprised.
A small team of researchers sought out patterns in the errors by concordancing a training corpus that contained two parallel versions of the tagging: the automatic version produced by CLAWS and a hand-corrected version, which served as a benchmark. A concordance query of the form "tag A | tag B", would retrieve lines where the former version assigned an incorrect tag A and the latter a correct tag B. An example is shown below, in which A is a subordinating conjunction and B a preposition.
By working interactively with the parallel concordance, sorting on the tags of the immediate context, testing for significant collocates to the left and right, and generally applying his/her linguistic knowledge, the researcher can often detect sufficient commonality between the tagging errors to formulate a patching rule (or a set of rules) such as that shown above. It took several iterations of training and testing to refine the rules to a point where they could be applied by Template Tagger to the full corpus.9
It should be said that some categories of error were easier to write rules for than others. Finding productive rules for noun-verb correction was especially difficult, because of the many types of ambiguity between nouns, verb and other categories, and the widely differing contexts in which they appear. The errors and ambiguity tags associated with NN1-VVB and NN2-VVZ in BNC2 in the error report testify to this problem. Here a more sophisticated lexicon, detailing the selectional restrictions of individual verbs and nouns (and other categories) would have undoubtedly been useful.
In some instances the ordering of rules was important. When two rules in the same ruleset compete, the longer match applies. Clashes arise in the case of the multiply ambiguous word as, for instance. Besides the clear grammatical choices between a preposition and a complementiser introducing an adverbial clause, there are many "interfering" idiomatic uses (as well as, as regards etc) and elliptical uses ( The TGV goes as fast as the Bullet train [sc.goes]). To avoid interference between the rules, we found it preferable to let an earlier pass of the rules handle more idiomatic (or exceptional) structures, and let a later pass deal with the more regular grammatical dependencies.
In many rule sets, however, we found that ordering did not affect the overall result, as we tried to ensure each rule was 'true' in all cases. Since, however, more than one rule sometimes carried out the same tag change to a particular word, the system was not optimised for speed and efficiency.
Besides the ordering of rules within rulesets, it is worth considering the placement of Template Tagger within the tagging schema (Figure 1). Ideally, it would be sensible to exploit the full pattern-matching functionality of the Template Tagger earlier in the schema, using it in place of the CLAWS Idiomlist not just after statistical disambiguation, where it is undoubtedly necessary, but also before it. In this way Template Tagger could have precluded much unnecessary ambiguity passing to Stage C. above. The reason we did not do this was pragmatic, that TT was in fact developed as a general-purpose annotation tool (See Fligelstone, Pacey and Rayson 1997), and not exclusively for the POS-tagging of BNC2. In future versions of the tagging software we hope to integrate Template Tagger more fully with CLAWS.
<NN1>
according to the standard TEI-based CDIF mark-up of the British National Corpus.<NN1-AJ0>
) are output if the difference between the probability of the first tag and of the second fails to reach a pre-decided threshold.The final phase, "ambiguity tagging", merits a little further discussion. The requirement for such tags is clear when one observes that even using Template Tagger on top of CLAWS, there remains a residuum of error, around 2%, in the corpus. By permitting ambiguity tags we are effectively able to "hedge" in many instances that might otherwise have counted as errors - improving the chances of retrieving a particular tag, but at the cost of retrieving other tags as well. We considered that a reasonable goal would be to employ sufficient ambiguity tags to achieve an overall error rate for the corpus of 1%.
As we report under Error rates, the BNC in fact contains a higher error rate than 1%. This is because some thresholds applied at the 1% rate incurred a very high frequency of potential ambiguity tags: we hand-adjusted such thresholds if permitting a slight rise in errors led to a substantial reduction in the number of ambiguities. Further comments on stages E. and F. can be found in Smith 1997.
<mw>
and <w>
XML elementsThe simplified wordclass scheme used for the second of these enhancements is listed in 9.8 Simplified Wordclass Tags of the manual, where the mapping between these values and the C5 tags from which they are derived is also specified.
The lemmatization procedure adopted derives ultimately from work reported in Beale 1987, as subsequently refined by others at Lancaster, and applied in a range of projects including the JAWS program (Fligelstone et al 1996) and the book Word Frequencies in Written and Spoken English (Leech et al 2001). The basic approach is to apply a number of morphological rules, combining simple POS-sensitive suffix stripping rules with a word list of common exceptions.
This process was carried out during the XML conversion, using code and a set of rules files kindly supplied by Paul Rayson.
Up: Contents Previous: 5 The header Next: 7 Software for the BNC