A POST-EDITOR'S GUIDE TO CLAWS7 TAGGING

May 1996

Written by the UCREL team.

UCREL

University of Lancaster

Bailrigg

Lancaster

England LA1 4YT

GENERAL INTRODUCTION TO WORD-CLASS TAGGING
INTRODUCTION BY WORD-CLASS TO THE CLAWS7 TAGGING SCHEME
DISAMBIGUATION GUIDE (BY TAG-PAIR)
DISAMBIGUATION GUIDE (BY WORD)
CLAWS7 TAGLIST

SECTION 1 GENERAL INTRODUCTION TO WORD-CLASS TAGGING

1.1 A Basic Introduction to the CLAWS tagging scheme

CLAWS (Constituent Likelihood Automatic Word-tagging System) is a suite of computer programs for automatically assigning an appropriate grammatical tag to each word in a body of continuous text.

One or more potential word-tags from the Claws version 7 (C7) tagset is assigned using:

(i) probability data

(ii) the lexicon or wordlist

(iii) the suffixlist

(iv) the idiomlist

(i) Probability Data

CLAWS assigns potential word-tags using a number of rules based on the ending and orthography of the word, and then uses a Hidden Markov Model method for estimating the most likely word-tag in each context. This is a type of statistical language model which calculates the probabilities of a certain sequence of words requiring a certain sequence of grammatical tags.

Further information on probabilistic language analysis and the CLAWS programs can be found in The Computational Analysis of English, Garside, Leech and Sampson (1987), especially Chapters 3 & 4 and in Roger Garside's chapter in Short and Thomas (1996).

(ii) The Lexicon

The lexicon or wordlist consists of approximately 12,000 words, each listed with the possible tags for that word. Each word has between one and six candidate tags.

In effect, the principles adopted for including a word in the wordlist have meant that where CLAWS has failed, using the probability data, to assign a correct tag to a word then it has been necessary to include it in the list to be perused by CLAWS before tag assignment is finalised.

For example, using probability data, CLAWS would automatically assign the tag for noun to the word 'mushroom' (NN1). But when we encounter the use of 'mushroom' as a verb (VV0) we know that is it now essential that the lexicon includes the following entry:

mushroom		NN1 VV0@

(The @ is a rarity symbol, indicating that this tag applies in less than one in a hundred cases. There is also a % rarity symbol, indicating that this tag applies in less than one in a thousand cases.)

However, with the BNC facilities, the use of increasingly large databases is now possible and we are moving towards the compilation of a much larger lexicon, incorporating other large wordlists for example.

The fact remains that it is the post-editor's responsibility to formulate lists of words which are candidates for inclusion in the lexicon, and it is recommended that, where the post-editor finds a CLAWS error, a check on the current lexicon is made, and if necessary a suggestion should be made for a new lexicon entry.

(iii) the Suffixlist

When CLAWS encounters a word that is not found in the lexicon, a basic morphological analysis of the word is carried out in order to assign a candidate tag or tags to the word. The suffixlist is a list of common or predictable word endings, coupled with a list of one or more candidate tags. CLAWS will attempt to match the longest possible suffix to the word and assign the candidate tags associated with that suffix. Here is a few lines of the suffix list:

cle                       NN1 
dle                       NN1 VV0 
fle                       NN1 VV0 
gle                       NN1 VV0 
ile                       JJ NN1 
obile                     NN1 
phile                     NN1 JJ% 
pile                      NN1 
tile                      JJ

Post-editors should also suggest new suffixlist entries when they feel it is appropriate.

(iv) The Idiomlist

This is not a list of idioms in the usual sense but a list of multi-word sequences to aid CLAWS disambiguation procedures. The entries fall roughly into two main groups:

(a) Ditto tag group

(b) Word disambiguation group

Although the list is less than twenty pages in length, its usefulness in helping with CLAWS disambiguation procedures is invaluable.

Ditto tags are used for sequences whose syntactic role in combination differs from the role of the same words in other contexts. For example, the phrase

all of a sudden

would cause CLAWS and post-editors alike a few tagging headaches. We know however that this particular combination of words only occurs as an adverbial form. Therefore this sequence of words can be treated as a single unit for grammatical tagging purposes. By the addition of numerals to the basic tag format we can indicate that this is a sequence of tags, representing a sequence of words which form one grammatical unit. These are called ditto tags. The above example is tagged:

all_RR41 of_RR42 a_RR43 sudden_RR44

(RR is the tag for general adverb)

Ditto tags are not included in the lexicon; they only appear in the idiomlist. (Only single orthographic words are included in the lexicon.) For most purposes, ditto tag sequences should be considered as a closed set. As far as possible, post-editors should not invent ditto tags. There are several good reasons for this. Often a decision will already have been reached that a certain collocation should not have ditto tags. Also, some software for processing tagged text (such as SARA) needs to know all ditto tag sequences that occur. If a post-editor feels that it is appropriate to create a new ditto, the following procedure must be followed:

do not use your new suggested ditto tag straight away

check the use of the word or construction across the corpus, and see how it has been tagged

suggest the new tag to your colleagues

if agreed upon, it will need to be tagged consistently in the text and corpus in question,

the new ditto tag should be documented;

a recommendation made for it to be included in the CLAWS idiomlist

The idiomlist is also currently used for a wide variety of word disambiguation problems. Three examples are given below.

The expression at times is not usually ambiguous, but there are several combinations of tags available from the lexicon. There are four candidate tags in the lexicon for times but whenever this word occurs after at the tag should be NNT2. In order to avoid errors it is a simple procedure to make the following entry in the idiomlist:

at II, times NNT2

The word junior has noun and adjective tags in the lexicon but whenever it occurs after a proper noun (NP1), the tag should be JJ (adjective) so there is an idiomlist entry:

[NP1], junior JJ

Multi-word place-names, such as Alice Springs, can cause problems for CLAWS. There are two candidate tags for "springs" in the lexicon so there is an idiomlist entry solely for the Australian town:

Alice NP1, Springs NP1

1.2 A Basic Introduction to the Principles of Post-editing

At present the rate of accuracy for the CLAWS tagging system is between 95-98%, depending on the type of text. The primary object of post-editing is to correct the erroneous output, for example where a noun has been tagged as a verb, or an adverb as an adjective.

Section 3 of this document gives guidelines for deciding which is the correct tag in cases where the distinction is not always obvious. Section 3.1 deals with pairs of tags which may often be confused and Section 3.2 supplements this with guidelines relating to specific ambiguous words.

The important by-product of post-editing is to improve the tagging system itself. In the past, all post-editing has been done manually with corrections input using a simple screen-editor. The BNC project however involves the accurate tagging of a 100 million words and as the greatest throughput of the tagging and parsing projects was previously 1 million words per year this gives an indication of the huge task that this involves.

It is therefore the responsibility of the post-editors to:

investigate recurring errors in the CLAWS output

log all inconsistencies accurately and conscientiously

devise methods to improve CLAWS accuracy - these methods will not necessarily mean programming modifications (although obviously creative thinking is encouraged) but include constant updating of the lexicon, the idiomlist, the suffixlist and, of course, this document.

1.3 General-specific ambiguity

The tags correspond to a number of general word classes (nouns, adverbs, determiners etc.). The complete tagset can be seen in Section 5. Within these tag-groups there is normally a 'general' tag, and a number of 'specific' tags, used for various subcategories of the general word-class. e.g.:

RR = general adverb(general tag)

RG = degree adverb(specific tag)

Either by searching a lexicon, or by assessing the likely properties of a word on the basis of its form (using a set of 'suffix rules'), CLAWS7 decides which of the tags might be assigned to a particular word, and then calculates which is most likely to be the appropriate one in the given context.

In the case of a word which is in the lexicon, the entry will specify the tags to be considered. e.g.:

jeer		NN1 VV0

In practice, CLAWS7 makes few errors in deciding whether a word is a noun or a verb, because it is deciding between two distinct classes of words with very different syntactic properties. However, a word may have different meanings suggesting that it should be allowed 2 or more different tags within the same general word class. e.g.:

I have broken my FOOT
Do you have a one FOOT ruler?

In the first example, 'foot' is clearly a general noun (NN1), as there is no separate tag for 'parts of the body', but in the second example, 'foot' corresponds to the specialized tag NNU, used for 'units of measurement'. In order to assign tags along these lines, we would need to either a) include the tags NN1 and NNU in the lexicon entry for 'foot', or b) include just one of the tags, and correct the output manually.

In fact, we do neither. Instead, we adopt a principle which can be stated thus:

AVOID PROLIFERATION OF TAGGING AMBIGUITIES BY ALLOWING THE MORE GENERAL CATEGORY TO SUBSUME THE MORE SPECIFIC ONE

Thus, foot will always be tagged NN1, whichever of its noun meanings is the case. Unambiguous words or items such as ft. or inch (when the latter is a noun), will, on the other hand, contain the specific tag in the lexicon entry, but not the general one.

1.3.1 Exceptions

To aid automatic tagging or parsing, a few known general-specific ambiguities have been allowed to persist. The notable examples are:

so RR (general adverb, as in 'so you think you're clever, do you?'), RG (degree adverb, as in: 'so nice')

too RR (as in 'I'm coming too'), RG (as in 'too much')

more, less RRR (comparative general adverb), RGR (comparative degree adverb)

if normally CS (subordinating conjunction) but tagged CSW when it means 'whether'.

Note also that NN1 and NP1 are considered to be distinct tags and not specific varieties of a generic 'noun' tag, so the principle of the non-proliferation of tags does not apply.

1.3.2 Updating Resources

One further important reason for post-editing, is to improve the automatic tagging system. From time to time, one may, in the course of post-editing, become aware of a lexicon entry which is the cause of automatic tagging errors, or which conflicts with the principle of 'no general-specific ambiguity'. It may be decided subsequently to amend the lexicon.

Similarly, particular sequences of words may be particularly likely to be erroneously tagged. Often, this is because the words themselves, in that particular sequence, require tags which are different from those they would normally be assigned. e.g.:

at_RR21 all_RR22 (equivalent to a single adverb)

The Far_NP1 East_NP1 (tagged as a proper noun)

In many such cases, the CLAWS7 idiomlist can help to make it more likely that the correct tags will be assigned. Post-editing is therefore a means of discovering word chains which might be included in the idiomlist.

Finally, post-editors frequently encounter usage which seem open to alternative tagging strategies. This is particularly true in the case of a range of types of naming expression. Achievement of a consistent post-edited output therefore rests on the adoption of conventions according to the type of expression. Section 2 includes descriptions of conventions devised so far to deal with the tagging of such naming expressions (see section on nouns, 2.6.3ff).

SECTION 2 INTRODUCTION BY WORD-CLASS TO THE CLAWS7 TAGGING SCHEME

2.1 Adjectives

The main class of adjectives, those which can be used predicatively or attributively (whether or not with the same meaning), are tagged JJ.

JJR is used for comparative adjectives (e.g. whiter) and JJT is used for superlative adjectives (e.g. whitest).

JK (catenative) is used for able, unable andwilling in sentences like: "Will you be able_JK to manage?" but not when used as a general adjective as in: "Your son is very able_JJ."

SYNTACTIC AMBIGUITY (see Section 3)

1. JJ > VVG, JJ > VVN

2. JJ > NN1

3. JJ > RR, JJR > RRR

2.2 Adverbs

As well as the general adverb (RR), tags exist for degree adverbs (very, too etc.) (RG), prepositional adverbs or particles (RP), locative adverbs (RL), adverbs of time (RT). For the first two of these comparative and superlative tags exist (RRR, RRT, RGR, RGT). Further adverb tags are listed in Section 5.

Adverb words are subject to tagging errors due to a variety of sources of ambiguity. See esp.:

Section 2: Articles and Determiners (2.4)

Prepositions vs. RP and RL (2.7.2)

Section 3: DAR > RRR (more, less)
II > RP, II > RL (more on prepositions vs.adverbs/particles)

JJ > RR, JJR > RRR (adjectives vs. adverbs)
Section 4: any
as

but

each

how

much

no

so

when

It is within the adverb class that most legitimate cases of general-specific ambiguity exist (see Section 4). Some tags may require changing manually from the general one to the specific one or vice-versa.

In some cases, specific usage is not tagged as such (e.g. really can only be an RR, even when it is an intensifier). In other cases there are specific tags available (e.g. bloody can be an RR or an RG). This is an area of inconsistency that needs to be addressed. Suggestions from post-editors are needed.

Underlying a number of other automatic tagging errors is the fact that some words, though frequently used, occur rarely as adverbs (e.g. but, each, good, no). It is therefore important for the post-editor to be familiar with the principles of tag choice affecting such words.

2.3 Articles, Determiners & Pronouns

All determiners capable of a pronominal function receive D tags, (see section 5 for a complete list), regardless of whether they are acting pronominally or not. These are categorised according to the positions in which they may occur in a complex noun phrase.

Words that are only pronouns (they, nobody etc.) are given P tags. The, a/an, no and every are tagged as articles (AT or AT1).

The main source of automatic tagging errors associated with 'D'-words is ambiguity between determiners and adverbs. See Section 3: DAR > RRR (more and less) and Section 4: any; each; much; no.

Errors are sometimes associated with the ambiguity between that as a determiner (DD1), as a conjunction (CST) and as a degree adverb (RG).

2.4 Conjunctions

The two most commonly occurring types of conjunction are:

coordinating conjunctions (and_CC, or_CC; but_CCB),

subordinating conjunctions (if_CS, because_CS).

The latter are sub-categorized as follows:

CS general subordinating conjunction

CSA as (see Section 4)

CSN than

CST that as a conjunction

CSW whether; also if when it means whether.

The sequence "as well as" may be tagged as a CC idiom. However, at present CLAWS7 tends to mis-tag this due to the 'overlapping' idiom "as_RR21 well_RR22" (meaning 'also'). Attention should therefore be paid to this at the post-editing stage.

Sometimes plus and minus may be tagged CC, but should be tagged II when linking noun phrases, if signifying addition or subtraction (see guidelines for times in Section 4.).

2.6 Nouns

2.6.1 Introduction

The basic tags for common nouns are NN1, NN2 and NN.

e.g.: table_NN1

tables_NN2

men_NN2

people_NN

group_NN1

groups_NN2

sheep_NN

NN is used for nouns that can have plural and singular modifiers yet do not change their form (i.e. words that are morphologically neutral for number) like sheep, and words that have distinct singular and plural meanings, like aids and butchers.

Various subcategories of noun are given more specialized tags (see section 5 for a complete list). Certain conventions have been adopted for the tagging of various types of noun phrase, either automatically, or, if necessary, manually by post-editors. These are described below.

Because of the principle of avoiding ambiguity within a word class between a general and a specific category, many decisions have had to be made about the appropriate tagging for nouns with more than one meaning.

The three subcategories of noun which present most dilemmas are NNL1/NNL2 (locative nouns) and NP/NP1/NP2 (proper nouns). There are also the tags NNT1/NNT2 used for temporal nouns (which may also be used adverbially), though cases of ambiguity between NNT* and NN* seldom arise, (two exceptions being fall and spring, which are tagged NN1).

2.6.2 Major Sub-Categories Of Noun

There follows a brief description of the criteria used for deciding when these tags are appropriate. This is followed by details of the conventions adopted for tagging various types of noun-phrase which have been found to raise questions.

2.6.2.1 NP/NP1/NP2

These proper noun tags are used for:

names of people

'proper noun' elements of company names etc.

names of places (e.g. countries, towns and villages), where these are not included in other classes of "capitalised noun" tags, e.g.: NNL1. compare:

Vermont_NP1
Atlantic_NP1 City_NNL1

names of newspapers

names of institutions and products where the words are not common nouns, e.g.:

IBM_NP1, Spar_NP1, Lancia_NP1

but NOT for:

names of horses,

names of ships etc.

names of teams

names of pubs, hotels etc.

titles of books, plays, films etc.,

and any other lexical nouns forming part of a naming expression.

The general principle can be stated as:

Bona fide proper nouns are words that are (i) names of people, (ii) geographical places and (iii) other names of things that are not also common lexical nouns. Bona fide proper nouns are given NP tags. Other nouns that form part or the whole of a naming expression are tagged as common nouns, unless they are words that are normally an NP/NP1/NP2 in any case.

A few examples (there are more for each category below).

Ships:

HMS_NNB Brilliant_JJ
The_AT Queen_NNB Mary_NP1

People:

John_NP1 Smith_NP1
Kate_NP1 Moss_NP1

Horses:

Shergar_NP1
White_JJ Flash_NN1

Product and company names:

IBM_NP1
Word_NN1 for_IF Windows_NN2
Volkswagen_NP1 Golf_NN1

Place names (see also more below):

Lancaster_NP1
Leicester_NP1 Square_NNL1
Old_NP1 Street_NNL1

The NP tag is used only for those cases where a proper noun is morphologically neutral for number (see NN above). There are therefore a very restricted, but open, number of words that are given this tag. Only those proper nouns that are countable and unchangeable in form come into this category, e.g. Mercedes, Sainsburys, Tescos.

2.6.2.3 NNL1/NNL2

The NNL tags are used for a closed list of words which have a locative meaning, and which occur (normally with a capital letter) in complex expressions for naming geographical places. Institutions have common general common noun tags, although there is sometimes a difficult distinction to be drawn between institutional and geographical reference, e.g. The British Museum. Since the NNL tags are only used in compounds, they are only assigned by the idiomlist and not by the lexicon.

Leicester_NP1 Square_NNL1
Old_NP1 Street_NNL1
The_AT Atlantic_NP1 Ocean_NNL1

The full list of words that can be tagged NNL1 is as follows:

         NNL1                     NNL2           
         City                    Hills          
         Close			Islands
	 Hill                    Isles          
        Island                 Mountains         
         Isle             
         Lake                                    
         Lane                                    
         Mount                                   
         Ocean                                   
         Place                                   
         River                                   
         Road                                    
          Sea                                    
        Square                                   
        Street

Shortened forms of these words get NNL tags, such as Rd, St (meaning Street) and Mtns, etc.

The NNL tag is also still regarded as valid for a word referring, not to the physical location itself, but to the activity or institution which is associated with or contained within that location. Thus:

Downing_NP1 Street_NNL1 registered_VVD its_APPGE disapproval_NN1 ._.

See also the section on place names below.

2.6.2.4 NNB tag

NNB (where B stands for 'before') is used for a set of nouns which function as a title in a person's name, and is only used when these words are used as a title, e.g.:

Sergeant_NNB Jones_NP1
He was a conscientious sergeant_NN1
footballer_NN1 Geoff_NP1 Hurst_NP1
"Yes, Sergeant_NN1"

Nouns denoting family relations, even though some of them do not meet the "title in a person's name" criterion, are included in the NNB set (e.g. uncle, aunt, auntie). NNB is used for abbreviated nouns of style occurring before a name:

Mr._NNB Jones
Coun._NNB Alf Roberts

NNA (where A = "after") is used for abbreviated nouns of style or title appended to names:

Anne Collins J.P._NNA
Gordon Banks O.B.E._NNA

Note that 'St.', meaning 'Saint', is always tagged NP1.

2.6.2.5 NNT1/NNT2

These tags are used for temporal nouns, which can be used as the head of an adverbial expression. e.g.:

I saw him last week_NNT1
Every year_NNT1, it's the same story

Nouns that also have a non-temporal meaning, such as Spring and Fall, are always tagged NN1.

See also Section 4: "times"

2.6.3 Categories Of Naming Expressions

2.6.3.1 Place Names And Locations

The tagging for place-names and geographical locations consists predominantly of NP, NNL and common noun tags. Words that are tagged as common nouns are parts of names which contain a capitalized word which is not generally recognized to be part of the name itself, but is used instead as a qualifier to the name, and is thus performing its normal lexical function (in spite of the capital). Thus:

New_NP1 York_NP1
West_NP1 Berlin_NP1

But:

New_NP1 York_NP1 West_ND1 
East_ND1 Chicago_NP1

The difference between 'West Berlin' and 'East Chicago' is that the former is (or was) an institutionalised name, whereas the latter is not. (The fact that West Berlin is no longer an official entity is not relevant here, since reference to places in the past is possible). References to places may often contain an NNL1 (or NNL2), such as Lake, Mount, River or Isles. However, a word which may be tagged NNL1/2 is tagged NP1 in cases where it an integral part of the name, not an additional descriptive qualification in the way NNL-tagged words are. Thus:

Mount_NNL1 Igman_NP1
River_NNL1 Tyne_NP1
Lake_NNL1 Placid_NP1
Fylde_NP1 Avenue_NNL1

But:

Lake_NP1 District_NN1
Street_NP1 Lane_NNL1
Avenue_NP1 Road_NNL1

The test we apply is to see whether e.g.. Lake District is a kind of lake. The answer is no, so Lake is an NP1. Lake Placid is a lake however, so it is an NNL1. An extension of this rule is applied in the case of names such a Long Beach, which is not a kind of beach, or Bowling Green, which is the name of a town. In applying this test we do not recognise the derivation of such names, and tag them as proper nouns:

Long_NP1 Beach_NP1
Bowling_NP1 Green_NP1

A qualification to the rule, on the other hand, applies in the case of a plural NNL or NN word which has become part of a singular proper noun. In such cases, the tag NP1 is used. E.g.:

Alice_NP1 Springs_NP1
Beverly_NP1 Hills_NP1
Strawberry_NP1 Fields_NP1
Grand_NP1 Rapids_NP1
Yorktown_NP1 Heights_NP1

It is clear that such names do not function as plural nouns. Compare, for example:

Beverly_NP1 Hills_NP1 is a suburb of L.A.
The Malvern_NP1 Hills_NNL2 are made of granite

It is for this reason that "United States" is tagged:

United_NP1 States_NP1

Note also that a place name preserves its tagging when it is subsumed into a longer naming expression:

Long_NP1 Island_NNL1
Long_NP1 Island_NNL1 Sound_NN1

Other words which are tagged as NP1 when part of a place-name are Greater and St., e.g.:

Greater_NP1 Manchester_NP1
St._NP1 Louis_NP1

2.6.3.2 Nationality And Language.

There is sometimes a problem deciding how to tag words relating to nationality, language etc. when they have an adjectival form. As a rule, the language is tagged as a noun (NN1), whilst the same word used as an adjective should be tagged JJ:

French_JJ people usually speak French_NN1

The tagging for specific or generic reference is shown in the table below.

If the word for generic or plural specific reference is the same as the adjectival form (i.e. does not undergo morphological pluralisation), then the word is tagged as an adjective, e.g.: The French_JJ (cf. the poor_JJ). In most cases, the word is not available for singular reference. Even when it is (e.g.. A Japanese), the tag JJ is retained. The key test is whether these words can take an additional plural ending. Those which cannot are treated as adjectives. (see type 1 in the table below) On the other hand, words which can be used either adjectivally or for singular specific reference, but which take a plural ending for generic reference (or plural specific reference), are tagged as nouns when they refer to people. (see type 2 in the table below).

Type 1

Language          Person           People           Adjective       
NN1               JJ               JJ               JJ              


Dutch             Dutch            Dutch                            
English           English          English                          
Flemish           Flemish          Flemish                          
French            French           French                           
Irish             Irish            Irish                            
Polish            Polish           Polish                           
Spanish           Spanish          Spanish                          
Welsh             Welsh            Welsh                            
Chinese           Chinese          Chinese                          
Japanese          Japanese         Japanese                         
Portuguese        Portuguese       Portuguese                       
Swiss             Swiss            Swiss                            
Vietnamese        Vietnamese       Vietnamese

Type 2

NN1               NN1              NN2              JJ              
                  African          Africans         African         
                  American         Americans        American        
Arabic            Arab             Arabs            Arabic/Arab     
                  Arabian          Arabians         Arabian         
                  Asian            Asians           Asian           
                  Australian       Australians      Australian      
                  Canadian         Canadians        Canadian        
German            German           Germans          German          
                  Belgian          Belgians         Belgian         
                  Brazilian        Brazilians       Brazilian       
                  European         Europeans        European        
Hungarian         Hungarian        Hungarians       Hungarian       
                  Indian           Indians          Indian          
Italian           Italian          Italians         Italian         
Norwegian         Norwegian        Norwegians       Norwegian       
Russian           Russian          Russians         Russian

In the case of compound adjectives or nouns referring to nationality, the tagging is extrapolated from the tagging applied to the name of the country. Thus:

The West_NP1 German_JJ Chancellor_NNS1
West_NP1 Indian_JJ food_NN1
The West_NP1 Indians_NN2 
Puerto_NP1 Ricans_NN2 
A South_NP1 African_NN1 
South_NP1 African_JJ policies

2.6.3.3 Race And Religion

The guidelines for tagging words of adjectival form relating to race and/or faith are parallel to those for nationality. Those which can be pluralised may be tagged NN1, NN2 or JJ:

A South_NP1 African_JJ Black_NN1
A Black_JJ South_NP1 African_NN1
A Black_JJ youth_NN1 and three whites_NN2
White_JJ supremacy_NN1
Black_JJ Liberation_NN1
black_JJ nationalists_NN2
black_JJ nationalist_JJ campaigners_NN2

Similarly:

A Roman_JJ Catholic_JJ priest
A Roman_JJ Catholic_NN1
Moslem_JJ customs_NN2
A group of Buddhists_NN2

2.6.3.4 Directional Terms

Words representing points on the compass are tagged ND1, whether they are used adjectivally, nominally or adverbially. This applies whether they are simple words, hyphenated words or abbreviations. e.g.:

north-east_ND1
S.E._ND1
south_ND1
southwest_ND1

Their derivative '-ern' adjectives are tagged JJ:

Northwestern_JJ
Southern_JJ

These rules are overridden in cases where:

1. The word is an essential part of the name of a country, region or place (see paragraphs 1 & 4 above):

The Middle_NP1 East_NP1
West_NP1 Germany_NP1

2. When the word stands alone in reference to a company or similar organization see paragraph 6a below):

Following the revelations of malpractice in the Eastern_JJ Railway_NN1 Co._NN1, the head of Eastern_NP1 has resigned

3. When the word is a proper noun in its own right. e.g.:

Colonel_NNB Oliver_NP1 North_NP1

2.6.3.5 Company Names

A typical company name would be tagged thus:

Schlitz_NP1 Brewing_NN1 Co._NN1

The proper noun tag is retained for elements of the title which are clearly names (e.g.. surnames), but other elements are tagged as they would be normally. Thus:

Filmpower_NP1
Chrysler_NP1 Motor_NN1 Corp._NN1
Safeway_NP1 Limited_JJ

If a proper noun happens to coincide with a common noun, we still assign an NP-tag:

Gateway_NP1 Productions_NN2 Inc._JJ
Storeys_NP1 Ltd._JJ

But if the title is merely composed of common lexical items, proper noun tags are not used:

Resorts_NN2 International_JJ
News_NN1 International_JJ
Federated_JJ Department_NN1 Stores_NN2

This distinction is not always obvious. The test used to judge whether an NN tag or an NP tag is appropriate is to ask whether these words are used in something approximating to the normal way. Since, presumably, Gateway Productions do not make gates, and Storeys are probably named after a person called Storey, there is a sense in which they are much further from their normal lexical function than the words 'Resorts', 'Federated' and 'News' in the examples above.

A different tagging strategy is employed when the company title is truncated to a single word. In such cases we use the NP1 tag:

A spokesman for Federated_NP1 said...
When I started working for International_NP1 ...

See also the section above on 'directional words' (Southern, East etc.), which often form part of corporate titles, and where similar considerations apply. Note that commerciality is not a criterion for applying the above guidelines. Non-commercial organizations are tagged in the same way:

Maine_NP1 Correctional_JJ Centre_NN1
New_NP1 York_NP1 Central_JJ Youth_NN1 Club_NN1

It is characteristic of such titles, that examples will always occur which show up the inadequacy of any concise guidelines for tagging and tag-correction. For instance, in the case of 'Pan American' and 'Pan Am', it was decided these should always be tagged as proper nouns:

Pan_NP1 Am_NP1
Pan_NP1 American_NP1

and that 'General Electric' should be tagged:

General_JJ Electric_NN1

Probably the best way to ensure that the tagging of a large body of text remains as consistent as possible is to build up a 'caselaw' of such tagging decision as they are made, and if possible to add them to the idiomlist.

2.6.3.6 Product Names

Product names are given NP tags when the words do not coincide with common lexical items, or when they are bona fide proper nouns:

Cadillac_NP1 Eldorado_NP1
I drive a Mini_NP1
He wasn't very good with Hoovers_NP2

This also applies when the NP term precedes the head of the phrase:

A Burberry_NP1 raincoat
A Boeing_NP1 airliner

2.6.3.7 Names Of Teams

Names of sports teams (e.g.. The Green Bay Packers; The Chicago Bears; New York Rangers; Boston Bruins; the Blue Devils) are tagged as though they were common nouns. This is the case even when the head of the team-name has the appearance of a proper noun (e.g.: The Finns). This rule only applies to the head of the naming expression, and to words which are explicitly referring to teams. Such words will usually be plural, but see example 5, below. Other elements of a team's name (e.g.. a place name) should be tagged as they would be in other contexts. When a place-name is substituted for the full name of a team (see example 6), it retains its NP tags, as the team reference is implicit.

These points are illustrated in the examples that follow.

Manchester_NP1 United_JJ
Birmingham_NP1 City_NN1
Oldham_NP1 Athletic_JJ
Queen_NN1 of_IO the_AT South_ND1
Tottenham_NP1 Hotspur_NP1
New_NP1 York_NP1 Rangers_NN2
The_AT Buffalo_NP1 Sabres_NN2
Indiana_NP1 Pistons_NN2
Green_NP1 Bay_NNL1 Packers_NN2
Chicago_NP1 Bear_NN1 Chuck_NP1 Smith_NP1
Los_NP1 Angeles_NP1 beat_VVD the_AT Flying_JJ Finns_NN2

2.6.3.8 Horses

We have adopted the convention of tagging the words in horses' names as though they were ordinary lexical items, and ignoring the capital letters:

Arabian_JJ Knight_NN1
Black_JJ and_CC White_JJ
Fish_NN and_CC Chips_NN2
King_NNB Arthur_NP1
Happy_JJ Days_NNT2

Note that Arthur is tagged NP1 because it is unquestionably a bona fide proper noun.

2.6.3.9 Names Of Ships Etc.

These are treated in the same way as names of horses, a common tag being JJ:

H.M.S._NNB1 Invincible_JJ
H.M.S._NNB1 Tenacious_JJ
H.M.S._NNB1 Tiger_NN1
Sir_NP1 Galahad_NP1
The_AT Queen_NNB Elizabeth_NP1 II_MC

Note that in the last two examples, the NP1 tag is used for personal names. The same would apply for quasi-proper nouns:

H.M.S._NNB Nautilus_NP1

2.6.3.10 Titles Of Newspapers

We have adopted the convention of tagging the titles of newspapers as common lexical items:

The_AT Sun_NN1
The_AT Daily_JJ Telegraph_NN1

However, the following points should be noted:

(1) Times is tagged NP1 rather than NNT2 when it occurs in a newspaper title.

(2) If the name occurs as part of a company name, there may be a conflict with the guidelines under section 2.6.3.5 Company Names above. In this case, we give priority to the test described there on page 21:

Mirror_NP1 Group_NN1 Newspapers_NN2
The owner of Today_NP1 Newspapers_NN2

2.6.3.11 Titles Of Books, Plays, Films Etc.

As with names of Horses and Ships, we attempt to keep NP-tagging to a minimum, using it only for words which would be NP1 or NP2 in other contexts:

What_DDQ Katy_NP1 Did_VDD Next_MD
The_AT Diary_NN1 of_IO a_AT1 Nobody_NN1
Frankenstein_NP1
Life_NN1 the_AT Universe_NN1 and_CC Everything_PN1

2.6.3.12 Names Of Hotels And Pubs Etc.

The conventions adopted for tagging names of hotels are similar to those for names of companies. In other words, where the full name is given, NP tags are restricted to bona fide proper nouns, e.g.:

Park_NN1 Lane_NNL1 Hotel_NN1

Truncated names are changed to NP1 in order to avoid an adjective standing alone as the head of a noun phrase, but not if the truncated name is a noun in any case:

The Regency_NN1

These points are illustrated further in the following examples:

The Cumberland_NP1 Hotel_NN1
The Post_NP1 House_NN1
The Post_NP1 House_NN1 Hotel_NN1
The White_JJ House_NN1 Hotel_NN1
The Imperial_JJ Hotel_NN1
Let's have a drink at the Imperial_NP1

2.6.3.13 Festivals And Commemorative Events:

As far as possible, these are tagged using ordinary lexical item tags, including, where appropriate NNT1:

At a Republic_NN1 Day_NNT1 gathering
New_JJ Year_NNT1 's_GE Day_NNT1
Lincoln_NP1 Day_NNT1

NNT1 tags are used for Christmas, Passover, Easter etc:

Next Christmas_NNT1
Easter_NNT1 Sunday_NPD1
Christmas_NT1 Day_NNT1

2.6.4 Cited Words

Words like 'no' and 'must', which may be used as if they were nouns, are governed by tagging conventions which vary according to the presence or absence of quotation marks, and according to whether or not the word in question has been pluralised. When such a 'cited word' has it normal (singular) form, it is given its normal tag if quotation marks are used:

That sounds like a "yes"_UH to me
It's a "maybe_RR" rather than a "must_VM"

However, when there are no quotation marks present, we use a tag appropriate to the context (usually NN1):

A resounding no_NN1
An absolute must_NN1

Plural cited words are tagged NN2 whether or not quotation marks are used:

No ifs_NN2 or buts_NN2, just do it!
The noes_NN2 have it

2.7 Prepositions

2.7.1 Tags Used For Prepositions

Most prepositions are tagged II. More specific tags are used as follows:

IF for

IO of

IW with, without

2.7.2 Prepositions & RPs

A preposition-type word will receive an adverbial particle tag when used in phrasal verb constructions, or when having the function of an adverbial in the sentence or clause, e.g.:

Japanese companies have insisted on keeping down_RP sales of US cars
He did not rule out_RP use of a surcharge
Rota put the Cannocks up_RP 4-3 in the second period
Seattle reeled off_RP six points for a 17-point lead
Out_RP in the garden, the dog was running around_RP

A list of possible RPs:

'bout                     RG II RP@
about                     II RG% RP@
along                     II RP
around                    II RP RG@
away                      RL RP JJ%
back                      RP NN1 JJ@ VV0%
by                        II RP%
down                      RP II@ NN1% VV0% JJ% NP1:%
in                        II RP@ .NNU%
off                       RP II JJ%
on                        II RP@
on/off                    RP
out                       RP II%
over                      II RP JJ% RG@ NN1%
round                     JJ II RP NN1@ VV0@
through                   II RP@ JJ%
thru                      II RP@ JJ%
to                        TO II RP%
under                     II RP@ RG@
up                        RP II@ VV0%

A similar potential ambiguity exists with words which may be tagged either as prepositions or as 'RL' (locative adverb), e.g.:

I walked across_II the park
He looked across_RL and saw Jim

Words which may be tagged RL or II (IW):

aboard                    RL II
above                     II RL JJ@
across                    II RL@
alongside                 II RL
astride                   RL II
behind                    II RL@ NN1%
below                     II RL RG
beneath                   II RL@ JJ%
beside                    II RL%
between                   II RL%
beyond                    II RL@ NN1%
ere                       CS II RL
inside                    II RL NN1@ JJ@
near                      II RL JJ@ VV0@
nigh                      RL II@
opposite                  JJ II% NN1@ RL@
outside                   II RL JJ NN1@
past                      NN1 II RL JJ
throughout                II RL@
underneath                II RL NN1@
within                    II RL@
without                   IW RL% RR%

There are no RL/RP ambiguities in the lexicon. One or the other is preferred in each case, or else the RR tag is used.

Stranded prepositions are liable to be automatically tagged RP, and should be changed to II. They occur when the preposition becomes detached from its noun phrase as may happen in various types of clausal construction, eg:

Question:

What team do you play in_II ?
(In_II which team do you play?)
Relative Clause:
I know the story you are talking about_II
( ... about_II which you are talking)
Passive:
The car was worked on_II by a fool
(A fool worked on_II the car)

SYNTACTIC AMBIGUITY (see Section 3)

II > RL

II > RP

2.8 Verbs

2.8.1 Tags used for lexical verbs and for `do`, `be` and `have`

Apart from modals (see below), tags for all verbs except be, do, and have contain the letters 'VV'. In the case of the 3 verbs mentioned, this changes to 'VB', 'VD' and 'VH' respectively.

The third element of the tag makes distinctions of form/function as follows:

I infinitive base form,

0 (zero) base form:

applaud_VV0; have_VH0; be_VB0; do_VD0

Z 3rd person sing. (present tense) form:

plays_VVZ; does_VDZ; has VHZ; is VBZ

(present tense forms of the verb 'to be': am_VBM; are_VBR)

D past tense form:

liked_VVD; took_VVD; had_VHD; did_VDD;

(was_VBDZ; were_VBDR)

N past participle form:

liked_VVN; taken_VVN; had_VVN; done_VDN; been_VBN

G present participle form:

saying_VVG; doing_VDG; being_VBG; having_VHG

In cases of verb-form ambiguity, function takes precedence. Thus, the word 'put' should be tagged VV0, VVD or VVN according to its grammatical function. Automatic tagging errors are sometimes associated with this source of ambiguity, particularly where no auxilliary co-occurs with a VVN (or VHN), or where the auxilliary is several words distant from its associated past participle (in questions for example).

Contracted forms of 'have', 'had' etc. should carry the same tags as the complete forms. Errors may be associated with certain ambiguous forms:

'd_VHD = 'had'; 'd_VM = 'would' 
's_VHZ = 'has'; 's_VBZ = 'is'; 's_GE = genitive 's'

Contracted negated forms are broken up by CLAWS7 into their constituent parts and tagged separately:

is_VBZ n't_XX
have_VH0 n't_XX
will_VM n't_XX  (= won't)

2.8.2 Modal Auxiliaries

Modal auxiliaries are tagged 'VM'. A list is given below:

'd		VM VHD
'll		VM
'ud		VM
can		VM NN1% VV0%
could		VM
dare		VV0 VM@ NN1%
may		VM NPM1: NP1:%
mayst		VM
might		VM NN1%
must		VM NN1%
need		NN1 VV0 VM@
shall		VM
should		VM
will		VM NN1@ VV0% NP1@
wilt		VV0 VM NN1%
would		VM

2.8.3 Catenatives And Modal Catenatives

The following verb-forms receive 'K' (catenative) tags when used as in the examples:

he is bound_VVNK to arrive soon
Do you think it is going_VVGK to rain?
We used_VMK to think it was impossible
We ought_VMK to leave

(ought is always VMK)

SYNTACTIC AMBIGUITY

1. VVG > NN1 (see Section 3)

2. VVG > JJ (see Section 3)

3. VVN > JJ (see Section 3)

4. VVG > VVGK (going)

5. VMK > JJ > VVN > VVD (used)

6. VV0 > NN1

7. VVZ > NN2

8. VVD > VVN

9. VHD > VHN (had)

2.9 Foreign Expressions

2.9.1 Naturalised And Commonly-Used Expressions

Single-word expressions are given tags appropriate to the word's use in English:

Ostpolitik_NN1
literati_NN2

We do not attempt to assign a tag according to the class of a word with its language of origin, rather with its syntactic use in the English sentence. Thus:

Vermicelli_NN1

is tagged as singular in spite of the Italian plural ending. Multi-word expressions are generally treated as units, and given ditto tags (see Section 4: Ditto-tags) appropriate to the expression as a whole:

Pate_NN131 de_NN132 cheval_NN133
fin_JJ31 de_JJ32 siecle_JJ33
in_RR21 extremis_RR22
personae_NN231 non_NN232 gratae_NN233

2.9.2 Non-Naturalised Expressions

For foreign expressions which are not naturalised in any appreciable way (as in quotations or book titles, for example), the tag FW is used for each word. Since a post-editor will apply an FW tag to any foreign word whose meaning he or she is unable to fathom, there is obviously a fuzzy boundary to the FW word class. Company names are usually tagged NP1 (see also 2.9.4 below) e.g.:

Volkswagen_NP1, Alfa_NP1 Romeo_NP1

2.9.3 Foreign words in whole sentences or clauses

In cases such as those in the examples below, FW is used:

J'y_FW suis_FW, j'y_FW reste_FW
festina_FW lente_FW
che_FW sara_FW sara_FW
c'est_FW la_FW vie_FW

It would be difficult or misleading to tag them with tags from an English tagset - for example sara above is a third person singular future indicative verb form, yet obviously there is no need for such a tag in English. Similarly festina above is a singular imperative in Latin and there is no such tag for English.

No changes are to be made manually to contracted combinations of words such as j'y or n'est which are to be tagged by CLAWS7 as single words.

2.9.4 Names

Words such as de, van and von are tagged NP1 when part of a name, e.g.:

Ludwig_NP1 van_NP1 Beethoven_NP1, Ferdinand_NP1 de_NP1 Saussure_NP1

Company and country names are usually tagged NP1 as well even if all of the words are foreign and non-naturalised:

Credit_NP1 Lyonnais_NP1
Banque_NP1 Nationale_NP1 de_NP1 Paris_NP1
les_NP1 Etats_NP1 Unis_NP1

2.9.5 "Borrowed" Prepositions

Expressions such as per or a la are given normal preposition tags, unless they are subsumed under a longer expression which should be tagged as a unit:

per_II pound_NNU1
a_II212 la_II22 Lancaster_NP1

but:

per_JJ21 RR21 diem_JJ22 RR22
per_NNU21 cent_NNU22
a_JJ31 RR31 la_JJ32 RR32 carte_JJ33 RR33

2.10 Interjections

Interjections are tagged UH. Some words are always considered as interjections, for example: aha, blimey, crikey, ha, huh, oh, sh, um, yes. Other words are sometimes tagged as interjections, and sometimes not, for example:

adieu UH NN1@
boo UH VV0@ NN1@
bye UH NN1%
clonk NN1 VV0 UH
hallelujah UH NN1@
no UH AT RR%

The only words that should be tagged as UH are those indicating exclamation, or some other kind of interactive signal (which is not integrated with the syntax of the sentence), e.g. yes, no, whoa.

The following are all tagged as FU (unclassified word): other types of exclamation (oops), onomatapoeic words that are not exclamatory (whoosh, ding), transcriptions of non-linguistic utterances ('de do da da da da'), hesitations and stutters (er, erm), truncated words, etc. Also exclamatory words or expressions which retain the spelling of a word in another class, from which they derive, are not tagged UH - e.g.:

God_NP1 Almighty_JJ
Sure_JJ
Bless_VV0 you_PPY

Post-editors should look at the full list of lexicon entries for FU and UH to help clarify this area.

2.11 Numbers

There are several tags for different types of strings representing numbers and strings containing numbers. These are:

MC1

the cardinal number 1: one 1 i I

MC

other cardinal numbers: two three ninety-four 2 745 ii XVI

MC2

morphological plurals of cardinal numbers: ones twos hundreds ten's

MCMC

hyphenated numbers: 40-50 1917-89

MD

ordinal numbers: first 3rd 75th last next

MF

fractions: two-thirds quarter 1/2 5&frac;78

NNO

(letter O, not 0) numeral nouns: hundred thousand million dozen gross

NNO2

morphologically plural number nouns: hundreds thousands dozens

NNU

strings with numbers and units of measurement: $100 £5 6in
This tag is also used for the units of measurement themselves where they are separate words. However

foot

is tagged NN1 and

feet

NN2 even when they are units of measurement, in order to minimise ambiguity.

FO

other strings that are a mixture of numeric and alphabetic characters: A41 M6 G7 4GL 3M

SECTION 3 DISAMBIGUATION GUIDE (BY TAG-PAIR)

CSA / II / RG see Section 4: as

DAR / RRR

more and less can be assigned either of these tags.

The difference between them is that DAR is for noun-phrase-like (and determiner) uses of the word in question, whereas RRR is for adverbial uses. The two can be difficult to distinguish, particularly after a verb: eg:

You should relax more_RRR
You should spend more_DAR

Since relax is an intransitive verb in this context, more cannot be a noun phrase. Instead, one can paraphrase it roughly as "to a greater extent". On the other hand, spend is a transitive verb, and so more is a DAR in this context. (We can notice that more after spend is the direct object of the verb, because it can be made the subject of a passive: "More should be spent..."). There are some verbs for which the distinction is less clear than in these examples, eg:

You should eat more
You should smoke less

Note that the verb may be used transitively or intransitively with almost identical meanings, so that the syntactic structures of the immediate and/or surrounding context are the only clue as to which is the case:

"Do you smoke?"   (Intransitive)
"How many do you smoke a day?"  (Transitive)

Contrast:

At the moment we have 23 fixtures per season.
Personally, I would rather play more_DAR

with:

If you're going to make the big time, I can see you'll have to play more_RRR, and not just wait for the ball to come to you.

(see also RG / RR for degree and general adverb tagging of 'more' and 'less').

II / RL & II / RP

Compare:

(a) He ran down_II the hill

and

(b) He ran down_RP his friends

In (a), down is a preposition because:

You could insert an adverb before it:

He ran quickly down the hill

But not:

*He ran viciously down his friends

You can move it to the front of a relative clause or question:

This is the hill down which he ran
Down which hills do you like running?

In (b), down is an adverbial particle because:

You can place it before or after the noun phrase:
```
He ran his friends down_RP
```
But not:
```
*He ran the hill down
```
If you replace the noun phrase with a pronoun, you HAVE TO place the pronoun in front of the particle:
```
He ran them down 
```
But not:
```
*He ran down them 
```

Similarly:

She put the cat out_RP > She put IT out_RP

whilst

She went through_II the gap > She went through_II IT

Notice that the syntactic distinction between down_RP and down_II is independent of the semantic distinction between locative and non-locative uses of down. When the verb is simply followed by down or out etc., without a following noun phrase, it is normally an RP:

Income tax is coming down_RP
The decorations were taken down_RP on 12th night

However, tagging errors may occur with stranded prepositions which are denuded of their noun phrase because it has been fronted or ellipted (eg. in relative clauses, passives, questions etc.):

This is the hill (which) she ran down_II
(ie. This is the hill down_II which she ran)
On Shrove Tuesday, this hill will be run down_II by housewives"
(ie. Housewives will run down_II it)
Which car did you arrive in_II?
(ie. In_II which car did you arrive?)

The same tests apply to words which are tagged either as prepositions or as locative adverbs RL eg. across, past, behind etc. (See section 3 for lists).

JJ / NN1

Words ending in -ing, when they premodify a noun, may be tagged either NN1 or JJ, eg:

New_JJ spending_NN1 reductions_NN2
her_APPGE acting_NN1 ability_NN1
a_AT1 working_JJ mother_NN1

(but see also JJ / VVG).

If "X-ing NOUN" is equivalent in meaning to "NOUN which X-es" - (ie. if the NOUN is the notional subject of the verb X) - then "X-ing" is a JJ.

For example:

The smiling_JJ children 
(i.e. The children are smiling)

In other cases, X-ing is an NN1. In such cases, it is often possible to paraphrase X-ing NOUN by a more explicit phrase in which X-ing is clearly a noun. eg:

new spending_NN1 reductions 
(new reductions in spending)
her acting_NN1 ability
(her ability in acting)

Further examples:

A boxing_NN1 match
A falling_JJ rate of exchange
Slimming_NN1 tablets
The mating_NN1 season
a couple of mating_JJ chimpanzees

JJ / RR & JJR / RRR

After a verb or an object, there is sometimes a tricky choice between JJ and RR, or between JJR and RRR. eg:

They arrived tired and hungry

Here, both "tired" and "hungry" are JJ. The main test is to see whether you can express the relation between these words and their logical subject, using the verb "be": "They arrived tired and hungry" implies "They were tired and hungry". The JJ/RR word refers to a property of a noun, rather than to a property of an event or a situation. Contrast:

Peter sang out loud_RR and clear_RR

This sentence does not imply that Peter was loud and clear, but is more or less equivalent to "Peter sang out loudly and clearly". It means that his SINGING was loud and clear. It follows that when, in colloquial English, a word which we normally expect to be an adjective is used as an adverb, we tag it RR. eg:

We did terrific_RR today

A simple pair of examples where the JJ/RR word follows an object:

I thought the game too long_JJ 
(the game was too long)
They work their staff very hard
(NOT "the staff are very hard")

Also JJR / RRR:

They'll have to make the taxes higher_JJR
(The taxes will be higher)

But:

You'll have to aim higher_RRR

Note: well is an adjective when it is the opposite of ill:

Mary is/feels well_JJ

Otherwise it is an adverb:

"He writes well_RR".

JJ / VVG & JJ / VVN

The tagging of words like "surprised" in "John was surprised", or "lasting" in "the effect was lasting" can be a problem. In both cases, the word can be a JJ. One test is to see whether you can insert an adverb like "very" in front of the word. eg. in "John was very surprised", "surprised" is a JJ.

Another test, having the opposite effect, is to see whether there is an agent "by"-phrase following an "ed/en" word. If so, it is a VVN. eg. in "John was surprised by the pirates", "surprised " is a VVN. Even where it is not present, the possibility of adding a "by"-phrase, without changing the meaning of the word, is evidence in favour of a VVN. (However, this criterion can clash with the preceding one - since it occasionally happens that an "ed"- word is preceded by an adverb like "very" AND followed by a "by"-phrase: eg. "John was very offended by her remarks". Fortunately, such cases are rare. When they do occur, however, give preference to JJ).

A third test is negative: to see whether the word in question can be placed before a noun. eg:

The effect is lasting:   a lasting effect

This shows that "lasting" can be (but need not be) a JJ. If the word could not be placed (with the same meaning) before the noun, this would be evidence that the word is not a JJ, but a VVG or a VVN.

Even though an "-ing" word is normally a VVG after the verb "be" it is generally treated as a JJ before a noun:

The man was dying_VVG

But:

The dying_JJ man

When the -ing or -en/ed word forms part of a phrase premodifying the noun, as in the following examples, the VVG/VVN tag is preferred:

interest_NN1 earning_VVG account_NN1
a hypothesis_NN1 driven_VVN approach_NN1

In these examples, the NN1 VVG sequence is similar in function to a compound pre-modifying adjective. In hyphenated form they would be given a JJ tag. The same applies when the phrase is a noun-like compound. eg:

a [ carol_NN1 singing_VVG ] contest_NN1

If the verb be can be replaced by another verb such as seem or become, without changing the meaning of the following JJ/VVN word, this is a strong indication that the construction is not properly a passive, and that the word is a JJ. eg:

The building was infested_JJ with cockroaches

(The building became/seemed infested...)

I could see he was favourably disposed_JJ to the idea

(He seemed favourably disposed...)

A further distinction which can be used as a test with 'event' verbs is that the JJ refers to a 'resultant state', whereas the VVN refers to a an event. eg:

Bill was married_JJ (as opposed to single)
Bill was married_VVN to Sarah on May 14th (the actual event)

Some further examples:

Three people were injured_VVN in the accident
I could see he was (seemed) injured_JJ
He lay injured_JJ on the road
We have three injured_JJ players in the side
Our players are not worried_JJ
She is not worried_VVN by that sort of threat

JJR / RRR see JJ / RR

NN1 / JJ see JJ / NN1

NP2 / NN2

Note that NP2 is not used for names of teams, even those which are apparently not common nouns. NP2 is used for proper nouns which happen to be plural, eg.

The Rockies, The Hebrides

for plural product names, eg.

Lancias_NP2 are pretty fast

and for naming families, eg.

The Staffords_NP2 are always quarrelling.

RG / RR

RG is restricted to adverbs of degree (also called intensifiers, etc.) which precede the word or expression they modify. Clear cases of RG are very,and so and as in comparatives (see section on as below).

Adverbs which have a range of functions, including adverb of degree, are not normally tagged RG, but are given the more general RR tag instead.

She_PPHS1 was_VBDZ scantily_RR clad_JJ

Here 'scantily' is an RR rather than an RG because it could also occur after a verb:

She_PPHS1 dressed_VVD scantily_RR

This is another case of the general principle of avoiding general-specific ambiguities within a word class. RG is usually only for words which do not have a more general range of adverbial uses.

There are exceptions to this, however. (See Section 2: Adverbs. See also Section 4: so). The words which may be tagged RG or RR are:

so

too

quite

rather

Examples:

She is so_RG attractive
I would think so_RR
This is too_RG heavy
Can I come too_RR?
That's rather_RG nice
I would rather_RR go out
He's quite_RG talkative
Quite_RR, I agree

Note that about may be an RP or an RG. However, this does not violate the principle mentioned above, since both RP and RG are sub-categories of RR:

He's about_RG 12, I think
Stop messing about_RP

RL / II see II / RP

RP / II see II / RP

RR / JJ see JJ / RR

RR / RG see RG / RR

RRQ / CS see Section 4 when

RRR / JJR see JJ / RR

VVG / JJ see JJ / VVG & JJ / VVN

VVN / JJ see JJ / VVN

SECTION 4 DISAMBIGUATION GUIDE (BY WORD)

ANY

any is tagged DD when it functions pronominally or as a determiner:

Do it any_DD way you like
I'm afraid I haven't got any_DD

and RR when it modifies an adverb or an adjective:

They are not called that any_RR longer_RRR
I cannot run any_RR faster_RRR
It was not really any_RR better_JJR than before

Note that the word following may also be ambiguous between adverb and determiner. In such cases, it is possible that both may be erroneously tagged, and require correction thus:

You won't feel any_DD more_DAR pain
If you have any_DD more_DAR , you'll burst
He doesn't play chess any_RR more_RRR

AS

as can be tagged RG, II or CSA.

It is an RG when it occurs before an adjective, adverb or determiner (and sometimes other words) in phrases such as:

I don't think that one is as_RG good 
I go there as_RG often (as...)
There are not as_RG many (as...)

In the 2nd and 3rd examples above, the second as is always a CSA because it introduces a comparative construction (an equal comparison, as contrasted with an unequal comparison introduced by than). Thus, in the following, the second as is tagged CSA:

She's not as_RG (or so_RG) pretty as_CSA I thought
An ostrich can run as_RG quickly as_CSA a zebra
He has as_RG many as_CSA six children

Notice that as in this comparative use is tagged CSA whether or not it introduces a clause, as normally understood. In the second case above, as precedes a noun phrase. In the following, it precedes an adjective:

Please come as_RG quickly as_CSA possible

CSA is also the tag used when as introduces other clauses (eg. clauses of time or clauses of reason). eg:

As_CSA I arrived, he was leaving
I'll lend you the money, as_CSA you're my friend

II is the tag for as as an undoubted preposition - it usually has an equative meaning, as in:

They regard him as_II a friend
As_II governor of the province, I have to take action

The guideline restricts II to cases of as followed by a noun-phrase-type structure - which may be a pronoun. If as is followed by an adjective, a past participle etc., it is tagged CSA, even though it has the same equative type of meaning as as_II. eg:

The novel as_CSA originally written
Many people regard his paintings as_CSA hideous

BUT

But is most commonly a CCB, but a there are rare cases when it can be an RR and a CS.

It is an RR in phrases such as:

You can but_RR try
We could not but_RR offer our help

It is an II when it has a meaning like except or apart from, eg:

All but_II one of us
We've asked everyone but_II the doctor
I've tried everything but_II taking tablets
Everything but the girl.

It is a CS when it introduces a clause such as:

There's no doubt but_CS he's the guilty party (rare)
There was nothing for it but_CS to give her the job
She would do nothing but_CS fly combat missions

Otherwise it is a CCB (co-ordinating conjunction):

I like this but_CCB but I don't like that
(co-ordinated sentences)
I like this one but_CCB not that one
(co-ordinated noun phrases)

EACH

When each could be replaced by apiece, it is tagged RA. Otherwise it is tagged as DD1:

Five pounds each_RA is a bit steep
They scored a goal each_RA

But:

They each_DD1 scored a goal
We go fishing each_DD1 Sunday in the Summer
Each_DD1 one a peach
I'll give you a fiver for each_DD1

HIS

His is tagged APPGE when it a pre-nominal possessive pronoun ie. when it is part of the set my; your; her etc.:

It was his_APPGE fault

It is tagged PPGE when it a nominal possesive pronoun, ie. when it is part of the set: mine; yours; hers' etc.:

John's not here, so use his_PPGE

HOW

how may be tagged as an RGQ or as an RRQ. As an RGQ it always premodifies another word, for example an adjective or an expression of quantity:

How_RGQ much_DA1 opposition is there?
I do not know how_RGQ willing_JJ he is

how as an RRQ, has a general adverbial meaning, and can often be paraphrased by an expression such as by what means or in which manner:

How_RRQ will you manage?
I wonder how_RRQ it will look

also:

How_RRQ are you?
How_RRQ does it feel?

How_RGQ implies a question which could be answered with the phrase in question, but with how replaced by a degree adverb (RG). eg:

I'm not sure how_RGQ likely it is
(How likely is it?  It is very_RG likely)

Note that the same principles apply to the word however (RGQV; RRQV), and the expression no matter how:

No_RGQV31 matter_RGQV32 how_RGQV33 difficult_JJ the situation, Red Adair always succeeds 
Be careful, however_RRQV you decide to do it!

(However may of course also be a general adverb (RR) ):

There were, however_RR, too few people in the audience

MUCH

Much is tagged DA1 when it functions pronominally or as a determiner:

There is not much_DA1 point in resisting
She didn't say very much_DA1

but it is tagged RR when it functions adverbially or pre-modifies an adjectival or adverbial head:

I don't like that very much_RR
This one is much_RR better_JJR

As with any (see above), co-occurrence with other ambiguous determiner/adverbs should be checked in case of a double error:

President Carter plays golf much_RR less_RRR these days 
He has much_DA1 less_DAR enthusiasm for the game

NO

When it means the opposite of yes, no is tagged UH. This is true even when the use is nominal, providing the quotation marks are present:

A resounding "no_UH"

If they are absent, the tag should be changed to NN1.

I'll take that as a no_NN1, then.

(See also Section 2.6.4: Cited Words)

Otherwise, no is tagged AT, e.g.:

There is no_AT question of that happening

ONE

MC1 where one precedes a noun or noun phrase, as in:

one_MC1 book
one_MC1 bag of spuds

and where it is the head of a noun phrase with a dependent prepostional phrase:

one_MC1 of the books

and when referring to 'one' as a number entity:

this is the number one_MC1
one_MC1 is an integer
type a one_MC1 at the prompt

PN1 where it is a personal pronoun such as:

one_PN1 ought to be careful
one_PN1 doesn't like to make a fuss

and when functioning as a substitute form:

the prettiest one_PN1 is called Flo
the one_PN1 you are holding is a bomb
his idea is not one_PN1 that holds much water

SO

The CS tag is used when so is equivalent to the expression so that. It has a purposive function:

We hid it so_CS no one would notice
He only said it so_CS he could impress us

It is an RR when it occurs, usually after a punctuation mark or at the beginning of a sentence, with a meaning approximating to therefore:

It is raining, so_RR I am staying at home
So_RR we gave up the struggle, you see
He swore at me, so_RR I hit him

It is likewise an RR if preceded by a conjunction in examples like those directly above:

He swore at me, and_CC so_RR I hit him

In expressions where so is used as a substitute form, and in cases where its use is clearly adverbial (= like that), it is tagged RR:

substitute:

so_RR I believe
I might feel that, but I would never say so_RR
So_RR did John
I'm afraid so_RR

adverbial:

Don't take on so_RR!

It is tagged RG when used in positions where very could occur:

She is so_RG friendly
I have never been so_RG angry
Thank you so_RG much

and when it corresponds to the first as in 'as...as...' comparisons:

They're not doing so_RG well_RR as_CSA before

TIMES

Times is now always tagged NNT2 except

when it occurs as the written form of the multiplication sign 'x' and when both sides of the mathematical operation are stated (even though they may not be expressed as numbers):

Three times_II two is six
The number of rows times_II the number of columns

when it occurs as part of a newspaper title it is tagged NP1.

In all the following diverse cases, NNT2 is used:

Recite your twelve times_NNT2 table
It's ten times_NNT2 better than before
(because of the following comparative adjective)
London is 10 times_NNT2 the size of Lancaster
(grammatically, 10 times could be replaced by twice)
How many times_NNT2 must that have happened?
Those were good times_NNT2
They clocked up some very fast times_NNT2
Knock three times_NNT2

Of course, times may also occur as a VVZ (She times_VVZ his response).

WHEN

When may be tagged RRQ or CS. When can introduce three types of clause:

adverbial clause [Fa],

noun clause [Fn]

relative clause [Fr].

When it introduces an adverbial clause or a non-restrictive relative clause, it is a CS. When it introduces either a noun clause or a restrictive relative clause, it is an RRQ. Examples:

adverbial:

 
When_CS I arrived, John left 
John left when_CS I arrived	(at the time at which)
I smoke when_CS I'm tense 	(whenever)

noun clause:

I cannot remember when_RRQ I was christened
I don't know when_RRQ the next bus is due 
			(the date/point in time at which)

relative:

In the year when_RRQ I was born   (in which)
The moment when_RRQ he arrived    (at which)

Note that when can often be omitted in a relative clause.

There are also non-restrictive relative clauses introduced by when, which are now to be tagged as CS. Previously they were tagged RRQ. It is no longer necessary to distinguish these from adverbial clauses introduced by when. Here are some examples of non-restrictive relative clauses:

 In 1968, when_CS the
students were revolting in Paris...

Here, when could best be paraphrased as at the time when.

Another example:

 
School finished at 4 o'clock
precisely, when_CS a loud bell sounded

Non-restrictive relative clauses do not define or restrict the meaning of the antecedent. If the antecedent is a precise temporal expression (such as "4 o'clock", "1990", "yesterday"), when is usually a non-restrictive relative.

These are different from restrictive relatives, such as:

 
In the year when_RRQ I was born

Here the year is defined by the relative clause. Typically restrictive relatives are not preceded by a comma, and the when can normally be omitted.

Another use of when_RRQ is in direct questions:

When_RRQ did you find out?

In abbreviated adverbial clauses, where when is followed by an adjective, a preposition phrase, a non-finite clause etc., when is a CS:

when_CS ready
when_CS in doubt
when_CS arriving late

but before an infinitive, when is an RRQ:

I don't know when_RRQ to apply

Note that the infinitive clause may be implied:

Tell me when_RRQ (to start)

and that a noun clause may be abbreviated simply to the word when:

It was Guy Fawkes, but I can't remember when_RRQ

WHERE

The tagging of where is consistent with when.

WORTH

Two tags are allowed: II and NN1. II is used for expressions which could be an answer to the question: how much is it worth? or what is it worth?:

My records are worth_II a small fortune
He is worth_II about two million
It's not worth_II gambling on

It also occurs as a stranded preposition (see Sections 2 and 3) in the questions used to elicit such responses, and in other common constructions:

What do you think they are worth_II ?
He knew exactly how much they were worth_II
She gave it everything she was worth_II

NN1 is used when worth is obviously nominal, and also in expressions where worth is preceded by a quantity, whether or not the quantity in question has been written as a genitive:

You don't know your own worth_NN1
I'd like a pound's worth_NN1
They purchased a million dollars worth_NN1 of equipment

SECTION 5 CLAWS7 TAGLIST

! punctuation tag - exclamation mark

punctuation tag - quotation marks

( punctuation tag - left bracket

) punctuation tag - right bracket

, punctuation tag - comma

- punctuation tag - dash

----- new sentence marker

. punctuation tag - full-stop

... punctuation tag - ellipsis

: punctuation tag - colon

; punctuation tag - semi-colon

? punctuation tag - question-mark

APPGE possessive pronoun, prenominal (my, your, our etc.)

AT article (the, no)

AT1 singular article (a, an, every)

BCS before-conjunction (in order (that), even (if etc.))

BTO before-infinitive marker (in order, so as (to))

CC coordinating conjunction (and, or)

CCB coordinating conjunction (but)

CS subordinating conjunction (if, because, unless)

CSA as as a conjunction

CSN than as a conjunction

CST that as a conjunction

CSW whether as a conjunction

DA after-determiner, capable of pronominal function (such, former, same)

DA1 singular after-determiner (little, much)

DA2 plural after-determiner (few, several, many)

DAR comparative after-determiner (more, less)

DAT superlative after-determiner (most, least)

DB before-determiner, capable of pronominal function (all, half)

DB2 plural before-determiner, capable of pronominal function (both)

DD determiner, capable of pronominal function (any, some)

DD1 singular determiner (this, that, another)

DD2 plural determiner (these, those)

DDQ wh-determiner (which, what)

DDQGE wh-determiner, genitive (whose)

DDQV wh-ever determiner (whichever, whatever)

EX existential there

FO formula

FU unclassified

FW foreign word

GE germanic genitive marker - (' or 's)

IF for as a preposition

II preposition

IO of as a preposition

IW with; without as preposition

JJ general adjective

JJR Rgeneral comparative adjective (older, better, bigger)

JJT general superlative adjective (oldest, best, biggest)

JK adjective catenative (able in be able to; willing in be willing to)

MC cardinal number neutral for number (two, three...)

MCGE genitive cardinal number, neutral for number (twos, 100's)

MCMC hyphenated number (40-50, 1770-1827)

MC1 singular cardinal number (one)

MC2 plural cardinal number (tens, twenties)

MD ordinal number (first, 2nd, next, last)

MF fraction (quarters, two-thirds)

ND1 singular noun of direction (north, southeast)

NN common noun, neutral for number (sheep, cod)

NNA following noun of title (M.A.)

NNB preceding noun of title (Mr, Prof)

NN1 singular common noun (book, girl)

NN2 plural common noun (books, girls)

NNL1 singular locative noun (street, Bay)

NNL2 plural locative noun (islands, roads)

NNO numeral noun, neutral for number (dozen, thousand)

NNO2 plural numeral noun (hundreds, thousands)

NNT temporal noun, neutral for number (no known examples)

NNT1 singular temporal noun (day, week, year)

NNT2 plural temporal noun (days, weeks, years)

NNU unit of measurement, neutral for number (in., cc.)

NNU1 singular unit of measurement (inch, centimetre)

NNU2 plural unit of measurement (inches, centimetres)

NP proper noun, neutral for number (Phillipines, Mercedes)

NP1 singular proper noun (London, Jane, Frederick)

NP2 plural proper noun (Browns, Reagans, Koreas)

NPD1 singular weekday noun (Sunday)

NPD2 plural weekday noun (Sundays)

NPM1 singular month noun (October)

NPM2 plural month noun (Octobers)

PN indefinite pronoun, neutral for number (none)

PN1 singular indefinite pronoun (one, everything, nobody)

PNQO whom

PNQS who

PNQV whoever, whomever, whomsoever, whosoever

PNX1 reflexive indefinite pronoun (oneself)

PP nominal possessive personal pronoun (mine, yours)

PPH1 it

PPHO1 him, her

PPHO2 them

PPHS1 She, she

PPHS2 they

PPIO1 me

PPIO2 us

PPIS1 I

PPIS2 we

PPX1 singular reflexive personal pronoun (yourself, itself)

PPX2 plural reflexive personal pronoun (yourselves, ourselves)

PPY you

RA adverb, after nominal head (else, galore)

REX adverb introducing appositional constructions (namely, viz, eg.)

RG degree adverb (very, so, too)

RGA post-nominal/adverbial/adjectival degree adverb (indeed, enough)

RGQ wh- degree adverb (how)

RGQV wh-ever degree adverb (however)

RGR comparative degree adverb (more, less)

RGT superlative degree adverb (most, least)

RL locative adverb (alongside, forward)

RP prep. adverb; particle (in, up, about)

RPK prep. adv., catenative (about in be about to)

RR general adverb (actually)

RRQ wh- general adverb (where, when, why, how)

RRQV wh-ever general adverb (wherever, whenever)

RRR comparative general adverb (better, longer)

RRT superlative general adverb (best, longest)

RT nominal adverb of time (now, tommorow)

TO infinitive marker (to)

UH interjection (oh, yes, um)

VB0 be

VBDR were

VBDZ was

VBG being

VBM am

VBN been

VBR are

VBZ is

VD0 do

VDD did

VDG doing

VDN done

VDZ does

VH0 have

VHD had (past tense)

VHG having

VHN had (past participle)

VHZ has

VM modal auxiliary (can, will, would etc.)

VMK modal catenative (ought, used)

VV0 base form of lexical verb (give, work etc.)

VVD past tense form of lexical verb (gave, worked etc.)

VVG -ing form of lexical verb (giving, working etc.)

VVN past participle form of lexical verb (given, worked etc.)

VVZ -s form of lexical verb (gives, works etc.)

VVGK -ing form in a catenative verb (going in be going to)

VVNK past part. in a catenative verb (bound in be bound to)

XX not, n't

ZZ1 singular letter of the alphabet (A, a, B, etc.)

ZZ2 plural letter of the alphabet (As, b's, etc.)

NOTE: DITTO TAGS

Any of the tags listed above may in theory be modified by the addition of a pair of numbers to it: eg. DD21, DD22. This signifies that the tag occurs as part of a sequence of similar tags, representing a sequence of words which for grammatical purposes are treated as a single unit. For example the expression in terms of is treated as a single preposition, receiving the tags:

in_II31 terms_II32 of_II33

The first of the two digits indicates the number of words/tags in the sequence, and the second digit the position of each word within that sequence. Such ditto tags are not included in the lexicon, but are assigned automatically by a program called IDIOMTAG which looks for a range of multi-word sequences included in the idiomlist. The following sample entries from the idiomlist show that syntactic ambiguity is taken into account, and also that, depending on the context, ditto-tags may or may not be required for a particular word sequence:

at_RR21 length_RR22
a_DD21/RR21 lot_DD22/RR22
in_CS21/II that_CS22/DD1

A POST-EDITOR'S GUIDE TO CLAWS7 TAGGING

CONTENTS

2.6.3.5 Company Names

2.6.4 Cited Words