A POST-EDITOR'S GUIDE TO CLAWS7 TAGGING

May 1996

Written by the UCREL team.
UCREL
University of Lancaster
Bailrigg
Lancaster
England LA1 4YT

CONTENTS

  1. GENERAL INTRODUCTION TO WORD-CLASS TAGGING
  2. INTRODUCTION BY WORD-CLASS TO THE CLAWS7 TAGGING SCHEME
  3. DISAMBIGUATION GUIDE (BY TAG-PAIR)
  4. DISAMBIGUATION GUIDE (BY WORD)
  5. CLAWS7 TAGLIST

SECTION 1

GENERAL INTRODUCTION TO WORD-CLASS TAGGING


1.1 A Basic Introduction to the CLAWS tagging scheme


CLAWS (Constituent Likelihood Automatic Word-tagging System) is a suite of computer programs for automatically assigning an appropriate grammatical tag to each word in a body of continuous text.

One or more potential word-tags from the Claws version 7 (C7) tagset is assigned using:

  • (i) probability data
  • (ii) the lexicon or wordlist
  • (iii) the suffixlist
  • (iv) the idiomlist
  • (i) Probability Data

    CLAWS assigns potential word-tags using a number of rules based on the ending and orthography of the word, and then uses a Hidden Markov Model method for estimating the most likely word-tag in each context. This is a type of statistical language model which calculates the probabilities of a certain sequence of words requiring a certain sequence of grammatical tags.

    Further information on probabilistic language analysis and the CLAWS programs can be found in The Computational Analysis of English, Garside, Leech and Sampson (1987), especially Chapters 3 & 4 and in Roger Garside's chapter in Short and Thomas (1996).

    (ii) The Lexicon

    The lexicon or wordlist consists of approximately 12,000 words, each listed with the possible tags for that word. Each word has between one and six candidate tags.

    In effect, the principles adopted for including a word in the wordlist have meant that where CLAWS has failed, using the probability data, to assign a correct tag to a word then it has been necessary to include it in the list to be perused by CLAWS before tag assignment is finalised.

    For example, using probability data, CLAWS would automatically assign the tag for noun to the word 'mushroom' (NN1). But when we encounter the use of 'mushroom' as a verb (VV0) we know that is it now essential that the lexicon includes the following entry:

    mushroom		NN1 VV0@
    

    (The @ is a rarity symbol, indicating that this tag applies in less than one in a hundred cases. There is also a % rarity symbol, indicating that this tag applies in less than one in a thousand cases.)

    However, with the BNC facilities, the use of increasingly large databases is now possible and we are moving towards the compilation of a much larger lexicon, incorporating other large wordlists for example.

    The fact remains that it is the post-editor's responsibility to formulate lists of words which are candidates for inclusion in the lexicon, and it is recommended that, where the post-editor finds a CLAWS error, a check on the current lexicon is made, and if necessary a suggestion should be made for a new lexicon entry.

    (iii) the Suffixlist

    When CLAWS encounters a word that is not found in the lexicon, a basic morphological analysis of the word is carried out in order to assign a candidate tag or tags to the word. The suffixlist is a list of common or predictable word endings, coupled with a list of one or more candidate tags. CLAWS will attempt to match the longest possible suffix to the word and assign the candidate tags associated with that suffix. Here is a few lines of the suffix list:

    cle                       NN1 
    dle                       NN1 VV0 
    fle                       NN1 VV0 
    gle                       NN1 VV0 
    ile                       JJ NN1 
    obile                     NN1 
    phile                     NN1 JJ% 
    pile                      NN1 
    tile                      JJ 
    

    Post-editors should also suggest new suffixlist entries when they feel it is appropriate.

    (iv) The Idiomlist

    This is not a list of idioms in the usual sense but a list of multi-word sequences to aid CLAWS disambiguation procedures. The entries fall roughly into two main groups:

  • (a) Ditto tag group
  • (b) Word disambiguation group
  • Although the list is less than twenty pages in length, its usefulness in helping with CLAWS disambiguation procedures is invaluable.

    Ditto tags are used for sequences whose syntactic role in combination differs from the role of the same words in other contexts. For example, the phrase

    all of a sudden
    

    would cause CLAWS and post-editors alike a few tagging headaches. We know however that this particular combination of words only occurs as an adverbial form. Therefore this sequence of words can be treated as a single unit for grammatical tagging purposes. By the addition of numerals to the basic tag format we can indicate that this is a sequence of tags, representing a sequence of words which form one grammatical unit. These are called ditto tags. The above example is tagged:

    all_RR41 of_RR42 a_RR43 sudden_RR44
    

    (RR is the tag for general adverb)

    Ditto tags are not included in the lexicon; they only appear in the idiomlist. (Only single orthographic words are included in the lexicon.) For most purposes, ditto tag sequences should be considered as a closed set. As far as possible, post-editors should not invent ditto tags. There are several good reasons for this. Often a decision will already have been reached that a certain collocation should not have ditto tags. Also, some software for processing tagged text (such as SARA) needs to know all ditto tag sequences that occur. If a post-editor feels that it is appropriate to create a new ditto, the following procedure must be followed:

  • do not use your new suggested ditto tag straight away
  • check the use of the word or construction across the corpus, and see how it has been tagged
  • suggest the new tag to your colleagues
  • if agreed upon, it will need to be tagged consistently in the text and corpus in question,
  • the new ditto tag should be documented;
  • a recommendation made for it to be included in the CLAWS idiomlist
  • The idiomlist is also currently used for a wide variety of word disambiguation problems. Three examples are given below.

    The expression at times is not usually ambiguous, but there are several combinations of tags available from the lexicon. There are four candidate tags in the lexicon for times but whenever this word occurs after at the tag should be NNT2. In order to avoid errors it is a simple procedure to make the following entry in the idiomlist:

    at II, times NNT2
    

    The word junior has noun and adjective tags in the lexicon but whenever it occurs after a proper noun (NP1), the tag should be JJ (adjective) so there is an idiomlist entry:

    [NP1], junior JJ
    

    Multi-word place-names, such as Alice Springs, can cause problems for CLAWS. There are two candidate tags for "springs" in the lexicon so there is an idiomlist entry solely for the Australian town:

    Alice NP1, Springs NP1
    

    1.2 A Basic Introduction to the Principles of Post-editing


    At present the rate of accuracy for the CLAWS tagging system is between 95-98%, depending on the type of text. The primary object of post-editing is to correct the erroneous output, for example where a noun has been tagged as a verb, or an adverb as an adjective.

    Section 3 of this document gives guidelines for deciding which is the correct tag in cases where the distinction is not always obvious. Section 3.1 deals with pairs of tags which may often be confused and Section 3.2 supplements this with guidelines relating to specific ambiguous words.

    The important by-product of post-editing is to improve the tagging system itself. In the past, all post-editing has been done manually with corrections input using a simple screen-editor. The BNC project however involves the accurate tagging of a 100 million words and as the greatest throughput of the tagging and parsing projects was previously 1 million words per year this gives an indication of the huge task that this involves.

    It is therefore the responsibility of the post-editors to:

  • investigate recurring errors in the CLAWS output
  • log all inconsistencies accurately and conscientiously
  • devise methods to improve CLAWS accuracy - these methods will not necessarily mean programming modifications (although obviously creative thinking is encouraged) but include constant updating of the lexicon, the idiomlist, the suffixlist and, of course, this document.

  • 1.3 General-specific ambiguity


    The tags correspond to a number of general word classes (nouns, adverbs, determiners etc.). The complete tagset can be seen in Section 5. Within these tag-groups there is normally a 'general' tag, and a number of 'specific' tags, used for various subcategories of the general word-class. e.g.:

  • RR = general adverb(general tag)
  • RG = degree adverb(specific tag)
  • Either by searching a lexicon, or by assessing the likely properties of a word on the basis of its form (using a set of 'suffix rules'), CLAWS7 decides which of the tags might be assigned to a particular word, and then calculates which is most likely to be the appropriate one in the given context.

    In the case of a word which is in the lexicon, the entry will specify the tags to be considered. e.g.:

    jeer		NN1 VV0
    

    In practice, CLAWS7 makes few errors in deciding whether a word is a noun or a verb, because it is deciding between two distinct classes of words with very different syntactic properties. However, a word may have different meanings suggesting that it should be allowed 2 or more different tags within the same general word class. e.g.:

    I have broken my FOOT
    Do you have a one FOOT ruler?
    

    In the first example, 'foot' is clearly a general noun (NN1), as there is no separate tag for 'parts of the body', but in the second example, 'foot' corresponds to the specialized tag NNU, used for 'units of measurement'. In order to assign tags along these lines, we would need to either a) include the tags NN1 and NNU in the lexicon entry for 'foot', or b) include just one of the tags, and correct the output manually.

    In fact, we do neither. Instead, we adopt a principle which can be stated thus:


    AVOID PROLIFERATION OF TAGGING AMBIGUITIES BY ALLOWING THE MORE GENERAL CATEGORY TO SUBSUME THE MORE SPECIFIC ONE


    Thus, foot will always be tagged NN1, whichever of its noun meanings is the case. Unambiguous words or items such as ft. or inch (when the latter is a noun), will, on the other hand, contain the specific tag in the lexicon entry, but not the general one.

    1.3.1 Exceptions

    To aid automatic tagging or parsing, a few known general-specific ambiguities have been allowed to persist. The notable examples are:

  • so RR (general adverb, as in 'so you think you're clever, do you?'), RG (degree adverb, as in: 'so nice')
  • too RR (as in 'I'm coming too'), RG (as in 'too much')
  • more, less RRR (comparative general adverb), RGR (comparative degree adverb)
  • if normally CS (subordinating conjunction) but tagged CSW when it means 'whether'.
  • Note also that NN1 and NP1 are considered to be distinct tags and not specific varieties of a generic 'noun' tag, so the principle of the non-proliferation of tags does not apply.

    1.3.2 Updating Resources

    One further important reason for post-editing, is to improve the automatic tagging system. From time to time, one may, in the course of post-editing, become aware of a lexicon entry which is the cause of automatic tagging errors, or which conflicts with the principle of 'no general-specific ambiguity'. It may be decided subsequently to amend the lexicon.

    Similarly, particular sequences of words may be particularly likely to be erroneously tagged. Often, this is because the words themselves, in that particular sequence, require tags which are different from those they would normally be assigned. e.g.:

    at_RR21 all_RR22 (equivalent to a single adverb)

    The Far_NP1 East_NP1 (tagged as a proper noun)

    In many such cases, the CLAWS7 idiomlist can help to make it more likely that the correct tags will be assigned. Post-editing is therefore a means of discovering word chains which might be included in the idiomlist.

    Finally, post-editors frequently encounter usage which seem open to alternative tagging strategies. This is particularly true in the case of a range of types of naming expression. Achievement of a consistent post-edited output therefore rests on the adoption of conventions according to the type of expression. Section 2 includes descriptions of conventions devised so far to deal with the tagging of such naming expressions (see section on nouns, 2.6.3ff).

    SECTION 2

    INTRODUCTION BY WORD-CLASS TO THE CLAWS7 TAGGING SCHEME


    2.1 Adjectives


    The main class of adjectives, those which can be used predicatively or attributively (whether or not with the same meaning), are tagged JJ.

    JJR is used for comparative adjectives (e.g. whiter) and JJT is used for superlative adjectives (e.g. whitest).

    JK (catenative) is used for able, unable andwilling in sentences like: "Will you be able_JK to manage?" but not when used as a general adjective as in: "Your son is very able_JJ."

    SYNTACTIC AMBIGUITY (see Section 3)

  • 1. JJ > VVG, JJ > VVN
  • 2. JJ > NN1
  • 3. JJ > RR, JJR > RRR

  • 2.2 Adverbs


    As well as the general adverb (RR), tags exist for degree adverbs (very, too etc.) (RG), prepositional adverbs or particles (RP), locative adverbs (RL), adverbs of time (RT). For the first two of these comparative and superlative tags exist (RRR, RRT, RGR, RGT). Further adverb tags are listed in Section 5.

    Adverb words are subject to tagging errors due to a variety of sources of ambiguity. See esp.:

  • Section 2: Articles and Determiners (2.4)
  • Prepositions vs. RP and RL (2.7.2)

  • 2.3 Articles, Determiners & Pronouns


    All determiners capable of a pronominal function receive D tags, (see section 5 for a complete list), regardless of whether they are acting pronominally or not. These are categorised according to the positions in which they may occur in a complex noun phrase.

    Words that are only pronouns (they, nobody etc.) are given P tags. The, a/an, no and every are tagged as articles (AT or AT1).

    The main source of automatic tagging errors associated with 'D'-words is ambiguity between determiners and adverbs. See Section 3: DAR > RRR (more and less) and Section 4: any; each; much; no.

    Errors are sometimes associated with the ambiguity between that as a determiner (DD1), as a conjunction (CST) and as a degree adverb (RG).


    2.4 Conjunctions


    The two most commonly occurring types of conjunction are:

  • coordinating conjunctions (and_CC, or_CC; but_CCB),
  • subordinating conjunctions (if_CS, because_CS).
  • The latter are sub-categorized as follows:

  • CS general subordinating conjunction
  • CSA as (see Section 4)
  • CSN than
  • CST that as a conjunction
  • CSW whether; also if when it means whether.
  • The sequence "as well as" may be tagged as a CC idiom. However, at present CLAWS7 tends to mis-tag this due to the 'overlapping' idiom "as_RR21 well_RR22" (meaning 'also'). Attention should therefore be paid to this at the post-editing stage.

    Sometimes plus and minus may be tagged CC, but should be tagged II when linking noun phrases, if signifying addition or subtraction (see guidelines for times in Section 4.).


    2.6 Nouns


    2.6.1 Introduction


    The basic tags for common nouns are NN1, NN2 and NN.

  • e.g.: table_NN1
  • tables_NN2
  • men_NN2
  • people_NN
  • group_NN1
  • groups_NN2
  • sheep_NN
  • NN is used for nouns that can have plural and singular modifiers yet do not change their form (i.e. words that are morphologically neutral for number) like sheep, and words that have distinct singular and plural meanings, like aids and butchers.

    Various subcategories of noun are given more specialized tags (see section 5 for a complete list). Certain conventions have been adopted for the tagging of various types of noun phrase, either automatically, or, if necessary, manually by post-editors. These are described below.

    Because of the principle of avoiding ambiguity within a word class between a general and a specific category, many decisions have had to be made about the appropriate tagging for nouns with more than one meaning.

    The three subcategories of noun which present most dilemmas are NNL1/NNL2 (locative nouns) and NP/NP1/NP2 (proper nouns). There are also the tags NNT1/NNT2 used for temporal nouns (which may also be used adverbially), though cases of ambiguity between NNT* and NN* seldom arise, (two exceptions being fall and spring, which are tagged NN1).

    2.6.2 Major Sub-Categories Of Noun


    There follows a brief description of the criteria used for deciding when these tags are appropriate. This is followed by details of the conventions adopted for tagging various types of noun-phrase which have been found to raise questions.

    2.6.2.1 NP/NP1/NP2

    These proper noun tags are used for:

  • names of people
  • 'proper noun' elements of company names etc.
  • names of places (e.g. countries, towns and villages), where these are not included in other classes of "capitalised noun" tags, e.g.: NNL1. compare:
  • Vermont_NP1
    Atlantic_NP1 City_NNL1
    
  • names of newspapers
  • names of institutions and products where the words are not common nouns, e.g.:
  • IBM_NP1, Spar_NP1, Lancia_NP1
    

    but NOT for:

  • names of horses,
  • names of ships etc.
  • names of teams
  • names of pubs, hotels etc.
  • titles of books, plays, films etc.,
  • and any other lexical nouns forming part of a naming expression.
  • The general principle can be stated as:


    Bona fide proper nouns are words that are (i) names of people, (ii) geographical places and (iii) other names of things that are not also common lexical nouns. Bona fide proper nouns are given NP tags. Other nouns that form part or the whole of a naming expression are tagged as common nouns, unless they are words that are normally an NP/NP1/NP2 in any case.


    A few examples (there are more for each category below).

    Ships:

    HMS_NNB Brilliant_JJ
    The_AT Queen_NNB Mary_NP1
    

    People:

    John_NP1 Smith_NP1
    Kate_NP1 Moss_NP1
    

    Horses:

    Shergar_NP1
    White_JJ Flash_NN1
    

    Product and company names:

    IBM_NP1
    Word_NN1 for_IF Windows_NN2
    Volkswagen_NP1 Golf_NN1
    

    Place names (see also more below):

    Lancaster_NP1
    Leicester_NP1 Square_NNL1
    Old_NP1 Street_NNL1
    

    The NP tag is used only for those cases where a proper noun is morphologically neutral for number (see NN above). There are therefore a very restricted, but open, number of words that are given this tag. Only those proper nouns that are countable and unchangeable in form come into this category, e.g. Mercedes, Sainsburys, Tescos.

    2.6.2.3 NNL1/NNL2

    The NNL tags are used for a closed list of words which have a locative meaning, and which occur (normally with a capital letter) in complex expressions for naming geographical places. Institutions have common general common noun tags, although there is sometimes a difficult distinction to be drawn between institutional and geographical reference, e.g. The British Museum. Since the NNL tags are only used in compounds, they are only assigned by the idiomlist and not by the lexicon.

    Leicester_NP1 Square_NNL1
    Old_NP1 Street_NNL1
    The_AT Atlantic_NP1 Ocean_NNL1
    

    The full list of words that can be tagged NNL1 is as follows:

             NNL1                     NNL2           
             City                    Hills          
             Close			Islands
    	 Hill                    Isles          
            Island                 Mountains         
             Isle             
             Lake                                    
             Lane                                    
             Mount                                   
             Ocean                                   
             Place                                   
             River                                   
             Road                                    
              Sea                                    
            Square                                   
            Street                                   
    
    

    Shortened forms of these words get NNL tags, such as Rd, St (meaning Street) and Mtns, etc.

    The NNL tag is also still regarded as valid for a word referring, not to the physical location itself, but to the activity or institution which is associated with or contained within that location. Thus:

    Downing_NP1 Street_NNL1 registered_VVD its_APPGE disapproval_NN1 ._.
    

    See also the section on place names below.

    2.6.2.4 NNB tag

    NNB (where B stands for 'before') is used for a set of nouns which function as a title in a person's name, and is only used when these words are used as a title, e.g.:

    Sergeant_NNB Jones_NP1
    He was a conscientious sergeant_NN1
    footballer_NN1 Geoff_NP1 Hurst_NP1
    "Yes, Sergeant_NN1"
    

    Nouns denoting family relations, even though some of them do not meet the "title in a person's name" criterion, are included in the NNB set (e.g. uncle, aunt, auntie). NNB is used for abbreviated nouns of style occurring before a name:

    Mr._NNB Jones
    Coun._NNB Alf Roberts
    

    NNA (where A = "after") is used for abbreviated nouns of style or title appended to names:

    Anne Collins J.P._NNA
    Gordon Banks O.B.E._NNA
    

    Note that 'St.', meaning 'Saint', is always tagged NP1.

    2.6.2.5 NNT1/NNT2

    These tags are used for temporal nouns, which can be used as the head of an adverbial expression. e.g.:

    I saw him last week_NNT1
    Every year_NNT1, it's the same story
    

    Nouns that also have a non-temporal meaning, such as Spring and Fall, are always tagged NN1.

    See also Section 4: "times"


    2.6.3 Categories Of Naming Expressions


    2.6.3.1 Place Names And Locations

    The tagging for place-names and geographical locations consists predominantly of NP, NNL and common noun tags. Words that are tagged as common nouns are parts of names which contain a capitalized word which is not generally recognized to be part of the name itself, but is used instead as a qualifier to the name, and is thus performing its normal lexical function (in spite of the capital). Thus:

    New_NP1 York_NP1
    West_NP1 Berlin_NP1 
    

    But:

    New_NP1 York_NP1 West_ND1 
    East_ND1 Chicago_NP1
    

    The difference between 'West Berlin' and 'East Chicago' is that the former is (or was) an institutionalised name, whereas the latter is not. (The fact that West Berlin is no longer an official entity is not relevant here, since reference to places in the past is possible). References to places may often contain an NNL1 (or NNL2), such as Lake, Mount, River or Isles. However, a word which may be tagged NNL1/2 is tagged NP1 in cases where it an integral part of the name, not an additional descriptive qualification in the way NNL-tagged words are. Thus:

    Mount_NNL1 Igman_NP1
    River_NNL1 Tyne_NP1
    Lake_NNL1 Placid_NP1
    Fylde_NP1 Avenue_NNL1
    

    But:

    Lake_NP1 District_NN1
    Street_NP1 Lane_NNL1
    Avenue_NP1 Road_NNL1
    

    The test we apply is to see whether e.g.. Lake District is a kind of lake. The answer is no, so Lake is an NP1. Lake Placid is a lake however, so it is an NNL1. An extension of this rule is applied in the case of names such a Long Beach, which is not a kind of beach, or Bowling Green, which is the name of a town. In applying this test we do not recognise the derivation of such names, and tag them as proper nouns:

    Long_NP1 Beach_NP1
    Bowling_NP1 Green_NP1
    

    A qualification to the rule, on the other hand, applies in the case of a plural NNL or NN word which has become part of a singular proper noun. In such cases, the tag NP1 is used. E.g.:

    Alice_NP1 Springs_NP1
    Beverly_NP1 Hills_NP1
    Strawberry_NP1 Fields_NP1
    Grand_NP1 Rapids_NP1
    Yorktown_NP1 Heights_NP1  
    

    It is clear that such names do not function as plural nouns. Compare, for example:

    Beverly_NP1 Hills_NP1 is a suburb of L.A.
    The Malvern_NP1 Hills_NNL2 are made of granite
    

    It is for this reason that "United States" is tagged:

    United_NP1 States_NP1
    

    Note also that a place name preserves its tagging when it is subsumed into a longer naming expression:

    Long_NP1 Island_NNL1
    Long_NP1 Island_NNL1 Sound_NN1
    

    Other words which are tagged as NP1 when part of a place-name are Greater and St., e.g.:

    Greater_NP1 Manchester_NP1
    St._NP1 Louis_NP1
    

    2.6.3.2 Nationality And Language.

    There is sometimes a problem deciding how to tag words relating to nationality, language etc. when they have an adjectival form. As a rule, the language is tagged as a noun (NN1), whilst the same word used as an adjective should be tagged JJ:

    French_JJ people usually speak French_NN1
    

    The tagging for specific or generic reference is shown in the table below.

    If the word for generic or plural specific reference is the same as the adjectival form (i.e. does not undergo morphological pluralisation), then the word is tagged as an adjective, e.g.: The French_JJ (cf. the poor_JJ). In most cases, the word is not available for singular reference. Even when it is (e.g.. A Japanese), the tag JJ is retained. The key test is whether these words can take an additional plural ending. Those which cannot are treated as adjectives. (see type 1 in the table below) On the other hand, words which can be used either adjectivally or for singular specific reference, but which take a plural ending for generic reference (or plural specific reference), are tagged as nouns when they refer to people. (see type 2 in the table below).

    Type 1

    Language          Person           People           Adjective       
    NN1               JJ               JJ               JJ              
    
    
    Dutch             Dutch            Dutch                            
    English           English          English                          
    Flemish           Flemish          Flemish                          
    French            French           French                           
    Irish             Irish            Irish                            
    Polish            Polish           Polish                           
    Spanish           Spanish          Spanish                          
    Welsh             Welsh            Welsh                            
    Chinese           Chinese          Chinese                          
    Japanese          Japanese         Japanese                         
    Portuguese        Portuguese       Portuguese                       
    Swiss             Swiss            Swiss                            
    Vietnamese        Vietnamese       Vietnamese                       
    
    

    Type 2

    NN1               NN1              NN2              JJ              
                      African          Africans         African         
                      American         Americans        American        
    Arabic            Arab             Arabs            Arabic/Arab     
                      Arabian          Arabians         Arabian         
                      Asian            Asians           Asian           
                      Australian       Australians      Australian      
                      Canadian         Canadians        Canadian        
    German            German           Germans          German          
                      Belgian          Belgians         Belgian         
                      Brazilian        Brazilians       Brazilian       
                      European         Europeans        European        
    Hungarian         Hungarian        Hungarians       Hungarian       
                      Indian           Indians          Indian          
    Italian           Italian          Italians         Italian         
    Norwegian         Norwegian        Norwegians       Norwegian       
    Russian           Russian          Russians         Russian         
    
    

    In the case of compound adjectives or nouns referring to nationality, the tagging is extrapolated from the tagging applied to the name of the country. Thus:

    The West_NP1 German_JJ Chancellor_NNS1
    West_NP1 Indian_JJ food_NN1
    The West_NP1 Indians_NN2 
    Puerto_NP1 Ricans_NN2 
    A South_NP1 African_NN1 
    South_NP1 African_JJ policies
    

    2.6.3.3 Race And Religion

    The guidelines for tagging words of adjectival form relating to race and/or faith are parallel to those for nationality. Those which can be pluralised may be tagged NN1, NN2 or JJ:

    A South_NP1 African_JJ Black_NN1
    A Black_JJ South_NP1 African_NN1
    A Black_JJ youth_NN1 and three whites_NN2
    White_JJ supremacy_NN1
    Black_JJ Liberation_NN1
    black_JJ nationalists_NN2
    black_JJ nationalist_JJ campaigners_NN2
    

    Similarly:

    A Roman_JJ Catholic_JJ priest
    A Roman_JJ Catholic_NN1
    Moslem_JJ customs_NN2
    A group of Buddhists_NN2
    

    2.6.3.4 Directional Terms

    Words representing points on the compass are tagged ND1, whether they are used adjectivally, nominally or adverbially. This applies whether they are simple words, hyphenated words or abbreviations. e.g.:

    north-east_ND1
    S.E._ND1
    south_ND1
    southwest_ND1
    

    Their derivative '-ern' adjectives are tagged JJ:

    Northwestern_JJ
    Southern_JJ
    

    These rules are overridden in cases where:

    1. The word is an essential part of the name of a country, region or place (see paragraphs 1 & 4 above):

    The Middle_NP1 East_NP1
    West_NP1 Germany_NP1
    

    2. When the word stands alone in reference to a company or similar organization see paragraph 6a below):

    Following the revelations of malpractice in the Eastern_JJ Railway_NN1 Co._NN1, the head of Eastern_NP1 has resigned
    

    3. When the word is a proper noun in its own right. e.g.:

    Colonel_NNB Oliver_NP1 North_NP1
    

    2.6.3.5 Company Names

    A typical company name would be tagged thus:

    Schlitz_NP1 Brewing_NN1 Co._NN1
    

    The proper noun tag is retained for elements of the title which are clearly names (e.g.. surnames), but other elements are tagged as they would be normally. Thus:

    Filmpower_NP1
    Chrysler_NP1 Motor_NN1 Corp._NN1
    Safeway_NP1 Limited_JJ
    

    If a proper noun happens to coincide with a common noun, we still assign an NP-tag:

    Gateway_NP1 Productions_NN2 Inc._JJ
    Storeys_NP1 Ltd._JJ
    

    But if the title is merely composed of common lexical items, proper noun tags are not used:

    Resorts_NN2 International_JJ
    News_NN1 International_JJ
    Federated_JJ Department_NN1 Stores_NN2
    

    This distinction is not always obvious. The test used to judge whether an NN tag or an NP tag is appropriate is to ask whether these words are used in something approximating to the normal way. Since, presumably, Gateway Productions do not make gates, and Storeys are probably named after a person called Storey, there is a sense in which they are much further from their normal lexical function than the words 'Resorts', 'Federated' and 'News' in the examples above.

    A different tagging strategy is employed when the company title is truncated to a single word. In such cases we use the NP1 tag:

    A spokesman for Federated_NP1 said...
    When I started working for International_NP1 ...
    

    See also the section above on 'directional words' (Southern, East etc.), which often form part of corporate titles, and where similar considerations apply. Note that commerciality is not a criterion for applying the above guidelines. Non-commercial organizations are tagged in the same way:

    Maine_NP1 Correctional_JJ Centre_NN1
    New_NP1 York_NP1 Central_JJ Youth_NN1 Club_NN1
    

    It is characteristic of such titles, that examples will always occur which show up the inadequacy of any concise guidelines for tagging and tag-correction. For instance, in the case of 'Pan American' and 'Pan Am', it was decided these should always be tagged as proper nouns:

    Pan_NP1 Am_NP1
    Pan_NP1 American_NP1
    

    and that 'General Electric' should be tagged:

    General_JJ Electric_NN1
    

    Probably the best way to ensure that the tagging of a large body of text remains as consistent as possible is to build up a 'caselaw' of such tagging decision as they are made, and if possible to add them to the idiomlist.

    2.6.3.6 Product Names

    Product names are given NP tags when the words do not coincide with common lexical items, or when they are bona fide proper nouns:

    Cadillac_NP1 Eldorado_NP1
    I drive a Mini_NP1
    He wasn't very good with Hoovers_NP2
    

    This also applies when the NP term precedes the head of the phrase:

    A Burberry_NP1 raincoat
    A Boeing_NP1 airliner
    

    2.6.3.7 Names Of Teams

    Names of sports teams (e.g.. The Green Bay Packers; The Chicago Bears; New York Rangers; Boston Bruins; the Blue Devils) are tagged as though they were common nouns. This is the case even when the head of the team-name has the appearance of a proper noun (e.g.: The Finns). This rule only applies to the head of the naming expression, and to words which are explicitly referring to teams. Such words will usually be plural, but see example 5, below. Other elements of a team's name (e.g.. a place name) should be tagged as they would be in other contexts. When a place-name is substituted for the full name of a team (see example 6), it retains its NP tags, as the team reference is implicit.

    These points are illustrated in the examples that follow.

    Manchester_NP1 United_JJ
    Birmingham_NP1 City_NN1
    Oldham_NP1 Athletic_JJ
    Queen_NN1 of_IO the_AT South_ND1
    Tottenham_NP1 Hotspur_NP1
    New_NP1 York_NP1 Rangers_NN2
    The_AT Buffalo_NP1 Sabres_NN2
    Indiana_NP1 Pistons_NN2
    Green_NP1 Bay_NNL1 Packers_NN2
    Chicago_NP1 Bear_NN1 Chuck_NP1 Smith_NP1
    Los_NP1 Angeles_NP1 beat_VVD the_AT Flying_JJ Finns_NN2
    

    2.6.3.8 Horses

    We have adopted the convention of tagging the words in horses' names as though they were ordinary lexical items, and ignoring the capital letters:

    Arabian_JJ Knight_NN1
    Black_JJ and_CC White_JJ
    Fish_NN and_CC Chips_NN2
    King_NNB Arthur_NP1
    Happy_JJ Days_NNT2
    

    Note that Arthur is tagged NP1 because it is unquestionably a bona fide proper noun.

    2.6.3.9 Names Of Ships Etc.

    These are treated in the same way as names of horses, a common tag being JJ:

    H.M.S._NNB1 Invincible_JJ
    H.M.S._NNB1 Tenacious_JJ
    H.M.S._NNB1 Tiger_NN1
    Sir_NP1 Galahad_NP1
    The_AT Queen_NNB Elizabeth_NP1 II_MC
    

    Note that in the last two examples, the NP1 tag is used for personal names. The same would apply for quasi-proper nouns:

    H.M.S._NNB Nautilus_NP1
    

    2.6.3.10 Titles Of Newspapers

    We have adopted the convention of tagging the titles of newspapers as common lexical items:

    The_AT Sun_NN1
    The_AT Daily_JJ Telegraph_NN1
    

    However, the following points should be noted:

    (1) Times is tagged NP1 rather than NNT2 when it occurs in a newspaper title.

    (2) If the name occurs as part of a company name, there may be a conflict with the guidelines under section 2.6.3.5 Company Names above. In this case, we give priority to the test described there on page 21:

    Mirror_NP1 Group_NN1 Newspapers_NN2
    The owner of Today_NP1 Newspapers_NN2
    

    2.6.3.11 Titles Of Books, Plays, Films Etc.

    As with names of Horses and Ships, we attempt to keep NP-tagging to a minimum, using it only for words which would be NP1 or NP2 in other contexts:

    What_DDQ Katy_NP1 Did_VDD Next_MD
    The_AT Diary_NN1 of_IO a_AT1 Nobody_NN1
    Frankenstein_NP1
    Life_NN1 the_AT Universe_NN1 and_CC Everything_PN1 
    

    2.6.3.12 Names Of Hotels And Pubs Etc.

    The conventions adopted for tagging names of hotels are similar to those for names of companies. In other words, where the full name is given, NP tags are restricted to bona fide proper nouns, e.g.:

    Park_NN1 Lane_NNL1 Hotel_NN1
    

    Truncated names are changed to NP1 in order to avoid an adjective standing alone as the head of a noun phrase, but not if the truncated name is a noun in any case:

    The Regency_NN1
    

    These points are illustrated further in the following examples:

    The Cumberland_NP1 Hotel_NN1
    The Post_NP1 House_NN1
    The Post_NP1 House_NN1 Hotel_NN1
    The White_JJ House_NN1 Hotel_NN1
    The Imperial_JJ Hotel_NN1
    Let's have a drink at the Imperial_NP1
    

    2.6.3.13 Festivals And Commemorative Events:

    As far as possible, these are tagged using ordinary lexical item tags, including, where appropriate NNT1:

    At a Republic_NN1 Day_NNT1 gathering
    New_JJ Year_NNT1 's_GE Day_NNT1
    Lincoln_NP1 Day_NNT1
    

    NNT1 tags are used for Christmas, Passover, Easter etc:

    Next Christmas_NNT1
    Easter_NNT1 Sunday_NPD1
    Christmas_NT1 Day_NNT1
    

    2.6.4 Cited Words


    Words like 'no' and 'must', which may be used as if they were nouns, are governed by tagging conventions which vary according to the presence or absence of quotation marks, and according to whether or not the word in question has been pluralised. When such a 'cited word' has it normal (singular) form, it is given its normal tag if quotation marks are used:

    That sounds like a "yes"_UH to me
    It's a "maybe_RR" rather than a "must_VM"
    

    However, when there are no quotation marks present, we use a tag appropriate to the context (usually NN1):

    A resounding no_NN1
    An absolute must_NN1
    

    Plural cited words are tagged NN2 whether or not quotation marks are used:

    No ifs_NN2 or buts_NN2, just do it!
    The noes_NN2 have it
    

    2.7 Prepositions


    2.7.1 Tags Used For Prepositions


    Most prepositions are tagged II. More specific tags are used as follows:

  • IF for
  • IO of
  • IW with, without

  • 2.7.2 Prepositions & RPs


    A preposition-type word will receive an adverbial particle tag when used in phrasal verb constructions, or when having the function of an adverbial in the sentence or clause, e.g.:

    Japanese companies have insisted on keeping down_RP sales of US cars
    He did not rule out_RP use of a surcharge
    Rota put the Cannocks up_RP 4-3 in the second period
    Seattle reeled off_RP six points for a 17-point lead
    Out_RP in the garden, the dog was running around_RP
    

    A list of possible RPs:

    'bout                     RG II RP@
    about                     II RG% RP@
    along                     II RP
    around                    II RP RG@
    away                      RL RP JJ%
    back                      RP NN1 JJ@ VV0%
    by                        II RP%
    down                      RP II@ NN1% VV0% JJ% NP1:%
    in                        II RP@ .NNU%
    off                       RP II JJ%
    on                        II RP@
    on/off                    RP
    out                       RP II%
    over                      II RP JJ% RG@ NN1%
    round                     JJ II RP NN1@ VV0@
    through                   II RP@ JJ%
    thru                      II RP@ JJ%
    to                        TO II RP%
    under                     II RP@ RG@
    up                        RP II@ VV0%
    

    A similar potential ambiguity exists with words which may be tagged either as prepositions or as 'RL' (locative adverb), e.g.:

    I walked across_II the park
    He looked across_RL and saw Jim
    

    Words which may be tagged RL or II (IW):

    aboard                    RL II
    above                     II RL JJ@
    across                    II RL@
    alongside                 II RL
    astride                   RL II
    behind                    II RL@ NN1%
    below                     II RL RG
    beneath                   II RL@ JJ%
    beside                    II RL%
    between                   II RL%
    beyond                    II RL@ NN1%
    ere                       CS II RL
    inside                    II RL NN1@ JJ@
    near                      II RL JJ@ VV0@
    nigh                      RL II@
    opposite                  JJ II% NN1@ RL@
    outside                   II RL JJ NN1@
    past                      NN1 II RL JJ
    throughout                II RL@
    underneath                II RL NN1@
    within                    II RL@
    without                   IW RL% RR%
    

    There are no RL/RP ambiguities in the lexicon. One or the other is preferred in each case, or else the RR tag is used.

    Stranded prepositions are liable to be automatically tagged RP, and should be changed to II. They occur when the preposition becomes detached from its noun phrase as may happen in various types of clausal construction, eg:

    Question:

    What team do you play in_II ?
    (In_II which team do you play?)
    Relative Clause:
    I know the story you are talking about_II
    ( ... about_II which you are talking)
    Passive:
    The car was worked on_II by a fool
    (A fool worked on_II the car)
    

    SYNTACTIC AMBIGUITY (see Section 3)

  • II > RL
  • II > RP

  • 2.8 Verbs


    2.8.1 Tags used for lexical verbs and for do, be and have

    Apart from modals (see below), tags for all verbs except be, do, and have contain the letters 'VV'. In the case of the 3 verbs mentioned, this changes to 'VB', 'VD' and 'VH' respectively.

    The third element of the tag makes distinctions of form/function as follows:

  • I infinitive base form,
  • 0 (zero) base form:
  • applaud_VV0; have_VH0; be_VB0; do_VD0
    
  • Z 3rd person sing. (present tense) form:
  • plays_VVZ; does_VDZ; has VHZ; is VBZ 
    
  • (present tense forms of the verb 'to be': am_VBM; are_VBR)
  • D past tense form:
  • liked_VVD; took_VVD; had_VHD; did_VDD;
    
  • (was_VBDZ; were_VBDR)
  • N past participle form:
  • liked_VVN; taken_VVN; had_VVN; done_VDN; been_VBN
    
  • G present participle form:
  • saying_VVG; doing_VDG; being_VBG; having_VHG
    

    In cases of verb-form ambiguity, function takes precedence. Thus, the word 'put' should be tagged VV0, VVD or VVN according to its grammatical function. Automatic tagging errors are sometimes associated with this source of ambiguity, particularly where no auxilliary co-occurs with a VVN (or VHN), or where the auxilliary is several words distant from its associated past participle (in questions for example).

    Contracted forms of 'have', 'had' etc. should carry the same tags as the complete forms. Errors may be associated with certain ambiguous forms:

    'd_VHD = 'had'; 'd_VM = 'would' 
    's_VHZ = 'has'; 's_VBZ = 'is'; 's_GE = genitive 's'
    

    Contracted negated forms are broken up by CLAWS7 into their constituent parts and tagged separately:

    is_VBZ n't_XX
    have_VH0 n't_XX
    will_VM n't_XX  (= won't)
    

    2.8.2 Modal Auxiliaries

    Modal auxiliaries are tagged 'VM'. A list is given below:

    'd		VM VHD
    'll		VM
    'ud		VM
    can		VM NN1% VV0%
    could		VM
    dare		VV0 VM@ NN1%
    may		VM NPM1: NP1:%
    mayst		VM
    might		VM NN1%
    must		VM NN1%
    need		NN1 VV0 VM@
    shall		VM
    should		VM
    will		VM NN1@ VV0% NP1@
    wilt		VV0 VM NN1%
    would		VM
    

    2.8.3 Catenatives And Modal Catenatives

    The following verb-forms receive 'K' (catenative) tags when used as in the examples:

    he is bound_VVNK to arrive soon
    Do you think it is going_VVGK to rain?
    We used_VMK to think it was impossible
    We ought_VMK to leave
    

    (ought is always VMK)

    SYNTACTIC AMBIGUITY

    1. VVG > NN1 (see Section 3)

    2. VVG > JJ (see Section 3)

    3. VVN > JJ (see Section 3)

    4. VVG > VVGK (going)

    5. VMK > JJ > VVN > VVD (used)

    6. VV0 > NN1

    7. VVZ > NN2

    8. VVD > VVN

    9. VHD > VHN (had)


    2.9 Foreign Expressions

    2.9.1 Naturalised And Commonly-Used Expressions

    Single-word expressions are given tags appropriate to the word's use in English:

    Ostpolitik_NN1
    literati_NN2
    

    We do not attempt to assign a tag according to the class of a word with its language of origin, rather with its syntactic use in the English sentence. Thus:

    Vermicelli_NN1
    

    is tagged as singular in spite of the Italian plural ending. Multi-word expressions are generally treated as units, and given ditto tags (see Section 4: Ditto-tags) appropriate to the expression as a whole:

    Pate_NN131 de_NN132 cheval_NN133
    fin_JJ31 de_JJ32 siecle_JJ33
    in_RR21 extremis_RR22
    personae_NN231 non_NN232 gratae_NN233
    

    2.9.2 Non-Naturalised Expressions

    For foreign expressions which are not naturalised in any appreciable way (as in quotations or book titles, for example), the tag FW is used for each word. Since a post-editor will apply an FW tag to any foreign word whose meaning he or she is unable to fathom, there is obviously a fuzzy boundary to the FW word class. Company names are usually tagged NP1 (see also 2.9.4 below) e.g.:

    Volkswagen_NP1, Alfa_NP1 Romeo_NP1
    

    2.9.3 Foreign words in whole sentences or clauses

    In cases such as those in the examples below, FW is used:

    J'y_FW suis_FW, j'y_FW reste_FW
    festina_FW lente_FW
    che_FW sara_FW sara_FW
    c'est_FW la_FW vie_FW
    

    It would be difficult or misleading to tag them with tags from an English tagset - for example sara above is a third person singular future indicative verb form, yet obviously there is no need for such a tag in English. Similarly festina above is a singular imperative in Latin and there is no such tag for English.

    No changes are to be made manually to contracted combinations of words such as j'y or n'est which are to be tagged by CLAWS7 as single words.

    2.9.4 Names

    Words such as de, van and von are tagged NP1 when part of a name, e.g.:

    Ludwig_NP1 van_NP1 Beethoven_NP1, Ferdinand_NP1 de_NP1 Saussure_NP1
    

    Company and country names are usually tagged NP1 as well even if all of the words are foreign and non-naturalised:

    Credit_NP1 Lyonnais_NP1
    Banque_NP1 Nationale_NP1 de_NP1 Paris_NP1
    les_NP1 Etats_NP1 Unis_NP1
    

    2.9.5 "Borrowed" Prepositions

    Expressions such as per or a la are given normal preposition tags, unless they are subsumed under a longer expression which should be tagged as a unit:

    per_II pound_NNU1
    a_II212 la_II22 Lancaster_NP1
    

    but:

    per_JJ21 RR21 diem_JJ22 RR22
    per_NNU21 cent_NNU22
    a_JJ31 RR31 la_JJ32 RR32 carte_JJ33 RR33
    

    2.10 Interjections

    Interjections are tagged UH. Some words are always considered as interjections, for example: aha, blimey, crikey, ha, huh, oh, sh, um, yes. Other words are sometimes tagged as interjections, and sometimes not, for example:

    adieu UH NN1@
    boo UH VV0@ NN1@
    bye UH NN1%
    clonk NN1 VV0 UH
    hallelujah UH NN1@
    no UH AT RR%
    

    The only words that should be tagged as UH are those indicating exclamation, or some other kind of interactive signal (which is not integrated with the syntax of the sentence), e.g. yes, no, whoa.

    The following are all tagged as FU (unclassified word): other types of exclamation (oops), onomatapoeic words that are not exclamatory (whoosh, ding), transcriptions of non-linguistic utterances ('de do da da da da'), hesitations and stutters (er, erm), truncated words, etc. Also exclamatory words or expressions which retain the spelling of a word in another class, from which they derive, are not tagged UH - e.g.:

    God_NP1 Almighty_JJ
    Sure_JJ
    Bless_VV0 you_PPY
    

    Post-editors should look at the full list of lexicon entries for FU and UH to help clarify this area.

    2.11 Numbers

    There are several tags for different types of strings representing numbers and strings containing numbers. These are:

    MC1
    the cardinal number 1: one 1 i I
    MC
    other cardinal numbers: two three ninety-four 2 745 ii XVI
    MC2
    morphological plurals of cardinal numbers: ones twos hundreds ten's
    MCMC
    hyphenated numbers: 40-50 1917-89
    MD
    ordinal numbers: first 3rd 75th last next
    MF
    fractions: two-thirds quarter 1/2 5&frac;78
    NNO
    (letter O, not 0) numeral nouns: hundred thousand million dozen gross
    NNO2
    morphologically plural number nouns: hundreds thousands dozens
    NNU
    strings with numbers and units of measurement: $100 £5 6in
    This tag is also used for the units of measurement themselves where they are separate words. However
    foot
    is tagged NN1 and
    feet
    NN2 even when they are units of measurement, in order to minimise ambiguity.
    FO
    other strings that are a mixture of numeric and alphabetic characters: A41 M6 G7 4GL 3M

    See also one in section 4.

    SECTION 3

    DISAMBIGUATION GUIDE (BY TAG-PAIR)

    CSA / II / RG see Section 4: as


    DAR / RRR


    more and less can be assigned either of these tags.

    The difference between them is that DAR is for noun-phrase-like (and determiner) uses of the word in question, whereas RRR is for adverbial uses. The two can be difficult to distinguish, particularly after a verb: eg:

    You should relax more_RRR
    You should spend more_DAR
    

    Since relax is an intransitive verb in this context, more cannot be a noun phrase. Instead, one can paraphrase it roughly as "to a greater extent". On the other hand, spend is a transitive verb, and so more is a DAR in this context. (We can notice that more after spend is the direct object of the verb, because it can be made the subject of a passive: "More should be spent..."). There are some verbs for which the distinction is less clear than in these examples, eg:

    You should eat more
    You should smoke less
    

    Note that the verb may be used transitively or intransitively with almost identical meanings, so that the syntactic structures of the immediate and/or surrounding context are the only clue as to which is the case:

    "Do you smoke?"   (Intransitive)
    "How many do you smoke a day?"  (Transitive)
    

    Contrast:

    At the moment we have 23 fixtures per season.
    Personally, I would rather play more_DAR
    

    with:

    If you're going to make the big time, I can see you'll have to play more_RRR, and not just wait for the ball to come to you.
    

    (see also RG / RR for degree and general adverb tagging of 'more' and 'less').


    II / RL & II / RP


    Compare:

    (a) He ran down_II the hill
    

    and

    (b) He ran down_RP his friends
    

    In (a), down is a preposition because:

    1. You could insert an adverb before it:
      He ran quickly down the hill
      

      But not:

      *He ran viciously down his friends
      
    2. You can move it to the front of a relative clause or question:
      This is the hill down which he ran
      Down which hills do you like running?
      

    In (b), down is an adverbial particle because:

    1. You can place it before or after the noun phrase:
      He ran his friends down_RP
      

      But not:

      *He ran the hill down
      
    2. If you replace the noun phrase with a pronoun, you HAVE TO place the pronoun in front of the particle:
      He ran them down 
      

      But not:

      *He ran down them 
      

    Similarly:

    She put the cat out_RP > She put IT out_RP
    

    whilst

    She went through_II the gap > She went through_II IT
    

    Notice that the syntactic distinction between down_RP and down_II is independent of the semantic distinction between locative and non-locative uses of down. When the verb is simply followed by down or out etc., without a following noun phrase, it is normally an RP:

    Income tax is coming down_RP
    The decorations were taken down_RP on 12th night
    

    However, tagging errors may occur with stranded prepositions which are denuded of their noun phrase because it has been fronted or ellipted (eg. in relative clauses, passives, questions etc.):

    This is the hill (which) she ran down_II
    (ie. This is the hill down_II which she ran)
    On Shrove Tuesday, this hill will be run down_II by housewives"
    (ie. Housewives will run down_II it)
    Which car did you arrive in_II?
    (ie. In_II which car did you arrive?)
    

    The same tests apply to words which are tagged either as prepositions or as locative adverbs RL eg. across, past, behind etc. (See section 3 for lists).


    JJ / NN1


    Words ending in -ing, when they premodify a noun, may be tagged either NN1 or JJ, eg:

    New_JJ spending_NN1 reductions_NN2
    her_APPGE acting_NN1 ability_NN1
    a_AT1 working_JJ mother_NN1
    

    (but see also JJ / VVG).

    If "X-ing NOUN" is equivalent in meaning to "NOUN which X-es" - (ie. if the NOUN is the notional subject of the verb X) - then "X-ing" is a JJ.

    For example:

    The smiling_JJ children 
    (i.e. The children are smiling)
    

    In other cases, X-ing is an NN1. In such cases, it is often possible to paraphrase X-ing NOUN by a more explicit phrase in which X-ing is clearly a noun. eg:

    new spending_NN1 reductions 
    (new reductions in spending)
    her acting_NN1 ability
    (her ability in acting)
    

    Further examples:

    A boxing_NN1 match
    A falling_JJ rate of exchange
    Slimming_NN1 tablets
    The mating_NN1 season
    a couple of mating_JJ chimpanzees
    

    JJ / RR & JJR / RRR


    After a verb or an object, there is sometimes a tricky choice between JJ and RR, or between JJR and RRR. eg:

    They arrived tired and hungry
    

    Here, both "tired" and "hungry" are JJ. The main test is to see whether you can express the relation between these words and their logical subject, using the verb "be": "They arrived tired and hungry" implies "They were tired and hungry". The JJ/RR word refers to a property of a noun, rather than to a property of an event or a situation. Contrast:

    Peter sang out loud_RR and clear_RR
    

    This sentence does not imply that Peter was loud and clear, but is more or less equivalent to "Peter sang out loudly and clearly". It means that his SINGING was loud and clear. It follows that when, in colloquial English, a word which we normally expect to be an adjective is used as an adverb, we tag it RR. eg:

    We did terrific_RR today
    

    A simple pair of examples where the JJ/RR word follows an object:

    I thought the game too long_JJ 
    (the game was too long)
    They work their staff very hard
    (NOT "the staff are very hard")
    

    Also JJR / RRR:

    They'll have to make the taxes higher_JJR
    (The taxes will be higher)
    

    But:

    You'll have to aim higher_RRR
    

    Note: well is an adjective when it is the opposite of ill:

    Mary is/feels well_JJ
    

    Otherwise it is an adverb:

    "He writes well_RR". 
    

    JJ / VVG & JJ / VVN


    The tagging of words like "surprised" in "John was surprised", or "lasting" in "the effect was lasting" can be a problem. In both cases, the word can be a JJ. One test is to see whether you can insert an adverb like "very" in front of the word. eg. in "John was very surprised", "surprised" is a JJ.

    Another test, having the opposite effect, is to see whether there is an agent "by"-phrase following an "ed/en" word. If so, it is a VVN. eg. in "John was surprised by the pirates", "surprised " is a VVN. Even where it is not present, the possibility of adding a "by"-phrase, without changing the meaning of the word, is evidence in favour of a VVN. (However, this criterion can clash with the preceding one - since it occasionally happens that an "ed"- word is preceded by an adverb like "very" AND followed by a "by"-phrase: eg. "John was very offended by her remarks". Fortunately, such cases are rare. When they do occur, however, give preference to JJ).

    A third test is negative: to see whether the word in question can be placed before a noun. eg:

    The effect is lasting:   a lasting effect
    

    This shows that "lasting" can be (but need not be) a JJ. If the word could not be placed (with the same meaning) before the noun, this would be evidence that the word is not a JJ, but a VVG or a VVN.

    Even though an "-ing" word is normally a VVG after the verb "be" it is generally treated as a JJ before a noun:

    The man was dying_VVG
    

    But:

    The dying_JJ man
    

    When the -ing or -en/ed word forms part of a phrase premodifying the noun, as in the following examples, the VVG/VVN tag is preferred:

    interest_NN1 earning_VVG account_NN1
    a hypothesis_NN1 driven_VVN approach_NN1
    

    In these examples, the NN1 VVG sequence is similar in function to a compound pre-modifying adjective. In hyphenated form they would be given a JJ tag. The same applies when the phrase is a noun-like compound. eg:

    a [ carol_NN1 singing_VVG ] contest_NN1
    

    If the verb be can be replaced by another verb such as seem or become, without changing the meaning of the following JJ/VVN word, this is a strong indication that the construction is not properly a passive, and that the word is a JJ. eg:

    The building was infested_JJ with cockroaches
    

    (The building became/seemed infested...)

    I could see he was favourably disposed_JJ to the idea
    

    (He seemed favourably disposed...)

    A further distinction which can be used as a test with 'event' verbs is that the JJ refers to a 'resultant state', whereas the VVN refers to a an event. eg:

    Bill was married_JJ (as opposed to single)
    Bill was married_VVN to Sarah on May 14th (the actual event)
    

    Some further examples:

    Three people were injured_VVN in the accident
    I could see he was (seemed) injured_JJ
    He lay injured_JJ on the road
    We have three injured_JJ players in the side
    Our players are not worried_JJ
    She is not worried_VVN by that sort of threat
    

    JJR / RRR see JJ / RR

    NN1 / JJ see JJ / NN1


    NP2 / NN2


    Note that NP2 is not used for names of teams, even those which are apparently not common nouns. NP2 is used for proper nouns which happen to be plural, eg.

    The Rockies, The Hebrides
    

    for plural product names, eg.

    Lancias_NP2 are pretty fast
    

    and for naming families, eg.

    The Staffords_NP2 are always quarrelling. 
    

    RG / RR


    RG is restricted to adverbs of degree (also called intensifiers, etc.) which precede the word or expression they modify. Clear cases of RG are very, and so and as in comparatives (see section on as below).

    Adverbs which have a range of functions, including adverb of degree, are not normally tagged RG, but are given the more general RR tag instead.

    She_PPHS1 was_VBDZ scantily_RR clad_JJ
    

    Here 'scantily' is an RR rather than an RG because it could also occur after a verb:

    She_PPHS1 dressed_VVD scantily_RR
    

    This is another case of the general principle of avoiding general-specific ambiguities within a word class. RG is usually only for words which do not have a more general range of adverbial uses.

    There are exceptions to this, however. (See Section 2: Adverbs. See also Section 4: so). The words which may be tagged RG or RR are:

  • so
  • too
  • quite
  • rather
  • Examples:

    She is so_RG attractive
    I would think so_RR
    This is too_RG heavy
    Can I come too_RR?
    That's rather_RG nice
    I would rather_RR go out
    He's quite_RG talkative
    Quite_RR, I agree
    

    Note that about may be an RP or an RG. However, this does not violate the principle mentioned above, since both RP and RG are sub-categories of RR:

    He's about_RG 12, I think
    Stop messing about_RP
    
  • RL / II see II / RP
  • RP / II see II / RP
  • RR / JJ see JJ / RR
  • RR / RG see RG / RR
  • RRQ / CS see Section 4 when
  • RRR / JJR see JJ / RR
  • VVG / JJ see JJ / VVG & JJ / VVN
  • VVN / JJ see JJ / VVN
  • SECTION 4

    DISAMBIGUATION GUIDE (BY WORD)


    ANY


    any is tagged DD when it functions pronominally or as a determiner:

    Do it any_DD way you like
    I'm afraid I haven't got any_DD
    

    and RR when it modifies an adverb or an adjective:

    They are not called that any_RR longer_RRR
    I cannot run any_RR faster_RRR
    It was not really any_RR better_JJR than before
    

    Note that the word following may also be ambiguous between adverb and determiner. In such cases, it is possible that both may be erroneously tagged, and require correction thus:

    You won't feel any_DD more_DAR pain
    If you have any_DD more_DAR , you'll burst
    He doesn't play chess any_RR more_RRR
    

    AS


    as can be tagged RG, II or CSA.

    It is an RG when it occurs before an adjective, adverb or determiner (and sometimes other words) in phrases such as:

    I don't think that one is as_RG good 
    I go there as_RG often (as...)
    There are not as_RG many (as...)
    

    In the 2nd and 3rd examples above, the second as is always a CSA because it introduces a comparative construction (an equal comparison, as contrasted with an unequal comparison introduced by than). Thus, in the following, the second as is tagged CSA:

    She's not as_RG (or so_RG) pretty as_CSA I thought
    An ostrich can run as_RG quickly as_CSA a zebra
    He has as_RG many as_CSA six children
    

    Notice that as in this comparative use is tagged CSA whether or not it introduces a clause, as normally understood. In the second case above, as precedes a noun phrase. In the following, it precedes an adjective:

    Please come as_RG quickly as_CSA possible
    

    CSA is also the tag used when as introduces other clauses (eg. clauses of time or clauses of reason). eg:

    As_CSA I arrived, he was leaving
    I'll lend you the money, as_CSA you're my friend
    

    II is the tag for as as an undoubted preposition - it usually has an equative meaning, as in:

    They regard him as_II a friend
    As_II governor of the province, I have to take action
    

    The guideline restricts II to cases of as followed by a noun-phrase-type structure - which may be a pronoun. If as is followed by an adjective, a past participle etc., it is tagged CSA, even though it has the same equative type of meaning as as_II. eg:

    The novel as_CSA originally written
    Many people regard his paintings as_CSA hideous
    

    BUT


    But is most commonly a CCB, but a there are rare cases when it can be an RR and a CS.

    It is an RR in phrases such as:

    You can but_RR try
    We could not but_RR offer our help
    

    It is an II when it has a meaning like except or apart from, eg:

    All but_II one of us
    We've asked everyone but_II the doctor
    I've tried everything but_II taking tablets
    Everything but the girl.
    

    It is a CS when it introduces a clause such as:

    There's no doubt but_CS he's the guilty party (rare)
    There was nothing for it but_CS to give her the job
    She would do nothing but_CS fly combat missions
    

    Otherwise it is a CCB (co-ordinating conjunction):

    I like this but_CCB but I don't like that
    (co-ordinated sentences)
    I like this one but_CCB not that one
    (co-ordinated noun phrases)
    

    EACH


    When each could be replaced by apiece, it is tagged RA. Otherwise it is tagged as DD1:

    Five pounds each_RA is a bit steep
    They scored a goal each_RA
    

    But:

    They each_DD1 scored a goal
    We go fishing each_DD1 Sunday in the Summer
    Each_DD1 one a peach
    I'll give you a fiver for each_DD1
    

    HIS


    His is tagged APPGE when it a pre-nominal possessive pronoun ie. when it is part of the set my; your; her etc.:

    It was his_APPGE fault
    

    It is tagged PPGE when it a nominal possesive pronoun, ie. when it is part of the set: mine; yours; hers' etc.:

    John's not here, so use his_PPGE
    

    HOW


    how may be tagged as an RGQ or as an RRQ. As an RGQ it always premodifies another word, for example an adjective or an expression of quantity:

    How_RGQ much_DA1 opposition is there?
    I do not know how_RGQ willing_JJ he is
    

    how as an RRQ, has a general adverbial meaning, and can often be paraphrased by an expression such as by what means or in which manner:

    How_RRQ will you manage?
    I wonder how_RRQ it will look
    

    also:

    How_RRQ are you?
    How_RRQ does it feel?
    

    How_RGQ implies a question which could be answered with the phrase in question, but with how replaced by a degree adverb (RG). eg:

    I'm not sure how_RGQ likely it is
    (How likely is it?  It is very_RG likely)
    

    Note that the same principles apply to the word however (RGQV; RRQV), and the expression no matter how:

    No_RGQV31 matter_RGQV32 how_RGQV33 difficult_JJ the situation, Red Adair always succeeds 
    Be careful, however_RRQV you decide to do it!
    

    (However may of course also be a general adverb (RR) ):

    There were, however_RR, too few people in the audience
    

    MUCH


    Much is tagged DA1 when it functions pronominally or as a determiner:

    There is not much_DA1 point in resisting
    She didn't say very much_DA1
    

    but it is tagged RR when it functions adverbially or pre-modifies an adjectival or adverbial head:

    I don't like that very much_RR
    This one is much_RR better_JJR
    

    As with any (see above), co-occurrence with other ambiguous determiner/adverbs should be checked in case of a double error:

    President Carter plays golf much_RR less_RRR these days 
    He has much_DA1 less_DAR enthusiasm for the game
    

    NO


    When it means the opposite of yes, no is tagged UH. This is true even when the use is nominal, providing the quotation marks are present:

    A resounding "no_UH" 
    

    If they are absent, the tag should be changed to NN1.

    I'll take that as a no_NN1, then.
    

    (See also Section 2.6.4: Cited Words)

    Otherwise, no is tagged AT, e.g.:

    There is no_AT question of that happening
    

    ONE

    MC1 where one precedes a noun or noun phrase, as in:

    one_MC1 book
    one_MC1 bag of spuds
    

    and where it is the head of a noun phrase with a dependent prepostional phrase:

    one_MC1 of the books
    

    and when referring to 'one' as a number entity:

    this is the number one_MC1
    one_MC1 is an integer
    type a one_MC1 at the prompt
    

    PN1 where it is a personal pronoun such as:

    one_PN1 ought to be careful
    one_PN1 doesn't like to make a fuss
    

    and when functioning as a substitute form:

    the prettiest one_PN1 is called Flo
    the one_PN1 you are holding is a bomb
    his idea is not one_PN1 that holds much water
    

    SO


    The CS tag is used when so is equivalent to the expression so that. It has a purposive function:

    We hid it so_CS no one would notice
    He only said it so_CS he could impress us
    

    It is an RR when it occurs, usually after a punctuation mark or at the beginning of a sentence, with a meaning approximating to therefore:

    It is raining, so_RR I am staying at home
    So_RR we gave up the struggle, you see
    He swore at me, so_RR I hit him
    

    It is likewise an RR if preceded by a conjunction in examples like those directly above:

    He swore at me, and_CC so_RR I hit him
    

    In expressions where so is used as a substitute form, and in cases where its use is clearly adverbial (= like that), it is tagged RR:

    substitute:

    so_RR I believe
    I might feel that, but I would never say so_RR
    So_RR did John
    I'm afraid so_RR
    

    adverbial:

    Don't take on so_RR!
    

    It is tagged RG when used in positions where very could occur:

    She is so_RG friendly
    I have never been so_RG angry
    Thank you so_RG much
    

    and when it corresponds to the first as in 'as...as...' comparisons:

    They're not doing so_RG well_RR as_CSA before
    

    TIMES


    Times is now always tagged NNT2 except

    Three times_II two is six
    The number of rows times_II the number of columns
    

    In all the following diverse cases, NNT2 is used:

    Recite your twelve times_NNT2 table
    It's ten times_NNT2 better than before
    (because of the following comparative adjective)
    London is 10 times_NNT2 the size of Lancaster
    (grammatically, 10 times could be replaced by twice)
    How many times_NNT2 must that have happened?
    Those were good times_NNT2
    They clocked up some very fast times_NNT2
    Knock three times_NNT2
    

    Of course, times may also occur as a VVZ (She times_VVZ his response).


    WHEN

    When may be tagged RRQ or CS. When can introduce three types of clause:

  • adverbial clause [Fa],
  • noun clause [Fn]
  • relative clause [Fr].
  • When it introduces an adverbial clause or a non-restrictive relative clause, it is a CS. When it introduces either a noun clause or a restrictive relative clause, it is an RRQ. Examples:

    adverbial:

     
    When_CS I arrived, John left 
    John left when_CS I arrived	(at the time at which)
    I smoke when_CS I'm tense 	(whenever)
    

    noun clause:

    I cannot remember when_RRQ I was christened
    I don't know when_RRQ the next bus is due 
    			(the date/point in time at which)
    

    relative:

    In the year when_RRQ I was born   (in which)
    The moment when_RRQ he arrived    (at which)
    

    Note that when can often be omitted in a relative clause.

    There are also non-restrictive relative clauses introduced by when, which are now to be tagged as CS. Previously they were tagged RRQ. It is no longer necessary to distinguish these from adverbial clauses introduced by when. Here are some examples of non-restrictive relative clauses:

     In 1968, when_CS the
    students were revolting in Paris... 
    

    Here, when could best be paraphrased as at the time when.

    Another example:

     
    School finished at 4 o'clock
    precisely, when_CS a loud bell sounded 
    

    Non-restrictive relative clauses do not define or restrict the meaning of the antecedent. If the antecedent is a precise temporal expression (such as "4 o'clock", "1990", "yesterday"), when is usually a non-restrictive relative.

    These are different from restrictive relatives, such as:

     
    In the year when_RRQ I was born
    

    Here the year is defined by the relative clause. Typically restrictive relatives are not preceded by a comma, and the when can normally be omitted.

    Another use of when_RRQ is in direct questions:

    When_RRQ did you find out?
    

    In abbreviated adverbial clauses, where when is followed by an adjective, a preposition phrase, a non-finite clause etc., when is a CS:

    when_CS ready
    when_CS in doubt
    when_CS arriving late
    

    but before an infinitive, when is an RRQ:

    I don't know when_RRQ to apply
    

    Note that the infinitive clause may be implied:

    Tell me when_RRQ (to start)
    

    and that a noun clause may be abbreviated simply to the word when:

    It was Guy Fawkes, but I can't remember when_RRQ
    

    WHERE

    The tagging of where is consistent with when.


    WORTH


    Two tags are allowed: II and NN1. II is used for expressions which could be an answer to the question: how much is it worth? or what is it worth?:

    My records are worth_II a small fortune
    He is worth_II about two million
    It's not worth_II gambling on
    

    It also occurs as a stranded preposition (see Sections 2 and 3) in the questions used to elicit such responses, and in other common constructions:

    What do you think they are worth_II ?
    He knew exactly how much they were worth_II
    She gave it everything she was worth_II
    

    NN1 is used when worth is obviously nominal, and also in expressions where worth is preceded by a quantity, whether or not the quantity in question has been written as a genitive:

    You don't know your own worth_NN1
    I'd like a pound's worth_NN1
    They purchased a million dollars worth_NN1 of equipment
    

    SECTION 5

    CLAWS7 TAGLIST

  • ! punctuation tag - exclamation mark
  • punctuation tag - quotation marks
  • ( punctuation tag - left bracket
  • ) punctuation tag - right bracket
  • , punctuation tag - comma
  • - punctuation tag - dash
  • ----- new sentence marker
  • . punctuation tag - full-stop
  • ... punctuation tag - ellipsis
  • : punctuation tag - colon
  • ; punctuation tag - semi-colon
  • ? punctuation tag - question-mark
  • APPGE possessive pronoun, prenominal (my, your, our etc.)
  • AT article (the, no)
  • AT1 singular article (a, an, every)
  • BCS before-conjunction (in order (that), even (if etc.))
  • BTO before-infinitive marker (in order, so as (to))
  • CC coordinating conjunction (and, or)
  • CCB coordinating conjunction (but)
  • CS subordinating conjunction (if, because, unless)
  • CSA as as a conjunction
  • CSN than as a conjunction
  • CST that as a conjunction
  • CSW whether as a conjunction
  • DA after-determiner, capable of pronominal function (such, former, same)
  • DA1 singular after-determiner (little, much)
  • DA2 plural after-determiner (few, several, many)
  • DAR comparative after-determiner (more, less)
  • DAT superlative after-determiner (most, least)
  • DB before-determiner, capable of pronominal function (all, half)
  • DB2 plural before-determiner, capable of pronominal function (both)
  • DD determiner, capable of pronominal function (any, some)
  • DD1 singular determiner (this, that, another)
  • DD2 plural determiner (these, those)
  • DDQ wh-determiner (which, what)
  • DDQGE wh-determiner, genitive (whose)
  • DDQV wh-ever determiner (whichever, whatever)
  • EX existential there
  • FO formula
  • FU unclassified
  • FW foreign word
  • GE germanic genitive marker - (' or 's)
  • IF for as a preposition
  • II preposition
  • IO of as a preposition
  • IW with; without as preposition
  • JJ general adjective
  • JJR Rgeneral comparative adjective (older, better, bigger)
  • JJT general superlative adjective (oldest, best, biggest)
  • JK adjective catenative (able in be able to; willing in be willing to)
  • MC cardinal number neutral for number (two, three...)
  • MCGE genitive cardinal number, neutral for number (twos, 100's)
  • MCMC hyphenated number (40-50, 1770-1827)
  • MC1 singular cardinal number (one)
  • MC2 plural cardinal number (tens, twenties)
  • MD ordinal number (first, 2nd, next, last)
  • MF fraction (quarters, two-thirds)
  • ND1 singular noun of direction (north, southeast)
  • NN common noun, neutral for number (sheep, cod)
  • NNA following noun of title (M.A.)
  • NNB preceding noun of title (Mr, Prof)
  • NN1 singular common noun (book, girl)
  • NN2 plural common noun (books, girls)
  • NNL1 singular locative noun (street, Bay)
  • NNL2 plural locative noun (islands, roads)
  • NNO numeral noun, neutral for number (dozen, thousand)
  • NNO2 plural numeral noun (hundreds, thousands)
  • NNT temporal noun, neutral for number (no known examples)
  • NNT1 singular temporal noun (day, week, year)
  • NNT2 plural temporal noun (days, weeks, years)
  • NNU unit of measurement, neutral for number (in., cc.)
  • NNU1 singular unit of measurement (inch, centimetre)
  • NNU2 plural unit of measurement (inches, centimetres)
  • NP proper noun, neutral for number (Phillipines, Mercedes)
  • NP1 singular proper noun (London, Jane, Frederick)
  • NP2 plural proper noun (Browns, Reagans, Koreas)
  • NPD1 singular weekday noun (Sunday)
  • NPD2 plural weekday noun (Sundays)
  • NPM1 singular month noun (October)
  • NPM2 plural month noun (Octobers)
  • PN indefinite pronoun, neutral for number (none)
  • PN1 singular indefinite pronoun (one, everything, nobody)
  • PNQO whom
  • PNQS who
  • PNQV whoever, whomever, whomsoever, whosoever
  • PNX1 reflexive indefinite pronoun (oneself)
  • PP nominal possessive personal pronoun (mine, yours)
  • PPH1 it
  • PPHO1 him, her
  • PPHO2 them
  • PPHS1 She, she
  • PPHS2 they
  • PPIO1 me
  • PPIO2 us
  • PPIS1 I
  • PPIS2 we
  • PPX1 singular reflexive personal pronoun (yourself, itself)
  • PPX2 plural reflexive personal pronoun (yourselves, ourselves)
  • PPY you
  • RA adverb, after nominal head (else, galore)
  • REX adverb introducing appositional constructions (namely, viz, eg.)
  • RG degree adverb (very, so, too)
  • RGA post-nominal/adverbial/adjectival degree adverb (indeed, enough)
  • RGQ wh- degree adverb (how)
  • RGQV wh-ever degree adverb (however)
  • RGR comparative degree adverb (more, less)
  • RGT superlative degree adverb (most, least)
  • RL locative adverb (alongside, forward)
  • RP prep. adverb; particle (in, up, about)
  • RPK prep. adv., catenative (about in be about to)
  • RR general adverb (actually)
  • RRQ wh- general adverb (where, when, why, how)
  • RRQV wh-ever general adverb (wherever, whenever)
  • RRR comparative general adverb (better, longer)
  • RRT superlative general adverb (best, longest)
  • RT nominal adverb of time (now, tommorow)
  • TO infinitive marker (to)
  • UH interjection (oh, yes, um)
  • VB0 be
  • VBDR were
  • VBDZ was
  • VBG being
  • VBM am
  • VBN been
  • VBR are
  • VBZ is
  • VD0 do
  • VDD did
  • VDG doing
  • VDN done
  • VDZ does
  • VH0 have
  • VHD had (past tense)
  • VHG having
  • VHN had (past participle)
  • VHZ has
  • VM modal auxiliary (can, will, would etc.)
  • VMK modal catenative (ought, used)
  • VV0 base form of lexical verb (give, work etc.)
  • VVD past tense form of lexical verb (gave, worked etc.)
  • VVG -ing form of lexical verb (giving, working etc.)
  • VVN past participle form of lexical verb (given, worked etc.)
  • VVZ -s form of lexical verb (gives, works etc.)
  • VVGK -ing form in a catenative verb (going in be going to)
  • VVNK past part. in a catenative verb (bound in be bound to)
  • XX not, n't
  • ZZ1 singular letter of the alphabet (A, a, B, etc.)
  • ZZ2 plural letter of the alphabet (As, b's, etc.)
  • NOTE: DITTO TAGS

    Any of the tags listed above may in theory be modified by the addition of a pair of numbers to it: eg. DD21, DD22. This signifies that the tag occurs as part of a sequence of similar tags, representing a sequence of words which for grammatical purposes are treated as a single unit. For example the expression in terms of is treated as a single preposition, receiving the tags:

    in_II31 terms_II32 of_II33 
    

    The first of the two digits indicates the number of words/tags in the sequence, and the second digit the position of each word within that sequence. Such ditto tags are not included in the lexicon, but are assigned automatically by a program called IDIOMTAG which looks for a range of multi-word sequences included in the idiomlist. The following sample entries from the idiomlist show that syntactic ambiguity is taken into account, and also that, depending on the context, ditto-tags may or may not be required for a particular word sequence:

    at_RR21 length_RR22
    a_DD21/RR21 lot_DD22/RR22
    in_CS21/II that_CS22/DD1