CLAWS (Constituent Likelihood Automatic Word-tagging System) is a suite of computer programs for automatically assigning an appropriate grammatical tag to each word in a body of continuous text.
One or more potential word-tags from the Claws version 7 (C7) tagset is assigned using:
CLAWS assigns potential word-tags using a number of rules based on the ending and orthography of the word, and then uses a Hidden Markov Model method for estimating the most likely word-tag in each context. This is a type of statistical language model which calculates the probabilities of a certain sequence of words requiring a certain sequence of grammatical tags.
Further information on probabilistic language analysis and the CLAWS programs can be found in The Computational Analysis of English, Garside, Leech and Sampson (1987), especially Chapters 3 & 4 and in Roger Garside's chapter in Short and Thomas (1996).
The lexicon or wordlist consists of approximately 12,000 words, each listed with the possible tags for that word. Each word has between one and six candidate tags.
In effect, the principles adopted for including a word in the wordlist have meant that where CLAWS has failed, using the probability data, to assign a correct tag to a word then it has been necessary to include it in the list to be perused by CLAWS before tag assignment is finalised.
For example, using probability data, CLAWS would automatically assign the tag for noun to the word 'mushroom' (NN1). But when we encounter the use of 'mushroom' as a verb (VV0) we know that is it now essential that the lexicon includes the following entry:
mushroom NN1 VV0@
(The @ is a rarity symbol, indicating that this tag applies in less than one in a hundred cases. There is also a % rarity symbol, indicating that this tag applies in less than one in a thousand cases.)
However, with the BNC facilities, the use of increasingly large databases is now possible and we are moving towards the compilation of a much larger lexicon, incorporating other large wordlists for example.
The fact remains that it is the post-editor's responsibility to formulate lists of words which are candidates for inclusion in the lexicon, and it is recommended that, where the post-editor finds a CLAWS error, a check on the current lexicon is made, and if necessary a suggestion should be made for a new lexicon entry.
When CLAWS encounters a word that is not found in the lexicon, a basic morphological analysis of the word is carried out in order to assign a candidate tag or tags to the word. The suffixlist is a list of common or predictable word endings, coupled with a list of one or more candidate tags. CLAWS will attempt to match the longest possible suffix to the word and assign the candidate tags associated with that suffix. Here is a few lines of the suffix list:
cle NN1 dle NN1 VV0 fle NN1 VV0 gle NN1 VV0 ile JJ NN1 obile NN1 phile NN1 JJ% pile NN1 tile JJ
Post-editors should also suggest new suffixlist entries when they feel it is appropriate.
This is not a list of idioms in the usual sense but a list of multi-word sequences to aid CLAWS disambiguation procedures. The entries fall roughly into two main groups:
Although the list is less than twenty pages in length, its usefulness in helping with CLAWS disambiguation procedures is invaluable.
Ditto tags are used for sequences whose syntactic role in combination differs from the role of the same words in other contexts. For example, the phrase
all of a sudden
would cause CLAWS and post-editors alike a few tagging headaches. We know however that this particular combination of words only occurs as an adverbial form. Therefore this sequence of words can be treated as a single unit for grammatical tagging purposes. By the addition of numerals to the basic tag format we can indicate that this is a sequence of tags, representing a sequence of words which form one grammatical unit. These are called ditto tags. The above example is tagged:
all_RR41 of_RR42 a_RR43 sudden_RR44
(RR is the tag for general adverb)
Ditto tags are not included in the lexicon; they only appear in the idiomlist. (Only single orthographic words are included in the lexicon.) For most purposes, ditto tag sequences should be considered as a closed set. As far as possible, post-editors should not invent ditto tags. There are several good reasons for this. Often a decision will already have been reached that a certain collocation should not have ditto tags. Also, some software for processing tagged text (such as SARA) needs to know all ditto tag sequences that occur. If a post-editor feels that it is appropriate to create a new ditto, the following procedure must be followed:
The idiomlist is also currently used for a wide variety of word disambiguation problems. Three examples are given below.
The expression at times is not usually ambiguous, but there are several combinations of tags available from the lexicon. There are four candidate tags in the lexicon for times but whenever this word occurs after at the tag should be NNT2. In order to avoid errors it is a simple procedure to make the following entry in the idiomlist:
at II, times NNT2
The word junior has noun and adjective tags in the lexicon but whenever it occurs after a proper noun (NP1), the tag should be JJ (adjective) so there is an idiomlist entry:
[NP1], junior JJ
Multi-word place-names, such as Alice Springs, can cause problems for CLAWS. There are two candidate tags for "springs" in the lexicon so there is an idiomlist entry solely for the Australian town:
Alice NP1, Springs NP1
At present the rate of accuracy for the CLAWS tagging system is between 95-98%, depending on the type of text. The primary object of post-editing is to correct the erroneous output, for example where a noun has been tagged as a verb, or an adverb as an adjective.
Section 3 of this document gives guidelines for deciding which is the correct tag in cases where the distinction is not always obvious. Section 3.1 deals with pairs of tags which may often be confused and Section 3.2 supplements this with guidelines relating to specific ambiguous words.
The important by-product of post-editing is to improve the tagging system itself. In the past, all post-editing has been done manually with corrections input using a simple screen-editor. The BNC project however involves the accurate tagging of a 100 million words and as the greatest throughput of the tagging and parsing projects was previously 1 million words per year this gives an indication of the huge task that this involves.
It is therefore the responsibility of the post-editors to:
The tags correspond to a number of general word classes (nouns, adverbs, determiners etc.). The complete tagset can be seen in Section 5. Within these tag-groups there is normally a 'general' tag, and a number of 'specific' tags, used for various subcategories of the general word-class. e.g.:
Either by searching a lexicon, or by assessing the likely properties of a word on the basis of its form (using a set of 'suffix rules'), CLAWS7 decides which of the tags might be assigned to a particular word, and then calculates which is most likely to be the appropriate one in the given context.
In the case of a word which is in the lexicon, the entry will specify the tags to be considered. e.g.:
jeer NN1 VV0
In practice, CLAWS7 makes few errors in deciding whether a word is a noun or a verb, because it is deciding between two distinct classes of words with very different syntactic properties. However, a word may have different meanings suggesting that it should be allowed 2 or more different tags within the same general word class. e.g.:
I have broken my FOOT Do you have a one FOOT ruler?
In the first example, 'foot' is clearly a general noun (NN1), as there is no separate tag for 'parts of the body', but in the second example, 'foot' corresponds to the specialized tag NNU, used for 'units of measurement'. In order to assign tags along these lines, we would need to either a) include the tags NN1 and NNU in the lexicon entry for 'foot', or b) include just one of the tags, and correct the output manually.
In fact, we do neither. Instead, we adopt a principle which can be stated thus:
AVOID PROLIFERATION OF TAGGING AMBIGUITIES BY ALLOWING THE MORE GENERAL CATEGORY TO SUBSUME THE MORE SPECIFIC ONE
Thus, foot will always be tagged NN1, whichever of its noun meanings is the case. Unambiguous words or items such as ft. or inch (when the latter is a noun), will, on the other hand, contain the specific tag in the lexicon entry, but not the general one.
To aid automatic tagging or parsing, a few known general-specific ambiguities have been allowed to persist. The notable examples are:
Note also that NN1 and NP1 are considered to be distinct tags and not specific varieties of a generic 'noun' tag, so the principle of the non-proliferation of tags does not apply.
One further important reason for post-editing, is to improve the automatic tagging system. From time to time, one may, in the course of post-editing, become aware of a lexicon entry which is the cause of automatic tagging errors, or which conflicts with the principle of 'no general-specific ambiguity'. It may be decided subsequently to amend the lexicon.
Similarly, particular sequences of words may be particularly likely to be erroneously tagged. Often, this is because the words themselves, in that particular sequence, require tags which are different from those they would normally be assigned. e.g.:
at_RR21 all_RR22 (equivalent to a single adverb)
The Far_NP1 East_NP1 (tagged as a proper noun)
In many such cases, the CLAWS7 idiomlist can help to make it more likely that the correct tags will be assigned. Post-editing is therefore a means of discovering word chains which might be included in the idiomlist.
Finally, post-editors frequently encounter usage which seem open to alternative tagging strategies. This is particularly true in the case of a range of types of naming expression. Achievement of a consistent post-edited output therefore rests on the adoption of conventions according to the type of expression. Section 2 includes descriptions of conventions devised so far to deal with the tagging of such naming expressions (see section on nouns, 2.6.3ff).
The main class of adjectives, those which can be used predicatively or attributively (whether or not with the same meaning), are tagged JJ.
JJR is used for comparative adjectives (e.g. whiter) and JJT is used for superlative adjectives (e.g. whitest).
JK (catenative) is used for able, unable andwilling in sentences like: "Will you be able_JK to manage?" but not when used as a general adjective as in: "Your son is very able_JJ."
SYNTACTIC AMBIGUITY (see Section 3)
As well as the general adverb (RR), tags exist for degree adverbs (very, too etc.) (RG), prepositional adverbs or particles (RP), locative adverbs (RL), adverbs of time (RT). For the first two of these comparative and superlative tags exist (RRR, RRT, RGR, RGT). Further adverb tags are listed in Section 5.
Adverb words are subject to tagging errors due to a variety of sources of ambiguity. See esp.:
It is within the adverb class that most legitimate cases of general-specific ambiguity exist (see Section 4). Some tags may require changing manually from the general one to the specific one or vice-versa.
In some cases, specific usage is not tagged as such (e.g. really can only be an RR, even when it is an intensifier). In other cases there are specific tags available (e.g. bloody can be an RR or an RG). This is an area of inconsistency that needs to be addressed. Suggestions from post-editors are needed.
Underlying a number of other automatic tagging errors is the fact that some words, though frequently used, occur rarely as adverbs (e.g. but, each, good, no). It is therefore important for the post-editor to be familiar with the principles of tag choice affecting such words.
All determiners capable of a pronominal function receive D tags, (see section 5 for a complete list), regardless of whether they are acting pronominally or not. These are categorised according to the positions in which they may occur in a complex noun phrase.
Words that are only pronouns (they, nobody etc.) are given P tags. The, a/an, no and every are tagged as articles (AT or AT1).
The main source of automatic tagging errors associated with 'D'-words is ambiguity between determiners and adverbs. See Section 3: DAR > RRR (more and less) and Section 4: any; each; much; no.
Errors are sometimes associated with the ambiguity between that as a determiner (DD1), as a conjunction (CST) and as a degree adverb (RG).
The two most commonly occurring types of conjunction are:
The latter are sub-categorized as follows:
The sequence "as well as" may be tagged as a CC idiom. However, at present CLAWS7 tends to mis-tag this due to the 'overlapping' idiom "as_RR21 well_RR22" (meaning 'also'). Attention should therefore be paid to this at the post-editing stage.
Sometimes plus and minus may be tagged CC, but should be tagged II when linking noun phrases, if signifying addition or subtraction (see guidelines for times in Section 4.).
The basic tags for common nouns are NN1, NN2 and NN.
NN is used for nouns that can have plural and singular modifiers yet do not change their form (i.e. words that are morphologically neutral for number) like sheep, and words that have distinct singular and plural meanings, like aids and butchers.
Various subcategories of noun are given more specialized tags (see section 5 for a complete list). Certain conventions have been adopted for the tagging of various types of noun phrase, either automatically, or, if necessary, manually by post-editors. These are described below.
Because of the principle of avoiding ambiguity within a word class between a general and a specific category, many decisions have had to be made about the appropriate tagging for nouns with more than one meaning.
The three subcategories of noun which present most dilemmas are NNL1/NNL2 (locative nouns) and NP/NP1/NP2 (proper nouns). There are also the tags NNT1/NNT2 used for temporal nouns (which may also be used adverbially), though cases of ambiguity between NNT* and NN* seldom arise, (two exceptions being fall and spring, which are tagged NN1).
There follows a brief description of the criteria used for deciding when these tags are appropriate. This is followed by details of the conventions adopted for tagging various types of noun-phrase which have been found to raise questions.
These proper noun tags are used for:
Vermont_NP1 Atlantic_NP1 City_NNL1
IBM_NP1, Spar_NP1, Lancia_NP1
but NOT for:
The general principle can be stated as:
Bona fide proper nouns are words that are (i) names of people, (ii) geographical places and (iii) other names of things that are not also common lexical nouns. Bona fide proper nouns are given NP tags. Other nouns that form part or the whole of a naming expression are tagged as common nouns, unless they are words that are normally an NP/NP1/NP2 in any case.
A few examples (there are more for each category below).
HMS_NNB Brilliant_JJ The_AT Queen_NNB Mary_NP1
John_NP1 Smith_NP1 Kate_NP1 Moss_NP1
Shergar_NP1 White_JJ Flash_NN1
Product and company names:
IBM_NP1 Word_NN1 for_IF Windows_NN2 Volkswagen_NP1 Golf_NN1
Place names (see also more below):
Lancaster_NP1 Leicester_NP1 Square_NNL1 Old_NP1 Street_NNL1
The NP tag is used only for those cases where a proper noun is morphologically neutral for number (see NN above). There are therefore a very restricted, but open, number of words that are given this tag. Only those proper nouns that are countable and unchangeable in form come into this category, e.g. Mercedes, Sainsburys, Tescos.
The NNL tags are used for a closed list of words which have a locative meaning, and which occur (normally with a capital letter) in complex expressions for naming geographical places. Institutions have common general common noun tags, although there is sometimes a difficult distinction to be drawn between institutional and geographical reference, e.g. The British Museum. Since the NNL tags are only used in compounds, they are only assigned by the idiomlist and not by the lexicon.
Leicester_NP1 Square_NNL1 Old_NP1 Street_NNL1 The_AT Atlantic_NP1 Ocean_NNL1
The full list of words that can be tagged NNL1 is as follows:
NNL1 NNL2 City Hills Close Islands Hill Isles Island Mountains Isle Lake Lane Mount Ocean Place River Road Sea Square Street
Shortened forms of these words get NNL tags, such as Rd, St (meaning Street) and Mtns, etc.
The NNL tag is also still regarded as valid for a word referring, not to the physical location itself, but to the activity or institution which is associated with or contained within that location. Thus:
Downing_NP1 Street_NNL1 registered_VVD its_APPGE disapproval_NN1 ._.
See also the section on place names below.
NNB (where B stands for 'before') is used for a set of nouns which function as a title in a person's name, and is only used when these words are used as a title, e.g.:
Sergeant_NNB Jones_NP1 He was a conscientious sergeant_NN1 footballer_NN1 Geoff_NP1 Hurst_NP1 "Yes, Sergeant_NN1"
Nouns denoting family relations, even though some of them do not meet the "title in a person's name" criterion, are included in the NNB set (e.g. uncle, aunt, auntie). NNB is used for abbreviated nouns of style occurring before a name:
Mr._NNB Jones Coun._NNB Alf Roberts
NNA (where A = "after") is used for abbreviated nouns of style or title appended to names:
Anne Collins J.P._NNA Gordon Banks O.B.E._NNA
Note that 'St.', meaning 'Saint', is always tagged NP1.
These tags are used for temporal nouns, which can be used as the head of an adverbial expression. e.g.:
I saw him last week_NNT1 Every year_NNT1, it's the same story
Nouns that also have a non-temporal meaning, such as Spring and Fall, are always tagged NN1.
See also Section 4: "times"
The tagging for place-names and geographical locations consists predominantly of NP, NNL and common noun tags. Words that are tagged as common nouns are parts of names which contain a capitalized word which is not generally recognized to be part of the name itself, but is used instead as a qualifier to the name, and is thus performing its normal lexical function (in spite of the capital). Thus:
New_NP1 York_NP1 West_NP1 Berlin_NP1
New_NP1 York_NP1 West_ND1 East_ND1 Chicago_NP1
The difference between 'West Berlin' and 'East Chicago' is that the former is (or was) an institutionalised name, whereas the latter is not. (The fact that West Berlin is no longer an official entity is not relevant here, since reference to places in the past is possible). References to places may often contain an NNL1 (or NNL2), such as Lake, Mount, River or Isles. However, a word which may be tagged NNL1/2 is tagged NP1 in cases where it an integral part of the name, not an additional descriptive qualification in the way NNL-tagged words are. Thus:
Mount_NNL1 Igman_NP1 River_NNL1 Tyne_NP1 Lake_NNL1 Placid_NP1 Fylde_NP1 Avenue_NNL1
Lake_NP1 District_NN1 Street_NP1 Lane_NNL1 Avenue_NP1 Road_NNL1
The test we apply is to see whether e.g.. Lake District is a kind of lake. The answer is no, so Lake is an NP1. Lake Placid is a lake however, so it is an NNL1. An extension of this rule is applied in the case of names such a Long Beach, which is not a kind of beach, or Bowling Green, which is the name of a town. In applying this test we do not recognise the derivation of such names, and tag them as proper nouns:
Long_NP1 Beach_NP1 Bowling_NP1 Green_NP1
A qualification to the rule, on the other hand, applies in the case of a plural NNL or NN word which has become part of a singular proper noun. In such cases, the tag NP1 is used. E.g.:
Alice_NP1 Springs_NP1 Beverly_NP1 Hills_NP1 Strawberry_NP1 Fields_NP1 Grand_NP1 Rapids_NP1 Yorktown_NP1 Heights_NP1
It is clear that such names do not function as plural nouns. Compare, for example:
Beverly_NP1 Hills_NP1 is a suburb of L.A. The Malvern_NP1 Hills_NNL2 are made of granite
It is for this reason that "United States" is tagged:
Note also that a place name preserves its tagging when it is subsumed into a longer naming expression:
Long_NP1 Island_NNL1 Long_NP1 Island_NNL1 Sound_NN1
Other words which are tagged as NP1 when part of a place-name are Greater and St., e.g.:
Greater_NP1 Manchester_NP1 St._NP1 Louis_NP1
There is sometimes a problem deciding how to tag words relating to nationality, language etc. when they have an adjectival form. As a rule, the language is tagged as a noun (NN1), whilst the same word used as an adjective should be tagged JJ:
French_JJ people usually speak French_NN1
The tagging for specific or generic reference is shown in the table below.
If the word for generic or plural specific reference is the same as the adjectival form (i.e. does not undergo morphological pluralisation), then the word is tagged as an adjective, e.g.: The French_JJ (cf. the poor_JJ). In most cases, the word is not available for singular reference. Even when it is (e.g.. A Japanese), the tag JJ is retained. The key test is whether these words can take an additional plural ending. Those which cannot are treated as adjectives. (see type 1 in the table below) On the other hand, words which can be used either adjectivally or for singular specific reference, but which take a plural ending for generic reference (or plural specific reference), are tagged as nouns when they refer to people. (see type 2 in the table below).
Language Person People Adjective NN1 JJ JJ JJ Dutch Dutch Dutch English English English Flemish Flemish Flemish French French French Irish Irish Irish Polish Polish Polish Spanish Spanish Spanish Welsh Welsh Welsh Chinese Chinese Chinese Japanese Japanese Japanese Portuguese Portuguese Portuguese Swiss Swiss Swiss Vietnamese Vietnamese Vietnamese
NN1 NN1 NN2 JJ African Africans African American Americans American Arabic Arab Arabs Arabic/Arab Arabian Arabians Arabian Asian Asians Asian Australian Australians Australian Canadian Canadians Canadian German German Germans German Belgian Belgians Belgian Brazilian Brazilians Brazilian European Europeans European Hungarian Hungarian Hungarians Hungarian Indian Indians Indian Italian Italian Italians Italian Norwegian Norwegian Norwegians Norwegian Russian Russian Russians Russian
In the case of compound adjectives or nouns referring to nationality, the tagging is extrapolated from the tagging applied to the name of the country. Thus:
The West_NP1 German_JJ Chancellor_NNS1 West_NP1 Indian_JJ food_NN1 The West_NP1 Indians_NN2 Puerto_NP1 Ricans_NN2 A South_NP1 African_NN1 South_NP1 African_JJ policies
The guidelines for tagging words of adjectival form relating to race and/or faith are parallel to those for nationality. Those which can be pluralised may be tagged NN1, NN2 or JJ:
A South_NP1 African_JJ Black_NN1 A Black_JJ South_NP1 African_NN1 A Black_JJ youth_NN1 and three whites_NN2 White_JJ supremacy_NN1 Black_JJ Liberation_NN1 black_JJ nationalists_NN2 black_JJ nationalist_JJ campaigners_NN2
A Roman_JJ Catholic_JJ priest A Roman_JJ Catholic_NN1 Moslem_JJ customs_NN2 A group of Buddhists_NN2
Words representing points on the compass are tagged ND1, whether they are used adjectivally, nominally or adverbially. This applies whether they are simple words, hyphenated words or abbreviations. e.g.:
north-east_ND1 S.E._ND1 south_ND1 southwest_ND1
Their derivative '-ern' adjectives are tagged JJ:
These rules are overridden in cases where:
1. The word is an essential part of the name of a country, region or place (see paragraphs 1 & 4 above):
The Middle_NP1 East_NP1 West_NP1 Germany_NP1
2. When the word stands alone in reference to a company or similar organization see paragraph 6a below):
Following the revelations of malpractice in the Eastern_JJ Railway_NN1 Co._NN1, the head of Eastern_NP1 has resigned
3. When the word is a proper noun in its own right. e.g.:
Colonel_NNB Oliver_NP1 North_NP1
A typical company name would be tagged thus:
Schlitz_NP1 Brewing_NN1 Co._NN1
The proper noun tag is retained for elements of the title which are clearly names (e.g.. surnames), but other elements are tagged as they would be normally. Thus:
Filmpower_NP1 Chrysler_NP1 Motor_NN1 Corp._NN1 Safeway_NP1 Limited_JJ
If a proper noun happens to coincide with a common noun, we still assign an NP-tag:
Gateway_NP1 Productions_NN2 Inc._JJ Storeys_NP1 Ltd._JJ
But if the title is merely composed of common lexical items, proper noun tags are not used:
Resorts_NN2 International_JJ News_NN1 International_JJ Federated_JJ Department_NN1 Stores_NN2
This distinction is not always obvious. The test used to judge whether an NN tag or an NP tag is appropriate is to ask whether these words are used in something approximating to the normal way. Since, presumably, Gateway Productions do not make gates, and Storeys are probably named after a person called Storey, there is a sense in which they are much further from their normal lexical function than the words 'Resorts', 'Federated' and 'News' in the examples above.
A different tagging strategy is employed when the company title is truncated to a single word. In such cases we use the NP1 tag:
A spokesman for Federated_NP1 said... When I started working for International_NP1 ...
See also the section above on 'directional words' (Southern, East etc.), which often form part of corporate titles, and where similar considerations apply. Note that commerciality is not a criterion for applying the above guidelines. Non-commercial organizations are tagged in the same way:
Maine_NP1 Correctional_JJ Centre_NN1 New_NP1 York_NP1 Central_JJ Youth_NN1 Club_NN1
It is characteristic of such titles, that examples will always occur which show up the inadequacy of any concise guidelines for tagging and tag-correction. For instance, in the case of 'Pan American' and 'Pan Am', it was decided these should always be tagged as proper nouns:
Pan_NP1 Am_NP1 Pan_NP1 American_NP1
and that 'General Electric' should be tagged:
Probably the best way to ensure that the tagging of a large body of text remains as consistent as possible is to build up a 'caselaw' of such tagging decision as they are made, and if possible to add them to the idiomlist.
Product names are given NP tags when the words do not coincide with common lexical items, or when they are bona fide proper nouns:
Cadillac_NP1 Eldorado_NP1 I drive a Mini_NP1 He wasn't very good with Hoovers_NP2
This also applies when the NP term precedes the head of the phrase:
A Burberry_NP1 raincoat A Boeing_NP1 airliner
Names of sports teams (e.g.. The Green Bay Packers; The Chicago Bears; New York Rangers; Boston Bruins; the Blue Devils) are tagged as though they were common nouns. This is the case even when the head of the team-name has the appearance of a proper noun (e.g.: The Finns). This rule only applies to the head of the naming expression, and to words which are explicitly referring to teams. Such words will usually be plural, but see example 5, below. Other elements of a team's name (e.g.. a place name) should be tagged as they would be in other contexts. When a place-name is substituted for the full name of a team (see example 6), it retains its NP tags, as the team reference is implicit.
These points are illustrated in the examples that follow.
Manchester_NP1 United_JJ Birmingham_NP1 City_NN1 Oldham_NP1 Athletic_JJ Queen_NN1 of_IO the_AT South_ND1 Tottenham_NP1 Hotspur_NP1 New_NP1 York_NP1 Rangers_NN2 The_AT Buffalo_NP1 Sabres_NN2 Indiana_NP1 Pistons_NN2 Green_NP1 Bay_NNL1 Packers_NN2 Chicago_NP1 Bear_NN1 Chuck_NP1 Smith_NP1 Los_NP1 Angeles_NP1 beat_VVD the_AT Flying_JJ Finns_NN2
We have adopted the convention of tagging the words in horses' names as though they were ordinary lexical items, and ignoring the capital letters:
Arabian_JJ Knight_NN1 Black_JJ and_CC White_JJ Fish_NN and_CC Chips_NN2 King_NNB Arthur_NP1 Happy_JJ Days_NNT2
Note that Arthur is tagged NP1 because it is unquestionably a bona fide proper noun.
These are treated in the same way as names of horses, a common tag being JJ:
H.M.S._NNB1 Invincible_JJ H.M.S._NNB1 Tenacious_JJ H.M.S._NNB1 Tiger_NN1 Sir_NP1 Galahad_NP1 The_AT Queen_NNB Elizabeth_NP1 II_MC
Note that in the last two examples, the NP1 tag is used for personal names. The same would apply for quasi-proper nouns:
We have adopted the convention of tagging the titles of newspapers as common lexical items:
The_AT Sun_NN1 The_AT Daily_JJ Telegraph_NN1
However, the following points should be noted:
(1) Times is tagged NP1 rather than NNT2 when it occurs in a newspaper title.
(2) If the name occurs as part of a company name, there may be a conflict with the guidelines under section 126.96.36.199 Company Names above. In this case, we give priority to the test described there on page 21:
Mirror_NP1 Group_NN1 Newspapers_NN2 The owner of Today_NP1 Newspapers_NN2
As with names of Horses and Ships, we attempt to keep NP-tagging to a minimum, using it only for words which would be NP1 or NP2 in other contexts:
What_DDQ Katy_NP1 Did_VDD Next_MD The_AT Diary_NN1 of_IO a_AT1 Nobody_NN1 Frankenstein_NP1 Life_NN1 the_AT Universe_NN1 and_CC Everything_PN1
The conventions adopted for tagging names of hotels are similar to those for names of companies. In other words, where the full name is given, NP tags are restricted to bona fide proper nouns, e.g.:
Park_NN1 Lane_NNL1 Hotel_NN1
Truncated names are changed to NP1 in order to avoid an adjective standing alone as the head of a noun phrase, but not if the truncated name is a noun in any case:
These points are illustrated further in the following examples:
The Cumberland_NP1 Hotel_NN1 The Post_NP1 House_NN1 The Post_NP1 House_NN1 Hotel_NN1 The White_JJ House_NN1 Hotel_NN1 The Imperial_JJ Hotel_NN1 Let's have a drink at the Imperial_NP1
As far as possible, these are tagged using ordinary lexical item tags, including, where appropriate NNT1:
At a Republic_NN1 Day_NNT1 gathering New_JJ Year_NNT1 's_GE Day_NNT1 Lincoln_NP1 Day_NNT1
NNT1 tags are used for Christmas, Passover, Easter etc:
Next Christmas_NNT1 Easter_NNT1 Sunday_NPD1 Christmas_NT1 Day_NNT1
Words like 'no' and 'must', which may be used as if they were nouns, are governed by tagging conventions which vary according to the presence or absence of quotation marks, and according to whether or not the word in question has been pluralised. When such a 'cited word' has it normal (singular) form, it is given its normal tag if quotation marks are used:
That sounds like a "yes"_UH to me It's a "maybe_RR" rather than a "must_VM"
However, when there are no quotation marks present, we use a tag appropriate to the context (usually NN1):
A resounding no_NN1 An absolute must_NN1
Plural cited words are tagged NN2 whether or not quotation marks are used:
No ifs_NN2 or buts_NN2, just do it! The noes_NN2 have it
Most prepositions are tagged II. More specific tags are used as follows:
A preposition-type word will receive an adverbial particle tag when used in phrasal verb constructions, or when having the function of an adverbial in the sentence or clause, e.g.:
Japanese companies have insisted on keeping down_RP sales of US cars He did not rule out_RP use of a surcharge Rota put the Cannocks up_RP 4-3 in the second period Seattle reeled off_RP six points for a 17-point lead Out_RP in the garden, the dog was running around_RP
A list of possible RPs:
'bout RG II RP@ about II RG% RP@ along II RP around II RP RG@ away RL RP JJ% back RP NN1 JJ@ VV0% by II RP% down RP II@ NN1% VV0% JJ% NP1:% in II RP@ .NNU% off RP II JJ% on II RP@ on/off RP out RP II% over II RP JJ% RG@ NN1% round JJ II RP NN1@ VV0@ through II RP@ JJ% thru II RP@ JJ% to TO II RP% under II RP@ RG@ up RP II@ VV0%
A similar potential ambiguity exists with words which may be tagged either as prepositions or as 'RL' (locative adverb), e.g.:
I walked across_II the park He looked across_RL and saw Jim
Words which may be tagged RL or II (IW):
aboard RL II above II RL JJ@ across II RL@ alongside II RL astride RL II behind II RL@ NN1% below II RL RG beneath II RL@ JJ% beside II RL% between II RL% beyond II RL@ NN1% ere CS II RL inside II RL NN1@ JJ@ near II RL JJ@ VV0@ nigh RL II@ opposite JJ II% NN1@ RL@ outside II RL JJ NN1@ past NN1 II RL JJ throughout II RL@ underneath II RL NN1@ within II RL@ without IW RL% RR%
There are no RL/RP ambiguities in the lexicon. One or the other is preferred in each case, or else the RR tag is used.
Stranded prepositions are liable to be automatically tagged RP, and should be changed to II. They occur when the preposition becomes detached from its noun phrase as may happen in various types of clausal construction, eg:
What team do you play in_II ? (In_II which team do you play?) Relative Clause: I know the story you are talking about_II ( ... about_II which you are talking) Passive: The car was worked on_II by a fool (A fool worked on_II the car)
SYNTACTIC AMBIGUITY (see Section 3)
Apart from modals (see below), tags for all verbs except be, do, and have contain the letters 'VV'. In the case of the 3 verbs mentioned, this changes to 'VB', 'VD' and 'VH' respectively.
The third element of the tag makes distinctions of form/function as follows:
applaud_VV0; have_VH0; be_VB0; do_VD0
plays_VVZ; does_VDZ; has VHZ; is VBZ
liked_VVD; took_VVD; had_VHD; did_VDD;
liked_VVN; taken_VVN; had_VVN; done_VDN; been_VBN
saying_VVG; doing_VDG; being_VBG; having_VHG
In cases of verb-form ambiguity, function takes precedence. Thus, the word 'put' should be tagged VV0, VVD or VVN according to its grammatical function. Automatic tagging errors are sometimes associated with this source of ambiguity, particularly where no auxilliary co-occurs with a VVN (or VHN), or where the auxilliary is several words distant from its associated past participle (in questions for example).
Contracted forms of 'have', 'had' etc. should carry the same tags as the complete forms. Errors may be associated with certain ambiguous forms:
'd_VHD = 'had'; 'd_VM = 'would' 's_VHZ = 'has'; 's_VBZ = 'is'; 's_GE = genitive 's'
Contracted negated forms are broken up by CLAWS7 into their constituent parts and tagged separately:
is_VBZ n't_XX have_VH0 n't_XX will_VM n't_XX (= won't)
Modal auxiliaries are tagged 'VM'. A list is given below:
'd VM VHD 'll VM 'ud VM can VM NN1% VV0% could VM dare VV0 VM@ NN1% may VM NPM1: NP1:% mayst VM might VM NN1% must VM NN1% need NN1 VV0 VM@ shall VM should VM will VM NN1@ VV0% NP1@ wilt VV0 VM NN1% would VM
The following verb-forms receive 'K' (catenative) tags when used as in the examples:
he is bound_VVNK to arrive soon Do you think it is going_VVGK to rain? We used_VMK to think it was impossible We ought_VMK to leave
(ought is always VMK)
1. VVG > NN1 (see Section 3)
2. VVG > JJ (see Section 3)
3. VVN > JJ (see Section 3)
4. VVG > VVGK (going)
5. VMK > JJ > VVN > VVD (used)
6. VV0 > NN1
7. VVZ > NN2
8. VVD > VVN
9. VHD > VHN (had)
Single-word expressions are given tags appropriate to the word's use in English:
We do not attempt to assign a tag according to the class of a word with its language of origin, rather with its syntactic use in the English sentence. Thus:
is tagged as singular in spite of the Italian plural ending. Multi-word expressions are generally treated as units, and given ditto tags (see Section 4: Ditto-tags) appropriate to the expression as a whole:
Pate_NN131 de_NN132 cheval_NN133 fin_JJ31 de_JJ32 siecle_JJ33 in_RR21 extremis_RR22 personae_NN231 non_NN232 gratae_NN233
For foreign expressions which are not naturalised in any appreciable way (as in quotations or book titles, for example), the tag FW is used for each word. Since a post-editor will apply an FW tag to any foreign word whose meaning he or she is unable to fathom, there is obviously a fuzzy boundary to the FW word class. Company names are usually tagged NP1 (see also 2.9.4 below) e.g.:
Volkswagen_NP1, Alfa_NP1 Romeo_NP1
In cases such as those in the examples below, FW is used:
J'y_FW suis_FW, j'y_FW reste_FW festina_FW lente_FW che_FW sara_FW sara_FW c'est_FW la_FW vie_FW
It would be difficult or misleading to tag them with tags from an English tagset - for example sara above is a third person singular future indicative verb form, yet obviously there is no need for such a tag in English. Similarly festina above is a singular imperative in Latin and there is no such tag for English.
No changes are to be made manually to contracted combinations of words such as j'y or n'est which are to be tagged by CLAWS7 as single words.
Words such as de, van and von are tagged NP1 when part of a name, e.g.:
Ludwig_NP1 van_NP1 Beethoven_NP1, Ferdinand_NP1 de_NP1 Saussure_NP1
Company and country names are usually tagged NP1 as well even if all of the words are foreign and non-naturalised:
Credit_NP1 Lyonnais_NP1 Banque_NP1 Nationale_NP1 de_NP1 Paris_NP1 les_NP1 Etats_NP1 Unis_NP1
Expressions such as per or a la are given normal preposition tags, unless they are subsumed under a longer expression which should be tagged as a unit:
per_II pound_NNU1 a_II212 la_II22 Lancaster_NP1
per_JJ21 RR21 diem_JJ22 RR22 per_NNU21 cent_NNU22 a_JJ31 RR31 la_JJ32 RR32 carte_JJ33 RR33
Interjections are tagged UH. Some words are always considered as interjections, for example: aha, blimey, crikey, ha, huh, oh, sh, um, yes. Other words are sometimes tagged as interjections, and sometimes not, for example:
adieu UH NN1@ boo UH VV0@ NN1@ bye UH NN1% clonk NN1 VV0 UH hallelujah UH NN1@ no UH AT RR%
The only words that should be tagged as UH are those indicating exclamation, or some other kind of interactive signal (which is not integrated with the syntax of the sentence), e.g. yes, no, whoa.
The following are all tagged as FU (unclassified word): other types of exclamation (oops), onomatapoeic words that are not exclamatory (whoosh, ding), transcriptions of non-linguistic utterances ('de do da da da da'), hesitations and stutters (er, erm), truncated words, etc. Also exclamatory words or expressions which retain the spelling of a word in another class, from which they derive, are not tagged UH - e.g.:
God_NP1 Almighty_JJ Sure_JJ Bless_VV0 you_PPY
Post-editors should look at the full list of lexicon entries for FU and UH to help clarify this area.
There are several tags for different types of strings representing numbers and strings containing numbers. These are:
footis tagged NN1 and
feetNN2 even when they are units of measurement, in order to minimise ambiguity.
See also one in section 4.
CSA / II / RG see Section 4: as
more and less can be assigned either of these tags.
The difference between them is that DAR is for noun-phrase-like (and determiner) uses of the word in question, whereas RRR is for adverbial uses. The two can be difficult to distinguish, particularly after a verb: eg:
You should relax more_RRR You should spend more_DAR
Since relax is an intransitive verb in this context, more cannot be a noun phrase. Instead, one can paraphrase it roughly as "to a greater extent". On the other hand, spend is a transitive verb, and so more is a DAR in this context. (We can notice that more after spend is the direct object of the verb, because it can be made the subject of a passive: "More should be spent..."). There are some verbs for which the distinction is less clear than in these examples, eg:
You should eat more You should smoke less
Note that the verb may be used transitively or intransitively with almost identical meanings, so that the syntactic structures of the immediate and/or surrounding context are the only clue as to which is the case:
"Do you smoke?" (Intransitive) "How many do you smoke a day?" (Transitive)
At the moment we have 23 fixtures per season. Personally, I would rather play more_DAR
If you're going to make the big time, I can see you'll have to play more_RRR, and not just wait for the ball to come to you.
(see also RG / RR for degree and general adverb tagging of 'more' and 'less').
(a) He ran down_II the hill
(b) He ran down_RP his friends
In (a), down is a preposition because:
He ran quickly down the hill
*He ran viciously down his friends
This is the hill down which he ran Down which hills do you like running?
In (b), down is an adverbial particle because:
He ran his friends down_RP
*He ran the hill down
He ran them down
*He ran down them
She put the cat out_RP > She put IT out_RP
She went through_II the gap > She went through_II IT
Notice that the syntactic distinction between down_RP and down_II is independent of the semantic distinction between locative and non-locative uses of down. When the verb is simply followed by down or out etc., without a following noun phrase, it is normally an RP:
Income tax is coming down_RP The decorations were taken down_RP on 12th night
However, tagging errors may occur with stranded prepositions which are denuded of their noun phrase because it has been fronted or ellipted (eg. in relative clauses, passives, questions etc.):
This is the hill (which) she ran down_II (ie. This is the hill down_II which she ran) On Shrove Tuesday, this hill will be run down_II by housewives" (ie. Housewives will run down_II it) Which car did you arrive in_II? (ie. In_II which car did you arrive?)
The same tests apply to words which are tagged either as prepositions or as locative adverbs RL eg. across, past, behind etc. (See section 3 for lists).
Words ending in -ing, when they premodify a noun, may be tagged either NN1 or JJ, eg:
New_JJ spending_NN1 reductions_NN2 her_APPGE acting_NN1 ability_NN1 a_AT1 working_JJ mother_NN1
(but see also JJ / VVG).
If "X-ing NOUN" is equivalent in meaning to "NOUN which X-es" - (ie. if the NOUN is the notional subject of the verb X) - then "X-ing" is a JJ.
The smiling_JJ children (i.e. The children are smiling)
In other cases, X-ing is an NN1. In such cases, it is often possible to paraphrase X-ing NOUN by a more explicit phrase in which X-ing is clearly a noun. eg:
new spending_NN1 reductions (new reductions in spending) her acting_NN1 ability (her ability in acting)
A boxing_NN1 match A falling_JJ rate of exchange Slimming_NN1 tablets The mating_NN1 season a couple of mating_JJ chimpanzees
After a verb or an object, there is sometimes a tricky choice between JJ and RR, or between JJR and RRR. eg:
They arrived tired and hungry
Here, both "tired" and "hungry" are JJ. The main test is to see whether you can express the relation between these words and their logical subject, using the verb "be": "They arrived tired and hungry" implies "They were tired and hungry". The JJ/RR word refers to a property of a noun, rather than to a property of an event or a situation. Contrast:
Peter sang out loud_RR and clear_RR
This sentence does not imply that Peter was loud and clear, but is more or less equivalent to "Peter sang out loudly and clearly". It means that his SINGING was loud and clear. It follows that when, in colloquial English, a word which we normally expect to be an adjective is used as an adverb, we tag it RR. eg:
We did terrific_RR today
A simple pair of examples where the JJ/RR word follows an object:
I thought the game too long_JJ (the game was too long) They work their staff very hard (NOT "the staff are very hard")
Also JJR / RRR:
They'll have to make the taxes higher_JJR (The taxes will be higher)
You'll have to aim higher_RRR
Note: well is an adjective when it is the opposite of ill:
Mary is/feels well_JJ
Otherwise it is an adverb:
"He writes well_RR".
The tagging of words like "surprised" in "John was surprised", or "lasting" in "the effect was lasting" can be a problem. In both cases, the word can be a JJ. One test is to see whether you can insert an adverb like "very" in front of the word. eg. in "John was very surprised", "surprised" is a JJ.
Another test, having the opposite effect, is to see whether there is an agent "by"-phrase following an "ed/en" word. If so, it is a VVN. eg. in "John was surprised by the pirates", "surprised " is a VVN. Even where it is not present, the possibility of adding a "by"-phrase, without changing the meaning of the word, is evidence in favour of a VVN. (However, this criterion can clash with the preceding one - since it occasionally happens that an "ed"- word is preceded by an adverb like "very" AND followed by a "by"-phrase: eg. "John was very offended by her remarks". Fortunately, such cases are rare. When they do occur, however, give preference to JJ).
A third test is negative: to see whether the word in question can be placed before a noun. eg:
The effect is lasting: a lasting effect
This shows that "lasting" can be (but need not be) a JJ. If the word could not be placed (with the same meaning) before the noun, this would be evidence that the word is not a JJ, but a VVG or a VVN.
Even though an "-ing" word is normally a VVG after the verb "be" it is generally treated as a JJ before a noun:
The man was dying_VVG
The dying_JJ man
When the -ing or -en/ed word forms part of a phrase premodifying the noun, as in the following examples, the VVG/VVN tag is preferred:
interest_NN1 earning_VVG account_NN1 a hypothesis_NN1 driven_VVN approach_NN1
In these examples, the NN1 VVG sequence is similar in function to a compound pre-modifying adjective. In hyphenated form they would be given a JJ tag. The same applies when the phrase is a noun-like compound. eg:
a [ carol_NN1 singing_VVG ] contest_NN1
If the verb be can be replaced by another verb such as seem or become, without changing the meaning of the following JJ/VVN word, this is a strong indication that the construction is not properly a passive, and that the word is a JJ. eg:
The building was infested_JJ with cockroaches
(The building became/seemed infested...)
I could see he was favourably disposed_JJ to the idea
(He seemed favourably disposed...)
A further distinction which can be used as a test with 'event' verbs is that the JJ refers to a 'resultant state', whereas the VVN refers to a an event. eg:
Bill was married_JJ (as opposed to single) Bill was married_VVN to Sarah on May 14th (the actual event)
Some further examples:
Three people were injured_VVN in the accident I could see he was (seemed) injured_JJ He lay injured_JJ on the road We have three injured_JJ players in the side Our players are not worried_JJ She is not worried_VVN by that sort of threat
JJR / RRR see JJ / RR
NN1 / JJ see JJ / NN1
Note that NP2 is not used for names of teams, even those which are apparently not common nouns. NP2 is used for proper nouns which happen to be plural, eg.
The Rockies, The Hebrides
for plural product names, eg.
Lancias_NP2 are pretty fast
and for naming families, eg.
The Staffords_NP2 are always quarrelling.
RG is restricted to adverbs of degree (also called intensifiers, etc.) which precede the word or expression they modify. Clear cases of RG are very, and so and as in comparatives (see section on as below).
Adverbs which have a range of functions, including adverb of degree, are not normally tagged RG, but are given the more general RR tag instead.
She_PPHS1 was_VBDZ scantily_RR clad_JJ
Here 'scantily' is an RR rather than an RG because it could also occur after a verb:
She_PPHS1 dressed_VVD scantily_RR
This is another case of the general principle of avoiding general-specific ambiguities within a word class. RG is usually only for words which do not have a more general range of adverbial uses.
There are exceptions to this, however. (See Section 2: Adverbs. See also Section 4: so). The words which may be tagged RG or RR are:
She is so_RG attractive I would think so_RR This is too_RG heavy Can I come too_RR? That's rather_RG nice I would rather_RR go out He's quite_RG talkative Quite_RR, I agree
Note that about may be an RP or an RG. However, this does not violate the principle mentioned above, since both RP and RG are sub-categories of RR:
He's about_RG 12, I think Stop messing about_RP
any is tagged DD when it functions pronominally or as a determiner:
Do it any_DD way you like I'm afraid I haven't got any_DD
and RR when it modifies an adverb or an adjective:
They are not called that any_RR longer_RRR I cannot run any_RR faster_RRR It was not really any_RR better_JJR than before
Note that the word following may also be ambiguous between adverb and determiner. In such cases, it is possible that both may be erroneously tagged, and require correction thus:
You won't feel any_DD more_DAR pain If you have any_DD more_DAR , you'll burst He doesn't play chess any_RR more_RRR
as can be tagged RG, II or CSA.
It is an RG when it occurs before an adjective, adverb or determiner (and sometimes other words) in phrases such as:
I don't think that one is as_RG good I go there as_RG often (as...) There are not as_RG many (as...)
In the 2nd and 3rd examples above, the second as is always a CSA because it introduces a comparative construction (an equal comparison, as contrasted with an unequal comparison introduced by than). Thus, in the following, the second as is tagged CSA:
She's not as_RG (or so_RG) pretty as_CSA I thought An ostrich can run as_RG quickly as_CSA a zebra He has as_RG many as_CSA six children
Notice that as in this comparative use is tagged CSA whether or not it introduces a clause, as normally understood. In the second case above, as precedes a noun phrase. In the following, it precedes an adjective:
Please come as_RG quickly as_CSA possible
CSA is also the tag used when as introduces other clauses (eg. clauses of time or clauses of reason). eg:
As_CSA I arrived, he was leaving I'll lend you the money, as_CSA you're my friend
II is the tag for as as an undoubted preposition - it usually has an equative meaning, as in:
They regard him as_II a friend As_II governor of the province, I have to take action
The guideline restricts II to cases of as followed by a noun-phrase-type structure - which may be a pronoun. If as is followed by an adjective, a past participle etc., it is tagged CSA, even though it has the same equative type of meaning as as_II. eg:
The novel as_CSA originally written Many people regard his paintings as_CSA hideous
But is most commonly a CCB, but a there are rare cases when it can be an RR and a CS.
It is an RR in phrases such as:
You can but_RR try We could not but_RR offer our help
It is an II when it has a meaning like except or apart from, eg:
All but_II one of us We've asked everyone but_II the doctor I've tried everything but_II taking tablets Everything but the girl.
It is a CS when it introduces a clause such as:
There's no doubt but_CS he's the guilty party (rare) There was nothing for it but_CS to give her the job She would do nothing but_CS fly combat missions
Otherwise it is a CCB (co-ordinating conjunction):
I like this but_CCB but I don't like that (co-ordinated sentences) I like this one but_CCB not that one (co-ordinated noun phrases)
When each could be replaced by apiece, it is tagged RA. Otherwise it is tagged as DD1:
Five pounds each_RA is a bit steep They scored a goal each_RA
They each_DD1 scored a goal We go fishing each_DD1 Sunday in the Summer Each_DD1 one a peach I'll give you a fiver for each_DD1
His is tagged APPGE when it a pre-nominal possessive pronoun ie. when it is part of the set my; your; her etc.:
It was his_APPGE fault
It is tagged PPGE when it a nominal possesive pronoun, ie. when it is part of the set: mine; yours; hers' etc.:
John's not here, so use his_PPGE
how may be tagged as an RGQ or as an RRQ. As an RGQ it always premodifies another word, for example an adjective or an expression of quantity:
How_RGQ much_DA1 opposition is there? I do not know how_RGQ willing_JJ he is
how as an RRQ, has a general adverbial meaning, and can often be paraphrased by an expression such as by what means or in which manner:
How_RRQ will you manage? I wonder how_RRQ it will look
How_RRQ are you? How_RRQ does it feel?
How_RGQ implies a question which could be answered with the phrase in question, but with how replaced by a degree adverb (RG). eg:
I'm not sure how_RGQ likely it is (How likely is it? It is very_RG likely)
Note that the same principles apply to the word however (RGQV; RRQV), and the expression no matter how:
No_RGQV31 matter_RGQV32 how_RGQV33 difficult_JJ the situation, Red Adair always succeeds Be careful, however_RRQV you decide to do it!
(However may of course also be a general adverb (RR) ):
There were, however_RR, too few people in the audience
Much is tagged DA1 when it functions pronominally or as a determiner:
There is not much_DA1 point in resisting She didn't say very much_DA1
but it is tagged RR when it functions adverbially or pre-modifies an adjectival or adverbial head:
I don't like that very much_RR This one is much_RR better_JJR
As with any (see above), co-occurrence with other ambiguous determiner/adverbs should be checked in case of a double error:
President Carter plays golf much_RR less_RRR these days He has much_DA1 less_DAR enthusiasm for the game
When it means the opposite of yes, no is tagged UH. This is true even when the use is nominal, providing the quotation marks are present:
A resounding "no_UH"
If they are absent, the tag should be changed to NN1.
I'll take that as a no_NN1, then.
(See also Section 2.6.4: Cited Words)
Otherwise, no is tagged AT, e.g.:
There is no_AT question of that happening
MC1 where one precedes a noun or noun phrase, as in:
one_MC1 book one_MC1 bag of spuds
and where it is the head of a noun phrase with a dependent prepostional phrase:
one_MC1 of the books
and when referring to 'one' as a number entity:
this is the number one_MC1 one_MC1 is an integer type a one_MC1 at the prompt
PN1 where it is a personal pronoun such as:
one_PN1 ought to be careful one_PN1 doesn't like to make a fuss
and when functioning as a substitute form:
the prettiest one_PN1 is called Flo the one_PN1 you are holding is a bomb his idea is not one_PN1 that holds much water
The CS tag is used when so is equivalent to the expression so that. It has a purposive function:
We hid it so_CS no one would notice He only said it so_CS he could impress us
It is an RR when it occurs, usually after a punctuation mark or at the beginning of a sentence, with a meaning approximating to therefore:
It is raining, so_RR I am staying at home So_RR we gave up the struggle, you see He swore at me, so_RR I hit him
It is likewise an RR if preceded by a conjunction in examples like those directly above:
He swore at me, and_CC so_RR I hit him
In expressions where so is used as a substitute form, and in cases where its use is clearly adverbial (= like that), it is tagged RR:
so_RR I believe I might feel that, but I would never say so_RR So_RR did John I'm afraid so_RR
Don't take on so_RR!
It is tagged RG when used in positions where very could occur:
She is so_RG friendly I have never been so_RG angry Thank you so_RG much
and when it corresponds to the first as in 'as...as...' comparisons:
They're not doing so_RG well_RR as_CSA before
Times is now always tagged NNT2 except
Three times_II two is six The number of rows times_II the number of columns
In all the following diverse cases, NNT2 is used:
Recite your twelve times_NNT2 table It's ten times_NNT2 better than before (because of the following comparative adjective) London is 10 times_NNT2 the size of Lancaster (grammatically, 10 times could be replaced by twice) How many times_NNT2 must that have happened? Those were good times_NNT2 They clocked up some very fast times_NNT2 Knock three times_NNT2
Of course, times may also occur as a VVZ (She times_VVZ his response).
When may be tagged RRQ or CS. When can introduce three types of clause:
When it introduces an adverbial clause or a non-restrictive relative clause, it is a CS. When it introduces either a noun clause or a restrictive relative clause, it is an RRQ. Examples:
When_CS I arrived, John left John left when_CS I arrived (at the time at which) I smoke when_CS I'm tense (whenever)
I cannot remember when_RRQ I was christened I don't know when_RRQ the next bus is due (the date/point in time at which)
In the year when_RRQ I was born (in which) The moment when_RRQ he arrived (at which)
Note that when can often be omitted in a relative clause.
There are also non-restrictive relative clauses introduced by when, which are now to be tagged as CS. Previously they were tagged RRQ. It is no longer necessary to distinguish these from adverbial clauses introduced by when. Here are some examples of non-restrictive relative clauses:
In 1968, when_CS the students were revolting in Paris...
Here, when could best be paraphrased as at the time when.
School finished at 4 o'clock precisely, when_CS a loud bell sounded
Non-restrictive relative clauses do not define or restrict the meaning of the antecedent. If the antecedent is a precise temporal expression (such as "4 o'clock", "1990", "yesterday"), when is usually a non-restrictive relative.
These are different from restrictive relatives, such as:
In the year when_RRQ I was born
Here the year is defined by the relative clause. Typically restrictive relatives are not preceded by a comma, and the when can normally be omitted.
Another use of when_RRQ is in direct questions:
When_RRQ did you find out?
In abbreviated adverbial clauses, where when is followed by an adjective, a preposition phrase, a non-finite clause etc., when is a CS:
when_CS ready when_CS in doubt when_CS arriving late
but before an infinitive, when is an RRQ:
I don't know when_RRQ to apply
Note that the infinitive clause may be implied:
Tell me when_RRQ (to start)
and that a noun clause may be abbreviated simply to the word when:
It was Guy Fawkes, but I can't remember when_RRQ
The tagging of where is consistent with when.
Two tags are allowed: II and NN1. II is used for expressions which could be an answer to the question: how much is it worth? or what is it worth?:
My records are worth_II a small fortune He is worth_II about two million It's not worth_II gambling on
It also occurs as a stranded preposition (see Sections 2 and 3) in the questions used to elicit such responses, and in other common constructions:
What do you think they are worth_II ? He knew exactly how much they were worth_II She gave it everything she was worth_II
NN1 is used when worth is obviously nominal, and also in expressions where worth is preceded by a quantity, whether or not the quantity in question has been written as a genitive:
You don't know your own worth_NN1 I'd like a pound's worth_NN1 They purchased a million dollars worth_NN1 of equipment
NOTE: DITTO TAGS
Any of the tags listed above may in theory be modified by the addition of a pair of numbers to it: eg. DD21, DD22. This signifies that the tag occurs as part of a sequence of similar tags, representing a sequence of words which for grammatical purposes are treated as a single unit. For example the expression in terms of is treated as a single preposition, receiving the tags:
in_II31 terms_II32 of_II33
The first of the two digits indicates the number of words/tags in the sequence, and the second digit the position of each word within that sequence. Such ditto tags are not included in the lexicon, but are assigned automatically by a program called IDIOMTAG which looks for a range of multi-word sequences included in the idiomlist. The following sample entries from the idiomlist show that syntactic ambiguity is taken into account, and also that, depending on the context, ditto-tags may or may not be required for a particular word sequence:
at_RR21 length_RR22 a_DD21/RR21 lot_DD22/RR22 in_CS21/II that_CS22/DD1