A BRIEF USERS' GUIDE TO THE GRAMMATICAL TAGGING OF THE BRITISH NATIONAL CORPUS

Geoffrey Leech

A. GENERAL INFORMATION

1. The Tagged British National Corpus

All the 100 million words of the British National Corpus (BNC) have been grammatically tagged: that is, a label is attached to each of them, indicating its grammatical class, or part of speech. All punctuation marks in the corpus are also grammatically tagged (however, the punctuation marks are omitted from all word counts, for example when calculating size of texts or error rates).

The grammatical tagging was undertaken by the Unit for Computer Research on the English Language (UCREL) at Lancaster University, U.K. In addition to the co-directors of UCREL, Roger Garside and Geoffrey Leech, the team undertaking the tagging included Michael Bryant, Elizabeth Eyes, Nick Smith, Mary Hodges, Mary Kinane, Tom Barney, Simon Botley, and Xu Xunfeng.

2. The BNC Basic Tagset and the BNC Enriched Tagset

The 100-million-word British National Corpus is tagged using a tagset (known as C5) which we may refer to as the "BNC Basic Tagset".

Additionally, a two-million-word subset of the BNC, the "Core Corpus", has been tagged using a richer, more detailed tagset (known as C7) which we will refer to as the "BNC Enriched Tagset".

3. The Automatic Tagging of the Corpus: Errors and Ambiguity Tags

The 100-million-word BNC was tagged automatically, using the CLAWS4 automatic tagger developed by Roger Garside at Lancaster. With such a large corpus, there was no opportunity to undertake a post-editing and correction of tagging errors produced by the automatic tagger, and so the errors (c. 1.7% of all words, excluding punctuation marks) remain in the distributed form of the corpus. In addition, the distributed form of the corpus contains ambiguous taggings, shown in the form of ambiguity tags (also called "portmanteau tags"), such as VVD-VVN, which indicates that the automatic tagger was unable to decide, with sufficient likelihood of success, which was the correct category, VVD (past tense verb) or VVN (past participle), and so left two possibilities for the users to disambiguate. Approximately 4.7% of the tags in the Basic Tagset tagging of the BNC (excluding punctuation tags) are ambiguity tags.

However, the tagging of the 2 million words of the Core Corpus has been manually post-edited and corrected, and so has low rates of error, less than 0.3%, in comparison with the Basic Tagset tagging of the whole 100 million words.

4. The Core Corpus: Equality of Spoken and Written Data

The Core Corpus of 2 million words is intended to be a representative subset of the whole corpus, in the sense that it contains samples from all the major subdivisions of the whole BNC, and in approximately the same proportions as those found in the BNC as a whole. There is one major exception to this statement however: whereas in the whole BNC, only c.10 million words (10% of the corpus) consist of spoken data, the Core Corpus is divided approximately equally between written and spoken material (c.1 million words each). It will generally be felt that in an ideal balanced corpus of the language, at least half of the material should be spoken English. It was only the impracticality of collecting and transcribing 50 million words of the spoken language which led to abandonment of this goal of "ideal balance" in the case of the whole BNC.

5. The BNC Tagging Enhancement Project

Although an appreciable number of errors regrettably remain in the tagging of the BNC, there is a current project the goals of which are (a) to eliminate errors from the available version of the BNC, and (b) to provide a tool, to be available to BNC users, enabling them to change the tagging in the corpus in accordance with their needs. For example, there may well be users who feel that some of the tags in the present corpus are inappropriate to their needs, and that some tags should be merged, or should be split into two or more categories. The tag adaption tool will enable this to be done, on the basis of the information in the existing tagging in the corpus.

This project is entitled "The British National Corpus Tag Enhancement Project", and will run from 1 January 1995 to 30 June 1996. The work is being undertaken by UCREL at Lancaster University, and is funded by the Engineering and Physical Sciences Research Council. Also collaborating on the project are the three publishers Oxford University Press, Longman and Chambers Harrap.

By September 1996, it is intended to produce for general release, through Oxford University Computing Services, an upgraded version of the whole corpus, in which tagging errors will have been reduced to a negligible frequency.

6. Some General Characteristics of the Grammatical Tagging

(a) Tag Non-proliferation

One general principle of grammatical tagging in the corpus is introduced in order to avoid unnecessary proliferation of ambiguities in the tagging system. It has been decided, for practical reasons, that a more general tag within a word category subsumes a less general one. For example, in the Enriched Tagset, there are many tags within the category of noun: NN1 (singular common noun), NNT1 (singular locative noun, a subcategory of NN1), NN2 (plural common noun) and NNU2 (plural noun of measure). In general, a noun is assigned to the more general category unless its use is restricted to the more specific category. Examples are inches, meters, miles, which are assigned the more specific tag NNU2. On the other hand, the more general tag is assigned to feet, even in such phrases as ten feet long,since this noun also has non-measurement uses. However, with the abbreviation ft. the measurement use is the only one, so NNU (measure noun neutral for number) is assigned. Similarly the noun square has the general singular common noun tag NN1. But if square occurs in a place name (as in Trafalgar Square) the tag NNL1 is assigned, because its capital letter and position indicates that the word can only be locative noun.

This non-proliferation principle is not followed in cases where it would lead to the loss of important information, e.g. for parsing purposes. A few notable exceptions in the Enriched Tagset are the adverbs so, too, rather and quite, which are tagged RR (general adverb) and RG (degree adverb - where the adverb premodifies an adjective, adverb, etc.). Also, if is tagged CS (subordinating conjunction) where it introduces an adverbial clause, and CSW (subordinating conjunction introducing a yes-no interrogative clause) where it is comparable to whether. Similarly, fall and spring are tagged NNT1 (temporal noun) when referring to the seasons - alongside summer, autumn and winter - and NN1 in non-temporal uses. More and less are also given two adverb tags (RGR and RRR) depending on whether they have a premodifying function or not. A distinction is always made between common noun (e.g. NN1) and proper noun (NP0).

(b) Segmentation

The tagging practice in general follows the default assumption that an orthographic word (separated by spaces from the adjacent words) is the appropriate unit for grammatical tagging. There are, however, exceptions to this. A single orthographic word may contain more than one grammatical word: e.g. in the case of verb contractions and negative contractions such as she's, they'll, we're, don't, isn't, two tags are assigned in sequence to the same orthographic word. Also quite frequent is the opposite circumstance, where two or more orthographic words are given a single grammatical tag: e.g. compound conjunctions such as so that and as well as are each assigned a single conjunction tag, and likewise compound prepositions such as instead of and up to are each assigned a single preposition tag. Naturally, whether such orthographic sequences should be treated as single word for grammatical tagging purposes depends on the context. As well as in some contexts is not a conjunction, but a sequence of adverb - adverb - conjunction/preposition. Up to in some contexts is a sequence of adverbial particle - preposition.

In one respect, we have inevitably allowed the orthographic occurrence of spaces to be criterial. This is in the tagging of compound sequences such as fox holes, fox-holes, and foxholes. Since orthographic practice is uncertain in such cases, the same "compound" may occur in the corpus tagged as two words (if they are separated by spaces) or as one word (if the sequence is printed solid or with a hyphen).

(c) Annotation Guidelines

Many more detailed decisions have been used in deciding where to draw the line between one tag and another, so that the concept of a "correct annotation", and therefore, the accuracy of automatic tagging, can be determined. There have to be detailed guidelines on tagging practice. These are incorporated into a BNC Tagging Manual which is not yet generally available, but which we intend to edit by August 1995 for general distribution to users of the corpus.

B. THE BNC BASIC TAGSET

The following is a brief description of the Basic Tagset used in the tagging of the whole 100-million-word BNC.

1. A List of Grammatical Tags with Brief Definitions and Clarifications

Each tag consists of three characters. Generally, the first two characters indicate the general part of speech, and the third character is used to indicate a subcategory. When the most general, unmarked category of a part of speech is indicated, in general the third character is 0. (For example, AJ0 is the tag for the most general class of adjectives.)

AJ0 Adjective (general or positive) (e.g. good, old, beautiful)

AJC Comparative adjective (e.g. better, older)

AJS Superlative adjective (e.g. best, oldest)

AT0 Article (e.g. the, a, an, no) [N.B. no is included among articles, which are defined here as determiner words which typically begin a noun phrase, but which cannot occur as the head of a noun phrase.]

AV0 General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest. [Note that adverbs, unlike adjectives, are not tagged as positive, comparative, or superlative.This is because of the relative rarity of comparative and superlative adverbs.]

AVP Adverb particle (e.g. up, off, out) [N.B. AVP is used for such "prepositional adverbs", whether or not they are used idiomatically in a phrasal verb: e.g. in 'Come out here' and 'I can't hold out any longer', the same AVP tag is used for out.

AVQ Wh-adverb (e.g. when, where, how, why, wherever) [The same tag is used, whether the word occurs in interrogative or relative use.]

CJC Coordinating conjunction (e.g. and, or, but)

CJS Subordinating conjunction (e.g. although, when)

CJT The subordinating conjunction that [N.B. that is tagged CJT when it introduces not only a nominal clause, but also a relative clause, as in 'the day that follows Christmas'. Some theories treat that here as a relative pronoun, whereas others treat it as a conjunction.We have adopted the latter analysis.]

CRD Cardinal number (e.g. one, 3, fifty-five, 3609)

DPS Possessive determiner (e.g. your, their, his)

DT0 General determiner: i.e. a determiner which is not a DTQ. [Here a determiner is defined as a word which typically occurs either as the first word in a noun phrase, or as the head of a noun phrase. E.g. This is tagged DT0 both in 'This is my house' and in 'This house is mine'.]

DTQ Wh-determiner (e.g. which, what, whose, whichever) [The category of determiner here is defined as for DT0 above. These words are tagged as wh-determiners whether they occur in interrogative use or in relative use.]

EX0 Existential there, i.e. there occurring in the there is ... or there are ... construction

ITJ Interjection or other isolate (e.g. oh, yes, mhm, wow)

NN0 Common noun, neutral for number (e.g. aircraft, data, committee) [N.B. Singular collective nouns such as committee and team are tagged NN0, on the grounds that they are capable of taking singular or plural agreement with the following verb: e.g. 'The committee disagrees/disagree'.]

NN1 Singular common noun (e.g. pencil, goose, time, revelation)

NN2 Plural common noun (e.g. pencils, geese, times, revelations)

NP0 Proper noun (e.g. London, Michael, Mars, IBM) [N.B. the distinction between singular and plural proper nouns is not indicated in the tagset, plural proper nouns being a comparative rarity.]

ORD Ordinal numeral (e.g. first, sixth, 77th, last) . [N.B. The ORD tag is used whether these words are used in a nominal or in an adverbial role. Next and last, as "general ordinals", are also assigned to this category.]

PNI Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody) [N.B. This tag applies to words which always function as [heads of] noun phrases. Words like some and these, which can also occur before a noun head in an article-like function, are tagged as determiners (see DT0 and AT0 above).]

PNP Personal pronoun (e.g. I, you, them, ours) [Note that possessive pronouns like ours and theirs are tagged as personal pronouns.]

PNQ Wh-pronoun (e.g. who, whoever, whom) [N.B. These words are tagged as wh-pronouns whether they occur in interrogative or in relative use.]

PNX Reflexive pronoun (e.g. myself, yourself, itself, ourselves)

POS The possessive or genitive marker 's or ' (e.g. for 'Peter's or somebody else's', the sequence of tags is: NP0 POS CJC PNI AV0 POS)

PRF The preposition of. Because of its frequency and its almost exclusively postnominal function, of is assigned a special tag of its own.

PRP Preposition (except for of) (e.g. about, at, in, on, on behalf of, with)

PUL Punctuation: left bracket - i.e. ( or [

PUN Punctuation: general separating mark - i.e. . , ! , : ; - or ?

PUQ Punctuation: quotation mark - i.e. ' or "

PUR Punctuation: right bracket - i.e. ) or ]

TO0 Infinitive marker to

UNC Unclassified items which are not appropriately classified as items of the English lexicon. [Items tagged UNC include foreign (non-English) words, special typographical symbols, formulae, and (in spoken language) hesitation fillers such as er and erm.]

VBB The present tense forms of the verb BE, except for is, 's: i.e. am, are, 'm, 're and be [subjunctive or imperative]

VBD The past tense forms of the verb BE: was and were

VBG The -ing form of the verb BE: being

VBI The infinitive form of the verb BE: be

VBN The past participle form of the verb BE: been

VBZ The -s form of the verb BE: is, 's

VDB The finite base form of the verb BE: do

VDD The past tense form of the verb DO: did

VDG The -ing form of the verb DO: doing

VDI The infinitive form of the verb DO: do

VDN The past participle form of the verb DO: done

VDZ The -s form of the verb DO: does, 's

VHB The finite base form of the verb HAVE: have, 've

VHD The past tense form of the verb HAVE: had, 'd

VHG The -ing form of the verb HAVE: having

VHI The infinitive form of the verb HAVE: have

VHN The past participle form of the verb HAVE: had

VHZ The -s form of the verb HAVE: has, 's

VM0 Modal auxiliary verb (e.g. will, would, can, could, 'll, 'd)

VVB The finite base form of lexical verbs (e.g. forget, send, live, return) [Including the imperative and present subjunctive]

VVD The past tense form of lexical verbs (e.g. forgot, sent, lived, returned)

VVG The -ing form of lexical verbs (e.g. forgetting, sending, living, returning)

VVI The infinitive form of lexical verbs (e.g. forget, send, live, return)

VVN The past participle form of lexical verbs (e.g. forgotten, sent, lived, returned)

VVZ The -s form of lexical verbs (e.g. forgets, sends, lives, returns)

XX0 The negative particle not or n't

ZZ0 Alphabetical symbols (e.g. A, a, B, b, c, d)

Total number of grammatical tags in the BNC Basic Tagset: 61

2. A List of Ambiguity Tags

AJ0-AV0 AJ0-VVN AJ0-VVD AJ0-NN1 AJ0-VVG

AVP-PRP AVQ-CJS CJS-PRP CJT-DT0 CRD-PNI

NN1-NP0 NN1-VVB NN1-VVG NN2-VVZ VVD-VVN

3. The Frequency of Ambiguity Tags in the British National Corpus (an estimate based on a 36,000 word sample)

The following is the result of a provisional analysis of ambiguity tags in a part of the BNC. It can be used to estimate the number of ambiguity tags representing this or that tag in the corpus, and likelihood of errors occurring in ambiguity tags. This is a first step towards a more detailed analysis being undertaken at Lancaster.

KEY TO THE COLUMNS IN THE TABLE BELOW

(1) = Ambiguity tag occurring in the BNC

(2) = Ambiguity tag correctly disambiguated as...

(3) = Number of occurrences, in the 36,000 words, of the specific type of ambiguity tag disambiguation

(4) = Number of uncertain [still ambiguous] cases included in (3), if any.

(5) = Total number of occurrences of the specific ambiguity tag.

(6) = Erroneous tags as a percentage of all instances of the specific ambiguity tag (i.e. Error Rate)

[N.B. * marks cases where neither of the tags included in the ambiguity tags were correct. I.e., this was an error.]

(1) (2) (3) (4) (5) (6)

AJ0-AV0 AJ0 37 [1]

AJ0-AV0 AV0 31 [4]

*AJ0-AV0 other *16 84 (19%)

AJ0-NN1 AJ0 126 [27]

AJ0-NN1 NN1 114 [10]

*AJ0-NN1 other *16 [3] 256 (6%)

AJ0-VVD AJ0 15

AJ0-VVD VVD 10

*AJ0-VVD other *4 29 (14%)

AJ0-VVG AJ0 25 [2]

AJ0-VVG VVG 31

*AJ0-VVG other *4 60 (7%)

AJ0-VVN AJ0 12

AJ0-VVN VVN 19 [2]

*AJ0-VVN other *4 35 (11%)

AVP-PRP AVP 52 [1]

AVP-PRP PRP 44 [4]

*AVP-PRP other *2 98 (2%)

AVQ-CJS AVQ 25 [1]

AVQ-CJS CJS 51 [2] 76 (0%)

CJS-PRP CJS 22

CJS-PRP PRP 51

*CJS-PRP other *1 74 (1%)

CJT-DT0 CJT 24

CJT-DT0 DT0 52 [1] 76 (0%)

CRD-PNI CRD 5 [2]

CRD-PNI PNI 12 17 (0%)

NN1-NP0 NN1 42 [10]

NN1-NP0 NP0 144 [25]

*NN1-NP0 other *5 191 (3%)

NN1-VVB NN1 113 [5]

NN1-VVB VVB 59 [5]

*NN1-VVB other *17 [1] 189 (9%)

NN1-VVG NN1 34 [8]

NN1-VVG VVG 31 [2]

*NN1-VVG other *6 71 (8%)

NN2-VVZ NN2 82

NN2-VVZ VVZ 36 [1]

*NN2-VVZ other *1 119 (0.8%)

VVD-VVN VVD 147 [2]

VVD-VVN VVN 118 [2]

*VVD-VVN other *1 266 (0.3%)

Total Number of Ambiguity Tags in the 36,000 words: 1641

Percentage of words having ambiguity tags (discounting 77 ambiguity-tag errors, which are accounted for in the table below): 1564/36,000 = 4.34%

4. Errors in a 36,000 Word Sample of the BNC

(Excluding errors coinciding with the use of ambiguity tags: marked * in the Ambiguity Tag Report at B3 above.)

The following is the result of a provisional analysis of errors in the tagging of the BNC. It can be used to estimate the likelihood of error in relation to particular tags, and also to determine the most likely tags to be correct replacements of erroneous tags. This is the first step of a more detailed analysis being undertaken at Lancaster

KEY TO THE COLUMNS IN THE TABLE BELOW

(1) = Erroneous tag occurring in the corpus

(2) = Correct tag which should have occurred [Where numbers are less than 5 in this column, the tags are lumped together under the heading "other"]

(3) = Number of tagging errors of this specific type

(4) = Number of uncertain or ambiguous cases included in (3), if any [These are included in the total of (3)]

(5) = Total number of errors involving this particular erroneous tag.

(6) = Total number of instances of the specified tag occurring in the whole sample.

(7) = Number of erroneous tags as a percentage of all instances of the specified tag [i.e. Error Rate]

(1) (2) (3) (4) (5) (6) (7)

AJ0 AV0 13 62 2027 3%

NN1 28 [3]

VVN 6

NP0 5

other 10

AJC AJS 1 2 63 3%

AV0 1

AJS 0 45 0%

AT0 0 2907 0%

AV0 AJ0 17 109 1765 6%

CJS 15 [4]

DT0 37

PRP 27

other 13

AVP PRP 18 [1] 20 292 7%

other 2

AVQ CJS 4 4 65 6%

CJC PRP 1 1 1349 0.007%

CJS PRP 14 20 430 5%

other

CJT DT0 6 [1] 7 199 4%

other

CRD PNI 3 4 526 0.08%

CRD+POS 1

DPS 0 493 0%

DT0 AV0 11 13 798 2%

CJT 2

DTQ 0 288 0%

EX0 0 72 0%

ITJ various 5 5 558 0.09%

NN0 NP0 3 [2] 3 351 0.09%

NN1 AJ0 12 [1] 67 4637 1%

NP0 18 [4]

UNC 12

VVB 10

other 15 [1]

NN2 VVZ 6 [1] 13 1545 0.08%

other

NP0 AJ0 10 38 1159 3%

NN1 21 [2]

other 7

ORD 0 106 0%

PNI AV0 1 1 114 0.9%

PNP DPS 11 11 2554 0.04%

PNQ 0 57 0%

POS VBZ 3 3 194 2%

PRF 0 955 0%

PRP AVP 8 [1] 19 2687 0.07%

others 11

TO0 PRP 5 5 547 0.9%

UNC PRP 1 1 291 0.03%

VBB 0 231 0%

VBD 0 417 0%

VBG 0 9 0%

VBI VB0 1 1 255 0.4%

VBN 0 99 0%

VBZ POS 10 11 568 2%

VHZ 1

VDB VDI 1 1 153 0.07%

VDD 0 88 0%

VDG 0 14 0%

VDI 0 45 0%

VDN 0 25 0%

VDZ 0 45 0%

VHB VHI 1 1 141 0.7%

VHD 0 139 0%

VHG 0 19 0%

VHI 0 86 0%

VHN VHD 4 4 24 17%

VHZ VBZ 6 7 125 6%

POS 1

VM0 0 599 0%

VVB NN1 21 67 585 11%

VVI 30 [2]

other 16 [1]

VVD AJ0 2 38 676 6%

VVB 7

VVN 29

VVG AJ0 8 12 435 3%

other 4 [2]

VVI NN1 9 13 932 1%

other 4 [1]

VVN VVD 7 11 623 2%

other 4

VVZ NN0 2 11 237 5%

NN2 9

ZZ0 CRD 3 3 52 6%

Sequences 3 3

("Sequences" are cases where the error correction consists in replacing a sequence of tags by a single tag.)

------------------------------------------------------------------------------------

Total number of errors listed under individual tags above: 591

total: 591

Errors in the form of erroneous ambiguity tags (marked * in the ambiguity tag list above):

total: 77

TOTAL OF ERRORS IN THE 36,000 WORD SAMPLE: 668

error rate: 1.85%

(Note that the error rate of this error analysis sample is higher than for the corpus as a whole, because the sample used for error analysis consists of 5 spoken text samples and 13 written text samples, each of 2000 words. This compares with the figure for the whole corpus, of which 90% is written data, compared with 10% for spoken data. The spoken corpus has a higher error rate [though a lower ambiguity-tag rate] than the corpus as a whole.)

C. THE BNC ENRICHED TAGSET

1. Differences between the Enriched Tagset and the Basic Tagset

The Enriched Tagset is a larger set of grammatical word labels than the Basic Tagset: it has 139 tags (minus punctuation tags), instead of 61. Many of the tags correspond to those in the Basic Tagset, and others represent more detailed grammatical distinctions. For example, alongside the Basic Tagset's three adverb tags

AV0 General adverb

AVP Adverbial particle

AVQ Wh-adverb

the Enriched Tagset has seventeen:

RA Adverb, after nominal head (e.g. else, galore)

REX Adverb introducing appositional constructions (e.g. i.e., e.g., viz)

RG Positive degree adverb (e.g. very, so, too)

RGA Post-modifying positive degree adverb (e.g. enough, indeed)

RGQ Wh- degree adverb (e.g. how when modifying a gradable adjective, adverb, etc.)

RGQV Wh-ever degree adverb (however when modifying a gradable adjective, adverb etc.)

RGR Comparative degree adverb (e.g. more, less)

RGT Superlative degree adverb (e.g. most, least)

RL Locative adverb (e.g. forward, alongside, there)

RP Adverbial particle (e.g. about, in, out, up)

RPK Catenative adverbial particle (about in be about to)

RR General positive adverb (e.g. often, well, long, easily)

RRQ Wh- general adverb (e.g. how, when, where, why)

RRQV Wh-ever general adverb (e.g. however, whenever, wherever)

RRR Comparative general adverb (e.g. more, oftener, longer, further)

RRT Superlative general adverb (e.g. most, oftenest, longest, furthest)

RT Nominal adverb of time (e.g. now, tomorrow, yesterday)

These additional distinctions are made for a combination of morphological, syntactic, and semantic reasons. Where a semantic category is distinguished by a separate tag, it is because that category is judged to be distinctive in its grammatical, as well as its semantic characteristics. On the other hand, many categories familiar from English grammars (e.g. manner adverbs, frequency adverbs, conjunctive adverbs) are not separately represented, because they would be too difficult to distinguish by automatic tagging, given the current limitations of tagging software. Hence the Enriched Tagset is in practice something of a compromise between distinctions it would be linguistically desirable to make, and distinctions which can be feasibly made with current software.

Some conjunctions and prepositions are given individual tags, on the grounds that their syntactic functions are in some respects unique. This applies, for example, to the conjunctions but, as and than, and the prepositions for and of.

It will be noted that the symbols used for Enriched Tagset tags do not in general closely resemble those of the Basic Tagset. The Basic Tagset tags are selected with the overriding goal of ease of interpretation - hence the mnemonic choice of symbols such as AJ- for adjective and AV- for adverb. The Enriched Tagset labels are chosen with more attention to analysability: each character (with a few exceptions) has a distinct meaning of its own. For example: the symbol RGR for comparative degree adverb makes sense character by character as follows: R (= adverb); G (= of degree); R (= comparative). In this way, it is possible to search the corpus according to individual features: for example, if we search for all tags in which the last character is R, we find all examples of comparative words. If we search th corpus for all tags beginning with the character R, we find all examples of adverbs. The Enriched Tagset also has a strong family resemblance to the tagsets used in other tagged corpora, such as the Brown Corpus and the LOB (Lancaster-Oslo/Bergen) Corpus, and will therefore have the advantage of familiarity for many users.

Since the Core Corpus has been post-edited and corrected throughout, there are no ambiguity tags in the Enriched Tagset.

2. List of Tags in the BNC Enriched Tagset

The BNC Enriched Tagset has been used for the tagging of the Core Corpus of 2 million words of spoken and written English. The symbols representing tags in this Tagset are similar to those employed in other well known corpora, such as the Brown Corpus and the LOB Corpus.

The complete list of the BNC Enriched Tagset (also known as the C7 Tagset) is given below, with brief definitions and exemplifications of the categories represented by each tag.

APPGE Possessive determiner, pre-nominal (e.g. my, your, her, his, their)

AT Article, neutral for number (e.g. the, no) [N.B. no is included among articles, which are defiined here as determiner words which typically begin a noun phrase, but which cannot occur as the head of a noun phrase. A word which is neutral for number is one that can cooccur with either singular or plural forms: e.g. the house, the houses; no brother, no brothers.]

AT1 Singular article (e.g. a, an, every)

BCS "Before-conjunction" (e.g. in order preceding that, even preceding if)

BTO "Before-infinitive-marker" (e.g. in order or so as preceding to)

CC Coordinating conjunction, general (e.g. and, or)

CCB Coordinating conjunction but

CS Subordinating conjunction, general (e.g. if, when, while, because)

CSA As as conjunction

CSN Than as conjunction

CST That as conjunction [N.B. that is tagged CST when it introduces not only a nominal clause, but also a relative clause, as 'the day that follows Christmas'.]

CSW The conjunction whether, or if when it is equivalent in function to whether.

DA "After-determiner" (or postdeterminer), neutral for number, e.g. such, former, same. [Where determiners occur in a sequence, postdeterminers tend to occur after other determiners or articles: e.g. 'all such friends', 'this same problem'. N.B.a determiner in this tagset, like the Basic Tagset, is defined as a word which typically occurs either as the first word of a noun phrase, or as the head of a noun phrase. E.g same is tagged DA in both the following contexts: 'This is the same tune'; 'This tune is the same.]

DA1 Singular "after-determiner" (or postdeterminer), e.g. little, much

DA2 Plural "after-determiner" (or postdeterminer), e.g. few, many, several

DA2R Plural "after-determiner", comparative form (e.g. fewer)

DA2T Plural "after-determiner", superlative form (e.g. fewest)

DAR Comparative "after-determiner", neutral for number (e.g. more, less)

DAT Superlative "after-determiner", neutral for number (e.g. most, least)

DB "Before-determiner" (or predeterminer), neutral for number (e.g. all, half) [N.B. where there is a sequence of determiners, predeterminers occur before other determiners or articles (e.g. 'all those years').]

DB2 Plural "before-determiner" (or predeterminer), e.g. both

DD Central determiner, neutral for number (e.g. some, any, enough) [N.B. central determiners are the most unmarked category, which in a sequence precedes predeterminers or follows postdeterminers.]

DD1 Singular central determiner (e.g. this, that, another)

DD2 Plural central determiner (e.g. these, those)

DDQ Wh-determiner (e.g. which, what)

DDQGE Wh-determiner, possessive (e.g. whose)

DDQV Wh-ever determiner (e.g. whichever, whatever)

EX Existential there

IF For as a preposition

II Preposition (general class: e.g. at, by, in, to, instead of)

IO Of as a preposition

IW With and without as prepositions

JJ Adjective (general or positive) (e.g. good, old, beautiful)

JJR General comparative adjective (e.g. better, older)

JJT General superlative adjectives (e.g. best, oldest)

JK Catenative adjective (with a quasi-auxiliary function, e.g. able in 'be able to'; willing in 'be willing to')

LE "Leading coordinator": a word introducing correlative coordination (e.g. both in both ... and, either in either ... or)

MC Cardinal number, neutral for number (e.g. two, three, four, 98, 1066) [Although numbers like two and three may be considered basically plural, the fact that they have singular agreement in uses such as 'Two's company, three's a crowd' assigns them to this number-neutral category.]

MC-MC Two numbers linked by a hyphen or dash (e.g. 40-50, 1770-1827)

MC1 Singular cardinal number (e.g. one, 1)

MC2 Plural cardinal number (e.g. tens, twenties, 1900s)

MD Ordinal number (e.g. first, sixth, 77th, last) [N.B. The MD tag is used whether these words are used in a nominal or in an adverbial role. Next and last, as "general ordinals", are also assigned to this category.]

MF Fractional number, neutral for number (e.g. quarter, three-fourths, two-thirds) [Again, these are treated as number-neutral because their ability to agree with singulars and with plurals: 'A quarter was/were eaten'.]

ND1 Singular noun of direction (e.g. north, east, southwest, NNW)

NN Common noun, neutral for number (e.g. sheep, cod, group, people). [N.B. Singular collective nouns, such as team, are tagged NN, on the grounds that they are capable of taking singular or plural agreement e.g. 'Our team has/have lost'.]

NN1 Singular common noun (e.g. bath, powder, disgrace, sister)

NN2 Plural common noun (e.g. baths, powders, sisters)

NNJ Human organization noun (e.g. council, orchestra, corporation, Company) [N.B. these are typically collective nouns (see NN above), and therefore are left unspecified for number. They often occur, with an initial capital, in names of private or public organizations, as in 'the Ford Motor Company'.]

NNJ2 Plural human organization noun (e.g. councils, orchestras, corporations)

NNL Locative noun, neutral for number (e.g. Is. as an abbreviation for Island(s))

NNL1 Singular locative noun (e.g. island, street). [They are often abbreviated as part of the names of places, as in Mt. Aconcagua, Wall St, Belsize Pk.]

NNL2 Plural locative noun (e.g. islands, streets) [Again, they can occur with an initial capital as part of a complex place name: e.g. 'the Grampian Mountains'.]

NNO Numeral noun, neutral for number (cf. MC above): e.g. hundred, thousand, dozen

NNO2 Plural numeral noun (e.g. hundreds, thousands, dozens)

NNSA Noun of style or title, following a name (e.g. PhD, J.P., Bart when following a person's name)

NNSB Noun of style or title, preceding a name (e.g. Sir, Queen, Ms, Mr when occurring as the first part of a person's name) [These are often in the form of abbreviations.]

NNT1 Singular temporal noun (e.g. day, week, year, Easter)

NNT2 Plural temporal nouns (e.g. days, weeks, years)

NNU Unit-of-measurement noun, neutral for number (e.g. the abbreviations in., ft, cc)

NNU1 Singular unit-of-measurement noun (e.g. inch, litre, hectare)

NNU2 Plural unit-of-measurement noun (e.g. inches, litres, hectares)

NP Proper noun, neutral for number (e.g. acronymic names of companies and organizations, such as IBM, NATO, BBC). [This tag also occurs widely in the pre-final parts of complex names, such as 'the Pacific Ocean', 'Cambridge University', 'North Germany'.]

NP1 Singular proper noun (e.g. Vivian, Clinton, Mexico)

NP2 Plural proper noun (e.g. Kennedys, Pyrenees, Cyclades)

NPD1 Singular weekday noun (e.g. Saturday, Wednesday)

NPD2 Plural weekday noun (e.g. Sundays, Fridays)

NPM1 Singular month noun (e.g. April, October)

NPM2 Plural month noun (e.g. Junes, Januaries)

PN Indefinite pronoun, neutral for number (e.g. noun) [N.B. pronoun tags always apply to words which function as [heads of] noun phrases. Words like some and any, which can also occur in the position of an article/determiner, are treated as determiners (see DA above) in both the following contexts: 'Did you get any beans?' 'No, I couldn't find any.']

PN1 Singular indefinite pronoun (e.g. one [as pronoun, not numeral], somebody, no one, everything)

PNQO Wh-pronoun, objective case (whom)

PNQS Wh-pronoun, subjective case (who)

PNQVS Wh-ever pronoun, subjective case (whoever)

PNX1 Reflexive indefinite pronoun, singular (oneself)

PP$ Nominal possessive pronoun (e.g. mine, yours, his, ours)

PPH1 Singular personal pronoun, third person (it)

PPHO1 Singular personal pronoun, third person, objective case (him, her)

PPHO2 Plural personal pronoun, third person, objective case (them)

PPHS1 Singular personal pronoun, third person, subjective case (he, she)

PPHS2 Plural personal pronoun, third person, subjective case (they)

PPIO1 Singular personal pronoun, first person, objective case (me)

PPIO2 Plural personal pronoun, first person, objective case (us)

PPIS1 Singular personal pronoun, first person, subjective case (I)

PPIS2 Plural personal pronoun, first person, subjective case (we)

PPX1 Singular reflexive pronoun (e.g. myself, yourself, herself)

PPX2 Plural reflexive pronoun (ourselves, yourselves, themselves)

PPY Second person personal pronoun (you)

RA Adverb, after nominal head (e.g. else, galore)

REX Adverb introducing appositional constructions (e.g. i.e., e.g., viz)

RG Positive degree adverb (e.g. very, so, too)

RGA Post-modifying positive degree adverb (e.g. enough, indeed)

RGQ Wh- degree adverb (e.g. how when modifying a gradable adjective, adverb, etc.)

RGQV Wh-ever degree adverb (however when modifying a gradable adjective, adverb etc.)

RGR Comparative degree adverb (e.g. more, less)

RGT Superlative degree adverb (e.g. most, least)

RL Locative adverb (e.g. forward, alongside, there)

RP Adverbial particle (e.g. about, in, out, up)

RPK Catenative adverbial particle (about in be about to) [Compare JK above.]

RR General positive adverb (e.g. often, well, long, easily)

RRQ Wh- general adverb (e.g. how, when, where, why)

RRQV Wh-ever general adverb (e.g. however, whenever, wherever)

RRR Comparative general adverb (e.g. more, oftener, longer, further)

RRT Superlative general adverb (e.g. most, oftenest, longest, furthest)

RT Nominal adverb of time (e.g. now, tomorrow, yesterday)

TO The infinitive marker to

UH Interjection, or other isolate (e.g. oh, yes, wow)

VB0 be as a finite form (in declarative and imperative clauses)

VBDR were

VBDZ was

VBG being

VBI be as an infinitive form

VBM am, 'm

VBN been

VBR are, 're

VBZ is, 's

VD0 do as a finite form (in declarative and imperative clauses)

VDD did

VDG doing

VDI do as an infinitive form

VDN done

VDZ does, 's

VH0 have, 've as a finite form (in declarative and imperative clauses)

VHD had, 'd as a past tense form

VHG doing

VHI have as an infinitive form

VHN had as a past participle

VHZ has, 's

VM Modal auxiliary verb (e.g. can, could, will, would, must)

VMK Catenative modal auxiliary (i.e. ought and used when followed by the infinitive marker to)

VV0 The base form of the lexical verb as a finite form (in declarative and imperative clauses) (e.g. give, find, look, receive)

VVD The past tense form of the lexical verb (e.g. gave, found, looked, received)

VVG The -ing form of the lexical verb (e.g. giving, finding, looking, receiving)

VVGK The -ing form as a catenative verb (e.g. going in be going to)

VVI The base form of the lexical verb as an infinitive (e.g. give, find, look, receive)

VVN The past participle form of the lexical verb (e.g. given, found, looked, received)

VVNK The past participle as a catenative verb (e.g. bound in be bound to)

VVZ The -s form of the lexical verb (e.g. gives, finds, looks, receives)

XX not, -n't

ZZ1 Singular letter of the alphabet (a, b, S, etc.)

ZZ2 Plural letter of the alphabet (a's, b's, Ss etc.)