BNC User Reference Guide

9 Miscellaneous tables

Up: Contents Previous: 8 References Next: 10 List of Sources

This section consists of a series of supplementary tables listing values used for some open or semi open value-lists, and other aspects of the corpus and its encoding not provided by the reference information in section 12 Formal Specification of the BNC XML schema.

The following code tables are provided:

9.1 XML tag usage by text type

Each of the 4049 texts in the BNC is categorized broadly by type (written fiction, written academic prose, spoken demographic, etc.). This table lists the usage of the various XML elements documented in this manual within the corpus, both in total and in each of the different text types. Note that elements which appear only in corpus or text headers are excluded.

Table 31. Tag usage by Text Type
Total Academic writing Published fiction News and journalism Published non-fiction Other published writing Unpublished writing Conversation Other spoken
align 407023 -- -- -- -- -- -- 66.96%
272552
33.03%
134471
bibl 1036 17.85%
185
10.90%
113
-- 55.59%
576
15.54%
161
0.09%
1
-- --
c 13614363 14.55%
1981729
23.65%
3220541
8.68%
1182536
22.31%
3038629
16.64%
2266554
3.87%
527101
5.03%
684858
5.23%
712415
corr 17000 11.42%
1943
7.86%
1337
11.07%
1882
28.34%
4819
28.94%
4921
12.12%
2062
0.02%
5
0.18%
31
div 210145 12.04%
25308
3.10%
6518
18.31%
38484
20.35%
42778
33.35%
70090
11.02%
23172
1.73%
3640
0.07%
155
event 6943 -- -- -- -- -- -- 36.85%
2559
63.14%
4384
gap 65159 21.16%
13790
0.35%
232
1.62%
1060
14.64%
9542
16.74%
10911
8.79%
5731
7.67%
4998
28.99%
18895
head 222085 10.71%
23797
2.62%
5836
22.11%
49108
21.74%
48288
33.44%
74283
9.35%
20773
-- --
hi 210508 27.84%
58613
12.50%
26315
0.14%
302
31.23%
65758
25.28%
53236
2.98%
6284
-- --
item 117237 27.82%
32621
0.74%
870
2.23%
2621
22.93%
26893
30.82%
36139
15.43%
18093
-- --
l 51310 2.59%
1333
71.39%
36631
0.17%
89
13.59%
6974
8.62%
4426
3.61%
1857
-- --
label 65697 43.83%
28799
0.65%
430
1.66%
1093
21.27%
13976
21.96%
14428
10.61%
6971
-- --
lg 3040 7.23%
220
54.53%
1658
0.23%
7
21.71%
660
11.71%
356
4.57%
139
-- --
list 19758 20.72%
4095
0.71%
142
1.63%
323
26.41%
5220
31.75%
6274
18.74%
3704
-- --
mw 792599 19.55%
155017
16.83%
133469
7.74%
61366
25.39%
201249
16.73%
132634
4.09%
32478
3.24%
25742
6.38%
50644
note 117 45.29%
53
0.85%
1
0.85%
1
5.12%
6
47.86%
56
-- -- --
p 1599693 8.78%
140550
27.13%
434019
17.95%
287171
18.18%
290826
20.35%
325612
7.59%
121515
-- --
pause 216354 -- -- -- -- -- -- 64.98%
140589
35.01%
75765
pb 94620 26.16%
24760
25.60%
24224
0.15%
148
31.63%
29931
14.75%
13961
1.68%
1596
-- --
quote 15208 40.20%
6114
4.58%
698
0.03%
5
45.66%
6945
6.14%
934
3.36%
512
-- --
s 6026276 11.55%
696038
21.96%
1323573
8.43%
508609
18.83%
1135264
16.95%
1021633
5.02%
303078
10.13%
610558
7.09%
427523
shift 36053 -- -- -- -- -- -- 70.90%
25564
29.09%
10489
sp 29112 0.21%
62
1.28%
373
-- 4.76%
1386
35.69%
10391
58.05%
16900
-- --
speaker 23466 0.26%
62
1.58%
373
-- 5.90%
1385
44.16%
10363
48.08%
11283
-- --
stage 507 1.38%
7
10.25%
52
-- 5.71%
29
82.44%
418
0.19%
1
-- --
trunc 52674 -- -- -- -- -- -- 38.69%
20382
61.30%
32292
u 784483 -- -- -- -- -- -- 67.02%
525789
32.97%
258694
unclear 203045 -- -- -- -- -- -- 62.39%
126686
37.60%
76359
vocal 43457 -- -- -- -- -- -- 63.61%
27645
36.38%
15812
w 98363707 16.04%
15781859
16.41%
16143913
9.56%
9412174
24.58%
24179010
18.26%
17970212
4.54%
4466681
4.30%
4233962
6.27%
6175896

9.2 Voice quality codes

Changes in voice quality in spoken texts are indicated by values for the <new> attribute on a <shift> element, at the point where the speaker's voice change. 156 distinct values are used, but most of them appear only infrequently. The following list gives the values which appear more than 10 times in the whole corpus:
voice quality number such
laughing 9268
reading 2463
singing 2045
shouting 1419
whispering 1247
yawning 363
sighing 276
mimicking 241
spelling 224
crying 108
screaming 97
whining 40
whingeing 38
praying 23
reading bible 22
reading newspaper 20
reading+laughing 15
reading book 14
on telephone 11

9.3 Gap descriptions

Where material is omitted for some reason during the transcription of a text, either written or spoken, the <gap> element is used to provide a brief description of the material omitted and the reason for its exclusion. The desc attribute supplies the description, and the cause attribute explains why it was done. Over 1700 distinct descriptions are used, but most of them appear only infrequently. The following list gives the 65 values which appear more than 25 times in the whole corpus:
material omitted number such
name 29698
formula 12476
figure 4060
address 3914
many nonRoman characters 2835
telephone number 2173
table 1393
illustration 903
photograph 338
footnote 197
references etc. 188
date 172
list of names 171
picture 144
personal name 142
advert 141
reference 133
adverts 126
list 123
period quotation 112
name and address 101
phonetic transcription 95
list of venues 95
table of contents 92
names 91
ingredients 91
publication details 84
footnotes 83
contents omitted 72
contents 65
list of ingredients 64
diagram 60
hebrew 59
address, telephone number etc. 50
tel. no 47
cover page 47
text 46
venue, dates, times, prices etc. 46
venue 46
telephone no 44
number 41
names and addresses 41
other venues 41
venues 40
form 39
personal names 38
computer code 38
gaelic 36
email address 36
author details 35
address and telephone 35
cover omitted 35
company name 34
credits 33
caption 31
dates 30
name and phone number 29
sales details 28
address, dates, times, prices etc. 28
period quotation/verse 27
notes 27
map 26
period/overseas quotation 25
quotation 25
The cause for a gap in transcription is usually self-evident, which may be why only a small number of values is used for the cause attribute. The following four values are the most significant:
reason for ommission number such
anonymization 31924
label 303
sampling strategy 115
repeated elsewhere 6

9.4 Event descriptions

The <event> element is used in spoken texts to mark wherever some non-linguistic but significant event is noted by a transcriber. The brief texts used to describe such events are very various, and there are more than 1500 different values for the desc attribute which stores them. The following lists shows the 60 or so values which appear more than 10 times in the whole corpus:
Event description number such
clapping 1134
music 397
recorded jingle 330
break in recording 310
speaking french 212
pre-recorded blurb 199
too quiet to hear 180
recording ends 177
tape change 158
phonecall starts 138
phone rings 138
phonecall ends 125
tape breaks here 120
piano music 109
paper rustling 93
tv on 80
people talking 77
applause 67
dog barks 63
tape jumps 62
dog barking 55
baby talk 55
jingle 47
banging 47
classroom chatter 47
talk in background 41
people laughing 35
advert 35
door knock 33
noise - traffic 30
playing piano 30
tape ends 28
door bell 27
noise - background 26
portuguese speech 24
writing on board 24
hits ball 23
music in background 21
speaking italian 19
television 19
baby crying 18
door closing 17
bell ring 17
singing 16
crockery noise 16
noseblow 16
telephone conversation ends 16
talking from other room 16
everyone talking 15
radio on 15
noise 15
door opening 15
introduction music 15
drilling noise 14
microphone moved 14
plane overhead 13
knocking 13
phonic 13
closing music 12
noise - train 12
cat noise 12
clicks fingers 11
speaking german 11

9.5 Speaker relationships

In demographically sampled texts, the role of each speaker with respect to the respondent is supplied by the role attribute on the <person> element. The following table lists all 79 values used in the curent version of the corpus in descending frequency order.

role name persons
unspecified 6862
other 1454
friend 654
? 354
self 306
colleague 216
daughter 102
son 100
husband 68
wife 66
mother 64
stranger 52
neighbour 50
father 42
sister 42
brother 38
mother-in-law 22
sister-in-law 22
teacher 22
acquaintance 18
brother-in-law 18
employee 14
son-in-law 14
father-in-law 12
granddaughter 12
niece 12
chairman 10
grandson 10
daughter-in-law 8
nephew 8
aunt 6
boyfriend 6
customer 6
girlfriend 6
babysitter 4
cousin 4
fiancÚ 4
friend's son 4
grandmother 4
lecturer 4
son's teacher 4
aunt-in-law 2
boss 2
boyfriend's father 2
boyfriend's mother 2
brother's friend 2
brother-in-law's mother 2
child's teacher 2
cousin-in-law 2
cousin-in-law's son 2
cousin-in-law's wife 2
daughter's boyfriend 2
daughter's friend 2
employee's wife 2
friend's brother 2
friend's father 2
friend's granddaughter 2
friend's mother 2
friend's sister 2
grandmother-in-law 2
hairdresser 2
hairdresser's son 2
housekeeper 2
husband's great-niece 2
husband's niece 2
neighbour's son 2
partner 2
partner's mother 2
plumber 2
sister's friend 2
sister's friend's mother 2
sister-in-law's father 2
sister-in-law's mother 2
son's friend 2
step-father 2
stepfather 2
uncle 2
visitor 2

9.6 Text and genre classification codes

Texts are classified in several different ways in the BNC, as described in section 5.3.5 Text classification . Each text carries a number of text classification codes, specified as a string of values on the target attribute of its <catRefs> element. Each code identifies one of the values in one of the 23 <taxonomy> element provided in the BNC Header, corresponding with the design criteria outlined in 1 Design of the corpus. Possible values for these codes and brief explanations of their meanings are listed in the corpus header. Distribution tables showing the number of texts, words, and sentences classified under most of them are given above in section 1 Design of the corpus and elsewhere in the current section.

One of the codes listed below is also supplied for each text as the content of a <classCode> element in its text header, as an alternative way of characterising each text. A description of the analysis scheme used and its rationale are provided in Lee 2001. The codes used in the present version of the corpus have been updated to take note of a small number of corrections made by Lee on his web site (http://clix.to/davidlee00) since publication of that article.

5.12
Table 37. Genre classification for spoken texts
Classification Number of texts W-units % S-units %
S brdcast discussn 53 761595 0.77 41144 0.68
S brdcast documentary 10 41893 0.04 2369 0.03
S brdcast news 12 263255 0.26 12454 0.20
S classroom 58 433646 0.44 51355 0.85
S consult 128 139320 0.14 20698 0.34
S conv 153 4233955 4.30 610557 10.13
S courtroom 13 129067 0.13 6366 0.10
S demonstratn 6 32062 0.03 2175 0.03
S interview 13 125096 0.12 11809 0.19
S interview oral history 119 822489 0.83 57831 0.95
S lect commerce 3 15233 0.01 406 0.00
S lect humanities arts 4 51510 0.05 2639 0.04
S lect nat science 4 22938 0.02 1019 0.01
S lect polit law edu 7 51407 0.05 1670 0.02
S lect soc science 13 162030 0.16 8136 0.13
S meeting 132 1391207 1.41 103266 1.71
S parliament 6 97289 0.09 2609 0.04
S pub debate 16 287062 0.29 13347 0.22
S sermon 16 82775 0.08 3345 0.05
S speech scripted 25 193020 0.19 9571 0.15
S speech unscripted 51 469492 0.47 33121 0.54
S sportslive 4 33630 0.03 1867 0.03
S tutorial 18 144783 0.14 8814 0.14
S unclassified 44 425097 0.43 31512 0.52
W ac:humanities arts 87 3358167 3.41 130167 2.15
W ac:medicine 24 1435608 1.45 66811 1.10
W ac:nat science 43 1122939 1.14 51203 0.84
W ac:polit law edu 186 4703304 4.78 190220 3.15
W ac:soc science 142 4785423 4.8 307998
W ac:tech engin 23 692141 0.70 34982 0.58
W admin 12 222803 0.22 14045 0.23
W advert 59 553625 0.56 42147 0.69
W biography 100 3556688 3.61 172615 2.86
W commerce 112 3807494 3.87 187127 3.10
W email 7 214022 0.21 17411 0.28
W essay school 7 147736 0.15 7871 0.13
W essay univ 3 56273 0.05 2905 0.04
W fict drama 2 46094 0.04 4932 0.08
W fict poetry 30 223682 0.22 38137 0.63
W fict prose 431 16033647 16.30 1293211 21.45
W hansard 4 1168362 1.18 63234 1.04
W institut doc 43 552124 0.56 30159 0.50
W instructional 15 440548 0.44 27875 0.46
W letters personal 6 52915 0.05 2583 0.04
W letters prof 11 66591 0.06 4800 0.07
W misc 502 9237504 9.39 521286 8.65
W news script 32 1248609 1.26 102937 1.70
W newsp brdsht nat: arts 51 352137 0.35 16991 0.28
W newsp brdsht nat: commerce 44 430075 0.43 21103 0.35
W newsp brdsht nat: editorial 12 102718 0.10 5150 0.08
W newsp brdsht nat: misc 95 1040943 1.05 51455 0.85
W newsp brdsht nat: report 49 668613 0.67 32079 0.53
W newsp brdsht nat: science 29 65880 0.06 3390 0.05
W newsp brdsht nat: social 36 82605 0.08 4388 0.07
W newsp brdsht nat: sports 24 300033 0.30 14679 0.24
W newsp other: arts 15 240877 0.24 13005 0.21
W newsp other: commerce 17 419996 0.42 20506 0.34
W newsp other: report 39 2735074 2.78 150676 2.50
W newsp other: science 23 55319 0.05 2798 0.04
W newsp other: social 37 1151490 1.17 63846 1.05
W newsp other: sports 9 1033352 1.05 56802 0.94
W newsp tabloid 6 733066 0.74 51744 0.85
W nonAc: humanities arts 110 3744321 3.80 155839 2.58
W nonAc: medicine 17 504610 0.51 27156 0.45
W nonAc: nat science 62 2533635 2.57 120610 2.00
W nonAc: polit law edu 93 4521040 4.59 208785 3.46
W nonAc: soc science 123 3708033 3.76 183588 3.04
W nonAc: tech engin 123 1220026 1.24 52750 0.87
W pop lore 211 7450814 7.57 425448 7.05
W religion 35 1132976 1.15 60334 1.00

9.7 Contracted forms and multiwords

The following tables summarize and document the tokenization decisions taken by the CLAWS system, where these do not coincide with normal orthographic convention.

The first list specifies common word-endings or enclitics which are regarded by CLAWS as indicating the start of a new ‘word’, although words containing them are conventionally represented as a single orthographic word.

The second list specifies some common two, three or four word phrases treated by CLAWS as single tokens. These are represented in this version of the corpus by means of a <mw> element; the table gives the C5 code assigned to this element, and also the codes assigned to the distinct <w> elements constituting it.

9.7.1 Contracted forms

Words ending with certain character strings are treated by CLAWS as distinct words, even though they are conventionally fused together when written. For example, ‘they're’ is treated as if it were two distinct ‘words’ — they and 're. The fact that these two items are orthographically fused is evident in the XML encoding of the corpus because there is no whitespace following the string ‘they’. Some XML processors may however assume that the end of an XML element such as the <w> enclosing the string should always be treated as a word separator, and may therefore introduce unwanted extra space.

In the following table we show how contracted forms are tokenized by CLAWS. The left column shows the contracted form; the right column shows the content of the two or more <w> elements used to represent it.

orthographic form tokenization
[word]'d [word] 'd
[word]'m [word] 'm
[word]'s [word] 's
[word]'ll [word] 'll
[word]n't [word] n't
[word]'re [word] 're
[word]'v [word] 'v
[word]'d've [word] 'd 've
'tis 't is
'twas 't was
'twere 't were
'twould 't would
ain't ai n't
aint ai nt
aintcha ai nt cha
arent are nt
c'mon c'm on
can't ca n't
cannot can not
couldnt could nt
d'ya d' ya
d'you d' you
didnt did nt
doesnt does nt
dont do nt
dunnit dun n it
dunno du n no
geroff ger off
gimme gim me
gonna gon na
gotta got ta
hadnt had nt
hasnt has nt
hes he s
inne in n e
innit in n it
isnt is nt
lorra lor ra
m'lud m' lud
ought'a ough t 'a
oughta ought a
shan't sha n't
shouldn't've should n't 've
shouldn't should n't
t'other t' other
thats that s
theres there s
theyve they ve
tis t is
twas t was
twere t were
twould t would
wanna wan na
wannit wann it
wasnt was nt
weve we ve
won't wo n't
wotta wott a
wouldn't've would n't 've
wouldnt would nt

9.7.2 Multiwords

CLAWS recognizes certain sequences of orthographically distinct words as constituting a single item: examples include common prepositional phrases such as ‘in spite of’, as well as phrases from other languages such as ‘aide memoire’. In this version of the corpus, such items are explicitly tagged using an XML <mw> (for multiword) tag carrying the appropriate wordclass tag, as indicated below. Within this <mw> element however, in a departure from earlier versions of the corpus, the individual words are also tagged using <w> tags in the same way as elsewhere in the corpus.

The following table lists all multiwords recognized in the corpus alphabetically, indicating both the wordclass codes assigned to it, and also the wordclass codes assigned to its constituent <w> elements. Note that these latter wordclass codes were assigned automatically during the XML conversion process and therefore should not be included in any assessment of the CLAWS error rate.

multiword mw wordclass/es constituent wordclasses
ab initio AV0 or AJ0 UNC UNC
a bit AV0 AT0 NN1
a capella AJ0 or AV0 UNC UNC
according as CJS VVG CJS
according to PRP VVG PRP
ad astra AV0 or AJ0 UNC UNC
ad hoc AV0 or AJ0 UNC UNC
ad hominem AV0 or AJ0 UNC UNC
ad infinitum AV0 UNC UNC
adjacent to PRP AJ0 PRP
ad lib AJ0 or AV0 or NN1 UNC UNC
ad nauseam AV0 or AJ0 UNC UNC
affaire de coeur NN1 UNC UNC UNC
affaire d'honneur NN1 UNC UNC
a fortiori AV0 or AJ0 UNC UNC
agent provocateur NN1 UNC UNC
agnus dei NN1 UNC UNC
a good deal AV0 AT0 AJ0 NN1
a great deal AV0 AT0 AJ0 NN1
ahead of PRP AV0 PRF
a heck of a lot AV0 AT0 NN1 PRF AT0 NN1
aide de camp NN1 UNC UNC UNC
aide memoire NN1 UNC UNC
a la PRP UNC UNC
a la carte AJ0 or AV0 UNC UNC UNC
a la mode AJ0 or AV0 UNC UNC UNC
al dente AJ0 or AV0 UNC UNC
al fresco AV0 or AJ0 UNC UNC
a little AV0 AT0 AV0/DT0
a little bit AV0 AT0 AJ0 NN1
alla breve AV0 or AJ0 or NN0 UNC UNC
all but AV0 AV0 CJS
all of a sudden AV0 DT0 PRF AT0 NN1
all right AV0 or AJ0 AV0 AV0
all the same AV0 DT0 AT0 DT0
alma mater NN1 UNC UNC
along with PRP AVP PRP
a lot AV0 AT0 NN1
alter ego NN1 UNC UNC
an' all AV0 CJC DT0
an awful lot AV0 AT0 AJ0 NN1
ancien regime NN1 UNC UNC
and so forth AV0 CJC AV0 AV0
and so on AV0 CJC AV0 AV0
anno domini AV0 or NN1 UNC UNC
annus horribilis NN1 UNC UNC
annus mirabilis NN1 UNC UNC
ante meridiem AV0 UNC UNC
any longer AV0 AV0 AV0
anything but AV0 PNI AV0
apart from PRP AV0 PRP
a posteriori AV0 or AJ0 UNC UNC
a priori AV0 or AJ0 UNC UNC
a propos PRP or AV0 UNC UNC
aqua vitae NN1 UNC UNC
art nouveau NN1 UNC UNC
as against PRP CJS PRP
as between PRP CJS PRP
as for PRP CJS PRP
as from PRP CJS PRP
aside from PRP AV0 PRP
as if CJS CJS CJS
as it were AV0 CJS PNP VBD
as long as CJS AV0 AV0 CJS
as of PRP CJS PRF
as opposed to PRP CJS VVN PRP
as regards PRP CJS VVZ
as soon as CJS AV0 AV0 CJS
as though CJS CJS CJS
asti spumante NN1 UNC UNC
as to PRP CJS PRP
as usual AV0 CJS AJ0
as well as PRP AV0 AV0 CJS
as well AV0 CJS AV0
as yet AV0 CJS AV0
at all AV0 PRP DT0
at best AV0 PRP AJS
at first AV0 PRP ORD
at large AV0 PRP AJ0
at last AV0 PRP ORD
at least AV0 PRP AV0
at length AV0 PRP NN1
at long length AV0 PRP AJ0 NN1
at most AV0 PRP DT0
at once AV0 PRP AV0
at present AV0 PRP NN1
at random AV0 PRP AJ0
at worst AV0 PRP AV0
au contraire AV0 UNC UNC
au fait AJ0 UNC UNC
auf wiedersehen ITJ UNC UNC
au pair NN1 UNC UNC
au revoir ITJ UNC UNC
aurora australis NN1 UNC UNC
aurora borealis NN1 UNC UNC
avant garde NN1 or AJ0 UNC UNC
away from PRP AV0 PRP
bar mitzvah NN1 or AJ0 UNC UNC
basso profundo NN1 UNC UNC
beau monde NN1 UNC UNC
because of PRP CJS PRF
belles lettres NN2 UNC UNC
bete noire NN1 UNC UNC
billet doux NN1 UNC UNC
bona fides NN2 UNC UNC
bona fide AJ0 UNC UNC
bon appetit ITJ UNC UNC
bon mot NN1 UNC UNC
bon vivant NN1 UNC UNC
bon viveur NN1 UNC UNC
bon voyage ITJ UNC UNC
brand new AJ0 NN1 AJ0
but for PRP CJS PRP
by and by AV0 AVP CJC AVP
by and large AV0 AVP CJC AJ0
by far AV0 PRP AV0
by means of PRP PRP NN0 PRF
by no means AV0 PRP PRP NN0
by now AV0 PRP AV0
by reason of PRP PRP NN1 PRF
by the by AV0 PRP AT0 NN1
by way of PRP PRP NN1 PRF
cafe au lait NN1 UNC UNC UNC
camera obscura NN1 UNC UNC
carte blanche NN1 UNC UNC
casus belli NN1 UNC UNC
cause celebre NN1 UNC UNC
ceteris paribus AV0 UNC UNC
chaise longue NN1 UNC UNC
charge d'affaires NN1 UNC UNC
chez moi AV0 UNC UNC
chez nous AV0 UNC UNC
chilli con carne NN1 NN1 NN1 NN1
chop suey NN1 UNC UNC
chow mein NN1 UNC UNC
clamp down NN1 VVB/VVI AVP
close to AV0 AV0/AJ0 PRP
compos mentis AJ0 UNC UNC
con brio AJ0 or AV0 UNC UNC
con fuoco AJ0 or AV0 UNC UNC
con moto AJ0 or AV0 UNC UNC
considering that CJS VVG CJT
contrary to PRP JJ PRP
cordon bleu NN1 UNC UNC
cordon sanitaire NN1 UNC UNC
corpus delicti NN1 UNC UNC
corpus juris NN1 UNC UNC
coup de grace NN1 UNC UNC UNC
coup d'etat NN1 UNC UNC
coup de theatre NN1 UNC UNC UNC
creme de la creme NN1 UNC UNC UNC UNC
creme de menthe NN1 UNC UNC UNC
cri de coeur NN1 UNC UNC UNC
croix de guerre NN0 UNC UNC UNC
cul de sac NN1 UNC UNC UNC
danse macabre NN1 UNC UNC
de facto AV0 or AJ0 UNC UNC
dei gratia AV0 UNC UNC
deja vu NN1 UNC UNC
de jure AV0 or AJ0 UNC UNC
delirium tremens NN1 UNC UNC
de luxe AJ0 UNC UNC
demi monde NN1 UNC UNC
depending on PRP VVG PRP
de profundis AV0 UNC UNC
de rigeur AJ0 UNC UNC
de trop AJ0 UNC UNC
deus ex machina NN1 UNC UNC UNC
double entendre NN1 UNC UNC
dramatis personae NN2 UNC UNC
due to PRP AJ0 PRP
each other PNX DT0 NN1
eminence grise NN1 UNC UNC
en bloc AV0 UNC UNC
en famille AV0 UNC UNC
enfants terribles NN2 UNC UNC
enfant terrible NN1 UNC UNC
en masse AV0 UNC UNC
en passant AV0 UNC UNC
en route AV0 UNC UNC
en suite AJ0 UNC UNC
entente cordiale NN1 UNC UNC
esprit de corps NN1 UNC UNC UNC
et al AV0 UNC UNC
et cetera AV0 UNC UNC
even if CJS AV0 CJS
even so AV0 AV0 AV0
even though CJS AV0 CJS
even when CJS AV0 CJS
ever so AV0 AV0 AV0
every so often AV0 AT0 AV0 AV0
ex army AJ0 PRP NN1
ex cathedra AV0 or AJ0 UNC UNC
except for PRP CJS PRP
excepting for PRP VVG PRP
except that CJS CJS CJT
ex gratia AV0 or AJ0 UNC UNC
ex libris AV0 UNC UNC
ex officio AV0 or AJ0 UNC UNC
ex parte AV0 or AJ0 UNC UNC
ex tempore AV0 or AJ0 UNC UNC
fait accompli NN1 UNC UNC
far from AV0 AV0 PRP
far off AJ0 AV0 AVP
faux amis NN2 UNC UNC
faux ami NN1 UNC UNC
faux pas NN0 UNC UNC
fed up AJ0 VVN AVP
femme fatale NN1 UNC UNC
fin de siecle NN1 UNC UNC UNC
follow up NN1 VVB/VVI AVP
force majeure NN1 UNC UNC
for certain AV0 PRP AJ0
for ever AV0 PRP AV0
for example AV0 PRP NN1
for fear of PRP PRP NN1 PRF
for good AV0 PRP AJ0
for instance AV0 PRP NN1
for keeps AV0 PRP NN2
for long AV0 PRP AV0
for once AV0 PRP AV0
for sure AV0 PRP AJ0
for the most part AV0 PRP AT0 AV0 NN1
for the time being AV0 PRP AT0 NN1 VBG
fromage frais NN1 UNC UNC
from now on AV0 PRP AV0 AVP
from time to time AV0 PRP NN1 PRP NN1
getting on for AV0 VVG AVP PRP
grande dame NN1 UNC UNC
grand prix NN1 UNC UNC
grown ups NN2 VVN NN2
grown up NN1 VVN AVP
gung ho AJ0 or AV0 UNC UNC
habeas corpus NN1 UNC UNC
half way AV0 DT0 NN1
hara kiri NN1 UNC UNC
hard up AJ0 AJ0 AVP
hasta la vista ITJ UNC UNC UNC
hasta luego ITJ UNC UNC
haute couture NN1 UNC UNC
haute cuisine NN1 UNC UNC
have nots NN2 VHB NN2
hey presto ITJ ITJ ITJ
hoi polloi NN0 UNC UNC
homo sapiens NN1 UNC UNC
hors d'oeuvres NN2 UNC UNC
hors d'oeuvre NN1 UNC UNC
hysteron proteron NN1 UNC UNC
idee fixe NN1 UNC UNC
in absentia AV0 UNC UNC
in accordance with PRP PRP NN1 PRP
in accord with PRP PRP NN1 PRP
in addition AV0 PRP NN1
in addition to PRP PRP NN1 PRP
in aid of PRP PRP NN1 PRF
in answer to PRP PRP NN1 PRP
in as much as CJS PRP AV0 DT0 CJS
inasmuch as CJS UNC CJS
in association with PRP PRP NN1 PRP
in back of PRP PRP NN1 PRF
in between PRP or AV0 AVP PRP/AV0
in brief AV0 PRP AJ0
in camera AV0 UNC UNC
in case of PRP PRP NN1 PRF
in case CJS or AV0 PRP NN1
in charge of PRP PRP NN1 PRF
in common AV0 PRP AJ0
in common with PRP PRP NN1 PRP
in comparison with PRP PRP NN1 PRP
in conjunction with PRP PRP NN1 PRP
in connection with PRP PRP NN1 PRP
in consultation with PRP PRP NN1 PRP
in contact with PRP PRP NN1 PRP
in cooperation with PRP PRP NN1 PRP
in course with PRP PRP NN1 PRP
in defence of PRP PRP NN1 PRF
in defiance of PRP PRP NN1 PRF
in excess of PRP PRP NN1 PRF
in extremis AV0 UNC UNC
in face of PRP PRP NN1 PRF
in favor of PRP PRP NN1 PRF
in favour of PRP PRP NN1 PRF
in flagrante delicto AV0 or AJ0 UNC UNC UNC
in front of PRP PRP NN1 PRF
in full AV0 PRP AJ0
in general AV0 PRP AJ0
in keeping with PRP PRP NN1 PRP
in lieu of PRP PRP UNC PRF
in light of PRP PRP NN1 PRF
in line with PRP PRP NN1 PRP
in loco parentis AV0 or AJ0 UNC UNC UNC
in medias res AV0 UNC UNC UNC
in memoriam AV0 UNC UNC
in need of PRP PRP NN1 PRF
in particular AV0 PRP AJ0
in perpetuum AV0 UNC UNC
in place of PRP PRP NN1 PRF
in possession of PRP PRP NN1 PRF
in private AV0 PRP AJ0
in proportion to PRP PRP NN1 PRP
in propria persona AV0 UNC UNC UNC
in public AV0 PRP AJ0
in pursuit of PRP PRP NN1 PRF
in quest of PRP PRP NN1 PRF
in receipt of PRP PRP NN1 PRF
in regard to PRP PRP NN1 PRP
in relation to PRP PRP NN1 PRP
in reply to PRP PRP NN1 PRP
in respect of PRP PRP NN1 PRF
in response to PRP PRP NN1 PRP
in return for PRP PRP NN1 PRP
in search of PRP PRP NN1 PRF
in short AV0 PRP AJ0
inside out AV0 or AJ0 AV0 AVP
in situ AV0 UNC UNC
in so far as CJS PRP AV0 AV0 CJS
insofar as CJS UNC CJS
in spite of PRP PRP NN1 PRF
instead of PRP AV0 PRF
in support of PRP PRP NN1 PRF
inter alia AV0 UNC UNC
in terms of PRP PRP NN2 PRF
in that CJS PRP CJT
in the light of PRP PRP AT0 NN1 PRF
in the main AV0 PRP AT0 AJ0
in the order of AV0 PRP AT0 NN1 PRF
into line with PRP PRP NN1 PRP
in toto AV0 or AJ0 UNC UNC
in touch with PRP PRP NN1 PRP
in vain AV0 PRP AJ0
in view of PRP PRP NN1 PRF
in vitro AJ0 or AV0 UNC UNC
in vivo AJ0 or AV0 UNC UNC
ipso facto AV0 UNC UNC
irrespective of PRP AJ0 PRF
je ne sais quoi NN1 UNC UNC UNC UNC
joie de vivre NN1 UNC UNC UNC
just about AV0 AV0 AV0
just about AV0 AV0 AV0
kind of AV0 NN1 PRF
know how NN1 VVB AVQ
kung fu NN1 UNC UNC
la dolce vita NN1 UNC UNC UNC
laissez faire NN1 UNC UNC
le mot juste NN1 UNC UNC UNC
less than AV0 AV0/DT0 CJS
let alone PRP VVB AJ0
let 's VM0 VVB PNP
lingua franca NN0 UNC UNC
lo and behold ITJ ITJ CJC VVB
loc cit AV0 UNC UNC
locum tenens NN1 UNC UNC
long-term wise AV0 AJ0 AV0
magna carta NN1 UNC UNC
magna cum laude AJ0 or AV0 UNC UNC UNC
magnum opus NN1 UNC UNC
maitre d'hotel NN1 UNC UNC
mal de mer NN1 UNC UNC UNC
matter of fact NN1 or AJ0 NN1 PRF NN1
mea culpa NN1 UNC UNC
medecins sans frontieres NN0 UNC UNC UNC
medicins sans frontieres NN0 UNC UNC UNC
menage a trois NN1 UNC UNC UNC
mezzo soprano NN1 UNC UNC
modus operandi NN1 UNC UNC
modus vivendi NN1 UNC UNC
more than AV0 AV0/DT0 CJS
more than AV0 AV0 CJS
mot juste NN1 UNC UNC
nearer to PRP AJC/AV0 PRP
nearest to PRP AJS/AV0 PRP
near to PRP AJ0/AV0 PRP
nem con AV0 UNC UNC
next to PRP ORD PRP
nigh on AV0 AV0 AVP
noblesse oblige NN1 UNC UNC
no doubt AV0 AT0 NN1
no longer AV0 AV0 AV0
no matter who PNQ AT0 NN1 PNQ
no matter whom PNQ AT0 NN1 PNQ
no matter whose DTQ AT0 NN1 DTQ
nom de guerre NN1 UNC UNC UNC
nom de plume NN1 UNC UNC UNC
non compos mentis AJ0 UNC UNC UNC
none other PNI PNI AJ0
none the less AV0 PNI AT0 AV0
none the AV0 PNI AT0
non sequitur NN1 UNC UNC
no one PNI AT0 PNI
not withstanding AV0 XX0 UNC
nouveau riche NN1 UNC UNC
nouveaux riches NN2 UNC UNC
nouvelle cuisine NN1 UNC UNC
now that CJS AV0 CJT
objet d'art NN1 UNC UNC
objets d'art NN2 UNC UNC
of course AV0 PRF NN1
off guard AV0 PRP NN1
off of PRP AVP PRF
oft times AV0 AV0 NN2
old fashioned AJ0 AJ0 VVN
on account of PRP PRP NN1 PRF
on behalf of PRP PRP NN1 PRF
on board PRP or AV0 PRP NN1
once again AV0 AV0 AV0
once and for all AV0 AV0 CJC PRP DT0
once more AV0 AV0 AV0
one another PNX CRD DT0
one 's CRD CRD POS
on the part of PRP PRP AT0 NN1 PRF
on top of PRP PRP NN1 PRF
on to PRP AVP PRP/TO0
op cit AV0 UNC UNC
other than PRP AJ0 CJS
out of PRP AVP PRF
out of date AJ0 AVP PRF NN1
out of line with PRP AVP PRF NN1 PRP
out of touch with PRP AVP PRF NN1 PRP
outside of PRP PRP PRF
over here AV0 PRP AV0
over there AV0 PRP AV0
owing to PRP VVG PRP
papier mache NN1 UNC UNC
par excellence AJ0 UNC UNC
pas de deux NN0 UNC UNC UNC
pate de foie gras NN1 UNC UNC UNC UNC
pax britannica NN1 UNC UNC
pax romana NN1 UNC UNC
per annum AV0 UNC UNC
per capita AV0 or AJ0 UNC UNC
per cent NN0 UNC UNC
per diem AV0 or AJ0 or NN1 UNC UNC
per se AV0 UNC UNC
personae non gratae NN2 UNC UNC UNC
persona non grata NN1 UNC UNC UNC
pertaining to PRP VVG PRP
petit bourgeois NN1 UNC UNC
petite bougeoisie NN1 UNC UNC
petits bourgeois NN2 UNC UNC
piece de resistance NN1 UNC UNC UNC
pied a terre NN1 UNC UNC UNC
pina colada NN1 UNC UNC
pince nez NN0 UNC UNC
poco a poco AV0 UNC UNC UNC
point blank AV0 or AJ0 NN1 AJ0
poste restante NN1 or AV0 UNC UNC
post hoc AV0 or AJ0 UNC UNC
post meridiem AV0 UNC UNC
post mortem NN1 or AJ0 UNC UNC
pot pourri NN1 UNC UNC
prima donna NN1 UNC UNC
prima facie AJ0 or AV0 UNC UNC
primus inter pares NN1 UNC AJ0 UNC
prior to PRP AJ0 PRP
pro forma NN1 UNC UNC
pro rata AV0 or AJ0 UNC UNC
pro tem AV0 UNC UNC
provided that CJS VVN CJT
providing that CJS VVG CJT
pursuant to PRP AJ0 PRP
quid pro quo NN1 UNC UNC UNC
raison d'etre NN1 UNC UNC
rather than PRP AV0 CJS
relative to PRP AJ0 PRP
rigor mortis NN1 UNC UNC
roman a clef NN1 UNC UNC UNC
save for PRP VVI PRP
save that CJS VVI CJT
savoir faire NN1 UNC UNC
savoir vivre NN1 UNC UNC
seeing as CJS VVG CJS
seeing that CJS VVG CJT
semper fidelis AJ0 UNC UNC
shish kebab NN1 UNC UNC
sine die AV0 UNC UNC
sine qua non NN1 UNC UNC UNC
sinn fein NN1 UNC UNC
so called AJ0 AV0 VVN
so long as CJS AV0 AV0 CJS
some one PNI DT0 PNI
something like AV0 PNI PRP
so much as AV0 AV0 DT0 CJS
son et lumiere NN1 UNC UNC UNC
sort of AV0 NN1 PRF
so that CJS AV0 CJT
sotto voce AV0 or AJ0 UNC UNC
spaghetti bolognese NN1 UNC UNC
spot on AV0 or AJ0 NN1 AVP
status quo NN1 UNC UNC
straight forward AJ0 AV0 AJ0
subject to PRP AJ0 PRP
sub judice AV0 or AJ0 UNC UNC
sub poena NN1 UNC UNC
subsequent to PRP AJ0 PRP
such as PRP DT0 PRP
such that CJS DT0 CJT
sui generis AJ0 UNC UNC
sui juris AJ0 UNC UNC
summa cum laude AJ0 or AV0 UNC UNC UNC
super duper AJ0 AJ0 XXX
supposing that CJS VVG PRP
table d'hote NN1 UNC UNC
tabula rasa NN1 UNC UNC
tai chi NN1 UNC UNC
tai kwan do NN1 UNC UNC UNC
terra firma NN1 UNC UNC
terra incognita NN1 UNC UNC
thanks to PRP NN2 PRP
that is AV0 DT0 VBZ
that is to say AV0 DT0 VBZ TO0 VVI
through thick and thin AV0 PRP AJ0 CJC AJ0
time and again AV0 NN1 CJC AV0
to and fro AV0 PRP CJC AV0
tour de force NN1 UNC UNC UNC
tout court AJ0 UNC UNC
tout de suite AV0 UNC UNC UNC
ultra vires AJ0 or AV0 UNC UNC
under way AV0 PRP NN1
up front AJ0 or AV0 AVP AJ0
upside down AV0 or AJ0 NN1 AVP
up to PRP or AV0 AVP PRP/TO0
up to date AJ0 AVP TO0 NN1
up to the minute AJ0 AVP PRP AT0 NN1
up until PRP AVP CJS/PRP
upward of AV0 AV0 PRF
upwards of AV0 AV0 PRF
vice versa AV0 UNC UNC
vin de table NN1 UNC UNC UNC
vis a vis PRP UNC UNC UNC
viva voce NN1 or AJ0 or AV0 UNC UNC
vol au vent NN1 UNC UNC UNC
volte face NN1 UNC UNC
vox populi NN1 UNC UNC
well off AJ0 AV0 AVP
whether or not CJS CJS CJC XX0
wiener schnitzel NN1 UNC UNC
with a view to PRP PRP AT0 NN1 PRP
with reference to PRP PRP NN1 PRP
with regard to PRP PRP NN1 PRP
with relation to PRP PRP NN1 PRP
with respect to PRP PRP NN1 PRP

9.8 Simplified Wordclass Tags

This table lists, for each of the twelve simplified wordclass tags used by the pos attribute, the corresponding CLAWS C5 tags of which the class consists.
POS value significance combines
ADJ adjective AJ0, AJC, AJS, CRD, DT0, ORD
ADV adverb AV0, AVP, AVQ, XX0
ART article AT0
CONJ conjunction CJC, CJS, CJT
INTERJ interjection ITJ
PREP preposition PRF, PRP, TO0
PRON pronoun DPS, DTQ, EX0, PNI, PNP, PNQ, PNX
STOP punctuation POS, PUL, PUN, PUQ, PUR
SUBST substantive NN0, NN1, NN2, NP0, ONE, ZZ0, NN1-NP0, NP0-NN1
UNC unclassified, uncertain, or non-lexical word UNC, AJ0-AV0, AV0-AJ0, AJ0-NN1, NN1-AJ0, AJ0-VVD, VVD-AJ0, AJ0-VVG, VVG-AJ0, AJ0-VVN, VVN-AJ0, AVP-PRP, PRP-AVP, AVQ-CJS, CJS-AVQ, CJS-PRP, PRP-CJS, CJT-DT0, DT0-CJT, CRD-PNI, PNI-CRD, NN1-VVB, VVB-NN1, NN1-VVG, VVG-NN1, NN2-VVZ, VVZ-NN2
VERB verb VBB, VBD, VBG, VBI, VBN, VBZ, VDB, VDD, VDG, VDI, VDN, VDZ, VHB, VHD, VHG, VHI, VHN, VHZ, VM0, VVB, VVD, VVG, VVI, VVN, VVZ, VVD-VVN, VVN-VVD

Up: Contents Previous: 8 References Next: 10 List of Sources



edited by Lou Burnard. Date: January 2007
This page is copyrighted