Miscellaneous tables
This section consists of a series of supplementary tables listing
values used for some open or semi open value-lists, and other
aspects of the corpus and its encoding not provided by the reference
information in section .
The following code tables are provided:
gives a breakdown of XML tag usage by
text type;
lists the most frequent values used in the corpus
for the new attribute on the shift element, to
indicate changes in voice quality for spoken texts;
lists the most frequent values used in
the corpus for the desc attribute on the gap
element, to describe material not transcribed in spoken texts;
lists the most frequent values used in
the corpus for the desc attribute on the event
element, which describes non-linguistic events noted by the
transcriber of a spoken text;
lists the codes used to identify
the roles relationships of participants, as specified in the
role attribute on person;
lists the text-type codes making up
David Lee's text classification system as applied to the BNC.
lists all the multiword items identified
by the CLAWS system and tagged as mw elements, together with
the C5 wordclass tag assigned to each of their constituent
parts
lists the mapping between simple POS
code and CLAWS5 wordclass tags
XML tag usage by text type
Each of the 4049 texts in the BNC is categorized broadly by type
(written fiction, written academic prose, spoken demographic,
etc.). This table lists the usage of the various XML elements
documented in this manual within the corpus, both in total and in each of the different
text types. Note that elements which appear only in corpus or text
headers are excluded.
Tag usage by Text Type
TotalAcademic writingPublished fictionNews and journalismPublished non-fictionOther published writingUnpublished writingConversationOther spoken
align407023 -- -- -- -- -- -- 66.96%27255233.03%134471
bibl103617.85%18510.90%113 -- 55.59%57615.54%1610.09%1 -- --
c1361436314.55%198172923.65%32205418.68%118253622.31%303862916.64%22665543.87%5271015.03%6848585.23%712415
corr1700011.42%19437.86%133711.07%188228.34%481928.94%492112.12%20620.02%50.18%31
div21014512.04%253083.10%651818.31%3848420.35%4277833.35%7009011.02%231721.73%36400.07%155
event6943 -- -- -- -- -- -- 36.85%255963.14%4384
gap6515921.16%137900.35%2321.62%106014.64%954216.74%109118.79%57317.67%499828.99%18895
head22208510.71%237972.62%583622.11%4910821.74%4828833.44%742839.35%20773 -- --
hi21050827.84%5861312.50%263150.14%30231.23%6575825.28%532362.98%6284 -- --
item11723727.82%326210.74%8702.23%262122.93%2689330.82%3613915.43%18093 -- --
l513102.59%133371.39%366310.17%8913.59%69748.62%44263.61%1857 -- --
label6569743.83%287990.65%4301.66%109321.27%1397621.96%1442810.61%6971 -- --
lg30407.23%22054.53%16580.23%721.71%66011.71%3564.57%139 -- --
list1975820.72%40950.71%1421.63%32326.41%522031.75%627418.74%3704 -- --
mw79259919.55%15501716.83%1334697.74%6136625.39%20124916.73%1326344.09%324783.24%257426.38%50644
note11745.29%530.85%10.85%15.12%647.86%56 -- -- --
p15996938.78%14055027.13%43401917.95%28717118.18%29082620.35%3256127.59%121515 -- --
pause216354 -- -- -- -- -- -- 64.98%14058935.01%75765
pb9462026.16%2476025.60%242240.15%14831.63%2993114.75%139611.68%1596 -- --
quote1520840.20%61144.58%6980.03%545.66%69456.14%9343.36%512 -- --
s602627611.55%69603821.96%13235738.43%50860918.83%113526416.95%10216335.02%30307810.13%6105587.09%427523
shift36053 -- -- -- -- -- -- 70.90%2556429.09%10489
sp291120.21%621.28%373 -- 4.76%138635.69%1039158.05%16900 -- --
speaker234660.26%621.58%373 -- 5.90%138544.16%1036348.08%11283 -- --
stage5071.38%710.25%52 -- 5.71%2982.44%4180.19%1 -- --
trunc52674 -- -- -- -- -- -- 38.69%2038261.30%32292
u784483 -- -- -- -- -- -- 67.02%52578932.97%258694
unclear203045 -- -- -- -- -- -- 62.39%12668637.60%76359
vocal43457 -- -- -- -- -- -- 63.61%2764536.38%15812
w9836370716.04%1578185916.41%161439139.56%941217424.58%2417901018.26%179702124.54%44666814.30%42339626.27%6175896
Voice quality codes
Changes in voice quality in spoken texts are indicated by values
for the new attribute on a shift element, at
the point where the speaker's voice
change. 156 distinct values are used, but most of them appear only
infrequently. The following list gives the values which appear more
than 10 times in the whole corpus:
voice qualitynumber such
laughing9268
reading2463
singing2045
shouting1419
whispering1247
yawning363
sighing276
mimicking241
spelling224
crying108
screaming97
whining40
whingeing38
praying23
reading bible22
reading newspaper20
reading+laughing15
reading book14
on telephone11
Gap descriptions
Where material is omitted for some reason during the transcription
of a text, either written or spoken, the gap element is used
to provide a brief description of the material omitted and the reason
for its exclusion. The desc attribute supplies the
description, and the cause attribute explains why it
was done. Over 1700 distinct descriptions are used, but most of them
appear only infrequently. The following list gives the 65 values which
appear more than 25 times in the whole corpus:
material omittednumber such
name29698
formula12476
figure4060
address3914
many nonRoman characters2835
telephone number2173
table1393
illustration903
photograph338
footnote197
references etc.188
date172
list of names171
picture144
personal name142
advert141
reference133
adverts126
list123
period quotation112
name and address101
phonetic transcription95
list of venues95
table of contents92
names91
ingredients91
publication details84
footnotes83
contents omitted72
contents65
list of ingredients64
diagram60
hebrew59
address, telephone number etc.50
tel. no47
cover page47
text46
venue, dates, times, prices etc.46
venue46
telephone no44
number41
names and addresses41
other venues41
venues40
form39
personal names38
computer code38
gaelic36
email address36
author details35
address and telephone35
cover omitted35
company name34
credits33
caption31
dates30
name and phone number29
sales details28
address, dates, times, prices etc.28
period quotation/verse27
notes27
map26
period/overseas quotation25
quotation25
The cause for a gap in transcription is usually self-evident, which
may be why only a small number of values is used for the
cause attribute. The following four values are the
most significant:
reason for ommissionnumber such
anonymization31924
label303
sampling strategy115
repeated elsewhere6
Event descriptions
The event element is used in spoken texts to mark wherever
some non-linguistic but significant event is noted by a
transcriber. The brief texts used to describe such events are very
various, and there are more than 1500 different values for the
desc attribute which stores them. The following lists
shows the 60 or so values which appear more than 10 times in the whole
corpus:
Event descriptionnumber such
clapping1134
music397
recorded jingle330
break in recording310
speaking french212
pre-recorded blurb199
too quiet to hear180
recording ends177
tape change158
phonecall starts138
phone rings138
phonecall ends125
tape breaks here120
piano music109
paper rustling93
tv on80
people talking77
applause67
dog barks63
tape jumps62
dog barking55
baby talk55
jingle47
banging47
classroom chatter47
talk in background41
people laughing35
advert35
door knock33
noise - traffic30
playing piano30
tape ends28
door bell27
noise - background26
portuguese speech24
writing on board24
hits ball23
music in background21
speaking italian19
television19
baby crying18
door closing17
bell ring17
singing16
crockery noise16
noseblow16
telephone conversation ends16
talking from other room16
everyone talking15
radio on15
noise15
door opening15
introduction music15
drilling noise14
microphone moved14
plane overhead13
knocking13
phonic13
closing music12
noise - train12
cat noise12
clicks fingers11
speaking german11
Speaker relationships
In demographically sampled texts, the role of each speaker with
respect to the respondent is supplied by the role
attribute on the person element. The following table lists
all 79 values used in the curent version of the corpus in descending
frequency order.
role namepersons
unspecified6862
other1454
friend654
?354
self306
colleague216
daughter102
son100
husband68
wife66
mother64
stranger52
neighbour50
father42
sister42
brother38
mother-in-law22
sister-in-law22
teacher22
acquaintance18
brother-in-law18
employee14
son-in-law14
father-in-law12
granddaughter12
niece12
chairman10
grandson10
daughter-in-law8
nephew8
aunt6
boyfriend6
customer6
girlfriend6
babysitter4
cousin4
fiancé4
friend's son4
grandmother4
lecturer4
son's teacher4
aunt-in-law2
boss2
boyfriend's father2
boyfriend's mother2
brother's friend2
brother-in-law's mother2
child's teacher2
cousin-in-law2
cousin-in-law's son2
cousin-in-law's wife2
daughter's boyfriend2
daughter's friend2
employee's wife2
friend's brother2
friend's father2
friend's granddaughter2
friend's mother2
friend's sister2
grandmother-in-law2
hairdresser2
hairdresser's son2
housekeeper2
husband's great-niece2
husband's niece2
neighbour's son2
partner2
partner's mother2
plumber2
sister's friend2
sister's friend's mother2
sister-in-law's father2
sister-in-law's mother2
son's friend2
step-father2
stepfather2
uncle2
visitor2
Text and genre classification codes
Texts are classified in several different ways in the BNC, as
described in section . Each text carries a
number of text classification codes, specified as a string of values on
the target attribute of its catRefs
element. Each code identifies one of the values in one of the 23
taxonomy element provided in the BNC Header, corresponding
with the design criteria outlined in . Possible
values for these codes and brief explanations of their meanings are listed in the corpus
header. Distribution tables showing the number of texts, words, and
sentences classified under most of them are given above in section
and elsewhere in the current section.
One of the codes listed below is also supplied for each text as the
content of a classCode element in its text header, as an
alternative way of characterising each text. A description of the
analysis scheme used and its rationale are provided in Lee 2001. The codes used in the present version of
the corpus have been updated to take note of a small number of
corrections made by Lee on his web site () since publication of that
article.
Genre classification for spoken texts
Classification
Number of texts
W-units
%
S-units
%
S brdcast discussn537615950.77411440.68
S brdcast documentary10418930.0423690.03
S brdcast news122632550.26124540.20
S classroom584336460.44513550.85
S consult1281393200.14206980.34
S conv15342339554.3061055710.13
S courtroom131290670.1363660.10
S demonstratn6320620.0321750.03
S interview131250960.12118090.19
S interview oral history1198224890.83578310.95
S lect commerce3152330.014060.00
S lect humanities arts4515100.0526390.04
S lect nat science4229380.0210190.01
S lect polit law edu7514070.0516700.02
S lect soc science131620300.1681360.13
S meeting13213912071.411032661.71
S parliament6972890.0926090.04
S pub debate162870620.29133470.22
S sermon16827750.0833450.05
S speech scripted251930200.1995710.15
S speech unscripted514694920.47331210.54
S sportslive4336300.0318670.03
S tutorial181447830.1488140.14
S unclassified444250970.43315120.52
W ac:humanities arts8733581673.411301672.15
W ac:medicine2414356081.45668111.10
W ac:nat science4311229391.14512030.84
W ac:polit law edu18647033044.781902203.15
W ac:soc science14247854234.83079985.12
W ac:tech engin236921410.70349820.58
W admin122228030.22140450.23
W advert595536250.56421470.69
W biography10035566883.611726152.86
W commerce11238074943.871871273.10
W email72140220.21174110.28
W essay school71477360.1578710.13
W essay univ3562730.0529050.04
W fict drama2460940.0449320.08
W fict poetry302236820.22381370.63
W fict prose4311603364716.30129321121.45
W hansard411683621.18632341.04
W institut doc435521240.56301590.50
W instructional154405480.44278750.46
W letters personal6529150.0525830.04
W letters prof11665910.0648000.07
W misc50292375049.395212868.65
W news script3212486091.261029371.70
W newsp brdsht nat: arts513521370.35169910.28
W newsp brdsht nat: commerce444300750.43211030.35
W newsp brdsht nat: editorial121027180.1051500.08
W newsp brdsht nat: misc9510409431.05514550.85
W newsp brdsht nat: report496686130.67320790.53
W newsp brdsht nat: science29658800.0633900.05
W newsp brdsht nat: social36826050.0843880.07
W newsp brdsht nat: sports243000330.30146790.24
W newsp other: arts152408770.24130050.21
W newsp other: commerce174199960.42205060.34
W newsp other: report3927350742.781506762.50
W newsp other: science23553190.0527980.04
W newsp other: social3711514901.17638461.05
W newsp other: sports910333521.05568020.94
W newsp tabloid67330660.74517440.85
W nonAc: humanities arts11037443213.801558392.58
W nonAc: medicine175046100.51271560.45
W nonAc: nat science6225336352.571206102.00
W nonAc: polit law edu9345210404.592087853.46
W nonAc: soc science12337080333.761835883.04
W nonAc: tech engin12312200261.24527500.87
W pop lore21174508147.574254487.05
W religion3511329761.15603341.00
Contracted forms and multiwords
The following tables summarize and document the tokenization
decisions taken by the CLAWS system, where these do not coincide with
normal orthographic convention.
The first list specifies common word-endings or
enclitics which are regarded by CLAWS as indicating the
start of a new word, although words containing
them are conventionally represented as a single orthographic word.
The second list specifies some common two, three or four word
phrases treated by CLAWS as single tokens. These are represented in
this version of the corpus by means of a mw element; the
table gives the C5 code assigned to this element, and also the
codes assigned to the distinct w elements constituting it.
Contracted forms
Words ending with certain character strings are treated by CLAWS as
distinct words, even though they are conventionally fused together
when written. For example, they're is treated as if it were two
distinct words — they and 're. The fact that
these two items are orthographically fused is evident in the XML
encoding of the corpus because there is no whitespace following the
string they. Some XML processors may however assume that the
end of an XML element such as the w enclosing the string
should always be treated as a word separator, and may therefore
introduce unwanted extra space.
In the following table we show how contracted forms are tokenized
by CLAWS. The left column shows the contracted form; the right column
shows the content of the two or more w elements used to
represent it.
orthographic formtokenization
[word]'d[word] 'd
[word]'m[word] 'm
[word]'s[word] 's
[word]'ll[word] 'll
[word]n't[word] n't
[word]'re[word] 're
[word]'v[word] 'v
[word]'d've[word] 'd 've
'tis't is
'twas't was
'twere't were
'twould't would
ain'tai n't
aintai nt
aintchaai nt cha
arentare nt
c'monc'm on
can'tca n't
cannotcan not
couldntcould nt
d'yad' ya
d'youd' you
didntdid nt
doesntdoes nt
dontdo nt
dunnitdun n it
dunnodu n no
geroffger off
gimmegim me
gonnagon na
gottagot ta
hadnthad nt
hasnthas nt
heshe s
innein n e
innitin n it
isntis nt
lorralor ra
m'ludm' lud
ought'aough t 'a
oughtaought a
shan'tsha n't
shouldn't'veshould n't 've
shouldn'tshould n't
t'othert' other
thatsthat s
theresthere s
theyvethey ve
tist is
twast was
tweret were
twouldt would
wannawan na
wannitwann it
wasntwas nt
wevewe ve
won'two n't
wottawott a
wouldn't'vewould n't 've
wouldntwould nt
Multiwords
CLAWS recognizes certain sequences of orthographically distinct
words as constituting a single item: examples include common
prepositional phrases such as in spite of, as well as phrases
from other languages such as aide memoire. In this version of
the corpus, such items are explicitly tagged using an XML mw (for
multiword) tag carrying the appropriate wordclass tag, as indicated
below. Within this mw element however, in a departure from
earlier versions of the corpus, the individual words are also tagged
using w tags in the same way as elsewhere in the corpus.
The following table lists all multiwords recognized in the corpus
alphabetically, indicating both the wordclass codes assigned to it,
and also the wordclass codes assigned to its constituent w
elements. Note that these latter wordclass codes were assigned
automatically during the XML conversion process and therefore should
not be included in any assessment of the CLAWS error rate.
multiwordmw
wordclass/esconstituent wordclasses
ab initioAV0 or AJ0UNC UNC
a bitAV0AT0 NN1
a capellaAJ0 or AV0UNC UNC
according asCJSVVG CJS
according toPRPVVG PRP
ad astraAV0 or AJ0UNC UNC
ad hocAV0 or AJ0UNC UNC
ad hominemAV0 or AJ0UNC UNC
ad infinitumAV0UNC UNC
adjacent toPRPAJ0 PRP
ad libAJ0 or AV0 or NN1UNC UNC
ad nauseamAV0 or AJ0UNC UNC
affaire de coeurNN1UNC UNC UNC
affaire d'honneurNN1UNC UNC
a fortioriAV0 or AJ0UNC UNC
agent provocateurNN1UNC UNC
agnus deiNN1UNC UNC
a good dealAV0AT0 AJ0 NN1
a great dealAV0AT0 AJ0 NN1
ahead ofPRPAV0 PRF
a heck of a lotAV0AT0 NN1 PRF AT0 NN1
aide de campNN1UNC UNC UNC
aide memoireNN1UNC UNC
a laPRPUNC UNC
a la carteAJ0 or AV0UNC UNC UNC
a la modeAJ0 or AV0UNC UNC UNC
al denteAJ0 or AV0UNC UNC
al frescoAV0 or AJ0UNC UNC
a littleAV0AT0 AV0/DT0
a little bitAV0AT0 AJ0 NN1
alla breveAV0 or AJ0 or NN0UNC UNC
all butAV0AV0 CJS
all of a suddenAV0DT0 PRF AT0 NN1
all rightAV0 or AJ0AV0 AV0
all the sameAV0DT0 AT0 DT0
alma materNN1UNC UNC
along withPRPAVP PRP
a lotAV0AT0 NN1
alter egoNN1UNC UNC
an' allAV0CJC DT0
an awful lotAV0AT0 AJ0 NN1
ancien regimeNN1UNC UNC
and so forthAV0CJC AV0 AV0
and so onAV0CJC AV0 AV0
anno dominiAV0 or NN1UNC UNC
annus horribilisNN1UNC UNC
annus mirabilisNN1UNC UNC
ante meridiemAV0UNC UNC
any longerAV0AV0 AV0
anything butAV0PNI AV0
apart fromPRPAV0 PRP
a posterioriAV0 or AJ0UNC UNC
a prioriAV0 or AJ0UNC UNC
a proposPRP or AV0UNC UNC
aqua vitaeNN1UNC UNC
art nouveauNN1UNC UNC
as againstPRPCJS PRP
as betweenPRPCJS PRP
as forPRPCJS PRP
as fromPRPCJS PRP
aside fromPRPAV0 PRP
as ifCJSCJS CJS
as it wereAV0CJS PNP VBD
as long asCJSAV0 AV0 CJS
as ofPRPCJS PRF
as opposed toPRPCJS VVN PRP
as regardsPRPCJS VVZ
as soon asCJSAV0 AV0 CJS
as thoughCJSCJS CJS
asti spumanteNN1UNC UNC
as toPRPCJS PRP
as usualAV0CJS AJ0
as well asPRPAV0 AV0 CJS
as wellAV0CJS AV0
as yetAV0CJS AV0
at allAV0PRP DT0
at bestAV0PRP AJS
at firstAV0PRP ORD
at largeAV0PRP AJ0
at lastAV0PRP ORD
at leastAV0PRP AV0
at lengthAV0PRP NN1
at long lengthAV0PRP AJ0 NN1
at mostAV0PRP DT0
at onceAV0PRP AV0
at presentAV0PRP NN1
at randomAV0PRP AJ0
at worstAV0PRP AV0
au contraireAV0UNC UNC
au faitAJ0UNC UNC
auf wiedersehenITJUNC UNC
au pairNN1UNC UNC
au revoirITJUNC UNC
aurora australisNN1UNC UNC
aurora borealisNN1UNC UNC
avant gardeNN1 or AJ0UNC UNC
away fromPRPAV0 PRP
bar mitzvahNN1 or AJ0UNC UNC
basso profundoNN1UNC UNC
beau mondeNN1UNC UNC
because ofPRPCJS PRF
belles lettresNN2UNC UNC
bete noireNN1UNC UNC
billet douxNN1UNC UNC
bona fidesNN2UNC UNC
bona fideAJ0UNC UNC
bon appetitITJUNC UNC
bon motNN1UNC UNC
bon vivantNN1UNC UNC
bon viveurNN1UNC UNC
bon voyageITJUNC UNC
brand newAJ0NN1 AJ0
but forPRPCJS PRP
by and byAV0AVP CJC AVP
by and largeAV0AVP CJC AJ0
by farAV0PRP AV0
by means ofPRPPRP NN0 PRF
by no meansAV0PRP PRP NN0
by nowAV0PRP AV0
by reason ofPRPPRP NN1 PRF
by the byAV0PRP AT0 NN1
by way ofPRPPRP NN1 PRF
cafe au laitNN1UNC UNC UNC
camera obscuraNN1UNC UNC
carte blancheNN1UNC UNC
casus belliNN1UNC UNC
cause celebreNN1UNC UNC
ceteris paribusAV0UNC UNC
chaise longueNN1UNC UNC
charge d'affairesNN1UNC UNC
chez moiAV0UNC UNC
chez nousAV0UNC UNC
chilli con carneNN1NN1 NN1 NN1
chop sueyNN1UNC UNC
chow meinNN1UNC UNC
clamp downNN1VVB/VVI AVP
close toAV0AV0/AJ0 PRP
compos mentisAJ0UNC UNC
con brioAJ0 or AV0UNC UNC
con fuocoAJ0 or AV0UNC UNC
con motoAJ0 or AV0UNC UNC
considering thatCJSVVG CJT
contrary toPRPJJ PRP
cordon bleuNN1UNC UNC
cordon sanitaireNN1UNC UNC
corpus delictiNN1UNC UNC
corpus jurisNN1UNC UNC
coup de graceNN1UNC UNC UNC
coup d'etatNN1UNC UNC
coup de theatreNN1UNC UNC UNC
creme de la cremeNN1UNC UNC UNC UNC
creme de mentheNN1UNC UNC UNC
cri de coeurNN1UNC UNC UNC
croix de guerreNN0UNC UNC UNC
cul de sacNN1UNC UNC UNC
danse macabreNN1UNC UNC
de factoAV0 or AJ0UNC UNC
dei gratiaAV0UNC UNC
deja vuNN1UNC UNC
de jureAV0 or AJ0UNC UNC
delirium tremensNN1UNC UNC
de luxeAJ0UNC UNC
demi mondeNN1UNC UNC
depending onPRPVVG PRP
de profundisAV0UNC UNC
de rigeurAJ0UNC UNC
de tropAJ0UNC UNC
deus ex machinaNN1UNC UNC UNC
double entendreNN1UNC UNC
dramatis personaeNN2UNC UNC
due toPRPAJ0 PRP
each otherPNXDT0 NN1
eminence griseNN1UNC UNC
en blocAV0UNC UNC
en familleAV0UNC UNC
enfants terriblesNN2UNC UNC
enfant terribleNN1UNC UNC
en masseAV0UNC UNC
en passantAV0UNC UNC
en routeAV0UNC UNC
en suiteAJ0UNC UNC
entente cordialeNN1UNC UNC
esprit de corpsNN1UNC UNC UNC
et alAV0UNC UNC
et ceteraAV0UNC UNC
even ifCJSAV0 CJS
even soAV0AV0 AV0
even thoughCJSAV0 CJS
even whenCJSAV0 CJS
ever soAV0AV0 AV0
every so oftenAV0AT0 AV0 AV0
ex armyAJ0PRP NN1
ex cathedraAV0 or AJ0UNC UNC
except forPRPCJS PRP
excepting forPRPVVG PRP
except thatCJSCJS CJT
ex gratiaAV0 or AJ0UNC UNC
ex librisAV0UNC UNC
ex officioAV0 or AJ0UNC UNC
ex parteAV0 or AJ0UNC UNC
ex temporeAV0 or AJ0UNC UNC
fait accompliNN1UNC UNC
far fromAV0AV0 PRP
far offAJ0AV0 AVP
faux amisNN2UNC UNC
faux amiNN1UNC UNC
faux pasNN0UNC UNC
fed upAJ0VVN AVP
femme fataleNN1UNC UNC
fin de siecleNN1UNC UNC UNC
follow upNN1VVB/VVI AVP
force majeureNN1UNC UNC
for certainAV0PRP AJ0
for everAV0PRP AV0
for exampleAV0PRP NN1
for fear ofPRPPRP NN1 PRF
for goodAV0PRP AJ0
for instanceAV0PRP NN1
for keepsAV0PRP NN2
for longAV0PRP AV0
for onceAV0PRP AV0
for sureAV0PRP AJ0
for the most partAV0PRP AT0 AV0 NN1
for the time beingAV0PRP AT0 NN1 VBG
fromage fraisNN1UNC UNC
from now onAV0PRP AV0 AVP
from time to timeAV0PRP NN1 PRP NN1
getting on forAV0VVG AVP PRP
grande dameNN1UNC UNC
grand prixNN1UNC UNC
grown upsNN2VVN NN2
grown upNN1VVN AVP
gung hoAJ0 or AV0UNC UNC
habeas corpusNN1UNC UNC
half wayAV0DT0 NN1
hara kiriNN1UNC UNC
hard upAJ0AJ0 AVP
hasta la vistaITJUNC UNC UNC
hasta luegoITJUNC UNC
haute coutureNN1UNC UNC
haute cuisineNN1UNC UNC
have notsNN2VHB NN2
hey prestoITJITJ ITJ
hoi polloiNN0UNC UNC
homo sapiensNN1UNC UNC
hors d'oeuvresNN2UNC UNC
hors d'oeuvreNN1UNC UNC
hysteron proteronNN1UNC UNC
idee fixeNN1UNC UNC
in absentiaAV0UNC UNC
in accordance withPRPPRP NN1 PRP
in accord withPRPPRP NN1 PRP
in additionAV0PRP NN1
in addition toPRPPRP NN1 PRP
in aid ofPRPPRP NN1 PRF
in answer toPRPPRP NN1 PRP
in as much asCJSPRP AV0 DT0 CJS
inasmuch asCJSUNC CJS
in association withPRPPRP NN1 PRP
in back ofPRPPRP NN1 PRF
in betweenPRP or AV0AVP PRP/AV0
in briefAV0PRP AJ0
in cameraAV0UNC UNC
in case ofPRPPRP NN1 PRF
in caseCJS or AV0PRP NN1
in charge ofPRPPRP NN1 PRF
in commonAV0PRP AJ0
in common withPRPPRP NN1 PRP
in comparison withPRPPRP NN1 PRP
in conjunction withPRPPRP NN1 PRP
in connection withPRPPRP NN1 PRP
in consultation withPRPPRP NN1 PRP
in contact withPRPPRP NN1 PRP
in cooperation withPRPPRP NN1 PRP
in course withPRPPRP NN1 PRP
in defence ofPRPPRP NN1 PRF
in defiance ofPRPPRP NN1 PRF
in excess ofPRPPRP NN1 PRF
in extremisAV0UNC UNC
in face ofPRPPRP NN1 PRF
in favor ofPRPPRP NN1 PRF
in favour ofPRPPRP NN1 PRF
in flagrante delictoAV0 or AJ0UNC UNC UNC
in front ofPRPPRP NN1 PRF
in fullAV0PRP AJ0
in generalAV0PRP AJ0
in keeping withPRPPRP NN1 PRP
in lieu ofPRPPRP UNC PRF
in light ofPRPPRP NN1 PRF
in line withPRPPRP NN1 PRP
in loco parentisAV0 or AJ0UNC UNC UNC
in medias resAV0UNC UNC UNC
in memoriamAV0UNC UNC
in need ofPRPPRP NN1 PRF
in particularAV0PRP AJ0
in perpetuumAV0UNC UNC
in place ofPRPPRP NN1 PRF
in possession ofPRPPRP NN1 PRF
in privateAV0PRP AJ0
in proportion toPRPPRP NN1 PRP
in propria personaAV0UNC UNC UNC
in publicAV0PRP AJ0
in pursuit ofPRPPRP NN1 PRF
in quest ofPRPPRP NN1 PRF
in receipt ofPRPPRP NN1 PRF
in regard toPRPPRP NN1 PRP
in relation toPRPPRP NN1 PRP
in reply toPRPPRP NN1 PRP
in respect ofPRPPRP NN1 PRF
in response toPRPPRP NN1 PRP
in return forPRPPRP NN1 PRP
in search ofPRPPRP NN1 PRF
in shortAV0PRP AJ0
inside outAV0 or AJ0AV0 AVP
in situAV0UNC UNC
in so far asCJSPRP AV0 AV0 CJS
insofar asCJSUNC CJS
in spite ofPRPPRP NN1 PRF
instead ofPRPAV0 PRF
in support ofPRPPRP NN1 PRF
inter aliaAV0UNC UNC
in terms ofPRPPRP NN2 PRF
in thatCJSPRP CJT
in the light ofPRPPRP AT0 NN1 PRF
in the mainAV0PRP AT0 AJ0
in the order ofAV0PRP AT0 NN1 PRF
into line withPRPPRP NN1 PRP
in totoAV0 or AJ0UNC UNC
in touch withPRPPRP NN1 PRP
in vainAV0PRP AJ0
in view ofPRPPRP NN1 PRF
in vitroAJ0 or AV0UNC UNC
in vivoAJ0 or AV0UNC UNC
ipso factoAV0UNC UNC
irrespective ofPRPAJ0 PRF
je ne sais quoiNN1UNC UNC UNC UNC
joie de vivreNN1UNC UNC UNC
just aboutAV0AV0 AV0
just aboutAV0AV0 AV0
kind ofAV0NN1 PRF
know howNN1VVB AVQ
kung fuNN1UNC UNC
la dolce vitaNN1UNC UNC UNC
laissez faireNN1UNC UNC
le mot justeNN1UNC UNC UNC
less thanAV0AV0/DT0 CJS
let alonePRPVVB AJ0
let 'sVM0VVB PNP
lingua francaNN0UNC UNC
lo and beholdITJITJ CJC VVB
loc citAV0UNC UNC
locum tenensNN1UNC UNC
long-term wiseAV0AJ0 AV0
magna cartaNN1UNC UNC
magna cum laudeAJ0 or AV0UNC UNC UNC
magnum opusNN1UNC UNC
maitre d'hotelNN1UNC UNC
mal de merNN1UNC UNC UNC
matter of factNN1 or AJ0NN1 PRF NN1
mea culpaNN1UNC UNC
medecins sans frontieresNN0UNC UNC UNC
medicins sans frontieresNN0UNC UNC UNC
menage a troisNN1UNC UNC UNC
mezzo sopranoNN1UNC UNC
modus operandiNN1UNC UNC
modus vivendiNN1UNC UNC
more thanAV0AV0/DT0 CJS
more thanAV0AV0 CJS
mot justeNN1UNC UNC
nearer toPRPAJC/AV0 PRP
nearest toPRPAJS/AV0 PRP
near toPRPAJ0/AV0 PRP
nem conAV0UNC UNC
next toPRPORD PRP
nigh onAV0AV0 AVP
noblesse obligeNN1UNC UNC
no doubtAV0AT0 NN1
no longerAV0AV0 AV0
no matter whoPNQAT0 NN1 PNQ
no matter whomPNQAT0 NN1 PNQ
no matter whoseDTQAT0 NN1 DTQ
nom de guerreNN1UNC UNC UNC
nom de plumeNN1UNC UNC UNC
non compos mentisAJ0UNC UNC UNC
none otherPNIPNI AJ0
none the lessAV0PNI AT0 AV0
none theAV0PNI AT0
non sequiturNN1UNC UNC
no onePNIAT0 PNI
not withstandingAV0XX0 UNC
nouveau richeNN1UNC UNC
nouveaux richesNN2UNC UNC
nouvelle cuisineNN1UNC UNC
now thatCJSAV0 CJT
objet d'artNN1UNC UNC
objets d'artNN2UNC UNC
of courseAV0PRF NN1
off guardAV0PRP NN1
off ofPRPAVP PRF
oft timesAV0AV0 NN2
old fashionedAJ0AJ0 VVN
on account ofPRPPRP NN1 PRF
on behalf ofPRPPRP NN1 PRF
on boardPRP or AV0PRP NN1
once againAV0AV0 AV0
once and for allAV0AV0 CJC PRP DT0
once moreAV0AV0 AV0
one anotherPNXCRD DT0
one 'sCRDCRD POS
on the part ofPRPPRP AT0 NN1 PRF
on top ofPRPPRP NN1 PRF
on toPRPAVP PRP/TO0
op citAV0UNC UNC
other thanPRPAJ0 CJS
out ofPRPAVP PRF
out of dateAJ0AVP PRF NN1
out of line withPRPAVP PRF NN1 PRP
out of touch withPRPAVP PRF NN1 PRP
outside ofPRPPRP PRF
over hereAV0PRP AV0
over thereAV0PRP AV0
owing toPRPVVG PRP
papier macheNN1UNC UNC
par excellenceAJ0UNC UNC
pas de deuxNN0UNC UNC UNC
pate de foie grasNN1UNC UNC UNC UNC
pax britannicaNN1UNC UNC
pax romanaNN1UNC UNC
per annumAV0UNC UNC
per capitaAV0 or AJ0UNC UNC
per centNN0UNC UNC
per diemAV0 or AJ0 or NN1UNC UNC
per seAV0UNC UNC
personae non grataeNN2UNC UNC UNC
persona non grataNN1UNC UNC UNC
pertaining toPRPVVG PRP
petit bourgeoisNN1UNC UNC
petite bougeoisieNN1UNC UNC
petits bourgeoisNN2UNC UNC
piece de resistanceNN1UNC UNC UNC
pied a terreNN1UNC UNC UNC
pina coladaNN1UNC UNC
pince nezNN0UNC UNC
poco a pocoAV0UNC UNC UNC
point blankAV0 or AJ0NN1 AJ0
poste restanteNN1 or AV0UNC UNC
post hocAV0 or AJ0UNC UNC
post meridiemAV0UNC UNC
post mortemNN1 or AJ0UNC UNC
pot pourriNN1UNC UNC
prima donnaNN1UNC UNC
prima facieAJ0 or AV0UNC UNC
primus inter paresNN1UNC AJ0 UNC
prior toPRPAJ0 PRP
pro formaNN1UNC UNC
pro rataAV0 or AJ0UNC UNC
pro temAV0UNC UNC
provided thatCJSVVN CJT
providing thatCJSVVG CJT
pursuant toPRPAJ0 PRP
quid pro quoNN1UNC UNC UNC
raison d'etreNN1UNC UNC
rather thanPRPAV0 CJS
relative toPRPAJ0 PRP
rigor mortisNN1UNC UNC
roman a clefNN1UNC UNC UNC
save forPRPVVI PRP
save thatCJSVVI CJT
savoir faireNN1UNC UNC
savoir vivreNN1UNC UNC
seeing asCJSVVG CJS
seeing thatCJSVVG CJT
semper fidelisAJ0UNC UNC
shish kebabNN1UNC UNC
sine dieAV0UNC UNC
sine qua nonNN1UNC UNC UNC
sinn feinNN1UNC UNC
so calledAJ0AV0 VVN
so long asCJSAV0 AV0 CJS
some onePNIDT0 PNI
something likeAV0PNI PRP
so much asAV0AV0 DT0 CJS
son et lumiereNN1UNC UNC UNC
sort ofAV0NN1 PRF
so thatCJSAV0 CJT
sotto voceAV0 or AJ0UNC UNC
spaghetti bologneseNN1UNC UNC
spot onAV0 or AJ0NN1 AVP
status quoNN1UNC UNC
straight forwardAJ0AV0 AJ0
subject toPRPAJ0 PRP
sub judiceAV0 or AJ0UNC UNC
sub poenaNN1UNC UNC
subsequent toPRPAJ0 PRP
such asPRPDT0 PRP
such thatCJSDT0 CJT
sui generisAJ0UNC UNC
sui jurisAJ0UNC UNC
summa cum laudeAJ0 or AV0UNC UNC UNC
super duperAJ0AJ0 XXX
supposing thatCJSVVG PRP
table d'hoteNN1UNC UNC
tabula rasaNN1UNC UNC
tai chiNN1UNC UNC
tai kwan doNN1UNC UNC UNC
terra firmaNN1UNC UNC
terra incognitaNN1UNC UNC
thanks toPRPNN2 PRP
that isAV0DT0 VBZ
that is to sayAV0DT0 VBZ TO0 VVI
through thick and thinAV0PRP AJ0 CJC AJ0
time and againAV0NN1 CJC AV0
to and froAV0PRP CJC AV0
tour de forceNN1UNC UNC UNC
tout courtAJ0UNC UNC
tout de suiteAV0UNC UNC UNC
ultra viresAJ0 or AV0UNC UNC
under wayAV0PRP NN1
up frontAJ0 or AV0AVP AJ0
upside downAV0 or AJ0NN1 AVP
up toPRP or AV0AVP PRP/TO0
up to dateAJ0AVP TO0 NN1
up to the minuteAJ0AVP PRP AT0 NN1
up untilPRPAVP CJS/PRP
upward ofAV0AV0 PRF
upwards ofAV0AV0 PRF
vice versaAV0UNC UNC
vin de tableNN1UNC UNC UNC
vis a visPRPUNC UNC UNC
viva voceNN1 or AJ0 or AV0UNC UNC
vol au ventNN1UNC UNC UNC
volte faceNN1UNC UNC
vox populiNN1UNC UNC
well offAJ0AV0 AVP
whether or notCJSCJS CJC XX0
wiener schnitzelNN1UNC UNC
with a view toPRPPRP AT0 NN1 PRP
with reference toPRPPRP NN1 PRP
with regard toPRPPRP NN1 PRP
with relation toPRPPRP NN1 PRP
with respect toPRPPRP NN1 PRP
Simplified Wordclass Tags
This table lists, for each of the twelve simplified wordclass tags
used by the pos attribute, the corresponding CLAWS C5
tags of which the class consists.
POS valuesignificancecombines
ADJadjectiveAJ0, AJC, AJS, CRD, DT0, ORD
ADVadverbAV0, AVP, AVQ, XX0
ARTarticleAT0
CONJconjunctionCJC, CJS, CJT
INTERJinterjectionITJ
PREPprepositionPRF, PRP, TO0
PRONpronounDPS, DTQ, EX0, PNI, PNP, PNQ, PNX
STOPpunctuationPOS, PUL, PUN, PUQ, PUR
SUBSTsubstantiveNN0, NN1, NN2, NP0, ONE, ZZ0, NN1-NP0, NP0-NN1
UNCunclassified, uncertain, or non-lexical wordUNC, AJ0-AV0, AV0-AJ0, AJ0-NN1, NN1-AJ0, AJ0-VVD, VVD-AJ0, AJ0-VVG, VVG-AJ0, AJ0-VVN, VVN-AJ0, AVP-PRP, PRP-AVP, AVQ-CJS, CJS-AVQ, CJS-PRP, PRP-CJS, CJT-DT0, DT0-CJT, CRD-PNI, PNI-CRD, NN1-VVB, VVB-NN1, NN1-VVG, VVG-NN1, NN2-VVZ, VVZ-NN2
VERBverbVBB, VBD, VBG, VBI, VBN, VBZ, VDB, VDD, VDG, VDI, VDN, VDZ, VHB, VHD, VHG, VHI, VHN, VHZ, VM0, VVB, VVD, VVG, VVI, VVN, VVZ, VVD-VVN, VVN-VVD