Notes
2. The terms "POS-tagging" and
"wordclass tagging" are used interchangeably in this manual.
3. The only exceptions to this
statement are: (i) the file F9M, which contains the Rap poetry "City
Psalms" by Benjamin Zephaniah. It was thoroughly hand-corrected
because the tagger, not familiar with Jamaican Creole, had produced an
inordinate number of tagging errors. (ii) files identified as
containing many foreign and classical expressions, as mentioned above.
4. In BNC version 1, the quantifier a little
meaning 'a small amount' was sometimes (but not reliably) tagged as a
multiword DT0
5. In our experience, human analysts
too sometimes have difficulty resolving ambiguities such as these,
especially when using the plain orthographic transcriptions of the
BNC, and with no direct access to the original sound
recordings.
6. That is, the error rate based on CLAWS's first choice tag only.
7. We borrow the term
"patching" from Brill (1992), although
for his tagging program the patches are discovered by an
automatic procedure.
8. The repetition value of up to 16 words was reached at by trial and error; an occurrence of a finite verb beyond that range was rarely in the same clause as the #AFTER-type word.
9.
Training and testing were mostly carried out on the BNC Sampler corpus of 2 million words. For less frequent phenomena we needed to use sections from the full BNC. None of the texts used for the tagging error report is included in the Sampler.