BNC User Reference Guide

Reference Guide for the British National Corpus (XML Edition) : Notes

Notes
1. The article in Wikipedia (http://en.wikipedia.org/wiki/XML) is probably as good a starting point as any; another is at http://homepages.inf.ed.ac.uk/wadler/xml/
2. The terms "POS-tagging" and "wordclass tagging" are used interchangeably in this manual.
3. The only exceptions to this statement are: (i) the file F9M, which contains the Rap poetry "City Psalms" by Benjamin Zephaniah. It was thoroughly hand-corrected because the tagger, not familiar with Jamaican Creole, had produced an inordinate number of tagging errors. (ii) files identified as containing many foreign and classical expressions, as mentioned above.
4. In BNC version 1, the quantifier a little meaning 'a small amount' was sometimes (but not reliably) tagged as a multiword DT0
5. In our experience, human analysts too sometimes have difficulty resolving ambiguities such as these, especially when using the plain orthographic transcriptions of the BNC, and with no direct access to the original sound recordings.
6. That is, the error rate based on CLAWS's first choice tag only.
7. We borrow the term "patching" from Brill (1992), although for his tagging program the patches are discovered by an automatic procedure.
8. The repetition value of up to 16 words was reached at by trial and error; an occurrence of a finite verb beyond that range was rarely in the same clause as the #AFTER-type word.
9. Training and testing were mostly carried out on the BNC Sampler corpus of 2 million words. For less frequent phenomena we needed to use sections from the full BNC. None of the texts used for the tagging error report is included in the Sampler.


edited by Lou Burnard. Date: January 2007
This page is copyrighted