Making electronic texts
Three ways of creating electronic versions of the texts were envisaged at the start of the BNC project: scanning, keyboarding, and re-use of existing electronic texts.
- Optical character readers are becoming increasingly sophisticated, and many BNC books were "scanned" in this manner. High-quality original texts were required to ensure the error rate was low. Hand editing was still required, though, to correct scanning errors and insert textual markup.
- For leaflets, hand-written items, and of course recorded speech, keyboarding -- manually typing in texts -- was the only viable option, as it was for surprisingly many magazines and newspapers. Scanners are not efficient enough at recognizing small typefaces, lower-quality typography, or handwriting. It would have taken longer to correct scanned output in such cases than it did for a trained typist simply to re-type the documents in full.
- Existing electronic texts
- The corpus designers believed that many texts would already exist in electronic form -- publishers' and typesetters' versions of newspapers, magazines and some books -- and that converting such texts to the standard format required for the corpus would be reasonably straightforward. In the event, texts in electronic form which fitted the corpus design were far fewer than had been supposed. Newspaper text frequently came in machine-readable form, but often required programs to reformat it.