Background
The British National Corpus is a 100 million word collection of texts originally created for use in lexicography and language engineering. Designed to represent a wide cross-section of British English from the end of the 20th century, it contains samples of both written and spoken language from a huge variety of carefully selected sources. The corpus also has rich metadata and detailed linguistic annotation. In the 15 years or more since its first release, the BNC has established itself as a key resource and a point of comparison for all corpus-based work, not only in linguistics, and Natural Language Processing, but also in language teaching and more generally within the digital humanities. Its design principles and internal organization have had a major influence in the development of these fields. Moreover, despite the fact that it can no longer be considered representative of contemporary language usage, and despite the enormous increase in availability of language data from the Web, demand for the BNC shows no sign of decreasing. We believe that this is because the BNC remains exceptional in several respects: its size, its balanced design, its rich markup and metadata, and its availability.
Apart from correcting some known errors and inconsistencies, the new edition changes none of these aspects, except perhaps the last. In converting the corpus format from SGML to XML we believe we have opened up the corpus to many new and exciting applications in many different disciplines. To that end, this workshop aims to share expertise about the corpus and its usability in XML form. As the lingua franca of the web, XML is the mechanism of choice for software development across all academic fields; expertise in its basic concepts and awareness of the huge range of software tools and techniques supporting it is correspondingly pervasive. XAIRA, the software we have developed for searching the BNC, is thus compatible with any other XML corpus, and can also interface with other XML software at a variety of levels. Where in the past corpora have tended to be associated with monolithic “closed box” software systems, the trend now is to open-ended and modular systems working to well defined standard interfaces, so that development can be carried out by many and systems can evolve flexibly to satisfy demand.
The BNC XML edition, like the previous editions, is owned by a small consortium of UK publishers and academics, but OUCS is solely responsible for its maintenance and licenced distribution. is Assistant Director at OUCS, where he has been responsible for management and development of the BNC since the initial corpus design phase. is responsible for the BNC website and help desk. Both have extensive experience in digital humanities support activitiy, not restricted to the BNC or even to corpus linguistics and are thus well placed to comment on the wider research implications of this technology. was until recently Dean of the Faculty of Translation at the Scuola Superiore for Linguistics and Translation Study (SSLMIT) at the University of Bologna and has written and publishd extensively on corpus methods in language pedagogy.
Up: Contents Previous: Provisional Timetable