[bnc] - An introduction to the BNC XML Edition

Organization of the project

An interesting and unanticipated consequence of the academic - industrial co-operation was the need for some accomodation between the academic desire for perfection and the commercial imperatives of delivering a pre-defined product on time and not too far over budget (see further Burnard 1999). In setting up an industrial scale text production system, the project found itself inevitably making compromises in both design and execution. The production line itself, dubbed by project manager Jeremy Clear the BNC Sausage machine is shown in the following figure:

Figure 1. The BNC Sausage Machine

As this figure demonstrates, production of different types of material was shared out amongst a number of different agencies: Longman focussed on the collection and transcription of spoken materials, and OUP on the transcription of written materials, using a mixture of OCR, rekeying, and processing of materials already in digital form. Conversion of all materials to a single consistent format and validation of its structure was carried out at OUCS, which also maintained a database of contextual and workflow information. Linguistic annotation of the material was carried out at Lancaster, using the well-established CLAWS tagger (discussed below and in Garside 1996, and the resulting texts then combined with standard metadata descriptions extracted from the database to form a single document conformant (insofaras these were already published) to the recommendations of the Text Encoding Initiative (Sperberg-McQueen 1994).

As might be expected, the rate with which the sausage machine turned was far from constant over the life of the project, and there were inevitably temporary blockages and hold ups. The work of developing the corpus was shared out amongst five task groups, on which staff from each of the consortium members participated to varying extents. These task groups and their responsibilities may be summarised as follows:

permissions: design and implementation of a standard permissions letter for use with all those owning IPR in the materials to be included in the corpus;
design criteria: definition of the range of text types to be included in the corpus and of their target proportions;
enrichment and annotation: implementation of linguistic and contextual annotation of the corpus texts;
encoding and markup: definition of the markup scheme to be applied in the final reference form of the corpus, and of procedures for mapping to it from a variety of data capture formats;
retrieval software: definition and implementation of simple retrieval software able to make use of the detailed corpus encoding.

Each of these topics is further discussed in the following sections.

Permissions issues

As noted above, the BNC was the first corpus of its size to be made widely available. This was possible largely because of the work done by this task group in successfully defining standard forms of agreement, between rights owners and the Consortium on the one hand, and between corpus users and the Consortium on the other. IPR owners were requested to give permission for the inclusion of their materials in the corpus free of charge, and shown the standard licence agreement which is still used today. Acceptance of this arrangement was perhaps to some extent facilitated by the relative novelty of the concept and the prestige attached to the project; however by no means every rights owner approached was immediately ready to assign rights to use digital versions of their material for linguistic research purpose indefinitely and free of charge. Some chose to avoid committing themselves at all, and others refused any non-paying arrangements.

Two specific problems attached to permissions issues relating to the spoken materials. Because participants had been assured that their identies would be kept secret, much effort was put into pondering how best to anonymise the their contributions, without unduly compromising their linguistic usefulness. Specific references to named persons were in many cases removed; the option of replacing them by alternative (but linguistically similar) names was briefly considered but felt to be impractical.

A more embarassing problem derives from the fact that participants in the demographically sampled part of the corpus had been asked (and had therefore given) permission only for inclusion of transcribed versions of their speech, not for inclusion of the speech itself. While such permission could in principle be sought again from the original respondents, the effectiveness of the anonymization procedures used now makes this a rather difficult task.

Two additional factors affected the willingness of IPR owners to donate materials: firstly, that no complete texts were to be included; secondly, that there was no intention of commercially exploiting or distributing the corpus materials themselves. This did not however preclude commercial usage of derived products, created as a consequence of access to the corpus. This distinction, made explicit in the standard User Licence, is obviously essential both to the continued availability of the corpus for research purposes, and to its continued usefulness in the commercial sector, for example as a testbed for language products from humble spelling correction software to sophisticated translation memories. To emphasize the non-commercial basis on which the corpus itself was to be distributed, one of the academic members of the consortium, OUCS, was appointed sole agent for licensing its use, reporting any dubious cases to the Consortium itself. Initially restricted to the EU, distribution of the corpus outside Europe was finally permitted in 1998.

Design Criteria

I referred above to the BNC's ‘non-opportunistic design’. A sense of the historical context is also perhaps helpful to understand the singling out of this aspect of the design as noteworthy. During the mid-nineties, although textual materials of all kinds were increasingly being prepared in digital form as a precursor to their appearance in print, the notion that the digital form might itself be of value was not at all widespread. Moreover, digitization in those pre-e-commerce days was far from uniform either in coverage or in format. As a consequence, there was a natural tendency in the research community to snap up such unconsidered trifles of electronic text as were available without considering too deeply their status with respect to the language as a whole. Because, to take one notorious example, large quantities of the Wall Street Journal were widely available in digital form, there was a danger that the specific register typified by that newspaper would increasingly serve as a basis for computationally-derived linguistic generalisations about the whole language.

As a corrective, therefore, the BNC project established at its outset the goal of sampling materials from across the language with respect to explicit design criteria rather than simply their contingent availability in machine-readable form. These criteria (usefully summarized in Atkins 1992) defined a specific range of text characteristics and target proportions for the material to be collected. The goal of the BNC was to make it possible to say something about language in general. But is language that which is received (read and heard) or that which is produced (written and spoken)? As good Anglo-Saxon pragmatists, the designers of the BNC chose to ignore this classic Saussurian dichotomy by attempting to take account of both perspectives.

The objective was to define a stratified sample according to stated criteria. While one might hesitate to claim that the corpus was statistically representative of the whole language in terms either of production or reception, at least the corpus would represent the degree of variability known to exist along certain specific dimensions, such as mode of production (speech or writing); medium (book, newspaper, etc.); domain (imaginative, scientific, leisure etc.); social context (formal, informal, business, etc.) and so on.

This is not the place to rehearse in detail the motivations for the text classification scheme adopted by the BNC ⁴. For example, spoken texts may be characterized by age, sex, or social class (of respondent, not speaker), or by the domain, region, or type of speech captured; written texts may also be characterized by author age, sex, type, by audience, circulation, status, and (as noted above) by medium or domain. Some of these categories were regarded as selection criteria, i.e. the domain of values for this category was predefined, and a target proportion identified for each; while others were regarded as descriptive criteria, i.e. while no particular target was set for the proportion of material of a particular type, other things being equal, attempts would be made to maximize variability within such categories. It should be stressed that the purpose of noting these variables was to improve coverage, not to facilitate access, nor to subset the corpus according to some typological theory.

Inevitably, the design goals of the project had to be tempered by the realities of economic life. A rough guess suggests that the cost of collecting and transcribing in electronic form one million words of naturally occurring speech is at least 10 times higher than the cost of adding another million words of newspaper text: the proportion of written to spoken material in the BNC is thus 10:1, even though many people would suggest that if speech and writing are of equal significance in the language, they should therefore be present in equal amounts in the corpus. Within the spoken corpus, an attempt is made to represent equally the production of different speech types (in the context-governed part) and its reception (in the demographically sampled part).

Similarly pragmatic concerns lead to the predominance within the written part of the corpus of published books and periodical. However, while text that is published in the form of books, magazines, etc., may not be representative of the totality of written language that is produced, (since writing for publication is a comparatively specialized activity in which few people engage), it is obviously representative of the written language that most people receive. In addition, it should be noted that significant amounts of other material (notably unpublished materials such as letters or gray literature) are also included. And even within a readily accessible text-type such as newspapers, care was taken to sample both broadsheet and tabloid varieties, both national and regional in such a way that the readily available (national broadsheet) variety did not drown out the other, less readily found, variants.

In its current version, the BNC XML Edition contains 4049 texts and occupies (including XML markup) about 4.5 Gb. In total, it contains just under 100 million words (there are 98,363,784 POS-tagged items, making up a total number of just over 6 million (6,026,284) segments, as identified by CLAWS. The following tables shows the breakdown in terms of

texts : number of distinct samples not exceeding 45,000 words
S-units: number of <s> elements identified by the CLAWS system (more or less equivalent to sentences)
W-units: number of <w> elements identified by the CLAWS system (more or less equivalent to words)

Table 1. Actual sample sizes for text type
	texts	w-units	%	s-units	%
Spoken demographic	153	4233955	4.30	610557	10.13
Spoken context-governed	755	6175896	6.27	427523	7.09
Written books and periodicals	2685	79238146	80.55	4395581	72.94
Written-to-be-spoken	35	1278618	1.29	104665	1.73
Written miscellaneous	421	7437168	7.56	487958	8.09

Within the written part of the corpus, target proportions were defined for each of a range of types of media, and subject matter. Here for example are the actual sample sizes for written domain:

Table 2. Sample sizes for written domains
	texts	w-units	%	s-units	%
Imaginative	476	16496420	18.75	1352150	27.10
Informative: natural & pure science	146	3821902	4.34	183384	3.67
Informative: applied science	370	7174152	8.15	356662	7.15
Informative: social science	526	14025537	15.94	698218	13.99
Informative: world affairs	483	17244534	19.60	798503	16.00
Informative: commerce & finance	295	7341163	8.34	382374	7.66
Informative: arts	261	6574857	7.47	321140	6.43
Informative: belief & thought	146	3037533	3.45	151283	3.03
Informative: leisure	438	12237834	13.91	744490	14.92

The spoken part of the corpus is itself divided into two. Approximately half of it is composed of informal conversation recorded by nearly 200 volunteers recruited for the project by a market research agency and forming a balanced sample with respect to age, gender, geographical area, and social class. This sampling method reflects the demographic distribution of spoken language, but (because of its small size) would have excluded from the corpus much linguistically-significant variation due to context. To compensate for this, the other half of the spoken corpus consists of speech recorded in each of a large range of predefined situations (for example public and semi-public meetings, professional interviews, formal and semi-formal proceedings in academia, business, or leisure contexts).

In retrospect, some text classifications (author ethnic origin for example) were poorly defined and many of them were only partially or unreliably populated. Pressures of production and lack of ready information seriously affected the accuracy and consistency with which all these variables were actually recorded in the text headers. Even such a seemingly neutral concept as dating is not unproblematic for written text — are we talking about the date of the copy used or of the first publication? Similarly, when we talk of ‘Author age’ do we mean age at the time the book was published, or when it was printed?

Of course, corpora before the BNC had been designed according to similar methods, though perhaps not on such a scale. In general, however, the metadata associated with such corpora had been regarded as something distinct from the corpus itself, to be sought out by the curious in the ‘manual of information to accompany’ the corpus. One innovation due to the Text Encoding Initiative, and adopted by the BNC, was the idea of an integrated header, attached to each text file in the corpus, and using the same formalism. This header contains information identifying and classifying each text, as well as additional specifics such as demographic data about the speakers, and housekeeping information about the size, update status, etc. Again following the TEI, the BNC factors out all common date (such as documentation and definition of the classification codes used) into a header file applicable to the whole corpus, retaining within each text header only the specific codes applicable to that text.⁵

During production, however, classificatory and other metadata was naturally gathered as part of the text capture process by the different data capture agencies mentioned above and stored locally before it was integrated within the OUCS database from which the TEI headers were generated. With the best will in the world, it was therefore difficult to avoid inconsistencies in the way metadata was captured, and hence to ensure that it was uniformly reliable when combined.

Following an extensive reappraisal of the classifications assigned to individual texts in the corpus, a doctoral student at Lancaster named David Lee proposed a more detailed taxonomy of text types, and also checked the classifications currently in place. This systematic investigation (reported in Lee 2001) enabled us to add to each text not only a somewhat more reliable version of its original classification criteria, but also a completely new classification carried out in terms of a more delicate taxonomy defined by Lee for the corpus. At the same time, whereas in BNC-1 a rather unsystematic method had been employed to associate descriptive topic keywords with each text, in the new version, each written text was additionally given the set of descriptive keywords associated with it in standard library catalogues.⁶. Simplifying these classification codes enabled us to produce a useful partitioning of the corpus into eight basic text types:

Table 3. BNC XML: partitioned by derived text type
	texts	w-units	%	s-units	%
Academic writing	497	696038	11.55%	15781859	16.04%
Fiction and verse	452	1323573	21.96%	16143913	16.41%
News and journalism	486	508609	8.43%	9412174	9.56%
Non-academic prose and biography	744	1135264	18.83%	24179010	24.58%
Other published writing	711	1021633	16.95%	17970212	18.26%
Unpublished writing	251	303078	5.02%	4466681	4.54%
Conversation	153	610558	10.13%	4233962	4.30%
Other spoken	755	427523	7.09%	6175896	6.27%

Annotation

Word tagging in the BNC was performed automatically, using CLAWS4, an automatic tagger developed at Lancaster University from the CLAWS1 tagger originally produced to perform a similar task on the one million LOB Corpus. The system is described more fully in Leech 1994; its theory and practice are explored in Garside 1997, and full technical documentation of its usage with the BNC is provided in the Manual which accompanies the BNC World Edition (Leech 2000).

CLAWS4 is a hybrid tagger, employing a mixture of probabilistic and non-probabilistic techniques. It assigns a part-of-speech code (or sometimes two codes) to a word as a result of four main processes:

tokenization into words (usually marked by spaces) and orthographic sentences (usually marked by punctuation); enclitic verbs (such as 'll or 's), and negative contractions (such as n't) are regarded as special cases, as are some common merged forms such as dunno (which is tokenized as ‘do + n't + know’
initial POS code assignment: all the POS codes which might be assigned to a token are retrieved, either by lookup from a 50,000 word lexicon, or by application of some simple morphological procedures; where more than one code is assigned to the word, the relative probability for each code is also provided by the lexicon look-up or other procedures. Probabilities are also adjusted on the basis of word-position within the sentence.
disambiguation or code selection is then applied, using a technique known as Viterbi alignment which chooses the probabilities associated with each code to determine the most likely path through a sequence of ambiguous codes, in rather the same way as the text messaging applications found on many current mobile phones. At the end of this stage, the possible codes are ranked in descending probability for each word in its context
idiom tagging is a further refinement of the procedure, in which groups of words and their tags are matched against predefined idiomatic templates, resembling finite-state networks.

With these procedures, CLAWS was able to achieve over 95% accuracy (i.e. lack of indeterminacy) in assigning POS codes to any word in the corpus. To improve on this, the Lancaster team developed further the basic ideas of ‘idiom tagging’, using a template tagger which could be taught more sophisticated contextual rules, in part derived by semi-automatic procedures from a sample set of texts which had previously been manually disambiguated. This process is further described in the Reference Manual cited.

The linguistic annotation of the corpus was enhanced for the BNC XML edition in three respects:

multiwords and their constituent items are explicitly tagged using the <mw> and <w> XML elements
an additional wordclass scheme, using a much simplified version of the C5 tagset was deployed
lemmatization of each word was carried out automatically on the basis of manually-defined rules.

By multiword unit we mean the situation where two or more orthographic words are considered by the CLAWS tagger to function as a single unit with a single wordclass. Common examples include adverbial phrases such as of course or in short, and prepositional sequences such as in spite of or up to. Deciding whether or not to treat these orthographic sequences as multiword units sometimes requires interpretation (in short is not adverbial in sequences such as ‘in short sharp bursts’, for example); such situations required extensions to the idiom rules.

In the BNC XML edition, these multiword units are marked using an additional XML element (<mw>) which carries the wordclass assigned to the whole sequence. Within the <mw> element, individual orthographic words are also marked, using the <w> element in the same way as elsewhere. For example, the multiword unit of course is marked up as follows:

<mw c5="AV0"> <w c5="PRF" hw="of" pos="PREP">of </w> <w c5="NN1" hw="course" pos="SUBST">course </w> </mw>

. Wordclass tags for the constituent tags of multiword units were automatically inserted, using a table derived automatically from the corpus. This table was constructed by selecting the most frequent tagging for a constituent word when it did not appear as part of a multiword sequence, and then manually correcting any resulting inconsistencies; there may therefore be residual errors in their usage.

The simplified wordclass scheme used for the second of these enhancements simply maps from the 65 CLAWS5 wordclass tags to a much simpler set of twelve basic wordclass tags, according to the following table:

POS value	significance	combines
ADJ	adjective	AJ0, AJC, AJS, CRD, DT0, ORD
ADV	adverb	AV0, AVP, AVQ, XX0
ART	article	AT0
CONJ	conjunction	CJC, CJS, CJT
INTERJ	interjection	ITJ
PREP	preposition	PRF, PRP, TO0
PRON	pronoun	DPS, DTQ, EX0, PNI, PNP, PNQ, PNX
STOP	punctuation	POS, PUL, PUN, PUQ, PUR
SUBST	substantive	NN0, NN1, NN2, NP0, ONE, ZZ0, NN1-NP0, NP0-NN1
UNC	unclassified, uncertain, or non-lexical word	UNC, AJ0-AV0, AV0-AJ0, AJ0-NN1, NN1-AJ0, AJ0-VVD, VVD-AJ0, AJ0-VVG, VVG-AJ0, AJ0-VVN, VVN-AJ0, AVP-PRP, PRP-AVP, AVQ-CJS, CJS-AVQ, CJS-PRP, PRP-CJS, CJT-DT0, DT0-CJT, CRD-PNI, PNI-CRD, NN1-VVB, VVB-NN1, NN1-VVG, VVG-NN1, NN2-VVZ, VVZ-NN2
VERB	verb	VBB, VBD, VBG, VBI, VBN, VBZ, VDB, VDD, VDG, VDI, VDN, VDZ, VHB, VHD, VHG, VHI, VHN, VHZ, VM0, VVB, VVD, VVG, VVI, VVN, VVZ, VVD-VVN, VVN-VVD

The lemmatization procedure adopted derives ultimately from work reported in Beale 1987, as subsequently refined by others at Lancaster, and applied in a range of projects including the JAWS program (Fligelstone et al 1996) and the book Word Frequencies in Written and Spoken English (Leech et al 2001). The basic approach is to apply a number of morphological rules, combining simple POS-sensitive suffix stripping rules with a word list of common exceptions. This process was carried out during the XML conversion, using code and a set of rules files kindly supplied by Paul Rayson.

Encoding

The markup scheme used by the BNC was originally defined at the same time as the Text Encoding Initiative's work was being done (and to some extent by the same people); the two schemes are thus unsurprisingly close, though there are differences. Since this scheme has been so widely taken up and is well documented elsewhere, we do not discuss it in any detail here.

As some indication of the extent and nature of the markup in the BNC, here is the start of a typical written text (A0A):

<wtext type="OTHERPUB"> <div level="1"> <head type="MAIN"> <s n="1"> <w c5="NP0" hw="camra" pos="SUBST">CAMRA </w> <w c5="NN1" hw="fact" pos="SUBST">FACT </w> <w c5="NN1" hw="sheet" pos="SUBST">SHEET </w> <w c5="NN1" hw="no" pos="SUBST">No </w> <w c5="CRD" hw="1" pos="ADJ">1</w> </s></head> <head rend="it" type="SUB"> <s n="2"> <w c5="AVQ" hw="how" pos="ADV">How </w> <w c5="NN1" hw="beer" pos="SUBST">beer </w> <w c5="VBZ" hw="be" pos="VERB">is </w> <w c5="VVN" hw="brew" pos="VERB">brewed</w></s> </head> <p> <s n="3"> <w c5="NN1" hw="beer" pos="SUBST">Beer </w> <w c5="VVZ" hw="seem" pos="VERB">seems </w> <w c5="DT0" hw="such" pos="ADJ">such </w> <w c5="AT0" hw="a" pos="ART">a </w> <w c5="AJ0" hw="simple" pos="ADJ">simple </w> <w c5="NN1-VVB" hw="drink" pos="SUBST">drink </w> <w c5="CJT" hw="that" pos="CONJ">that </w> <w c5="PNP" hw="we" pos="PRON">we </w> <w c5="VVB" hw="tend" pos="VERB">tend </w> <w c5="TO0" hw="to" pos="PREP">to </w> <w c5="VVI" hw="take" pos="VERB">take </w> <w c5="PNP" hw="it" pos="PRON">it </w> <w c5="PRP" hw="for" pos="PREP">for </w> <w c5="VVN" hw="grant" pos="VERB">granted</w> <c c5="PUN">.</c> </s> ....</p>...</div>....</wtext>

The text itself starts with a <wtext> (for ‘written text’) element, which carries an attribute value of OTHERPUB to indicate its text-type according to the simplified taxonomy discussed above. This particular text consists of a number of leaflets, each of which corresponds with a <div> element. The leaflet represented above begins with two headings, each represented in XML by means of a <head> element, followed by a sequence of <p> (paragraph) elements.

The segmentation of the text carried out by CLAWS is preserved by means of the <s> elements used throughout the whole text. Each ><s> element carries a number to identify it. Within each <s>, every ‘word’ identified by CLAWS is marked as a <w>element containing on its attributes the original CLAWS C5 wordclass code (eg c5="VVN"), the simplified wordclass code derived from it (eg pos="VERB"), and the root form or headword for it (e.g. hw="grant").

The User Reference Guide delivered with the corpus contains a detailed discussion of the scope and significance of this markup system.

In marking up the spoken part of the corpus, many different technical issues had to be addressed. As noted above, this was the first time detailed markup of transcribed speech on such a scale had been attempted. The transcription itself was carried out by staff who were not linguistically trained (but who were however familiar with the regional variation being transcribed — staff recruited in Essex for example were not required to transcribe material recorded in Northern Ireland). Transcribers added a minimal (non-SGML) kind of markup to the text, which was then normalized, converted to SGML, and validated by special purpose software (see further Burnage 1993). The markup scheme made explicit a number of features, including changes of speaker and quite detailed overlap; the words used, as perceived by the transcriber; indications of false starts, truncation, uncertainty; some performance features e.g. pausing, stage directions etc. In addition, of course, detailed demographic and other information about each speaker and each speech context was recorded in the appropriate part of the Header, where this was available.

Here is a sample from a transcribed spoken text: Is he not going home then? No and erm I'm leaving a turkey in the freezer,an Paul is quite good at cooking standard cooking.

Words and sentences are tagged as in the written example above. However, sentences are now grouped into utterances, marked by the <u> element, each representing an unbroken stretch of speech, and containing withing its start tag a code (such as PS04Y) which acts as a key to access the more detailed information about the speaker recorded in the TEI Header for this text. Note also the <pause> and <event> elements used to mark paralinguistic features of the transcribed speech.

As can readily be seen in the above example, the intention of the transcribers was to provide a version of the speech which was closer to writing than to unmediated audio signal. Thus, the spelling of filled pauses such as erm or mmm is normalised, and there is even use of conventional punctuation to mark intonation patterns interpreted as questions. For more discussion of the rationale behind this and other aspects of the speech transcription see Crowdy 1994.

Software and distribution

In 1994, it was not entirely obvious how one should distribute a corpus the size of the BNC on a not-for-profit basis. Low-cost options such as anonymous ftp seemed precluded by the scale of the data. Our initial policy was to distribute the text compressed to the extent that it would fit on a set of three CDs, together with some simple software system which could be installed by suitably skilled personnel to provide departmental access over a network to the local copy of the corpus. Development of such a software system was undertaken, with the aid of additional funding from the British Library, during the last year of the project. The software now delivered with the corpus is known as XAIRA (XML Aware Indexing and Retrieval Architecture) and derives from that original tool, which was called SARA (for SGML Aware Retrieval Application). XAIRA was developed as a general purpose open source XML-aware tool for searching large or small language corpora with funding from the Anmdrew Mellon foundation: see further the website at http://www.xaira.org.

It was always intended that access to the BNC should not be contingent on the use of any particular software — this was after all the main rationale behind the use of the international standard SGML as a means of encoding the original corpus, rather than a system tailored to any particular software tool.

As noted above, the BNC dates from the pre-World Wide Web era. ⁷ However, within a year of its publication, it was apparent that web access would be the ideal way of making it available, if only because this would enable us to provide a service to researchers outside the European Union, who were still at this time unable to obtain copies of the corpus itself because of licencing restrictions. The British Library generously offered the project a server for this purpose, and a simple web interface to the corpus was developed. This service, still available at the address http://sara.natcorp.ox.ac.uk allows anyone to perform basic searches of the corpus, with a restricted range of display options; those wishing for more sophisticated facilities can also download a copy of the SARA client program to access the same server: a small registration fee is charged for continued use of the service beyond an initial trial period.

To complement this service, and in response to the demand for help in using the BNC from the language teaching community, a detailed tutorial guide (Aston 1999) was written, introducing the various facilities of the software in the form of focussed and linguistically-motivated exercises. The Online service remains very popular, receiving several thousand queries each month.

Up: Contents Previous: What, exactly, is the BNC? Next: Revisions of the BNC

Notes

These are exhaustively discussed in e.g. Atkins 1992 for the written material, and Crowdy 1995 for the spoken material; discussion and detailed tables for each classification are also provided in the BNC User Reference Guide (Burnard 1995, revised 2000).

For further description of the way TEI Headers are used by the BNC see Dunlop 1995

These were obtained from the UK joint COPAC for the bulk of the written published material

The phrase world wide web in fact appears only twice in the corpus, in both cases as part of a brief exchange about the feasibility of publicizing the Leeds United football club which occurred on an email discussion list in January 1994. The most frequent collocates for the word web in the corpus are spider, tangled, complex, and seamless. In this respect at least the BNC is definitely no longer an accurate reflection of the English language.