The Xaira Specification The xairaSpecification supplied in the corpus header determines the behaviour of the XAIRA indexer, and hence of the XAIRA-indexed system delivered with the BNC. In this section, we document that specification as it applies to the BNC only. The information provided here is for reference purposes only, and is of no interest unless you are using the XAIRA system to index the BNC or a similar corpus. Note however that this document is not an exhaustive description of the capabilities of the XAIRA system: for more information on that, please consult the project web site at The xairaSpecification element is as a member of the model.encodingPart class, and may therefore be included within the encodingDesc element of the TEI Header for any corpus. It is organized as a number of xairaList elements, each of which contains a number of xairaItem elements. Both of these latter elements have a type attribute which specifies more exactly the function of the item or list, by supplying one of a number of predefined codes, as further described in this section. The following values are defined for the type attribute on xairaList: elementSpec

lists and glosses the elements, attributes, and codebooks used in a corpus ()

keySpec

specifies how items are to be indexed ()

regionSpec

specifies any predefined regions to be made available to the client ()

lemmaSpec

specifies any lemmatization schemes used ()

refSpec

specifies how items are to be referenced ()

indexSpec

specifies any special indexing policies ()

langSpec

specifies any language-specific rules ()

All of these are used in the BNC. The following values are defined for the type attribute on the xairaItem element: element

an element

form

a lexical form for the indexer

addKey

an additional key for the indexer

lemmaScheme

a lemmatization scheme

region

a region

textRef

a reference identifying a document

unitRef

a reference identifying a low level unit within a document

scopeRef

a reference identifying a low level unit used to delimit results obtained when querying a corpus

indexPol

defines an index policy

defaultLang

specifies the default language for a corpus

langRules

specifies non-default tokenization or collation rules for a language used in a corpus

All of these except the last are used in the BNC. Element specification A XAIRA element specification consists of a xairaList of type elementSpec containing one or more xairaItem elements, one for each element that the Xaira indexer or client needs to be aware of. Elements which are not mentioned within the Xaira element specification may however appear within a corpus. When the indexer finds such an element, it will index it using all default options; the client will not have access to any explanatory text or gloss for such elements. Equally, the specification may include definitions for elements which do not appear within the corpus. The simplest form of element specification just provides a description for the element: marks a pause in the transcription More usually, an element specification will also supply glosses for the attributes of an element. These are supplied by an attList element embedded within the xairaItem, consisting of one attDef element for each attribute concerned: a lexical token as identified by CLAWS the part of speech assigned to a token by CLAWS the base form of a word as determined by the Lancaster lemmatization scheme a simplified part of speech code derived from the CLAWS C5 tag Descriptions may also be supplied for the values indexed for given attributes. This is accomplished by providing a valList element within the attDef, as in the following example: a person whose speech is recorded in the corpus the age group to which a person belongs unknown age under 16 years old aged 16 to 35 years aged 36 to 45 years 46 or older The values A0, A1 etc. supplied by the ident attribute on valItem need not be unique across the corpus. A single definition may be supplied for global attributes which appear on any element by using the following syntax: global attributes a name or number used to label any element If the element or attribute to be defined is taken from some non-default namespace, the ns attribute must be supplied on the xairaItem element: global attributes a name or number used to label any element Here the globally-available xml:id attribute is explicitly associated with the namespace http://www.w3.org/XML/1998/namespace A type attribute may also be specified on the valList element to indicate whether the list of values it contains is exhaustive or exemplary; at present Xaira does not use this information however. In this section, we have introduced the following elements: Key specification A Xaira key specification is used to define how the indexer should identify which parts of the input documents are to be regarded as lexical forms and what additional keys should be associated with those forms. Additional keys are used to distinguish otherwise identical forms in the index (for example, the same spelling with two different POS codes); they are also used too build up lemma schemes and regions on which see below. The key specification consists of a xairaList of type keySpec. If no specification is given, the indexer will assume default implicit tokenization is in force and no additional keys are defined. If a key specification is supplied, it contains at least one xairaItem type="form", optionally followed by one or more xairaItem type="addKey" elements, each of which may contain a desc element to document its purpose, and should also contain a valSource element to specify an element or attribute within the corpus being indexed which is to be used as the source for the values to be used as a key. The BNC index specification begins by specifying that the elements w and c delimit the forms which the indexer must index: ... The valSource element specifies where the indexer is to find the value which is to be treated as the form part of the index entry. In both cases, it is found as element content, of a c or w element. Since no further information is given about where such elements are to be found, this will apply to every occurrence of a w or c element, irrespective of its context. Since no namespace is specified, the element is assumed to be in the current or default namespace. Next, the BNC index specification defines three additional keys, corresponding with the attributes c5, hw, and pos. First, the CLAWS C5 code which is supplied as the value of a c5 attribute on the elements w and c: w c XXX This defines an additional key called c5, the value of which is supplied by the attribute also called c5, but only when that attribute is supplied on an element called w or c and at any point in the document structure. Other attributes called c5 (such as that on mw) will not be used for this purpose. When an additional key value is required, but no value is available, because the attribute or element specified does not exist or has no value, the literal content of the defaultVal element (XXX in the example above) will be used instead. In the BNC, this should not happen, and this value should not therefore appear. The remaining two additional keys are defined in much the same way, except that they derive from attributes specified only for the w element: w w XXX These addkeys are used in the BNC lemma scheme specification discussed below (). The caseFold attribute is used to specify that forms should be case folded before indexing, so that forms differing only in letter case will be stored identically. The last additional key defined in the BNC index specification is derived from a source other than an element or attribute: defines the additional keys used to support filtering of text from different regions of selected texts stext teiHeader wtext nowhere The effect of this is to define an additional key called region, the value of which on a given form in the index will be one of the strings stext, teiHeader, wtext, or nowhere depending on the location of the form being indexed. The name() identifier here indicates that it is the name of the associated elements which is to be used as the value of the key, rather than their content. If no nameList were provided, then the key generated would contain the name of the nearest ancestor element. This key is used in the subsequent region specification (see ). Lemma Scheme Specification Any combination of additional keys may be used to form a lemma scheme. This enables the values of the nominated keys to be treated as alternate forms for the associated index entries. For example, occurrences of words such as "dogs", "dogged", "dogging" etc in the BNC all have the value "dog" for an additional key called "Headword". To distinguish verbal senses from nominal ones, this additional key would need to be combined with another key giving the part of speech (noun or verb) for each occurrence. The resulting lemma scheme would then distinguish forms of "dog (noun)" from forms of "dog (verb)". Xaira supports the definition of multiple lemma schemes, but only a single one is defined for the BNC. All lemma schemes are defined together in a single xairaList type="lemmaSpec" element, containing one xairaItem type="lemmaScheme" for each scheme. This element contains an optional desc, followed by a nameList containing the names of the additional keys used to constitute the scheme. (The name of the additional key is the name supplied by the ident attribute when the key was defined.). Thus, the lemma scheme defined for the BNC has the following specification: Headword pos This defines a lemma scheme called BNC which is based on the combination of the values given by the additional keys Headword and pos which were defined in the previous section. Region Specification A region is a collection of possibly discontinuous sections of a corpus defined by the XML tagging within it. For example, each BNC document contains a teiHeader element and either a wtext or an stext element. We say that all the parts of each document contained by a teiHeader element constitute one region. All the parts contained by either a wtext or a stext element constitute another region. Regions (unlike partitions) span document boundaries, and are not made up of whole texts but of defined parts of them. A region is defined by means of a xairaItem of type region. The ident attribute on the xairaItem supplies a name for the region, which can be used by the client to limit searches to locations within the named region. The definition of the region is contained within a nameList. It combines the name of a previously-defined additional key (region in the case of the BNC) which is tagged as an ident element, with a list of one or more values. Word occurrences whose region additional key has the value specified will be considered to fall within the region being defined. Since these values are element names, they are tagged within the nameList using the gi element. For example: region stext This part of the BNC region specification defines a region called speechOnly. Any word for which the additional key region has the value stext will fall within this region. Two other regions are defined in a very similar way in the BNC: region teiHeader region stext wtext The first of these defines the region headerOnly, for words occurring within the header; the second defines the region textOnly for words occurring within wtext or stext elements, as indicated by the values supplied for their respective region additional key. Reference specification The index maps occurrences of index terms as defined in the previous section to locations in the corpus, which may be identified in a number of ways, additional to the internally-defined location system. This external referencing scheme is used by the system to label the context of occurrences found by the search program. Occurrences themselves are precisely located by the internal location scheme. Although the index contains information about the complete xpath location of occurrences within the corpus, the internal location scheme is highly optimized and cannot be used to support access via arbitrary Xpaths or XQL queries. The referencing scheme used to identify contexts has the following components:

a single text identifier: this may be derived from a system identifier, or specified by a nominated attribute on the element which contains the text, or it may calculated by the indexer in terms of the XML structure indexed.

a single scope identifier: this may be derived from the value of a specified attribute on any element in the text; calculated by the indexer in terms of the XML structure; or derived from the physical input structure.

optionally additional unit labels: these may derived from the value of a specified attribute on any element in the text; calculated by the indexer in terms of the XML structure; or derived from the physical input line number.

The element from which the text identifier is derived also delimits a single text in the corpus. This effectively limits the kinds of value which may be used to identify it: it must be an attribute value or a pseudo value; element content is not permitted. The referencing specification for a Xaira index is given by a xairaList type="refSpec", containing exactly one xairaItem type="textRef", followed by one xairaItem type="scopeRef" and optionally one or more further xairaItem type="unitRef" elements. Each such xairaItem element contains a valSource element as defined above, to indicate where the value for the reference is to be obtained in the input document. It may also contain a labelGen element which further defines the parts of the document to which the reference applies and its format. bncDoc s %1.%2 In the BNC, each bncDoc begins a new text, which is identified by the value of its xml:id attribute, and the scope for each query is to be a complete s element, identified by its n attribute. The reference is to be formatted with a dot between the two values. This specification will produce references like ABC.123 for an s element with attribute n set to 123, found within a bncDoc element whose xml:id attribute has the value ABC. Indexing Policies In addition to index terms derived from the lexical content of a corpus, a Xaira index also contains information about the occurrence of XML start- and end-tags within the corpus. This information is used to facilitate a number of search options: searching for non-lexical features, searching for lexical features within a given structural context, scoping co-occurrences of lexical or non-lexical features, etc. By default an entry is made in the index for each occurrence of each tag, both start and end. This entry may also distinguish start-tag occurrences depending on the values of specified attributes supplied with them. (Note that this is independent of the use of such attribute values in the creation of index terms as described in the previous section). For example: The heading will create index entries for the tags head and /head The subheadingwill create index entries for the tags head, head type="sub" and /head The content of every element found in a corpus is indexed by default, as are all of the tags, and all of their attributes. This behaviour may be modified by specifying explicit indexing policies for elements to which this default policy does not apply. An indexing policy may not be specified for elements or attributes which have been nominated as the sources for an additional key or reference, since these are indexed in a different way. Any indexing policy specified for such elements or attributes will be ignored by the indexer. The following indexing policies are used in the BNC: none

No part of the specified element or attribute will be indexed. In the case of an element, this means that none of its start and end tags, its attributes, its child elements, and its character data content will be included in the index. In the case of an attribute, its value will be omitted.

markup

This policy applies only to elements. Only start-and end-tags and attributes for the specified element and for any child elements will be indexed; no content of the element or its children will be indexed.

jointo

This policy applies only to attributes. The specified attribute is available for use as the target of an attribute indexed with the joinfrom policy.

joinfrom

This policy applies only to attributes. The attribute specified has values which correspond with those on an attribuite of some other element which has been indexed with the jointo policy, or (if no jointo attribute has been defined) which uses the xml:id attribute

taxonomy

This policy applies only to attributes. The attribute specified has values which correspond with the xml:id attribute on some category element within a TEI-conformant taxonomy element.

For every element or attribute to which a non-default indexing policy applies, a xairaItem type="indexPol" appears within the xairaList type="indexSpec" element. This may contain either an elementPolicy or an attributePolicy, element depending on whether it relates to elements or attributes. Index policies NONE and MARKUP Within the BNC, an attribute policy of none is applied to the element revisionDesc: The effect of this is that, although revisionDesc elements will be visible in search results, they cannot be searched for and a query for one or for anything contained by of one, will return no hits The indexing policy markup is applied to the element bibliography. One occurrence of this element, declared in its own name space, is necessary for a XAIRA system: it holds metadata relating to each text constituting the corpus. In the BNC this bibliographic information is copied from the text headers, which are also indexed in their own right. To avoid duplication of this content, the indexer is instructed to index only the structure of the bibliography but not its content: Index policies JOINFROM and JOINTO The purpose of joinfrom and jointo indexing policies is to support join queries. A join query is one in which attributes are effectively transferred from one element to another. For example, in the BNC, each text header contains detailed data about individual speakers within the person element, and also uses the attribute who to identify the speaker or speakers of each speech in the transcribed part of the corpus: <person xml:id="ABC" age="A" soc="B1"> ... </person> <person xml:id="DEF" age="Z" soc="A1"> ... </person> ... <u who="ABC">....</u> <u who="DEF"> .... </u> Since the values for who all correspond with the value of an xml:id attribute on some person element, a join query can be effected. The XAIRA client can be configured to support queries in which the attributes age and soc appear to be attributes of the u element, their values being transferred from the person element whose xml:id value is equal to that given by the who attribute on u. The effect is as it would be if the u elements above looked like this: .... .... This is accomplished by the following set of indexing policies: u person First, we declare a join-to policy for any xml:id attribute. Next we declare the join-from policy for the who attribute on the u element. As well as specifying which attribute carries the value required (who), we need additionally to supply the name of the element on which the corresponding join-to attribute should be found (u). Values are transferred when a match is found between the value for the who attribute and that of whichever attribute of the nominated element has been indexed with the join-to policy. Note that only one attribute of a given element may be indexed with the join-to policy and that the values of attributes indexed with the join-to policy must be unique within the specified element and attribute combination. Thus, there may be only person element with the value ABC for its xml:id attribute, though the same value may appear on other attributes. If the value appears on the xml:id attribute of some other element, it will not be found with this join-to policy. Note that, since the globally-available xml:id attribute is used to hold the joint-to attribute, its values must be unique across the whole corpus. Index policy taxonomy A taxonomy is a special kind of codebook, the purpose of which is to provide a set of defined codes to classify the texts making up a corpus. The BNC defines several different taxonomies as means of classifying its constituent texts, as further described in . The element or attribute within a particular text which identifies its classification, by referencing one or more codes within a taxonomy, is called its classifier. Each distinct taxonomy for a corpus is defined by a TEI taxonomy element, within the corpus header. This defines the codes available for use and gives a gloss to them. Where, as is usual, the texts in a corpus are classified along more than one dimension (for example, by text type, by medium of distribution, by audience type etc.), a taxonomy must be defined for each dimension, rather than defining a single taxonomy with disjoint sets of children. Note that the classification codes used must be unique across the whole corpus, irrespective of the taxonomy to which they belong. This approach also enables the client to regard each taxonomy as defining a partition of the corpus. To use a taxonomy defined in this way, the relevant attribute must be defined with the taxonomy indexing policy. In the case of the BNC, classification information is carried by two attributes:

the targets attribute on the catRef element in each text header supplies a list of values for all the original selection and descriptive criteria, described in

the type attribute on wtext and stext elements carries a broadbrush text-type categorization, derived from the other classification codes, see further

The following declarations achieve this effect: catRef wtext stext Language specification As a Unicode system, XAIRA is able to handle data in any natural language or writing system. However, it is still necessary to specify the language or languages used in the corpus being indexed. This specification is performed by a xairaList of type langspec. This contains at least one xairaItem type="defaultLang", and optionally other xairaItem type="langRules" elements. The BNC uses only standard English and thus contains only a default language specification, which looks like this: