The Xaira Specification
The xairaSpecification supplied in the corpus header
determines the behaviour of the XAIRA indexer, and hence of the
XAIRA-indexed system delivered with the BNC. In this section, we
document that specification as it applies to the BNC only. The
information provided here is for reference purposes only, and is of no
interest unless you are using the XAIRA system to index the BNC or a
similar corpus. Note however that this document is not an exhaustive
description of the capabilities of the XAIRA system: for more
information on that, please consult the project web site at
The xairaSpecification element
is as a member of the
model.encodingPart class, and may therefore be included
within the encodingDesc element of the TEI Header for any
corpus. It is organized as a number of
xairaList elements, each of which contains a number of
xairaItem elements. Both of these latter elements have a
type attribute which specifies more exactly the function of
the item or list, by supplying one of a number of predefined codes, as
further described in this section.
The following values are
defined for the type attribute on xairaList:
elementSpec
lists and glosses the elements,
attributes, and codebooks used in a corpus ()
keySpecspecifies how items are to be indexed ()
regionSpecspecifies any predefined regions to be
made available to the client ()
lemmaSpecspecifies any lemmatization schemes used ()
refSpecspecifies how items are to be referenced ()
indexSpecspecifies any special indexing policies ()
langSpecspecifies any language-specific rules ()
All of these are used in the BNC.
The following values are defined for the type attribute
on the xairaItem element:
elementan element
forma lexical form for the indexer
addKeyan additional key for the indexer
lemmaSchemea lemmatization scheme
regiona region
textRefa reference identifying a document
unitRefa reference identifying a low level unit
within a document
scopeRefa reference identifying a low level unit
used to delimit results obtained when querying a corpus
indexPoldefines an index policy
defaultLangspecifies the default language for a corpus
langRulesspecifies non-default tokenization or collation
rules for a language used in a corpus
All of these except the last are used in the BNC.
Element specification
A XAIRA element specification consists of a xairaList of
type elementSpec containing one or more
xairaItem elements, one for each element that the Xaira
indexer or client needs to be aware of. Elements which are not
mentioned within the Xaira element specification may however appear
within a corpus. When the indexer finds such an element, it will index
it using all default options; the client will not have access to any
explanatory text or gloss for such elements. Equally, the
specification may include definitions for elements which do not appear
within the corpus.
The simplest form of element specification just provides a description for
the element:
marks a pause in the transcription
More usually, an element specification will also supply glosses for the
attributes of an element. These are supplied by an attList
element embedded within the xairaItem, consisting of one
attDef element for each attribute concerned:
a lexical token as identified by CLAWS
the part of speech assigned to a token by CLAWS
the base form of a word as determined by the
Lancaster lemmatization scheme
a simplified part of speech code derived from
the CLAWS C5 tag
Descriptions may also be supplied for the values indexed for given
attributes. This is accomplished by providing a valList
element within the attDef, as in the following example:
a person whose speech is recorded in the corpus
the age group to which a person belongs
unknown age
under 16 years old
aged 16 to 35 years
aged 36 to 45 years
46 or older
The values A0, A1 etc. supplied by the ident attribute
on valItem need not be unique across the corpus.
A single definition may be supplied for global
attributes which appear on any element by using the following syntax:
global attributes
a name or number used to label any element
If the element or attribute to be defined is taken from some non-default
namespace, the ns attribute must be supplied on the
xairaItem element:
global attributes
a name or number used to label any element
Here the globally-available xml:id attribute is explicitly
associated with the namespace
http://www.w3.org/XML/1998/namespace
A type attribute may also be specified on the
valList element to indicate whether the list of values it
contains is exhaustive or exemplary; at present Xaira does not use
this information however.
In this section, we have introduced the following elements:
Key specification
A Xaira key specification is used to define how the indexer should
identify which parts of the input documents are to be regarded as
lexical forms and what additional keys should
be associated with those forms. Additional keys are used to
distinguish otherwise identical forms in the index (for example, the
same spelling with two different POS codes); they are also used too
build up lemma schemes and regions on which
see below.
The key specification
consists of a xairaList of type
keySpec. If no specification is given, the indexer will
assume default implicit tokenization is in force and no additional
keys are defined.
If a key specification is supplied, it contains at least one xairaItem
type="form", optionally followed by one or more xairaItem
type="addKey" elements, each of which may contain a desc
element to document its purpose, and should also contain a
valSource element to specify an
element or attribute within the corpus being indexed which is to be
used as the source for the values to be used as a key.
The BNC index specification begins by specifying that the elements
w and c delimit the forms which the indexer must
index:
...
The valSource element specifies where the
indexer is to find the value which is to be treated as the form part
of the index entry. In both cases, it is found as element content, of
a c or w element. Since no further information is
given about where such elements are to be found, this will apply to
every occurrence of a w or c element, irrespective
of its context. Since no namespace is specified, the element is
assumed to be in the current or default namespace.
Next, the BNC index specification defines three additional keys,
corresponding with the attributes c5,
hw, and pos. First, the CLAWS
C5 code which is supplied as the value of a c5 attribute
on the elements w and c:
w
c
XXX
This defines an additional key called c5, the
value of which is supplied by the attribute also called c5,
but only when that attribute is supplied on an element called
w or c and at any point in the document
structure. Other attributes called c5 (such as that on
mw) will not be used
for this purpose.
When an additional key value is required, but no value is
available, because the attribute or element specified does not exist
or has no value, the literal content of the defaultVal element
(XXX in the example above) will be used instead. In the
BNC, this should not happen, and this value should not therefore appear.
The remaining two additional keys are defined in much the same way,
except that they derive from attributes specified only for the
w element:
w
w
XXX
These addkeys are used in the BNC lemma scheme specification discussed
below ().
The caseFold attribute
is used to specify that forms should be case folded before indexing,
so that forms differing only in letter case will be stored
identically.
The last additional key defined in the BNC index specification is
derived from a source other than an element or attribute:
defines the additional keys used to support filtering of text
from different regions of selected texts
stext
teiHeader
wtext
nowhere
The effect of this is to define an additional key called
region, the value of which on a given form in the
index will be one of the strings stext, teiHeader, wtext,
or nowhere depending on the location of the form being
indexed. The name() identifier here indicates that it is the name
of the associated elements which is to be used as the value of
the key, rather than their content. If no nameList were provided, then
the key generated would contain the name of the nearest ancestor
element. This key is used in the subsequent region specification (see
).
Lemma Scheme Specification
Any combination of additional keys may be used to form a
lemma scheme. This enables the values of the nominated
keys to be treated as alternate forms for the associated index
entries. For example, occurrences of words such as "dogs", "dogged",
"dogging" etc in the BNC all have the value "dog"
for an additional key called "Headword". To distinguish verbal senses
from nominal ones, this additional key would need to be combined with
another key giving the part of speech (noun or verb) for each
occurrence. The resulting lemma scheme would then distinguish forms of
"dog (noun)" from forms of "dog (verb)".
Xaira supports the definition of multiple lemma schemes, but only a
single one is defined for the BNC. All lemma schemes are defined together
in a single xairaList type="lemmaSpec" element, containing
one xairaItem type="lemmaScheme" for each scheme. This
element contains an optional desc, followed by a nameList
containing the names of the additional keys used to constitute the
scheme. (The name of the additional key is the name supplied by the
ident attribute when the key was defined.). Thus, the lemma
scheme defined for the BNC has the following specification:
Headword
pos
This defines a lemma scheme called BNC which is based on the
combination of the values given by the additional keys
Headword and pos which were defined in
the previous section.
Region Specification
A region is a collection of possibly discontinuous
sections of a corpus defined by the XML tagging within it. For
example, each BNC document contains a teiHeader element and either a
wtext or an stext element. We say that all the parts of
each document contained by a teiHeader element constitute one
region. All the parts contained by either a wtext or a
stext element constitute another region. Regions (unlike
partitions) span document boundaries, and are not made up of whole
texts but of defined parts of them.
A region is defined by means of a xairaItem of type
region. The ident attribute on the
xairaItem supplies a name for the region, which can be used
by the client to limit searches to locations within the named
region.
The definition of the region is contained within a
nameList. It combines the name of a previously-defined
additional key (region in the case of the BNC) which is
tagged as an ident element, with a
list of one or more values. Word occurrences whose
region additional key has the value specified will be
considered to fall within the region being defined. Since these
values are element names, they are tagged within the
nameList using the gi element.
For example:
region
stext
This part of the BNC region specification defines a region called speechOnly. Any
word for which the additional key region has the value
stext will fall within this region.
Two other regions are defined in a very similar way in the BNC:
region
teiHeader
region
stext
wtext
The first of these defines the region headerOnly, for
words occurring within the header; the second defines the region
textOnly for words occurring within wtext or
stext elements, as indicated by the values supplied for their
respective region additional key.
Reference specification
The index maps occurrences of index terms as defined in the
previous section to locations in the corpus, which may be identified in a
number of ways, additional to the internally-defined location
system. This external referencing scheme is used by the
system to label the context of occurrences found by the search
program. Occurrences themselves are precisely located by the internal
location scheme. Although the index contains information about the
complete xpath location of occurrences within the corpus, the internal
location scheme is highly optimized and cannot be used to support
access via arbitrary Xpaths or XQL queries.
The referencing scheme used to identify contexts has the following
components:
a single text identifier: this may be derived from
a system identifier, or specified by a nominated attribute on the
element which contains the text, or it may calculated by the indexer
in terms of the XML structure indexed.
a single scope identifier: this may be derived from
the value of a specified attribute on any element in the text;
calculated by the indexer in terms of the XML structure; or derived
from the physical input structure.
optionally additional unit labels: these may derived
from the value of a specified attribute on any element in the text;
calculated by the indexer in terms of the XML structure; or derived
from the physical input line number.
The element from which the text identifier is derived also delimits
a single text in the corpus. This effectively limits the
kinds of value which may be used to identify it: it must be an attribute value
or a pseudo value; element content is not permitted.
The referencing specification for a Xaira index is given by a xairaList
type="refSpec", containing exactly one xairaItem
type="textRef", followed by one xairaItem type="scopeRef" and optionally
one or more further xairaItem type="unitRef" elements. Each such
xairaItem element contains a valSource element as
defined above, to indicate where the value for the reference is to be
obtained in the input document. It may also contain a
labelGen element which further defines the parts of the document to
which the reference applies and its format.
bncDoc
s
%1.%2
In the BNC, each bncDoc begins a new text, which is
identified by the value of its xml:id attribute, and
the scope for each query is to be a complete
s element, identified by its
n attribute. The reference is to be formatted with a dot
between the two values.
This specification will produce references like
ABC.123 for an s element with attribute
n set to 123, found within a
bncDoc element whose xml:id attribute has the value
ABC.
Indexing Policies
In addition to index terms derived from the lexical content of a
corpus, a Xaira index also contains information about the
occurrence of XML start- and end-tags within the corpus. This
information is used to facilitate a number of search options:
searching for non-lexical features, searching for lexical
features within a given structural context, scoping
co-occurrences of lexical or non-lexical features, etc.
By default an entry is made in the index for each occurrence of each tag,
both start and end. This entry may also distinguish start-tag
occurrences depending on the values of specified attributes
supplied with them. (Note that this is independent of the use of
such attribute values in the creation of index terms as described in the previous section).
For example:
The heading will create index entries for the tags head and /head
The subheadingwill create index entries for the tags head, head type="sub" and /head
The content of every element found in a corpus is indexed by
default, as are all of the tags, and all of their attributes. This
behaviour may be modified by specifying explicit indexing policies for
elements to which this default policy does not apply. An indexing
policy may not be specified for elements or attributes which have been
nominated as the sources for an additional key or reference, since
these are indexed in a different way. Any indexing policy
specified for such elements or attributes will be ignored by the indexer.
The following indexing policies are used in the BNC:
none
No part of the specified element or attribute will be
indexed. In the case of an element, this means that none of its start and
end tags, its attributes, its child elements, and its character data content
will be included in the index. In the case of an attribute, its
value will be omitted.
markup
This policy applies only to elements. Only start-and end-tags and attributes for the specified element
and for any child elements will be indexed; no content of the element or its children will be indexed.
jointo
This policy applies only to attributes. The specified attribute
is available for use as the target of an attribute indexed with the
joinfrom policy.
joinfrom
This policy applies only to attributes. The attribute specified
has values which correspond with those on an attribuite of some other
element which has been indexed with the
jointo policy, or (if no jointo attribute has been defined) which uses
the xml:id attribute
taxonomy
This policy applies only to attributes. The attribute specified
has values which correspond with the xml:id attribute on some
category element within a TEI-conformant taxonomy
element.
For every element or attribute to which a non-default indexing
policy applies, a xairaItem type="indexPol" appears within
the xairaList type="indexSpec" element. This may contain
either an elementPolicy or an attributePolicy,
element depending on whether it relates to elements or attributes.
Index policies NONE and MARKUP
Within the BNC, an attribute policy of none is applied
to the element revisionDesc:
The effect of this is that, although revisionDesc elements
will be visible in search results, they cannot be searched for and a query
for one or for anything contained by of one, will return no hits
The indexing policy markup is applied to the element
bibliography. One occurrence of this element, declared in its
own name space, is necessary for a XAIRA system: it holds metadata
relating to each text constituting the corpus. In the BNC this
bibliographic information is copied from the text headers, which are
also indexed in their own right. To avoid duplication of this content,
the indexer is instructed to index only the structure of the
bibliography but not its content:
Index policies JOINFROM and JOINTO
The purpose of joinfrom and jointo indexing policies is
to support join queries. A join query is one in which
attributes are effectively transferred from one element to another.
For example, in the BNC, each text header contains detailed data about individual
speakers within the person element, and
also uses the attribute who to identify the
speaker or speakers of each speech in the
transcribed part of the corpus:
<person xml:id="ABC" age="A" soc="B1"> ... </person>
<person xml:id="DEF" age="Z" soc="A1"> ... </person>
...
<u who="ABC">....</u>
<u who="DEF"> .... </u>
Since the values for who all correspond with the value of
an xml:id attribute on some person element, a join query can be
effected. The XAIRA client can be configured to
support queries in which the attributes age and soc appear to be
attributes of the u element, their values being transferred
from the person element whose xml:id value is equal to that
given by the who attribute on u. The effect is
as it would be if the u elements above looked like this:
....
....
This is accomplished by the following set of indexing policies:
u
person
First, we declare a join-to policy for any xml:id
attribute. Next we declare the join-from policy for the who
attribute on the u element. As well as specifying which
attribute carries the value required (who), we need additionally to
supply the name of the element on which the corresponding join-to
attribute should be found (u). Values are transferred when a
match is found between the value for the who attribute and
that of whichever attribute of the nominated element has been indexed
with the join-to policy. Note that only one attribute of a given
element may be indexed with the join-to policy and that the values of
attributes indexed with the join-to policy must be unique within the
specified element and attribute combination. Thus, there may be only
person element with the value ABC for its xml:id
attribute, though the same value may appear on other attributes. If
the value appears on the xml:id attribute of some other
element, it will not be found with this join-to policy. Note that,
since the globally-available xml:id attribute is used to
hold the joint-to attribute, its values must be unique across the
whole corpus.
Index policy taxonomy
A taxonomy is a special kind of codebook, the purpose
of which is to provide a set of defined codes to classify the texts
making up a corpus. The BNC defines several different taxonomies as
means of classifying its constituent texts, as further described in
. The element or attribute within a particular
text which identifies its classification, by referencing one or more
codes within a taxonomy, is called its classifier.
Each distinct taxonomy for a corpus is defined by a TEI
taxonomy element, within the corpus header. This defines the
codes available for use and gives a gloss to them. Where, as is
usual, the texts in a corpus are classified along more than one
dimension (for example, by text type, by medium of distribution, by
audience type etc.), a taxonomy must be defined for each
dimension, rather than defining a single taxonomy with disjoint sets
of children. Note that the classification codes used must be unique
across the whole corpus, irrespective of the taxonomy to which they
belong. This approach also enables the client to regard each taxonomy
as defining a partition of the corpus.
To use a taxonomy defined in this way, the relevant attribute must be defined
with the taxonomy indexing policy. In the case of the
BNC, classification information is carried by two attributes:
the
targets attribute on the catRef element in each
text header supplies a list of values for all the original selection
and descriptive criteria, described in
the type attribute on wtext and
stext elements carries a broadbrush text-type categorization,
derived from the other classification codes, see further
The following declarations achieve this effect:
catRef
wtext
stext
Language specification
As a Unicode system, XAIRA is able to handle data in any natural
language or writing system. However, it is still necessary to
specify the language or languages used in the corpus being
indexed. This specification is performed by a xairaList of
type langspec. This contains at least one
xairaItem type="defaultLang", and optionally other
xairaItem type="langRules" elements.
The BNC uses only standard English and thus contains only a default
language specification, which looks like this: