OUP2BNC(1l) MISC. REFERENCE MANUAL PAGES OUP2BNC(1l) NAME oup2bnc - update bnc database with information derived from OUP database SYNOPSIS oup2bnc [-ivy] [-n|-u] [-D database] [-b period_start] [-e period_end] [-q query] DESCRIPTION oup2bnc updates the BNC database by taking bibliographic and text selection data from the bncOUPdbClean database table (which is a cleaned-up version of bncOUPdb, the data as received from OUP) and updating records in, or appending records to, the following tables: bncAEjoin Links between authors/editors, named in bncAuthEd, and monographic works named in bncWork. bncAuthEd Names and other details of authors and editors bncFile Type of text (written, spoken etc.) bncImprint Names and other details of imprints (publishers) bncRegClas Regional classifications (USA; France; north, middle, south UK; etc.) bncSample Sample type, page range etc. for books (See NOTES below.) bncSelClas Selection classifications (type of sample, author age band etc.) bncWork Name and other details of monographic text. Each unprocessed record in bncOUPjoin, a table linking bncOUPdbClean and the tables above, is processed in turn. Data in bncOUPdbClean is mapped to categories, region codes, generally cleaned up, and used to inform decisions about text properties not directly stated in the OUP data - sample type, for example. This done, the results are used to update the tables listed above. In the cases of bncAuthEd and bncImprint, a serious attempt is made to minimize the number of entries in the tables by searching for exact or fuzzy matches for every author, edi- tor, or imprint encountered. For imprints, a single exact match of name and publishing location results in the exist- ing imprint being used; a fuzzy match results in the user being prompted for a decision as to whether an existing imprint should be used or a new one added. For authors and Sun Release 4.1Last change: TGCW55: 25 August, 1993 1 OUP2BNC(1l) MISC. REFERENCE MANUAL PAGES OUP2BNC(1l) editors, the user is always prompted, even if the match is exact: matches are less frequent for author names than for imprints, and the possibility exists that there are two dif- ferent authors having the same name. These questions cannot be supressed by the -y option, although they are supressed by the -n option. (See OPTIONS and NOTES below.) Whenever it appears that there is already data in a target table corresponding to, information provided by OUP, but having a different value, the user is prompted for a deci- sion as to whether the OUP data should replace existing data. Common situations where this occurs are for sample type - in which case, data generated at OUCS is likely to be more accurate than the guess intuited from OUP's informa- tion; and text title, where it is a matter of taste as to whether OUP's or OUCS' title is the more accurate. Less frequently, the activities of oup2bnc result in it sug- gesting alternative values for fields which it has itself added to the database. This happens, for example, when the user, in response to a question, states that two differently-spelled imprint names correspond in fact to a single imprint. (As in W R Chambers and Chambers.) In such cases, it is generally better to choose the more verbose name. After all data pertaining to a text has been processed, and if database changes have resulted, the user is prompted as to whether the changes should be accepted. If there are no changes, or if the changes are accepted, and if the -u option is in force, the coresponding row in the bncOUPjoin table is updated to indicate that it has been processed. This has the effect of causing the row to be ignored in sub- sequent runs of bnc2oup, unless the -i option is used for those runs. Questions demanding a yes/no answer may be supressed through use of the -n and -y options. (See below.) Whenever, the user is prompted for such an answer, entering ``!'' will have the effect of answerng this and all subsequent ques- tions affirmatively; answering ``q'' will provide a negative answer to the current question, then terminate the program. The -y option and ``!'' response are dangerous, and their use is not recommended. OPTIONS -bperiod-start Process only data received on or after period-start. Period-start must be a valid date for Ingres. -Ddatabase Use database instead of the default databse, bnc, in Sun Release 4.1Last change: TGCW55: 25 August, 1993 2 OUP2BNC(1l) MISC. REFERENCE MANUAL PAGES OUP2BNC(1l) compiling the report. -eperiod-end Process only data received on or before period-end. Period-end must be a valid date for Ingres. -i Process all records in bncOUPjoin: ignore flags stating that records have already been processed. (See also -u below.) -n Provide a negative answer to all yes/no questions. This allows the effect of a run to be judged without updating the database. (See NOTES and BUGS below, how- ever.) -qquery Append query as an and-ed condition to the query which selects records for processing from the bncOUPjoin table. For example, -q "ojTeKey like 'Z%'" would limit the selection to texts having BNC codes beginning with Z. -u Set a flag in each bncOUPdbJoin record processed to show that the record has been processed. (See also -i above.) -v Echo the BNC code and disposition of each text pro- cessed as it is processed, even if no changes to the database result. -y Provide an affirmative answer to all yes/no questions. Use with the utmost caution. DIAGNOSTICS Invalid command lines elicit a usage message. Invalid arguments for -b, -D, -e and -q will elicit Ingres error messages. Failure of any SQL statement results in the printing of an Ingres error message, and an immediate error exit. ENVIRONMENT II_SYSTEM Location of Ingres files. Defaults to /usr/local. FILES ~natcorp/bin/oup2bnc The program itself. The man page is embedded: hand the program to nroff -man. ~/perl, ~natcorp/perl Sun Release 4.1Last change: TGCW55: 25 August, 1993 3 OUP2BNC(1l) MISC. REFERENCE MANUAL PAGES OUP2BNC(1l) Directories searched, in that order, after the ``stan- dard places'' are searched, for required perl library files. AUTHOR Dominic Dunlop SEE ALSO perl(1l); Ingres documentation; TGCW36 - The new BNC data- base NOTES A quirk of the database design results in there potentially being a bncSample record, describing sample start and end points, corresponding to each version (A_,B_, ...) of a text, but only one bncSelClas record describing sample type per text. Thus, oup2bnc never overwrites existing bncSample information because it inserts records corresponding to the initial (dot) version of a text, whereas the overnight pro- gram inserts information about sampling in the B_ version. oup2bnc may, however, wish to overwrite the bncSelClas information on sample type generated by the overnight pro- gram from the B_ file. In general, it should not be allowed to do so, as the information about the B_ file is consider- ably more likely to be correct for texts as they will appear in the corpus. Publication dates already on file for magazines, which are derived from information in texts' dummy headers or from the names of the files as delivered from OUP, are likely to be more accurate than those in OUP's database. Consequently, requests to replace existing publication date information should generally be refused. There should be at most one row in bncOUPjoin corresponding to each corpus text name. However, the heuristic used to create entries in bncOUPjoin produces a small number of clashes - texts linked to two or more bncOUPdbClean records. Because oup2bnc processes texts in order of text name, such clashes result in the user being presented with two or more invitations to accept changes for a given text name in suc- cession. The second and any subsequent such requests should be regarded with suspicion. In attempting to match new author names with those already known, the program takes date of birth into account if this is known for the new author. If an identically named author is already known, a possible match will be flagged either if the date of birth of the existing author is not known, or if it is within two years of that of the new author. This tolerance has to be allowed because OUP's information gives author age at the date of data entry, and, because we do not Sun Release 4.1Last change: TGCW55: 25 August, 1993 4 OUP2BNC(1l) MISC. REFERENCE MANUAL PAGES OUP2BNC(1l) know when that date was, calculated birth dates are not always correct. Under the -n option, the user is not required to enter a choice when possible matches for imprint or author name are presented. Instead, it is assumed that none of the matches is acceptable. This has the effect of suggesting that more new imprints and authors will be added to the database than would be the case in practice. BUGS The program takes a snap-shot of the state of the bncOUPjoin table at the start of its run. Updates made by other pro- grams to this table after the snap-shot has been taken are not noticed by oup2bnc. If the -u option is in force, the updated rows will be flagged as having been processed, even though bnc2oup has used their old, rather than the updated, contents. (Excuse: sqlperl does not support multiple cur- sors...) The program is quite slow, taking 10-15 seconds per text if updates are required. Most of this time is spent waiting for Ingres to service requests. Other programs accessing the tables concerned may be locked out for as long as it takes to process each text. Speed-up would require either rewriting the program to access that database more effi- ciently (for example, by making a single query to retrieve all bncSelClas information for a given text, rather than a separate query for each classification type); and/or optim- izing database organization. Don't hold your breath waiting for either to happen. The -n option is very inefficient in operation: it has the effect of performing each update on the database, then rol- ling it back. This at least has the advantage that any difficulties concerned with updating the database will be flagged. Dates which are not in ISO or ISO-like format in OUP data can get badly mangled. When asked, do not allow mangled values to be put into the database. The program could really do with having a logging facility. As it is, if you want a log, you must record the session in which you run it with script, cmdtool, or some similar util- ity. Sun Release 4.1Last change: TGCW55: 25 August, 1993 5