XValidate - Cross-validate a set of files
XValidate attempts to perform a cross-validation on a set of files to look
for inconsistencies. It takes up to two parameters:
XValidate [-n] [[@]filename]
where:
- n is the validation level: this is currently '0' (or ignored) for full validation;
'1' for middling validation; '2' for minimal validation (no longer used)
- filename is the file to be cross-validated; most commonly this is used with
the @ prefix to indicate a control file containing a list of file names
Note that, if using a control file, the following control flags are supported:
- -X:filename - this specifies additional exception files (instead of looking
for xvalidate.xxx in the current folder). This allows cross-validation to
be done for each index in turn and then for all the indexes together without
(much) duplication of exception listings.
- -V:n - this allows the validation level to be over-ridden for individual
sections of the control file.
Note also that, if a file contains DQE~VALFULL (even with qualifiers) then
the validation level for that file is set to 0.
As discussed under SCAN_FILE the (current)
validation level is stored in each record. Then, when we are doing the comparisons
(see below) the programs sets up two control variables depending on the contents
of the two records being compared - val_full and val_most. By
default both are set to "True" implying all validation should be done.
However,
- If the author is "Anon." or starts with [, and the two records
are in different files and for different publication details then val_full
is reset to "False";
- If the validation level for both records is not set to 0 & neither
item type is a "major" type (vi,ss,nv,na,sl,pm,ts or n.) then both
val_full and val_most are reset to "False".
This allows an incremental approach to cross-validation rather than trying
to do it all at once.
Initially the program just calls SCAN_FILE
for each file being cross-validated, building up data in the scandata
structure. It then calls qsort to sort the magbuf
and itmbuf arrays using the SCANITEM_COMP_RTN
comparison routine for the latter. It then simply compares each scanitem
record against the next one in the array. Generally speaking the author and
compacted item title have to match before we do any comparison. However, one
exception is if we have a reprint that comes from one of "our" files
but doesn't appear to be specified therein. Specifically:
- If the author, compacted title, publication details or item types differ
between the two (and neither of the latter is "ex"); and
- The new item is a reprint (i.e. edition is '2' which will always sort after
the original appearance); and
- We have at least a year and month for the data of original publication;
and
- The original publication was in one of "our" magazines (i.e. the
magazine ID is in magbuf); and
- It isn't an anonymous cover; and
- The validation level for the new item is "0" or the item type
is a "major" type (as discussed above).
In this case we output a diagnostic saying the original appearance is missing.
Otherwise, assuming the author and compacted title match, we do the following
checks. Note that, to minimise diagnostics, these are in a long If/Then/Else
so that if one mismatch is found then the subsequent checks are omitted (even
if we choose not to report the mismatch). The checks, in order, are then as
follows:
- If only one record is a dummy series entry (i.e. ends in '|') we do nothing;
- Else we check if the publication details are the same but the real authors
are different, but only report an error if val_full is set;
- Else if the real authors are different we do nothing;
- Else we check if the full titles are the same, but only report an error
if val_full is set;
- Else we check to see if the serial maximum differs for two entries in the
same file (saving the title if so to avoid multiple identical errors);
- Else, if we have a serial maximum, we check to see if the item types, title
additional and series name are the same (while this partially duplicates checks
later, we'll be bypassing them for serial parts so we need to do them here);
- Else if the serial parts differ and/or are unknown then we do nothing;
- Else we check to see if the item types are the same (ignoring differences
if one or the other is "ex", "iw" or "br" or
if the title is generic), but only report an error if val_full is set;
- Else we check to see if the publication details differ, except that:
- we ignore dummy series entries;
- we ignore generic titles;
- we ignore minor item types (iw,as,av,bg,bi,br,cl,cn,cs,ct,ed,fp,fr,gp,gr,hd,hu,ia,il,in,is,iv,ix,lc,lr,mr,ms,ob,pi,pr,pt,pz,qa,qz,rc,rv,th
or ??);
- we only report an error if val_most is set;
- Else we check if the title additionals are the same, but only report an
error if val_full is set;
- Else, if we're doing dummy series entries, we do nothing else;
- Else we check to see if the series names match;
- Else we check to see if the original titles match;
- Else we check to see if the bylines match for items with the same publication
details;
- Else we check to see if the co-authors match for items with the same publication
details;
- Else we check to see if the secondary names match;
- Else we check to see if the appearance notes match;
- Else we check to see if the ED notes match (allowing for differences caused
by an appended magazine ID), but only report an error if val_full is
set.