This uses the "Quarterly Update" as an example as this is a superset of the
processes involved in adding other ToCs to the index(es). Note that programs
and files flagged ** are ones that were only used by Phil and hence are less
well documented (and probably less portable) than the others. The source list for the Big List (ARCHIVE.TXT**) contains an entry for each
magazine. The magazines included in the quarterly update have a #NOTE record
saying "Current: Checked" which includes the date it was last checked, the latest
cover scan that has been retrieved, the latest entry in FT_LINKS.CVT, and possibly
other notes. (Note that there are also some flagged "Current: Checked: TODO"
which are under consideration but have not yet been added to the quarterly update.) The quarterly update then basically goes through each such entry, checks the
website (and/or Amazon) to see if there are any new issues and, if so, updates
the entry accordingly and adds a skeleton ToC for each such issue to a text
file. Currently this is in a format similar to that described in the user
documentation with some exceptions and additions: Clearly, exactly the same process is used to create the text file from ToCs
supplied by other users and/or taken from magazines or magazine (ToC) scans.
Note that the holding file I use for FMI ToCs (fmtocs.txt**) contains a series
of notes at the beginning that help convert ToCs supplied by users into a standard
format. Note that, when doing the quarterly update, a cover scan for each new issue
(where relevant) should also be downloaded and stored with all the other new/updated
cover scans for the current period (see below). The text file is converted to the internal, data file, format via a program
called MagParse** – a C++ program which prompts for the input file name,
defaulting to the previous file used. In general I just reuse the same file
(temp.txt) over and over again. MagParse primarily tries to convert the text file format into the internal
format used by all the main programs. It is fairly fussy and will report errors
if it doesn’t understand something. "Some" of these are reported via dialog
boxes but the best place to check is the error file the program creates. This
is named xxx.err where the input text file was named xxx.txt (and is in the
same folder as the text file). One particular error that MagParse traps is the accidental inclusion
of any 8-bit characters (such as á or –). The programs all work on 7-bit characters
(partly because the 8-bit mappings in Bill’s computer were not standard Windows
mappings though I was never quite sure why, and partly because 8-bit characters
sometimes get corrupted in e-mail exchanges). Note that the data file is created as xxx.mag in the same folder. To avoid
accidental over-writing of the data file (as has happened before) the program
checks for the existence of such when it starts and will refuse to overwrite
it. As such, if you want to fix errors that have been reported and then to rerun
MagParse you must first manually delete xxx.mag Anything MagParse doesn’t understand at all it puts in an EB
record so this is a useful way to include post-processing notes (e.g. for
things MagParse doesn’t handle) so the first thing to do after running
MagParse is to check all instances of B1 in case it failed to convert
something that it should have convertedJ and/or to fix up any
post-processing notes. Validate attempts to validate a data file against the various formatting
and capitalisation rules. The degree of validation performed can be controlled
by means of the Validation
Control Flags but it’s desirable to do a full validation on all new ToCs
added to the database so that the "validation level" of the database as a whole
gradually improves. The errors reported by Validate generally fall into three categories: I tend to handle these in three passes. On the first pass I delete all the
diagnostics in categories 1 & 2 and fix all the remaining (or add them to
the exceptions file VALIDATE.XXX)
– note that this pass includes resolving any ambiguous names in the file. Once the first pass is complete, I rerun Validate, delete the diagnostics
in the first category and handle all the undefined names – this is often the
most time-consuming part of the process. My approach is to add these records
to the latest version of ATTRIB.TXT (see below) with a distinguishing flag (^^^)
and then check and resolve each flagged item. Typically the names will fall into one of four categories: I generally leave resolution of the first category of diagnostic (adjustments)
until after the next two steps as some of the items might be reprints
which might, of course, affect the date range. As part of the previous step, it is almost inevitable that changes will have
been made to PSEUD.CVT and possibly ABBREV.CVT and/or SERIES.CVT and errors
might have crept in. ValNames is a simple program that validates the
contents of the three control files, looking for obvious errors. It is likely that some of the items added will either be reprints of something
already in the database or, when indexing older magazines, will be earlier appearances
of items already in the database. The programs require that all instances of
a given item specify the same "first appearance data" so any such conflicts
need to be resolved (this was originally a requirement of Bill’s programs that
has been carried over to the new programs but might be worth reconsidering at
some point). There is no "official" way of doing this, but I use a file called STORY.IDX**
which contains the earliest known appearance of all "significant" items in the
database, and a program called Mrg_Sty** which tries to merge the new
data with the existing data, flagging up any discrepancies. It also flags all
new additions to STORY.IDX (with a ^^^ suffix) so you can manually check them
for subtle variations that Mrg_Sty can’t spot (e.g. UK/US spellings). Mrg_Sty is an old program, and I’m not entirely sure I know how it works
any more, but it seems to do the job and there’s always something more important
to work on. By now we should have updated all known prior appearances so we can run Validate
again to see what date adjustments are needed in PSEUD.CVT. Once these have
been made, the contents of the (new) data file need to be moved to the associated
data files. Note that this might involve creating new data files (if new magazines
have been indexed) and it is important to ensure that the new files are also
added to the appropriate Index Definition
File. The changes made as part of adding the new ToCs will probably have a knock-on
impact on the existing data, in one of three ways. Firstly, any new disambiguations will require changes to existing references
to the associated names, so the first step is to run Validate on the
entire database. With judicious use of the Validation Control Flags and VALIDATE.XXX
it should be possible to maintain the database in a state where Validate
produces no errors at all. As such, any errors thrown up at this point will
purely be a result of the new data and can easily be resolved. Secondly, the new data may have identified some earlier appearances of existings
items. Some of these will have become apparent in step 5 above, but there may
well be additional instances. My approach is to use Mrg_Sty to merge
each group of files in turn and then use ALL.TXT** (see below) to identify which
files are affected. The third area is somewhat more complex and relates to cross-validation within
and between the files. One aspect of this is to check that a repeated item (e.g.
a column or series) has the same characteristics in all instances. Another looks
for instances of an item under one name being reprinted under a different name.
To address these I have a program called Xvalidate**. Xvalidate is simultaneously very simple and very complex. It is very
simple because the guts of it are shared with IdxGen (and GenAttrib);
it is very complex because it tries to do some very complicated validation and,
to be honest, there are times I don’t quite understand what’s going on! However,
as with Validate, if the database is held in a state where Xvalidate
produces no errors then rerunning it after every change throws up any problems
introduced by the change and they can then be easily resolved. Note that it can be startling quite how many errors Xvalidate throws
up so I tend to run it on each group of files in turn before running it on the
whole (magazine) database. One special type of error that Xvalidate throws up is when one file
references an item that published in another of the magazines in the database,
but the file for that magazine does not include the item in question, possibly
because the relevant issue hasn’t been indexed. In these cases we generate
a skeleton entry so that we can catch any discrepancies if the issue is subsequently
indexed. There is a small programme, CvtSkel**, which converts the original
item line into a skeleton entry of the required format. While not (yet) formally documented as control/support files, ATTRIB.TXT and
ALL.TXT were files originally generated by Bill (I think as offshoots of his
index generation programs) which proved so useful that, when Bill could no longer
supply them I wrote my own program GenAttrib** to generate them. FWIW,
GenAttrib was written by generalising the core code used by Xvalidate
and that core code then formed the basis for IdxGen. ATTRIB.TXT contains an entry for every name in the database (or for whichever
part of the database you run the program on) indicating the date range of their
(original) appearances and summarising the types of entry (e.g. fiction, poems,
etc.). If the name has an entry in PSEUD.CVT then it also includes the main
data from PSEUD.CVT. This is used, as mentioned above, when adding new entries
to PSEUD.CVT as it shows all names in use, rather than just those in PSEUD.CVT,
and gives an idea of when they were active. ALL.TXT is basically a flat file version of the whole database (or whichever
part of the database you run the program on). It is useful when trying to disambiguate
an author as it lists all the different appearances (apart for artwork which
is more of a challenge). It’s also useful for finding all appearances of a given
item when trying to update prior appearance data. Although not directly part of the indexing process, another job I have is the
maintenance of COVERS.CVT. Currently
all images are held on my website although the file allows for multiple sources
of images (adding new locations would currently require program changes to IdxGen). The first step is obviously gathering new/updated images. For the Quarterly
Update this is simply part of the process, but in parallel with this images
come from a wide variety of sources – scans of new listings on eBay, direct
contributions from others, full scan magazines posted on pulscans or
similar. As these are acquired, the filenames need to be normalised, the images
adjusted to remove any skewing and to shrink them to a standard size (400px
wide) and to check (if they are replacements) that they are an improvement on
the existing images. Periodically there is then a need to update COVERS.CVT (and elsewhere, see
below) with any new cover scans. To assist in this I use a program called CvtCovers**
which reads a list of file names, compares it against the existing COVERS.CVT
and attempts to add new entries for any new images by parsing the file name
and attempting to deduce the corresponding issue abbreviation. This is usually at least 90% successful but there are some file names the program
cannot (yet) parse successfully so the diagnostic file needs to be checked for
any errors. There are also unavoidable ambiguities (e.g. when two magazines
with the same name exist at the same time) and, at times, confusion about whether
a date or issue number should be used, so all new entries are flagged (with
a trailing ^^^) and these need to be checked at some point. CvtCovers asks if you want to keep any changes made so at this point
I tend to say "No" and focus on addressing any "fixable" errors that are reported.
The next step is to add any new images to the relevant Illustrated Checklist
and/or to ARCHIVE.TXT – I have a macro in my text editor that assists with the
former but this is still a very time-consuming process! It is also likely that,
during this process, some mistakes can be found in the file names so these need
to be corrected. Once you’re happy that all the files are named correctly you can run CvtCovers
again and say you do want to keep the changes, and then check COVERS.CVT
to see if any of the conversions are wrong and to fix up any entries that the
program was unable to convert. In a small number of cases when an image has
been obtained that is not needed in COVERS.CVT they can be added to UNMATCHED_COVERS.TXT**. It is also necessary to create two thumbnails for each new or improved image
(one 100px tall; the other 150px tall) – I use a piece of shareware called ThumbsUp for this
– and then to upload everything to the appropriate website. As a final "belts-and-braces" exercise, I also run ChkCovers** which
basically compares all the magazine images in the folder structure against the
FM database via COVERS.CVT and UNMATCHED_COVERS.TXT – among other useful things
this does, it is also useful for identifying cases where the translation in
COVERS.CVT is incorrect (e.g. using a month instead of an issue number or vice
versa). Terminology gets a bit confusing in this area as I tend to refer to "the GCP
website" which, these days, includes a lot of different things including: The Big List is generated from ARCHIVE.TXT (and ABBREVIATIONS.TXT**) by means
of a pair of programs. MagPop** reads, parses and validates the input
file(s) and creates an Access 97 database called MAGS.MDB. MagGen** then
reads this database and generates all the relevant HTML files as well as a new
copy of ARCHIVE.TXT which should be used to replace the working copy (as it
is somewhat enhanced/tidied up). (One of the many "projects for a rainy day" is to rewrite these two to remove
the need for the intermediate database. It shouldn’t be too hard – the only
tricky bit is that the database implicitly does additional validation by rejecting
duplicate records with the same key and this would need to be replicated by
manual checks.) Note that The Big List and The Fictionmags Index Family are relatively independent
(necessary because I maintained one and Bill maintained the other) but are linked.
The Big List links to the Indexes as follows: http://www.philsp.com/homeville/FMI/link.asp?magid=xxx where "xxx" is the abbreviation in ABBREV.CVT (defined via the ABBREV keyword
in ARCHIVE.TXT). If the abbreviation can’t be matched (e.g. because of a typo in ARCHIVE.TXT)
the link just goes to the front page of the index. The reverse link is a bit trickier as the FM database doesn’t contain any mention
of which magazines are or are not defined in ARCHIVE.TXT. Instead, MagGen
generates a special file called ZZMAGIDS.TXT which contains a series of entries
of the form: xxx~yyyy where xxx is the abbreviation in ABBREV.CVT as above and yyy is the name specified
on the appropriate MAGID header in ABBREV.CVT. IdxGen then reads this file and, for each (group) header checks if the
abbreviation is listed in ZZMAGIDS.TXT and, if so, generates a link of the form: http://www.philsp.com/links2.asp?magid=yyyy which works in much the same way as above. This allows the two sets of files to be compiled independently of each other,
but pragmatically it is best not to release a new version of The Big
List without an associated version of the Index Family or there might be explicit
index links (e.g. for newly indexed magazines) that don’t work. If the two are
being generated "at the same time" then The Big List should be generated first
so that IdxGen has an up-to-date version of ZZMAGIDS.TXT to work from. In theory the current schedule is: though, in reality, the monthly schedule for the indexes is more of an aspiration
than a realityL Regenerating the Indexes is simply a question of running IdxGen on each
of the Index
Configuration Files in turn and checking the logs. Note that IdxGen
is deliberately intolerant of serious errors (e.g. it will exit if it encounters
a series name that is not defined in SERIES.CVT) so it is essential to do a
full Validate on the magazine database before running it. It may also report some less serious errors (e.g. mismatched quote characters)
and it’s up to you whether rerun the program after you fix these or leave it
until next time. When uploading a new version of The Big List I tend just to upload the new
files on top of the old files as the upload is fairly quick, the changes relatively
minor from release to release and the traffic fairly light. For the Indexes, though, I use a process that Bill pioneered: One key point to remember after regenerating the indexes is to update the LASTUPDATE
field in the Index Configuration Files.Current Processes for Maintaining Index(es)
Create Text File containing new ToCs
Convert Text File to Data File
Validate the Data File
Revalidate the Control Files
Check for Prior Appearances
Complete processing of the data file
Handle knock-on impact of changes
Regenerate ATTRIB.TXT and ALL.TXT
Update COVERS.CVT
Regenerate the Big List
Regenerate & Upload the Indexes