IdxGen - Generate a set of Index Files

Introduction

IdxGen generates a set of Index Files based on the contents of an Index Configuration File. This file discusses some of the general background issues:

Basic Index Structure
Calculating Bookmarks and Page Breaks
Boilerplate Files
Sorted Files
Data File Naming Convention
External Sort Command File
Configuration Data Structure
Other Global Data
Aggregation
Regression Testing

The individual routines are discussed in a separate file.

Basic Index Structure

While the individual indexes look quite different one from the other, there is an over-riding structure that is common to each index, although not all indexes use all parts. The core level is what might be called the Index 1 Level which contains a list of all the "things" being indexed (e.g. Artist Names, Magazine Names, Story Titles, etc.).

In most cases there is a level below this which we could call the Listings Level which lists all the items for each "thing" (e.g. issues for each Magazine, items by each author, items in each series, etc. etc.). Note that this level is not used for the Story and Book Title Indexes as these do not have an expansion. Conversely, for Magazine Issues and for Book Authors there is a further level below which we could call the Contents Level which lists the contents of each magazine issue or book.

Above the Index 1 Level there then might be up to two higher, hierarchical, levels which provide index ranges into the lower-level indexes. The intention is that these should be dynamic based upon the number of items in the Index 1 Level. As each level will contain MAXPAGESIZE entries of the next level, we can see that with a modest page size of, say, 300 lines:

Each Index 1 Level would contain (up to) 300 items
Each Index 2 Level would handle up to 300x300 = 90,000 items
Each Index 3 Level would handle up to 300x300x300 = 27,000,000 items

Currently (June 2021) the largest index (the Story Title Index) has about 2,000,000 items in it so this should do for the foreseeable future. The table below summarises the situation as of June 2021. Note that one complication is that the top level of the index is always a single page with a fixed name, so a small index will only have the Index 3 Level; a medium-sized one will have Index 3 and Index 1 Levels and large indexes would have all three.

Note also that, while the formats of Listings and Contents Levels vary quite widely from index to index, the lowest Index level will always be simply a name or a continuation line and all the levels above that will be a range of names.

	Index 3/Top	Index 2	Index 1	Listings	Contents	Notes
Artists	A01	BBBnnnn	BBnnnn	Bnnnn
Biographical Notes	A02	CCCnnnn	CCnnnn	Cnnnn
Book Authors	A03	DDDnnnn	DDnnnn	Dnnnn	Ennnn
Book Titles	A04	FFnnnn	Fnnnn			This index has no listings level
Chronological	A05	HHnnnn	Hnnnn	Innnn
Magazine Issues	A06	JJJnnnn	JJnnnn	Jnnnn	Knnnn
Series	A07	LLLnnnn	LLnnnn	Lnnnn
Story Authors	A08	MMnnnn	Mnnnn	Nnnnn
Story Titles	A09	Onnnn	Pnnnn			This index has no listings level
Full Text Links	A10	QQQnnnn	QQnnnn	Qnnnn
Names	A11	Rnnn	Snnn			This index has no listings level
Defined by	topfilnam_ptr	idxlv2pre_ptr	idxlv1pre_ptr	lstingspre_ptr	contentspre_ptr

The creation and the structuring of the index levels is all handled by WRITE_INDEX_LINE which depends on a set of static tables as listed in the "Defined by" line above. These are accessed by the index type which is held in the curr_index_type field in the Configuration Data structure. There are also four other static tables similarly referenced:

idxtyp_ptr - Name of the index to be used on links from the index pages
idxpgttl_ptr - Page Title to be used on the index pages
idxuseital - Set to PSP_TRUE if the index pages should put the names in italics rather than in bold
idxlinecnt - This is set up during the "set up" phase to contain the number of lines in the Level 1 index (and is used to determine the number of levels).

Using WRITE_INDEX_LINE

Using the routine for normal indexes is very straightforward:

In the SETUP module, set idxlinecnt[IDX_xxx] to the total number of lines in the index.
When writing a line to the index, set config_ptr->formatted_name to the name as you want it to appear in the Level 1 index and config_ptr->curr_name to the name you want to appear in the higher level indexes (this might be the same or might be a simplified form)
Call WRITE_INDEX_LINE specifying IDXLIN_NORMAL for a normal line or IDXLIN_CONT for a continuation line
When writing the trailers to the various files, call WRITE_INDEX_LINE specifying IDXLIN_TRAIL

Note that the code generated for the Level 1 index is of the form:

<LI><A HREF="xnn.htm#Annn">formatted name</LI> (IDXLIN_NORMAL)
<LI>________: <A HREF="xnn.htm.htm#Annn">continuation text</LI> (for IDXLIN_CONT)
<LI>special text</LI> (for others)

It is up to the caller to ensure that formatted_name and continuation text contain "</A>" at the appropriate point to terminate the hyperlink - this allows just the first part of the text to be hyperlinked if so desired.

Calculating Bookmarks and Page Breaks

The indexes are linked together by hyperlinks which requires the creation of unique anchors for anything that might be the target of a link. This obviously presents a minor challenge as we need to know what the anchors in Index 1 are when creating Index 2, but also know what the anchors are in Index 2 when creating Index 1. To handle this always requires two passes through the indexes - one to set up the anchors and the other to implement them. There are various ways of doing this, but the approach IdxGen uses is to have one set of modules (SETUP_xxx_IDX) whose primary purpose is calculate the anchors and a second set (BUILD_xxx_IDX) which generates each index in turn. A key part of the first pass is also to decide where to insert a page break in the index so that pages do not get too large - depending on the index, these may or may not be in the middle of a section, as discussed below.

There are two approaches to handling the anchors in the two passes. Each requires the anchor to be stored in the first pass, but the second pass can then take one of two approaches:

Look up the anchor in the table, throwing a new page when the anchor says one is needed;
Calculate the anchors and page breaks again and periodically check they match those stored in the table.

The first of these is potentially the easiest, but the second provides a belts-and-braces sanity check - i.e., if the anchors/page numbers don't match then the code in either the SETUP or BUILD module (or both) is incorrect. For this reason the program currently adopts the second approach, although it can take quite a bit of time to track down an anomaly if the program detects the two don't match.

The anchors are either stored in an in-memory table or added to the sorted files and the page breaks vary from index to index, as follows:

Book Author Index (SETUP_BOKAUT_IDX): This is more complex than most indexes as it has to handle both the listings level and the contents level:
- A line is counted in the contents level for each line in the consolidated books file and a new page is thrown in the contents level if there is an 'A' record (i.e. start of a new book) and the count > MINPAGESIZE. At the moment the assumption is that an overlong page is better than breaking a book contents listing across multiple pages so no check is made on MAXPAGESIZE.
- A line is counted at the listings level for each 'A' record and a new page is thrown if the author name has changed and the count > MINPAGESIZE, or the count > MAXPAGESIZE; in the latter case a continuation record is needed.
- An anchor is created at the listings level for each new author in the index and is stored internally
- An anchor is created at the contents level for each book and is stored internally in the Issue Link Table
Book Title Index (SETUP_BOKTTL_IDX):
- The routine ignores anything that isn't a book or is by a secondary author or where the author is the subject.
Magazine Issues Index (SETUP_ISSUE_IDX): Like the Book Author Index this also has to handle both the listings level and the contents level:
- A line is counted in the contents level for each line in the consolidated magazine file (except those in an Additional Entries File) and a new page is thrown in the contents level if there is an 'A' record (i.e. start of a new issue) which does not immediately follow a header record, and the count > MINPAGESIZE. Assuming the minimum and maximum page sizes are sensible there should be no need for a continuation page here so the code simply checks to see if the count > MAXPAGESIZE and outputs a diagnostic if so.
- A line is counted at the listings level for each 'A' record and a new page is thrown if we have a new main features record and the count > MINPAGESIZE, or the count > MAXPAGESIZE; in the latter case a continuation record is needed.
- An anchor is created at the contents level for each issue and for each features record and is stored internally in the Issue Link Table
- Currently no anchors are created at the listing level.
Story Title Index (SETUP_STYTTL_IDX):
- The routine ignores all items of types hd, mg. il, fp, pt, cv or ct, as well as any subject records
- Otherwise a line is counted whenever the author or title changes and a new page is thrown as soon as the count > MINPAGESIZE.
- An anchor is created for every distinct author/title pair and is stored in the styttl_ptr field in the associated SCANITEM record.
Story Author Index (SETUP_STYAUT_IDX):
- The routine ignores all items of types hd, mg. il, fp, pt, cv or ct, as well as any secondary author records (e.g. subjects, editors, etc.)
- Otherwise a line is counted for each input record
- A new page is thrown when the item changes and either the author changes and the count > MINPAGESIZE, or the count > MAXPAGESIZE; in the latter case a continuation record is needed.
- An anchor is created for each author and is stored internally in the Names Link Table
Artist Index (SETUP_ARTIST_IDX):
- The routine reads the sorted artist file (IdxGen.art) and counts a line for each input record.
- A new page is thrown when either the artist changes and the count > MINPAGESIZE, or the count > MAXPAGESIZE; in the latter case a continuation record is needed.
- An anchor is created for each author and is stored internally in the Names Link Table
Chronological Index (SETUP_CHRON_IDX):
- The routine creates and reads the sorted Chronological Index file (IdxGen.crn)
- It ignores all items of types hd, mg. il, fp, pt, cv or ct, as well as any subject records, but otherwise a line is counted for each input record
- A new page is thrown when either the author changes and the count > MINPAGESIZE, or the count > MAXPAGESIZE; in the latter case a continuation record is needed.
- An anchor is created for each author and is stored internally in the Names Link Table
Series Index (SETUP_SERIES_IDX):
- The routine creates and reads the sorted Series Index file (IdxGen.ser)
- A new page is thrown when the series or series type changes and the count > MINPAGESIZE. Assuming the minimum and maximum page sizes are sensible there should be no need for a continuation page here so the code simply checks to see if the count > MAXPAGESIZE and outputs a diagnostic if so.
- An anchor is created for each series and is stored internally in the Series Link Table
Biographical Notes (SETUP_BIOG_NOTES):
- The routine reads the contents of the Names Link Table and counts 3 lines for every author which is in the current index and has some biographical notes.
- A new page is thrown whenever the count > MINPAGESIZE
- An anchor is created for each author and is stored internally in the Names Link Table
Full Text Index:
- The routine reads the Full Text file (IdxGen.ftx) and counts a line for each input record.
- A new page is thrown whenever the magazine name changes and the count > MINPAGESIZE, or the count > MAXPAGESIZE; in the latter case a continuation record is needed.
- An anchor is created for each magazine name but is only used internally.
- Note that there is no SETUP module for this index - the idxlinecnt is set up as part of creating the Full Text file in BUILD_ISSUE_IDX

Boilerplate Files

The use of boilerplate files was first used in the programs used for generating the GCP Website and provide a means for having standard, potentially fairly complex, HTML page layouts that can be easily changed without requiring programmatic changes. The principle is very straightforward - the file is a standard HTML file that can be created with any HTML editor and which contains a number of special flags of the form <!x>, <!x+> and <!x-> where x is some special code known to the program (often just a single letter).

A line that starts with <!x> indicates that the whole line should be replaced with whatever x represents to the program while <!x+> and <!x-> delimit an entire section that should be omitted under certain conditions (e.g. the "Previous Page" link on the first page).

There are three boilerplate files used by IdxGen:

index_toc.htm - The Table of Contents
index_hdr.htm - The Page Header
index_trl.htm - The Page Trailer

A consistent set of substitution variables are used across the three files, even though no single file uses them all:

1 - The link to the Introduction Text (if any)
3 - The link to the Missing Issues Page (if any)
4 - The link to the Titles Not Included Page (if any)
5 - The link to the In the Next Update Page (if any)
A - A subfolder string ("a/") to be inserted if subfolders are being used.
B - A folder redirection string ("../") to be inserted if subfolders are being used.
C - The names of the creator(s) of this index
D - The current date in the form MMM DD, CCYY
E - The names of the editor(s) of this index
F - The background colour for this page (e.g. FFFFE0)
H - The link to and name of the top page for this Index (if required)
I - The offical name of the Index
N - The filename for the next page on this level (if any)
P - The filename for the previous page on this level (if any)
T - The page title to be used on the current page
U - The location to insert any additional index-specific lines in the Table of Contents
X - The link to the Index by Publisher (if any)
Y - The current year in the form CCYY
Z - The links to the Book Indexes (only generated if any books were found in the index)

Sorted Files

While much of the data needed by the program resides in structures in memory it has proved impossible to hold everything in memory as discussed below. As such, a number of sorted data files are created and used by the program as follows:

IdxGen.tm1 - This file is created by SETUP_FILES to contain all the records from the book files, prefixed by a special sort header.
IdxGen.tm2 - This is a version of IdxGen.tm1 sorted by SETUP_FILES into the order needed for the Book Author Index.
IdxGen.bks - This is identical to IdxGen.tm2 but has the sort header stripped off (also by SETUP_FILES) and is a primary input to SETUP_BOKAUT_IDX
IdxGen.mgs - This file is created by READ_MAG_FILES (called from SETUP_FILES) and contains all the records from the magazine files, in the order required by the index, with additional fields added to the 'A' records specifying any cover scans, full text and/or about links. Note that the latter two occupy the same field as about links are only used on features records and full text links on non-features record. It is a primary input to SETUP_ISSUE_IDX
IdxGen.tmp - This file is created by SETUP_BOKAUT_IDX & SETUP_ISSUE_IDX via the SCAN_RECORD routine.
IdxGen.tm3 - This is a version of IdxGen.tmp sorted by SETUP_STYTTL_IDX into the order needed for the Story/Book Title Indexes.
IdxGen.tm4 - This file is created by SETUP_STYTTL_IDX as a copy of IdxGen.tm3 but with Story Title Index anchors added for each Story Title.
IdxGen.tm5 - This is a version of IdxGen.tm4 sorted by SETUP_STYAUT_IDX into the order needed for the Story Author Index.
IdxGen.tm6 - This is a version of IdxGen.tm5 with additional records added for the first appearance of reprinted items.
IdxGen.Aut - This is a version of IdxGen.tm6 sorted by SETUP_STYAUT_IDX into the order needed for the Story Author Index and is the primary input to BUILD_STYAUT_IDX.
IdxGen.tm7 - This file is created by SETUP_STYAUT_IDX as a copy of IdxGen.Aut but with Story Author Index anchors added for each Story Title and is in Story Title Index order.
IdxGen.Ttl - This is a version of IdxGen.tm7 sorted by SETUP_STYAUT_IDX into the order needed for the Story Title Index and is the primary input to BUILD_STYTTL_IDX.
IdxGen.tm8 - This file is created by SETUP_CHRON_IDX and is a version of IdxGen.ttl containing only those records needed for the Chronological Index, rewritten in Chronological Index order.
IdxGen.Crn - This is a version of IdxGen.tm8 sorted by SETUP_CHRON_IDX into the order needed for the Chronological Index and is the primary input to the main part of BUILD_CHRON_IDX.
IdxGen.tm9 - This file is created by SETUP_SERIES_IDX and is a version of IdxGen.ttl containing only those records needed for the Series Index, rewritten to ensure all identical items are grouped together.
IdxGen.Ser - This is a version of IdxGen.tm9 sorted by SETUP_SERIES_IDX into the order needed for the Series Index and is the primary input to the main part of BUILD_SERIES_IDX.
IdxGen.ftx - This file is created by BUILD_ISSUE_IDX and contains details of all the Full-Text Links for this index; it is read by BUILD_FULLTEXT_IDX to create the Full Text Index.

Data File Naming Convention

The program supports the input of a mixture of magazine data files, book data files and additional reference files. At the moment, at least, these files need to be distinct (i.e. you can't have a single file with both magazine and book data, other than where the book is being listed as part of a magazine) and need to follow a strict naming convention:

magazine data files should be named either xxxxx.mag or mags.xxx where xxx/xxxxx is the (case-insensitive) magazine abbreviation in ABBREV.CVT. The translation of this abbreviation is used to determine the order in which magazines will be displayed for those indexes where the Index Configuration File indicates the magazines should be sorted by name.
additional reference files should be named 00xxx.mag (where, by convention, the xxx is the mnemonic of the index, although the program doesn't check this). These will be used to generate entries in the Story Author and Story Title indexes, but nowhere else.
all other input files are assumed to be book data files.

External Sort Command File

The data in the indexes is too large for in-memory sorts with the 32-bit compiler, and a brief experiment with the 64-bit compiler resulted in 8 hours for a single full sort of the data (as opposed to 5 minutes for an external sort program) so, at various points, IdxGen calls an external command file (specified in the Index Configuration File) to sort a pair of files. As the sort order varies from instance to instance the command file is also passed a parameter indicating the type of sort that is required. Thus the formatted command that is executed might be:

   sortfil BOOKDATA IdxGen.tm1 IdxGen.Bks

where IdxGen.Tm1 is the input file and IdxGen.Bks is the output file. The possible command parameters are:

BOOKDATA: Sort the file created by READ_BOOK_FILE - this is a straight sort into increasing order

The current tool used for sorting is CMSort where the key switches are:

/V=$1F indicates treat the file as a CSV file separated by %x1f characters
/SV=n,1,0 means sort the nth field (1st field =1) from the first character to the end of the field

So, for instance, a sort key could be :

    /V=$1F /SV=1,1,0 /SV=2,1,0 /SV=3,1,0 /SV=4,1,0 /SV=5,1,0 /SV=6,1,0 /SV=7,1,0 /SV=11,1,0 /SV=8,1,0 /SV=9,1,0 /SV=10,1,0

However, experimentation shows that sorting with a key like this takes 4-5 times longer than a straight sort so, instead, the files are created in a different order depending on the way we want to sort them as described in the documentation of SETUP_SCANITEM

Configuration Data Structure

To allow the code to be broken into smaller sections without huge parameter lists, the bulk of the data that the program uses is held in a structure called config_data which is passed as an argument to most routines. It contains the following groups of fields:

Settings from the Index Configuration File:

	char	idxnam[128];			/* Name of Index */
	char	editor[128];			/* Name of Editor(s) */
	char	idxdir[128];			/* Folder to generate indexes into */
	char	boiler[128];			/* Folder containing boilerplates */
	char	sortfile[128];			/* Name of the file containing the sort commands */
	char	ctrlfile[128];			/* Name of the file defining the files to include in the index */
	char	ablkfile[128];			/* Name of the file containing the about links for the index */
	char	intrfile[128];			/* Name of the file containing the Introduction */
	char	missfile[128];			/* Name of the file containing a list of missing issues */
	char	omitfile[128];			/* Name of the file containing a list of magazines deliberately omitted from the index */
	char	nxtufile[128];			/* Name of the file containing a list of items for the next update */
	char	toctext[128];			/* Name of the file containing additional text to insert in Table of Contents */
	char	lastupdate[10];			/* Date of last update */
	int	subfolders;			/* Set to PSP_TRUE if files should be generated in sub-folders */
	int	pubindex;			/* Set to PSP_TRUE if we want a "by Publisher" index */
	int	sortnames;			/* Set to PSP_TRUE if the magazine files should be sorted by name */
	int	fullimages;			/* Set to PSP_TRUE if cover scans are to be displayed full-size */
	int	minpagesize;			/* The minimum number of lines to display on a page (default 200) */
	int	maxpagesize;			/* The maximum number of lines to display on a page (default 1000) */
	int	permlinks;			/* Set to PSP_TRUE if permanent links should be output */
	int	report_diags;			/* Set to PSP_TRUE if extended diagnostics should be output */
	char	special1[128];			/* First internal special flag */
	char	special2[128];			/* Second internal special flag */
	char	special3[128];			/* Third internal special flag */
	char	special4[128];			/* Fourth internal special flag */
	char	special5[128];			/* Fifth internal special flag */
	char	special6[128];			/* Sixth internal special flag */
	char	special7[128];			/* Seventh internal special flag */
	char	special8[128];			/* Eighth internal special flag */
	char	special9[128];			/* Ninth internal special flag */

The scandata structure used by SCAN_FILE:

	struct	scandata *scandata_ptr;		/* Pointer to scandata structure */

The lists of magazine files (in a sub-structure so that they can be sorted by magazine sort name) and the consolidated magazine file name:

	struct magfile {
		char	*magnam_ptr;		/* Pointers to Magazine Names */
		char    filnam[MAXFILENAME];	/* Magazine File Name */
	};
	struct  magfile magfiles[MAXMAGFILES];	/* Magazine File & Sort Names */
	int	magfile_cnt;			/* Count of Magazine Files */
	char	magfilnam[128];			/* Magazine File Name */

The name of the consolidated book file name and flag to indicate that we have some books:

	int	got_books;			/* Flag to say we have some books */
	char	bookfilnam[128];		/* Book File Name */

A number of fields related to the index type currently being built:

	char	curr_index_type;		/* Current index type */
	FILE	*topfil_ptr;			/* Top-Level Index Output File Pointer */
	FILE	*midfil_ptr;			/* Middle-Level Index Output File Pointer */
	int	midpage_count;			/* Page and line counts for the middle-level index */
	int	midline_count;
	int	lstpage_count;			/* Page, line & anchor counts for the listings */
	int	lstline_count;
	int	lstanchor_count;
	char	first_name[1024];		/* First name for top-level index */
	char	last_name[1024];		/* Last name for top-level index */
	char	curr_name[1024];		/* Current name */
	char	formatted_name[4096];		/* formatted version of same for index headings (can be huge for house names) */

And some miscellaneous (useful) data:

	char	uplink[4];			/* Set to "../" if we have subfolders; to "" otherwise */
	char	toplink[6];			/* Set to "../a/" if we have subfolders; to "" otherwise */
	char	tmpdir[256];			/* Folder to use for temporary files */
	FILE	*namfil_ptr;			/* Names Link File */
	FILE	*dmpfil_ptr;			/* Diagnostic Dump File (if needed) */
	FILE	*csvfil_ptr;			/* CSV file for Names Link Database (if needed) */

Note that, as we store key information related to the current index type in the structure, it is critical that the index types are processed sequentially, not concurrently (see BUILD_FULLTEXT_IDX below).

Other Global Data

We also need to expose some of the data globally so that we can sort them efficiently via qsort. This includes:

The Issue Link Table containing the list of magazine issue IDs and book IDs and the associated links and a list of subscripts into that array:

static	char	*isslink_det_ptr[MAXISSUES];	/* Issue Details Table */
static	char	*isslink_exp_ptr[MAXISSUES];	/* Expanded Details Table (also used for cover scan links in READ_MAG_FILES) */
static	char	*isslink_txt_ptr[MAXISSUES];	/* Full Text/About Link Table (in READ_MAG_FILES) */
						/* Book Title Link (elsewhere) */
static	char	*isslink_pre_ptr[MAXISSUES];	/* Issue Link Page/Anchor Prefix Table */
static	int	isslink_pageno[MAXISSUES];
static	int	isslink_anchor[MAXISSUES];	/* Issue Link Page/Anchor Table */
static	char	isslink_edition[MAXISSUES];	/* and edition */
static	int	isslink_idx[MAXISSUES];		/* and Indexes into Table(s) */
static	int	isslink_cnt;			/* Count of Issue Links */

This table is set up by SETUP_BOKAUT_IDX and SETUP_ISSUE_IDX and sorted into order by the latter. The three fields isslink_pre_ptr, isslink_pageno & isslink_anchor define a link to the associated book in the Book Contents List or to the associated magazine issue in the Magazine Contents List. Edition contains the edition number (for books) and isslink_idx is used so we can sort the field efficiently.

isslink_det_ptr is one of three formats:

␢␢␢␢10SWC+^^2255925 for feature records, as discussed under SETUP_ISSUE_IDX
$000054 for books that do not have a formal book abbreviation
- In this case isslink_exp_ptr contains details of the book title, author/editor, publisher etc.
1924WSMOct11 or 1925*GSSWrld for magazine issues and for books with a formal book abbreviation
- Note that for magazine issues this always contains the "old format" of the details - where relevant the "new format" is stored in the isslink_exp_ptr field.

isslink_txt_ptr is only used for magazine issues where the associated file had VALNOABB specified implying that the magazine name changes from issue to issue and, in that cases, stores the actual name of the magazine (nouvp.mag in the WFI is a classic example).

The Names Link Table containing the list of authors in the Story Author, Book Author, Artist, Chronological & Biographical Notes Indexes:

static	char	*nameslink_nrmaut_ptr [MAXNAMES]; /* Normalised author names */
static	char	*nameslink_auth_ptr [MAXNAMES];	/* Standard author names */
static	char	nameslink_namtyp[MAXNAMES];	/* Name Type Table: Bitmap indicating which indexes the name appears in: */
						/* 1=Story Author; 2=Artist; 4=Book Author */
static	int	nameslink_pseudsub[MAXNAMES];	/* Index into PSEUD.CVT */
static	int	nameslink_stypag[MAXNAMES];
static	int	nameslink_styanc[MAXNAMES];
static	int	nameslink_stylin[MAXNAMES];	/* Story Author Index Page/Anchor/Line Table */
static	int	nameslink_artpag[MAXNAMES];
static	int	nameslink_artanc[MAXNAMES];	/* Artist Index Page/Anchor Table */
static	int	nameslink_bokpag[MAXNAMES];
static	int	nameslink_bokanc[MAXNAMES];	/* Book Author Index page/anchor table */
static	int	nameslink_crnpag[MAXNAMES];
static	int	nameslink_crnanc[MAXNAMES];	/* Chronological Index Page/Anchor Table */
static	int	nameslink_biopag[MAXNAMES];
static	int	nameslink_bioanc[MAXNAMES];
static	int	nameslink_biotyp[MAXNAMES];	/* Biographical Notes page/anchor/type table */
static	int	nameslink_idx[MAXNAMES];	/* and indexes into table */
static	int	nameslink_cnt;			/* and count of Links */

Similar information for the book titles in the Book Title Index:

static	int	bokttllink_recnum[MAXBOOKS];	/* Record number in Books File */
static	int	bokttllink_pageno[MAXBOOKS];
static	int	bokttllink_anchor[MAXBOOKS];	/* Book Title page/anchor table */
static	char	*bokttllink_pubdet[MAXBOOKS];	/* Abbreviated publication details */
static	int	bokttllink_idx[MAXBOOKS];	/* and indexes into table */
static	int	bokttllink_cnt;			/* and count of Links */

Similar information for the series names in the Series Index:

static	char	*serieslink_nam_ptr [MAXSERIES]; /* Series Names Table */
static	int	serieslink_pageno[MAXSERIES];
static	int	serieslink_anchor[MAXSERIES];	/* Series Page/Anchor Table */
static	int	serieslink_idx[MAXSERIES];	/* and indexes into table */
static	int	serieslink_cnt;			/* and count of Links */

The array of column headers that should be suppressed when encountered:

static	char	*colhdr_arr[MAXCOLHDRS]; 	/* Array of column headers to be suppressed */
static	int	colhdr_cnt;			/* and count thereof */

Key information (mainly static) related to the different index types:

static	char*	topfilnam_ptr;			/* File names for the top-level index for each index type */
static	char*	idxlv2pre_ptr[];		/* Prefix letter(s) for the level 2 index for each index type */
static	char*	idxlv1pre_ptr[];		/* Prefix letter(s) for the level 1 index for each index type */
static	char*	lstingspre_ptr[];		/* Prefix letter(s) for the listings level for each index type ("" if none) */
static	char*	contentspre_ptr[];		/* Prefix letter(s) for the contents level for each index type ("" if none) */
static	char*	idxtyp_ptr[];			/* Index Type Name to be used when linking to top-level index for each index type */
static	char*	idxpgttl_ptr[];			/* Page Title to be used in Index Headers for each index type */
static	int	idxemphchr[];			/* Type of emphasis to be used in index levels: 'B' = Bold; 'I' = Italics; ' ' = None */
static	int	idxlinecnt[];			/* Number of lines in the bottom-level index */
static	int	idxsinglvl[];			/* Set to true if this index is wholly contained in the top-level index page */
static	char*	idxbgcolor_ptr[];		/* RGB colour value to set as background colour for this index */

Lastly, some odd counts for the Statistics File and for Terminal Diagnostics:

static	int	author_cnt=0;			/* Number of authors in the index */
static	int	artist_cnt=0;			/* Number of artists in the index */
static	int	magtitles_cnt=0;		/* Number of magazine titles in the index */
static	int	magissues_cnt=0;		/* Number of magazine issues in the index */
static	int	books_cnt=0;			/* Number of books in the index */
static	int	fiction_cnt=0;			/* Number of fiction items in the index */
static	int	poems_cnt=0;			/* Number of poems and plays in the index */
static	int	nonfiction_cnt=0;		/* Number of non-fiction items in the index */
static	int	covers_cnt=0;			/* Number of cover images in the index */
static	int	pages_cnt=0;			/* Number of pages in the index */
static	int	ftlinks_cnt=0;			/* Number of full-text links in the index */
static	int	biog_cnt=0;			/* Number of biographical notes in the index */

static	int	nonfatal_diags=0;		/* Number of non-fatal diagnostics (check idxgen.prt) */
static	int	max_html_files=0;		/* Maximum number of HTML files for any index type (compared to MAXHTMLFILES) */
static	int	multgrp_max=0;			/* Maximum number of different filnam/byline groups (compared to MAX_MULTGRP) */
static	int	multitm_max=0;			/* Maximum number of occurrences for a single item (compared to MAX_MULTITM) */
static	int	multrep_max=0;			/* Maximum number of entries with reprint author/title details for an item (compared to MAX_MULTREP) */
static	int	multxtr_max=0;			/* Maximum number of groups with distinct bylines/original titles (compared to MAX_MULTXTR) */

static	int	do_diag=PSP_FALSE;		/* Diagnostic flag; this can be used anywhere to create a line on which a breakpoint */
						/* can be set when a particular condition occurs */

Aggregation

For simple, uncomplicated, items the intent of aggregation is simply to display the first appearance of an item, followed by any reprint information on subsequent lines (in chronological order) as in:

Justice Comes to Red Creek, (nv) Exciting Western January 1951
- Exciting Western (Canada) January 1951
- Exciting Western (UK) October 1951

In this simple case, we have entries in the sorted data for the original appearance as well as the reprints, each of which identifies the original appearance. However, there are cases where all we have is an instance defining the reprint appearance which may or may not identify the original appearance, as in (in the WFI):

Nightmare Island, (ss) Short Stories June 25, 1932
- All Star Western & Frontier Magazine December 1932

Clearly, in this case, we need to output the first line (even if the first appearance is unknown) as part of processing the second line, while in the previous case above we want to suppress the first line for the reprints as we have already displayed it. We handle this by checking the pubdet_ptr field in each scanitem structure to see if it is the same as the previous one.

There are even (obscure) cases where we have (vague) original appearance data without any reprint information, as in:

Blue-Fired Cow-Killing Crazies, (ss) Western Digest 1996

in which case we need to output the first line but not the second line even though we are dealing with a reprint record (in this case magid_ptr is empty).

However, if the item is a serial (say) then we don't want to list each part separately so we try to aggregate all the instances into a single line so that, if the above were serialised over three issues, it might appear as:

Justice Comes to Red Creek, (nv) Exciting Western Jan, Feb, Mar 1951
- Exciting Western (Canada) Jan, Feb, Mar 1951
- Exciting Western (UK) Oct, Dec 1951, Feb 1952

There are several points to note about this simple case:

To save space (as, for columns, there could be hundreds of these) the dates are abbreviated from the normal form.
Each issue for the serial (or column) needs to be hyperlinked, but the text associated with that hyperlink might contain just the month/day, the magazine name and month/day or the month/day plus year. This is handled by some clever (i.e. messy) code in FORMAT_PUBDET but also means that we need to know what year the next item is while we're processing the current one.
In this case, for reprints, we have to ignore which issue each item is reprinted from (or it would destroy the aggregation) so we need to ignore the setting of pubdet_ptr no matter whether it is specified or not.
We immediately have a problem with sort orders. To achieve the chronological sequencing we need to sort the items (via SETUP_SCANITEM) in order of dtpubl_ptr (after edition which is needed to ensure the first appearance comes before any reprints with the same date). To achieve the grouping in the example just above it would be ideal to sort all the items in a given magazine together as well, but that cannot precede the publication date or the basic sequence (the first example above) might be disrupted. This means that when we are aggregating reprints, if a series of items are reprinted in multiple magazines with overlapping dates they will be presented in date order, not magazine order - this has not yet been resolved.

To make life even more interesting, some items appear as a single item in their original appearance but multiple items when reprinted or vice versa (note that the first of these is particularly common if the original appearance lies outside the current index), as in (in the WFI):

The Charmed Life, (nv) All-Story Weekly September 22, 1917
- Best Stories Oct, Nov 1927

or:

Trouble Trail, (n.) Western Story Magazine Aug 28, Sep 4, Sep 11, Sep 18, Sep 25, Oct 2 1926 (as by George Owen Baxter)
- Triple Western December 1950 (as by Max Brand)
- Triple Western (Canada) December 1950 (as by Max Brand)

This latter highlights yet another problem with aggregation, albeit one new to the v2 indexes, - i.e. bylines. Although this example is fairly straightforward, when we are talking about a long column, it is possible that the column will appear under multiple bylines (the most common example known being when an editor uses their initials for some instances). In a perfect world we would list these separately, but that would mean sorting by byline_ptr before dtpubl_ptr which would disrupt the "normal entries". We might be able to "put aside" any with a different byline while we process the current set or come up with an alternative (e.g. append the byline to only those issues that differ, which is fine as long as there is isn't an overall byline to display as well) – all this has been left as an exercise for later: for now we just (try to) ignore such differences.

Notes on Some (Past) Problem Areas

Many of the items in the database effectively contain two different pieces of publication data – details of the current publication and of the original publication – although obviously in many case these are the same and/or the second one is missing. In many/most cases there is also a record in the database for the original publication so we don't need to worry about it, but there are many instances where this is not the case.

If the original publication was under a different byline or title then this has always been catered for in FLUSH_SORT as we would need records in different areas of the sort hierarchy. However, if this was not the case, originally the issue was posponed until WRITE_STORY_ITEM when the code checked to see if we already had the original publication and, if not, output it. This works fine for simple cases, but led to problems in aggregation for two reasons:

For repeated items (such as columns) for which we only had reprint information, some/all of the original appearances weren't listed.
When they were listed, it was not uncommon for the original appearances to be aggregated with the reprint appearances rather than with the other original appearances.

A brief attempt was made to fix this by creating new records in FLUSH_SORT for such records but this seemed to introduce more problems (not least a major increase in the size of the files) so a second attempt (in 2024) focussed on SETUP_STYAUT_IDX. Having sorted the data into the order required for the STYAUT (i.e. Names) index, the routine then read through the data sequentially and for each set of records for a given item checked that we had a record for the original publication and, if not, added one. This worked pretty well, but there were three immediate complications:

The original publication details might be multi-issue details (i.e. including '+' or '-'). One possibility would be to try to convert these into the equivalent "[Part 1 of m]" in the same way as XVALIDATE does but this was deemed unnecessary as the complications only occur with aggregations and such items (probably) wouldn't be aggregated if the original publication details didn't exist. Instead the code in WRITE_STORY_ITEM discussed above was left in place to handle such cases. To distinguish between cases where we did want the original publication details listed and where we didn't a new flag (done_original) was added to the scanitem structure purely for this purpose. We also don't want to create an "original publication" record if the current item type is "ex" as we don't know what it is an extract from.
This approach breaks a cardinal (unwritten) rule that all the files from IdxGen.tmp onward have exactly the same records (or defined subsets thereof). In particular these records are added after SETUP_STYTTL_IDX has set up all the anchors for the title index but before BUILD_STYTTL_IDX checks them, which is a recipe for disaster. To avoid this the same flag (done_original) is used to flag the new items and BUILD_STYTTL_IDX simply ignores them. This can't do any harm as we only added them for the benefit of the Names Index.
The aggregation code attempts to aggregate all records of the same "type" that come from the same file (by checking the filabb_ptr field in the scanitem structure) – this copes with cases where the same magazine changes abbreviations but is fundamentally the same magazine. This is a problem when creating records for the original appearance from a reprint item as we don't have the relevant data for the original printing – it will usually be the same as the magazine abbreviation but clearly this is not always the case. To address this SETUP_ISSUE_IDX creates a lookaside list of abbreviations that don't match the associated filabb_ptr and SETUP_STYAUT_IDX then checks this when adding the new record.

Even with the above fix there was still a problem with book reviews. Initially there was an additional check in the aggregation check to see if the next item "was either not about the specified author or had the same byline on the piece about the author". This works fine when the only reviews of a given book are by the same author. However, if the same book is reviewed by multiple different authors, and those reviews are then reprinted elsewhere (very common in British Reprint Editions of SF magazines) this breaks as all the original editions are sorted before all the reprint editions so that for each review the original and the reprint are probably not adjacent and hence end up in separate groups. Both before and after the above fix this meant the same review was listed twice, separated by other reviews of the same book by other people.

This was fixed by removing the check in WRITE_STORY_ITEM, when aggregating items for a "subject" that they were all by the same byline and then handling this via the group mechanism as discussed under WRITE_STORY_ITEM.

There was also a problem when an item being aggregated had previously appeared under a different title (e.g. Jon Gustafson's "A Different Perspective" column which ran in Figment and were mainly (but not always) a reprint of his "The Gimlet Eye Returns" column in Pulphouse: The Hardback Magazine. The listing under "The Gimlet Eye Returns" worked fine as it simply listed the items that appeared under that name followed by the reprints with the different title (see the discussion of the Item Extras Array. However it doesn't work the other way round so the listing under "A Different Perspective" alternated the original and reprint details even though the original appeared under a different name. To handle this, the Group Extras Array was extended to include the prior publication details, used only when there was an original title. This meant that a separate Group Extras item was created for each of the above instances causing them to be listed correctly.

Regression Testing

As with all web-based applications, regression testing is a nightmare as a small change in the program that has little visual effect can have a massive impact on the HTML source with anchor names changing, page breaks moving, etc. etc.

To address this, an approach has been adopted that attempts to preserve the key information about the website while removing much of the formatting code that gets in the way. This is achieved by running the batch file d:\new\reformat_index.bat which first merges all HTML files for a particular index into a single text file and then invokes an UltraEdit macro (TidyIdxgen) to strip out the unnecessary formatting. The resultant files are:

00issues.txt - the Magazine Contents Lists (k*.htm)
00names.txt - the Index by Name (n*.htm)
00series.txt - the Index by Series/Imprint (l0*.htm)
00title.txt - the Index by Title (p*.htm)
00chron.txt - the Index by Date (i*.htm)
00books.txt - the Book Contents Lists (e*.htm)

The text files can then be compared from one version of the index to another to give a clearer idea of where there are differences.