IdxGen generates a set of Index Files based on the contents of an Index Configuration File. This file discusses some of the general background issues:
While the individual indexes look quite different one from the other, there is an over-riding structure that is common to each index, although not all indexes use all parts. The core level is what might be called the Index 1 Level which contains a list of all the "things" being indexed (e.g. Artist Names, Magazine Names, Story Titles, etc.).
In most cases there is a level below this which we could call the Listings Level which lists all the items for each "thing" (e.g. issues for each Magazine, items by each author, items in each series, etc. etc.). Note that this level is not used for the Story and Book Title Indexes as these do not have an expansion. Conversely, for Magazine Issues and for Book Authors there is a further level below which we could call the Contents Level which lists the contents of each magazine issue or book.
Above the Index 1 Level there then might be up to two higher, hierarchical, levels which provide index ranges into the lower-level indexes. The intention is that these should be dynamic based upon the number of items in the Index 1 Level. As each level will contain MAXPAGESIZE entries of the next level, we can see that with a modest page size of, say, 300 lines:
Currently (June 2021) the largest index (the Story Title Index) has about 2,000,000 items in it so this should do for the foreseeable future. The table below summarises the situation as of June 2021. Note that one complication is that the top level of the index is always a single page with a fixed name, so a small index will only have the Index 3 Level; a medium-sized one will have Index 3 and Index 1 Levels and large indexes would have all three.
Note also that, while the formats of Listings and Contents Levels vary quite widely from index to index, the lowest Index level will always be simply a name or a continuation line and all the levels above that will be a range of names.
Index 3/Top | Index 2 | Index 1 | Listings | Contents | Notes | |
Artists | A01 | BBBnnnn | BBnnnn | Bnnnn | ||
Biographical Notes | A02 | CCCnnnn | CCnnnn | Cnnnn | ||
Book Authors | A03 | DDDnnnn | DDnnnn | Dnnnn | Ennnn | |
Book Titles | A04 | FFnnnn | Fnnnn | This index has no listings level | ||
Chronological | A05 | HHnnnn | Hnnnn | Innnn | ||
Magazine Issues | A06 | JJJnnnn | JJnnnn | Jnnnn | Knnnn | |
Series | A07 | LLLnnnn | LLnnnn | Lnnnn | ||
Story Authors | A08 | MMnnnn | Mnnnn | Nnnnn | ||
Story Titles | A09 | Onnnn | Pnnnn | This index has no listings level | ||
Full Text Links | A10 | QQQnnnn | QQnnnn | Qnnnn | ||
Names | A11 | Rnnn | Snnn | This index has no listings level | ||
Defined by | topfilnam_ptr | idxlv2pre_ptr | idxlv1pre_ptr | lstingspre_ptr | contentspre_ptr |
The creation and the structuring of the index levels is all handled by WRITE_INDEX_LINE which depends on a set of static tables as listed in the "Defined by" line above. These are accessed by the index type which is held in the curr_index_type field in the Configuration Data structure. There are also four other static tables similarly referenced:
Using the routine for normal indexes is very straightforward:
Note that the code generated for the Level 1 index is of the form:
<LI><A HREF="xnn.htm#Annn">formatted name</LI> (IDXLIN_NORMAL)
<LI>________: <A HREF="xnn.htm.htm#Annn">continuation text</LI> (for IDXLIN_CONT)
<LI>special text</LI> (for others)
It is up to the caller to ensure that formatted_name and continuation text contain "</A>" at the appropriate point to terminate the hyperlink - this allows just the first part of the text to be hyperlinked if so desired.
The indexes are linked together by hyperlinks which requires the creation of unique anchors for anything that might be the target of a link. This obviously presents a minor challenge as we need to know what the anchors in Index 1 are when creating Index 2, but also know what the anchors are in Index 2 when creating Index 1. To handle this always requires two passes through the indexes - one to set up the anchors and the other to implement them. There are various ways of doing this, but the approach IdxGen uses is to have one set of modules (SETUP_xxx_IDX) whose primary purpose is calculate the anchors and a second set (BUILD_xxx_IDX) which generates each index in turn. A key part of the first pass is also to decide where to insert a page break in the index so that pages do not get too large - depending on the index, these may or may not be in the middle of a section, as discussed below.
There are two approaches to handling the anchors in the two passes. Each requires the anchor to be stored in the first pass, but the second pass can then take one of two approaches:
The first of these is potentially the easiest, but the second provides a belts-and-braces sanity check - i.e., if the anchors/page numbers don't match then the code in either the SETUP or BUILD module (or both) is incorrect. For this reason the program currently adopts the second approach, although it can take quite a bit of time to track down an anomaly if the program detects the two don't match.
The anchors are either stored in an in-memory table or added to the sorted files and the page breaks vary from index to index, as follows:
The use of boilerplate files was first used in the programs used for generating the GCP Website and provide a means for having standard, potentially fairly complex, HTML page layouts that can be easily changed without requiring programmatic changes. The principle is very straightforward - the file is a standard HTML file that can be created with any HTML editor and which contains a number of special flags of the form <!x>, <!x+> and <!x-> where x is some special code known to the program (often just a single letter).
A line that starts with <!x> indicates that the whole line should be replaced with whatever x represents to the program while <!x+> and <!x-> delimit an entire section that should be omitted under certain conditions (e.g. the "Previous Page" link on the first page).
There are three boilerplate files used by IdxGen:
A consistent set of substitution variables are used across the three files, even though no single file uses them all:
While much of the data needed by the program resides in structures in memory it has proved impossible to hold everything in memory as discussed below. As such, a number of sorted data files are created and used by the program as follows:
The program supports the input of a mixture of magazine data files, book data files and additional reference files. At the moment, at least, these files need to be distinct (i.e. you can't have a single file with both magazine and book data, other than where the book is being listed as part of a magazine) and need to follow a strict naming convention:
The data in the indexes is too large for in-memory sorts with the 32-bit compiler, and a brief experiment with the 64-bit compiler resulted in 8 hours for a single full sort of the data (as opposed to 5 minutes for an external sort program) so, at various points, IdxGen calls an external command file (specified in the Index Configuration File) to sort a pair of files. As the sort order varies from instance to instance the command file is also passed a parameter indicating the type of sort that is required. Thus the formatted command that is executed might be:
sortfil BOOKDATA IdxGen.tm1 IdxGen.Bks
where IdxGen.Tm1 is the input file and IdxGen.Bks is the output file. The possible command parameters are:
The current tool used for sorting is CMSort where the key switches are:
So, for instance, a sort key could be :
/V=$1F /SV=1,1,0 /SV=2,1,0 /SV=3,1,0 /SV=4,1,0 /SV=5,1,0 /SV=6,1,0 /SV=7,1,0 /SV=11,1,0 /SV=8,1,0 /SV=9,1,0 /SV=10,1,0
However, experimentation shows that sorting with a key like this takes 4-5 times longer than a straight sort so, instead, the files are created in a different order depending on the way we want to sort them as described in the documentation of SETUP_SCANITEM
To allow the code to be broken into smaller sections without huge parameter lists, the bulk of the data that the program uses is held in a structure called config_data which is passed as an argument to most routines. It contains the following groups of fields:
Settings from the Index Configuration File:
char idxnam[128]; /* Name of Index */
char editor[128]; /* Name of Editor(s) */
char idxdir[128]; /* Folder to generate indexes into */
char boiler[128]; /* Folder containing boilerplates */
char sortfile[128]; /* Name of the file containing the sort commands */
char ctrlfile[128]; /* Name of the file defining the files to include in the index */ char ablkfile[128]; /* Name of the file containing the about links for the index */
char intrfile[128]; /* Name of the file containing the Introduction */
char missfile[128]; /* Name of the file containing a list of missing issues */
char omitfile[128]; /* Name of the file containing a list of magazines deliberately omitted from the index */
char nxtufile[128]; /* Name of the file containing a list of items for the next update */
char toctext[128]; /* Name of the file containing additional text to insert in Table of Contents */
char lastupdate[10]; /* Date of last update */
int subfolders; /* Set to PSP_TRUE if files should be generated in sub-folders */
int pubindex; /* Set to PSP_TRUE if we want a "by Publisher" index */
int sortnames; /* Set to PSP_TRUE if the magazine files should be sorted by name */
int fullimages; /* Set to PSP_TRUE if cover scans are to be displayed full-size */ int minpagesize; /* The minimum number of lines to display on a page (default 200) */
int maxpagesize; /* The maximum number of lines to display on a page (default 1000) */ int permlinks; /* Set to PSP_TRUE if permanent links should be output */
int report_diags; /* Set to PSP_TRUE if extended diagnostics should be output */ char special1[128]; /* First internal special flag */
char special2[128]; /* Second internal special flag */
char special3[128]; /* Third internal special flag */
char special4[128]; /* Fourth internal special flag */
char special5[128]; /* Fifth internal special flag */
char special6[128]; /* Sixth internal special flag */
char special7[128]; /* Seventh internal special flag */
char special8[128]; /* Eighth internal special flag */
char special9[128]; /* Ninth internal special flag */
The scandata structure used by SCAN_FILE:
struct scandata *scandata_ptr; /* Pointer to scandata structure */
The lists of magazine files (in a sub-structure so that they can be sorted by magazine sort name) and the consolidated magazine file name:
struct magfile {
char *magnam_ptr; /* Pointers to Magazine Names */
char filnam[MAXFILENAME]; /* Magazine File Name */
};
struct magfile magfiles[MAXMAGFILES]; /* Magazine File & Sort Names */
int magfile_cnt; /* Count of Magazine Files */ char magfilnam[128]; /* Magazine File Name */
The name of the consolidated book file name and flag to indicate that we have some books:
int got_books; /* Flag to say we have some books */ char bookfilnam[128]; /* Book File Name */
A number of fields related to the index type currently being built:
char curr_index_type; /* Current index type */
FILE *topfil_ptr; /* Top-Level Index Output File Pointer */
FILE *midfil_ptr; /* Middle-Level Index Output File Pointer */
int midpage_count; /* Page and line counts for the middle-level index */
int midline_count;
int lstpage_count; /* Page, line & anchor counts for the listings */
int lstline_count;
int lstanchor_count;
char first_name[1024]; /* First name for top-level index */
char last_name[1024]; /* Last name for top-level index */
char curr_name[1024]; /* Current name */
char formatted_name[4096]; /* formatted version of same for index headings (can be huge for house names) */
And some miscellaneous (useful) data:
char uplink[4]; /* Set to "../" if we have subfolders; to "" otherwise */
char toplink[6]; /* Set to "../a/" if we have subfolders; to "" otherwise */
char tmpdir[256]; /* Folder to use for temporary files */ FILE *namfil_ptr; /* Names Link File */
FILE *dmpfil_ptr; /* Diagnostic Dump File (if needed) */
FILE *csvfil_ptr; /* CSV file for Names Link Database (if needed) */
Note that, as we store key information related to the current index type in the structure, it is critical that the index types are processed sequentially, not concurrently (see BUILD_FULLTEXT_IDX below).
We also need to expose some of the data globally so that we can sort them efficiently via qsort. This includes:
The Issue Link Table containing the list of magazine issue IDs and book IDs and the associated links and a list of subscripts into that array:
static char *isslink_det_ptr[MAXISSUES]; /* Issue Details Table */
static char *isslink_exp_ptr[MAXISSUES]; /* Expanded Details Table (also used for cover scan links in READ_MAG_FILES) */
static char *isslink_txt_ptr[MAXISSUES]; /* Full Text/About Link Table (in READ_MAG_FILES) */ /* Book Title Link (elsewhere) */
static char *isslink_pre_ptr[MAXISSUES]; /* Issue Link Page/Anchor Prefix Table */
static int isslink_pageno[MAXISSUES];
static int isslink_anchor[MAXISSUES]; /* Issue Link Page/Anchor Table */
static char isslink_edition[MAXISSUES]; /* and edition */
static int isslink_idx[MAXISSUES]; /* and Indexes into Table(s) */
static int isslink_cnt; /* Count of Issue Links */
This table is set up by SETUP_BOKAUT_IDX and SETUP_ISSUE_IDX and sorted into order by the latter. The three fields isslink_pre_ptr, isslink_pageno & isslink_anchor define a link to the associated book in the Book Contents List or to the associated magazine issue in the Magazine Contents List. Edition contains the edition number (for books) and isslink_idx is used so we can sort the field efficiently.
isslink_det_ptr is one of three formats:
isslink_txt_ptr is only used for magazine issues where the associated file had VALNOABB specified implying that the magazine name changes from issue to issue and, in that cases, stores the actual name of the magazine (nouvp.mag in the WFI is a classic example).
The Names Link Table containing the list of authors in the Story Author, Book Author, Artist, Chronological & Biographical Notes Indexes:
static char *nameslink_nrmaut_ptr [MAXNAMES]; /* Normalised author names */
static char *nameslink_auth_ptr [MAXNAMES]; /* Standard author names */
static char nameslink_namtyp[MAXNAMES]; /* Name Type Table: Bitmap indicating which indexes the name appears in: */ /* 1=Story Author; 2=Artist; 4=Book Author */
static int nameslink_pseudsub[MAXNAMES]; /* Index into PSEUD.CVT */
static int nameslink_stypag[MAXNAMES];
static int nameslink_styanc[MAXNAMES];
static int nameslink_stylin[MAXNAMES]; /* Story Author Index Page/Anchor/Line Table */
static int nameslink_artpag[MAXNAMES];
static int nameslink_artanc[MAXNAMES]; /* Artist Index Page/Anchor Table */
static int nameslink_bokpag[MAXNAMES];
static int nameslink_bokanc[MAXNAMES]; /* Book Author Index page/anchor table */
static int nameslink_crnpag[MAXNAMES];
static int nameslink_crnanc[MAXNAMES]; /* Chronological Index Page/Anchor Table */
static int nameslink_biopag[MAXNAMES];
static int nameslink_bioanc[MAXNAMES];
static int nameslink_biotyp[MAXNAMES]; /* Biographical Notes page/anchor/type table */
static int nameslink_idx[MAXNAMES]; /* and indexes into table */
static int nameslink_cnt; /* and count of Links */
Similar information for the book titles in the Book Title Index:
static int bokttllink_recnum[MAXBOOKS]; /* Record number in Books File */
static int bokttllink_pageno[MAXBOOKS];
static int bokttllink_anchor[MAXBOOKS]; /* Book Title page/anchor table */
static char *bokttllink_pubdet[MAXBOOKS]; /* Abbreviated publication details */
static int bokttllink_idx[MAXBOOKS]; /* and indexes into table */
static int bokttllink_cnt; /* and count of Links */
Similar information for the series names in the Series Index:
static char *serieslink_nam_ptr [MAXSERIES]; /* Series Names Table */
static int serieslink_pageno[MAXSERIES];
static int serieslink_anchor[MAXSERIES]; /* Series Page/Anchor Table */
static int serieslink_idx[MAXSERIES]; /* and indexes into table */
static int serieslink_cnt; /* and count of Links */
The array of column headers that should be suppressed when encountered:
static char *colhdr_arr[MAXCOLHDRS]; /* Array of column headers to be suppressed */
static int colhdr_cnt; /* and count thereof */
Key information (mainly static) related to the different index types:
static char* topfilnam_ptr; /* File names for the top-level index for each index type */
static char* idxlv2pre_ptr[]; /* Prefix letter(s) for the level 2 index for each index type */
static char* idxlv1pre_ptr[]; /* Prefix letter(s) for the level 1 index for each index type */
static char* lstingspre_ptr[]; /* Prefix letter(s) for the listings level for each index type ("" if none) */
static char* contentspre_ptr[]; /* Prefix letter(s) for the contents level for each index type ("" if none) */
static char* idxtyp_ptr[]; /* Index Type Name to be used when linking to top-level index for each index type */
static char* idxpgttl_ptr[]; /* Page Title to be used in Index Headers for each index type */ static int idxemphchr[]; /* Type of emphasis to be used in index levels: 'B' = Bold; 'I' = Italics; ' ' = None */ static int idxlinecnt[]; /* Number of lines in the bottom-level index */
static int idxsinglvl[]; /* Set to true if this index is wholly contained in the top-level index page */
static char* idxbgcolor_ptr[]; /* RGB colour value to set as background colour for this index */
Lastly, some odd counts for the Statistics File and for Terminal Diagnostics:
static int author_cnt=0; /* Number of authors in the index */
static int artist_cnt=0; /* Number of artists in the index */
static int magtitles_cnt=0; /* Number of magazine titles in the index */
static int magissues_cnt=0; /* Number of magazine issues in the index */
static int books_cnt=0; /* Number of books in the index */
static int fiction_cnt=0; /* Number of fiction items in the index */
static int poems_cnt=0; /* Number of poems and plays in the index */
static int nonfiction_cnt=0; /* Number of non-fiction items in the index */
static int covers_cnt=0; /* Number of cover images in the index */
static int pages_cnt=0; /* Number of pages in the index */
static int ftlinks_cnt=0; /* Number of full-text links in the index */
static int biog_cnt=0; /* Number of biographical notes in the index */
static int nonfatal_diags=0; /* Number of non-fatal diagnostics (check idxgen.prt) */
static int max_html_files=0; /* Maximum number of HTML files for any index type (compared to MAXHTMLFILES) */
static int multgrp_max=0; /* Maximum number of different filnam/byline groups (compared to MAX_MULTGRP) */
static int multitm_max=0; /* Maximum number of occurrences for a single item (compared to MAX_MULTITM) */
static int multrep_max=0; /* Maximum number of entries with reprint author/title details for an item (compared to MAX_MULTREP) */
static int multxtr_max=0; /* Maximum number of groups with distinct bylines/original titles (compared to MAX_MULTXTR) */ static int do_diag=PSP_FALSE; /* Diagnostic flag; this can be used anywhere to create a line on which a breakpoint */ /* can be set when a particular condition occurs */
For simple, uncomplicated, items the intent of aggregation is simply to display the first appearance of an item, followed by any reprint information on subsequent lines (in chronological order) as in:
In this simple case, we have entries in the sorted data for the original appearance as well as the reprints, each of which identifies the original appearance. However, there are cases where all we have is an instance defining the reprint appearance which may or may not identify the original appearance, as in (in the WFI):
Clearly, in this case, we need to output the first line (even if the first appearance is unknown) as part of processing the second line, while in the previous case above we want to suppress the first line for the reprints as we have already displayed it. We handle this by checking the pubdet_ptr field in each scanitem structure to see if it is the same as the previous one.
There are even (obscure) cases where we have (vague) original appearance data without any reprint information, as in:
in which case we need to output the first line but not the second line even though we are dealing with a reprint record (in this case magid_ptr is empty).
However, if the item is a serial (say) then we don't want to list each part separately so we try to aggregate all the instances into a single line so that, if the above were serialised over three issues, it might appear as:
There are several points to note about this simple case:
To make life even more interesting, some items appear as a single item in their original appearance but multiple items when reprinted or vice versa (note that the first of these is particularly common if the original appearance lies outside the current index), as in (in the WFI):
or:
This latter highlights yet another problem with aggregation, albeit one new to the v2 indexes, - i.e. bylines. Although this example is fairly straightforward, when we are talking about a long column, it is possible that the column will appear under multiple bylines (the most common example known being when an editor uses their initials for some instances). In a perfect world we would list these separately, but that would mean sorting by byline_ptr before dtpubl_ptr which would disrupt the "normal entries". We might be able to "put aside" any with a different byline while we process the current set or come up with an alternative (e.g. append the byline to only those issues that differ, which is fine as long as there is isn't an overall byline to display as well) all this has been left as an exercise for later: for now we just (try to) ignore such differences.
Many of the items in the database effectively contain two different pieces of publication data details of the current publication and of the original publication although obviously in many case these are the same and/or the second one is missing. In many/most cases there is also a record in the database for the original publication so we don't need to worry about it, but there are many instances where this is not the case.
If the original publication was under a different byline or title then this has always been catered for in FLUSH_SORT as we would need records in different areas of the sort hierarchy. However, if this was not the case, originally the issue was posponed until WRITE_STORY_ITEM when the code checked to see if we already had the original publication and, if not, output it. This works fine for simple cases, but led to problems in aggregation for two reasons:
A brief attempt was made to fix this by creating new records in FLUSH_SORT for such records but this seemed to introduce more problems (not least a major increase in the size of the files) so a second attempt (in 2024) focussed on SETUP_STYAUT_IDX. Having sorted the data into the order required for the STYAUT (i.e. Names) index, the routine then read through the data sequentially and for each set of records for a given item checked that we had a record for the original publication and, if not, added one. This worked pretty well, but there were three immediate complications:
Even with the above fix there was still a problem with book reviews. Initially there was an additional check in the aggregation check to see if the next item "was either not about the specified author or had the same byline on the piece about the author". This works fine when the only reviews of a given book are by the same author. However, if the same book is reviewed by multiple different authors, and those reviews are then reprinted elsewhere (very common in British Reprint Editions of SF magazines) this breaks as all the original editions are sorted before all the reprint editions so that for each review the original and the reprint are probably not adjacent and hence end up in separate groups. Both before and after the above fix this meant the same review was listed twice, separated by other reviews of the same book by other people.
This was fixed by removing the check in WRITE_STORY_ITEM, when aggregating items for a "subject" that they were all by the same byline and then handling this via the group mechanism as discussed under WRITE_STORY_ITEM.
There was also a problem when an item being aggregated had previously appeared under a different title (e.g. Jon Gustafson's "A Different Perspective" column which ran in Figment and were mainly (but not always) a reprint of his "The Gimlet Eye Returns" column in Pulphouse: The Hardback Magazine. The listing under "The Gimlet Eye Returns" worked fine as it simply listed the items that appeared under that name followed by the reprints with the different title (see the discussion of the Item Extras Array. However it doesn't work the other way round so the listing under "A Different Perspective" alternated the original and reprint details even though the original appeared under a different name. To handle this, the Group Extras Array was extended to include the prior publication details, used only when there was an original title. This meant that a separate Group Extras item was created for each of the above instances causing them to be listed correctly.
As with all web-based applications, regression testing is a nightmare as a small change in the program that has little visual effect can have a massive impact on the HTML source with anchor names changing, page breaks moving, etc. etc.
To address this, an approach has been adopted that attempts to preserve the key information about the website while removing much of the formatting code that gets in the way. This is achieved by running the batch file d:\new\reformat_index.bat which first merges all HTML files for a particular index into a single text file and then invokes an UltraEdit macro (TidyIdxgen) to strip out the unnecessary formatting. The resultant files are:
The text files can then be compared from one version of the index to another to give a clearer idea of where there are differences.