MagParse - Parse an external file into Fictionmags Format

MagParse creates a draft file in Fictionmags format from a variety of input formats:

a plain text file
one or more ISFDb source files
the NESFA database file (not currently used)
Bill's website format (historic & archived)

Plain Text File

MagParse translates a number of entries representing magazine issues and/or books into the internal Fictionmags format. It is primarily used for converting files submitted to the Fictionmags Index by assorted users, all of whom tend to use a subtly different format. Note that the format used by MagParse is neither the input format documented in the user documentation nor the output format used in the indexes themselves, though it is similar to both.

The start of a typical entry is identified by a leading ">>" which identifies a magazine or book header. For magazines the basic format is:

>>Magazine Name [issue details] ed. editor(s) (publisher, price, pagecount, format, cover by artist)

where:

Magazine Name is (surprise, surprise!) the name of the magazine being indexed
Issue Details are in the same format as on the book records
The editors are specified in the usual way for names and may be omitted if the editor isn't known.
The publisher, if specified, must be the first field after the '('. Note that it is delimited by a comma so the name shouldn't contain a comma, although as a special case a trailing ", Inc.", ", Ltd.", ", Limited" or ", LLC" are allowed. Note also that, as the publisher can be almost anything, the program currently requires it to start with an alphabetic character to distinguish it from the other fields.
The price is optional and may be specified in most common currencies (starting $ for US dollars; A$ for Australian dollars; C$ for Canadian dollars; ^E= for Euros; \ for British pounds. Note that for pre-decimal British prices the format uses a trailing /- for whole shillings (e.g. 5/-) and /nd for pennies (e.g. 4/6d).
The pagecount is optional and may contain multiple counts in roman and numeric formats, but must end in "pp" or "pp+" (e.g. ix+236pp).
The format is optional and specifies the magazine format (e.g. "pulp", "quarto", "standard", etc.)
The cover artist is optional. If the cover is photographic this part may be specified as "cover photo by artist"

For books the basic format is similar but is identified by the book title being in angle brackets:

>><book title> ed. editor(s) (publisher, ISBN, date, price, pagecount, format, booktype, cover by artist)
>><book title> by author(s) (publisher, ISBN, date, price, pagecount, format, booktype, cover by artist)

where:

The ISBN, if specified, must start with 978-, 0- or 1-
The date must be specified and is in the same format as for magazine issue details except that it shouldn't contain any commas (e.g. "March 2010", or "March 15 2010" and so on)
The booktype, if specified, defines the type of book (e.g. "co", "n.", "oa", etc.)

The other fields are the same as for magazines.

Individual items (i.e. EA records) are then identical for books or magazines and are specified in the format:

[page number] * author(s)[/subject(s)] * title * item type [[series name]][; illus. artist(s)] [* original appearance data]

where:

page number is anything that comes before the first divider (*)
author(s) may be specified in either internal format (e.g. Smith, John) or external format, with a leading @ used to distinguish the latter case (e.g. @John Smith). If in the internal format then everything between the first and second dividers is passed through as the author names unchecked; if in the external format then there are various additional options and restrictions:
- If there are multiple authors they are separated by commas or (preferably) ampersands; the code also interprets the string " and " as " & "
- An attempt is made to identify and isolate any trailing "extras" field by looking for phrases like ", Jr." or patterns like a two upper case characters after a comma and space (e.g. ", MD") or two upper case characters followed by periods similarly (e.g. ", B.A.") but this is far from exhaustive or foolproof. Note in particular that, if a degree (such as B.A.) is specified there must not be a space after the first period.
- Various qualifiers are supported to identify secondary names in the format ", prefix:name(s)" (note comma comes before the space in this case) where "prefix" can currently be "tr", "hp", "gho", "adapt", "by", "ed", "with", "read by", "as told to" or "as told by"
- If there are subject names to added to the item, these are specified as part of this field, separated from the main authors by a '/' character
title is the title of the item in external format (e.g. "The House" or "An Exciting Column| An Item") and the code will attempt to identify and extract the title additional and item additional fields where appropriate.
the item type is as usual; technically this can be omitted and will default to "??" but this isn't recommended
if the item is part of a series then the series name should be specified after the item type, enclosed in square brackets, either in internal or external format (e.g. either "[Sherlock Holmes]" or "[Holmes, Sherlock]"); multiple series names may be specified in either (or even mixed) format and are separated by '/' characters
if the item is illustrated then the artist name(s) may be specified after the prefix "; illus. ". They are normally specified in external format but some users supply them in internal format in which case the special prefix "; illusx. " may be used to suppress conversion.
if the original appearance data is known then it may be specified after a further divider in the format of magazine name terminated by a comma followed by issue data (e.g. "Argosy, June 24, 1935"); in addition the special format " * (r)" may be used to indicate a reprint from an unknown source. If no indication is given that the item is a reprint it is assumed to be original in the enclosing magazine or book and the appropriate publication details (where known) will be added.

In addition to the above, any existing A/D/E record may be included in the input file (and must include the terminating '~' character to identify them as such). If an existing 'A' record is used then any following items (to the next ">>" or 'A' record) are assumed to be part of that book/magazine and will inherit any specified publication details.

Any other line is treated as an item appearance note (EB) record unless it starts with "translated from " in which case it is translated to an item note (ED) record - note that such records should not contain asterisks or the code will try to treat them as an EA record item.

Errors detected by the program are written to a file called xxx.err in the same folder as the input file where "xxx" matches the file name of the input file; some errors are also output to the screen (for historic reasons) and the program will report at the end of conversion is any errors have been detected. Where there are any errors or not, the program will create a converted output file called xxx.mag in the same folder. Note that for safety reasons the program will not run if an existing file called xxx.mag is detected in the relevant folder to avoiding accidentally over-writing an existing file (as has happened in the past).

Note that MagParse was never intended for use by anybody other than the developer so it is not particularly flexible. In particular, the order of the fields in the records above is currently fixed so that if, for example, the illustrator is specified after the first appearance data it will not be translated correctly.

ISFDb Source Files

The ISFDb database is made up of a number of different page types:

ea.cgi: a name record, listing all the core items for a given author
pe.cgi: a series record, listing all entries in a given series
title.cgi: an item record, listing all appearances (for a story) or editions (for a book) of a given item
pl.cgi: a book/magazine record, listing the contents of a given edition of a book or magazine issue

Currently the program only handles the pl.htm page type.

One generic problem is that, even though the pages are (presumably) generated by a software program from the underlying database, the precise format of the page content is fairly fluid with elements sometimes starting a new line and sometimes just running on from the previous element. The program currently tries to handle this by progressively stripping off the recognised content and either leaving what's left of the current line or, if there's nothing there, reading a new line (e.g. in the routine CheckStart). This is not wholly successful and a possible alternative would be to read the entire file into a single (massive) CString variable at the start (as is done partially with the pl item header).

Book/Magazine Records (pl*.cgi)

Each record basically contains the following sections:

Page Header
Item Header
Contents Header
Contents
Item Trailer: this is currently ignored

Page Header

This contains all the standard ISFDb page format such as search box, left-side hyperlinks, etc. It is (currently) identified by the presence of "<div id="content">" which announces the start of the next section. This section is currently ignored.

Item Header

This contains overall details of the item (e.g. editor, date, publisher, etc.). It currently seems to start with "<div class="ContentBox">" (although the code suggests that some records do, or did, start with "<div class="MetadataBox">"). The former may be just at the start of the line while the latter apparently is/was on a line by itself. The code checks to see if it has one or the other: if so, it steps over it; if not then it panics.

If there is a cover scan associated with the item then the next element then the image and the item header data are held in a table, which is not present if there is no cover scan so we need to check for the existence of "<table>" and remember if we found it (so that we can handle the terminating "</table>" later). If there is a table then it is followed by code (typically on a single line) along the lines of:

<table>
<tr class="scan">
<td>
<a href="http://www.collectorshowcase.fr/images2/weird_4911.jpg">
<img src="http://www.collectorshowcase.fr/images2/weird_4911.jpg" alt="picture" class="scan"></a>
</td>
<td class="pubheader">

so we try to check for these and step over. In all cases this should be followed by "<ul>" and multiple items that start with "<li>" (but don't have a terminating "</li>"). As the information associated with a particular "<li>" may or may not be on the same data line, code first concatenates any following lines which do not start with either "<li>" or "</ul>". Within this section it also strips any spaces before a < or after a > as these tend to be variable and cause confusion. It then checks each of the "<li>" elements as follows:

Publication: along the lines of "<li><b>Publication:</b>Weird Tales, November 1949<span..." where, on a good day, everything between the "<b>" and the "<span" is the magazine title and issue date (or book title). As this is pretty much a free text field, though, there are wide variations. The title may also be preceded by a "hint" field (indicated by "<span class="hint" title="") in which case we use the title in the hint field instead (not sure when this happens: possibly in foreign-language titles?) Note that everything after the "<span" (in the first case) or the terminating " in the second
Author(s)/Editor(s): these start with "<li><b>Authors:</b>", "<li><b>Author:</b>", "<li><b>Editors:</b>" or "<li><b>Editor:</b>" and are all parsed by ParseAuthor (and suffixed with !eds. or !ed. in the latter cases).
Issue/Publication Date: these start with "<li><b>Year:</b>" or "<li><b>Date:</b>" and are in the form YYYY-MM-DD or similar. The code just strips off any -00 segments and collapses any remaining - characters.
ISBN : starts with "<li><b>ISBN:</b>" followed by the ISBN. In some cases there is a secondary ISBN in a smaller font in square brackets (e.g. "0-345-32759-4 [<small>978-0-345-32759-8</small>]"). If the fomer is a 10-digit ISBN then the latter is used instead.
Publisher: starts with "<li><b>Publisher:</b>". The code calls StripHref to remove any href before storing it.
Price: starts with "<li><b>Price:</b>". We just store the value, normalising it to "" if specified as "$0.00".
Pagecount: starts with "<li><b>Pages:</b>". For some reason this is occasionally in square brackets so we just strip those off if we find them.
Binding: starts with "<li><b>Binding:</b>". We just store the value unless it is set to "unknown".
Format: starts with "<li><b>Format:</b><div class" or "<li><b>Format:</b><span class" with the format following the next ">" and terminated by the next "<".
Type: starts with "<li><b>Type:</b>" followed by a keyword. The ones currently understood are:
- "MAGAZINE" or "FANZINE": store as "mg"
- "COLLECTION" or "OMNIBUS": store as "co"
- "ANTHOLOGY": store as "oa"
- "NONFICTION": store as "nf"
- "NOVEL" or "NOVEL[non-genre]": store as "n."
- "CHAPBOOK" or "CHAPBOOK[graphic format]": store as "??" as these are ambiguous
Series Name: starts with "<li><b>Pub. Series:</b>". The code calls StripHref to remove any href before storing it.
Series Number: starts with "<li><b>Pub. Series #:</b>". The code just stores it.
Cover Artist: starts with "<li><b>Cover:</b>". Some times (usually?) this contains a link to the title record for the book/issue followed by "by xxx" and sometimes (historically?) it just contains the artist name. In either case we call ParseAuthor to parse the name. In some rare cases it is followed by "(variant of " suggesting the cover is a reprint in which case we need to parse the latter first via CheckVariant and generate the cover using "Eifc.A0" rather than "DC".
Unwanted Bits: There are a number of recognised header fields that we want to ignore. These are:
- "<li><b>Title Reference:</b>"
- "<li><b>Container Title:</b>"
- "<li><b>Bibliographic Comments:</b>"
- "<li><b>Webpages:</b>": a link to an online copy of the book/issue
- "<li><b>Catalog ID:</b>"
- "<li><b>Cover2:</b>": details of a second cover for the magazine
- "<li><a href="http://www.isfdb.org/wiki/index.php/Special:Upload"
Notes: These start with "<li><div class="notes"><b>Notes:</b>" and are a bit of a headache as they may contain lists (and even nested lists) of contents which can interfere with the parsing of the main elements. Basically we just concatenate them all into a single field (separated with ";") but count any "<ul>" we find (removing each) and then remove the same number of "</ul>" elements together with the trailing "</div>".
External IDs: These start with "<li><b>External IDs:</b>" but I don't know what they are and am not entirely sure how we handle them.

If we find any other header records then we log an "Unexpected header line" error so that we can investigate and work out how to handle them.

If we determined that we had a magazine then we try to split off the issue information from the title. These appear in a bewildering variety of ways, such as:

Weird Tales Vol. 1 No. 2
Weird Tales, Issue #47
Weird Tales #6
Weird Tales, November 1949
Weird Tales, 14 November 1949

The first three can be detected by looking for the prefix, but if those fail we look for the first comma and assume the rest is a date. In the latter case it tries to reverse the day and month and insert a comma before the year to keep the parse code happy. It also does a bit of house-keeping on the title, removing any trailing "," or ":" and converting "[UK]" to "(UK)". Note that if it can't find any issue information it logs an error and converts the book type to "an".

In all cases it then calls CONVERT_TITLE to strip off any "title additional".

If it is a magazine then it tries to convert the title and issue into the relevant Magazine ID (not sure what the date2 stuff is about).

There might then be one of two records indicating missing or incomplete contents - "Stub record" or "Placeholder, contents incomplete" which we translate (in due course) into the FM format equivalents.

The code then generates the header records for the issue/book from the information parsed to date, including any notes. It then checks we have the terminating "</ul>" and, if we had a cover scan, the terminating "</td>", "</table>" and "Cover art..." records. In the latter case there may also be a record starting "on <a href=" (no idea why) which we just ignore.

Contents Header

There are then a small number of records between the item header and the contents themselves. The contents are terminated by "<div id="VerificationBox">" or "<div class="VerificationBox">" so we check for those first and, if found, just return as it means there aren't any contents. If not then the next record should be either "<div id="ContentBox">" or "<div class="ContentBox">" and we log an error and return if neither is found.

There then might be some optional sections starting with "<span class="containertitle">". It's unclear what the purpose of these are so we just want to skip over them. We're really looking for the presence of a record starting "<h2>Contents " but rather than just throwing away anything that might be before that we explicitly look for records we have checked and found to be harmless (as well as the "VerificationBox" records as above).

Assuming all has gone according to plan we then get a record starting "<h2>Contents " followed by a record containing "<ul>" followed by multiple groups of records starting "<li>" which are parsed as per the next section.

Each item in the contents section may be spread across multiple lines so we first consolidate them all into a single record as in the Item Header section and remove any spaces before a "<", after a ">" or either side of a divider (which has been translated into the trigraph "^.@"). There are then a number of basic formats for the record:

<li>page number^.@title^.@type byauthor
<li>page number^.@title^.@interview ofinterviewee^.@interview byinterviewer
<li>page number^.@Review:titlebyauthor^.@review byreviewer
<li>page number^.@Review of the xxx "title" by author^.@essay byreviewer

where "title", "author", "interviewee", "interviewer", "reviewer" and the fixed text "Review" are usually hyperlinked to the relevant item or name record. Note that:

the page number may be omitted in which case the following divider is omitted as well so we need to check first if it is a feasible page number.
special page numbers of "fep" and "bep" are used where the FM format uses "ifc." and "ibc.".
the author, interviewer and reviewer fields may specify "uncredited", in which case they aren't hyperlinked and are either translated as "Anon." for main authors or just omitted for the author of a book being reviewed (where it is the subject field).
in the fourth format there is a single hyperlink from "Review" to the end of the author name(s) rather than separate hyperlinks for review, title and author(s).
the title may be prefixed with an "en-space" ("^ _") which we need to strip off and ???

An added complication is that various parts of the text may be enclosed in a hint or tooltip structure of the form:

<span class="hint" title="xxx"><a href="xxx" dir="ltr">xxx</a><img src="http://www.isfdb.org/question_mark_icon.gif" alt="Question mark" class="help"></span>
<div class="tooltip"><a href="xxx" dir="ltr">xxx</a><sup class="mouseover">?</sup><span class="tooltiptext tooltipnarrow">xxx</span></div>

in which case we try to isolate the standard "<a href="xxx" dir="ltr">xxx</a>" for the rest of the program to parse (maybe the other title is better??) by calling StripTtip as soon as we have consolidated the record.

Once we have isolated the title there are also cases where a repeated column (or untitled letter or similar) is distinguished by suffixing the issue data. Thus, for example, in "Weird Tales, March-April 2008" we have "The Eyrie (Weird Tales, March-April 2008)". This is easily handled when they are identical but more problematic otherwise. For instance, in "Interzone, #144 June 1999" we have "Ansible Link (Interzone #144)" - at the moment the code doesn't even attempt to handle this.

Note that, for all the above formats, we should now be at the point where we have "^.@type byauthor" (or "^.@interview ofinterviewee^.@interview byinterviewer") but there are occasionally other bits in the way:

there might be a type prefix followed by another divider: current prefixes are "non-genre", "graphic format" and "juvenile" - all are currently just ignored
there might be a series field, in square brackets, followed by another divider. As a further complication, the series field might just be a series name or might be a series name followed by a divider followed by a series number. In either case the series name will be hyperlinked so the hyperlink needs to be stripped off.
if the item is reprint there will be the date of the original appearance (in brackets) followed by another divider.
as shown above, if we have an interview we need to parse the "^.@interview ofinterviewee" part and store the interviewee as the subject
there might be a type prefix (again) followed by another divider: not sure if all three can appear either here or as above but some have been noted in both positions

The "type" will then translated as follows:

"shortstory", "short story", "shortfiction", or "short fiction": ss
"novella": na
"novelette": nv
"novel": n.
"interview": iv
"poem": pm
"serial": sl
"essay": ar or br (if we're doing a review)
"interior artwork": il
"collection": co
"anthology": an

Anything else is reported as an error so that we can handle it next time round.

On a good day, all we have left is the author, but that may be followed by any (combination) of three suffix clauses:

"(variant of": usually details of the original title a reprint appeared under, but this is not guaranteed. In the simple case this is just followed by a hyperlinked title but there are other possibilities starting with "<i>" that I'm not currently clear about. This is parsed via CheckVariant.
"(book publication as": followed by details of the book version inside a hyperlink, possibly with date attached. As this is identical in structure to the variant we use the same routine to parse it .
"(trans. of": followed by details of original version, possibly embedded in a tooltip class, but details are currently vague.
"[as by": contains details of the original name(s) the item was published under, which could be either a variant or a pseudonym (there's no way of telling which). If there are multiple authors they are separated by "<b>and</b>", each occurrence of which is converted to " & " before being passed to ParseAuthor.

There's just the (main) author left to parse, via ParseAuthor, with "uncredited" and "Anonymous" normalised to "Anon." (and an item type of "ar" being reset to "ms" in such cases) and "various" normalised to "[Various]".

Having parsed the record, there's some final tidying up to be done:

if the item type is "ar" and the title is "Letter", or "Letter [n]", or "Letter #n" then the item type is reset to "lt" and the title to "[letter]"
if the item type is "ar" and the title is "Letter: xxx" then the item type is reset to "lt" and the title to "xxx"
if the item type is "il" and the title is "Cartoon: xxx" then the item type is reset to "ct" and the title to "xxx"
if the title contains "(Part x of y)" then it is reset to "[Part x of y]" (with an additional leading space)
if the title or the series contains something like "..." it is reset to "^._"
If we have a series and the item type is non-fiction (apart for letters) we add the series title as a prefix (inserting any series number in the "||nnnn n.|" format) and change the item type to "cl". As a special case, if the series name matches the title or is a prefix in the title, we reset it back to a column with just the series name.
There is then a complex bit of shuffling to see if an item type that isn't "il" is preceded by, or followed by, an item with the same title and a type of "il", then the latter is suppressed with the "author" being added to the former record as artist.

CheckVariant

ParseAuthor

In most cases an author name is just embedded in a hyperlink, but if there are multiple authors then each is embedded in its own hyperlink and separated with "<b>and</b>". There's also something messy with embedded "[as by " clauses but I'm not sure how that works.

The ISFDb distinguishes ambiguous authors by adding a suffix of the form " (x)" to the names so we strip those off so that we can handle our own disambiguation.

StripHref

This routine checks to see if the specified string contains "<a href=" and, if so, strips off the hyperlink. It also searches the string for any other instances of "<a href=" and, if found, strips them off as well (e.g. when multiple authors are specified as each has its own hyperlink).

StripTtip

As mentioned above, sometimes fields such as authors or titles are embedded inside hints or tooltips with constructs along the lines of:

<span class="hint" title="xxx"><a href="xxx" dir="ltr">xxx</a><img src="http://www.isfdb.org/question_mark_icon.gif" alt="Question mark" class="help"></span>
<div class="tooltip"><a href="xxx" dir="ltr">xxx</a><sup class="mouseover">?</sup><span class="tooltiptext tooltipnarrow">xxx</span></div>

where the hyperlink may be omitted. This routine isolates the (first) "xxx" string (and surrounding hyperlink if there is one).