MagParse - Parse
an external file into Fictionmags Format
MagParse creates a draft file in
Fictionmags format from a variety of input formats:
Plain Text File
MagParse translates a number of entries
representing magazine issues and/or books into the internal Fictionmags format.
It is primarily used for converting files submitted to the Fictionmags Index
by assorted users, all of whom tend to use a subtly different format. Note that
the format used by MagParse is neither the input
format documented in the user documentation nor the output format used in
the indexes themselves, though it is similar to both.
The start of a typical entry is identified
by a leading ">>" which identifies a magazine or book header.
For magazines the basic format is:
>>Magazine Name [issue details] ed. editor(s) (publisher, price, pagecount, format, cover by artist)
where:
- Magazine Name is (surprise, surprise!)
the name of the magazine being indexed
- Issue Details are in the same
format as on the book records
- The editors are specified in the
usual way for names and may be omitted if the editor isn't known.
- The publisher, if specified, must
be the first field after the '('. Note that it is delimited by a comma so
the name shouldn't contain a comma, although as a special case a trailing
", Inc.", ", Ltd.", ", Limited" or ", LLC"
are allowed. Note also that, as the publisher can be almost anything, the
program currently requires it to start with an alphabetic character to distinguish
it from the other fields.
- The price is optional and may
be specified in most common currencies (starting $ for US dollars; A$ for
Australian dollars; C$ for Canadian dollars; ^E= for Euros; \ for British
pounds. Note that for pre-decimal British prices the format uses a trailing
/- for whole shillings (e.g. 5/-) and /nd for pennies (e.g. 4/6d).
- The pagecount is optional and
may contain multiple counts in roman and numeric formats, but must end in
"pp" or "pp+" (e.g. ix+236pp).
- The format is optional and specifies
the magazine format (e.g. "pulp", "quarto", "standard",
etc.)
- The cover artist is optional.
If the cover is photographic this part may be specified as "cover photo
by artist"
For books the basic format is similar
but is identified by the book title being in angle brackets:
>><book title> ed. editor(s) (publisher, ISBN, date, price, pagecount, format, booktype, cover by artist)
>><book title> by author(s) (publisher, ISBN, date, price, pagecount, format, booktype, cover by artist)
where:
- The ISBN, if specified, must start
with 978-, 0- or 1-
- The date must be specified and
is in the same format as for magazine issue details except that it shouldn't
contain any commas (e.g. "March 2010", or "March 15 2010"
and so on)
- The booktype, if specified, defines
the type of book (e.g. "co", "n.", "oa", etc.)
The other fields are the same as
for magazines.
Individual items (i.e. EA records)
are then identical for books or magazines and are specified in the format:
[page number] * author(s)[/subject(s)] * title * item type [[series name]][; illus. artist(s)] [* original appearance data]
where:
- page number is anything that comes
before the first divider (*)
- author(s) may be specified in
either internal format (e.g. Smith, John) or external format, with a leading
@ used to distinguish the latter case (e.g. @John Smith). If in the internal
format then everything between the first and second dividers is passed through
as the author names unchecked; if in the external format then there are various
additional options and restrictions:
- If there are multiple authors
they are separated by commas or (preferably) ampersands; the code also
interprets the string " and " as " & "
- An attempt is made to identify
and isolate any trailing "extras"
field by looking for phrases like ", Jr." or patterns like a
two upper case characters after a comma and space (e.g. ", MD")
or two upper case characters followed by periods similarly (e.g. ",
B.A.") but this is far from exhaustive or foolproof. Note in particular
that, if a degree (such as B.A.) is specified there must not be
a space after the first period.
- Various qualifiers are supported
to identify secondary names
in the format ", prefix:name(s)" (note comma comes before the
space in this case) where "prefix" can currently be "tr",
"hp", "gho", "adapt", "by", "ed",
"with", "read by", "as told to" or "as
told by"
- If there are subject names
to added to the item, these are specified as part of this field, separated
from the main authors by a '/' character
- title is the title of the item
in external format (e.g. "The House" or "An Exciting Column|
An Item") and the code will attempt to identify and extract the title
additional and item additional fields where appropriate.
- the item type is as usual; technically
this can be omitted and will default to "??" but this isn't recommended
- if the item is part of a series
then the series name should be specified after the item type, enclosed in
square brackets, either in internal or external format (e.g. either "[Sherlock
Holmes]" or "[Holmes, Sherlock]"); multiple series names may
be specified in either (or even mixed) format and are separated by '/' characters
- if the item is illustrated then
the artist name(s) may be specified after the prefix "; illus. ".
They are normally specified in external format but some users supply them
in internal format in which case the special prefix "; illusx. "
may be used to suppress conversion.
- if the original appearance data
is known then it may be specified after a further divider in the format of
magazine name terminated by a comma followed by issue data (e.g. "Argosy,
June 24, 1935"); in addition the special format " * (r)" may
be used to indicate a reprint from an unknown source. If no indication is
given that the item is a reprint it is assumed to be original in the enclosing
magazine or book and the appropriate publication details (where known) will
be added.
In addition to the above, any existing
A/D/E record may be included in the input file (and must include the terminating
'~' character to identify them as such). If an existing 'A' record is used then
any following items (to the next ">>" or 'A' record) are assumed
to be part of that book/magazine and will inherit any specified publication
details.
Any other line is treated as an item
appearance note (EB) record unless it starts with "translated from "
in which case it is translated to an item note (ED) record - note that such
records should not contain asterisks or the code will try to treat them as an
EA record item.
Errors detected by the program are
written to a file called xxx.err in the same folder as the input file where
"xxx" matches the file name of the input file; some errors are also
output to the screen (for historic reasons) and the program will report at the
end of conversion is any errors have been detected. Where there are any errors
or not, the program will create a converted output file called xxx.mag in the
same folder. Note that for safety reasons the program will not run if an existing
file called xxx.mag is detected in the relevant folder to avoiding accidentally
over-writing an existing file (as has happened in the past).
Note that MagParse was never
intended for use by anybody other than the developer so it is not particularly
flexible. In particular, the order of the fields in the records above is currently
fixed so that if, for example, the illustrator is specified after the
first appearance data it will not be translated correctly.
ISFDb Source Files
The ISFDb
database is made up of a number of different page types:
- ea.cgi:
a name record, listing all the core items for a given author
- pe.cgi:
a series record, listing all entries in a given series
- title.cgi:
an item record, listing all appearances (for a story) or editions (for a book)
of a given item
- pl.cgi:
a book/magazine record, listing the contents of a given edition of a book
or magazine issue
Currently the program only handles
the pl.htm page type.
One generic problem is that, even
though the pages are (presumably) generated by a software program from the underlying
database, the precise format of the page content is fairly fluid with elements
sometimes starting a new line and sometimes just running on from the previous
element. The program currently tries to handle this by progressively stripping
off the recognised content and either leaving what's left of the current line
or, if there's nothing there, reading a new line (e.g. in the routine CheckStart).
This is not wholly successful and a possible alternative would be to read the
entire file into a single (massive) CString variable at the start (as is done
partially with the pl item header).
Book/Magazine Records (pl*.cgi)
Each record basically contains the
following sections:
Page Header
This contains all the standard ISFDb
page format such as search box, left-side hyperlinks, etc. It is (currently)
identified by the presence of "<div id="content">"
which announces the start of the next section. This section is currently ignored.
Item Header
This contains overall details of
the item (e.g. editor, date, publisher, etc.). It currently seems to start with
"<div class="ContentBox">"
(although the code suggests that some records do, or did, start with "<div
class="MetadataBox">"). The former may be just at the
start of the line while the latter apparently is/was on a line by itself. The
code checks to see if it has one or the other: if so, it steps over it; if not
then it panics.
If there is a cover scan associated
with the item then the next element then the image and the item header data
are held in a table, which is not present if there is no cover scan so we need
to check for the existence of "<table>"
and remember if we found it (so that we can handle the terminating "</table>"
later). If there is a table then it is followed by code (typically on a single
line) along the lines of:
<table>
<tr class="scan">
<td>
<a href="http://www.collectorshowcase.fr/images2/weird_4911.jpg">
<img src="http://www.collectorshowcase.fr/images2/weird_4911.jpg"
alt="picture" class="scan"></a>
</td>
<td class="pubheader">
so we try to check for these and
step over. In all cases this should be followed by "<ul>"
and multiple items that start with "<li>"
(but don't have a terminating "</li>").
As the information associated with a particular "<li>"
may or may not be on the same data line, code first concatenates any following
lines which do not start with either "<li>"
or "</ul>". Within this section
it also strips any spaces before a < or after
a > as these tend to be variable and cause confusion.
It then checks each of the "<li>"
elements as follows:
- Publication: along the lines of
"<li><b>Publication:</b>Weird
Tales, November 1949<span..." where, on a good day, everything
between the "<b>" and the "<span"
is the magazine title and issue date (or book title). As this is pretty much
a free text field, though, there are wide variations. The title may also be
preceded by a "hint" field (indicated by "<span
class="hint" title="") in which case we use the
title in the hint field instead (not sure when this happens: possibly in foreign-language
titles?) Note that everything after the "<span"
(in the first case) or the terminating "
in the second
- Author(s)/Editor(s): these start
with "<li><b>Authors:</b>",
"<li><b>Author:</b>",
"<li><b>Editors:</b>"
or "<li><b>Editor:</b>"
and are all parsed by ParseAuthor (and suffixed
with !eds. or !ed. in the latter cases).
- Issue/Publication Date: these
start with "<li><b>Year:</b>"
or "<li><b>Date:</b>"
and are in the form YYYY-MM-DD or similar. The code just strips off any -00
segments and collapses any remaining - characters.
- ISBN : starts with "<li><b>ISBN:</b>"
followed by the ISBN. In some cases there is a secondary ISBN in a smaller
font in square brackets (e.g. "0-345-32759-4 [<small>978-0-345-32759-8</small>]").
If the fomer is a 10-digit ISBN then the latter is used instead.
- Publisher: starts with "<li><b>Publisher:</b>".
The code calls StripHref to remove any href before
storing it.
- Price: starts with "<li><b>Price:</b>".
We just store the value, normalising it to "" if specified as "$0.00".
- Pagecount: starts with "<li><b>Pages:</b>".
For some reason this is occasionally in square brackets so we just strip those
off if we find them.
- Binding: starts with "<li><b>Binding:</b>".
We just store the value unless it is set to "unknown".
- Format: starts with "<li><b>Format:</b><div
class" or "<li><b>Format:</b><span
class" with the format following the next ">"
and terminated by the next "<".
- Type: starts with "<li><b>Type:</b>"
followed by a keyword. The ones currently understood are:
- "MAGAZINE"
or "FANZINE": store as "mg"
- "COLLECTION"
or "OMNIBUS": store as "co"
- "ANTHOLOGY":
store as "oa"
- "NONFICTION":
store as "nf"
- "NOVEL"
or "NOVEL[non-genre]": store as
"n."
- "CHAPBOOK"
or "CHAPBOOK[graphic format]":
store as "??" as these are ambiguous
- Series Name: starts with "<li><b>Pub.
Series:</b>". The code calls StripHref
to remove any href before storing it.
- Series Number: starts with "<li><b>Pub.
Series #:</b>". The code just stores it.
- Cover Artist: starts with "<li><b>Cover:</b>".
Some times (usually?) this contains a link to the title record for the book/issue
followed by "by xxx" and sometimes
(historically?) it just contains the artist name. In either case we call ParseAuthor
to parse the name. In some rare cases it is followed by "(variant
of " suggesting the cover is a reprint in which case we need to
parse the latter first via CheckVariant and generate
the cover using "Eifc.A0" rather than "DC".
- Unwanted Bits: There are a number
of recognised header fields that we want to ignore. These are:
- "<li><b>Title
Reference:</b>"
- "<li><b>Container
Title:</b>"
- "<li><b>Bibliographic
Comments:</b>"
- "<li><b>Webpages:</b>":
a link to an online copy of the book/issue
- "<li><b>Catalog
ID:</b>"
- "<li><b>Cover2:</b>":
details of a second cover for the magazine
- "<li><a
href="http://www.isfdb.org/wiki/index.php/Special:Upload"
- Notes: These start with "<li><div
class="notes"><b>Notes:</b>" and are
a bit of a headache as they may contain lists (and even nested lists) of contents
which can interfere with the parsing of the main elements. Basically we just
concatenate them all into a single field (separated with ";") but
count any "<ul>" we find (removing
each) and then remove the same number of "</ul>"
elements together with the trailing "</div>".
- External IDs: These start with
"<li><b>External IDs:</b>"
but I don't know what they are and am not entirely sure how we handle them.
If we find any other header records
then we log an "Unexpected header line" error so that we can investigate
and work out how to handle them.
If we determined that we had a magazine
then we try to split off the issue information from the title. These appear
in a bewildering variety of ways, such as:
- Weird Tales Vol.
1 No. 2
- Weird Tales, Issue
#47
- Weird Tales #6
- Weird Tales, November
1949
- Weird Tales, 14
November 1949
The first three can
be detected by looking for the prefix, but if those fail we look for the first
comma and assume the rest is a date. In the latter case it tries to reverse
the day and month and insert a comma before the year to keep the parse code
happy. It also does a bit of house-keeping on the title, removing any trailing
"," or ":" and converting "[UK]" to "(UK)".
Note that if it can't find any issue information it logs an error and converts
the book type to "an".
In all cases it then
calls CONVERT_TITLE to strip off any "title additional".
If it is a magazine
then it tries to convert the title and issue into the relevant Magazine ID (not
sure what the date2 stuff is about).
There might then
be one of two records indicating missing or incomplete contents - "Stub
record" or "Placeholder, contents incomplete"
which we translate (in due course) into the FM format equivalents.
The code then generates
the header records for the issue/book from the information parsed to date, including
any notes. It then checks we have the terminating "</ul>"
and, if we had a cover scan, the terminating "</td>",
"</table>" and "Cover
art..." records. In the latter case there may also be a record starting
"on <a href=" (no idea why) which
we just ignore.
Contents Header
There are then a small number of
records between the item header and the contents themselves. The contents are
terminated by "<div id="VerificationBox">"
or "<div class="VerificationBox">"
so we check for those first and, if found, just return as it means there aren't
any contents. If not then the next record should be either "<div
id="ContentBox">" or "<div
class="ContentBox">" and we log an error and return
if neither is found.
There then might be some optional
sections starting with "<span class="containertitle">".
It's unclear what the purpose of these are so we just want to skip over them.
We're really looking for the presence of a record starting "<h2>Contents
" but rather than just throwing away anything that might be before
that we explicitly look for records we have checked and found to be harmless
(as well as the "VerificationBox" records
as above).
Assuming all has gone according to
plan we then get a record starting "<h2>Contents
" followed by a record containing "<ul>"
followed by multiple groups of records starting "<li>"
which are parsed as per the next section.
Contents
Each item in the contents section
may be spread across multiple lines so we first consolidate them all into a
single record as in the Item Header section and
remove any spaces before a "<", after
a ">" or either side of a divider
(which has been translated into the trigraph "^.@").
There are then a number of basic formats for the record:
<li>page
number^.@title^.@type
byauthor
<li>page number^.@title^.@interview
ofinterviewee^.@interview byinterviewer
<li>page number^.@Review:titlebyauthor^.@review
byreviewer
<li>page number^.@Review
of the xxx "title" by author^.@essay
byreviewer
where "title", "author",
"interviewee", "interviewer", "reviewer" and the
fixed text "Review" are usually hyperlinked
to the relevant item or name record. Note that:
- the page number may be omitted
in which case the following divider is omitted as well so we need to check
first if it is a feasible page number.
- special page numbers of "fep"
and "bep" are used where the FM format
uses "ifc." and "ibc.".
- the author, interviewer and reviewer
fields may specify "uncredited", in which case they aren't hyperlinked
and are either translated as "Anon." for main authors or just omitted
for the author of a book being reviewed (where it is the subject field).
- in the fourth format there is
a single hyperlink from "Review" to
the end of the author name(s) rather than separate hyperlinks for review,
title and author(s).
- the title may be prefixed with
an "en-space" ("^ _") which
we need to strip off and ???
An added complication is that various
parts of the text may be enclosed in a hint or tooltip structure of the form:
<span class="hint"
title="xxx"><a
href="xxx" dir="ltr">xxx</a><img
src="http://www.isfdb.org/question_mark_icon.gif" alt="Question
mark" class="help"></span>
<div class="tooltip"><a href="xxx" dir="ltr">xxx</a><sup
class="mouseover">?</sup><span class="tooltiptext
tooltipnarrow">xxx</span></div>
in which case we try to isolate the
standard "<a href="xxx" dir="ltr">xxx</a>"
for the rest of the program to parse (maybe the other title is better??) by
calling StripTtip as soon as we have consolidated the
record.
Once we have isolated the title there
are also cases where a repeated column (or untitled letter or similar) is distinguished
by suffixing the issue data. Thus, for example, in "Weird Tales, March-April
2008" we have "The Eyrie (Weird Tales, March-April 2008)". This
is easily handled when they are identical but more problematic otherwise. For
instance, in "Interzone, #144 June 1999" we have "Ansible Link
(Interzone #144)" - at the moment the code doesn't even attempt to handle
this.
Note that, for all the above formats,
we should now be at the point where we have "^.@type
byauthor" (or "^.@interview
ofinterviewee^.@interview byinterviewer")
but there are occasionally other bits in the way:
- there might be
a type prefix followed by another divider: current prefixes are "non-genre",
"graphic format" and "juvenile"
- all are currently just ignored
- there might be
a series field, in square brackets, followed by another divider. As a further
complication, the series field might just be a series name or might be a series
name followed by a divider followed by a series number. In either case the
series name will be hyperlinked so the hyperlink needs to be stripped off.
- if the item is
reprint there will be the date of the original appearance (in brackets) followed
by another divider.
- as shown above,
if we have an interview we need to parse the "^.@interview
ofinterviewee" part and store the interviewee as the subject
- there might be
a type prefix (again) followed by another divider: not sure if all three can
appear either here or as above but some have been noted in both positions
The "type" will then translated
as follows:
- "shortstory",
"short story", "shortfiction",
or "short fiction": ss
- "novella":
na
- "novelette":
nv
- "novel":
n.
- "interview":
iv
- "poem":
pm
- "serial":
sl
- "essay":
ar or br (if we're doing a review)
- "interior
artwork": il
- "collection":
co
- "anthology":
an
Anything else is reported as an error
so that we can handle it next time round.
On a good day, all we have left is
the author, but that may be followed by any (combination) of three suffix clauses:
- "(variant
of": usually details of the original title a reprint appeared
under, but this is not guaranteed. In the simple case this is just followed
by a hyperlinked title but there are other possibilities starting with "<i>"
that I'm not currently clear about. This is parsed via CheckVariant.
- "(book
publication as": followed by details of the book version inside
a hyperlink, possibly with date attached. As this is identical in structure
to the variant we use the same routine to parse it .
- "(trans.
of": followed by details of original version, possibly embedded
in a tooltip class, but details are currently vague.
- "[as
by": contains details of the original name(s) the item was published
under, which could be either a variant or a pseudonym (there's no way of telling
which). If there are multiple authors they are separated by "<b>and</b>",
each occurrence of which is converted to " & " before being
passed to ParseAuthor.
There's just the (main) author left
to parse, via ParseAuthor, with "uncredited"
and "Anonymous" normalised to "Anon." (and an item type
of "ar" being reset to "ms" in such cases) and "various"
normalised to "[Various]".
Having parsed the record, there's
some final tidying up to be done:
- if the item type is "ar"
and the title is "Letter", or "Letter
[n]", or "Letter #n" then
the item type is reset to "lt" and the title to "[letter]"
- if the item type is "ar"
and the title is "Letter: xxx" then
the item type is reset to "lt" and the title to "xxx"
- if the item type is "il"
and the title is "Cartoon:
xxx" then the item type
is reset to "ct" and the title to "xxx"
- if the title contains "(Part
x of y)" then
it is reset to "[Part x of y]" (with an additional leading space)
- if the title or the series contains
something like "..." it is reset to
"^._"
- If we have a series and the item
type is non-fiction (apart for letters) we add the series title as a prefix
(inserting any series number in the "||nnnn n.|" format) and change
the item type to "cl". As a special case, if the series name matches
the title or is a prefix in the title, we reset it back to a column with just
the series name.
- There is then a complex bit of
shuffling to see if an item type that isn't "il" is preceded by,
or followed by, an item with the same title and a type of "il",
then the latter is suppressed with the "author" being added to the
former record as artist.
CheckVariant
ParseAuthor
In most cases an author name is just
embedded in a hyperlink, but if there are multiple authors then each is embedded
in its own hyperlink and separated with "<b>and</b>".
There's also something messy with embedded "[as by
" clauses but I'm not sure how
that works.
The ISFDb distinguishes ambiguous
authors by adding a suffix of the form " (x)"
to the names so we strip those off so that we can handle our own disambiguation.
StripHref
This routine checks to see if the
specified string contains "<a href="
and, if so, strips off the hyperlink. It also searches the string for any other
instances of "<a href=" and, if found,
strips them off as well (e.g. when multiple authors are specified as each has
its own hyperlink).
StripTtip
As mentioned above, sometimes fields
such as authors or titles are embedded inside hints or tooltips with constructs
along the lines of:
<span class="hint"
title="xxx"><a href="xxx" dir="ltr">xxx</a><img
src="http://www.isfdb.org/question_mark_icon.gif" alt="Question
mark" class="help"></span>
<div class="tooltip"><a href="xxx" dir="ltr">xxx</a><sup
class="mouseover">?</sup><span class="tooltiptext
tooltipnarrow">xxx</span></div>
where the hyperlink may be omitted.
This routine isolates the (first) "xxx" string (and surrounding hyperlink
if there is one).