nonvalidating scanning/parsing

"Andrew Dalke" <dalke@acm.org>
1 May 2000 13:13:56 -0400

From comp.compilers

Related articles
*nonvalidating scanning/parsing dalke@acm.org (Andrew Dalke)* (2000-05-01)**

| List of all articles for this month |

From:	"Andrew Dalke" <dalke@acm.org>
Newsgroups:	comp.compilers
Date:	1 May 2000 13:13:56 -0400
Organization:	MindSpring Enterprises
Keywords:	parse, question

Hello,

    I have need to parse a large number (10s or even 100s) of different
file formats. They are characterized by being line-oriented, machine
generated and rather context dependent. The files are often large
(even in the 100s of MB), and mostly described by a regular grammar.

    I'm looking for a way to generate parsers for these formats, for use
in Python. Nothing like what I want seems available for Python, so
I'm looking for comparable systems in other languages, so I get some
idea of what's available and where I can go wrong.

    These files come from two sources; database records and program
execution output. Examples of the database files I need to parse are:
220MB at ftp://ftp.expasy.ch/databases/swiss-prot/release/sprot38.dat
21MB (compressed) at ftp://ncbi.nlm.nih.gov/genbank/gbmam.seq.Z 55K at
http://www.expasy.ch/cgi-bin/get-pdb-entry.pl?9pti

HTML marked-up records from the first two formats are at:
http://www.expasy.ch/cgi-bin/get-sprot-entry?P80222
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Nucleotide&lis
t_uids=3822546&dopt=GenBank Links from the "DR" lines of the SwissProt
record (the first one) gives pointers to another dozen database
formats.

Examples of executable output I need to parse are at:
490K at ftp://bio.perl.org/pub/katel/Blast/Nucleic/wisteria.txt
3K (a small one!) at ftp://bio.perl.org/pub/katel/Blast/Nucleic/lupine.txt

Parsers can be written for these with lex/yacc, or ANTLR or any of the
Python equivalents (eg, SPARK and Plex). There are some
complications. One is the lack of simple lexical tokenization. The
same word, say, "sequence", can mean 3 or 4 different things depending
on the context. All of the parser generators allow explicit state
changes, but they are (IMO), harder to maintain than I want to deal
with.

Another problem is that there is a lot of data in a record, but often
I only want a few items. There can be dozens of fields, but I only
want two or three. Because the formats are line-oriented, hand
written parsers often do things like "read the 2nd word from the first
line, then skip lines up to the line starting with 'SQ ', and read the
rest of that line." This can be quite a bit faster than parsers which
read and verify every character.

This obviously allows grammars which are not part of the "formal"
language description, but that's fine - because the data is machine
generated, it's pretty much guaranteed to be in the right format so
this extra check isn't needed. However, the parser generaters I've
come across will verify every character, and there's no way to improve
the lexical performance by having them generate more lenient parsers.

What I'm looking for, then, is the ability to specify the full grammar,
the fields I'm interested in, and some amount of leniency, and use
that to create the parser.

For example, I can imagine the following:

ID = Re("ID (?P<id>\w{8})
(?P<year>\d{4})-(?P<month>\w{3}))-(?P<day>\d{2})\n")
  ... other definitions ...
SQ = Re("SQ( (?P<sequence>\w{10})*( (?P<sequence>\w{0,9}))\n")
END = Re("(?P<end_record>//)\n")

record = ID + ... + Repeat(SQ) + END # "Repeat(s)" means 0 or more s'es
grammar = Repeat(record)

parser = GenerateParser(grammar,
                                                events = ("id", "sequence", "end_record"),
                                                lenient=1)

# assuming some sort of SAX-like event based callbacks..
parser.add_handler(SimpleHandler())
parser.parse(data)

What this would do is have it create the equivalent to

recordloop:
  quit if eof
  skip 3 characters
  read 8 characters, and send the "id" event with the associated data
  skip 12 characters
loop:
  read up to the newline
  if line starts with "SQ" (or maybe just "S" if there are no
      other lines starting with "S", goto SQloop)
  if line starts with two /'s, goto ENDloop
  goto loop:
SQloop:
  skip the space
  read up to the space or newline and send the "sequence" event with the data
  if it was space, goto SQloop
  # by construction, the // must be next, so
ENDloop:
  send the "end_record" event with the two-character string //
  goto recordloop

(state transitions are probably better, but this is easier to follow -
hopefully I didn't make any mistakes :).

I'm thinking of using this named-regex description (the ?P<>) because
that seems to solve the highly context dependent problem in a terse
syntax that most people follow.

The hardest part is determining the minimum amount of work needed
to identify unambiguously the position of the next required event.
I've got some ideas on how to do that, but would like pointers to
related work.

Suggestions?

Oh, and I can get away with having a regular grammar only (some of
the formats technically are even context sensitive, but enforcement
can be done in the event handler).

While I'm here, one of the problems I have to deal with is families
of related formats. I gave pointers to BLAST output earlier. Almost
every minor version of BLAST has a slightly different format, even
though the main information does not change (ie, the presentation
changes but not the semantics). Indeed, I should be able to use the
same event handler for the different formats.

It isn't always obvious to recognize which version of the output
is being parsed, so I want to generate the parser which is the union
of all known variations of the format (allowing, perhaps, some error
if there is no way to disambiguate, or a way to establish proper
order if a newer format is a backwards compatible subset of an older
format).

If the grammar is described with a state diagram, then version
identification can be done by traversing each diagram seperately,
where the one(s) remaining signify the specific version(s).

Is this the usual/proper way to do version identification?

                                        Andrew Dalke
                                        dalke@acm.org

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

nonvalidating scanning/parsing

"Andrew Dalke" <dalke@acm.org>1 May 2000 13:13:56 -0400

"Andrew Dalke" <dalke@acm.org>
1 May 2000 13:13:56 -0400