Re: State-of-the-art parser generation?

Chris F Clark <cfc@world.std.com>
14 Dec 2003 22:09:26 -0500

          From comp.compilers

Related articles
State-of-the-art parser generation? galibert@pobox.com (2003-12-13)
Re: State-of-the-art parser generation? cfc@world.std.com (Chris F Clark) (2003-12-14)
| List of all articles for this month |
From: Chris F Clark <cfc@world.std.com>
Newsgroups: comp.compilers
Date: 14 Dec 2003 22:09:26 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 03-12-095
Keywords: parse, tools
Posted-Date: 14 Dec 2003 22:09:26 EST

Plain-old "vanilla" LALR(1) parser generation is certainly
"out-of-date" in the sense that any moderately competent university
student could be expected to write one as part of their compiler class
given an appropriate text book. The same is true of LL(1) parser
generation. A graduate student working on a masters thesis could
similarly be expected to implement a "new wrinkle" on either of the
techniques. The state of the art in the area seems to be adaptive
grammars (grammars whose rules can change as the parse progress).


Decent error recovery is still somewhat at the PhD/research level.
There are some relatively good error recovery techniques published in
the literature. However, they are not generally reflected in most
parser generators and most parsers have either intolerable or at best
poor behavior in the presence of syntax errors. Moreover, many of the
"good" error recovery techniques require some level of extra
information (in addition to the grammar) to generate the error
correction tables--for example, weights for each token giving its
importance.


More importantly, I am not aware of even good metrics that measure
parser error recovery, especially not on arbitrary grammars. It would
seem to be a significant advance to the field if someone would come up
with a "theory" of error recovery that would measure how close a
parser came of generating a correct parse given an arbitrary grammar
and an arbitrary incorrect input--better, of course, if the
measurements were in some sense realistic. The closest I've seen to
that was a series of erroneous Pascal sources with grades on the
related error recovery by specific parsers.


If you are really interested in error recovery, please realize that
most of the important errors occur at the sub-lexical level, e.g. 1
character errors, that just happen to cause problem tokens. My
favorite examples are missing quote characters that change strings
into text to be parsed and vice-versa. Similarly, missing opening or
closing braces are a common problem (in C-like languages, parens have
the same effect in many more languages) that tend to have disastrous
effects on the resulting parses. (Note, these problems occur in both
hand-written and automatically generated parsers, so its not just a
parser generation problem.) I've never seen a lexer/parser that could
figure out where the missing quote was in a program text and correct
the program to the intended error free example, much less one that
could do that if there were more than 1 error in the program.


Now, as to the maintainability of the grammar aspects. There is a
reasonable amount of recent and not-so-recent work in that area.


GLR (Tomita/Lang) parsing is something that anyone can implement and
has some nice properties in that it eliminates the anoying conflict
problems of traditional LALR parser generators. Its advocates claim
that it allows writing of grammars by a larger class of naive grammar
writers. Its drawback is that it hides the fact that the grammar may
be ambiguous, which can allow people to write grammars that do not
parse what they intend.


Predicated parsing is a slightly more controlled extension to grammars
than GLR parsing and is easy to implement for LL tools, and only
moderately difficult for LR tools (I only know of one predicated LR
parser generator and several LL ones). It does not hide ambiguous
grammars, but requires more thought on the grammar writers' part.


Improvements/replacements for the precedence/associativity
declarations in LALR parsing also make grammars easier to write and
have no drawbacks that I can see--although I've never seen it in
practice, only read theoretical discussions of it. The best examples
allow one to specify that one "prefers" one parse tree to be built
rather than another. This might combine nicely with GLR parsing to
eliminate some of the ambiguity problems.


Grammar inheritance (treating grammars like objects whose rules can be
inherited and overridden) is an easy extension to any parser
technology. I think "modular" grammars are a similar idea, but am not
familiar with the details.


Fnially, just one word to the wise. The world doesn't really need a
new parser generator unless it is truly an improvement on the current
state and more importantly one that its author is willing to nurture
long-term. I have seen far too many flash-in-the-pan efforts where
someone has come up with a small improvement, written yet-another-
incompatible-parser-generator that incorporates the idea, and then
abandoned it after a few months (or years) and that work hasn't really
added to the field. If you sit down to write a parser generator,
expect to spend about 5 person-years on it before you will really know
whether you have done a good job that will endure. Most of the parser
generators that have endured have had at least two authors, so that
atleast one of them has the where-withall to continue on through the
long acceptance process.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
19 Bronte Way #33M voice : (508) 435-5016
Marlboro, MA 01752 USA fax : (508) 251-2347 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.