Hand-rolled parsers

firth@sei.cmu.edu (Robert Firth)
9 Nov 89 19:28:48 GMT

          From comp.compilers

Related articles
Hand-rolled parsers firth@sei.cmu.edu (1989-11-09)
Re: Hand-rolled parsers bart@videovax.tv.tek.com (Bart Massey) (1989-11-10)
Re: Hand-rolled parsers chris@mimsy.umd.edu (1989-11-12)
Re: Hand-rolled parsers diamond@ws.sony.junet (1989-11-14)
| List of all articles for this month |
From: firth@sei.cmu.edu (Robert Firth)
Newsgroups: comp.compilers
Date: 9 Nov 89 19:28:48 GMT
Organization: Carnegie-Mellon University (Software Engineering Institute), Pgh, PA

This addresses the question 'Why do people write recursive descent parsers by
hand, rather than using parser-generator tools such as lex and yacc?'. It
answers a slightly different question, namely, 'Why do I write RD parsers by
hand...', and as such represents a personal and biased view.


1. Ease of construction.


Given a decent grammar, I can write the basic RD parser in a couple of days,
and do some solid component testing along the way. Moreover, I can write it
in a reasonable language. The tools I've used, on the other hand, have made
the job much harder, for these reasons


. truly bizarre, unfriendly, and unreadable input format or syntax


. very unhelpful and unforgiving error messages. such as the message
    "12 shift/reduce conflicts" as the first level error message; the next
    level is a 50 page trace of the parser generator internal states.


. serious bugs in the lexer generator or parser generator, leading to any of:
    infinite loops, core dumps, silent generation of code skeletons that
    won't compile.


2. Flexibility


The hand lexer and parser are not committed to a single inadequate technique.
For example, an RD parser has no trouble distinguishing between the character
"'" as the start of an Ada character literal, and the character "'" as the
Ada attribute symbol. The average lexer generator cannot disambiguate these
cases. Another one I had trouble with was the difference between Pascal
"1.10" and "1..10".


As another example, when parsing an expression I can use good old operator
precedence techniques, and gain a lot of efficiency for a small loss of
robustness.


Finally, if I do tree building by hand I can straighten out on the fly a lot
of stuff that might cause trouble during later traversals, for example by
flattening nested lists, massaging slightly different productions into a
common format, and so on.


3. Error Handling


I've found most mechanically generated parsers to be truly dreadful at error
handling. A message like "syntax error on line 33" is not just amateur, it's
pathetic. No user should have to put up with such rubbish. As another
example, try giving a certain C compiler an input where the ";" has been
omitted before the "}". This is arguably the simplest repair possible, but
you'll get a stream junk consequential error messages.


With an RD parser, the error handling


. can be tailored closely to the error situation


. can give a very informative message


. can be tuned (in respect of both message and recovery
    action) based on usage statistics


4. Efficiency


RD parsers written by hand are much smaller and faster than the
machine-generated ones I've seen. One example from my own experience is a
parser for Modula-2, where the hand version is about 600 lines of code (24
kbytes object) and the machine version was over TWO MEGABYTES of object code
and static data.


For the future - one day I'll give up these antique ways, and move to modern
technology. But that technology will at have to be based on something at
least as powerful as attribute grammars or von-Wyngaarden grammars; as far as
I can see, mechanical context-free parsing just doesn't hack it.


As I said, just a personal view.
[From firth@sei.cmu.edu (Robert Firth)]





Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.