Re: Advice on Writing a Parser Generator

Chris F Clark <>
23 Aug 2003 23:09:17 -0400

          From comp.compilers

Related articles
[2 earlier articles]
Re: Advice on Writing a Parser Generator (Andi Kleen) (2003-08-10)
Re: Advice on Writing a Parser Generator (Chris Dollin) (2003-08-15)
Re: Advice on Writing a Parser Generator (Mykola Rabchevskiy) (2003-08-15)
Re: Advice on Writing a Parser Generator (Rob Arthan) (2003-08-15)
Re: Advice on Writing a Parser Generator (2003-08-15)
Re: Advice on Writing a Parser Generator (Joachim Durchholz) (2003-08-20)
Re: Advice on Writing a Parser Generator (Chris F Clark) (2003-08-23)
Re: Advice on Writing a Parser Generator (Lex Spoon) (2003-08-23)
Re: Advice on Writing a Parser Generator (2003-08-23)
Re: Advice on Writing a Parser Generator (Thomas Skora) (2003-09-04)
Re: Advice on Writing a Parser Generator (Rob Thorpe) (2003-09-04)
| List of all articles for this month |

From: Chris F Clark <>
Newsgroups: comp.compilers
Date: 23 Aug 2003 23:09:17 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 03-08-027 03-08-040 03-08-048 03-08-055
Keywords: tools, practice
Posted-Date: 23 Aug 2003 23:09:17 EDT

Joachim Durchholz wrote:
> Wouldn't it be better if the tables simply contained the addresses of
> action functions? I see a whole lot of advantages in such an approach:

Several parser generators have used that approach. The ones that come
most readily to mind are the ones written by Paul Mann: LALR, autom,
.... so obviously, at least one parser generator writer agrees with

Of course, that implementation depends upon being able to "take the
address of functions", which not all languages allow.

Of course, each of your advantage has a corresponding set of
disadvantages. :-) (The rest should not be seen as a criticim of the
proposal, just an attempt to make it clear that there isn't a
one-size-fits-all engineering solution.)

> 1. We'd get just large tables. The huge switch statements tend to drive
> compilers nuts, either during parsing or during optimization.

You have substituted the creation of lots of small functions (see the
moderators note) which are called by address, which is beyond the
capacity of most current optimizers to effectively inline. Note, I
don't really like that argument, because if lots of code started using
that idiom, the optimizers would get rewritten to support it
efficiently, especially if it got incorporated into a standard
benchmark. Still it is likely that an optimizer is more likely to be
able to sucessfuly optimize code embedded in branches of a case
statement than it is code in lots of little functions called through
a table of pointers.

Additionally, you have partitioned the data space of the action code,
so that it is not possible to use "local" variables from one action to
the next. You may see that as either an advantage or a disadvantage.
Again, different languages have different mechanisms for supporting
the sharing of data between disjoint functions, so what works for one
language might not be so good for another.

> 2. It would improve separation between the yacc-generated code and the
> action code. The action code can modify all parts of the parser state,
> it must be written to avoid accidentally clobbering parser state, etc.
> In other words, it's extremely unmodular - and if anybody exploits the
> advantages of non-modularity, his code will almost certainly break with
> the next change in yacc. (IIRC yacc had more than its share of
> compatibility issues.)

Separating the actions from the parser is good and can increase
modularity. One of the noted authors, perhaps Robert Martin or Jack
Crenshaw, wrote up a pattern on "indirect visitors" (in Dr Dobbs if I
recall correctly) where he exploited having separate functions in a
table. In particular, it allows one to change what the parser "does"
(i.e. the actions) by simply changing the functions invoked and that
is divorced from parsing.

On the downside, having the functions separate from the parser means
that to access any information within the parser, the parser must
provide access mechanisms (e.g. get [and perhaps set] functions).
This may expose the parser internals at more uncontrolled places than
simply the actions which occur during the parse.

More importantly, not having easy access to the parsers internals is
only one step in protecting them and it can provide a false sense of
security. Programs which step out of array bounds or access invalid
pointers can still trash the internals of a parser (and don't tell me
about purify, buonds checker, insure++, or even third degree; they
fail to detect all cases of problems--nor have I found that functional
languages solve the problem completely).

> 3. Oh, and finally, we'd not have to manually match up rule numbers and
> action code :-)

And, yes, your parser generator ought to mechanize that process. Of
course, there are times when you would like to intervene and "hack"
that process to implement something that the author of the parser
generator didn't envision. So, sometimes you want to manually match
up the rule numbers and actions.


Now, with all of the above said, I have implemented both in various
versions of Yacc++ and my experience is that the case statement
approach is generally far superior in measurable performance to the
little function approach with a wide variety of compilers( when the
performance matters, which it usually doesn't). Moreover, if one
takes the time to opitimize the code a little bit by folding duplicate
cases into a single one. The actual number of cases is quite small
and doesn't give compilers a problem. In the long run, I intend for
Yacc++ to support models, as each has its advantages, and in most
cases it isn't the critical choice.

The advantage to users of being able to have local variables that
persist from one case to another is not imagined and is a definite
convenience for them that should not be under-estimated. However, a
langue with good packaging of the "little" functions so that users
could share exactly the data they want, would totally overcome this
advantage and promote more reliablity too.

The point about making the case statement a separate function from the
code which does the parsing is worthwhile and enables enough ability
for the user to intercede (perhaps overriding the case function via
inheritance). The cost of a procedure call at that point is not
sufficient to negatively effect the performance of any "real" grammar
that I have ever measured (it is generally well within the error bars
of the testing).

The point about protecting your parser from errant action code is also
not imagined, particularly if one provides the convenience of a
"parser stack" that carries user data for them. It is surprising easy
to write code that looks correct but which trashes the parser's stack
in various devious ways. Your users are going to assume that the
problem lies in the parser and not their code until you can *prove*
otherwise. That is not an argument for either cases or functions, but
for checking that the "impossible" doesn't happen and that your well-
defined pointers are still valid before accessing them.

Note accessors that allow users the manipulate the parsers internal
state are actually a good idea even if you use the case approach
(providing you make the parser check its sanity). You would be amazed
at the problems that clients have solved simply by making judicious
inspectionand and/or adjustments to the parser's internal state at the
appropriate points.

Hope this helps,

Chris Clark Internet :
Compiler Resources, Inc. Web Site :
19 Bronte Way #33M voice : (508) 435-5016
Marlboro, MA 01752 USA fax : (508) 251-2347 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.