RE: Partially handwritten parsers

"Quinn Tyler Jackson" <>
Mon, 1 Jun 2009 10:08:33 -0700

          From comp.compilers

Related articles
Partially handwritten parsers (Lowell Thomas) (2009-05-30)
RE: Partially handwritten parsers (Quinn Tyler Jackson) (2009-06-01)
RE: Partially handwritten parsers (Lowell Thomas) (2009-06-19)
| List of all articles for this month |

From: "Quinn Tyler Jackson" <>
Newsgroups: comp.compilers
Date: Mon, 1 Jun 2009 10:08:33 -0700
Organization: Compilers Central
References: 09-05-137
Keywords: parse
Posted-Date: 02 Jun 2009 13:25:14 EDT

Lowell Thomas said:

> I've read many of the previous comments here about handwritten parsers and
> it's a "mixed bag" to quote a couple of posts. The chief
> criticism seems to be that you quickly lose sight of exactly what
> language it is you are parsing. But many agree that speed favors
> hand writing.

The moderator commented:

[How often is the parser a significant part of a program's runtime? The
scanner is usually much slower. -John]

The Meta-S parser is scannerless and works on memory-based (rather
than file based) input. On linear grammars, the single most expensive
operations are the making and breaking of strings. Optimizations such
as reference counting the strings and allocating from a specialized
pool get significant performance gains. (All of this is explored in
tedious detail in Adapting to Babel, so I won't go into specifics

That said, I do not agree that the comment that "speed favors hand
writing" in a general sense. Properly designed generated parsers can
benefit from something that hand written parsers cannot: the "optimize
once, improve everywhere" concept.

Certain optimizations that one can effect (such as memoization, for
instance) will improve performance across all grammars, all parsers
generated from those grammars, and in all systems that use those
parsers. Yes, if a hand written parser relied on some library, and
that library were optimized, that would also be distributed across all
parsers that used that library, but this is not what I'm talking

In many cases, grammars can be analyzed at grammar-compile-time and
optimized globally and in ways that are specific to that grammar, and
in a deterministic way. It has been my experience that in linear
grammars parsed in a top-down fashion, the following heuristic
applies: "Failure is more expensive than success."

An example of this in a trivial top-down case:

grammr X {
      S ::= K ":" L;
      K ::= a|b|c;
      L ::= K;
      a ::= "apple"|"ape"|"appear"|"age";
      b ::= "bear"|"ball"|"boy"|"bridge";
      c ::= x|y;
      x ::= "call"|"cure"|"compiler";
      y ::= "consequence"|"condition"|"crow"|z;
      z ::= "center"|"correspond";

Given the input "center:correspond", a naive top-down parser will:

1. S descends into K.
2. K descends into a and fails.
3. K descends into b and fails.
4. K descends into c.
5. c descends into x and fails.
6. c descends into y.
7. y descends into z and matches "center".
8. S matches ":" and then descends into L.
9. L descends into K.
10. Steps 2-6 tried against "correspond".
11. y descends and z and matches "correspond".

A "simple" optimization:

Before descending into a rule, first peek ahead one character to see if a
match is even possible. For instance, rule a can only match if the next
character of input is an 'a'. Rule b can only match if 'b' is the next in
input. Rule c can only match if 'c' is in the input. The optimized top-down
case, given the same input:

1. S descends into K.
2. K skips a and b altogether and descends into c directly.
3. et cetera.

The total effect of not going into as many rules that fail is that the parse
occurs more quickly.

OK, now my point: in a generated parser, these look aheads can be generated
based on the grammar. If the grammar changes (for instance, if rule z is
changed to "center"|"correspond"|"quux"), the look aheads are different.

If the look ahead building algorithm is improved upon, the optimization
carries across the whole parse. For instance, let's say that the look aheads
examine more than one character before descending:

The caller of a only descends into a if the next two characters match the
regular expression 'a[pg]'.
The caller of b only descends into b if the next two characters match the
regular expression 'b[eaor]'.
Et cetera.

This one change to the algorithm carries through the whole system.

Doing that in a hand written parser would, IMO, make the system
exceptionally brittle and stoically resistant to change.

Quinn Tyler Jackson
Chief Scientist
Thothic Technology Partners, LLC

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.