Re: grammar ambiguity

Chris F Clark <world!>
27 Feb 2000 02:42:08 -0500

          From comp.compilers

Related articles
[4 earlier articles]
Re: grammar ambiguity world! (Chris F Clark) (2000-02-12)
Re: Grammar ambiguity (Joachim Durchholz) (2000-02-12)
Re: grammar ambiguity (Chris F Clark) (2000-02-12)
Re: Grammar ambiguity (Charles E. Bortle, Jr.) (2000-02-13)
Re: Grammar ambiguity (Jocelyn Coulmance) (2000-02-19)
Re: grammar ambiguity (2000-02-21)
Re: grammar ambiguity world! (Chris F Clark) (2000-02-27)
Re: Grammar ambiguity (Joachim Durchholz) (2000-02-27)
Re: grammar ambiguity (Joachim Durchholz) (2000-02-27)
Re: Grammar ambiguity (Jocelyn Coulmance) (2000-03-03)
Grammar ambiguity (Vassilis Kostakos) (2000-03-06)
Re: Grammar ambiguity (2000-03-11)
Re: Grammar ambiguity (Rodrigo Augusto Barbato Ferreira) (2000-03-11)
| List of all articles for this month |

From: Chris F Clark <world!>
Newsgroups: comp.compilers
Date: 27 Feb 2000 02:42:08 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 00-02-051 00-02-104
Keywords: parse

William B. Clodius asked about why we switched from LALR with
non-terminal lookahead to minimal state LR:

> It appears that you preferred this implementation. What were the
> tradeoffs involved that led you to drop this implementation, and
> what has prevented you from reimplementing it in subsequent versions
> of YACC++?

The reasons are not as insightful (for this group) as I would like,
but I'll post them anyway.

The short answer to both questions is time.

The longer answer is that we originally switched because we were
certain we could implement both, but decided not to complicate the LR
implementation with our own non-standard extension until it settled.

A little history is in order here. We first implemented non-terminal
lookahead accidentally. We implemented the LR(0) machine from memory.
It seemed natural that one would have the whole vocabulary as indexes
into the actions within a state. Only after when we were looking up
something, did we discover that "shift" and "goto" actions were
different and that a "normal" LR table distinguishes between terminal
and non-terminal indexes.

The curious aspect of using non-terminal lookaheads is that the parser
sometimes reduces right-to-left. The reason is that in the conflict
situations the parser has to reduce the lookahead non-terminal before
it reduces the non-terminal the lookahead is disambiguating. With the
LALR(1) engine we had at the time, that meant one non-terminal
(including the non-terminals inside it) could be reduced out-of-order.
At the time, we weren't certain what effect that would have on
usability. Moreover, we hadn't seen the paper on non-terminal
lookaheads yet, so we had no idea, whether it was a useful feature nor
what class of grammars would be acceptable to it. There were also
complications when the lookahead non-terminal had non-terminals nested
in them. We invented this concept we called "non-terminal diving" to
deal with that, but I don't recall if we implemented it.

As a result, we turned the non-terminal lookahead code off and
implemented the minimal state LR, loosely following David Spector's
SIGPLAN article. That turned out to introduce its own hair. In
particular, eventually one has to split states that the LR(0) engine
has merged. That code was problematic and temperamental.

The issue then was that once we implemented the LR splitting, we were
anxious to get the code out the door. Therefore, we decided that
shipping that code without the non-terminal lookahead was better than
trying to get both features working together.

Once, we had the product out in the market a whole new reality
appeared. Customers are more anxious for solutions to specific nits
than they are to whole new generalizations. The generalizations may
put sizzle in the ads and make us developers feel better. However, it
is getting all the details right (and documented in a way that people
find the answer to the question they are asking) is much more
significant in having happy users. As a result the non-terminal
lookahead got put into the list of things that would get improved in
the "spare time".

It wasn't until version 2.2 that we got a block of spare time large
enough to reimplement the non-terminal lookahead in the generator. Of
course, the new implementation was an "improvement" on the old one and
would use as many non-terminals of lookahead as it required.(*) In
addition, by that time we had a C++ grammar that we had typed in from
one of Stroustrup's books that we were curious to see how it
fared. The results were devastating. Essentially, the generator
looped until it used all available memory (some hundreds of thousands
of states later) and crashed.

That experiment led to making parts of the generator more robust. In
addition, we learned some things about how lookahead loops in the
grammar occur. Now, at 2.4 some of the robustness changes we did
became useful to solve a normal parsing problem. Therefore, we
finally have merged the code into our main stream. However, with the
non-terminal lookahead turned on the generator still loops on certain
ambiguous grammars (including our C++ one). And we still don't have
the revised run-time support implemented. Thus, the feature is hidden
behind an undocumented switch.

My expectation is that it won't be released even in the next version
(2.5), as we have other more pressing customer requests to satisfy
first. However, I think that 2.6 has a good shot.

*) There is an interesting concept motivating the non-terminal
lookahead "improvement". Take a grammar with a conflict, say:

a: b c | d e; // where b and d conflict
b: B;
d: B;

Now, if the input matches the grammar, there must be some path through
the grammar that gets from the conflict state to the eof. In our
sample, the path matches either c (for b) or e (for d). If the
grammar is unambiguous, the lookahead "grammars" that get from each
conflicting non-terminal to the eof must be disjoint (ie. there is no
f that is a valid lookahead for both b and d).

If we just extracted the grammars that computed the lookaheads for b
and d and saw which one parsed the remaining input with those grammars
and chose to reduce b or d by noting whose lookahead grammar matched,
we would have solved the conflict and have the correct parse.

Further, it seems that the lookahead grammars must each be "simple"
based upon the following argument. If the non-terminal appears in the
top-level (goal) rule, then the lookahead path is simply the items in
the rule that follow the non-terminal. If the non-terminal appears in
a one-deep nested rule, the lookahead is the items to complete that
rule followed by the items following it in the top-level rule. And so
on, by induction. Even if the conflict occurs in a recursive rule, it
seems that there still must be a regular expression (with elements
being drawn from the set of terminals and non-terminals) that
describes the path to the eof.

For the grammar to be ambiguous, there must be a place where the paths
converge (have the same set of items on their way to eof) and describe
the same input (set) between the conflict and the convergence.

The question is whether the simplicity of the lookahead grammars
forces the state generation algorithm to converge for all unambiguous
grammars? My intuition says yes. However, the theorem that says the
ambiguity of cfl's is undecidable might make that not so. Particu-
larly, in light of how easy ambiguity is detected once the paths
exist. The way our prototype appears to loop on some grammars suggest
that there are grammars that don't converge (or that we have a bug).


Chris Clark Internet :
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
Web Site in Progress: Web Site :

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.