Re: Preview in grammar

Chris F Clark <cfc@shell01.TheWorld.com>
15 May 2006 22:22:28 -0400

          From comp.compilers

Related articles
[9 earlier articles]
Re: Preview in grammar franku@fmi.uni-passau.de (Ulrich Frank) (2006-05-11)
Re: Preview in grammar Meyer-Eltz@t-online.de (Detlef Meyer-Eltz) (2006-05-12)
Re: Preview in grammar cfc@shell01.TheWorld.com (Chris F Clark) (2006-05-12)
Re: Preview in grammar pbmann@gmail.com (2006-05-12)
Re: Preview in grammar pbmann@gmail.com (2006-05-12)
Re: Preview in grammar pbmann@gmail.com (2006-05-15)
Re: Preview in grammar cfc@shell01.TheWorld.com (Chris F Clark) (2006-05-15)
| List of all articles for this month |

From: Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups: comp.compilers
Date: 15 May 2006 22:22:28 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 06-05-011 06-05-037 06-05-044
Keywords: parse
Posted-Date: 15 May 2006 22:22:28 EDT

Paul Mann wrote:
> When I implemented this algorithm, it was not necessary to unpack
> the parser tables. You have the parser algorithm already.


Paul is correct. Using his algorithm you don't need to unpack the
tables (instead you need to be able to save an restore the parser
stack) because he is letting the parser unpack the tables. In
contrast, the algorithm I outlined requires unpacking the tables but
not saving the state, because the algorithm is "simulating" what the
parser would do without actually running the parser. One can see that
they are essentially the same, just with a different division of
labor. In either case, one can extract the "legal next token"
information from LALR tables (and not just LR ones).


His other point was correct also. With LALR tables one needs a
dynamic (run-time) algorithm for determining the "legal next token"
because state merging has destroyed some of the properties of the
tables (by merging them) that are necessary to solve the problem with
a static (compile-time) algorithm. The dynamic algorithm determines
at run-time whether this reduction of some non-terminal due to some
token is the result of a merge of the tables where the resulting state
that one gets to after the reduction doesn't allow the token.
Canonical LR tables preserve the required static properties by not
merging states.


Example:


context1: "(" nt1 ")"; // nt1 has ")" as follow token


context2: "if" "(" nt1 "+"; // nt1 has "+" as follow token


In LALR parsers, the states for reducing nt1 may get merged with a
lookahead of both ")" and "+" as the reduce tokens. Depending on the
actual input, it is likely that only one of the contexts applies and
thus one of the tokens "(" or "+" is not legal. However, you need to
unwind the stack to see if you are in context1 or context2 (or
potentially either) to determine which token is legal (or if both
are).


That being said, my recollection is that canonical LR tables are
overkill. Canonical LR will not merge states that have different
lookaheads, so that you get potentially 3 reduce states for nt1, ")"
legal (in context1), "+" legal (in context2), and both legal (could be
in either context and the lookahead disambiguates it). This makes the
check trivial. However, to do that canonical LR parsers have to keep
separates states for all the items leading up to the different reduce
states also. This can cause an exponential increase in the parser
table size.


Not all state mergings are problematic. First of all, some mergings
do not affect the list of follow tokens (ones that can cause a reduce
if they follow the non-terminal in question). In fact, it is quite
common for a non-terminal to appear in several different rules, but
the context of the rules still generate the same follow sets. Here is
an example of that situation.


context3: nt2 "(" ")" | nt2 nt3 "."; // follow for nt2 is "(" + first(nt3 ".")


context4: "[" nt2 nt3 "." "]" | nt2 "(" expr+ ")"; // same follows for nt2


On the other hand, sometimes the rule changes the follow set not just
for the non-terminal in question but for a non-terminal embedded at
the end of a rule that defines that non-terminal. Or if the
non-terminal is nullable, for a preceding non-terminal in some rule.
Example:


nt2: nt4 "+" nt4; // merges that affect the lookahead of nt2 also
                                    // affect nt4.


nt3: nt5 | /* empty */; // nt3 nullable,
                                                // thus merges affecting nt3 can affect nt2
                                                // (see context3 above)


Finally, the distinction between LR and LALR grammars is overstated in
most cases. I have seen very few grammars (in practice maybe none)
that are LR and not LALR. In contrast, I've seen many grammars that
are LALR(2), LALR(3), or GLR and not LALR(1). Since GLR algorithms
have their own interesting merge properties, which I believe are also
dynamic, it is (in my opinion) better to just get used to asking the
question dynamically.


This is probably getting into overkill now. However, hopefully it has
given some appreciation into the distinction of what one knows at
compile (parser generation) time versus run (actually parsing) time
and the subtleties of rule interactions.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.