Re: Trouble understanding LL(k) vs. LALR(k) etc.

Chris F Clark <cfc@shell01.TheWorld.com>
15 Apr 2004 12:29:41 -0400

From comp.compilers

Related articles
[2 earlier articles]
Re: Trouble understanding LL(k) vs. LALR(k) etc. cfc@shell01.TheWorld.com (Chris F Clark) (2004-03-15)
Re: Trouble understanding LL(k) vs. LALR(k) etc. cdc@maxnet.co.nz (Carl Cerecke) (2004-03-19)
Re: Trouble understanding LL(k) vs. LALR(k) etc. derkgwen@HotPOP.com (Derk Gwen) (2004-03-26)
Re: Trouble understanding LL(k) vs. LALR(k) etc. cfc@shell01.TheWorld.com (Chris F Clark) (2004-03-26)
Re: Trouble understanding LL(k) vs. LALR(k) etc. f40mczf02@sneakemail.com (Karsten Nyblad) (2004-04-03)
Re: Trouble understanding LL(k) vs. LALR(k) etc. clint@0lsen.net (Clint Olsen) (2004-04-14)
*Re: Trouble understanding LL(k) vs. LALR(k) etc. cfc@shell01.TheWorld.com (Chris F Clark)* (2004-04-15)**
Re: Trouble understanding LL(k) vs. LALR(k) etc. clint@0lsen.net (Clint Olsen) (2004-04-15)
DFA's and LR machine (was Re: Trouble understanding...) cdc@maxnet.co.nz (Carl Cerecke) (2004-04-21)
Re: Trouble understanding LL(k) vs. LALR(k) etc. cfc@shell01.TheWorld.com (Chris F Clark) (2004-04-21)
Re: DFA's and LR machine (was Re: Trouble understanding...) cdc@maxnet.co.nz (Carl Cerecke) (2004-05-08)
Re: DFA's and LR machine (was Re: Trouble understanding...) cfc@shell01.TheWorld.com (Chris F Clark) (2004-05-09)

| List of all articles for this month |

From:	Chris F Clark <cfc@shell01.TheWorld.com>
Newsgroups:	comp.compilers
Date:	15 Apr 2004 12:29:41 -0400
Organization:	The World Public Access UNIX, Brookline, MA
References:	04-03-042 04-03-057 04-03-072 04-03-091 04-04-029
Keywords:	parse
Posted-Date:	15 Apr 2004 12:29:41 EDT

Clint Olsen asked some questions regarding LR+RE parsing.

> When you mean 'integrate', do you mean deciding if the language at a
> certain non-terminal is regular and therefore can be handled with a
> DFA, or are you talking about EBNF notation?

By integrate I mean merge/combine, use both in an intertwined manner,
where one algorithm handles both aspects. So, yes, the input is in
EBNF, but more importantly, the right-hand-sides of each rule are
treated as a regular expression that the parser builds "into" the
tables as an appropriate DFA. Likewise, at lexical time, token
definitions can be recursive, because the engine has a push-down stack
as well as a DFA. The engine for the lexer and the parser are both
essentially the same. There is no DFA model and a separate DPDA
model. It is a combined DPDA/DFA model.

> In yacc you use left recursion in 'one-or-more items' contexts to keep
> the stack size the lowest, but integrating the actions for both cases
> becomes problematic - on the first item, initialize a list; append to
> the list every time after. The RE notation is inherently right
> recursive, which maps nicely to LL but not LR. Your documentation
> seems to imply that the necessary states to handle the two cases in a
> traditional yacc implementation are created automagically, but you
> don't specify what you do with the action code in those cases.

We don't use recursion (left or right) to implement the closure (* and
+) operators. We simply make the parser run the correct DFA as part
of that rule (and the DFA is simply part of the table, it isn't
stepping outside to some other mechanism).

As to the action code, the stack now is not fixed length. If you have
a regular expression on the right-side, the context of the stack will
vary depending as to what the expression matched. An example would
help here.

// grammar (fragment)
a_comma_list: "a" ("," "a")*;
    { int i; for (i = 0; i < yy_psr_last(); ++i ) {
                  cout << yy_psr_ref(i).yy_type();
    } }

// sample input
a,a,a

The yy_psr_last() function returns the length (number of tokens) of
the text that was matched, 5 for the sample input (3 a's and 2
commas). The yy_psr_ref(i) functions returns the tokens in sequence,
for the sample input "a" "," "a" "," "a" and the yy_type() function
returns the type number assigned for that token, perhaps 1 for "a" and
2 for ",". As a result, the example would print 1 2 1 2 1 for the
sample input.

As a practical matter, we recommend using another feature of Yacc++
that automatically constructs the appropriate ast. Again, we use a
similar approach and make variadic objects (i.e. the number of
children vary and one loops over them the same way). The result is
very flat structures without the tower-to-the-left or
tower-to-the-right nature of binary tree representations. One simply
has a parent with n-children, all equally accessible because the
parent contains an array of (pointers to) them.

Note this running of the DFA is actually very similar to what some
LL-regular parsers do. That is once one recognizes which rule to
apply, one simply "runs" the regular expression that matches that
rule. It isn't as obvious that this happens in a recursive descent
parser because the stack is implicit, one simply eats the tokens
desired (possibly assigning them to local variables).

There are some complications in the LR world, as one might be
"running" two separate rules that match the current prefix at the same
time, and the rules might have started at different places on the
stack, and one rule might stop matching before the tokens of the other
rule are swallowed (i.e. the prefix no longer matches both). However,
these (and other related problems) are all solvable in the "LR"
framework.

The key point is that one doesn't have to convert the regular
expressions into recursive rules to make the process work, one can
work directly with regular right hand sides. Moreover, it isn't that
hard to do once one knows a few simplifying ideas.

One of the simplifying ideas was illustrated above. Regular
expressions don't preserve structure. Thus, it is valid to throw that
structure away when parsing a right-hand-side that has a regular
expression in it. To further illustrate this point, the following
fragment puts the same contents on the parser stack as the one above
(even though we rearranged where the regular expression operator is).

a_comma_list: ("a" ",")* "a";

If one wants structure, one must introduce non-terminals to give the
text structure. The next fragments are subtly different than the ones
above and each other.

a_comma_list: part1 "a";
part1: ("a" ",")*;

a_comma_list: part1a* "a";
part1a: "a" ",";

a_comma_list: part1b? "a";
part1b: ("a" ",")+;

a_comma_list: part1c? "a";
part1: a_comma_list ",";

a_comma_list: "a" part2;
part2: ("," "a")*;

...

Finally, as you asked (in a round about way), the merged algorithm
does degenerate into either a DFA algorithm or a DPDA algorithm for
certain inputs. (Not that we ever think of it that way.) If one uses
only regular expressions and no recursion, the resulting tables will
simply be a DFA state machine and the stack part of the mechanism
won't ever be invoked. Similarly, if one uses a traditional yacc
grammar with only recursive rules and no regular expressions, all the
right hand sides will be parsed as a simple sequence without using a
complicated DFA and the resulting stack contents will always be fixed
length.

I hope this answers your question (and doesn't do so in a tedious
way). I did actually omit some details, but I think for a first cut
this should cover the rough aspects of the merged model.

It is worth metioning, that not all LR parser enthusiasts agree with
me on these points. For example, the idea that regular expressions
should erase structure, while prevalent in the normal treatment of
regular expressions, does not sit well with many folks who believe the
structure should be preserved whether written as a recursive rule or
as a regular expression (Kannapin being one of this group).

Hope this helps,
-Chris

*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Trouble understanding LL(k) vs. LALR(k) etc.

Chris F Clark <cfc@shell01.TheWorld.com>15 Apr 2004 12:29:41 -0400

Chris F Clark <cfc@shell01.TheWorld.com>
15 Apr 2004 12:29:41 -0400