From: | Chris F Clark <cfc@shell01.TheWorld.com> |
Newsgroups: | comp.compilers |
Date: | 15 Apr 2004 12:29:41 -0400 |
Organization: | The World Public Access UNIX, Brookline, MA |
References: | 04-03-042 04-03-057 04-03-072 04-03-091 04-04-029 |
Keywords: | parse |
Posted-Date: | 15 Apr 2004 12:29:41 EDT |
Clint Olsen asked some questions regarding LR+RE parsing.
> When you mean 'integrate', do you mean deciding if the language at a
> certain non-terminal is regular and therefore can be handled with a
> DFA, or are you talking about EBNF notation?
By integrate I mean merge/combine, use both in an intertwined manner,
where one algorithm handles both aspects. So, yes, the input is in
EBNF, but more importantly, the right-hand-sides of each rule are
treated as a regular expression that the parser builds "into" the
tables as an appropriate DFA. Likewise, at lexical time, token
definitions can be recursive, because the engine has a push-down stack
as well as a DFA. The engine for the lexer and the parser are both
essentially the same. There is no DFA model and a separate DPDA
model. It is a combined DPDA/DFA model.
> In yacc you use left recursion in 'one-or-more items' contexts to keep
> the stack size the lowest, but integrating the actions for both cases
> becomes problematic - on the first item, initialize a list; append to
> the list every time after. The RE notation is inherently right
> recursive, which maps nicely to LL but not LR. Your documentation
> seems to imply that the necessary states to handle the two cases in a
> traditional yacc implementation are created automagically, but you
> don't specify what you do with the action code in those cases.
We don't use recursion (left or right) to implement the closure (* and
+) operators. We simply make the parser run the correct DFA as part
of that rule (and the DFA is simply part of the table, it isn't
stepping outside to some other mechanism).
As to the action code, the stack now is not fixed length. If you have
a regular expression on the right-side, the context of the stack will
vary depending as to what the expression matched. An example would
help here.
// grammar (fragment)
a_comma_list: "a" ("," "a")*;
{ int i; for (i = 0; i < yy_psr_last(); ++i ) {
cout << yy_psr_ref(i).yy_type();
} }
// sample input
a,a,a
The yy_psr_last() function returns the length (number of tokens) of
the text that was matched, 5 for the sample input (3 a's and 2
commas). The yy_psr_ref(i) functions returns the tokens in sequence,
for the sample input "a" "," "a" "," "a" and the yy_type() function
returns the type number assigned for that token, perhaps 1 for "a" and
2 for ",". As a result, the example would print 1 2 1 2 1 for the
sample input.
As a practical matter, we recommend using another feature of Yacc++
that automatically constructs the appropriate ast. Again, we use a
similar approach and make variadic objects (i.e. the number of
children vary and one loops over them the same way). The result is
very flat structures without the tower-to-the-left or
tower-to-the-right nature of binary tree representations. One simply
has a parent with n-children, all equally accessible because the
parent contains an array of (pointers to) them.
Note this running of the DFA is actually very similar to what some
LL-regular parsers do. That is once one recognizes which rule to
apply, one simply "runs" the regular expression that matches that
rule. It isn't as obvious that this happens in a recursive descent
parser because the stack is implicit, one simply eats the tokens
desired (possibly assigning them to local variables).
There are some complications in the LR world, as one might be
"running" two separate rules that match the current prefix at the same
time, and the rules might have started at different places on the
stack, and one rule might stop matching before the tokens of the other
rule are swallowed (i.e. the prefix no longer matches both). However,
these (and other related problems) are all solvable in the "LR"
framework.
The key point is that one doesn't have to convert the regular
expressions into recursive rules to make the process work, one can
work directly with regular right hand sides. Moreover, it isn't that
hard to do once one knows a few simplifying ideas.
One of the simplifying ideas was illustrated above. Regular
expressions don't preserve structure. Thus, it is valid to throw that
structure away when parsing a right-hand-side that has a regular
expression in it. To further illustrate this point, the following
fragment puts the same contents on the parser stack as the one above
(even though we rearranged where the regular expression operator is).
a_comma_list: ("a" ",")* "a";
If one wants structure, one must introduce non-terminals to give the
text structure. The next fragments are subtly different than the ones
above and each other.
a_comma_list: part1 "a";
part1: ("a" ",")*;
a_comma_list: part1a* "a";
part1a: "a" ",";
a_comma_list: part1b? "a";
part1b: ("a" ",")+;
a_comma_list: part1c? "a";
part1: a_comma_list ",";
a_comma_list: "a" part2;
part2: ("," "a")*;
...
Finally, as you asked (in a round about way), the merged algorithm
does degenerate into either a DFA algorithm or a DPDA algorithm for
certain inputs. (Not that we ever think of it that way.) If one uses
only regular expressions and no recursion, the resulting tables will
simply be a DFA state machine and the stack part of the mechanism
won't ever be invoked. Similarly, if one uses a traditional yacc
grammar with only recursive rules and no regular expressions, all the
right hand sides will be parsed as a simple sequence without using a
complicated DFA and the resulting stack contents will always be fixed
length.
I hope this answers your question (and doesn't do so in a tedious
way). I did actually omit some details, but I think for a first cut
this should cover the rough aspects of the merged model.
It is worth metioning, that not all LR parser enthusiasts agree with
me on these points. For example, the idea that regular expressions
should erase structure, while prevalent in the normal treatment of
regular expressions, does not sit well with many folks who believe the
structure should be preserved whether written as a recursive rule or
as a regular expression (Kannapin being one of this group).
Hope this helps,
-Chris
*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. Web Site : http://world.std.com/~compres
23 Bailey Rd voice : (508) 435-5016
Berlin, MA 01503 USA fax : (978) 838-0263 (24 hours)
------------------------------------------------------------------------------
Return to the
comp.compilers page.
Search the
comp.compilers archives again.