Re: Recursive Descent vs. LALR

bear@sonic.net
4 Jul 2003 00:11:57 -0400

From comp.compilers

Related articles
Recursive Descent vs. LALR JohnMResler@netscape.net (John Resler) (2003-06-20)
Re: Recursive Descent vs. LALR bear@sonic.net (2003-06-25)
Re: Recursive Descent vs. LALR cfc@shell01.TheWorld.com (Chris F Clark) (2003-07-02)
Re: Recursive Descent vs. LALR kamalpr@yahoo.com (2003-07-03)
Re: Recursive Descent vs. LALR John.M.Resler@Boeing.com (Boeing) (2003-07-04)
*Re: Recursive Descent vs. LALR bear@sonic.net* (2003-07-04)**

| List of all articles for this month |

From:	bear@sonic.net
Newsgroups:	comp.compilers
Date:	4 Jul 2003 00:11:57 -0400
Organization:	...disorganized...
References:	03-06-093 03-06-117 03-07-024
Keywords:	parse
Posted-Date:	04 Jul 2003 00:11:57 EDT

Chris F Clark wrote:
>
> Note, this question has been asked before, and I've answered it before
> (as well as have others). In some ways, I'm a little surpised it
> isn't in the FAQ, but I guess it isn't asked that frequently, only
> about once per year or two. Here is more than you wanted to know.

I thank you for responding, especially in such detail, but I must
admit that several points of your response elicit confusion or refer
to formalisms or parser models that aren't immediately clear.

> 2) Dike grammars
>
> These are grammars that handle parenthesis (and only parenthesis).
> They have only central recursion.

I had not heard of this classification before. They have some
interesting
properties, such as being able to parse left-to-right with equal ease as
right-to-left. But they don't seem very expressive.

> 2) LL(k) (top-down or predictive) grammars
>
> The k refers to how many tokens of lookahead one needs at the be
> examine at the start of a rule to determine which rule (or
> alternative of a rule) to apply.
>
> If at the beginning of any non-terminal in the grammar, one can
> always look at just one token and decide whether the rule should be
> apllied or not (to a syntactically legal input), the grammar is
> LL(1). This is the most common case.

Nonterminals are just symbols. They don't have beginnings, in a
way that's useful to think about. What I think you mean is that,
given the *first* nonterminal in the expectation sequence of the
parser, and the *first* unshifted token of input, you should be
able to uniquely determine which reduction rule to use.

In terms of the grammar, this means that the 'before' sequence
of every reduction rule consists of a single nonterminal and the
'after' sequence of every rule starts with a single terminal token,
and that no two different grammar rules can have both identical
'before' sequences and 'after' sequences that start with the same
terminal token.

In parsing, when the reduction rule is applied to the expectation
sequence, it replaces the nonterminal token at the beginning of the
expectation sequence with the 'after' sequence of the rule, so now
the expectation sequence starts with a terminal token that matches
the first unshifted token of input. So a 'Left shift' happens,
simultaneously removing the nonterminal from the expectation sequence
and the unshifted input.

(note that there are also right-linear grammars, where a 'Right Shift'
is used exposing new nonterminals to match at the end of the
expectation sequence rather than the beginning, but I have not seen
them used in programming languages.)

> Also, some forms of central recursion mixed with right recursion
> are not LL(1). In particular, the common if-then/if-then-else rule
> of most commonly used programming languages is not LL(k) for any k.

> This is not as big of problem as it would seem as there is a simple
> easy way (hack) to extend LL parsers to handle the if-then/if-then-
> else conflict by having it simply choose the else version if and
> only if the else is present. This binds the else to the closest
> if, which is the normally desired semantics. Most LL parser
> generators will automatically do this for you with at most a
> warning.

What happens is something like this:

('Statement) --> ("if" "(" 'Boolean-Expr ")" 'Statement 'Optional-Else )
('Optional-Else) --> ( "else" 'Statement )
('Optional-Else) --> ()

The third reduction rule shown here, with an empty 'after' sequence,
violates the requirement to have a nonterminal at the beginning of
each 'after' sequence. In theory, we can look at "else" waiting for
us in input and still use either rule to reduce the 'Optional-Else
nonterminal, because we can always match an empty sequence at the
beginning of the unshifted input and use the third rule, even if
there's an "else" sitting there waiting for us. Moreover, if there
is another 'Optional-else following the current one, it can eat the
"else" token, attaching it to a different "if", and the parse will
succeed (although the semantics will be ambiguous).

In practice, it's no trouble to parse because you can just pick the
second rule if you see an "else" in the input and the third rule
if you see any other thing in the input.

> 3) LR(k) (bottom up or shift-reduce) grammars
>
> In LR(k) grammars, the k refers to how many tokens one needs at the
> end of the rule to decide which rule to apply.

Okay, here you lost me. "end of a rule?" A grammar rule is a
'before' sequence and an 'after' sequence. For all type-2
languages (including LR languages) the 'before' sequence must
consist of a single nonterminal constrained to match at the
start of the expectation sequence, and the 'after' sequence can
be an arbitrary sequence of terminals and nonterminals. A
parsing rule for an LR(k) grammar, in turn, is an inputmatch
sequence of up to K terminal tokens to match in input, a
nonterminal to match at the beginning of the expectation
sequence, and a pointer to the grammar rule to apply as a
reduction.

Is there also a shifted-input index to say where the input
matching needs to start relative to the first unshifted input?
You talk about left context or shifted input below, but
I thought the input matching index was assumed to be zero
(never looking at shifted input) for LR(k) parser rules?

I think I recognize your "tokens at the end of a rule" as
being the input sequence of a parsing rule, but don't
understand what the word "end" means in this context.

I recognize the hierarchy of languages parsable by three different
LR parsing algorithms: simple LR (SLR), lookahead LR (LALR) and
canonical LR (LR languages). I recognize LR languages as a subset
of context-free or type-2 languages. But I don't know exactly
what you're saying about them here.

> LR(0) is important because two important sub-classes of LR grammars
> arise from it, SLR(k) and LALR(k). The key feature of LR(0)
> grammars is that they throw away all left (preceding) context in
> the lookahead calculations. The throwing away of the left context
> occurs, because the LR(0) mahcine "merges" states without regard to
> their lookahead information. thus, if in one context the lookahead
> for one rule is three tokens: a, b, or c and in another the
> lookahead for the same rule is d or e, the LR(0) machaine marges
> those two states, confusing the lookaheads, so that at the end of
> the rule any of a, b, c, d, or e should result in a reduction (note
> if the language is LR(0) meaning we don't need to look at all, it
> doesn't matter, since only the correct token will appear anyway in
> a syntactically legal program). Most "LR" parser generation
> algorithms first build the LR(0) machine and then resolve the
> conflicts the occur in the reduce states.

I don't know what "states" you're talking about merging. And
although I *think* you're talking about parsing rules, I can't tell
for sure whether you mean grammar rules or parsing rules. Are you
talking about the "states" of the DFSA that controls the SLR parser?
In the Dragon Book (section 4.7, page 226-227 in my edition) it says
that an SLR parser can use a single state machine as its control
program. This implies to me that the inputmatch indices of all its
parser rules must be equal.

The DFSA is made from the parser rules' inputmatch sequences, and must
recognize all legal subsequences of k tokens of input, identifying
unique accepting states for each. During an SLR parse, the lookahead
tokens are fed into this DFSA. The accepting state returned from it
is then used as an index (along with the first nonterminal of the
current sequence) to look up which grammar rule to apply as a
reduction.

Given that kind of structure, there are "states" that are important,
but I'm still not sure what it means to "merge states".

Your "left context" appears to be equivalent to what I've learned
as "shifted input"; normally we are trying to reach a production
that will cause the beginning of the expectation sequence to match
input starting at the first unshifted input token. The first token
of the expectation sequence, of course, is always a nonterminal. If it
were a terminal token matching the first unshifted input, it would
immediately cause a "shift" (which will result in an "accept" if it
shifts the last token of input and simultaneously empties the
expectation sequence). If it were a terminal token not matching the
first unshifted input, it would immediately result in an "error". But
if the inputmatch index is -1, a parsing rule can start input matching
against the most recently shifted input token instead of the first
unshifted input token, giving your "left context." and if K is 2 or
greater, the inputmatch index can be -2, to start matching two tokens
before the first unmatched token. And so on.

> A grammar is SLR(k) if at the end of all uses of a rule, the same k
> tokens always determine whether to apply the rule or not. A
> grammar is LALR(k) if at the end of each use of a rule, there are k
> tokens which decide whther to apply the rule or not and those k
> tokens do not conflict with the lookahead tokens for some other
> rule that is applied in the same LR(0) reduce state.

I do not know what you mean by "the end of all uses of a rule."
I believe that you must be talking about parsing rules rather
than grammar rules since you mention the input matching sequence
("lookahead tokens") Does "The same k tokens" mean the sequence
of k tokens starting at a particular point in the input? Or does
it mean identical sequences of k tokens in the input matching
sequences of two different parsing rules? I think it must be the
first - to match what I think an SLR(k) language means, all parsing
rules must have the same inputmatch index (and, although I'm not
sure whether it's by convention or definition, its value is zero.)

I'm still not sure I grasp the difference between SLR(k) and LALR(K)
however. I know that for full LR parsing, you need a different DFSA
for each value of inputmatch index to make sure you find the correct
indexes to look up the applicable grammar rule for the next reduction,
and for a SLR(k) language you only need one. But I'm not sure how to
characterize the inputmatching needs for an LALR(K) language.

> The
> difference between the two grammar classes is subtle, and all
> SLR(k) grammars are LALR(k) (and all LALR(k) grammars are LR(k)).
> Essentially, LALR(k) gives some context sensitivity to the
> lookahead tokens and different sets of conflicting rules can have
> different lookahead tokens to distinguish them.

Obviously, if you have two different rules in the parser that point to
different grammar rules with the exact same 'before' sequence, and also
have the exact same inputmatch sequence and inputmatch index, they
conflict. Just as obviously, some difference in those values is
preserved by all LALR parsers (to the extent that ambiguous grammars are
avoided) so conflict is avoided. SLR(k) parsers have parsing rules
differentiated by the 'before' sequence (one nonterminal token, since
it's a type 2 language) and the inputmatch sequence (up to k
nonterminals).
LR(k) parsers have parsing rules which may also be differentiated by
inputmatch index (ie, they may use "left context" to determine which
grammar rule to apply). I'm not sure where the middle ground is for
LALR to occupy.

> That brings us back to "full" or canonical LR(k), where k is >= 1.
> LR(k) parsing preserves the needed left context to disambiguate
> certain situations where LALR(k) finds the rules to conflict. The
> canonical way of doing this is to not merge states with different
> lookaheads in the first place. That causes an explosion in the
> table size as many states are kept distinct simply because they
> have different lookaheads, when in reality the lookahead for those
> states will never be consulted. A more modern method for solving
> the problem involves splitting the states only when a conflict is
> detected. Then, if the grammar is LALR, no splitting will occur
> and one has the same size machine as the LR(0), but as conflicts
> arise the require left context to resolve, the tables slowly grow.

"Not merge states with different lookaheads" -- after scrambling a
little bit with that one, I'm assuming you mean that in ordinary
cases you use the same DFSA output state to report the same sequence
of upcoming input tokens. I was having trouble understanding why
you'd ever _NOT_ do this, and then I realized that you must be
treating shift as optional, so that parser rules don't necessarily
need an inputmatch index at all. That would mean that the "correct"
reduction rule might not match if the input were shifted before it
expected, or that the "correct" reduction rule might not match until
something (what? Another parser rule?) explicitly caused a shift.

In my universe, rules that need left context use an inputmatch index
other than zero to explicitly match against already-shifted input.
With different DFA's for different inputmatch indices, there is
absolutely never a possibility of needing to avoid merging states
within a DFA. Usually you can tell from the first token of the
expectation sequence which DFA you need to use, and you can always
tell the optimal order to try them in - but once in a great while
you do need to check them all, which for an LR(K) language, in the
worst case, can cause it to take up to K times as long to find the
correct reduction rule.

Are you assuming shift *isn't* automatic whenever there's a matching
terminal token at the start of the expectation sequence, and thus
you may have an expectation sequence that starts with _unshifted_
_terminal_ _tokens_ (this violates a major invariant in my parser)
for the parsing rules to match their input sequences against?

Doesn't that cause an explosion in the number of parser rules? And
doesn't it mean you need special rules just to decide when it's okay
to shift? Hmmm. You are skipping the one-DFSA-per-different
inputmatch index so you probably find rules slightly faster in the
average case, but you're also actually taking the time of a rule
lookup just to decide when to shift. Isn't that an efficiency hit?
Or have I misunderstood completely?

What is clear though, is that now we're talking about different
implementations of parsers rather than about different language
families. We ought to be able to talk about different language
families in ways that don't require people to figure out from
context exactly what our particular parser implementation does or
doesn't do.

Bear

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Recursive Descent vs. LALR

bear@sonic.net4 Jul 2003 00:11:57 -0400

bear@sonic.net
4 Jul 2003 00:11:57 -0400