From: | Chris F Clark <cfc@shell01.TheWorld.com> |
Newsgroups: | comp.compilers |
Date: | Sun, 28 Feb 2010 13:25:04 -0500 |
Organization: | The World Public Access UNIX, Brookline, MA |
References: | 10-02-024 10-02-029 10-02-047 10-02-055 10-02-062 10-02-064 10-02-070 10-02-072 10-02-078 10-02-080 |
Keywords: | errors, parse |
Posted-Date: | 01 Mar 2010 00:35:53 EST |
I hope not to be too disappointing in this response to
"Joel E. Denny" <jdenny@clemson.edu> who writes:
> On Fri, 19 Feb 2010, Chris F Clark wrote:
>
>> A true LR algorithm does not merge left-contexts (that affect
>> reductions--the canonical LR algorithm never merges, and minimal
>> state ones marge and then re-split (ala Spector) or use lane tracing
>> (ala Pager) before merging to avoid problematic merges) and thus
>> each state has only the lookaheads that are appropriate to it.
>> Thus, an LR algorithm doesn't have the issue unless one "optimizes"
>> the states as described above.
>
> I'm not sure how that could be correct for any minimal LR algorithm. My
> understanding is that, if every state has only the lookaheads that are
> appropriate to it (or, more precisely, to all of its viable prefixes),
> then the tables are canonical LR. If there are minimal LR algorithms that
> somehow avoid the merging of different lookahead sets, I'd interested in
> hearing how they accomplish that.
The minimal algorithms don't avoid all merging of different lookahead
sets, just those that have an impact on which reductions occur. I
cannot explain how it works in Pager's algorithm because I never
completely grasped it, but I can explain the perspective from
Spector's algorithm because we used (a variation on) it in Yacc++. I
apologize in advance for any lack of rigor in this explanation.
However, if this causes questions, I am willing to *try* and discuss
them, from my limited background.
Ok, the very brief, very hand-wavy discussion of Spector's algorithm.
You run the typical LR(0) algorithm to create you initial machine
doing all the mergings that one expects. This gives you a minimal
state machine, perhaps too minimal. In particular, the LR(0) machine
has thrown away all the left context information that the canonical
LR(1) algorithm preserves in its distinct states. For many grammars,
those that are SLR(1) or LALR(1), this is not an issue. One still has
correct tables.
However, for some reduce-reduce conflicts, the reason we get the
conflict in the LR(0) tables is because the construction algorithm has
thrown away the left context information in merging the states. The
following algorithm does not apply to shift-reduce conflicts unless
there are two different shift-reduce conflicts for that lookahead,
i.e. there is a reduce-reduce conflict behind the shift-reduce
conflict.
So, for each "path" from the start state to the reduce-reduce conflict
state we calculate what the left contexts were that apply to that
reduction. Each path has a specific reduction that if you take, when
you reduce to that non-terminal, you will be capable of continuing the
path and eventually arriving at an accepting state.
Now, by left contexts, what one really calculates is the states
involved in the path and the follow sets appropriate to the path. It
is worth noting that the follow sets involved here don't necessarily
come from the non-terminal we are attempting to reduce they can come
from a non-terminal which (eventually) derives the non-terminal we are
trying to reduce (as the last element in a production (or sequence of
last element productions)).
The problematic merge states are part of that path. Moreover, the
relevant lookaheads the two states that were merged to form the merge
state are captured in the calculation. If those merge states are not
the same, the merge has lost this distinction. Therefore, undoing the
merge wil l restore that left context information to the resulting
tables. Therefore, we eliminate the reduce-reduce conflict by not
merging the states and splitting all the states between the formerly
merged state and the state that had the conflict (including splitting
the conflict state). This restores the proper LR(1) lookaheads to the
conflict state, which is now two (or more) states.
However, you will note that we only did this to states where the LR(1)
lookahead participated in a reduce decision. In a typical grammar,
you will find states where there is only one possible non-terminal to
reduce to. In those states, splitting will not reduce the number of
conflicts and is thus pointless. (More on this 2 paragraphs below.)
If you work through the algorithm, you will see that for an LR(1)
grammar this produces the minimal number of states that accept that
grammar. In particular, if the grammar is LALR(1), the resulting
tables are identical to the LALR(1) tables. You can also prove that
for an LR(1) grammar, that the tables are correct, that every sentence
the grammar accepts are accepted by the constructed tables.
It is possible that the resulting tables will still reduce too early,
for example when there is only one non-terminal to reduce and the
lookahead that causes that reduction only occurs on one path to the
reduction state and on another path that reduction is an error, the
algorithm won't detect it. It doesn't split states to re-introduce
errors in the reduction states. It only splits states to resolve
conflicts. I think this slightly contradicts what I said in my
earlier post on the topic about all the lookaheads being appropriate
to a state, but I just worked through that possibility now. There may
be some states that have lookaheads that are from a prefix, but that
prefix occurs along a path that the parser didn't take to reach this
state. However, those lookaheads don't result in a traditional
conflict, just the removal of a reduce action where an error action
was appropriate. Thus, for a minimal state algorithm there may still
be some spurious reduces when an error is detected. You could adjust
the algorithm to avoid those reduces also and I believe you would then
get the canonical LR(1) tables.
> According to another of your emails, you've done some work with
> Yacc++ to correct the expected tokens reported upon syntax errors.
> I've also worked on this for Bison. Do you know of any papers that
> discuss this topic
There aren't any specifically on that topic concerning Yacc++. Only
Barbara Zino and I worked on it and neither of us wrote any on that
aspect. Most of what we've written on Yacc++ is actually here in the
comp.compilers archives. Since neither of us were in a situation
where publishing papers were relevant, the informality of this
newsgroup was more appropriate to sharing our knowledge.
<digression>
Moveover, the work we did was not particularly, groundbreaking.
We implemented the essential "display_expected()" function that
returns for a given state the tokens (and if desired non-terminals)
that are not error actions in a given state. (Our model treats
non-terminals more like tokens than most LR algorithms, so we can
recover the expected non-terminals also.)
We have some options that allow one to control which states are split.
When working with non-LR(1) grammars, sometimes it is advantageous to
split reduce-reduce and/or shift-reduce conflicts even when splitting
doesn't resolve them. This makes our tables closer to the cannonical
LR(1) tables. I would have to look to see if we handle the error case
I was discussing above. We may have thought of that issue and put an
option to split those tables or may not have.
We also have some example derived versions of the "display_expected()"
function that we use to illustrate ideas, such as reporting expected
non-terminals and not just expected tokens (i.e. so one could say an
"expression" was expected).
In particular, along my career I worked on PL/I and Pascal compilers
and thus learned the differences between reserved words and keywords
and languages that have contextually dependent keywords that are
sometime identifiers and sometimes not. In fact, the Yacc++ language
uses context dependent keywords to keep it simpler, and also supports
(in a fairly extensive way) grammars that are of that style. Thus, it
was natural and important that the "display_expected()" function be
able to skip reporting all the context dependent keywords that were
legal simply because an identifier was legal in that place.
We also have an idea we call "syntax assists" which allow specific
error reporting numbers to be associated with certain rules and the
error mechanism is able to use the numbers for generating better error
messages, i.e. to display which forms of an "if statement" the parser
thought you were working on at the point you made an error.
Again, these are all small tweaks to allow a grammar writer more
control of the error process, but none of them are particularly
exceptional in themselves. Of course, I would like to think that as a
collection they allow a fairly professional result to be crafted.
<end digression>
There are, of course, a (very) few papers that discusss reporting
correct expected tokens, but mostly in the light of other work the
author was doing, and I don't have any references at hand. It doesn't
tend to be a topic that stands on its own, since most of the work is
intimately connected to implementation details. It also doesn't help
that the parsing problem is not generally considered to be academ-
ically interesting and thus very few PhD students are working on it.
Well, I hope this helps,
-Chris
******************************************************************************
Chris Clark email: christopher.f.clark@compiler-resources.com
Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris
------------------------------------------------------------------------------
Return to the
comp.compilers page.
Search the
comp.compilers archives again.