YAYERB (yet another yacc error recovery bug)

megatest!djones@decwrl.dec.com (Dave Jones)
Mon, 8 Jun 1992 00:29:30 GMT

          From comp.compilers

Related articles
YAYERB (yet another yacc error recovery bug) megatest!djones@decwrl.dec.com (1992-06-08)
| List of all articles for this month |
Newsgroups: comp.compilers
From: megatest!djones@decwrl.dec.com (Dave Jones)
Keywords: yacc, errors
Organization: Megatest Corporation, San Jose, Ca
Date: Mon, 8 Jun 1992 00:29:30 GMT

Old hands here will recall a long discussion about a one-character typo in
the orignial yacc which caused it invariably to screw up on
error-recovery. The problem was that it would do default reductions in
states that could shift the "error" token. Fixing the typo eliminates
those default reductions and makes it work in most cases.


I've just discovered that the repaired version can screw up in a similar
way. This is not caused by a typo, but by a design flaw. Default
reductions should also not be generated in states which can be produced
immediately after shifting the error-token (before any more input is
shifted), not just in the states that can actually shift error. In
fairness it should be said that this is pretty complicated stuff, and it
is not at all surprising that the implications of "adding default
reductions" may not have been thought out completely. Or then it may have
been thought out and it was decided (in those days of 64K address spaces)
that the memory savings were worth breaking error-recovery in some cases.
I rather doubt that however, because there is nothing about it in the BUGS
section of the man-page.


I'll tell you a little about the project I'm working on, just to make the
following example easier to follow.


It is a microcode assembler. It must emit exactly one microcode
instruction for each coded instruction in the source. An empty
microinstruction is illegal. (If it were legal, it would not mean for the
compiler to emit nothing; It would mean to emit a default "do-nothing"
microinstruction. The difference is important because timing is
critical.) So I don't want empty instructions in the grammar. A small
complication is that although the manual says that every instruction must
end with a semicolon, in many existing programs, (which are compiled
without error by a predecessor program), the semicolon is omited after the
last instruction. That small complication breaks the error-recovery.


I coded it up as a semicolon-separated list, followed by an optional
semicolon. (It could have been done several other ways, but it all comes to
the same thing.) Here it is, simplified somewhat:


body : '{' instr_list '}'
;


instr_list : emited_instrs
                                                        | emited_instrs ';'
                                                        ;


emited_instrs : one_emited_instr
                                                        | emited_instrs ';' one_emited_instr
                                                        ;


                one_emited_instr : partial_instr { emit_instr(); }
                                                        ;


                partial_instr : statement
| partial_instr ',' statement
                                                        ;


                statement : error
                                                        | et cetera
                                                        ;


It is not at all obvious that only one error may be detected, and then
yacc may discard the rest of the input up to the right curly brace. The
problem is our old nemesis, the default reduction -- this time present not
as the result of a typo in the yacc source code, but by design.


Let's say we have this input:


{
/* instr 1 */
foo = %ar1 + foobar1, /* stat 1 */
foo2 = bar2 + foobar2 /* stat 2 */
                      ;




/* instr 2 */
foo3 = bar3 - foobar3
                      ;
                }


The token "%ar" is a syntax error. Yacc turns it into "error" and pops the
stack until it finds a state that can accept "error". It does so and turns
the error into a "statement". Now in this particular example, we want yacc
to eat input until it finds the next comma, after "foobar1". But it does
not.


The next input is the '+'. Yacc has gone to a state that will push a comma
if it finds one, or produce a "partial_instr" if it does not.


Seeing the '+', it produces a "partial_instr" by the rule
"partial_instr <== statement".


So far so good. Now it must decide whether to emit it, or continue
building it with another ", statement". Here is the state that makes that
decision:


state 6
one_emited_instr : partial_instr _ (6)
partial_instr : partial_instr _, statement


, shift 12
. reduce 6 /* default */


OOPS! Seeing the '+' next, it makes the (default) decision to produce an
"emited_instr". What we really want is this state:


state 6
one_emited_instr : partial_instr _ (6)
partial_instr : partial_instr _, statement


, shift 12
                ; reduce 6
                } reduce 6
. error /* default */


If it were that way, yacc would eat input until it found the next comma,
semicolon or right curly brace.


I fixed this by associating an action with error, which discards input "by
hand":




%%


statement : error { stat_error_sync(); }
                                        | et cetera
;


%%


static void stat_error_sync()
{
        /* yychar is still the token that produced the error */


        do {
yychar = yylex();
        } while(yychar != 0 && yychar != ',' && yychar != ';' && yychar != '}');
}
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.