lex/yacc "bugs"

dww@inf.fu-berlin.de (Debora Weber-Wulff)
Fri, 31 Jan 92 09:07:46 +0100

          From comp.compilers

Related articles
Errors in lex and yacc weberwu@inf.fu-berlin.de (1991-10-25)
lex/yacc "bugs" dww@inf.fu-berlin.de (1992-01-31)
Re: lex/yacc "bugs" megatest!djones@decwrl.dec.com (1992-02-10)
| List of all articles for this month |
Newsgroups: comp.compilers
From: dww@inf.fu-berlin.de (Debora Weber-Wulff)
Keywords: lex, yacc, errors, summary
Organization: Compilers Central
References: 91-10-101
Date: Fri, 31 Jan 92 09:07:46 +0100

This is the collection of problems with lex and yacc that I got as a response
to my request on comp.compilers. Hope it helps


Debora Weber-Wulff dww@inf.fu-berlin.de
Institut fuer Informatik +49 30 89691 124
Nestorstr. 8-9 (INCLUDE "standard.disclaimer")
D-W-1000 Berlin 31 (PRINTN (WITTY-MESSAGE TODAY))


[Some of the messages look to me like they're reporting misunderstandings
about what yacc and lex are supposed to do, though others are reporting
real bugs. -John]


-----snip-----
Subject: Errors in lex and yacc
Reply-To: david.tarditi@CS.CMU.EDU


The lookahead operator doesn't work correctly in lex. The attached
lexical specification shows the bug. On input aba, it returns ab as a
token, not a. This bug was posted to comp.compilers a while ago (sometime
in 1989, I think).


      David Tarditi (tarditi@cs.cmu.edu)
-------------
%%
(a|ab)/ba { printf("lookahead: %s\n",yytext); return 1; }
.|\n { printf("no lookahead: %s\n",yytext); return 1; }
%%
main()
{ while (yylex()); }


yywrap()
{ return 1; }
-----snip-----


No workaround was posted with the bug report, as far as I remember. I
don't know of any simple fix for the bug. Please let me know if you find
one. I maintain sml-lex, a lexer generator for Standard ML, and it had
the same bug.


    David Tarditi


-----snip-----


>From gnu.ai.mit.edu!asherman Sun Oct 27 23:30:44 1991
Subject: Errors in lex and yacc


The following crashes our Ultrix lex on a MIPS box:


cp /bin/csh foo
lex foo
-AJS


-----snip-----


>From unido!compres!chris Wed Nov 6 05:14:48 1991


S> The principle of longest match is
> understood to be used (lex does the order of the REs on this),
> but what the correctness of this means is never stated.


An interesting point to be sure. It's easy to show that the longest
match principle has non-local effects. Inserting a character at the
end of one token can cause the whole stream of tokens to change.


Consider the following two token definitions:


A : (00)+ (0 (1 1?)?)?;
B : (101)+;


A legal stream would then be:
                A B
00 101


However, changing the A token to be 000 would cause the lex to fail
when blindly using longest match. Here are the desired tokens:


A B
                000 101


The lexer would actually find:


A ERROR
                0001 01


I think the problem comes up in practice in lexers for Pascal like
languages with the following three tokens.


INTEGER: (0..9)+;
REAL: (0..9)+ ("." (0..9)+ ("e" (0..9)+)? | "e" (0..9)+);
DOTDOT: ".."


After seeing a "." after an INTEGER, should one shift or reduce? Of
course, two character lookahead solves it for this case. However, you can
construct a grammar and sequence of tokens which requires arbitrary
lookahead to correctly lex. The problem is very reminiscent of the "I'll
just check the semantic attributes for directing my recursive descent
parse".


Isn't it amazing when people use principles that are not well understood
what a mess they can sweep under the rug. [And, then they wonder why they
bump their heads on the ceiling after the floor has been risen three
feet.]
-----snip-----


Newsgroups: comp.compilers
Reply-To: bliss@sp64.csrd.uiuc.edu (Brian Bliss)
Subject: Re: Lookahead vs. Scanner Feedback
Organization: UIUC Center for Supercomputing Research and Development
Date: Tue, 7 Jan 92 00:00:15 GMT


In article 92-01-012, hjelm+@cs.cmu.edu (Mark Hjelm) writes:
|> I have a parser, written using Yacc and Lex, for ANSI C. The grammar is
|> taken pretty much verbatim from the standard. The scanner uses the symbol
|> table to decide whether to return "identifier" or "typedef name" as the
|> token type for an identifier. How do I KNOW that there are no situations
|> which, due to parser lookahead, would cause the scanner to return an
|> incorrect token type for an identifier ... ?


One place where every yacc/lex based C compiler I know of is
broken is on a typedef name redefined in an inner scope:


typedef int foo;


main ()
      {
            char foo;
      }


is legal ANSI C. I think there was a thread on this a while back.


bb
-----snip-----
*****


It seems that yacc fails to find some ambiguity-related shift-reduce
conflicts in the presence of epsilon rules. The following yacc input
(adapted from the yacc manual) correctly pinpoints the "dangling else"
ambiguity as a shift-reduce conflict.
--------------------------------------------------------------------
%%
stat : 'IF' 'COND' stat
| 'IF' 'COND' stat 'ELSE' stat
| /*epsilon rule*/
;
--------------------------------------------------------------------
However, the following slightly modified version fails to identify
the shift-reduce conflict.
--------------------------------------------------------------------
%%
stat : 'IFCOND' stat
| 'IFCOND' stat 'ELSE' stat
| /*epsilon rule*/
;
--------------------------------------------------------------------
Note that changing the epsilon rule to a non-empty rule (such as
--------------------------------------------------------------------
%%
stat : 'IFCOND' stat
| 'IFCOND' stat 'ELSE' stat
| 'X'
;
--------------------------------------------------------------------
) fixes the problem. Also, the problem does not occur with byacc.


-----snip-----


Organization: Computer Systems Group, U of Waterloo




I don't know if you consider this an error or not, but abusing the precedence
rules will give strange results:


%left 'a' 'b'
%%
A : 'b' 'a' 'c'
    | B 'a' 'b'
    ;
B : 'b'
    ;
%%


No diagnostics are given, not even in y.output, but the automaton generated
doesn't accept the language. In particular, it doesn't accept ``bac''.


Peter
-----snip-----


Hello,


I have discovered in error in Yacc/Bison. I do not know whether it is
already known.


Error: When using ambigous grammars and embedded actions, reduce/reduce
              conflicts can arise. Some of these can be resolved using the
              precendence/associativity information. This is not done in Yacc
              nor Bison.


Example:


---------------------------------------------cut-here----


%token ID
%left '*'


%%
e : e {printf( "shifting *\n") } '*' e { printf("reduced e*e\n") };
e : ID;


---------------------------------------------cut-here----


This results in a reduce/reduce conflict. See y.output from Yacc:


5: reduce/reduce conflict (red'ns 2 and 1 ) on *
state 5
e : e_$$1 * e
e : e $$1 * e_ (2)
$$1 : _ (1)


* reduce 1
. reduce 2


$$1 goto 3


Yacc makes the parser treat '*' as if it were right-associative. With
LA = '*' it should reduce 2, not 1. The action is *Really* associated
with the shift of the '*'.


Hope this helped,


Arlet Ottens
Delft University of technology
The Netherlands


Email: ottens@dutiba.tudelft.nl


-----snip-----


Here are a few yacc bugs (in Sun's version, which I believe is a slightly
upgraded version of the System V yacc). These were as of SunOS 4.0.3:


- If you explicitly declare literal tokens, yacc creates bogus defines -
for example, the statement


%token '2'


in your grammar leads to


#define 2 50


in your output parser.


- Contrary to the documentation, if the user calls the YYERROR macro in
an action the parser does not call the yyerror() routine.


- You can't redefine the stacktype with a simple define -


#define YYSTYPE sometype


doesn't always work.




These were the bugs which we fixed in our yacc development environment, ydb;
we also fixed things which would count as limitations rather than bugs, like
limited table sizes.


Yours,


Peter Garst
pg@bsg.com
-----snip-----
------ Here's some other stuff I've gathered over the months -----
Reply-To: bart@cs.uoregon.edu (Bart Massey)
Newsgroups: comp.compilers
Subject: Parsing makefiles
Date: 1 Nov 91 06:22:27 GMT
References: 91-10-122
Reply-To: bart@cs.uoregon.edu (Bart Massey)
Organization: Department of Computer Science, University of Oregon
Approved: compilers@iecc.cambridge.ma.us


In article 91-10-122:
> Where can I find a "make" grammar?
> [ ... hard-to-yaccify
> distinction between lines that start with whitespace and lines that don't.]


Yeah, but you do this in the lexer.


BTW, the distinction is more brutal than that -- it's definitely "lines
that start with a tab and lines that don't." The following two scripts
(at least now, as I'm typing them) have very different effects when
executed by make


foo:
BAR=bletch; echo $$BAR


foo:
   BAR=bletch; echo $$BAR


(For those upon whom the distinction may be lost :-) (probably by the news
transport system), the first has a tab before the B, the second has a
space followed by a tab.) This is obviously a total design botch, but it
happened long ago.


If the lexer returns the appropriate token for lines that begin with a tab
(probably with its value just some sort of string containing the command)
then the parser hardly cares...


The moderator writes:
> For the make stuff, I suppose that you could have the lexer return the
> command lines as TAB-token REST-OF-LINE-token and just glom everything
> together, but by the time you've done that, there's hardly anything left
> for the parser to do.


Unless you write a modal lexer, though, that's kind of hard to
do, and not really worth it. You just want an
EXECUTABLE-LINE-token, whose value (LEX yylval) is some kind of
structure like


{
int ignore_errors;
int execute_silently;
char *to-execute;
}


There's still a *bit* for the parser to do, but you're right
about YACC probably being overkill.


Bart Massey
bart@cs.uoregon.edu
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.