Re: Can (f)lex handle the NULL character?

rkrayhawk@aol.com (RKRayhawk)
20 May 1999 01:49:10 -0400

From comp.compilers

Related articles
Can (f)lex handle the NULL character? dcardani@totalint.com (1999-05-07)
Re: Can (f)lex handle the NULL character? rkrayhawk@aol.com (1999-05-09)
Re: Can (f)lex handle the NULL character? dcardani@totalint.com (1999-05-16)
*Re: Can (f)lex handle the NULL character? rkrayhawk@aol.com* (1999-05-20)**

| List of all articles for this month |

From:	rkrayhawk@aol.com (RKRayhawk)
Newsgroups:	comp.compilers
Date:	20 May 1999 01:49:10 -0400
Organization:	AOL http://www.aol.com
References:	99-05-049
Keywords:	lex

  dcardani@totalint.com (Darrin Cardani)
posted notes
Date:1999/05/07

Concerning<< ...the NULL character (a 0 byte) in them. Reading this causes flex
to go into an infinite loop saying "NUL or End of
Buffer" over and over. >>

and code was posted
Date: 16 May 1999

stream { BEGIN INSTREAM; return STREAM; }
<INSTREAM>endstream { BEGIN NOTINITIAL; return ENDSTREAM; }
<INSTREAM>. { return ANYTHING; }

Here are some comments and suggestions based upon a partial review of
FLEX_VERSION "2.5.4".

Your message text does not match the one I find in my version of flex, which
has "--(end of buffer or a NUL)". If your version of flex is much lower than
mine, maybe seek an update. But I suppose that you just paraphrased the error.

Here is a hack of the code from flex file gen.c, presenting the cerr text (and
chopping the fprintf alternate) just for simplicity (you do not have to be
using C++ for this to be relevant, I am just getting this thing down to
presentable size):

if ( ddebug )
    {
      if ( yy_flex_debug )
            {
                if ( yy_act == 0 )
                        cerr << \"--scanner backing up\\n\";" :
                else if ( yy_act < %d )\n", num_rules );
                        cerr << \"--accepting rule at line \" << yy_rule_linenum[yy_act]
<<" );
" \"(\\\"\" << yytext << \"\\\")\\n\";" );
              else if ( yy_act == %d )\n", num_rules );
                      cerr << \"--accepting default rule (\\\"\" << yytext <<
\"\\\")\\n\";" );
              else if ( yy_act == %d )\n", num_rules + 1 );
                      cerr << \"--(end of buffer or a NUL)\\n\";" :
              else
                    cerr << \"--EOF (start condition \" << YY_START << \")\\n\";" );
              }

The emitted condition test
      else if ( yy_act == %d )\n", num_rules + 1 )
includes the expression
        num_rules + 1

Note that this code will get inserted into your executable only if you
are in debug mode.

Don't be too surprised if all you have discovered is a weakness in the
debugger functionality. Perhaps you are getting the error message
repeatedly, but not actually going into an infinite loop. To try to
get some relief you could coerce token ANYTHING to be other than value
258, the symptom could go away. That is create a dummy-258, followed
by ANYTHING, followed by dummy-260.

Although the expression may be harnessing the problem, the real
problem (if there is one in flex) would be in the various possible
computations of variable yy_act. To narrow in on that we will need to
know everything about your parameters at code generation time (full
table, compressed, speed....).

The message text does suggest that there could be a tiny design
problem in flex, since end of buffer and null need to be
distinguishable in your application. But actually, on simple review
this does not look like showstopper code being generated; it is just
warning flag waving.

At any rate, if you are actually driving flex into a loop, it is more
likely to be the start states that are handled weakly in flex (because
this was a high level program structural feature retrofitted to a
coding paradigm that had no such feature originally).

In your post you do not indicate if you are defining inclusive or exclusive
start conditions.
------
Notice that your rule
      <INSTREAM>. { return ANYTHING; }

definitely does not catch all other text,
in your phrase "return every character (all values from 0x00 to 0xFF)".
Instead, it skips the newline character (\n).

This is important since you do NOT list a specification of
          <INSTREAM>\n {whatever;}

This could be producing unexpected results. Note that unhandled
conditions are scooped up by lex under a rule that simply echoes the
text to output (not to your parser). Indeed the echoed newline
characters may be rather hard to sense on the screen (or redirected
output file) at program execution time.

There are atleast three gotchas on start states which relate to things you may
not realize
you are engaging (note the difference between inclusive and exclusive here):
      - earlier generic rules can match if current start state is inclusive (%s)
      - later generic rules can match if current start state is inclusive (%s) and
it's set of rules fails to catch the text
      - if an exclusive start state (%x) is active, but not exhaustive of all
text, then to quote the flex manual:
              " The default rule (to `ECHO' any unmatched character) remains active
in start conditions. It is equivalent to:
                                      <*>.|\\n ECHO;
              "

In order for this to be relevant, you must have newline characters in
your binary file. It is not impossible that you are weaving through
flex in a way that sets yy_act in an inappropriate way.

The super default rule mentioned in the quoted rule is certainly nice, but
perhaps in your application you may wish to specify start state INSTREAM in a
way that exhausts _all_ possibilities, and not allow the grand phantom
          <*>.|\\n ECHO;
to handle it.

The processing of the newline cipher does not actually encounter the problem,
but it could set it up by mangling yy_act. And the next input cipher, which
would be anything but newline, as specified in
    <INSTREAM>. { return ANYTHING; }
crashes and burns. Actually, if my interpretation is right, it only burns with
a hot waving of warning flags - for every single cipher except newline - but
really might never crash.

Another possibility here, and don't get insulted if you are beyond this kind of
mistake; but you could be bringing an obsolete version of your parser include
file into your lexer compile. Conceivably, that could adversely effect setting
of num_rules. Check to see if your parser compile is placing the token defines
into the file name and directory that you bring into the lexer compile with the
#include directive.

Let us know if you are using %s or %x, and whether you have exhaustive testing
in all %x states.

Best Wishes,

Robert Rayhawk
RKRayhawk@aol.com

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Can (f)lex handle the NULL character?

rkrayhawk@aol.com (RKRayhawk)20 May 1999 01:49:10 -0400

rkrayhawk@aol.com (RKRayhawk)
20 May 1999 01:49:10 -0400