Handling EOF in lex & flex actions that call input() directly.

greyham@research.canon.oz.au (Graham Stoney)
Fri, 7 Jan 1994 09:44:00 GMT

          From comp.compilers

Related articles
Handling EOF in lex & flex actions that call input() directly. greyham@research.canon.oz.au (1994-01-07)
Re: Handling EOF in lex & flex actions that call input() directly. vern@horse.ee.lbl.gov (Vern Paxson) (1994-01-09)
Multiline Comments mps@dent.uchicago.edu (1994-01-10)
| List of all articles for this month |

Newsgroups: comp.compilers
From: greyham@research.canon.oz.au (Graham Stoney)
Summary: How should an action that calls input() directly respond to EOF?
Keywords: flex, question
Organization: Canon Information Systems Research Australia
Date: Fri, 7 Jan 1994 09:44:00 GMT

Rats! You wouldn't believe how bad a day it's just turned into. Not only
is there a bushfire raging right outside the office window, but I just
clicked "Close Window" by accident and blew away the vi session which I'd
just finished writing a 266 line comp.compilers article in. That in
combination with NeXT's broken expreserve means I'm back to the start
again. Oh well, here goes...


The latest 2.4 releases of flex have uncovered a problem in one of the
grimier areas of c2man: dealing with EOF properly when encountered in a
function which calls input() from outside the normal lexer state machine.


I have a separate get_comment function which reads the contents of a
comment after the lexer has matched its start with a /* regexp. It's done
like that because I have two different types of comment token recognised
by the grammar, and they differ depending on whether the comment started
as the first thing on the line, and whether the comment ended as the last
thing on the line. This function also coalesces the contents of comments
on consecutive lines into one token, to handle input like this:


/* 1. This comment gets coalesced with */
/* 2. this one into a single token */


/* 3. But this is a separate one due to the blank line */


This causes problems because it is necessary to read past the terminating
'/' on the first comment in order to determine that there is another
comment on the next line. If an EOF is found (as occurs after comment 3)
after this terminating '/', the trouble starts because although we can
simply return our final comment token, the lexer regexp state machine has
not seen the '\n' after the comment, and so will not recognise patterns
which match at the start of line when it starts lexing the next input
file.


In the past, I've had my get_comment function unput a newline if it
reaches EOF so that the lexer sees an end of line and will match ^
archored rules at the start of the next input file. This always seemed
like a bit of a fudge though, and although it works in lex & flex 2.3, it
fails with the latest flex 2.4.6.


Does anyone know of a cleaner approach to this problem? I think it's a
little hard to call this a bug in flex 2.4.6; it does act differently to
lex, but it doesn't seems a real hotly documented area of lex's
functionality.


Another fun crowd-pleaser is that lex & flex return different values from
input() at EOF. (lex is definitely wrong here though since it prevents
NULs in the input).


Here's a rough outline of the relevant rules and get_comment function:


^{WS}*"/*" { /* match comment at the start of a line */
int ret;
if (ret = get_comment(FALSE)) return ret;
}
"/*" { /* match comment after something else on the line */
int ret;
if (ret = get_comment(TRUE)) return ret;
}
...
%%
...


/* this marvellously finely handcrafted state machine parses a C comment,
  * extracts the interesting text joining adjacent comment lines into one big
  * block, and determines if the comment is at the end of the line or not.
  */
static int
get_comment(ateol)
boolean ateol;
{
        enum
        {
...
        } state = START;


        while ((c = input()) > 0)
        {
switch (c)
{
case ...:
if (theres_something_more_but_its_not_a_comment)
{
unput(whatever_weve_read_ahead)
goto leave;
}
...
}
        }


leave:
        if (c > 0)
        {
unput(c); /* let lex see the terminating character */
if (c == '\n') line_num--; /* compensate for seeing it twice */
        }
        else
unput('\n'); /* make sure lex sees a \n at end-of-file */


        return ateol ? T_EOLCOMMENT : T_COMMENT;
}


The following example of a comment lexer appears in the flex
documentation, but it doesn't suffer from the same problem because once it
sees the terminating `/', it does not need to keep looking to see if there
is another comment on the next line to coalesce, so it return straight
away. If it hits EOF while input()ing, that causes a call to error()
which (presumably) resets the lexer. But hitting EOF while looking for
the start of a possible comment on the next line isn't an error condition.


                            %%
                            "/*" {
                                                    register int c;


                                                    for ( ; ; )
                                                            {
                                                            while ( (c = input()) != '*' &&
                                                                            c != EOF )
                                                                    ; /* eat up text of comment */


                                                            if ( c == '*' )
                                                                    {
                                                                    while ( (c = input()) == '*' )
                                                                            ;
                                                                    if ( c == '/' )
                                                                            break; /* found the end */
                                                                    }


                                                            if ( c == EOF )
                                                                    {
                                                                    error( "EOF in comment" );
                                                                    break;
                                                                    }
                                                            }
                                                    }




I guess one option might be to coalesce the comments in the grammar, but
the problem there is that they must be immediately adjacent in order to be
coalesced - if there is a blank line between them then they are separate
tokens; but knowledge of blank lines is gone by the time it gets to the
parser.


Does anyone have any clues how I could get around this problem?


regards,
Graham
--
Graham Stoney, Hardware/Software Engineer
Canon Information Systems Research Australia
Ph: + 61 2 805 2909 Fax: + 61 2 805 2929
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.