|[8 earlier articles]|
|Re: what scanner scheme is efficient? firstname.lastname@example.org (1996-10-30)|
|Re: what scanner scheme is efficient? email@example.com (1996-11-12)|
|Re: what scanner scheme is efficient? firstname.lastname@example.org (1996-11-15)|
|Re: what scanner scheme is efficient? email@example.com (1996-11-19)|
|Re: what scanner scheme is efficient? firstname.lastname@example.org (1996-11-21)|
|Re: what scanner scheme is efficient? email@example.com (1996-11-24)|
|Re: what scanner scheme is efficient? firstname.lastname@example.org (1996-12-01)|
|From:||email@example.com (John Lilley)|
|Date:||1 Dec 1996 22:56:00 -0500|
|References:||96-10-076 96-10-081 96-10-149 96-11-079 96-11-103 96-11-123|
>But it's not just keywords. Consider the two Pascal fragments:
>The first tokenises as REAL_LITERAL(1.4) whereas the second goes to
>INTEGER_LITERAL(1) KEYWORD(..) INTEGER_LITERAL(4). After lexing 1. if
>you then see another period you have to back up.
As yes, this is true. (And FORTRAN is worse, of course). But even
this nasty case can be treated with a hack. Of course, anything can
be treated with a hack :-)
In this case, a normal, non-backtracking, one-char-lookahead lexer can
deal with it if you define tokens like (and I simplify):
FLOAT = "[0-9]+\.[0-9]+"
INT = "[0-9]+"
RANGE = "\.\."
RANGE_START = "[0-9]+\.\."
Of course, you must (a) write the grammar to accomodate either INT
RANGE INT or RANGE_START INT (b) strip the trailing ".." when you
process the RANGE_START token. And it gets worse, because typically
the range can be an expression, not just an integer.
A "better" approach, if you can stomach it, is to hack the lexer
output, and turn RANGE_START into two separate INT, RANGE tokens. But
now we're approaching the land of the original problem, which is
hacking the lexer output with a perfect hash table.
Other than that, Mrs. Lincoln...
Return to the
Search the comp.compilers archives again.