Related articles |
---|
how to handle lookahead in JLex? jwilleke@ix.netcom.com (1998-10-24) |
Re: how to handle lookahead in JLex? jwilleke@ix.netcom.com (1998-10-30) |
Re: how to handle lookahead in JLex? cfc@world.std.com (Chris F Clark) (1998-10-30) |
Re: how to handle lookahead in JLex? kleing@informatik.tu-muenchen.de (Gerwin Klein) (1998-11-01) |
From: | Chris F Clark <cfc@world.std.com> |
Newsgroups: | comp.compilers |
Date: | 30 Oct 1998 13:52:45 -0500 |
Organization: | The World Public Access UNIX, Brookline, MA |
References: | 98-10-155 |
Keywords: | Java, lex |
Jon Willeke asked:
> I'm using JLex to write a lexer for M. I have a case that calls for
> lookahead, which JLex doesn't support. For example, consider this
> line of code:
>
> write $p($p,",")
>
> The first "$p" is an abbreviated form of the special function
> "$piece." The second "$p" is an abbreviated form of the special
> variable "$principal." I'd like the lexer to be smart enough to tell
> the difference, and it's easy with lookahead: functions are always
> followed by an open paren.
While I can't speak to the exact issues with JLex, there is a
forthcoming "new" SIGPLAN Notices column (called Practical Parsing
Patterns) which will be covering exactly these kinds of issues (in a
non-generator specific manner). The first article in this column (to
appear in the December issue) addresses lookahead issues in lexers and
how to get around lack of lookahead--there are more options than one
would think.
So that there will be some reason for people to read the article, I
won't post the entire text here (it's also 5 pages, so it's long even
for my postings). I will suggest a couple solutions.
Make a special lexer rule for function names that repeats your
identifier rule but includes the closing paren. i.e.
ident: "a".."z"+;
function_ident: "a".."z"+ "(";
The only problem with going down that path is that it doesn't handle
whietspace (or comments) gracefully. To handle whitespace (or
comments), you need to include them and the result gets very complex
very fast.
ident: "a".."z"+
| "a".."z"+ " "+*; // trailing whitespace
function_ident: "a".."z"+ "("
| "a".."z"+ " "* "("; // trailing whitespace after ident
Alternately, and probably better for this case, you should just defer
abbreviation expansion until after the lexer. That means you can
either do it in the parser or stick a phase between the lexer and the
parser. Generally, it is preferable to do ones spelling look-ups in
the lexer. However, in your case, the look-ups are somewhat context
sensitive and thus belong in the parser.
The really ambitious (or really perverse) would split the difference
and do the lookups in the lexer, identifying exactly which
abbreivations are ambiguous and return special tokens for the
ambiguous ones that the parser can use context to decipher. The
subsequent maintainers would then be straddled with that complexity
until they got fed up.
Hope this helps,
-Chris
*****************************************************************************
Chris Clark Internet : cfc@world.std.com
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
Return to the
comp.compilers page.
Search the
comp.compilers archives again.