Re: how to handle lookahead in JLex?

Chris F Clark <cfc@world.std.com>
30 Oct 1998 13:52:45 -0500

          From comp.compilers

Related articles
how to handle lookahead in JLex? jwilleke@ix.netcom.com (1998-10-24)
Re: how to handle lookahead in JLex? jwilleke@ix.netcom.com (1998-10-30)
Re: how to handle lookahead in JLex? cfc@world.std.com (Chris F Clark) (1998-10-30)
Re: how to handle lookahead in JLex? kleing@informatik.tu-muenchen.de (Gerwin Klein) (1998-11-01)
| List of all articles for this month |

From: Chris F Clark <cfc@world.std.com>
Newsgroups: comp.compilers
Date: 30 Oct 1998 13:52:45 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 98-10-155
Keywords: Java, lex

Jon Willeke asked:
> I'm using JLex to write a lexer for M. I have a case that calls for
> lookahead, which JLex doesn't support. For example, consider this
> line of code:
>
> write $p($p,",")
>
> The first "$p" is an abbreviated form of the special function
> "$piece." The second "$p" is an abbreviated form of the special
> variable "$principal." I'd like the lexer to be smart enough to tell
> the difference, and it's easy with lookahead: functions are always
> followed by an open paren.


While I can't speak to the exact issues with JLex, there is a
forthcoming "new" SIGPLAN Notices column (called Practical Parsing
Patterns) which will be covering exactly these kinds of issues (in a
non-generator specific manner). The first article in this column (to
appear in the December issue) addresses lookahead issues in lexers and
how to get around lack of lookahead--there are more options than one
would think.


So that there will be some reason for people to read the article, I
won't post the entire text here (it's also 5 pages, so it's long even
for my postings). I will suggest a couple solutions.


Make a special lexer rule for function names that repeats your
identifier rule but includes the closing paren. i.e.


ident: "a".."z"+;


function_ident: "a".."z"+ "(";


The only problem with going down that path is that it doesn't handle
whietspace (or comments) gracefully. To handle whitespace (or
comments), you need to include them and the result gets very complex
very fast.


ident: "a".."z"+
          | "a".."z"+ " "+*; // trailing whitespace


function_ident: "a".."z"+ "("
          | "a".."z"+ " "* "("; // trailing whitespace after ident


Alternately, and probably better for this case, you should just defer
abbreviation expansion until after the lexer. That means you can
either do it in the parser or stick a phase between the lexer and the
parser. Generally, it is preferable to do ones spelling look-ups in
the lexer. However, in your case, the look-ups are somewhat context
sensitive and thus belong in the parser.


The really ambitious (or really perverse) would split the difference
and do the lookups in the lexer, identifying exactly which
abbreivations are ambiguous and return special tokens for the
ambiguous ones that the parser can use context to decipher. The
subsequent maintainers would then be straddled with that complexity
until they got fed up.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : cfc@world.std.com
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.