Re: Context sensitive scanner ?

Chris Clark USG <clark@quarry.zk3.dec.com>
5 Dec 1997 00:54:04 -0500

          From comp.compilers

Related articles
[7 earlier articles]
Re: Context sensitive scanner ? henry@zoo.toronto.edu (Henry Spencer) (1997-11-28)
Re: Context sensitive scanner ? ok@cs.rmit.edu.au (1997-11-29)
Re: Context sensitive scanner ? hat@se-46.wpa.wtb.tue.nl (Albert Theo Hofkamp) (1997-11-29)
Re: Context sensitive scanner ? thetick@magelang.com (Scott Stanchfield) (1997-11-30)
Re: Context sensitive scanner ? johnm@non.net (1997-11-30)
Re: Context sensitive scanner ? thetick@magelang.com (Scott Stanchfield) (1997-11-30)
Re: Context sensitive scanner ? clark@quarry.zk3.dec.com (Chris Clark USG) (1997-12-05)
Re: Context sensitive scanner ? mark@research.techforce.nl (Mark Thiehatten) (1997-12-07)
Re: Context sensitive scanner ? qjackson@direct.ca (1997-12-07)
| List of all articles for this month |
From: Chris Clark USG <clark@quarry.zk3.dec.com>
Newsgroups: comp.compilers,comp.compilers
Date: 5 Dec 1997 00:54:04 -0500
Organization: Digital Equipment Corporation - Marlboro, MA
References: 97-11-117 97-11-122 97-11-172
Keywords: lex

Sorry to keep continuing this subject, but
Albert Theo Hofkamp <hat@se-46.wpa.wtb.tue.nl> asked:
> I am asking for a scanner which tries to do interpretation within
> the limits of what the parser expects. In other words, when the parser
> doesn't want a real, and the scanner gets a real as input, it will not
> return a real, but 2 integer numbers seperated by a dot.
>
> Obviously :-), there are some serious consequences to this
> approach. Is there literature about this ?


Literature: look up "Scannerless NSLR(1) Parsing" in Sigplan Notices
(I think about 10 years back). Anyway, what they implemented was a
lexer which used the parser context to determine what tokens were
legal to follow the current token and thus determine the
lookahead/termination criteria for matching the token. The idea was
very similar to yours, in that, in certain contexts "1..10" represents
a range (int dot_dot int) and in those contexts "1.129" representing a
real was not legal.


The basic idea is that you can determine which tokens are legal in
each parser state (although it may take some digging depending on the
parser generator). You can also determine the potential transitions
between parser states. Given that information you can determine which
tokens can follow a given token (either globally or in the current
context). You can then create different token recognizers (lexer
states) which match the different tokens based on what legally can
follow them in the different contexts. In theory, you can even
generate these different recognizers automatically--I don't remember
if the paper mentioned above did so or not--it certainly hinted at
doing it automatically even if they did it by hand only to prove its
feasibility.


At run-time you have three options, the lexer and the parser can
interact to determine the context and which token recognizer (lexer
state) to use, the lexer can control the state internally, or the
lexer can dynamically query whether certain tokens are currently
legal (your idea).


I don't think your alternative of having the lexer query the parser
for which tokens are legal has been cited in this discussion is either
good or bad. It can certainly be implemented. {A parser member
function in Yacc++ generated parsers provides exactly which tokens the
parser is expecting in the current context--technically it prints them
for error recovery purposes, but the code could be modified to do what
you want. A similar function could be written for yacc generated
parser tables. A function that returned which tokens could be legal
after the current token could also be written, although it is a bit
more complicated.} You will, of course, need to decide if you are
going to do a series of queries and pick a lexer state based upon
that or if you will query after having lexed the token whether that
token would be legal and then recovering if it isn't.


As to the other two implementations:


The point has been clearly made that feedback from the parser to the
lexer (using lexer states controlled from the parser) is not a good
idea when avoidable. {To make it even feasible for the parser to
direct the lexer in Yacc++ we implemented "shift action code" in a
very specific way to make certain that it would have reliable
lookahead properties (i.e. the generator guarantees that the tokens
after the shift action code will not participate in lookahead until
after the action code has been executed <under certain predictable
conditions exactly one token of lookahead can be read before the
action code is executed>) and the manual has a section which explains
why the user cares.}


However, at the same time using (manually controlled) state in a lexer
to control which tokens are returned should also be avoided when
possible. Having read more than a few lexers, I have seen that lexers
which use state are significantly less robust than ones which don't.
In addition, the lexers which used state often had more unintended
lexical artifacts (quirks) and places where they mislexed input. The
same also holds true for lexers which use lookahead to determine which
token to return. Of course, when you are parsing a language defined
by someone else, you may have to choose between these three evils
(parser interaction, lexer states, and lookahead) and the specifics of
the situation may make one of them better than the others.


One important thing worth noting, is that if the lexer controls the
state internally, it must simulate some portion of the parsers
actions. If those changes of state are simple and fixed, this may not
be a significant problem. However, if you find yourself keeping a
stack and having numerous states to consider, you probably are redoing
the parser generator's work by hand, which means you are likely to
introduce an error (if not immediately, sometime in the eventual
maintenance of the lexer)--be forwarned.


For these reasons, if you choose to abandon your idea of querying, I
would recommend considering the co-routine approach if your conditions
advance beyond trivial. Of course, the co-routine approach is not a
panacea either, and there are things to be careful about when using it
also, but separating things into passes generally make things easier
and cleaner.


Hope this helps,
-Chris Clark
************************************************************************
Compiler Resources, Inc. email: compres@world.std.com
3 Proctor St. http://world.std.com/~compres
Hopkinton, MA 01748 phone: (508) 435-5016
USA 24hr fax: (508) 435-4847
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.