Re: Context sensitive scanner ?

Henry Spencer <henry@zoo.toronto.edu>
28 Nov 1997 01:03:00 -0500

          From comp.compilers

Related articles
Context sensitive scanner ? hat@se-46.wpa.wtb.tue.nl (Albert Theo Hofkamp) (1997-11-20)
Re: Context sensitive scanner ? johnm@non.net (1997-11-23)
Re: Context sensitive scanner ? pjj@cs.man.ac.uk (1997-11-23)
Re: Context sensitive scanner ? Mikael.Pettersson@sophia.inria.fr (Mikael Pettersson) (1997-11-23)
Re: Context sensitive scanner ? genew@vip.net (1997-11-23)
Re: Context sensitive scanner ? thetick@magelang.com (Scott Stanchfield) (1997-11-24)
Re: Context sensitive scanner ? cfc@world.std.com (Chris F Clark) (1997-11-28)
Re: Context sensitive scanner ? henry@zoo.toronto.edu (Henry Spencer) (1997-11-28)
Re: Context sensitive scanner ? ok@cs.rmit.edu.au (1997-11-29)
Re: Context sensitive scanner ? hat@se-46.wpa.wtb.tue.nl (Albert Theo Hofkamp) (1997-11-29)
Re: Context sensitive scanner ? thetick@magelang.com (Scott Stanchfield) (1997-11-30)
Re: Context sensitive scanner ? johnm@non.net (1997-11-30)
Re: Context sensitive scanner ? thetick@magelang.com (Scott Stanchfield) (1997-11-30)
Re: Context sensitive scanner ? clark@quarry.zk3.dec.com (Chris Clark USG) (1997-12-05)
[2 later articles]
| List of all articles for this month |

From: Henry Spencer <henry@zoo.toronto.edu>
Newsgroups: comp.compilers
Date: 28 Nov 1997 01:03:00 -0500
Organization: SP Systems, Toronto
References: 97-11-117 97-11-127
Keywords: lex

Mikael Pettersson <Mikael.Pettersson@sophia.inria.fr> wrote:
>1. A simple low-level scanner generates a stream of tokens.
>2. A "transformation" scanner is a coroutine with an input stream
> and an output stream... Push one transformation scanner for each
> individual kind of transformation...
>This strategy was quite successful... Haven't seen anything like this
>described in the compiling litterature, though.


I do not recall seeing the idea of pushing little coroutines, one per
syntactic blemish, before. However, the idea that there might be more
than one layer of scanner is quite old.


When I was in grad school, about 20 years ago, one of my profs preferred
a model of a compiler in which the parser was preceded by two separate
phases: the scanner (which partitioned the input text into tokens) and
the screener (which tidied up the token stream, doing things like deleting
non-significant white space and recognizing keywords).


I've also implemented several systems in which the scanner is followed by
an error-repair layer which interfaces the scanner to the (top-down)
parser, hiding syntax errors from the parser. (One of the nice things
about a top-down parser is that it can tell you what tokens it thinks
might come next, and if what you've got isn't one of those, you can give
it what it wants and then invoke heuristics to try to resynchronize the
input with the parser... without the parser ever having to know. The
parser always sees a syntactically correct program, and the semantic
routines intertwined with the parser never have to suddenly stop and rip
out everything they've done when the parser detects an error.)


A slight digression... More recently, I've discovered that a split into
multiple layers can do wonders for parsing even simple-looking notations
like regular expressions. Most RE implementations, including a couple I
wrote, just charge in and do it all in one layer... but splitting it into
two layers, with a distinct scanner underneath the parser, turns out to do
wonders for handling real RE syntaxes (which are rather messier than the
ones in the textbooks). The extra leverage provided by the layer split
really quite surprised me; all kinds of nuisances like context-dependent
metacharacters (e.g., $ is magic only at the end of the RE) turn out to
require only a line or two of code... *if* it doesn't have to be embedded
in the middle of a parser, or *if* it can look ahead to the next token
instead of just the next character.


Actually, I guess that wasn't a digression after all. I think there's a
general moral to be drawn. When you add a layer split, what you're buying
is separation of concerns: the simple and narrow interface between the
layers hides the complexity of the previous or following layer. That way,
code to deal with some oddity can see a simple environment, where a little
bit of code can have a lot of leverage.
--
If NT is the answer, you didn't | Henry Spencer
understand the question. -- Peter Blake | henry@zoo.toronto.edu
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.