Re: Context sensitive scanner ?

Mikael Pettersson <Mikael.Pettersson@sophia.inria.fr>
23 Nov 1997 19:43:19 -0500

          From comp.compilers

Related articles
Context sensitive scanner ? hat@se-46.wpa.wtb.tue.nl (Albert Theo Hofkamp) (1997-11-20)
Re: Context sensitive scanner ? johnm@non.net (1997-11-23)
Re: Context sensitive scanner ? pjj@cs.man.ac.uk (1997-11-23)
Re: Context sensitive scanner ? Mikael.Pettersson@sophia.inria.fr (Mikael Pettersson) (1997-11-23)
Re: Context sensitive scanner ? genew@vip.net (1997-11-23)
Re: Context sensitive scanner ? thetick@magelang.com (Scott Stanchfield) (1997-11-24)
Re: Context sensitive scanner ? cfc@world.std.com (Chris F Clark) (1997-11-28)
Re: Context sensitive scanner ? henry@zoo.toronto.edu (Henry Spencer) (1997-11-28)
Re: Context sensitive scanner ? ok@cs.rmit.edu.au (1997-11-29)
Re: Context sensitive scanner ? hat@se-46.wpa.wtb.tue.nl (Albert Theo Hofkamp) (1997-11-29)
[6 later articles]
| List of all articles for this month |
From: Mikael Pettersson <Mikael.Pettersson@sophia.inria.fr>
Newsgroups: comp.compilers
Date: 23 Nov 1997 19:43:19 -0500
Organization: INRIA Sophia-Antipolis
References: 97-11-117
Keywords: lex

Albert Theo Hofkamp <a.hofkamp@wtb.tue.nl> wrote:
>We are busy writing a language where the following constructs occur:
>
>1) Literal reals (such as 1.2),
>2) Nested index operations on arrays (such as x.1.2).
>
>Obviously, indices are always unsigned positive integers.
>
>Since the scanner is not context-sensitive, it does not understand that
>the second expression should be returned as IDEN x, DOT, NUM 1, DOT NUM 2
> ...
>I'd rather want to know what tokens the parser is expecting, and recognize
>only those tokens (maybe with an exception on keywords) in the scanner.


I don't know about any tools for dealing with this situation, but I
once had to cope with something similar, but worse.


Back in 1988, I wrote a compiler (a translator actually) for a proprietary
Basic dialect, using yacc for the parser and a hand-written scanner (never
liked Unix C lex much). The language had some "interesting" features:


* It had an IF <exp> THEN <statements> [ELSE ...] END IF construct:
* "END" and "IF" are separate tokens, thus requiring two tokens
lookahead in the parser ("END" all by itself is a statement).
As we all know, yacc does LALR(1), not LALR(2). My very first
grammar had > 600 reduce/reduce conflicts!
* You could *omit* the "END IF" part if the following statement
started with a line number. In this case, a number of "virtual"
"END IF" would be supplied to terminate any pending open "IF".
* Two-token lookahead was also required for some other cases, like
    "END WHILE" and "END FOR".
* There were other cases where the token to be returned for some
    construct depended on a limited range of surrounding context.
    (I cannot, and really don't want to, recall the details now.)


Now, one could imagine some awful hack, adding persistent (i.e., C
"static") state variables in the scanner routines to recognize the
different situations and cope with them. An early version of my
scanner did just that, but it was an unmaintainable mess that never
really worked well.


I eventually recast the problem as a stack of coroutines
processing streams of tokens. That is:
1. A simple low-level scanner generates a stream of tokens.
2. A "transformation" scanner is a coroutine with an input stream
      and an output stream. It steps through its input stream
      looking for its particular kind of transformation. If it
      doesn't find one, the first input token is output as-is,
      otherwise the transformation consumes N input tokens and
      produces M output tokens.
3. Push one transformation scanner for each individual kind
      of transformation. Keep each scanner as simple as possible.
4. The yacc-generated parser gets its input tokens from the
      output stream of the top-most transformation scanner.


For instance, one transformation scanner looked for two-token
sequences "END IF/WHILE/FOR", emitting instead synthetic
"END_IF/END_WHILE/END_FOR" tokens.
Another maintained the number of pending END_IF tokens by counting
IF tokens: upon seeing a LINE_NUMBER token, it would insert
the pending number of END_IF tokens before the LINE_NUMBER.


This strategy was quite successful, since I could essentially
forget about all the nasty special cases in both the yacc grammar
and the low-level scanner. The implementation of the coroutines
is almost trivial. Haven't seen anything like this described
in the compiling litterature, though.


/Mikael
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.