|Context sensitive scanner ? email@example.com (Albert Theo Hofkamp) (1997-11-20)|
|Re: Context sensitive scanner ? firstname.lastname@example.org (1997-11-23)|
|Re: Context sensitive scanner ? email@example.com (1997-11-23)|
|Re: Context sensitive scanner ? Mikael.Pettersson@sophia.inria.fr (Mikael Pettersson) (1997-11-23)|
|Re: Context sensitive scanner ? firstname.lastname@example.org (1997-11-23)|
|Re: Context sensitive scanner ? email@example.com (Scott Stanchfield) (1997-11-24)|
|Re: Context sensitive scanner ? firstname.lastname@example.org (Chris F Clark) (1997-11-28)|
|Re: Context sensitive scanner ? email@example.com (Henry Spencer) (1997-11-28)|
|Re: Context sensitive scanner ? firstname.lastname@example.org (1997-11-29)|
|Re: Context sensitive scanner ? email@example.com (Albert Theo Hofkamp) (1997-11-29)|
|Re: Context sensitive scanner ? firstname.lastname@example.org (Scott Stanchfield) (1997-11-30)|
|Re: Context sensitive scanner ? email@example.com (1997-11-30)|
|[4 later articles]|
|From:||"Scott Stanchfield" <firstname.lastname@example.org>|
|Date:||24 Nov 1997 23:38:47 -0500|
|Organization:||MageLang Institute - http://www.magelang.com|
Wow! That sounds _exactly_ like what I had to do to parse Visual Basic!
Add on top of that labeled-end-ifs...(I think this was the one.)
IF X=1 THEN
100 END IF
which needs THREE tokens of lookahead. I handled it the same way you did,
although I called it a "scanner wrapper", and, using lex at the time, had my
makefil rename the lex-generated scanner to "real_yylex" and wrote my own
yylex that called real_yylex and changed things like END IF to END_IF.
Did your language have something evil like NEXT X,Y...
(Obviously the language was designed purely with the end user in mind, and
no thought for the poor compiler-writer...)
Perhaps we should write a paper on this stuff... ;)
Scott Stanchfield -- http://www.scruz.net/~thetick
Magelang Institute -- http://www.magelang.com
>>the second expression should be returned as IDEN x, DOT, NUM 1, DOT NUM 2
>>I'd rather want to know what tokens the parser is expecting, and recognize
>>only those tokens (maybe with an exception on keywords) in the scanner.
>I don't know about any tools for dealing with this situation, but I
>once had to cope with something similar, but worse.
>Back in 1988, I wrote a compiler (a translator actually) for a proprietary
>Basic dialect, using yacc for the parser and a hand-written scanner (never
>liked Unix C lex much). The language had some "interesting" features:
>* It had an IF <exp> THEN <statements> [ELSE ...] END IF construct:
> * "END" and "IF" are separate tokens, thus requiring two tokens
> lookahead in the parser ("END" all by itself is a statement).
> As we all know, yacc does LALR(1), not LALR(2). My very first
> grammar had > 600 reduce/reduce conflicts!
> * You could *omit* the "END IF" part if the following statement
> started with a line number. In this case, a number of "virtual"
> "END IF" would be supplied to terminate any pending open "IF".
>* Two-token lookahead was also required for some other cases, like
> "END WHILE" and "END FOR".
>* There were other cases where the token to be returned for some
> construct depended on a limited range of surrounding context.
> (I cannot, and really don't want to, recall the details now.)
>Now, one could imagine some awful hack, adding persistent (i.e., C
>"static") state variables in the scanner routines to recognize the
>different situations and cope with them. An early version of my
>scanner did just that, but it was an unmaintainable mess that never
>really worked well.
>I eventually recast the problem as a stack of coroutines
>processing streams of tokens. That is:
>1. A simple low-level scanner generates a stream of tokens.
>2. A "transformation" scanner is a coroutine with an input stream
> and an output stream. It steps through its input stream
> looking for its particular kind of transformation. If it
> doesn't find one, the first input token is output as-is,
> otherwise the transformation consumes N input tokens and
> produces M output tokens.
>3. Push one transformation scanner for each individual kind
> of transformation. Keep each scanner as simple as possible.
>4. The yacc-generated parser gets its input tokens from the
> output stream of the top-most transformation scanner.
>For instance, one transformation scanner looked for two-token
>sequences "END IF/WHILE/FOR", emitting instead synthetic
>Another maintained the number of pending END_IF tokens by counting
>IF tokens: upon seeing a LINE_NUMBER token, it would insert
>the pending number of END_IF tokens before the LINE_NUMBER.
>This strategy was quite successful, since I could essentially
>forget about all the nasty special cases in both the yacc grammar
>and the low-level scanner. The implementation of the coroutines
>is almost trivial. Haven't seen anything like this described
>in the compiling litterature, though.
>Send compilers articles to email@example.com, meta-mail to
>firstname.lastname@example.org. Archives at http://www.iecc.com/compilers
Return to the
Search the comp.compilers archives again.