LR-parser-based lexical analysis - does it work?

"=?iso-8859-1?Q?S=F6nke_Kannapinn?=" <>
13 Oct 2002 16:27:13 -0400

          From comp.compilers

Related articles
LR-parser-based lexical analysis - does it work? (=?iso-8859-1?Q?S=F6nke_Kannapinn?=) (2002-10-13)
Re: LR-parser-based lexical analysis - does it work? (Chris F Clark) (2002-10-18)
Re: LR-parser-based lexical analysis - does it work? (Vladimir N. Makarov) (2002-10-18)
Re: LR-parser-based lexical analysis - does it work? (VBDis) (2002-10-18)
Re: LR-parser-based lexical analysis - does it work? (Brian Smith) (2002-10-18)
Re: LR-parser-based lexical analysis - does it work? (Josef Grosch) (2002-10-18)
Re: LR-parser-based lexical analysis - does it work? (Zack Weinberg) (2002-10-20)
| List of all articles for this month |

From: "=?iso-8859-1?Q?S=F6nke_Kannapinn?=" <>
Newsgroups: comp.compilers
Date: 13 Oct 2002 16:27:13 -0400
Organization: Siemens Business Services
Keywords: lex, parse, question, comment
Posted-Date: 13 Oct 2002 16:27:13 EDT


In a compiler generator project we are thinking about building
scanners using LR parser techniques.

        (Background: We are interested in describing compiler syntax
        and semantics *including* the lexical level using attributed
        CFGs. The high-speed performance of DFA-based scanners is
        nice, but we don't need ultimate speed.)

That's certainly not a new idea. When referring to the literature that
I have available (including the Web), however, I couldn't find much on
this. I found

F. L. DeRemer. Lexical Analysis, pp109-120, Compiler Construction
- An Advanced Course, F. L. Bauer and J. Eickel (editors), Springer
Verlag, 1974.

Here, the scanner generation process just uses a CFG description and
LR items, but the scanners generated are still essentially DFAs
instead of doing "full CFG" parsing. (Not what we want to do.)

I also found a 2002 contribution to the CC conference called
"Disambiguation Filters for Scannerless Generalized LR Parsers"
( It targets
reverse-engineering tools and uses GLR parsing. It seems to me that
such parsers more or less disallow executing semantic actions together
with the parser activities because GLR parsers "deal with
ambiguities". (A reduce action may eventually turn out not to
contribute to the derivation tree.) So that's also not what we're
looking for.

Let me try to sketch what I have in mind instead:
We can imagine to have a finite set of token languages L_1, L_2,
..., L_n describing the lexical structure of the target language L
such that (L_1+L_2+...+L_n)* covers L. Furthermore, the lexical
analysis component delivers, for any w in L, an "interpretation"
((w_1,i_1), (w_2,i_2), ..., (w_m,i_m)) such that w = w_1 w_2 ... w_m,
and w_j in L_{i_j}. Indices i_1, ... , i_m represent the token
encodings delivered. For the token languages L_i, we have cfgs
G_i=(N,T,P,S_i). And we have a kind of "scanner cfg" G that is built
by "taking the union of" the G_i and adding a rule
S := /*empty*/ | S S_1 | S S_2 | ... | S S_n. We are interested in
getting a scanner for L by building a conflict-free, say, LR(1)
parser for G.
There should be rules to eliminate some LR conflicts "automatically"
a) realizing what is called "longest match";
b) realizing the fact that usually keywords override identifiers.
There should also some clean way to deal with whitespace&comment.

Here are some questions which some reader of comp.compilers is
perhaps able to answer:

* Does that attempt work (at least for languages the lexical
structure is "not too complicated")? Or does one really need GLR

* If it doesn't work: Where are the problems with it? Do you know
counter-examples of programming languages where one can't do
lexical analysis like that?
(I know of Pascal's '..' problem; are there other problem cases?)

* What is known on performance drawbacks of LR-parsing-based
scanners vs. DFA-based scanners?

* Do you know relevant literature I should be aware of?

If you can provide an answer on one of these questions or other
useful comment on that scenario, I would be very grateful.

Thanks for your help,
Soenke Kannapinn

Soenke Kannapinn
String.concat[ "soenke", ".", "kannapinn",
        "@", "wincor-nixdorf", ".", "com" ]
[Traditionally people didn't use parsers as lexers because they were
too slow, but in these days of gigahertz PCs that's not much of an
issue any more. -John]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.