Unbodged lookahead algorithms for lexers

dans@chiark.greenend.org.uk (Dan Sheppard)
27 May 1999 23:24:02 -0400

          From comp.compilers

Related articles
Unbodged lookahead algorithms for lexers dans@chiark.greenend.org.uk (1999-05-27)
Re: Unbodged lookahead algorithms for lexers mw@willms.demon.co.uk (Martin Williams) (1999-06-06)
| List of all articles for this month |

From: dans@chiark.greenend.org.uk (Dan Sheppard)
Newsgroups: comp.compilers
Date: 27 May 1999 23:24:02 -0400
Organization: Linux Unlimited
Keywords: lex, design

I'm interested in lookahead algorithms for lexers implemented by
direct DFA construction from the expression (without going via the
correponding NFA). Most of the ones I've seen implemented in practice
chiecken-out on difficult lookaheads (for example if the union of the
last set of the left subexpression and the first set of the second
expression is non-empty). I'm interested in algorithms which are
reasonably efficient, and which implement the lookahead operator
well. I'd rather they were greedy of the left half, rather than the
right, but I don't really mind.

1/ At the moment I've come up with a rather complex algorithm which
uses a number of `registers' for the location of the matched part, and
commands on the arcs of the DFA to update the register contents
(copying, storing present location). It's a little unweildy to
generate, and slows down the lexer quite a bit, but I think it's quite

I've investigated a number of alternative schemes, but they seem to be
flawed for some lookahead expressions.

2/ Make it an epsilon transition and record the contents. This doesn't
work since I'm using a DFA and I don't know which DFA state containing
the epsilon transition corresponds to the current accepting state.

3/ Roll backwards using a reversed second half, after doing a full
match, using a non-greedy lexer. This showed promise, but fails in a
number of cases (eg. xx/x*). Looking at many lexer implementations
they do something like this and then bodge it by saying `some
lookaheads are dangerous' or somesuch.

4/ Do a full match, marking each time we potentially pass the
lookahead. Now roll a reversed regexp of the rhs backward until it
both accepts the expression and is at a marked position. The first
such match would mark the greedy point. This would work but is
horrendously inefficient. Particularly since I'm using a DFA with
combined states for potentially hundreds of patterns, each with their
own lookahead points. I would need either to mark each cell in many
different ways during the forwards phase, doing vast quantities of
useless work, or once I got a match I would need to roll an automaton
for the pattern forwards first to mark the accepting
positions. Potentially covering the entire input three times with

Any thoughts?


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.