Related articles |
---|
Unbodged lookahead algorithms for lexers dans@chiark.greenend.org.uk (1999-05-27) |
Re: Unbodged lookahead algorithms for lexers mw@willms.demon.co.uk (Martin Williams) (1999-06-06) |
From: | dans@chiark.greenend.org.uk (Dan Sheppard) |
Newsgroups: | comp.compilers |
Date: | 27 May 1999 23:24:02 -0400 |
Organization: | Linux Unlimited |
Keywords: | lex, design |
I'm interested in lookahead algorithms for lexers implemented by
direct DFA construction from the expression (without going via the
correponding NFA). Most of the ones I've seen implemented in practice
chiecken-out on difficult lookaheads (for example if the union of the
last set of the left subexpression and the first set of the second
expression is non-empty). I'm interested in algorithms which are
reasonably efficient, and which implement the lookahead operator
well. I'd rather they were greedy of the left half, rather than the
right, but I don't really mind.
1/ At the moment I've come up with a rather complex algorithm which
uses a number of `registers' for the location of the matched part, and
commands on the arcs of the DFA to update the register contents
(copying, storing present location). It's a little unweildy to
generate, and slows down the lexer quite a bit, but I think it's quite
cute.
I've investigated a number of alternative schemes, but they seem to be
flawed for some lookahead expressions.
2/ Make it an epsilon transition and record the contents. This doesn't
work since I'm using a DFA and I don't know which DFA state containing
the epsilon transition corresponds to the current accepting state.
3/ Roll backwards using a reversed second half, after doing a full
match, using a non-greedy lexer. This showed promise, but fails in a
number of cases (eg. xx/x*). Looking at many lexer implementations
they do something like this and then bodge it by saying `some
lookaheads are dangerous' or somesuch.
4/ Do a full match, marking each time we potentially pass the
lookahead. Now roll a reversed regexp of the rhs backward until it
both accepts the expression and is at a marked position. The first
such match would mark the greedy point. This would work but is
horrendously inefficient. Particularly since I'm using a DFA with
combined states for potentially hundreds of patterns, each with their
own lookahead points. I would need either to mark each cell in many
different ways during the forwards phase, doing vast quantities of
useless work, or once I got a match I would need to roll an automaton
for the pattern forwards first to mark the accepting
positions. Potentially covering the entire input three times with
automata.
Any thoughts?
Dan.
--
http://www.chiark.greenend.org.uk/~dans/
Return to the
comp.compilers page.
Search the
comp.compilers archives again.