27 May 1999 23:24:02 -0400

Related articles |
---|

Unbodged lookahead algorithms for lexers dans@chiark.greenend.org.uk (1999-05-27) |

Re: Unbodged lookahead algorithms for lexers mw@willms.demon.co.uk (Martin Williams) (1999-06-06) |

From: | dans@chiark.greenend.org.uk (Dan Sheppard) |

Newsgroups: | comp.compilers |

Date: | 27 May 1999 23:24:02 -0400 |

Organization: | Linux Unlimited |

Keywords: | lex, design |

I'm interested in lookahead algorithms for lexers implemented by

direct DFA construction from the expression (without going via the

correponding NFA). Most of the ones I've seen implemented in practice

chiecken-out on difficult lookaheads (for example if the union of the

last set of the left subexpression and the first set of the second

expression is non-empty). I'm interested in algorithms which are

reasonably efficient, and which implement the lookahead operator

well. I'd rather they were greedy of the left half, rather than the

right, but I don't really mind.

1/ At the moment I've come up with a rather complex algorithm which

uses a number of `registers' for the location of the matched part, and

commands on the arcs of the DFA to update the register contents

(copying, storing present location). It's a little unweildy to

generate, and slows down the lexer quite a bit, but I think it's quite

cute.

I've investigated a number of alternative schemes, but they seem to be

flawed for some lookahead expressions.

2/ Make it an epsilon transition and record the contents. This doesn't

work since I'm using a DFA and I don't know which DFA state containing

the epsilon transition corresponds to the current accepting state.

3/ Roll backwards using a reversed second half, after doing a full

match, using a non-greedy lexer. This showed promise, but fails in a

number of cases (eg. xx/x*). Looking at many lexer implementations

they do something like this and then bodge it by saying `some

lookaheads are dangerous' or somesuch.

4/ Do a full match, marking each time we potentially pass the

lookahead. Now roll a reversed regexp of the rhs backward until it

both accepts the expression and is at a marked position. The first

such match would mark the greedy point. This would work but is

horrendously inefficient. Particularly since I'm using a DFA with

combined states for potentially hundreds of patterns, each with their

own lookahead points. I would need either to mark each cell in many

different ways during the forwards phase, doing vast quantities of

useless work, or once I got a match I would need to roll an automaton

for the pattern forwards first to mark the accepting

positions. Potentially covering the entire input three times with

automata.

Any thoughts?

Dan.

--

http://www.chiark.greenend.org.uk/~dans/

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.