6 Jun 1999 23:00:11 -0400

Related articles |
---|

Unbodged lookahead algorithms for lexers dans@chiark.greenend.org.uk (1999-05-27) |

Re: Unbodged lookahead algorithms for lexers mw@willms.demon.co.uk (Martin Williams) (1999-06-06) |

From: | "Martin Williams" <mw@willms.demon.co.uk> |

Newsgroups: | comp.compilers |

Date: | 6 Jun 1999 23:00:11 -0400 |

Organization: | Compilers Central |

References: | 99-05-133 |

Keywords: | lex |

Dan Sheppard wrote:

*> I'm interested in lookahead algorithms for lexers implemented by*

*> direct DFA construction from the expression (without going via the*

*> correponding NFA). Most of the ones I've seen implemented in practice*

*> chiecken-out on difficult lookaheads...*

*> <snip>*

*> 1/ At the moment I've come up with a rather complex algorithm which*

*> uses a number of `registers' for the location of the matched part, and*

*> commands on the arcs of the DFA to update the register contents...*

*> <snip>*

*> 2/ Make it an epsilon transition and record the contents. This doesn't*

*> work since I'm using a DFA and I don't know which DFA state containing*

*> the epsilon transition corresponds to the current accepting state.*

I took this approach in a (hobby) lexalike. Create an eps position for

the lookahead operator and you can then determine which DFA states

contain one, and generate "arc" code accordingly.

My present attempt fails to deal with difficult lookaheads, but I wonder

if it might be coerced into success ?

Easy example:

Rule #1: aa* / bb* ....Action code....

DFA: | a b other

--+-----------------

State: 1 | 2 - - Start

2 | 2 3 - Contains eps lookahead position.

3 | - 3 - Match

Generated DFA code:

LOOP

...... =20

CASE State OF (* "arc" code *)

2: MatchPos[1] :=3D pos; (* rule #1 *)

ELSE ;

END;

State :=3D NextState(State,SrcBuf[pos]);

INC(pos);

......

END;

MatchPos is an array over each rule of the lex source. It is consulted

by the lexer generated "pre" action code of each (lookahead containing)

rule to determine the token end position.

In practice many states will not require "arc" code. Some states may

update several MatchPos elements.

Example: Difficult

Rule #1: aa* / ab* ....Action code....

DFA: | a b other

--+-----------------

State: 1 | 2 - - Start

2 | 3 - - Contains eps lookahead position.

3 | 3 4 - Match. Contains eps lookahead position.

4 | - 4 - Match.

Code generated on the above simple principle (fails):

LOOP

...... =20

CASE State OF (* "arc" code *)

2: MatchPos[1] :=3D pos; (* rule #1 *)

| 3: MatchPos[1] :=3D pos; (* rule #1 *)

ELSE ;

END;

State :=3D NextState(State,SrcBuf[pos]);

INC(pos);

......

END;

The result is the same as if we had moved the lookahead one character

to the right.

Code that would work in this circumstance:

LOOP

...... =20

CASE State OF (* "arc" code *)

2: SavePos[1] :=3D pos; (* rule #1. Save match point. =

*)

| 3: MatchPos[1] :=3D SavePos[1]; (* rule #1. Match point. *)

SavePos[1] :=3D pos; (* rule #1. Save match point. =

*)

ELSE ;

END;

State :=3D NextState(State,SrcBuf[pos]);

INC(pos);

......

END;

Note that the lookahead marker position in state 2 is "redundant". If

there's going to be a match we will always encounter the lookahead

marker position in state 3. It is, in fact, telling us something

different.

I wonder if it might be possible to exploit this property to:

a) easily identify all lookahead difficulties in the DFA, and

b) easily remove some of them by generating "code that would work"

It will always pay to notice that one or other side of the

expression matches a fixed length token. In these circumstances the

eps position need not be generated - the lexer generated "pre" action

code can make use of the constant length property instead to

determine the token size.

Martin Williams

mw@willms.demon.co.uk

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.