Re: Unbodged lookahead algorithms for lexers

"Martin Williams" <mw@willms.demon.co.uk>
6 Jun 1999 23:00:11 -0400

          From comp.compilers

Related articles
Unbodged lookahead algorithms for lexers dans@chiark.greenend.org.uk (1999-05-27)
Re: Unbodged lookahead algorithms for lexers mw@willms.demon.co.uk (Martin Williams) (1999-06-06)
| List of all articles for this month |

From: "Martin Williams" <mw@willms.demon.co.uk>
Newsgroups: comp.compilers
Date: 6 Jun 1999 23:00:11 -0400
Organization: Compilers Central
References: 99-05-133
Keywords: lex

Dan Sheppard wrote:


> I'm interested in lookahead algorithms for lexers implemented by
> direct DFA construction from the expression (without going via the
> correponding NFA). Most of the ones I've seen implemented in practice
> chiecken-out on difficult lookaheads...


> <snip>


> 1/ At the moment I've come up with a rather complex algorithm which
> uses a number of `registers' for the location of the matched part, and
> commands on the arcs of the DFA to update the register contents...


> <snip>


> 2/ Make it an epsilon transition and record the contents. This doesn't
> work since I'm using a DFA and I don't know which DFA state containing
> the epsilon transition corresponds to the current accepting state.




I took this approach in a (hobby) lexalike. Create an eps position for
the lookahead operator and you can then determine which DFA states
contain one, and generate "arc" code accordingly.


My present attempt fails to deal with difficult lookaheads, but I wonder
if it might be coerced into success ?


Easy example:


Rule #1: aa* / bb* ....Action code....


DFA: | a b other
                --+-----------------
State: 1 | 2 - - Start
                2 | 2 3 - Contains eps lookahead position.
                3 | - 3 - Match


Generated DFA code:


        LOOP
            ...... =20
            CASE State OF (* "arc" code *)
                2: MatchPos[1] :=3D pos; (* rule #1 *)
            ELSE ;
            END;
            State :=3D NextState(State,SrcBuf[pos]);
            INC(pos);
            ......
        END;




MatchPos is an array over each rule of the lex source. It is consulted
by the lexer generated "pre" action code of each (lookahead containing)
rule to determine the token end position.


In practice many states will not require "arc" code. Some states may
update several MatchPos elements.




Example: Difficult


Rule #1: aa* / ab* ....Action code....


DFA: | a b other
                --+-----------------
State: 1 | 2 - - Start
                2 | 3 - - Contains eps lookahead position.
                3 | 3 4 - Match. Contains eps lookahead position.
                4 | - 4 - Match.




Code generated on the above simple principle (fails):


        LOOP
            ...... =20
            CASE State OF (* "arc" code *)
                2: MatchPos[1] :=3D pos; (* rule #1 *)
            | 3: MatchPos[1] :=3D pos; (* rule #1 *)
            ELSE ;
            END;
            State :=3D NextState(State,SrcBuf[pos]);
            INC(pos);
            ......
        END;


The result is the same as if we had moved the lookahead one character
to the right.


Code that would work in this circumstance:


        LOOP
            ...... =20
            CASE State OF (* "arc" code *)
                2: SavePos[1] :=3D pos; (* rule #1. Save match point. =
*)
            | 3: MatchPos[1] :=3D SavePos[1]; (* rule #1. Match point. *)
                      SavePos[1] :=3D pos; (* rule #1. Save match point. =
*)
            ELSE ;
            END;
            State :=3D NextState(State,SrcBuf[pos]);
            INC(pos);
            ......
        END;


Note that the lookahead marker position in state 2 is "redundant". If
there's going to be a match we will always encounter the lookahead
marker position in state 3. It is, in fact, telling us something
different.


I wonder if it might be possible to exploit this property to:


a) easily identify all lookahead difficulties in the DFA, and
b) easily remove some of them by generating "code that would work"




It will always pay to notice that one or other side of the
expression matches a fixed length token. In these circumstances the
eps position need not be generated - the lexer generated "pre" action
code can make use of the constant length property instead to
determine the token size.






Martin Williams
mw@willms.demon.co.uk


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.