Re: building a lexer generator

Karsten Nyblad <148f3wg02@sneakemail.com>
1 Dec 2006 09:17:33 -0500

From comp.compilers

Related articles
building a lexer generator mips@cyberspace.org (2006-11-29)
*Re: building a lexer generator 148f3wg02@sneakemail.com (Karsten Nyblad)* (2006-12-01)**
Re: building a lexer generator torbenm@app-3.diku.dk (2006-12-01)
Re: building a lexer generator DrDiettrich1@aol.com (Hans-Peter Diettrich) (2006-12-03)
Re: building a lexer generator mips@cyberspace.org (2007-01-05)

| List of all articles for this month |

From:	Karsten Nyblad <148f3wg02@sneakemail.com>
Newsgroups:	comp.compilers
Date:	1 Dec 2006 09:17:33 -0500
Organization:	Compilers Central
References:	06-11-123
Keywords:	lex
Posted-Date:	01 Dec 2006 09:17:33 EST

mips@cyberspace.org wrote:
> I'M Trying To Build A Lexer Generator
...
> For example, when i have five
> keywords and one identifier definition how do the lexer process them ?
> Does it have a DFA for each keywords and the identifier or does it
> process all with the same FSM ? I would bet on the second one but
> this mean that i must merge all the DFA (or i surely missed
> something).

I would start by parsing the regular expressions defining the tokens to
be scanned. During the parsing I would build an abstract syntax tree
(AST). In east AST I would mark all characters that may be the last
character of the token, e.g., if the token is defined as a*bc+d*e*, I
would mark d and e. Then I would build one big merged DFA. That is in
practice the same as putting all the regular expressions into one
regular expression, the original expressions are separated by "union"
operators. During the building process I would copy the marks from the
ASTs to the states you can come to by reading a marked character. If a
state may get two marks during this process, then you only copy the mark
of the regular expression you prefer to recognize, e.g., if a state can
recognize a keyword and an identifier, you let the state recognize the
keyword.

Normally you will continue by finding the minimal DFA. During this
process you may not merge to accepting states if these states have
different marks.

The DFA build during this process has the property that from all states
there is a path to an accepting state. This makes it possible to
optimize process of finding the minimal DFA, because two states can only
be equivalent if there are transitions on the same character. The
normal way of to minimalize a DFA is to divide the state into a class of
accepting states and a class of not accepting states. Now for each
character and each class you divide the class into new classes such that
the states are put in the same class if and only if a transition on the
character takes them to states in the same class. The process of the
last statement is repeated until no further progress. Finally the
states belonging to the same classes are merged. This algorithm can be
optimized because in stead of the initial division of two states, you
divide the states into separate classes if either they are marked with
different marks or if they have transitions on different characters or
both. (The algorithm described in my first paragraph will put marks on
the accepting states and only them.)

Karsten Nyblad
148f3wg02 at sneakemail dot com

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: building a lexer generator

Karsten Nyblad <148f3wg02@sneakemail.com>1 Dec 2006 09:17:33 -0500

Karsten Nyblad <148f3wg02@sneakemail.com>
1 Dec 2006 09:17:33 -0500