Re: Theory about DFA's and f?lex start states

Chris F Clark <>
21 Nov 2000 13:52:34 -0500

          From comp.compilers

Related articles
[3 earlier articles]
Re: Theory about DFA's and f?lex start states (2000-11-11)
Re: Theory about DFA's and f?lex start states (Chris F Clark) (2000-11-14)
Re: Theory about DFA's and f?lex start states (Joachim Durchholz) (2000-11-14)
Re: Theory about DFA's and f?lex start states (Jonathan Barker) (2000-11-14)
Re: Theory about DFA's and f?lex start states (2000-11-14)
Re: Theory about DFA's and f?lex start states (2000-11-19)
Re: Theory about DFA's and f?lex start states (Chris F Clark) (2000-11-21)
Re: Theory about DFA's and f?lex start states (Joachim Durchholz) (2000-11-21)
Re: Theory about DFA's and f?lex start states (2000-11-22)
| List of all articles for this month |

From: Chris F Clark <>
Newsgroups: comp.compilers
Date: 21 Nov 2000 13:52:34 -0500
Organization: The World Public Access UNIX, Brookline, MA
References: 00-11-073 00-11-080 00-11-083 00-11-100 00-11-134
Keywords: lex, theory
Posted-Date: 21 Nov 2000 13:52:34 EST

Paul wrote:

> One /can/ theoretically combine the set of DFA's over M to generate
> D. But stepping back [to] a practical standpoint . . . I'm
> struggling to write the lexer because . . . many of the tokens have
> overlapping definitions . . . . So the (natural?) thing is to try to
> partition those token definitions into several distinct layers and
> switch between them when appropriate. So what model of computation
> is this?

It isn't a new model of computation, just a way humans simplify
analysis by layering problems into sub-problems. There are proofs
that the layering introduces no "power" into the class of problems it
solves (at least when the underlying machines are finite state
machines with no auxillary store)--that is anything that can be solved
by a layered automaton can be solved by a machine with no layers.

However, usually the non-start-state solution to the problem makes the
tokens more context sensitive (as context sensitivity is what start
states buy one) which makes finding the token boundaries more
difficult. This is not an issue at the theoretical level, which is
why the proofs work. It is an issue at the practical level, which is
why start states help simplify the human analysis.

Note, if the start states are actually controlled by the parser (which
uses a stack automaton) the theoretical proof can break down and one
is actually dealing with a larger class of languages (usually the
LR-regular ones to be precise), but this is "lexer/parser feedback"
which some people prefer to avoid as it has its own issues in some

> Given the additional thought, I think the point of using stack
> instructions for switching between the elements in M (DPDFA) versus
> explicit goto instructions (like flex) is not too relevant.

I think the Visual Parse++ people might differ on this aspect. The
developer(s) specifically made their start states (user) stackable.

Moreover, from a theoretical standpoint adding a stack adds a lot (!)
of power to class of lanugages one can parse--although it does make
the backtracking analysis much more complicated (and as far as I know
intractable in general). To emphasize this point, the current ANTLR
lexer offers LL(1) "lexing" (with only limited backtracking via
syntactic predicates) and the lexer with Yacc++ offers LR(1) "lexing",
both of those offer an integrated stack in the lexing model, both have
sacrificed general backtracking to do so.

> The important thing is that you have multiple "exclusive" DFA's
> (using the word "exclusive" to mean that merging them would cause
> conflicts) and that you are switching between them.

Yes, exclusive DFA's are a very powerful technique. You can
definitely use them to solve some otherwise intractable lexing
problems--my favorite example is hexadecimal numbers v. identifiers.
C has solved the problem of hex numbers by using the 0x prefix.
However, other languages have used context to determine when a hex
number is valid. In such languages, start states are almost always
the way to go.

Of course, if I were designing a new language, I would probably follow
the C convention unless there were reasons not to. Start states make
ones grammar more fragile and error recovery more difficult. In fact,
most bad syntax errors occur at the "lexical" level (forgetting a
single character, say an opening or closing quote or brace (ok a brace
is generally both a character and a token, but it happens to be the
kind of bad thing which causes "state shifts")). Most compilers do
particularly poorly in recovering from errors at the lexical level or
errors involving state shifts.

Recently I wrote a Verilog grammar for a client and it uses start
states for numeric types because that's the way that language is
structured. The compiler originally recovered really poorly if the
user made an error in the middle of a number because the lexer got
caught in the "wrong" lexical state. Making syntax errors revert to
the "default" state fixed many of those poor error recoveries, but
still the system is more fragile than it would have been without the
need for start states.

Thus, while if I were writing a new lexer generator (which I did once
as a co-author), I would include a start state model in it, at the
same time, I would not encourage too many people to use start states
in lexers (and try to find other features that minimized their (start
states) use by providing alternate solutions with better semantics),
because making lexers more fragile is not desirable.

Hope this helps,

Chris Clark Internet :
Compiler Resources, Inc. Web Site :
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.