Re: Finite State Machines in compilers

Chris Clark USG <clark@quarry.zk3.dec.com>
20 Nov 1997 22:30:37 -0500

          From comp.compilers

Related articles
Finite State Machines in compilers h-bergst@dsv.su.se (Henrik_Bergstrom) (1997-11-11)
Re: Finite State Machines in compilers robt@sensi.co.uk (Robert Trevellyan) (1997-11-13)
Re: Finite State Machines in compilers eravila@spin.com.mx (M. en C. Eduardo René Rodríguez Ávila) (1997-11-16)
Re: Finite State Machines in compilers clark@quarry.zk3.dec.com (Chris Clark USG) (1997-11-20)
| List of all articles for this month |

From: Chris Clark USG <clark@quarry.zk3.dec.com>
Newsgroups: comp.compilers
Date: 20 Nov 1997 22:30:37 -0500
Organization: Digital Equipment Corporation - Marlboro, MA
References: 97-11-058
Keywords: FSM

> As part of my work on my masters thesis I need information on how Finite
> State Machines (FSMs) are used in complers. I don't meen that they are
> used for constructing the lexical analyser, but rather things like how
> they are implemented, how large the state-set is etc.


>From my experience looking at buggy FSMs sent to us by the Yacc++ user
community, I have seen some trends in FSM sizes.


For lexical analyzers there seem to be 4 general sizes/shapes of
FSMs.


1) very small. These are the FSMs which describe scripting languages,
shell languages, and lisp-like languages, where there are generally
only a few distinct token types and those relatively simple. A
typical language of this type has tokens like "whitespace",
"punctuation", and "words". The punctuation token type is further
subdivided into a unique token type for each punctuation character,
but the tokens are (almost) all one character long, so that does not
increase number of states (by very much). A typical number of states
for lexical analyzers in this category is 10 or fewer states.


2) small. These are the FSMs describing Algol-like languages
(i.e. Pascal, C, C++, Java, Basic, PL/I) where any keywords of the
language are handled via a symbol table lookup (or a "screener"
function). Also, most assembly language grammars fit in this
category. These languages sub-divide the word category into
identifiers, integers, and floating-point numbers and introduce more
multi-character punctuation. Some of them have complicated tokens for
comments. Some have tokens that require multiple characters of
lookahead to disambiguate (e.g. "1..23" (3 tokens) versus "1.123" (1
token)). The typical number of states for lexical analyzers in this
category are around 25 states.


3) not-small. These are the FSMs for the languages in class 2 where
keywords are done in-line in the lexical analyzer (i.e. as rules)
rather than as a separate function. These FSMs add a large number of
states with only a few transitions leaving them--the exact number
depending on the number of keywords and their length. Typical numbers
of states for this class are 100 to 200 states, although I have seem
them with 1000's of states. (Josef Grolsh, the Cocktail author, came
up with a special table optimization (tunnel automata) just to deal
with FSMs of this class.)


4) complex. Finally, I have seen a few lexical analyzers where some
of the parser functionality was pushed down into the lexer. These
often have complicated start-state sets and implement some recursive
tokens. Some of the text markup languages (SGML, HTML) tend to fall
into this category, as implementors parse entire tags in the lexer.
I've seen similarly complicated lexers describing pre-processors,
where the control flow (#if/#else/#endif) is implemented by changing
lexer states. I don't have figures for this class (fortunately I
don't have to look at many examples of that class), but it does seem
to me that the number of states is limited to only a few 100 because
the ability of the programmer to reason about the correctness of the
underlying machine breaks down at that point. (I once spent over a
week just trying to determine what one lexical analyzer was attempting
to match.)


As to FSMs for parsing the range is more varied. The smallest class
tends to have under 25 states. There is another class with around 100
states. Algol-like languages tend to vary in the 200-500 state range.
The large class tends to be in the 1000's range. I haven't yet seen a
working parser FSM in the 10k state range. (I know I wouldn't
understand it even if I did.)


At the parsing level, I have seen some effects that mimic the
distinctions in lexical analyzers. For example, although the number
of states is larger, the fact that the parser advances through the
program and is usually only dealing with a few states at a time
simplifies the reasoning process. Thus, a 200 state parser grammar is
about as difficult to interpret as a 25 state lexer.


Inheritance of grammars has a similar effect. I have seen grammars
where several disjoint subsets (of the 200-500 state variety) were
inherited together to produce a many 1000 state monsters, but the
reasoning was compartmentalized into the subset parts, so it wasn't
nearly as complex to reason about them as a single 1000 state grammar
would have been.


Another similarity between lexing and parsing FSMs is that static type
analysis maps closely to keyword processing. That is, a parsing
grammar where the type analysis is coded into the rules has the same
kind of state explosion that coding keywords as rules does at the
lexical level. The same thing is also true about precedence levels.
A grammar which uses implicit precedence levels (and the yacc %left
et. al. directives) has only a fraction of the states where the
precedence levels are coded as nested recursive terminals (expression,
term, factor, ...).


-Chris Clark
************************************************************************
Compiler Resources, Inc. email: compres@world.std.com
3 Proctor St. http://world.std.com/~compres
Hopkinton, MA 01748 phone: (508) 435-5016
USA 24hr fax: (508) 435-4847
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.