Re: Compress the finite state machine

Torben Mogensen <torbenm@diku.dk>
4 Mar 1999 12:11:54 -0500

          From comp.compilers

Related articles
Compress the finite state machine felixmish@usa.net (Felix Mish) (1999-03-02)
Re: Compress the finite state machine cfc@world.std.com (Chris F Clark) (1999-03-04)
Re: Compress the finite state machine torbenm@diku.dk (Torben Mogensen) (1999-03-04)
Re: Compress the finite state machine jes6@eecs.lehigh.edu (James E. Stine) (1999-03-04)
Re: Compress the finite state machine allanmac@blueprint.com (Allan MacKinnon) (1999-03-04)
Re: Compress the finite state machine hunk@alpha1.csd.uwm.edu (1999-03-04)
Re: Compress the finite state machine stans@lucent.com (Sicco Tans) (1999-03-10)
Re: Compress the finite state machine torbenm@diku.dk (Torben Mogensen) (1999-03-22)
Re: Compress the finite state machine sassth@unx.sas.com (Steven Hand) (1999-03-22)
| List of all articles for this month |

From: Torben Mogensen <torbenm@diku.dk>
Newsgroups: comp.compilers
Date: 4 Mar 1999 12:11:54 -0500
Organization: Compilers Central
References: 99-03-010
Keywords: DFA

Felix Mish <felixmish@usa.net> wrote:


>I am evaluating the ways to compress the finite state machine.
>Is there anyone be kind enough to provide some possible solutions?


You don't say whether your machine is deterministic or
nondeterministic. I'll assume the first.


The first step is, obviously, to minimize the number of states in the
DFA. You can find a method for this in Aho, Sethi and Ullman's
compiler book, though the version described there isn't particularly
efficient.


The next step is to reduce the space used by each state. There are
several ways to do this.


One is to divide the alphabet into equivalence classes: If it for all
states is the case that the transition made for two characters is the
same, these are equivalent. For example, it is often the case that all
digits are equivalent in DFA's used for lexical analysis, and often
most letters have the same property (some may be used for keywords,
though). When you have found the equivalence classes, make a single
global table that maps a character to its equivalence class and make
the state transition tables use the equivalence class numbers
instead. If you can reduce 256 different characters to, say, 50
equivalence classes that is a factor 5 savings.


Next, you can try to make several states share the same row in the
table. Two states may have transitions on disjoint subsets of the
alphabet. In this case, they can share a row in the transition table
if you also for each state provide a bitmap that masks out the
irrelevant parts of the row. Since a bitmap takes up far less space
than a row, you can save quite a bit here. You can relax the sharing
criterium a bit: Two states need not have disjoint domains. It is O.K
that they both have a defined transition on a character if the
trasitionsa re identical. Finding an optimal sharing of rows between
states is equivalent to graph colouring: The states form nodes in an
undirected graph. An edge is drawn between two states if there exist a
character such that they both have defined transitions on this
character and these go to different states. Find the minimal number of
colours for the nodes such that states connected by edges have
different colours. Graph colouring is also used for register
allocation and the same heuristics should work fine for row sharing.


If these measures are insufficient, you can try alphabet splitting:
Replace each character from the alphabet by two adjacent characters
from a smaller set. If, for example, your alphabet has 256 characters,
you can split each character into two characters from an alphabet of
16 characters. When you in a DFA replace each transition by two
sequential transitions on the smaller characters, you might get an
NFA, as symbols that are distinct in the original alphabet may not be
distinguished by the first character in their encodings. Hence, you
must convert this NFA to a DFA. If the original DFA was constructed
from a regular expression or NFA, it is easier to modify these before
conversion to a DFA instead of modifying the DFA and re-do NFA to DFA
conversion. While the number of states in your DFA will (on average)
double by this transformation, your alphabet will be far less than
half the size (it may, e.g., go from 256 to 16 characters). Hence,
there is a substantial space saving.


All three methods can be combined: First alphabet splitting, then
equivalence classing and finally row sharing. However, the combined
time overhead may be substantial.


Torben Mogensen (torbenm@diku.dk)


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.