DFA - State Reduction

Alexander Morou <alexander.morou@gmail.com>
Wed, 25 Feb 2009 09:16:49 -0600

          From comp.compilers

Related articles
DFA - State Reduction alexander.morou@gmail.com (Alexander Morou) (2009-02-25)
Re: DFA - State Reduction cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-27)
Re: DFA - State Reduction cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-27)
| List of all articles for this month |

From: Alexander Morou <alexander.morou@gmail.com>
Newsgroups: comp.compilers
Date: Wed, 25 Feb 2009 09:16:49 -0600
Organization: Compilers Central
Keywords: lex, DFA
Posted-Date: 25 Feb 2009 14:28:56 EST


I hope by this point you're not all tired of me, but I have a question
of a different nature this time around. For the most part, I've
successfully implemented a very basic regular expression NFA system
and a NFA->DFA conversion routine (though I seem to have messed up the
? Kleene operator on grouped elements, I'll probably rewrite it later
to clean it up due to the mess caused by the post- reduction
determination mentioned below).

I was curious as to what the buzz was on a minimal state DFA, and I
also took a look at the code that was being generated by the DFA
lexical verifier. I noticed something most of you probably already
know: For every state (in an example let's call it state 'A'), if
there exists another state or states (let's call them set B) within
the original DFA structure which contains identical transition
characters and identical destination points for those characters,
those states are considered equivalent to one another.

Now you can do some basic replacements, in the DFA/NFA system I
implemented, I wrote it to be bi-directional, so it was easy to tell,
either direction, where states were coming and going. So I could say:
for every element in A that has a corresponding element or set of
elements in B, on all states that transition into those original
elements in B, replace them for the corresponding element A.
Effectively removing all instances of B for those of A. The ratio of
A:B was 1:many. This often reduced the states by quite a bit, but I
also noticed another pattern in the states: All terminal edges with
corresponding incoming transition characters are equivalent. This for
the most part is true, provided the individual target you're
minimizing doesn't have a start->end determination requirement. That
is: so long as you're only interested in verifying the pattern, and
capturing the characters within the pattern, and the order in which
the verification occurs is not important, this is ok.

I used both simplification methods, and reduced state sets from
430->231 on sets of keywords in a given language. Initially I thought
that was great, until I tried to do a simple thing: figure out which
keyword it was that was verified. Reducing the end points in that
manner, eliminated the determination aspect to the DFA (creating
what's somewhat a DFA/NFA hybrid, DFA left->right; NFA right->left),
which for most cases wasn't an issue. To give a visual example if I
have a small regular language 'L' with the following three terms:
L := ascending | descending | astral;

The resulted DFA would yield a version of L that's not directly
expressible in the above form, so I'll give it in two versions:
L1 := (as|de)cending;
L2 := as(tral|scending);

Effectively meaning that the first character determines whether the L1
or L2 path is followed, and by the third character it knows which of
the two options in L1 or L2 is followed. Though the end-state for
ascending and descending is identical since they're DFA was reduced on
the right-half of the expression (since the first minimization method
mixed with terminal state reduction, causes a cascade effect, reducing
the 'scending' part into a single path.)

The question I have is: If you were to implement a DFA sytem where you
could determine the exact set of transitions while eliminating the
terminal edges in such a manner, would the trade-off of more complex
code to do the determination mitigate the advantage of a reduced state
set, or would this be directly dependent upon the level of complexity
of the said system?

In testing, I found that the redetermination was actually *more*
difficult than creating a NFA and corresponding DFA: it would require
you to keep track of the initial paths and the points of divergence in
the DFA would have to be tracked as well, since there are cases where
the previous choices would affect the later choices, thus greatly
complicating the matter.

(An example, in a set of faux keywords in a 'worst case' scenario, I
created a series of 'keywords' that would merge on the first three
characters, the end state-set for ~1664 literals, 3 characters long a
piece, was ~6 states. Thus making the pattern something akin to:
[A-P][A-Z][A-D] if all you're concerned with is verifying the pattern
and capturing it.)

So I'm probably going to abandon the idea of minimizing based off of
terminal edges. Unless someone else has an idea that I haven't
thought of? Though I think in cases where determination is important
I could use the first minimization method, and where the end result
isn't as important as verifying the pattern it could -also- minimize
the end-states, basically rewriting a somewhat more optimal version of
the initial regular expression (similar to L1 and L2).

The primary reason I bring the question to this newsgroup is you give
answers that aren't in the form of high level math and set theory.
Stuff which I'm probably already using, but since I've never encountered
the symbols, nor their descriptions and functional use, they're foreign
to me, and typical articles that discuss NFA, DFA, reduction and so on
are beyond me for that very reason.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.