Wed, 25 Feb 2009 09:16:49 -0600

Related articles |
---|

DFA - State Reduction alexander.morou@gmail.com (Alexander Morou) (2009-02-25) |

Re: DFA - State Reduction cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-27) |

Re: DFA - State Reduction cfc@shell01.TheWorld.com (Chris F Clark) (2009-02-27) |

From: | Alexander Morou <alexander.morou@gmail.com> |

Newsgroups: | comp.compilers |

Date: | Wed, 25 Feb 2009 09:16:49 -0600 |

Organization: | Compilers Central |

Keywords: | lex, DFA |

Posted-Date: | 25 Feb 2009 14:28:56 EST |

Greetings,

I hope by this point you're not all tired of me, but I have a question

of a different nature this time around. For the most part, I've

successfully implemented a very basic regular expression NFA system

and a NFA->DFA conversion routine (though I seem to have messed up the

? Kleene operator on grouped elements, I'll probably rewrite it later

to clean it up due to the mess caused by the post- reduction

determination mentioned below).

I was curious as to what the buzz was on a minimal state DFA, and I

also took a look at the code that was being generated by the DFA

lexical verifier. I noticed something most of you probably already

know: For every state (in an example let's call it state 'A'), if

there exists another state or states (let's call them set B) within

the original DFA structure which contains identical transition

characters and identical destination points for those characters,

those states are considered equivalent to one another.

Now you can do some basic replacements, in the DFA/NFA system I

implemented, I wrote it to be bi-directional, so it was easy to tell,

either direction, where states were coming and going. So I could say:

for every element in A that has a corresponding element or set of

elements in B, on all states that transition into those original

elements in B, replace them for the corresponding element A.

Effectively removing all instances of B for those of A. The ratio of

A:B was 1:many. This often reduced the states by quite a bit, but I

also noticed another pattern in the states: All terminal edges with

corresponding incoming transition characters are equivalent. This for

the most part is true, provided the individual target you're

minimizing doesn't have a start->end determination requirement. That

is: so long as you're only interested in verifying the pattern, and

capturing the characters within the pattern, and the order in which

the verification occurs is not important, this is ok.

I used both simplification methods, and reduced state sets from

430->231 on sets of keywords in a given language. Initially I thought

that was great, until I tried to do a simple thing: figure out which

keyword it was that was verified. Reducing the end points in that

manner, eliminated the determination aspect to the DFA (creating

what's somewhat a DFA/NFA hybrid, DFA left->right; NFA right->left),

which for most cases wasn't an issue. To give a visual example if I

have a small regular language 'L' with the following three terms:

L := ascending | descending | astral;

The resulted DFA would yield a version of L that's not directly

expressible in the above form, so I'll give it in two versions:

L1 := (as|de)cending;

L2 := as(tral|scending);

Effectively meaning that the first character determines whether the L1

or L2 path is followed, and by the third character it knows which of

the two options in L1 or L2 is followed. Though the end-state for

ascending and descending is identical since they're DFA was reduced on

the right-half of the expression (since the first minimization method

mixed with terminal state reduction, causes a cascade effect, reducing

the 'scending' part into a single path.)

The question I have is: If you were to implement a DFA sytem where you

could determine the exact set of transitions while eliminating the

terminal edges in such a manner, would the trade-off of more complex

code to do the determination mitigate the advantage of a reduced state

set, or would this be directly dependent upon the level of complexity

of the said system?

In testing, I found that the redetermination was actually *more*

difficult than creating a NFA and corresponding DFA: it would require

you to keep track of the initial paths and the points of divergence in

the DFA would have to be tracked as well, since there are cases where

the previous choices would affect the later choices, thus greatly

complicating the matter.

(An example, in a set of faux keywords in a 'worst case' scenario, I

created a series of 'keywords' that would merge on the first three

characters, the end state-set for ~1664 literals, 3 characters long a

piece, was ~6 states. Thus making the pattern something akin to:

[A-P][A-Z][A-D] if all you're concerned with is verifying the pattern

and capturing it.)

So I'm probably going to abandon the idea of minimizing based off of

terminal edges. Unless someone else has an idea that I haven't

thought of? Though I think in cases where determination is important

I could use the first minimization method, and where the end result

isn't as important as verifying the pattern it could -also- minimize

the end-states, basically rewriting a somewhat more optimal version of

the initial regular expression (similar to L1 and L2).

The primary reason I bring the question to this newsgroup is you give

answers that aren't in the form of high level math and set theory.

Stuff which I'm probably already using, but since I've never encountered

the symbols, nor their descriptions and functional use, they're foreign

to me, and typical articles that discuss NFA, DFA, reduction and so on

are beyond me for that very reason.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.