Mon, 1 Oct 2007 10:19:30 -0400 (EDT)

Related articles |
---|

Perfecting State Elimination bjoern@hoehrmann.de (Bjoern Hoehrmann) (2007-10-01) |

Re: Perfecting State Elimination anton@mips.complang.tuwien.ac.at (2007-10-01) |

Re: Perfecting State Elimination bjoern@hoehrmann.de (Bjoern Hoehrmann) (2007-10-01) |

From: | Bjoern Hoehrmann <bjoern@hoehrmann.de> |

Newsgroups: | comp.compilers,comp.theory |

Followup-To: | comp.compilers |

Date: | Mon, 1 Oct 2007 10:19:30 -0400 (EDT) |

Organization: | Arcor |

Keywords: | DFA, lex, optimize |

Hi,

I am trying to find a method that will make regular expressions as

short as possible (say, using only concatenation, alternation, zero or

more repetition, characters and epsilon, I want to minimize the number

of characters) given limited resources. I could not find much material

on the problem on the web, except that I can turn the expression into

a minimal DFA and the DFA into a regular expression using state elimi-

nation, and that some removal sequences are better than others.

Unfortunately it turns out even the best removal sequence does not ge-

nerate the shortest regular expression. For example, with a DFA like

'>'

- - I -----------------+

'>' \

- - J -------------------+ Final State

'>' /

- - K -----------------+

the shortest expression would have only a single '>' at the end, but

as far as I understand it, plain state elimination will duplicate it.

Short of another approach, I wondered how to fix it and it's easy to

see that in the automaton above a Guard State can be introduced like

epsilon

- - I -----------------+

epsilon \ '>'

- - J ------------------ G -----------> Final State

epsilon /

- - K -----------------+

It's clear that this change does not semantically alter the automaton,

it does not make matters worse since eliminating the guard state first

would take us back to the original, and if it is eliminated after what

it is guarding, the regular expression will be shorter for some removal

sequence.

Put differently, the transformation would add the minimal amount of

guard states such that the automaton is deterministic from both ends

for all non-epsilon transitions, meaning it maximizes left-factoring

and right-factoring while minimizing the number of states [1]. From

this description it sounds similar to Brzozozski's DFA minimization

algorithm.

My question now is, is this transformation sufficient to ensure that

there is a state removal sequence such that state elimination would

generate the shortest possible regular expression as defined above?

If not, is there anything that would make it so?

Is it correct to say, in the model above, removing a state while it

has an incoming epsilon transition will never generate a shorter

regular expression than when removing the state after all epsilon

coming to it have been removed?

[1] Or, it minimizes the number of characters in the automaton that

state elimination could duplicate. Presumably though the result

does not need to be fully deterministic, specifically, if there

is a state with only one incoming transition on c and loops in c,

it does not need a guard, but I haven't thought much about that.

(Followup-To set to comp.compilers)

Thanks,

--

Bjvrn Hvhrmann 7 mailto:bjoern@hoehrmann.de 7 http://bjoern.hoehrmann.de

Weinh. Str. 22 7 Telefon: +49(0)621/4309674 7 http://www.bjoernsworld.de

68309 Mannheim 7 PGP Pub. KeyID: 0xA4357E78 7 http://www.websitedev.de/

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.