Perfecting State Elimination

Bjoern Hoehrmann <bjoern@hoehrmann.de>
Mon, 1 Oct 2007 10:19:30 -0400 (EDT)

From comp.compilers

Related articles
*Perfecting State Elimination bjoern@hoehrmann.de (Bjoern Hoehrmann)* (2007-10-01)**
Re: Perfecting State Elimination anton@mips.complang.tuwien.ac.at (2007-10-01)
Re: Perfecting State Elimination bjoern@hoehrmann.de (Bjoern Hoehrmann) (2007-10-01)

| List of all articles for this month |

From:	Bjoern Hoehrmann <bjoern@hoehrmann.de>
Newsgroups:	comp.compilers,comp.theory
Followup-To:	comp.compilers
Date:	Mon, 1 Oct 2007 10:19:30 -0400 (EDT)
Organization:	Arcor
Keywords:	DFA, lex, optimize
Posted-Date:	01 Oct 2007 10:19:30 EDT

Hi,

    I am trying to find a method that will make regular expressions as
short as possible (say, using only concatenation, alternation, zero or
more repetition, characters and epsilon, I want to minimize the number
of characters) given limited resources. I could not find much material
on the problem on the web, except that I can turn the expression into
a minimal DFA and the DFA into a regular expression using state elimi-
nation, and that some removal sequences are better than others.

Unfortunately it turns out even the best removal sequence does not ge-
nerate the shortest regular expression. For example, with a DFA like

                                  '>'
      - - I -----------------+
                                  '>' \
      - - J -------------------+ Final State
                                  '>' /
      - - K -----------------+

the shortest expression would have only a single '>' at the end, but
as far as I understand it, plain state elimination will duplicate it.
Short of another approach, I wondered how to fix it and it's easy to
see that in the automaton above a Guard State can be introduced like

                            epsilon
      - - I -----------------+
                            epsilon \ '>'
      - - J ------------------ G -----------> Final State
                            epsilon /
      - - K -----------------+

It's clear that this change does not semantically alter the automaton,
it does not make matters worse since eliminating the guard state first
would take us back to the original, and if it is eliminated after what
it is guarding, the regular expression will be shorter for some removal
sequence.

Put differently, the transformation would add the minimal amount of
guard states such that the automaton is deterministic from both ends
for all non-epsilon transitions, meaning it maximizes left-factoring
and right-factoring while minimizing the number of states [1]. From
this description it sounds similar to Brzozozski's DFA minimization
algorithm.

My question now is, is this transformation sufficient to ensure that
there is a state removal sequence such that state elimination would
generate the shortest possible regular expression as defined above?
If not, is there anything that would make it so?

Is it correct to say, in the model above, removing a state while it
has an incoming epsilon transition will never generate a shorter
regular expression than when removing the state after all epsilon
coming to it have been removed?

[1] Or, it minimizes the number of characters in the automaton that
        state elimination could duplicate. Presumably though the result
        does not need to be fully deterministic, specifically, if there
        is a state with only one incoming transition on c and loops in c,
        it does not need a guard, but I haven't thought much about that.

(Followup-To set to comp.compilers)

Thanks,
--
Bjvrn Hvhrmann 7 mailto:bjoern@hoehrmann.de 7 http://bjoern.hoehrmann.de
Weinh. Str. 22 7 Telefon: +49(0)621/4309674 7 http://www.bjoernsworld.de
68309 Mannheim 7 PGP Pub. KeyID: 0xA4357E78 7 http://www.websitedev.de/

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Perfecting State Elimination

Bjoern Hoehrmann <bjoern@hoehrmann.de>Mon, 1 Oct 2007 10:19:30 -0400 (EDT)

Bjoern Hoehrmann <bjoern@hoehrmann.de>
Mon, 1 Oct 2007 10:19:30 -0400 (EDT)