Re: Regexps from DFA

clark@quarry.zk3.dec.com (Chris Clark USG)
7 Feb 1997 23:35:06 -0500

          From comp.compilers

Related articles
Regexps from DFA gvmt@csa.iisc.ernet.in (G Venkatesha Murthy) (1997-02-02)
Re: Regexps from DFA anton@a0.complang.tuwien.ac.at (1997-02-03)
re: Regexps from DFA jm@ai.univ-paris8.fr (Jean Mehat) (1997-02-03)
Re: Regexps from DFA dimock@deas.harvard.edu (1997-02-03)
Re: Regexps from DFA clark@quarry.zk3.dec.com (1997-02-07)
Re: Regexps from DFA mslamm@pluto.mscc.huji.ac.il (1997-02-07)
Re: Regexps from DFA lijnzaad@ebi.ac.uk (Philip Lijnzaad) (1997-02-07)
| List of all articles for this month |

From: clark@quarry.zk3.dec.com (Chris Clark USG)
Newsgroups: comp.compilers
Date: 7 Feb 1997 23:35:06 -0500
Organization: Digital Equipment Corporation - Marlboro, MA
References: 97-02-030
Keywords: lex, DFA

> It might be expressed as a ``minimal regexpr for a given set of strings''.
> I didn't clarify the notion of generality vs. simplicity, as '.*' can match
> anything and has a minimal number of states as a DFA, while the exact match
> for all strings is minimal in term of size of the generated language.
> The interesting expr/automatas are somewhere in between.


Expressed that way, there is at least one "natural" solution to your
problem. This will construct a set of NFA's which describe the
languages starting with your exact set of strings and work its way to
"sigma*". Each machine it adds to its set in the construction process
accepts a little larger language (in that it accepts more strings)
than the element of the set it was derived from. That seems to match
your generality concept. (BTW, If you want DFA's, you simply need to
run a conversion algorithm. Allyn Dimock supplied several
references.)


1) Create the following NFA as the initial element of your set.
Create a unique start state.
For each string in your set
For each character in the string
Add a unique state, if the character is the
last character of the string, make the
state an accepting state
Add a transition (for the character) from the
state describing the previous
character (the start state if this is
the first character of the string) to
the state for this character.


    The NFA for the strings "car" and "cat" would look like the one below.
-[0] (start state)
                    | "c" "a" "r"
+------>[1]----->[2]----->[3]+ (accepting state)
|
                    | "c" "a" "t"
|------>[4]----->[5]----->[6]+ (accepting state)


      This machine accepts the minimal set of strings. It is obviously
      not minimal in terms of states nor transitions except under some
      restrictive rules. However, it has some nice properities for the
      set it constructs in the rest of the algorithm. Any other NFA
      which described the set of strings could also be used as the
      initial element and some would yield slightly different sets at
      the end of this algorithm.


2) For each NFA currently in your set (the input NFA), create a set of
new NFA's by the following method. Take each pair of states in the
input NFA and merge them, each pair-wise merger of state yields a
potential new NFA for your set, if the resulting NFA is not in your
set add it. Here are two of the new machines which would be added
given original machine.


    Onw is the merger of states 3 and 5 from the above machine.
    The resulting machine would look like:
-[0] (start state)
                    | "c" "a" "r"
+----->[1]----->[2]----|
| |
                    | "c" "a" | + "t"
|----->[4]-------------+--->[5]---------->[6]+ (accepting state)
                                                                                  (accepting state)
        That machine would, of course, accept "car", "cart", "ca", and "cat".


    If the two states to be merged caused a cycle to be formed, the resulting
    machine would except regular expressions which repetitions, as in the
    next machine, where states 4 and 6 from the original machine were merged.
-[0] (start state)
                    | "c" "a" "r"
+----->[1]----->[2]----->[3]+ (accepting state)
|
                    | "c" + "a" "t"
|----->[4]----->[5]---------|
                                    ^ (accepting state) |
                                    |-------------------|
        The resulting machine would accept "car" | "c"("at")*.


3) Repeat step 2 until there are no more machines which can be added.
One element of you machine set, should be a machine with one state,
which is both the start and an accepting state and has a transition to
itself on each character in any string in the problem (i.e. sigma*,
unless sigma includes some characters not in any of the strings).
Note, that the resulting set of machines is finite (and strictly
limited by a function of the number of initial characters within the
initial strings).


Other extensions to the set which produce finite supersets are also
possible. For example, machines could be added to the set where
non-accepting state were changed to accepting states. Or, additional
transitions could be added between states.


-Chris
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.