Thu, 20 May 1993 21:18:53 GMT

Related articles |
---|

Regular Expression -> Minimal DFA algorithm markh@csd4.csd.uwm.edu (1993-05-18) |

Regular Expression -> Minimal DFA algorithm II markh@csd4.csd.uwm.edu (1993-05-20) |

Newsgroups: | comp.theory,comp.compilers |

From: | markh@csd4.csd.uwm.edu (Mark) |

Keywords: | DFA, FTP |

Organization: | Computing Services Division, University of Wisconsin - Milwaukee |

References: | 93-05-083 |

Date: | Thu, 20 May 1993 21:18:53 GMT |

Earlier, I posted a description of a Regular Expression -> DFA algorithm.

With some slight alterations, this algorithm has been modified somewhat

and clarified so that it now generates DFA's basically by directly

applying the subset construction on the regular expression itself.

The minimilization step is still kept separate. It's possible to merge

this step with the other two, but in contrast to the above, there will not

be any significant gains as a result.

You can get a copy of the software by anonymous ftp at csd4.csd.uwm.edu

in the directory pub/regex. You WILL need an ANSI-C compiler to compile

it. The library functions in <stdlib.h> (particularily realloc()) are

assumed to have the semantics indicated by ANSI. The program regex.c is

the older version, and regex2.c is the newer version.

THE ALGORITHM

Regular Expressions are defined using the following syntax:

(0) 0 ------- to denote the empty set

(1) 1 ------- to denote the empty string

(2) x ------- any valid C identifier

(3) L ------- string literal

(3) [A] ----- denotes the regular expression 1 | A.

(4) A+ ------ denotes the regular expression A A*.

(5) A* ------ ITERATION (the set 1 | A | A A | A A A | ...)

(6) A | B --- ALTERATION (the union of A and B)

(7) A B ----- CONCATENATION

For example, these are valid regular expressions:

(a [b+ a*])+ | c* a b

a* (b a*)*

If E is a regular expression, then it will be reduced to the following

normal form:

E = 1 | E1 | E2 | ...

where each alterand, E1, E2, ... has the form x E', where x is a symbol.

The following reduction rules are applied to carry out this

transformation:

x -> x 1

[A] -> 1 | A

A+ -> A A*

A* -> 1 | A+

A | B -> merge(a, b) where A reduces to a, and B to b.

0 A -> 0

1 A -> A

[A] B -> B | A B

A+ B -> A (A* B)

A* B -> B | A+ B

(A | B) C -> A C | B C

(A B) C -> A (B C)

The merge operation combines expressions starting with the same symbol,

e.g.:

merge (a 0 | b 1, b 2 | c 3) = a 0 | b (1 | 2) | c 3.

The formal definition is fairly straightforward.

The expression, A, in each term of the form x A in the sum is considered a

state in the DFA. Each newly generated state is reduced to normal form in

the same way, and this in turn may generate more new states. When all the

states that were generated from the top-level expression have been fully

reduced, the result is an DFA representing the original expression. This

process will always halt.

For example, take this expression

E = a* (b a*)*

for convenience, represent it as follows:

E = x0 x1, where x0 = a*, x1 = x2*, x2 = b x0

then E reduces as follows:

E = x0 x1 = a* x1 -> x1 | a x3, where x3 = a* x1 = x0 x1 = E

x1 = x2* -> 1 | x2+

x2+ -> x2 x0* = x2 x1 = (b x0) x1 -> b (x0 x1) = b E

thus: x1 -> merge(1, b E) = 1 | b E.

thus: E -> merge(1 | b E, a E) = 1 | a E | b E

Therefore E reduces to 1 | a E | b E.

In any implementation of this algorithm, it is ABSOLUTELY essential to

avoid the formation of duplicate expressions. Subexpressions are

therefore hashed and a table lookup precedes the formation of any

expression. Without this, the example above would get into an infinite

loop creating, for instance, an infinite number of copies of E = (x0 x1).

The software I wrote takes these steps to derive an DFA from the regular

expresison E. And it just so happens that for this example, E is the only

"state" in the DFA, so this is the minimal DFA.

Compare with the method described in Berry and Sethi ("From Regular

Expressions to Deterministic Automata" Theoretical Computer Science 1986).

After the conversion, standard techniques are used to minimlaize the DFA.

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.