|Regular Expression -> Minimal DFA algorithm email@example.com (1993-05-18)|
|Regular Expression -> Minimal DFA algorithm II firstname.lastname@example.org (1993-05-20)|
|Organization:||Computing Services Division, University of Wisconsin - Milwaukee|
|Date:||Thu, 20 May 1993 21:18:53 GMT|
Earlier, I posted a description of a Regular Expression -> DFA algorithm.
With some slight alterations, this algorithm has been modified somewhat
and clarified so that it now generates DFA's basically by directly
applying the subset construction on the regular expression itself.
The minimilization step is still kept separate. It's possible to merge
this step with the other two, but in contrast to the above, there will not
be any significant gains as a result.
You can get a copy of the software by anonymous ftp at csd4.csd.uwm.edu
in the directory pub/regex. You WILL need an ANSI-C compiler to compile
it. The library functions in <stdlib.h> (particularily realloc()) are
assumed to have the semantics indicated by ANSI. The program regex.c is
the older version, and regex2.c is the newer version.
Regular Expressions are defined using the following syntax:
(0) 0 ------- to denote the empty set
(1) 1 ------- to denote the empty string
(2) x ------- any valid C identifier
(3) L ------- string literal
(3) [A] ----- denotes the regular expression 1 | A.
(4) A+ ------ denotes the regular expression A A*.
(5) A* ------ ITERATION (the set 1 | A | A A | A A A | ...)
(6) A | B --- ALTERATION (the union of A and B)
(7) A B ----- CONCATENATION
For example, these are valid regular expressions:
(a [b+ a*])+ | c* a b
a* (b a*)*
If E is a regular expression, then it will be reduced to the following
E = 1 | E1 | E2 | ...
where each alterand, E1, E2, ... has the form x E', where x is a symbol.
The following reduction rules are applied to carry out this
x -> x 1
[A] -> 1 | A
A+ -> A A*
A* -> 1 | A+
A | B -> merge(a, b) where A reduces to a, and B to b.
0 A -> 0
1 A -> A
[A] B -> B | A B
A+ B -> A (A* B)
A* B -> B | A+ B
(A | B) C -> A C | B C
(A B) C -> A (B C)
The merge operation combines expressions starting with the same symbol,
merge (a 0 | b 1, b 2 | c 3) = a 0 | b (1 | 2) | c 3.
The formal definition is fairly straightforward.
The expression, A, in each term of the form x A in the sum is considered a
state in the DFA. Each newly generated state is reduced to normal form in
the same way, and this in turn may generate more new states. When all the
states that were generated from the top-level expression have been fully
reduced, the result is an DFA representing the original expression. This
process will always halt.
For example, take this expression
E = a* (b a*)*
for convenience, represent it as follows:
E = x0 x1, where x0 = a*, x1 = x2*, x2 = b x0
then E reduces as follows:
E = x0 x1 = a* x1 -> x1 | a x3, where x3 = a* x1 = x0 x1 = E
x1 = x2* -> 1 | x2+
x2+ -> x2 x0* = x2 x1 = (b x0) x1 -> b (x0 x1) = b E
thus: x1 -> merge(1, b E) = 1 | b E.
thus: E -> merge(1 | b E, a E) = 1 | a E | b E
Therefore E reduces to 1 | a E | b E.
In any implementation of this algorithm, it is ABSOLUTELY essential to
avoid the formation of duplicate expressions. Subexpressions are
therefore hashed and a table lookup precedes the formation of any
expression. Without this, the example above would get into an infinite
loop creating, for instance, an infinite number of copies of E = (x0 x1).
The software I wrote takes these steps to derive an DFA from the regular
expresison E. And it just so happens that for this example, E is the only
"state" in the DFA, so this is the minimal DFA.
Compare with the method described in Berry and Sethi ("From Regular
Expressions to Deterministic Automata" Theoretical Computer Science 1986).
After the conversion, standard techniques are used to minimlaize the DFA.
Return to the
Search the comp.compilers archives again.