Sat, 2 Oct 1993 17:47:13 GMT

Related articles |
---|

The Equational Approach: (1) Regular Expressions -> DFA's. markh@csd4.csd.uwm.edu (1993-10-02) |

Newsgroups: | comp.compilers,comp.theory |

From: | markh@csd4.csd.uwm.edu (Mark) |

Keywords: | parse, theory |

Organization: | Computing Services Division, University of Wisconsin - Milwaukee |

Date: | Sat, 2 Oct 1993 17:47:13 GMT |

(1.1) Introduction

This article is a lead into a brief series of articles that illustrate

the algebraic and equational approach to formal language theory, and its

applications, such as constructing efficient and intuitive parsers, and

performing optimizations. I will demonstrate why it is actually better to

view languages, grammars, and automata as systems of equations, by

exploiting this view throughout. This is basis of methods I've been using

for quite a while now to write language software of all sorts, and for

performing symbolic manipulations on languages and automata.

Since the question indicated by the topic has been brought up a few

times in comp.compilers, the initial article is also being posted there.

Follow-ups will be send exclusively to comp.theory. Another question that

will be indirectly answered down the line is how to generate parsers for

context free grammars written in EBNF notation.

A couple months back, I described a method for reducing regular

expressions to finite automata that involves performing algebraic

manipulations on regular expressions. The goal of these calculations is

to get a system of equations of the form:

Q = 1 + x1 Q1 + x2 Q2 + ...

or Q = x1 Q1 + x2 Q2 + ...

which is the resulting automaton in equational form.

What's described below is a formalization of the method, and terms.

There are slight differences between the algorithm described here and the

one originally presented, this version tends to be a bit more powerful in

that it can handle a wider range of regular expression operators.

A regular expression over an alphabet, X, is built out of the following

items and operations:

(0) 0 ------- to denote the empty set.

(1) 1 ------- to denote the empty string in X*.

(2) x ------- an element of X.

(3) [A] ----- denotes the regular expression 1 + A.

(4) A+ ------ denotes the regular expression A A*.

(5) A* ------ ITERATION (the set 1 + A + A A + A A A + ...)

(6) A + B --- ALTERATION (the union of A and B)

(7) A B ----- CONCATENATION

the following operation may also be included:

(8) A - B --- DIFFERENCE (the set difference of A and B).

For example, these are valid regular expressions:

(a [b+ a*])+ + c* a b (E1)

a* (b a*)* (E2)

The goal in converting a regular expression to a DFA is to get a series of

linear equations between a number of regular expressions { Q0, Q1 ...

Q{q-1} ), with Q0 being identified as the original expression. For

example, systems of equations corresponding to the regular expressions

above are:

(a [b+ a*])+ + c* a b (S1)

Q0 = a Q1 + c Q2 (Q0 = (a [b+ a*])+ + c* a b)

Q1 = 1 + a Q1 + b Q1 (Q1 = (a + b)*)

Q2 = a Q3 + c Q2 (Q2 = c* a b)

Q3 = b Q4 (Q3 = b)

Q4 = 1 (Q4 = 1)

a* (b a*)* (S2)

Q0 = 1 + a Q0 + b Q0

The right hand side of each equation is a sum of terms of one of the forms:

x Q or 1

Because all the Q's appear only once in each term, and as a factor on the

right hand side of the term, the system is known as a Right Linear System.

(1.2) Automaton = System of Equations

If you've ever had to deal with the task of writing a program to do

symbolic manipulations on circuits, one thing would have stood out in

time: a circuit diagram is actually a concise visual encoding of a

mathematical system of relations. Each component stands for a equation

(or more generally, a relation) between the currents and voltages of its

terminals, each line represents a variable representing a voltage.

The same is true for automata, though this fact is not widely

appreciated. A Finite Automaton is equivalent to a Right Linear System of

equations with the following relations:

States ............... Expressions Q0, Q1,..., Q{q-1}

Start State .......... Q0

Q a final state ...... Q = 1 + ...

Arcs Q --[x]--> Q' ... Q = ... + x Q' + ...

The transition function (or more generally, the transition relation) of

the automaton is thus effectively coded as the system itself.

As an exercise, draw the automata corresponding to the systems S1 and

S2 listed above using this correspondence.

If you've ever studied Quantum Field Theory, you'll notice a passing

resemblance in this way of approaching matters to what is done there

(where the terms in perturbation expansions are represented by diagrams,

with rules for manipulating them).

Formally, if the transition function of the automaton is d: Q x X -> Q,

where Q is the set of states, and if \Q is defined to be 1 when Q is a

final state, and 0 otherwise, then the system of equations derived from the

automaton can be represented compactly as:

Q = \Q + sum for all x in X: (x d(Q, x))

Since this correspondence is so readily transparent, I won't even

bother to make the distinction and will consider an automaton to be

nothing more than a fancy way of depicting a system of equations.

This correspondence, incidentally, generalizes to Push Down Automata,

as you will see later.

(1.3) The Factoring Property.

Every regular expression over the alphabet X is a set of words taken

from X. If this expression is denoted E, then make the following

definitions:

\E = 1 if 1 is contained in the regular expression E

= 0 else

and for all symbols x in X:

x\E = { w: x w is contained in E }

The latter operator bears a degree of resemblance to the partial

derivative operator and so the same name is sometimes used (and a notation

less constrained by the ASCII character set, more resembling the

operator).

Take any element of the regular expression E. It's a word in X*.

Either this word is the empty word (in which case \E = 1), or it begins

with a symbol, x, and has the form x w (in which case w is in the set x\E,

and x w in the set x (x\E)). Thus the FACTORING PROPERTY:

E = \E + sum for all x in X: x (x\E)

Any method used to perform this decomposition will, as a consequence,

convert a regular expression into a system of Right Linear Equations -- a

finite automaton.

(1.4) Factoring Regular Expressions

From the definitions of \E and x\E, you can derive the basic properties

(which could have also served to define the operators) with respect to which

the calculations can be carried out. Experience proves that it's actually

better NOT to calculate each of the x\E's individually, but rather en masse.

Therefore, define the Total Differential Operator:

dE = sum x (x\E)

A consequence of this definition is that:

E = \E + dE

The following properties are then evident (and should be very easy for you

to prove -- which is your exercise :)):

E \E x\E dE

0 0 0 0

1 1 0 0

x 0 1 x

y 0 0 y

[A] 1 x\A dA

A* 1 x\A A* dA A*

A+ \A x\A A* dA A*

A B \A \B \A x\B + x\A B \A dB + dA B

A + B \A + \B x\A + x\B dA + dB

A - B \A - \B x\A - x\B dA - dB

where it's assumed that x and y are distinct. Note that 1 - 1 = 0, 1 - 0 = 0,

0 - 0 = 0, BUT 0 - 1 = 0 in this algebra.

(1.5) Calculation Method

The calculation method comes straight out of the table, after noting the

following rules:

(a) (A B) C = A (B C)

x = x 1, where x is any symbol

1 A = A

(b) Terms can be ordered and grouped in any way.

(A + B) C = A C + B C

A + A = A

0 + A = A

x A + x B = x (A + B)

(c) x A - x B = x (A - B)

A - A = 0

0 - A = 0

To decompose a regular expression into its factors, make the following

transformations:

[A] = 1 + dA

A* = 1 + dA A*

A+ = \A + dA A*

A B = (\A \B) + \A dB + dA B

A + B = (\A + \B) + (dA + dB)

using the table to calculate the terms \A, \B, dA, dB. It is to be

understood that these reductions only apply to the expression itself, NOT

TO ANY OF ITS SUBEXPRESSIONS. Apply rules (b), and (c) as needed to

collect all the like terms, and apply rules (a), (b), (c) as needed to get

the final result. This system of reduction rules, by the way, will always

lead to the desired result in a finite number of steps, because the terms

being operated on by the dA and \A operators grow progressively smaller as

you carry out the reduction.

The result of this calculation is to calculate the state of the resulting

finite automata corresponding to the original expression. To calculate

the other states, you will then have to decompose each of the expressions

appearing in the factoring that have not already been decomposed. An

implementation will take care to store the expressions encountered in a

hash table in order to avoid duplications.

(1.6) Examples

(a* b)* a* (E3)

d ( (a* b)* a* ) = d(a* b) (a* b)* a* + da a*

= (a a* b + b) (a* b)* a* + a a*

= a (a* b (a* b)* + a*) a* + b (a* b)* a

d ( a* b (a* b)* a* + a*) = a a* b (a* b)* a* + b (a* b)* a* + a a*

= a (a* b (a* b)* a* + a*) + b (a* b)* a*

\ ( (a* b)* a*) = \(a* b)* \a* = 1 1 = 1

\ ( a* b (a* b)* + a*) = \(a* b (a* b)*) + \a* = \(a* b (a* b)*) + 1 = 1

Resulting from this is a 2-state finite automaton given by the system of

equations:

Q0 = 1 + a Q1 + b Q0 (Q0 = (a* b)* a*) (S3)

Q1 = 1 + a Q1 + b Q0 (Q1 = a* b (a* b)* a* + a*)

A minimalization step can be added to eliminate equivalent terms, which

will lead to a 1-state minimal DFA (Q0 = 1 + a Q0 + b Q0). In the

presence of the difference operator it will also be necessary to eliminate

useless states (those that generate no words), before minimalization.

Another example that uses the difference operator is the following:

(a + b)* - b* a a b* (E4)

For convenience, label the following expressions (I'm cheating by looking

ahead):

Q3 = (a + b)*

Q0 = (a + b)* - b* a a b* = Q3 - b* a a b*

Q1 = (a + b)* - a b* = Q3 - a b*

Q2 = (a + b)* - b* = Q3 - b*

then:

\Q3 = \(a + b)* = 1

d Q3 = (a + b)(a + b)* = a Q3 + b Q3

\Q0 = \Q3 - \(b* a a b*) = 1 - 0 = 1

d Q0

= d Q3 - d (b* a a b*)

= a Q3 + b Q3 - (b b* a a b* + a a b*)

= a (Q3 - a b*) + b (Q3 - b* a a b*)

= a Q1 + b Q0

\Q1 = \Q3 - \(a b*) = 1 - 0 = 1

d Q1

= d Q3 - d(a b*)

= a Q3 + b Q3 - a b*

= a (Q3 - b*) + b Q3

= a Q2 + b Q3

\Q2 = \Q3 - \b* = 1 - 1 = 0

d Q2

= d Q3 - d b*

= a Q3 + b Q3 - b b*

= a Q3 + b ( Q3 - b*)

= a Q3 + b Q2

The result is a 4 state automaton:

Q0 = 1 + a Q1 + b Q0 (S4)

Q1 = 1 + a Q2 + b Q3

Q2 = a Q3 + b Q2

Q3 = 1 + a Q3 + b Q3

(1.7) Complexity

This algorithm (and all algorithms for converting R.E.'s to DFA's) will

generate at least an exponential number of states in the worst case. In

particular, the regular expression:

(a + b)* a (a + b)^n

has a 2^(n+1) state minimal DFA. There are algorithms that will convert

regular expressions into NFA's in polynomial (and possibly even linear)

time, the R.E. -> NFA algorithm I presented one here a while back does.

But what's also common, as an approach, is to use a LAZY implementation of

the R.E. -> DFA conversion. That is, instead of converting the regular

expression to a DFA before processing input, defer expansion of a state

until it is actually encountered in an input.

The conversion to an NFA can be accomplished in the same way as

described above, if the difference operator (A - B) is not used, and if

the rule:

x A + x B = x (A + B)

is no longer used. For example, converting the expression:

(a + b)* a (a + b) (a + b) (a + b)

you get the following system (exercise: do the calculations):

Q0 = a Q0 + b Q0 + a Q1

Q1 = a Q2 + b Q2

Q2 = a Q3 + b Q3

Q3 = a Q4 + b Q4

Q4 = 1

whereas if you attempted to combine a Q0 and a Q1 into a single term a (Q0

+ Q1) and expand Q0 + Q1 (and each of the terms spawned from combining the

terms on the right hand side) the automaton will blow up into 16 states.

(1.8) Derivation Sequences = Inequalities

One can define an inequality relation on regular expressions by the

following:

A >= B iff A = A + B

(which is just a way of saying B is a subset of A, without mentioning sets).

Then the following properties hold:

(a) A >= \A, A >= x x\A, A >= dA

(b) A >= B if and only if \A >= \B and x\A >= x\B

(c) A >= 0

(d) if A >= B and B >= A then A = B

(e) if A >= B and B >= C then A >= C

(f) A >= A

which effectively serves as an alternate definition.

When you carry out a reduction of a state Q, to derive a word w in the

language, you're actually performing a series of inequalities to prove

that Q >= w. For example, with the finite automaton S4

Q0 = 1 + a Q1 + b Q0 (S4)

Q1 = 1 + a Q2 + b Q3

Q2 = a Q3 + b Q2

Q3 = 1 + a Q3 + b Q3

you can make the following derivation:

Q0 >= a Q1 >= a a Q2 >= a a 1 = a a

thus, Q0 >= a a.

A consequence of this definition is the following property:

E = { w in X*: E >= w }

that is, every regular expression is equal to the set of words derived

from it.

(1.9) Infinite Automata and Recursive Systems of Equations

Suppose you attempt to treat the following set as a regular expression

and factor it:

S = { a^n b^n: n >= 0 }.

This set is effectively an infinite regular expression given by the

following recursive equation:

S = 1 + a b + a a b b + a a a b b b + ...

= 1 + a (1 + a b + a a b b + ... ) b

= 1 + a S b

If you attempt to construct a DFA for it, you will end up with an

infinite number of states. In fact, we'll do that. Define the following

classes of expressions:

S(n) = S b^n n >= 0

T(n) = b^n

Then the following holds:

\T(0) = 1

\T(n) = 0, if n > 0

dT(0) = 0

dT(n) = b T(n - 1) if n > 0

\S(n) = \S \b^n = \(1 + a S b) \T(n) = 1 \T(n)

= \T(n)

dS(n) = dS b^n + d(b^n)

= a S b b^n + dT(n)

= a S(n + 1) + dT(n)

Therefore, you get the following (infinite) system of equations:

S(0) = 1 + a S(1)

S(n) = b T(n - 1) + a S(n + 1) if n > 0

T(0) = 1

T(n) = b T(n - 1), if n > 0

This represents the infinite automaton:

[[S(0)]]

| a

v b

S(1) ---------> [[T(0)]]

| a ^

v b | b

S(2) ---------> T(1)

| a ^

v b | b

S(3) ---------> T(2)

| a ^

v b | b

... ...

It's easy to show this is minimal. The only possible equivalent states here

are (apart from equality):

S(n) = S(n'), 0 < n < n'

T(n) = T(n'), 0 < n < n'

Let N be the smallest integer, n, such that either of the equivalences above

holds. If

S(N) = S(n'),

then

b T(N) + a S(N+1) = b T(n') + a S(n'+1),

and thus

T(N) = T(n').

If

T(N) = T(n')

then

b T(N-1) = b T(n'-1).

Thus

T(N-1) = T(n'-1),

which is impossible by assumption. This Automaton represents a special

case of what I call an Indexed Infinite Automaton (or, committing a major

abuse of terms, an Infinite Finite Automaton), and the grammar is an Index

Grammar of degree 1 and dimension 1. The automaton has the topology of a

1-ary tree.

If you ever studied Quantum Theory, you should have experienced a

feeling of deja vu. It actually looks like we're talking about quantum

states, energy levels, and operators, and that resemblance you'll see is a

little bit more than a passing resemblance. The states S(0) and T(0) look

like ground states, the transition S(n) >= a S(n + 1) like the absorption

of a quantum of energy, and the transition T(n) >= b T(n - 1), the

emission of a quantum of energy.

If you introduce some quantum notation (bras and kets) and apply a

similar set of formal properties to these operators, the result will be

that you've just come up with a way to represent solutions to recursive

systems of equations.

As you'll see later on, the expression above will be represented by the

expression:

<0| (<1| a)* (b |1>)* |0>

and thus, you can actually solve the equation S = 1 + a S b. The <n|'s

are Creation operators, and the |n>'s are Anhilliation operators. There

are Commutation and Orthogonality relations. For example, you can perform

the following manipulations:

<0| (<1| a) (b |1>) |0>

= <0| <1| |1> |0> a b

= <0| |0> a b

= a b

and

<0| (<1| a) |0> = <0| <1| |0> a = <0| 0 a = 0

To summarize development in one term: you now have a generalization of

Regular Expressions, known as Context Free Expressions...

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.