20 Sep 2001 00:24:12 -0400

Related articles |
---|

Generalising regular expressions to stack machines & all machines whopkins@alpha2.csd.uwm.edu (2001-09-20) |

Re: Generalising regular expressions to stack machines & all machines joachim_d@gmx.de (Joachim Durchholz) (2001-10-06) |

From: | whopkins@alpha2.csd.uwm.edu (Mark William Hopkins) |

Newsgroups: | comp.compilers |

Date: | 20 Sep 2001 00:24:12 -0400 |

Organization: | University of Wisconsin - Milwaukee, Computing Services Division |

Keywords: | parse, DFA |

Posted-Date: | 20 Sep 2001 00:24:12 EDT |

From: Chris F Clark <world!cfc@uunet.uu.net>

*>> By the way, I was very impressed that the regular expressions have*

*>> remained practically unchanged for a so long time.*

Context free expressions are the objects you're seeking. Tentative

efforts were made to erect a theory of CFE's in the early 1970's but

this fell flat. I successfully resurrected the whole enterprise in

the late 1980's and early 1990's.

*>Now, regular expressions (RE's) haven't seemed to advance in a long*

*>time because it has been proven that all regular expressions can be*

*>reduced to simple FSM's and vice-versa...*

The reason for the lack of advancement was a shift in focus that took

place in the 1960's away from the mathematical and algebraic

underpinnings of formal language and automata theory to a more

procedural point of view.

It's only recently that progress has been had once again at the

foundation, particularly with recent development in areas related to

Kleene algebra.

See Kozen's web site at www.cs.cornell.edu/kozen.

There is also an extensive (but unpublished) literature originated by

me posted here and elsewhere (comp.theory, sci.math.research) on the

general issue. A few references are at

www.csd.uwm.edu/~whopkins

which is in flux again. This outlines key aspects of the more general

'Algebraic Approach'.

In the original spirit of the people who came up with the concepts of BNF

and context-free grammars, a grammar is algebraically a system of fixed

point equations whose 'solution' defines the language expression, itself.

Part of the hangup is that they originally got this conception wrong.

In place of

Grammar = System of Equations

it should have read

Grammar = System of Inequalities

And in place of

Language = Fixed point solution to the system

(with all its problems associated with non-uniqueness which held them up

seriously at the time), it should read:

Lanugage = LEAST fixed point solution to the system

This is a generalization, since the least solution to the system of

inequalities will also be a solution to the corresponding system of

equations.

I'll show below how this concept leads you directly to the notation and

calculus for context free expressions. The developments are general and

applicable to any N-stack automaton/machine; and so comprise a general

notation for Turing computeable language/translations (since Turing

machines are 2 stack machines).

Language expressions for given language families are direct generalisations

of regular expressions which reside in a non-trivial algebra. Regular

expressions, themselves, comprise a 'free' algebra. More generally, you

can define 'rational expressions' for any algebra M with the properties

M has a product with properties (ab)c = a(bc); a1 = a = 1a

The rational expressions R(M) will then consist of the subsets formed

from the finite subsets by applications of the operators:

AB = { ab: a in A, b in B }

A+B = A union B

A* = {1} union A union AA union AAA ...

Regular expressions over an alphabet X are rational expressions over the

free monoid M = X* -- R(X*). The rational expressions for the direct

product monoid X* x Y* are then R(X* x Y*) which are the rational

transductions -- translations between X* and Y* which are recognizeable

by finite state machines.

Therefore, finite state transductions, themselves, can actually be

specified by 'regular expressions' without any extensions. Instead, it's

the algebra which is non-free. You have relations

xy = yx, for x in X, y in Y.

This is the simplest example of the extension of regular expressions for

language or transduction families higher up the Chomsky hierarchy.

Generally the Kleene algebras R(M) don't generally form a free Kleene

algebra since M may have non-trivial relations. The best known example is

the Parikh vector algebra formed by taking the set

X* = {x1,x2,...,xn}*

and imposing commutativity on X* to arrive at the commutative monoid M of

Parikh vectors. Then R(M) consists of all regular expressions over Parikh

vectors.

A corollary of a well-known theorem which I recently co-published a purely

algberaic proof to (Parikh's Theorem) in fact says that R(M) also comprises

all the expressions for context-free subsets. The paper can be found on

Kozen's web site.

So, in this case, these expressions actually are context-free expressions

for the monoid M.

This is the next simplest example of 'regular expressions' for something

higher up the Chomsky hierarchy.

Regular expressions comprise an algebra called a Kleene Algebra. This is

an algebra with a product operation (generalising "concatenation"), an

identity (generalising "empty word"), and partial ordering (generalising

"subset inclusion") such that

Least upper bounds are defined

0 = lub {}

a+b = lub {a,b}

such that

(lub A) (lub B) = lub {ab:a in A, b in B} = lub AB

This structure defines a DIOID. In addition, there's a Kleene star

operator:

a* = lub {1,a,aa,aaa,aaaa,...}

such that

*-Continuity: lub (ac, abc, abbc, abbbc, ...) = a b* c

An abstract Kleene algebra defines the operator in the abstract by the

axioms:

a* >= 1 + a a*

x >= a + bx + xc --> x >= b* a c*

where >= is the partial ordering.

Both algebra completely capture the essence of regular expressions in

that every identity between regular expressions is provable within

either algebra. A proof can be found at my web site. Other proofs

can be found at Kozen's site.

If you allow least upper bounds on more general subsets, you can

arrive at different varieties of Kleene algebras. The Kleene algebra

with the *-continuity property above can be equivalently described by

the

Rational Continuity Property:

Every rational subset R(K) of the algebra K has a least upper bound

lub A lub B = lub (AB) for all rational subsets A, B in R(K)

The most general property specifies these two properties for all

subsets. The resulting structure is called a QUANTALE.

To specify an algebraic notation or calculus for context-free

expressions means to be able to solved fixed point systems. That's in

the original spirit of the people who developed BNF's.

In the algebraic spirit. A grammar is a system of inequalities. The

grammar

S -> NP VP

VP -> v NP; VP -> VP PP

PP -> p NP

NP -> n; NP -> NP PP

specifies a language consisting of a subset of {v,p,n}* which is the

least solution to the corresponding system of inequalities:

S >= NP VP; VP >= v NP; VP >= VP PP

PP >= p NP; NP >= n; NP >= NP PP

As you can verify (using Kleene algebra) the least solution is

PP = pn (pn)*; NP = n (pn)*; VP = vn (pn)*; S = n (pn)* vn (pn)*

so the language would actually be the regular language given by the

regular expression

S = n (pn)* vn (pn)*

What hung up everyone is the question:

What about when the solution is NOT a regular language?

In the case of Parikh's theorem, you CAN stay within the confines of

regular expressions and there is a relatively simple way to find the

solution and (in fact) a closed form for the solution! This makes use

of a formal calculus structure complete with its own definition of

differential operators, Taylor's Theorem, etc.

To be able to solve general fixed point systems means to take the

Kleene algebra in question K, and extend it so that it includes all

the first point solutions of all its systems.

This can be done for Kleene algberas with *-continuity. It's an open

issue whether a fixed-point extension exists for Kleene Algebras where

the * operator is defined abstractly -- because the axioms aren't

strong enough.

The result, for instance, of extending the regular expression algebra

R(X*) will then be C(X*) the algebra of context-free languages over

alphabet X.

Extending R(M) will yield C(M), the algebra of context-free subsets of

the monoid M.

In the special case, M = X* x Y*, you get the algebra C(X* x Y*),

which is the algebra of push-down transductions. THIS PROVIDES A

BASIS FOR A PURELY ALGEBRAIC THEORY OF/CALCULUS & NOTATION FOR PARSING

THEORY completely on par with and analogous to the theory of regular

expressions.

This extension is described as follows.

Take K, formally add in a set of operators

<0|, <1|, <2|, ..., <n-1|

|0>, |1>, |2>, ..., |n-1>

This is the same Bra-Ket notation used in Quantum Theory. They are subject

to the same identities:

<i| |j> = 1 if i = j; 0 otherwise

|0><0| + |1><1| + ... + |n-1><n-1| = 1

Subject these to the relations

<i| k = k <i|; |i> k = k |i>

and impose the axioms for a Quantale. The resulting structure gives you

the algebra and notation for context-free sets.

Call the resulting algebra Cn(K), if the original algebra is K. It's

enough to just work with n = 2. Then:

FUNDAMENTAL THEOREM OF CONTEXT FREE EXPRESSIONS:

* The set of all expressions e in C2(K) which commute with all

the bra-ket symbols comprises the closure of the algebra K.

* Every context-free subset of K has a least upper bound in C2(K)

representable in this way.

This is actually directly provable for the special case K = R(X*) or

K = R(X* x Y*), using a (infinite-dimensional) MATRIX REPRESENTATION for the

notation. The general proof for general (*-continuous) Kleene algebras K is

more complex; since it involves extending K to a Quantale. In one of my

writeups I show how this can be done; but have not yet integrated the

construction into a complete proof of the results above.

Examples of grammars and their solutions are listed below:

EXAMPLE 1:

S -> aSb | SS | x

S = <0| (a<1| + x + |1>b)* |0>

EXAMPLE 2:

S -> aSb | 1

S = <0| (a<1|)* (|1>b)* |0>

EXAMPLE 3:

S -> SSa | xb | c

S = <0| (xb+c) ((xb+c)<1| + |1>a)* |0>

This is over the word algebra X* x Y*, where X = {x}, Y = {a,b,c}.

This can be intrepreted as the grammar S -> SS | x | 1, annotated by

'parse tree construction' operations

a: Build S from SS

b: Build S from x

c: Build S from 1

A word (vw) in S consists of a word v in X*, and its corresponding parse

tree w in Y*. Using Kleene algebra identities, you can write this as

the sum (least upper bound):

S = <0| c U |0>

S = <0| [c U<1|] xb U (xb<1| U)* |0>

S = x^0 <0| cU |0>

+ x^{n+1} sum <0| [ cU<1| ] bU (b<1|U)^n |0>, summed over all n=0,1,2,...

where

U = (c<1| + |1>a)*

[A] is the operator defined as A + 1

So the set of translations corresponding to the word xxx would be

S[xxx] = <0| [cU<1|] bU b<1|U b<1|U |0>

In this grammar, xxx not only has an infinite number of parses (the

set above is infinite), but the parses, themselves, form an entire

context free language themselves!

This is a clear demonstration of how and why the general algebraic

notation is strictly more powerful than traditional parse tree

representations; which can't even begin to approach an example like

this in as efficient a manner. [The best a parse-forest representation

can do is a representation which is order n^3. The context-free expression

representation is order n].

EXAMPLE 4:

S -> NP VP a

VP -> v NP b | VP PP c

PP -> p NP d

NP -> n e | NP PP f

S = <0| ne (B + Apne<2|)* A |0>

where

A = |1>b + |2>dc

B = (p<3| + |0>v<0|<1|) ne + |3>df

In this example, the word algebra is M = {n,v,p}* x {a,b,c,d,e,f}* with

input alphabet X = {n,v,p} and output alphabet {a,b,c,d,e,f}. Unlike the

similar example posed before where the annotations consisting of the

symbols a-f were absent, this solution is NOT regular; but is a

context-free. You can simplify it to the equivalent system:

S >= NP v NP b (p NP d c)*

NP >= n e (p NP d f)*

VP >= v NP b (p NP d c)*

PP >= p NP d

this effectively reducing it to 2 variables. The least solution commutes

with all the operators so we can build it up progressively.

Given the solutions S and NP, define SF, NPF as the least solutions to:

NPF >= |0><0| v NP b (p NP d c)* SF +

|1> b (p NP d c)* SF +

|2> d c (p NP d c)* SF +

|3> d f (p NP d f)* NPF

SF >= |0>

and define SS = S SF; NPS = NP NPF. The solutions SF, NPF are solutions

to the corresponding equations, so we can write:

<1| NPS = <1| NP NPF = NP <1| NPF = NP b (p NP d c)* SF

<2| NPS = NP <2| NPF = NP d c (p NP d c)* SF

<3| NPS = NP <3| NPF = NP d f (p NP d f)* NPF

<0| NPS = NP <0| NPF = <0| NP v NP b (p NP d c)* SF = <0| SS

<0| SS = S <0| SF = S <0| |0> = S

Using Kleene algebra identities you get:

|0><0| v NP b (p NP d c)* SF = |0><0| v <1| NPS

|1> b (p NP d c)* SF = |1> b SF + |1> b p NP d c (p NP d c)* SF

= |1> b (SF + p <2| NPS)

|2> d c (p NP d c)* SF = |2> d c (SF + p <2| NPS)

|3> d f (p NP d f)* NPF = |3> d f (NPF + p <3| NPS)

NP NPF = n e (p NP d f)* NPF

= n e NPF + n e p NP d f (p NP d f)* NPF

= n e NPF + n e p <3| NPS

so

S = <0| SS = <0| NPS

NPS = NP NPF = n e (NPF + p <3| NPS)

NPF = |0><0| v <1| NPS

+ |1> b (SF + p <2| NPS)

+ |2> d c (SF + p <2| NPS)

+ |3> d f (NPF + p <3| NPS)

or substituting NPQ = NPF + p <3| NPS and SQ = SF + p <2| NPS, you get

S = <0| NPS

NPS = n e NPQ

NPQ = p <3| NPS

+ |0><0| v <1| NPS

+ (|1> b + |2> d c) SQ

+ |3> d f NPQ

SQ = |0> + p <2| NPS

whose least solution is a regular expression:

S = <0| ne (B + Apne<2|)* A |0>

SQ = (pne<2|B*A)* |0>

NPQ = (B + Apne<2|)* A |0>

NPS = ne (B + Apne<2|)* A |0>

where

A = |1>b + |2>dc

B = (p<3| + |0>v<0|<1|) ne + |3>df

So the language is:

S = <0| ne (B + Apne<2|)* A |0>

In a parsing application, the system, itself, would actually be of more

fundamental importance. This system,

S = <0| NPS

NPS = n e NPQ

NPQ = p <3| NPS

+ |0><0| v <1| NPS

+ (|1> b + |2> d c) SQ

+ |3> d f NPQ

SQ = |0> + p <2| NPS

is directly interpretable as a push down transducer (= parser) in which

<0| = Clear stack

|0> = Test for empty stack

<i| = Push symbol i, i = 1, 2, 3

|i> = Test & pop for symbol i, i = 1, 2, 3

p = Read & test for equality to p

n = Read & test for equality to n

v = Read & test for equality to v

a,b,c,d,e,f = Do/write the corresponding action/symbol

All of this can be generalised to Turing machines since a Turing machine

is just a 2-stack machine. The one is working with two copies of the

bra-ket algebra; each copy commuting with the other and with the original

algebra K.

So, any computation can be represented as a 'regular expression'. The

classical algebra and theory of regular expressions generalise completely

to a general calculus for computation.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.