Generalising regular expressions to stack machines & all machines

whopkins@alpha2.csd.uwm.edu (Mark William Hopkins)
20 Sep 2001 00:24:12 -0400

          From comp.compilers

Related articles
Generalising regular expressions to stack machines & all machines whopkins@alpha2.csd.uwm.edu (2001-09-20)
Re: Generalising regular expressions to stack machines & all machines joachim_d@gmx.de (Joachim Durchholz) (2001-10-06)
| List of all articles for this month |
From: whopkins@alpha2.csd.uwm.edu (Mark William Hopkins)
Newsgroups: comp.compilers
Date: 20 Sep 2001 00:24:12 -0400
Organization: University of Wisconsin - Milwaukee, Computing Services Division
Keywords: parse, DFA
Posted-Date: 20 Sep 2001 00:24:12 EDT

From: Chris F Clark <world!cfc@uunet.uu.net>
>> By the way, I was very impressed that the regular expressions have
>> remained practically unchanged for a so long time.


Context free expressions are the objects you're seeking. Tentative
efforts were made to erect a theory of CFE's in the early 1970's but
this fell flat. I successfully resurrected the whole enterprise in
the late 1980's and early 1990's.


>Now, regular expressions (RE's) haven't seemed to advance in a long
>time because it has been proven that all regular expressions can be
>reduced to simple FSM's and vice-versa...


The reason for the lack of advancement was a shift in focus that took
place in the 1960's away from the mathematical and algebraic
underpinnings of formal language and automata theory to a more
procedural point of view.


It's only recently that progress has been had once again at the
foundation, particularly with recent development in areas related to
Kleene algebra.


See Kozen's web site at www.cs.cornell.edu/kozen.


There is also an extensive (but unpublished) literature originated by
me posted here and elsewhere (comp.theory, sci.math.research) on the
general issue. A few references are at


www.csd.uwm.edu/~whopkins


which is in flux again. This outlines key aspects of the more general
'Algebraic Approach'.


In the original spirit of the people who came up with the concepts of BNF
and context-free grammars, a grammar is algebraically a system of fixed
point equations whose 'solution' defines the language expression, itself.
Part of the hangup is that they originally got this conception wrong.
In place of


                            Grammar = System of Equations


it should have read


                            Grammar = System of Inequalities


And in place of


                            Language = Fixed point solution to the system


(with all its problems associated with non-uniqueness which held them up
seriously at the time), it should read:


                            Lanugage = LEAST fixed point solution to the system


This is a generalization, since the least solution to the system of
inequalities will also be a solution to the corresponding system of
equations.


I'll show below how this concept leads you directly to the notation and
calculus for context free expressions. The developments are general and
applicable to any N-stack automaton/machine; and so comprise a general
notation for Turing computeable language/translations (since Turing
machines are 2 stack machines).


Language expressions for given language families are direct generalisations
of regular expressions which reside in a non-trivial algebra. Regular
expressions, themselves, comprise a 'free' algebra. More generally, you
can define 'rational expressions' for any algebra M with the properties


                  M has a product with properties (ab)c = a(bc); a1 = a = 1a


The rational expressions R(M) will then consist of the subsets formed
from the finite subsets by applications of the operators:


                          AB = { ab: a in A, b in B }
                          A+B = A union B
                          A* = {1} union A union AA union AAA ...


Regular expressions over an alphabet X are rational expressions over the
free monoid M = X* -- R(X*). The rational expressions for the direct
product monoid X* x Y* are then R(X* x Y*) which are the rational
transductions -- translations between X* and Y* which are recognizeable
by finite state machines.


Therefore, finite state transductions, themselves, can actually be
specified by 'regular expressions' without any extensions. Instead, it's
the algebra which is non-free. You have relations


                                          xy = yx, for x in X, y in Y.


This is the simplest example of the extension of regular expressions for
language or transduction families higher up the Chomsky hierarchy.


Generally the Kleene algebras R(M) don't generally form a free Kleene
algebra since M may have non-trivial relations. The best known example is
the Parikh vector algebra formed by taking the set


                                X* = {x1,x2,...,xn}*


and imposing commutativity on X* to arrive at the commutative monoid M of
Parikh vectors. Then R(M) consists of all regular expressions over Parikh
vectors.


A corollary of a well-known theorem which I recently co-published a purely
algberaic proof to (Parikh's Theorem) in fact says that R(M) also comprises
all the expressions for context-free subsets. The paper can be found on
Kozen's web site.


So, in this case, these expressions actually are context-free expressions
for the monoid M.


This is the next simplest example of 'regular expressions' for something
higher up the Chomsky hierarchy.


Regular expressions comprise an algebra called a Kleene Algebra. This is
an algebra with a product operation (generalising "concatenation"), an
identity (generalising "empty word"), and partial ordering (generalising
"subset inclusion") such that


                      Least upper bounds are defined
                            0 = lub {}
                            a+b = lub {a,b}
such that
                            (lub A) (lub B) = lub {ab:a in A, b in B} = lub AB


This structure defines a DIOID. In addition, there's a Kleene star
operator:
                            a* = lub {1,a,aa,aaa,aaaa,...}
such that
                      *-Continuity: lub (ac, abc, abbc, abbbc, ...) = a b* c


An abstract Kleene algebra defines the operator in the abstract by the
axioms:
                            a* >= 1 + a a*
                            x >= a + bx + xc --> x >= b* a c*


where >= is the partial ordering.


Both algebra completely capture the essence of regular expressions in
that every identity between regular expressions is provable within
either algebra. A proof can be found at my web site. Other proofs
can be found at Kozen's site.


If you allow least upper bounds on more general subsets, you can
arrive at different varieties of Kleene algebras. The Kleene algebra
with the *-continuity property above can be equivalently described by
the


          Rational Continuity Property:
                Every rational subset R(K) of the algebra K has a least upper bound
                lub A lub B = lub (AB) for all rational subsets A, B in R(K)


The most general property specifies these two properties for all
subsets. The resulting structure is called a QUANTALE.


To specify an algebraic notation or calculus for context-free
expressions means to be able to solved fixed point systems. That's in
the original spirit of the people who developed BNF's.


In the algebraic spirit. A grammar is a system of inequalities. The
grammar
                          S -> NP VP
                          VP -> v NP; VP -> VP PP
                          PP -> p NP
                          NP -> n; NP -> NP PP


specifies a language consisting of a subset of {v,p,n}* which is the
least solution to the corresponding system of inequalities:


                          S >= NP VP; VP >= v NP; VP >= VP PP
                          PP >= p NP; NP >= n; NP >= NP PP


As you can verify (using Kleene algebra) the least solution is


              PP = pn (pn)*; NP = n (pn)*; VP = vn (pn)*; S = n (pn)* vn (pn)*


so the language would actually be the regular language given by the
regular expression
                                          S = n (pn)* vn (pn)*


What hung up everyone is the question:


                    What about when the solution is NOT a regular language?


In the case of Parikh's theorem, you CAN stay within the confines of
regular expressions and there is a relatively simple way to find the
solution and (in fact) a closed form for the solution! This makes use
of a formal calculus structure complete with its own definition of
differential operators, Taylor's Theorem, etc.


To be able to solve general fixed point systems means to take the
Kleene algebra in question K, and extend it so that it includes all
the first point solutions of all its systems.


This can be done for Kleene algberas with *-continuity. It's an open
issue whether a fixed-point extension exists for Kleene Algebras where
the * operator is defined abstractly -- because the axioms aren't
strong enough.


The result, for instance, of extending the regular expression algebra
R(X*) will then be C(X*) the algebra of context-free languages over
alphabet X.


Extending R(M) will yield C(M), the algebra of context-free subsets of
the monoid M.


In the special case, M = X* x Y*, you get the algebra C(X* x Y*),
which is the algebra of push-down transductions. THIS PROVIDES A
BASIS FOR A PURELY ALGEBRAIC THEORY OF/CALCULUS & NOTATION FOR PARSING
THEORY completely on par with and analogous to the theory of regular
expressions.


This extension is described as follows.


                  Take K, formally add in a set of operators
                        <0|, <1|, <2|, ..., <n-1|
                        |0>, |1>, |2>, ..., |n-1>


This is the same Bra-Ket notation used in Quantum Theory. They are subject
to the same identities:


                          <i| |j> = 1 if i = j; 0 otherwise
                          |0><0| + |1><1| + ... + |n-1><n-1| = 1


Subject these to the relations


                              <i| k = k <i|; |i> k = k |i>


and impose the axioms for a Quantale. The resulting structure gives you
the algebra and notation for context-free sets.


Call the resulting algebra Cn(K), if the original algebra is K. It's
enough to just work with n = 2. Then:


              FUNDAMENTAL THEOREM OF CONTEXT FREE EXPRESSIONS:


              * The set of all expressions e in C2(K) which commute with all
                  the bra-ket symbols comprises the closure of the algebra K.


              * Every context-free subset of K has a least upper bound in C2(K)
                  representable in this way.


This is actually directly provable for the special case K = R(X*) or
K = R(X* x Y*), using a (infinite-dimensional) MATRIX REPRESENTATION for the
notation. The general proof for general (*-continuous) Kleene algebras K is
more complex; since it involves extending K to a Quantale. In one of my
writeups I show how this can be done; but have not yet integrated the
construction into a complete proof of the results above.


Examples of grammars and their solutions are listed below:


EXAMPLE 1:
S -> aSb | SS | x
S = <0| (a<1| + x + |1>b)* |0>


EXAMPLE 2:
S -> aSb | 1
S = <0| (a<1|)* (|1>b)* |0>


EXAMPLE 3:
S -> SSa | xb | c
S = <0| (xb+c) ((xb+c)<1| + |1>a)* |0>


This is over the word algebra X* x Y*, where X = {x}, Y = {a,b,c}.


This can be intrepreted as the grammar S -> SS | x | 1, annotated by
'parse tree construction' operations


                                      a: Build S from SS
                                      b: Build S from x
                                      c: Build S from 1


A word (vw) in S consists of a word v in X*, and its corresponding parse
tree w in Y*. Using Kleene algebra identities, you can write this as
the sum (least upper bound):


S = <0| c U |0>
S = <0| [c U<1|] xb U (xb<1| U)* |0>
S = x^0 <0| cU |0>
    + x^{n+1} sum <0| [ cU<1| ] bU (b<1|U)^n |0>, summed over all n=0,1,2,...
where
      U = (c<1| + |1>a)*
      [A] is the operator defined as A + 1


So the set of translations corresponding to the word xxx would be


                          S[xxx] = <0| [cU<1|] bU b<1|U b<1|U |0>


In this grammar, xxx not only has an infinite number of parses (the
set above is infinite), but the parses, themselves, form an entire
context free language themselves!


This is a clear demonstration of how and why the general algebraic
notation is strictly more powerful than traditional parse tree
representations; which can't even begin to approach an example like
this in as efficient a manner. [The best a parse-forest representation
can do is a representation which is order n^3. The context-free expression
representation is order n].


EXAMPLE 4:
S -> NP VP a
VP -> v NP b | VP PP c
PP -> p NP d
NP -> n e | NP PP f


S = <0| ne (B + Apne<2|)* A |0>
where
                    A = |1>b + |2>dc
                    B = (p<3| + |0>v<0|<1|) ne + |3>df


In this example, the word algebra is M = {n,v,p}* x {a,b,c,d,e,f}* with
input alphabet X = {n,v,p} and output alphabet {a,b,c,d,e,f}. Unlike the
similar example posed before where the annotations consisting of the
symbols a-f were absent, this solution is NOT regular; but is a
context-free. You can simplify it to the equivalent system:


S >= NP v NP b (p NP d c)*
NP >= n e (p NP d f)*


VP >= v NP b (p NP d c)*
PP >= p NP d


this effectively reducing it to 2 variables. The least solution commutes
with all the operators so we can build it up progressively.


Given the solutions S and NP, define SF, NPF as the least solutions to:


NPF >= |0><0| v NP b (p NP d c)* SF +
              |1> b (p NP d c)* SF +
              |2> d c (p NP d c)* SF +
              |3> d f (p NP d f)* NPF
SF >= |0>


and define SS = S SF; NPS = NP NPF. The solutions SF, NPF are solutions
to the corresponding equations, so we can write:


          <1| NPS = <1| NP NPF = NP <1| NPF = NP b (p NP d c)* SF
          <2| NPS = NP <2| NPF = NP d c (p NP d c)* SF
          <3| NPS = NP <3| NPF = NP d f (p NP d f)* NPF
          <0| NPS = NP <0| NPF = <0| NP v NP b (p NP d c)* SF = <0| SS


          <0| SS = S <0| SF = S <0| |0> = S


Using Kleene algebra identities you get:


          |0><0| v NP b (p NP d c)* SF = |0><0| v <1| NPS
          |1> b (p NP d c)* SF = |1> b SF + |1> b p NP d c (p NP d c)* SF
                                                    = |1> b (SF + p <2| NPS)
          |2> d c (p NP d c)* SF = |2> d c (SF + p <2| NPS)
          |3> d f (p NP d f)* NPF = |3> d f (NPF + p <3| NPS)
          NP NPF = n e (p NP d f)* NPF
                        = n e NPF + n e p NP d f (p NP d f)* NPF
                        = n e NPF + n e p <3| NPS
so


        S = <0| SS = <0| NPS
        NPS = NP NPF = n e (NPF + p <3| NPS)
        NPF = |0><0| v <1| NPS
                + |1> b (SF + p <2| NPS)
                + |2> d c (SF + p <2| NPS)
                + |3> d f (NPF + p <3| NPS)


or substituting NPQ = NPF + p <3| NPS and SQ = SF + p <2| NPS, you get


            S = <0| NPS
            NPS = n e NPQ
            NPQ = p <3| NPS
                    + |0><0| v <1| NPS
                    + (|1> b + |2> d c) SQ
                    + |3> d f NPQ
            SQ = |0> + p <2| NPS


whose least solution is a regular expression:


            S = <0| ne (B + Apne<2|)* A |0>
            SQ = (pne<2|B*A)* |0>
            NPQ = (B + Apne<2|)* A |0>
            NPS = ne (B + Apne<2|)* A |0>
where
                    A = |1>b + |2>dc
                    B = (p<3| + |0>v<0|<1|) ne + |3>df


So the language is:
            S = <0| ne (B + Apne<2|)* A |0>


In a parsing application, the system, itself, would actually be of more
fundamental importance. This system,


            S = <0| NPS
            NPS = n e NPQ
            NPQ = p <3| NPS
                    + |0><0| v <1| NPS
                    + (|1> b + |2> d c) SQ
                    + |3> d f NPQ
            SQ = |0> + p <2| NPS


is directly interpretable as a push down transducer (= parser) in which


              <0| = Clear stack
              |0> = Test for empty stack
              <i| = Push symbol i, i = 1, 2, 3
              |i> = Test & pop for symbol i, i = 1, 2, 3
              p = Read & test for equality to p
              n = Read & test for equality to n
              v = Read & test for equality to v
              a,b,c,d,e,f = Do/write the corresponding action/symbol


All of this can be generalised to Turing machines since a Turing machine
is just a 2-stack machine. The one is working with two copies of the
bra-ket algebra; each copy commuting with the other and with the original
algebra K.


So, any computation can be represented as a 'regular expression'. The
classical algebra and theory of regular expressions generalise completely
to a general calculus for computation.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.