Re: Have I discovered something new?

"Mark" <whopkins@alpha2.csd.uwm.edu>
24 Jul 2002 02:26:47 -0400

          From comp.compilers

Related articles
Have I discovered something new? steve.horne@iris.co.uk (Stephen Horne) (2002-07-15)
Re: Have I discovered something new? torbenm@diku.dk (Torben Ægidius Mogensen) (2002-07-21)
Re: Have I discovered something new? joachim_d@gmx.de (Joachim Durchholz) (2002-07-21)
Have I discovered something new? cfc@world.std.com (Chris F Clark) (2002-07-21)
Re: Have I discovered something new? kgw-news@stiscan.com (2002-07-21)
Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-07-24)
Re: Have I discovered something new? steve@lurking.demon.co.uk (Steve Horne) (2002-07-25)
Re: Have I discovered something new? robert.corbett@sun.com (Robert Corbett) (2002-07-31)
Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-07-31)
Re: Have I discovered something new? cfc@shell01.TheWorld.com (Chris F Clark) (2002-08-04)
Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-08-04)
| List of all articles for this month |
From: "Mark" <whopkins@alpha2.csd.uwm.edu>
Newsgroups: comp.compilers
Date: 24 Jul 2002 02:26:47 -0400
Organization: University of Wisconsin - Milwaukee, Computing Services Division
References: 02-07-057
Keywords: parse
Posted-Date: 24 Jul 2002 02:26:46 EDT

"Stephen Horne" <steve.horne@iris.co.uk> writes:
>My main question refers to something I read in 'Parsing Techniques, A
>Practical Guide'. This book contained a statement that the possibility
>of an linear-time parsing algorithm for arbitrary context-free
>grammars was still an open question - that there was no proof or
>disproof, that every known context-free grammar can be parsed in
>linear time using specially designed methods, but that no general
>algorithm currently exists which works for all context-free grammars.


The problem is ambiguous and can actually be resolved into quite a few
INDEPENDENT problems, each with its own complexity:


(A) Recognition:
        Given a CFG G over an alphabet X, and a word w in X*,
        determine if w is in the subset L(G) of X*.


That -- commonly confused with "parsing" is most decidedly NOT parsing.
Known optimal recognition methods are based on Valiant (a path-search
method) and are of order O(n^log_2(7)).


(B) Enumeration 1:
        Given an SSDT G over an input alphabet X, output alphabet Y,
        and a word w in X*, enumerate the set G(w) of v in Y* such that
        wv is generated by G.


An SSDT with input alphabet X, output alphabet Y is a context free grammar
over the word algebra X* x Y*.


A complete enumeration is impossible since since G(w) is generally an
infinite set for a given w in X* -- in fact, a context free language
over Y*, itself.


(C) Enumeration 2:
        With the same conditions as in (B), let G(w) = {v0,v1,v2,...},
        a countable subset of Y*. Given G, a word w, and a number
        n = 0,1,2,..., find vn (if G(w) is finite of size N and n >= N,
        no answer need be given).


This can be regarded as "parsing", if G is deterministic, for then G(w)
will never have more than one member in it.


(D) Enumeration 3: Choice
        Enumeration 2, with n = 0. Find at least ONE output v corresponding
        to w.


This MIGHT be equivalent, in complexity terms, to (A) Recognition.


(E) Parsing
        Given G and w, as in (B), find the set G(w), itself.


This is the form of "parsing" as it more commonly thought of, for
instance, in comprehensive methods (any dynamic-programming based
method: Tomita, Earley, etc.) where entire tree-sets are produced
for a given word.


The complexity depends on the form of representation chosen for
sets G(w). The "list them one-by-one" representation, besides being
the least efficient form, is not an option since G(w) may be infinite.


Since G(w) can be any context-free language (over Y), then whatever
representation used for parse-sets must amount to nothing more or
nothing less than a representation for context-free languages, themselves
-- i.e., a context-free analogue of the notation for regular
expressions -- "context-free expressions".


Tree-based representation is also out of the question. A tree (or
"packed" forest) is just a fancy graphic notation for a context-free grammar.
At best, it will yield representations of O(n^3) complexity for the
description of a set G(w).


A suitable notation for context-free expressions makes this problem LINEAR
-- as is already dictated by Kolomogorov complexity theory. Translation
cannot increase the complexity of a description. Just writing down G(w)
is, itself, a notation for the sets G(w), since it uniquely identifies
the set. G(w) is its own representation ... and is linear in w and G
(it's just G + w + "(" + ")").


So, the best representation gives you a linear complexity for (E).


But this particular example not particularly amenable to use by the other
versions of parsing listed before.


Generally, you can factor the various versions of parsing into (E),
plus something:


(A) Recognition = Parsing + Emptiness Test (of G(w)'s).
(B),(C),(D) Enumeration = Parsing + Enumeration (of individual elements
        v in Y* from the set G(w)).


So, the notation used for sets G(w) -- and, indeed, for context-free
expressions in general -- should be suitable to use by these other
problems.


So, let me describe one such notation for context-free expressions,
which I've made reference to here a large numbers of times here and
elsewhere.


Regular expressions, context free expressions and general language
expressions can be thought of as residing in an algebra called a
Quantale. This has the following structure:


      (A) A PRODUCT (interpreted as word concatenation) (a,b) |-> ab.
      (B) An IDENTITY (interpreted as the 'empty word' or 'null action') 1.
      (C) An PARTIAL ORDERING relation <= (which can be interpreted as
              'inclusion' or 'deriveability').


(A) and (B) together form a MONOID -- an algebra with the following
rules:
(ab)c = a(bc); a1 = a = 1a.


Every word algebra X* over an alphabet X is a monoid, in fact a FREE
monoid generated by the set X. That is: the structure derived by
forming 'words' from X and subjecting them to the two rules above,
but nothing more.


A non-free monoid would, in contrast, be one with the additional
rule:
ab = ba.


This example gives you a COMMUTATIVE monoid ... and the free commutative
monoid P(X) generated by X is just X* itself, with the additional rule
ab = ba imposed. P(X) is just the family of Parikh vectors over X with
the correspondence:
                          If X = {x1,x2,...}
then (a1,a2,...,an) <-> x1^a1 x2^a2 ... xn^an.


The other example, alluded to above, is the non-free monoid derived by
taking (X union Y)* and subjecting it to the relations
xy = yx, x in X, y in Y.
This gives you the direct product
X* x Y*


which is the word algebra underlying SSDT's and all other classes of
translation/transductions.


A 3-way or n-way generalization (i.e. context-free relation) could be
arrived at by considering more general products, like


X1* x X2* x ... x Xn*.


The partial ordering, in a Quantale Q, has the following property:


                    (A) Completeness
Every subset A of Q has a least upper bound (sum A).


Sums correspond to "0" and the "|" operator and are interpreted as
"failure" (for 0) and "non-deterministic branching" (for |).


(B) Continuity (or Distributivity)
If A and B are subsets of Q, then
sum AB = (sum A) (sum B)


where AB is the set { ab: a in A, b in B }.


Sums of special interest are:
0 = sum {}
a+b = sum {a,b}
a* = sum {1,a,aa,aaa,aaaa,...}.


Other types of algebras, more restricted than the Quantale, can be
arrived at by limiting the scope of (A) and (B) to more specific
families of subsets, such as:


Finite subsets Q -- dioids = idempotent semiring
Rational subsets Q -- *-continuous Kleene algebras
Context-free subsets Q -- (no name exists).
Turing subsets of Q -- (no name exists).
Countable subsets Q -- "omega-continuous" Kleene algebras


A suitable algebra for representing context-free expressions need only
have context-free subsets covered by (A) and (B).


Dioids can be defined by a purely algebraic axiom set:
(ab)c = a(bc); a1 = a = 1a; a0d = 0; a(b+c)d = abd+acd
(a+b)+c = a+(b+c); a+0 = a = 0+a; a+a = a; a+b = b+a
The ordering is defined by
a >= b <==> a = a + b.


*-continuous Kleene algebras have the additional operator a* and can be
represented with the additional axiom:


sum { ad, abd, abbd, abbbd, ... } = a b* d.


>From this, alone, it's possible to prove that (A) and (B) hold for all
rational subsets of Q.


The family R(Q) of rational subsets of Q is, itself, a *-continuous
Kleene algebra. It's defined as the smallest family of subsets of Q
representable by regular expressions; i.e., the closure of the family
of finite subsets of Q under the 3 Kleene operations
A union B; AB = {ab:a in A, b in B}
A* = {1} union A union AA union AAA union ...


Likewise, the family C(Q) of context-free subsets is, itself, the
archetypical (context-free-closed) algebra.


The family C(X*) is just the family of context-free languages over X,
and C(X* x Y*) is the family of simple syntax directed transductions
(SSDT's) over input alphabet X and output alphabet Y.


The family 2^X* of ALL subsets of X* is the free quantale generated by X;
likewise 2^M is the free extension of the monoid M to a quantale.


A suitable algebra for C(M) for ANY monoid is obtained by taking the
quantale 2^M, and adding to it an operator algebra powerful enough to
represent a stack machine. A more direct representation of a stack
machine yields a notation more amenable to parsing methods.


The algebra I've described elsewhere is the formal Dirac "Bra-Ket"
algebra C_n, given by the 2n symbols:


<0|, <1|, ..., <n-1|, |0>, |1>, ...., |n-1>,


subject to the identities:


Orthogonality: <i| |j> = 1 if i = j; 0 else
Completeness: |0><0| + |1><1| + ... |n-1><n-1| = 1.


The special case for n = 2 can be written as:


b, d, p, q; subject to the relations
bd = 1 = pq; bq = 0 = pd; db + qp = 1


Any of the other algebras C_n can be embedded within C_2, so that
C_2 alone suffices.


A stack-based interpretation can be posed as:


|0><0| = empty stack test
<i| = push symbol #i (i = 1,...,n-1)
|i> = pop and test for symbol #i (i = 1,...,n-1).


Since quantales have addition over unrestricted subsets, you can define
matrix algebras of quantales of any dimension (even uncountably infinite).
This gives you an algebra M_n(Q) of n x n matrices over a quantale Q.


The algebra C_2 is its own 2 x 2 matrix algebra
C_2 = M_2(C_2)
with the correspondence:
A B <--> dAb + dBp + qCb + qDp
C D


This also means that
C_2 = M_2(C_2) = M_2(M_2(C_2)) = M_4(C_2)
C_2 = M_4(C_2) = M_4(M_2(C_2)) = M_8(C_2)
etc.
so that C_2 contains ALL of its (finite-dimensional) matrix algebras.


The algebra formed by combining 2^M and C_n and subjecting it to the
relations
A q = q A; A in 2^M; q in C_n
gives you an algebra
C_n(2^M).
This can be represented as a matrix algebra over 2^M by identifying:
A <-> A I, A in 2^M
d <-> transpose of b
q <-> transpose of p
b <-> 1 0 0 0 0 0 ...
0 0 1 0 0 0 ...
0 0 0 0 1 0 ...
...
                                  p <-> 0 1 0 0 0 0 ...
0 0 0 1 0 0 ...
0 0 0 0 0 1 ...
...


The subalgebra obtained by taking the rational subsets R(M) and C_n
gives you an algebra C_n(M) that gives you a notation for context-free
languages over M build directly on top of the algebra of regular
expressions R(M) over M.


The basic property if C_n(M) is that:


Fundamental Theorem of Context Free Expressions:
(A kind of "Schur's Lemma"):
An expression e in C_n(M) commutes with all the members
of C_n if and only if e has the form
e = A I,
                        some a context-free subset A in C(M).
This holds for any n > 1.


So, a general context free expression over a monoid M is a regular expression
combined with the extra notation for C_n subject to the commutativity rule


qw = wq, w = M, q in C_n.


A good example showing how this algebra works and showing its suitability
as a notation for context-free expressions (and parser output sets G(w))
comes by looking at the worst kind of contorted pathological grammar
you can think of:


S -> 1 | S S | x.
taken over the monoid {x}*.


In algebraic terms, a grammar is a system of inequalities whose least
solution is the expression for the language (or transduction or relation,
etc.) generated by the grammar:


S = least s for which: s >= 1 + s s + x.
In algebraic notation, the grammar is just:


S >= 1 + S S + x.


The corresponding system of inequalities would be a "context-free" system,
by analogous use of terms.


A parser actually works with an SSDT, not a grammar. It is concerned with
deriving output structures from a grammar base. The two canonical
extensions of this grammar to an SSDT are:


"Top-down": start >= Z S; S >= A + B S S + C x.
"Bottom-up": start >= S z; S >= a + S S b + x c.


The bottom up case is the one we're interested in here. The output
alphabet Y = {a,b,c,z} has the standard interpretation as tree-building
operations:
a = "form an S node from []"
b = "form an S node from [SS]"
c = "form an S node from [x]"
z = "form the root node from [S]"


The subset of X* x Y* corresponding to this SSDT commutes, in the
larger algebra C_n(X* x Y*), with all the operator symbols. So, you
can actually derive a regular expression (with operator symbols)
directly from the grammar itself.


The least solution to any system containing the inequality
x >= u + x v
is also the least solution to the system with the inequality replaced
by
x >= u v*.


So the above system can be replaced by:


                                          start >= S z
                                          S >= (a + x c) (Sb)*.


In general, all left-recursions can be eliminated from a context-free
system.


Define SF as the least solution to
SF >= |1> b (Sb)* SF + z |0>.


A similar rule allows inequlities of the form
x >= u + v x
to be replaced by inequalities of the form
x >= v* u
to yield equivalent systems.


Using the above property can also be written as the least solution
to
SF >= (|1>b (Sb)*)* z |0>
which is just
SF = (|1>b (Sb)*)* z |0>,
itself. Thus
SF = |1>b (Sb)* SF + z |0>.


In general, the least solution to ANY system of inequalities will
also satisfy the corresponding system of equalities. So, we can also
write
start = S z
S = (a + x c) (Sb)*
and obtain the desired SSDT as the least solution.


But now S commutes with the operators. So, define SS = S SF. Then,


<0| SS = S <0| SF = S z = start.
<1| SS = S <1| SF = S b (Sb)* SF.
So
start = <0| SS
SS = S SF = (a + xc) (Sb)* SF
and
(Sb)* SF = SF + Sb (Sb)* SF
= SF + <1| SS
= <1| SS + |1>b (Sb)* SF + z |0>.


Thus, the SSDT in question is the least solution to a right-linear
system:
start >= <0| A
A >= (a + xc) B
B >= <1| A + |1>b B + z |0>.
or equivalently,
start >= <0| (a + xc) B
B >= (<1| (a + xc) + |1>b) B + z |0>,
or
start >= <0| (a + xc) B
B >= (<1| (a + xc) + |1>b)* z |0>,
or just
start = <0| (a + xc) (<1| (a + xc) + |1>b)* z |0>.


To arrive at something more suitable for parsing, it's best to keep
with the right-linear system above and reduce until every item
on the right starts with an input word:


start >= <0| A
A >= a B + x c B
B >= <1| A + |1>b B + z|0>


The least solution to any system involving
A >= uB + a
B >= vA + wB + b
is that obtained by solving this subsystem first:
                              A >= a + u (vu + w)* (va + b)
B >= (vu + w)* (va + b)
Applying this to the above gives you the equivalent system:
start >= <0| A
A >= x c B + a U (<1| x c B + z |0>).
B >= U (<1| x c B + z |0>).
where
U = (<1|a + |1>b)*
or
start >= <0| A
A >= x [a U <1|] c B + aUz |0>
B = x U<1| c B + Uz |0>.
where
[z] = 1 + z = "optional z".


>From this, you can directly read off the output sets G(x^n) corresponding
to the words x^n, n = 0, 1, 2, ...


G(1) = <0| aUz |0> = a <0| (<1|a + |1>b)* |0> z = a D(a,b) z
where
D(p,q) is the context-free language (the Dyck language)
corresponding to properly nested sequences of p's & q's.


G(x) = <0| [aU<1|] c U z |0>
G(x^{n+1}) = <0| [aU<1|] c (U<1|c)^n U z |0>
= <0| [aU<1|] (cU<1|)^n cUz |0>, n = 1, 2, 3, ...


All of these subsets of Y* are infinite. The complexity of the
expression for G(x^n) is O(n), or O(log(n)) if exponential notation is
explicitly used.


Enumeration would amount to following all the paths of the right-linear
system in a fashion similar to Tomita.


Emptiness test amounts to reducing the pure C_n expression obtained by
replacing all X and Y symbols by 1:


G(1) -> empty[G(1)] = <0| u |0>
G(x^{n+1}) -> empty[G(x^{n+1})] = <0| [u<1|] (u<1|)^n u |0>
u = (<1| + |1>)*


There may be a way of directly implementing Valiant in purely algebraic
form as an algorithm for reducing expressions in C_n to normal form.
Normal form (like the analogue in Quantum Physics) consists of a sum
of terms where in each term all the |n>'s appear to the left of <n|'s.
The normal form of u is just


u = (<1| + |1>)* = |1>* <1|*.
so
            empty[G(1)] = <0| |1>* <1|* |0>
= <0| |0> = 1
since
<0| |1>* = <0| (1 + |1> |1>*)
= <0| + <0| |1> |1>* = <0|,
similarly
<1|* |0> = |0>.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.