Quantum Parsing: Transcending the LR(k) Paradigm.

mark@omnifest.uwm.edu (Mark Hopkins)
19 May 1996 18:32:21 -0400

From comp.compilers

Related articles
*Quantum Parsing: Transcending the LR(k) Paradigm. mark@omnifest.uwm.edu* (1996-05-19)**

| List of all articles for this month |

From:	mark@omnifest.uwm.edu (Mark Hopkins)
Newsgroups:	comp.compilers
Date:	19 May 1996 18:32:21 -0400
Organization:	Omnifest
Keywords:	parse, LR(1)

Quantum Parsing: Transcending the LR(k) Paradigm.
Contents:
      (1) Equivalence vs. Non-Equivalence & Algebraic Transformations
      (2) Kleene Algebras, Translation Expressions and Context-Free Expressions
      (3) Transforming a SSDTG: Quantum Grammars
      (4) Examples. Issues Associated with Writing Quantum Parser Generators

(1) Equivalence vs. Non-Equivalence & Algebraic Transformations
      In what sense are equivalent grammars different that keeps people from
applying purely algebraic methods to grammar transformation and parser
formation? Take the following grammar:

                      S = A a + B b + c.
                      A = C d + e A f + g.
                      B = C d + e B f + g.
                      C = C g + h.

It represents a deterministic language since it can be transformed into the
LR(0) grammar:

                      S = A I + c.
                      A = e A f + g + h D.
                      D = d + g D.
                      I = a + b.

but is not LR(k) for any k >= 0.

      All deterministic context free languages can be given by LR(0) grammars
(provided you add in an extra symbol, $, to denote the end of input), but
the transformation is not "intuitive", as is often said.

      This is, in itself, a vague notion and rather weak reason for not doing
a transformation (especially since the algorithms used in parser generators
are already, as a rule, non-intuitive). But there's a more general notion
underlying this which captures what is being said, and thus renders the
reasons for not doing transformations impotent without losing anything.

      The "structure" of a grammar is, in reality, a systems of transformations
that captures the actual set of reductions that get carried out when the
grammar is applied. This can be represented as a Simple Syntax Directed
Translation Grammar, so that in reality what we're talking about in parser
applications is transforming SSDTGs not CFGs. Equivalent CFGs will generally
have NON-equivalent SSDTGs, the case above being the basis of an example.

      A (simple) syntax direct translation grammar consists of the following:

                        X: An input alphabet.
                        Y: An output alphabet.
                        N: A set of non-terminals.
                        S: The main non-terminal.
                        P: A system of rules of the form
                                                n: v -> w
                              where n is a non-terminal, v and w are in (X+N)* and (Y+N)*
                              and are constrained so that the same number of non-terminals
                              occur in v and w (in the same order).

The (simple) in SSDTG refers to the (in the same order) in the definition
above.

      Derivations are defined by the following:

                    (1) If n: v -> w is a rule in P then the following is a one-step
                            derivation:
                                            (v1 n v2 -> w1 n w2) |- (v1 v v2 -> w1 w w2).
                    (2) A derivation is a sequence of 0, 1, or more one-step
                            derivations of the form A |- B, B |- C, C |- ... |- Y, Y |- Z,
                            in which case we write A |-* Z.

and associated with each translation (v -> w) is a set:

                [v -> w] = { (x, y): (v -> w) |-* (x -> y), x is in X*, y is in Y* }

and finally, the set of translations recognized by the SSDTG is just [S -> S].

      If we use 1 to denote the empty word, and if we define:

                                    [n] = [n -> n], for non-terminals,
                                    [x] = [x -> 1], for words x in X*,
                            and [y] = [1 -> y], for words y in Y*,

then we can state the following:

                              (1) [x] [y] = [x -> y] = [y] [x], for x in X*, y in Y*
                              (2) [x1 ... xm] = [x1] ... [xm], for x1, ..., xm in X,
                                      [y1 ... yn] = [y1] ... [yn], for y1, ..., yn in Y.
                              (3) [x n v -> y n w] = [x] [y] [n] [v -> w],
                                      for x in X*, y in Y*, and n in N.
                              (4) [n] = sum { [u -> v], such that n: u -> v is in P }

where the product of sets is defined as follows:

                      U1 U2 = { (x1 x2, y1 y2): (x1, y1) is in U1, (x2, y2) in U2 }

This allows us to generate all sets [u -> v] from the [n], [x] and [y]
for n in N, x in X, and y in Y, and then to write down a system of set
equations:
                                [n] = sum of products of [z], of z's from N+X+Y.

If we identify each symbol with its corresponding set, we can write a system
of equations in the elements of N:

                                            n = sum of products of terms from N+X+Y

whose least solution in the n's are (n = [n]), and we can write the system of
rules in P down in the form

                                              n -> u v, where n: u -> v

Thus, a Simple Syntax Directed Translation Grammar is, itself, just a Context
Free Grammar in which you have two alphabets (X and Y) which commute with
one another (xy = yx, for all words x in X*, y in Y*).

      Similarly, a Finite State Machine can be considered as a Finite State
Automaton with two mutually commuting alphabets, so that the family of
translation systems generated by the SSDTGs (which are called the
Simple Syntax Direct Translations, or SSDTs) is the algebraic closure of
the RT's (Rational Transductions: the family generated by finite state
machines) in which every equation has a solution -- just like the CFLs
are the algebraic closure of the RLs.

      (Algebraic closure, here, means the smallest extension of a family of
languages over which every finite system of equations has a least solution).

      Each Context Free Grammar, then, has its structure captured by appending
to each rule an "action symbol" which is interpreted as the action
"build up a new node in the current parse tree from the set of items on
the furthest right using this rule."

      Thus, in the examples above, we can make the following changes:

                S = A a + B b + c. --> S = A a p + B b q + c r.
                A = C d + e A f + g. A = C d s + e A f t + g u.
                B = C d + e B f + g. B = C d v + e B f w + g x.
                C = C g + h. C = C g y + h z.
and
                S = A I + c. --> S = A I p + c q.
                A = e A f + g + h D. A = e A f r + g s + h D t.
                D = d + g D. D = d u + g D v.
                I = a + b. I = a w + b x.

where X = [abcdefgh] and Y = [pqrstuvwxyz] in the first case and [pqrstuvw] in
the second. Even though the two grammars are equivalent, the two SSDTGs are
not. But on the other hand, precisely because the SSDTGs capture the
structure of a grammar, you're free to carry out any transformation on it
without worrying about "destroying the structure of the original". Thus,
algebraic methods can be freely applied, noting especially that the two
alphabets X and Y commute.

      The second grammar will reduce as follows:

                                  S = (eAfr + gs + hDt) J + cq = eAfrJ + gsJ + hDtJ + cq
                                  J = Ip = (aw + bx) p = awp + bxp
                                  A = eAfr + rs + hDt.
                                  D = du + gDv.

or, by letting K = f r J, L = t J, M = f r, N = t, O = v we can write
down the system:

                                  S = eAK + gsJ + hDL + cq
                                  J = awp + bxp
                                  K = frJ
                                  L = tJ
                                  A = eAM + rs + hDN
                                  M = fr
                                  N = t
                                  D = du + gDO
                                  O = v

This SSDT is deterministic with respect to its input alphabet, except for
the rule listed under A, which has a shift-reduce conflict (A -> eAM
involves a shift on e, but A -> rs involves only the reductions r and s).
This is easily eliminated by substituting new terms for AK and AN, and then
writing down equations for (AK) and (AN), in place of A.

                                  AK = Q = eAMK + rsK + hDNK = eAR + frsrJ + hDS
                                  MK = R = frK
                                  NK = S = tK
and
                                  AM = T = eAMM + rsM + hDNM = eAU + frsr + hDV
                                  MM = U = frM
                                  NM = V = tM

where rsK = rsfrJ = f (rsrJ), and rsM = rsfr = f (rsr), because f commutes
with r and s.

      The fact that we could derive this SSDTG, which is deterministic, by mere
substitution also proves that the grammar is LR(0), as can be easily
verified by running it through an LR(0) parser generator (yielding 6 shift
states and 10 reduce states).

      A shift-reduce conflict may also result in a shift-shift conflict (i.e.
two terms which begin in the same element of X), in which case you'll need
to factor out the common subterm (e.g. aU + aV would become aW, with W
being a new term being set to U + V). If everything can be reduced to
deterministic form with only one round of factorings (e.g., if W required
no further factorings), then the original grammar had to have been LR(1).
Factoring corresponds to "looking ahead", while reordering elements of X
and Y corresponds to "deferring actions" (when yx is changed to xy), or
"deferring shifts" (when xy is changed to yx).

      If the first grammar is reduced, then from:

                S = A a p + B b q + C r.
                A = C d s + e A f t + g u.
                B = C d v + e B f w + g x.
                C = C g y + h z.

you'll get:

                C = h z (g y)*
                B = e B fw + (hz (gy)* dv + gx).
                A = e A ft + (hz (gy)* ds + gu).
                S = A ap + B bq + Cr
                    = e (A fatp + B fbwq)
                    + hz (gy)* d (sap + vbq)
                    + g (uap + xbq)
                    + cr

thus yielding:

            S = e (A fatp + B fbwq) + hz (gy)* d (sap + vbq) + g (uap + xbq) + cr
            A = e A ft + (hz (gy)* ds + gu).
            B = e B fw + (hz (gy)* dv + gx).

where A and B were substituted for in S and all the shift-shift conflicts
were factored out. This is also deterministic, except for the appearance
of the subterm A fatp + B fbwq. When A and B are substituted for here,
and all the shift-shift conflicts are factored out, the result will be:

          e (A ffattp + B ffbwwq) + hz (gy)* df (astp + bvwq) + gf (autp + bxwq)

which yields further terms ad infinitum of the form:

                                      e (A f^m a t^m p + B f^m b w^m q).

Therefore, the original grammar cannot possibly be LR(k) for any k >= 0.

(2) Kleene Algebras, Translation Expressions and Context-Free Expressions
      In its most general form, an SSDTG is a system of equations over a certain
kind of algebra that contains the sets X and Y and for which:

                                              xy = yx, for each x in X, and y in Y.

This algebra has the identities 1 and 0, and the three basic operations:

                                    Products: a, b -> ab
                                    Sums: a, b -> a + b

which satisfies the following properties:

                              Associativity: (ab)c = a(bc), (a+b)+c = a+(b+c)
                              Identity: a1 = a = 1a, a+0 = a = 0+a
                              Commutativity: a+b = b+a
                              Idempotency: a+a = a
                              Linearity: a0d = 0, a(b+c)d = abd + acd
and
                                    Star: a -> a*

which satisfies the properties:

                              (1) 0* = 1
                              (2) The least solution to (a + bx + xc <= x) is b* a c*,

where the ordering relation is defined by: (a <= b <==> a + b = b).

      This system is known as a Kleene Algebra.

      It is possible to prove that the algebra generated from a set, X, of
symbols subject only to these axioms is none other than the algebra of
regular expressions over X, itself. Among other things, what this means
is that it's possible to prove every equation between regular expressions
using the axioms above. This comes from three basic properties which I
list below with their traditional names in parantheses:

(Regular Expression -> Finite State Automata):
      Every element of a Kleene Algebra generated by a set X can be expressed
      as the least solution to a system of equations of the form:

                      x = q0; qi = sum of terms of the forms (1, x qj), i = 1,...,n

      called Right-Linear form.

(Finite State Automata -> Regular Expression):
      Every finite system of Right-Linear equations over a Kleene Algebra has a
      least solution, i.e.,

                                    Ri(q0,...,qn) <= qi, i = 0, ..., n

      with Ri(q0,...,qn) being a sum of terms of the forms (1, x qj), x in X.

Each equation can be written in a form that is strongly suggestive of
Taylor's Theorem in Calculus:

                                      qi = qi_0 + sum (x d/dx qi: x in X)

where qi_0 is either 0 or 1, and d/dx qi is either 0 or qj, for some j.

(Minimalizing Finite State Automata):
      Every finite system of right-linear equations can be reduced to minimal
      deterministic form, whose least solution in q0 will match that of the
      original system of equations in q0.

      Deterministic means that no two terms on the right hand side start in the
      same x (i.e., no shift-shift conflicts), and minimal means there is no
      equivalence among states defined by:

                  qi == qj <==> qi_0 = qj_0, and d/dx qi == d/dx qj, for all x.

      except for actual identity (qi == qj <==> i = j).

All these results follow from purely algebraic algorithms that use only the
axioms listed above and nothing more. Thus, any equation that exists between
regular expressions can be proven from the axioms above.

      It's common to represent systems of equations (and finite state automata)
in matrix form. For example, the regular expression a (b a)* can be written
as the least solution in q_0 to the following system:

                                                    q_0 = a q_1
                                                    q_1 = 1 + b q_0

(proof: they's solutions since (b a)* = 1 + b a (b a)* = 1 + b q_0, and
they're least solutions since if 1 + b q_0 = 1 + b a q_1 <= q_1, then by axiom
(b a)* <= q_1, and thus a (b a)* <= a q_1 <= q_0).

      In matrix form, this system is expressed as:

                                    / \ / \ / \ / \
                                  | q_0 | | 0 a | | q_0 | | 0 |
                                  | | = | | | | + | |
                                  | q_1 | | b 0 | | q_1 | | 1 |
                                    \ / \ / \ / \ /

whose solution is then represented formally as:

                                              / \ / \ * / \
                                            | q_0 | | 0 a | | 0 |
                                            | | = | | | |
                                            | q_1 | | b 0 | | 1 |
                                              \ / \ / \ /

This formal representation can be made logically sound, if we DEFINE the
n x n matrix algebra of a Kleene algebra K as follows:

                          (A B)_ij = sum A_ik B_kj, for k = 1, 2, ..., n
                      (A + B)_ij = A_ij + B_ij
                              (e)_ij = e if i = j, 0 else, for any expression e in K.

The star operation is then defined by:

                        (A*)_ij is the least solution in q_i to the system of equations

                                        q_0 = d0 + A_00 q_0 + A_01 q_1 + A_02 q_2 + ...
                                        q_1 = d1 + A_10 q_0 + A_11 q_1 + A_12 q_2 + ...
                                        q_2 = d2 + A_20 q_0 + A_21 q_1 + A_22 q_2 + ...
                                        ...............................................

                        formed by setting d_j = 1, and all other d's = 0.

It's not too difficult to prove that the resulting algebra does, indeed,
satisfy all the axioms of a Kleene Algebra. We will call this the n x n
matrix algebra of K, denoting it by M_n(K).

      This example shows that Kleene algebras don't just have to be regular
expressions. In fact, they may be defined with extra relations on them.
In particular, the algebra we used for SSTDG's was built from two sets X and Y
such that
                        (1) x y = y x, for all x in X, y in Y
                        (2) All finite systems of equations have (least) solutions.

These two extra properties, in fact, precisely define the algebra which
comprises the entire family of SSDT's with input alphabet X and output
alphabet Y. If you remove clause (2), then you get a definition for the
family of Rational Transductions, which is what Finite State Machines
process. The corresponding elements of the algebras, each representing
a translation process, may then be called Translation Expressions.

      The only missing item here is coming up with ways to actually write down
closed-form solutions to systems of equations. This requires coming up
with a notation for Translation Expressions, which as you'll see, is actually
very easy to do.

      Another Kleene Algebra which is not free, which will prove highly pertinent
here, is the algebra C_2 which is generated from { b, d, p, q } and defined
to satisfy the relations:

                                            bd = pq = 1, bq = pd = 0, db + qp = 1

More generally, one can define C_n from the set { <0|,...,<n-1|.|0>,...|n-1> }
by the relations:
                                              <i| |j> = 1 if i = j, 0 else
                                        |0><0| + |1><1| + ... + |n-1><n-1| = 1

These algebras have the remarkable properties that:

(1) C_n is contained within C_2 by the correspondence:

                                  <i| <--> u_i, |i> <--> v_i

        where the set { v_0, v_1, ..., v_{n-1} } forms a complete prefix
        binary code in { d, q }, and u_i is the "adjoint" of v_i (e.g., the
        adjoint of qqddq is pbbpp, which is qqddq written backwards).
        Example, with:
                              v_0 = dd, v_1 = dq, v_2 = q,
                              u_0 = bb, u_1 = pb, u_2 = p
        you have:
          u_0 v_0 = bbdd = bd = 1 u_1 v_0 = pbdd = pd = 0 u_2 v_0 = pdd = 0d = 0
          u_0 v_1 = bbdq = bq = 0 u_1 v_1 = pbdq = pq = 1 u_2 v_1 = pdq = 0q = 0
          u_0 v_2 = bbq = b0 = 0 u_1 v_2 = pbq = p0 = 0 u_2 v_2 = pq = 1

          v_0 u_0 + v_1 u_1 + v_2 u_2 = ddbb + dqpb + qp
                                                                  = d(db+qp)b + qp = db + qp = 1

(2) C_2 is isomorphic to its own 2x2 matrix algebra by the correspondence:

                              A B <--> d A b + d B p + q C b + q D p
                              C D

        which preserves all the operations (the star operation has to be defined
        a certain way for matrices to make this correspondence work), and is
        invertible to the form:

                              b E d b E q <--> E
                              p E d p E q

        which proves the two algebras are the same: C_2 = M_2(C_2).

(3) More generally, C_n = M_n(C_n).
(4) C_2 contains all of its finite dimensional matrix algebras.

      Given two Kleene Algebras, K and L, one can define their "tensor" product
K x L as the Kleene Algebra which contains K and L as subalgebras and for
which the relations
                                                  kl = lk, for k in K, l in L

hold. Therefore, if you let R(X) denote the free Kleene Algebra generated
from X, you have the following identities:

                                        R(X) = the family of regular languages over X.
      R(X, Y) = R(X) x R(Y) = the family of rational transductions from X to Y,

The ones that assume particular importance here are those formed as tensor
products with C_2 (or generally C_n) for which we will show the following:

            C(X) is the subalgebra of C_2 x R(X) comprised of all expressions e
            such that eb = be, ed = de, ep = pe, eq = qe.

            C(X, Y) is the subalgebra of C_2 x R(X, Y) defined similarly.

where C(X) is defined as the algebra comprised of the Context-Free Languages
over X, and C(X, Y) as that of the Simple Syntax Direct Transductions from
X to Y.

      Expressions corresponding to C_2 x R(X) (and C_2 x R(X, Y)) will be called
Context Free (Translation) Expressions, and will be deemed "operator free"
when they commute with all the members of C_2. Therefore, every context
free language is represented by an operator-free context-free expression,
and similarly for simple syntax directed translations.

      This notation bears a direct link to parsing applications (the context
where I originally came up with it from), via the "standard interpretation":

                        <i| = push symbol i.
                        |i> = pop and test for symbol i.
                            x = shift input symbol x (i.e., read a symbol from the input
                                    and test its equality to x).
                            y = write output symbol y (or: perform action y).
                        A B = do A, then do B.
                    A + B = do either A or B.
                          A* = do A either 0, 1, 2, or more times.
                            0 = fail (or: error).
                            1 = do nothing (or: empty word).

A parsing application can then be thought of as starting with an infinitely
deep stack of 0's and a string to read from the input. The input is
recognized if there is act least one pathway that leads to it being
successfully processed with the stack ending up in the same form as it started
out. One way to enforce this setup is to require a context-free expression
to be writeable in the form:

                                                            <0| e |0>

where the only occurrences of <0| or |0> in the expression e are in the
combination |0><0|. This will guarantee that the resulting expression
will be operator-free.

      A "standard representation" of C_2 can be provided using infinite matrices.
The symbols b, d, p and q can be represented as follows (where the 0's are
written as ":" to clarify the matrices forms):

                                  b = 1 : : : : : ... d = 1 : : : : : ...
                                          : : 1 : : : ... : : : : : : ...
                                          : : : : 1 : ... : 1 : : : : ...
                                          : : : : : : ... : : : : : : ...
                                          : : : : : : ... : : 1 : : : ...
                                          : : : : : : ... : : : : : : ...
                                          ........... ...........

                                  p = : 1 : : : : ... q = : : : : : : ...
                                          : : : 1 : : ... 1 : : : : : ...
                                          : : : : : 1 ... : : : : : : ...
                                          : : : : : : ... : 1 : : : : ...
                                          : : : : : : ... : : : : : : ...
                                          : : : : : : ... : : 1 : : : ...
                                          ........... ...........

C_2 x R(X) is then represented as the infinite matrix algebra where each
element x in R(X) corresponds to a diagonal matrix:

                                                    x = x : : : : : ...
                                                            : x : : : : ...
                                                            : : x : : : ...
                                                            : : : x : : ...
                                                            : : : : x : ...
                                                            : : : : : x ...
                                                            ...........

The components of the matrices in the representation for C_2 x R(X) are
generally language expressions of an arbitrary sort over the alphabet X
(meaning: they are sets), so that all matrix operations are defineable
(i.e. (AB)_ij = union of all sets (A_ik B_kj), for k = 0, 1, 2, ...,
(A+B)_ij = A_ij union B_ij).

      We can get a representation for C_2 x R(X) x R(Y) by requiring that
the x matrices consist of x's on the diagonal at even positions, with 1's
at odd positions, and that the y's occupy the diagonal at odd positions
with the 1's at even positions. The b, d, p and q matrices are then
represented with each 0 in the original replaced by a 2 x 2 matrix of 0's
and each 1 by the 2 x 2 identity matrix.

      A rather significant example can be provided of a context-free expression
for the set X = {a, c} is b (pa + cq)* d. Using our definition of the star
operation, we can express this ultimately as the least solution in q_0 to the
following system of equations:

                                                    q_0 = 1 + a q_1
                                                    q_1 = c q_0 + a q_2
                                                    q_2 = c q_1 + a q_3
                                                    ...

This is none other than the language S consisting of all well-formed
nested sequences of brackets in a and c (called the Dyck language in a
and c), which is also the least solution to the equation, S = 1 + a S c S
and to the equation: S = (a S c)*. The other q's in fact will be expressible
in terms of S as:
                                                    q_i = S (c S)^i

and it is easy to verify by direct substitution that this does indeed solve
the system of equations above, e.g.,

                          q_1 = S (c S) = (1 + a S c S) (c S)
                                                      = c S + a S c S c S
                                                      = c S (c S)^0 + a S (c S)^2
                                                      = c q_0 + a q_2.

(3) Transforming a SSDTG: Quantum Grammars
      To transform a SSDTG into a form most suitable for generating parsers,
we will want to (1) remove all of its cyclic left-recursions and (2) factor
out the shift-shift conflicts that occur in all the subterms on the right
hand side.

      Ultimately, we can express it in kind of 2-Greibach Normal Form as
follows:

                (a) A subset N_0 of the set N of non-terminals is defined, which
                        contains S. The set N is partitioned so that each partition
                        contains exactly one member from N_0. We will therefore write:

                                    n1 == n2 if n1 and n2 are in the same partition
                                    [n1] = n if n1 == n and n is in N_0.

                (b) Every rule is of one of the forms

                                  n -> a, n -> a r, n -> a l r

                        where the a's are elements of (X + Y)*, the l's are in N_0 and
                        n == r.

We can relax these restrictions to allow for non-cyclic left-recursions
n -> a l r where a is in the set Y*, provided that:

                if n -> a l r, and n -> a' l' r',
                then FIRST(l) and FIRST(l') have an empty intersection,

which is imposed to remove the possibility of shift-shift conflicts, where
FIRST() is defined recursively as the least solution to the following
system:

                    FIRST(n B) = union FIRST(w): n -> w is a rule, B in (X + Y + N)*
                    FIRST(y B) = FIRST(B), for y in Y, B in (X + Y + N)*
                    FIRST(x B) = { x }, for x in X, B in (X + Y + N)*

      We will not detail how this transformation can be made, but two points
can be raised:

                  (a) As mentioned at the start of this presentation, there is
                          nothing that restricts our ability to perform this or any
                          algebraic transformation on grounds of "you will lose the
                          structure of the grammar", since we're working with SSDTG's
                          which preserve that structure.

                  (b) A *DETERMINISTIC* normal form (i.e. one free of shift-reduce
                          conflicts and reduce-reduce conflicts, as well as shift-shift
                          conflicts) will be possible exactly when the original grammar
                          is LR(k) for some k >= 0.

The process of trying to make the grammar deterministic amounts to substituting
to remove shift-reduce and reduce-reduce conflicts, and then factoring out
common initial subterms to remove shift-shift conflicts. Each stage of
factoring increases the k in LR(k) by 1, and can be carried out arbitrarily
far.

      Once a grammar is in the normal form as described above, we can then
carry out the following process:

      (a) For each combination lr that appears in a rule n -> alr, define
              the symbols <lr| and |lr>. Also define the symbols <0| and |0>.

      (b) Write down the equation, for each non-terminal n in N_0:

                          n_F = d_n |0> + sum (|lr> r [r]_F: n -> a l r is a rule in P)
              where d_S = 1, and d_n = 0 for all other non-terminals in N_0.

      (c) For each non-terminal, n, define n_S = n [n]_F.

In the algebra C_2 x R(X, Y), the non-termial S will be seen to satisfy a
FINITE system of equations, provided we assume that all the original
non-terminals lie in the subalgebra C(X, Y) of operator-free expressions.

      First of all, we note that:

                                <i| n = n <i|, |i> n = n |i>

for all non-terminals n, by our assumption that n be in C(X, Y). Thus
we can write:

                        <0| S_S = <0| S S_F = S <0| S_F,
and
                        <0| S_F = <0| d_S |0> + <0| sum (|lr> r)
                                        = 1 + 0
Thus
                                                S = <0| S_S

Second, we note that:

                        <lr| l_S = <lr| l l_F = l <lr| l_F
with
                        <lr| l_F = <lr| d_l |0> + <lr| sum (|l'r'> r' [r']_F)

with the only non-cancelling term being that for which l' = l, r' = r.
Thus:
                                <lr| l_F = <lr| |lr> r [r]_F = r [r]_F = r_S.

Therefore, we may write:

                  n_S = n n_F = sum (a n_F) + sum (a r n_F) + sum (a l r n_F)
                                          = sum (a n_F) + sum (a r_S) + sum (a <lr| l_S)

noting that r_S = r [r]_F = r n_F, since we're supposing in the normal
form that n == r.

      Thus the original grammar may be transformed by the following rather
simple prescription:

              (a) Set up the rules S -> <0| S_S, and S_F -> |0>.
              (b) Change each rule n -> a into n_S -> a [n]_F.
              (c) Change each rule n -> a r into n_S -> a r_S.
              (d) Change each rule n -> a l r into n_S -> a <lr| l_S
                      and add the rule l_F -> |lr> r_S.

The result is what I call a Quantum Grammar. It is deterministic exactly
when the original grammar is deterministic and therefore can also serve
as a deterministic parser under the Standard Interpretation given before.

              (e) Write down a system of equations in S, each n_F for n in N_0,
                      and each n_S for n in N, as follows:

                                          n = sum of all t: n -> t is a rule.

It is possible to solve this system in terms of S to arrive at an expression
of the form:
                        <0| e |0>, where e is an expression in C_2 x R(X, Y)
                        whose only occurrences of <0|, |0> are in the combination |0><0|.

Actually, since none of the rules use the <0|, |0> operators, except for the
first two formed, then there won't be ANY occurrences of |0><0| at all in e.

(4) Examples. Issues Associated with Writing Quantum Parser Generators
      A couple examples will be provided to show the ease of use of this method
(what makes it generally easy is that grammars that are actually written
down tend to already be close to the normal form given above).

EXAMPLE 1: The Dyck Context Free Language S = (a S c)*.
The expression (a S c)* can be written as S = 1 + a S c (a S c)* = 1 + a S c S.
Thus, we get a system of equations:

                                    S = 1 + a S r
                                    r = c S

There is only one partition here [S, r], so things are rather easy:

                                  S = <0| S_S
                              S_S = S_F + a <1| S_S
                              S_F = |1> r_S + |0>
                              r_S = c S_S

where <1| stands for the context <Sr|.

      This could be simplified further, by eliminating r_S and S_F:

                                  S = <0| S_S
                              S_S = a <1| S_S + |1> c S_S + |0>

The closed form of this system is then:

                              S = <0| S_S (a <1| + |1> c)* |0>

which was essentially the first example shown before.

EXAMPLE 2: The expression SSDT Grammar:
                      E = x {1} + u E {2} + E p {3} + E b E {4} + r E s {5}

                    with X = { x, u, p, b, r, s } and Y = { {n}: n = 0,1,2,3,4,5},
                    with {n} standing for possible action sequences.

In normal form, this becomes:
                      E = x {1} A + u E C + r E B
                      A = 1 + p {3} A + b E D
                      B = s {5} A
                      C = {2} A
                      D = {4} A

Again, there is only one partition: [S,A,B,C,D]. The contexts here are
then denoted as follows:

                        <1| for <E C|, <2| for <E B| and <3| for <E D|,
                        |1> for |E C>, |2> for |E B> and |3> for |E D>.

and we can then write:

                        E = <0| E_S
                    E_S = x {1} A_S + u <1| E_S + r <2| E_S
                    A_S = E_F + p {3} A_F + b <3| E_S
                    B_S = s {5} A_S
                    C_S = {2} A_S
                    D_S = {4} A_S
                    E_F = |0> + |1> C_S + |2> B_S + |3> D_S

which, after eliminating C_S, B_S, D_S and then E_F would simplify to:

                        E = <0| E_S
                    E_S = x {1} A_S + u <1| E_S + r <2| E_S
                    A_S = p {3} A_F + b <3| E_S +
                                |0> + |1> {2} A_S + |2> s {5} A_S + |3> {4} A_S

Notice this time the presence of shift-reduce conflicts in A_S that
arise from the fact that there was an empty transition A_S -> E_F. This
is where operator precedences would be applied. For example, if (uE) binds
more tightly than (Ep), which binds more tightly than (EbE), and if (EbE)
associates to the left, then we could set up a table to resolve all the
conflicts:

Context: E -> E $ p E -> E $ b E
                      p {3} A_F b <3| E_S Context:
                            === === |1> {2} A_S E -> u E $ {2}
                            ||| === |3> {4} A_S E -> E b E $ {4}

where === means resolve in favor of the item listed at the end of the row,
and ||| means resolve in favor of the item listed at the top of the column.

      An easy way to declare the majority of precedences among rules is simply
to state that (unless otherwise indicated) the rules listed in an actual
grammar have precedence over all the ones listed after it. Only the
associativity of a rule (such as E -> E b E) would have to be explicitly
declared, e.g.,

                                    E -> x {1}
                                    E -> r E s {5}
                                    E -> u E {2}
                                    E -> E p {3}
                                    E -> E b E {4} %left

Therefore, a parser generator for Quantum Grammars would be able to use
rule ordering to resolve shift-reduce and reduce-reduce conflicts.

      The only significant factors that would have to be resolved, and which
would make different implementations different in terms of their performance,
are:
                          (a) How to resolve "errors" (i.e., those places where
                                  a parser has no shift or reduce for the pending
                                  input symbol and stack state).

                          (b) An algorithm to transform an SSDTG into normal form. Such
                                  an algorithm should be able to selectively perform up to
                                  k rounds of factorings (to accomplish a reduction of an
                                  LR(k) grammar), where k may even dynamically determined.

                  and (c) How to declare precedences. An easy way is simply to
                                  stick a %prec N next to a given rule, where N can be
                                  a whole number. If operator precendences and associativites
                                  are declared, then a %prec N (and %left and %right) are
                                  automatically inserted whereever the operator appears in
                                  a rule (e.g. as the first operator terminal).
--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Quantum Parsing: Transcending the LR(k) Paradigm.

mark@omnifest.uwm.edu (Mark Hopkins)19 May 1996 18:32:21 -0400

mark@omnifest.uwm.edu (Mark Hopkins)
19 May 1996 18:32:21 -0400