19 May 1996 18:32:21 -0400

Related articles |
---|

Quantum Parsing: Transcending the LR(k) Paradigm. mark@omnifest.uwm.edu (1996-05-19) |

From: | mark@omnifest.uwm.edu (Mark Hopkins) |

Newsgroups: | comp.compilers |

Date: | 19 May 1996 18:32:21 -0400 |

Organization: | Omnifest |

Keywords: | parse, LR(1) |

Quantum Parsing: Transcending the LR(k) Paradigm.

Contents:

(1) Equivalence vs. Non-Equivalence & Algebraic Transformations

(2) Kleene Algebras, Translation Expressions and Context-Free Expressions

(3) Transforming a SSDTG: Quantum Grammars

(4) Examples. Issues Associated with Writing Quantum Parser Generators

(1) Equivalence vs. Non-Equivalence & Algebraic Transformations

In what sense are equivalent grammars different that keeps people from

applying purely algebraic methods to grammar transformation and parser

formation? Take the following grammar:

S = A a + B b + c.

A = C d + e A f + g.

B = C d + e B f + g.

C = C g + h.

It represents a deterministic language since it can be transformed into the

LR(0) grammar:

S = A I + c.

A = e A f + g + h D.

D = d + g D.

I = a + b.

but is not LR(k) for any k >= 0.

All deterministic context free languages can be given by LR(0) grammars

(provided you add in an extra symbol, $, to denote the end of input), but

the transformation is not "intuitive", as is often said.

This is, in itself, a vague notion and rather weak reason for not doing

a transformation (especially since the algorithms used in parser generators

are already, as a rule, non-intuitive). But there's a more general notion

underlying this which captures what is being said, and thus renders the

reasons for not doing transformations impotent without losing anything.

The "structure" of a grammar is, in reality, a systems of transformations

that captures the actual set of reductions that get carried out when the

grammar is applied. This can be represented as a Simple Syntax Directed

Translation Grammar, so that in reality what we're talking about in parser

applications is transforming SSDTGs not CFGs. Equivalent CFGs will generally

have NON-equivalent SSDTGs, the case above being the basis of an example.

A (simple) syntax direct translation grammar consists of the following:

X: An input alphabet.

Y: An output alphabet.

N: A set of non-terminals.

S: The main non-terminal.

P: A system of rules of the form

n: v -> w

where n is a non-terminal, v and w are in (X+N)* and (Y+N)*

and are constrained so that the same number of non-terminals

occur in v and w (in the same order).

The (simple) in SSDTG refers to the (in the same order) in the definition

above.

Derivations are defined by the following:

(1) If n: v -> w is a rule in P then the following is a one-step

derivation:

(v1 n v2 -> w1 n w2) |- (v1 v v2 -> w1 w w2).

(2) A derivation is a sequence of 0, 1, or more one-step

derivations of the form A |- B, B |- C, C |- ... |- Y, Y |- Z,

in which case we write A |-* Z.

and associated with each translation (v -> w) is a set:

[v -> w] = { (x, y): (v -> w) |-* (x -> y), x is in X*, y is in Y* }

and finally, the set of translations recognized by the SSDTG is just [S -> S].

If we use 1 to denote the empty word, and if we define:

[n] = [n -> n], for non-terminals,

[x] = [x -> 1], for words x in X*,

and [y] = [1 -> y], for words y in Y*,

then we can state the following:

(1) [x] [y] = [x -> y] = [y] [x], for x in X*, y in Y*

(2) [x1 ... xm] = [x1] ... [xm], for x1, ..., xm in X,

[y1 ... yn] = [y1] ... [yn], for y1, ..., yn in Y.

(3) [x n v -> y n w] = [x] [y] [n] [v -> w],

for x in X*, y in Y*, and n in N.

(4) [n] = sum { [u -> v], such that n: u -> v is in P }

where the product of sets is defined as follows:

U1 U2 = { (x1 x2, y1 y2): (x1, y1) is in U1, (x2, y2) in U2 }

This allows us to generate all sets [u -> v] from the [n], [x] and [y]

for n in N, x in X, and y in Y, and then to write down a system of set

equations:

[n] = sum of products of [z], of z's from N+X+Y.

If we identify each symbol with its corresponding set, we can write a system

of equations in the elements of N:

n = sum of products of terms from N+X+Y

whose least solution in the n's are (n = [n]), and we can write the system of

rules in P down in the form

n -> u v, where n: u -> v

Thus, a Simple Syntax Directed Translation Grammar is, itself, just a Context

Free Grammar in which you have two alphabets (X and Y) which commute with

one another (xy = yx, for all words x in X*, y in Y*).

Similarly, a Finite State Machine can be considered as a Finite State

Automaton with two mutually commuting alphabets, so that the family of

translation systems generated by the SSDTGs (which are called the

Simple Syntax Direct Translations, or SSDTs) is the algebraic closure of

the RT's (Rational Transductions: the family generated by finite state

machines) in which every equation has a solution -- just like the CFLs

are the algebraic closure of the RLs.

(Algebraic closure, here, means the smallest extension of a family of

languages over which every finite system of equations has a least solution).

Each Context Free Grammar, then, has its structure captured by appending

to each rule an "action symbol" which is interpreted as the action

"build up a new node in the current parse tree from the set of items on

the furthest right using this rule."

Thus, in the examples above, we can make the following changes:

S = A a + B b + c. --> S = A a p + B b q + c r.

A = C d + e A f + g. A = C d s + e A f t + g u.

B = C d + e B f + g. B = C d v + e B f w + g x.

C = C g + h. C = C g y + h z.

and

S = A I + c. --> S = A I p + c q.

A = e A f + g + h D. A = e A f r + g s + h D t.

D = d + g D. D = d u + g D v.

I = a + b. I = a w + b x.

where X = [abcdefgh] and Y = [pqrstuvwxyz] in the first case and [pqrstuvw] in

the second. Even though the two grammars are equivalent, the two SSDTGs are

not. But on the other hand, precisely because the SSDTGs capture the

structure of a grammar, you're free to carry out any transformation on it

without worrying about "destroying the structure of the original". Thus,

algebraic methods can be freely applied, noting especially that the two

alphabets X and Y commute.

The second grammar will reduce as follows:

S = (eAfr + gs + hDt) J + cq = eAfrJ + gsJ + hDtJ + cq

J = Ip = (aw + bx) p = awp + bxp

A = eAfr + rs + hDt.

D = du + gDv.

or, by letting K = f r J, L = t J, M = f r, N = t, O = v we can write

down the system:

S = eAK + gsJ + hDL + cq

J = awp + bxp

K = frJ

L = tJ

A = eAM + rs + hDN

M = fr

N = t

D = du + gDO

O = v

This SSDT is deterministic with respect to its input alphabet, except for

the rule listed under A, which has a shift-reduce conflict (A -> eAM

involves a shift on e, but A -> rs involves only the reductions r and s).

This is easily eliminated by substituting new terms for AK and AN, and then

writing down equations for (AK) and (AN), in place of A.

AK = Q = eAMK + rsK + hDNK = eAR + frsrJ + hDS

MK = R = frK

NK = S = tK

and

AM = T = eAMM + rsM + hDNM = eAU + frsr + hDV

MM = U = frM

NM = V = tM

where rsK = rsfrJ = f (rsrJ), and rsM = rsfr = f (rsr), because f commutes

with r and s.

The fact that we could derive this SSDTG, which is deterministic, by mere

substitution also proves that the grammar is LR(0), as can be easily

verified by running it through an LR(0) parser generator (yielding 6 shift

states and 10 reduce states).

A shift-reduce conflict may also result in a shift-shift conflict (i.e.

two terms which begin in the same element of X), in which case you'll need

to factor out the common subterm (e.g. aU + aV would become aW, with W

being a new term being set to U + V). If everything can be reduced to

deterministic form with only one round of factorings (e.g., if W required

no further factorings), then the original grammar had to have been LR(1).

Factoring corresponds to "looking ahead", while reordering elements of X

and Y corresponds to "deferring actions" (when yx is changed to xy), or

"deferring shifts" (when xy is changed to yx).

If the first grammar is reduced, then from:

S = A a p + B b q + C r.

A = C d s + e A f t + g u.

B = C d v + e B f w + g x.

C = C g y + h z.

you'll get:

C = h z (g y)*

B = e B fw + (hz (gy)* dv + gx).

A = e A ft + (hz (gy)* ds + gu).

S = A ap + B bq + Cr

= e (A fatp + B fbwq)

+ hz (gy)* d (sap + vbq)

+ g (uap + xbq)

+ cr

thus yielding:

S = e (A fatp + B fbwq) + hz (gy)* d (sap + vbq) + g (uap + xbq) + cr

A = e A ft + (hz (gy)* ds + gu).

B = e B fw + (hz (gy)* dv + gx).

where A and B were substituted for in S and all the shift-shift conflicts

were factored out. This is also deterministic, except for the appearance

of the subterm A fatp + B fbwq. When A and B are substituted for here,

and all the shift-shift conflicts are factored out, the result will be:

e (A ffattp + B ffbwwq) + hz (gy)* df (astp + bvwq) + gf (autp + bxwq)

which yields further terms ad infinitum of the form:

e (A f^m a t^m p + B f^m b w^m q).

Therefore, the original grammar cannot possibly be LR(k) for any k >= 0.

(2) Kleene Algebras, Translation Expressions and Context-Free Expressions

In its most general form, an SSDTG is a system of equations over a certain

kind of algebra that contains the sets X and Y and for which:

xy = yx, for each x in X, and y in Y.

This algebra has the identities 1 and 0, and the three basic operations:

Products: a, b -> ab

Sums: a, b -> a + b

which satisfies the following properties:

Associativity: (ab)c = a(bc), (a+b)+c = a+(b+c)

Identity: a1 = a = 1a, a+0 = a = 0+a

Commutativity: a+b = b+a

Idempotency: a+a = a

Linearity: a0d = 0, a(b+c)d = abd + acd

and

Star: a -> a*

which satisfies the properties:

(1) 0* = 1

(2) The least solution to (a + bx + xc <= x) is b* a c*,

where the ordering relation is defined by: (a <= b <==> a + b = b).

This system is known as a Kleene Algebra.

It is possible to prove that the algebra generated from a set, X, of

symbols subject only to these axioms is none other than the algebra of

regular expressions over X, itself. Among other things, what this means

is that it's possible to prove every equation between regular expressions

using the axioms above. This comes from three basic properties which I

list below with their traditional names in parantheses:

(Regular Expression -> Finite State Automata):

Every element of a Kleene Algebra generated by a set X can be expressed

as the least solution to a system of equations of the form:

x = q0; qi = sum of terms of the forms (1, x qj), i = 1,...,n

called Right-Linear form.

(Finite State Automata -> Regular Expression):

Every finite system of Right-Linear equations over a Kleene Algebra has a

least solution, i.e.,

Ri(q0,...,qn) <= qi, i = 0, ..., n

with Ri(q0,...,qn) being a sum of terms of the forms (1, x qj), x in X.

Each equation can be written in a form that is strongly suggestive of

Taylor's Theorem in Calculus:

qi = qi_0 + sum (x d/dx qi: x in X)

where qi_0 is either 0 or 1, and d/dx qi is either 0 or qj, for some j.

(Minimalizing Finite State Automata):

Every finite system of right-linear equations can be reduced to minimal

deterministic form, whose least solution in q0 will match that of the

original system of equations in q0.

Deterministic means that no two terms on the right hand side start in the

same x (i.e., no shift-shift conflicts), and minimal means there is no

equivalence among states defined by:

qi == qj <==> qi_0 = qj_0, and d/dx qi == d/dx qj, for all x.

except for actual identity (qi == qj <==> i = j).

All these results follow from purely algebraic algorithms that use only the

axioms listed above and nothing more. Thus, any equation that exists between

regular expressions can be proven from the axioms above.

It's common to represent systems of equations (and finite state automata)

in matrix form. For example, the regular expression a (b a)* can be written

as the least solution in q_0 to the following system:

q_0 = a q_1

q_1 = 1 + b q_0

(proof: they's solutions since (b a)* = 1 + b a (b a)* = 1 + b q_0, and

they're least solutions since if 1 + b q_0 = 1 + b a q_1 <= q_1, then by axiom

(b a)* <= q_1, and thus a (b a)* <= a q_1 <= q_0).

In matrix form, this system is expressed as:

/ \ / \ / \ / \

| q_0 | | 0 a | | q_0 | | 0 |

| | = | | | | + | |

| q_1 | | b 0 | | q_1 | | 1 |

\ / \ / \ / \ /

whose solution is then represented formally as:

/ \ / \ * / \

| q_0 | | 0 a | | 0 |

| | = | | | |

| q_1 | | b 0 | | 1 |

\ / \ / \ /

This formal representation can be made logically sound, if we DEFINE the

n x n matrix algebra of a Kleene algebra K as follows:

(A B)_ij = sum A_ik B_kj, for k = 1, 2, ..., n

(A + B)_ij = A_ij + B_ij

(e)_ij = e if i = j, 0 else, for any expression e in K.

The star operation is then defined by:

(A*)_ij is the least solution in q_i to the system of equations

q_0 = d0 + A_00 q_0 + A_01 q_1 + A_02 q_2 + ...

q_1 = d1 + A_10 q_0 + A_11 q_1 + A_12 q_2 + ...

q_2 = d2 + A_20 q_0 + A_21 q_1 + A_22 q_2 + ...

...............................................

formed by setting d_j = 1, and all other d's = 0.

It's not too difficult to prove that the resulting algebra does, indeed,

satisfy all the axioms of a Kleene Algebra. We will call this the n x n

matrix algebra of K, denoting it by M_n(K).

This example shows that Kleene algebras don't just have to be regular

expressions. In fact, they may be defined with extra relations on them.

In particular, the algebra we used for SSTDG's was built from two sets X and Y

such that

(1) x y = y x, for all x in X, y in Y

(2) All finite systems of equations have (least) solutions.

These two extra properties, in fact, precisely define the algebra which

comprises the entire family of SSDT's with input alphabet X and output

alphabet Y. If you remove clause (2), then you get a definition for the

family of Rational Transductions, which is what Finite State Machines

process. The corresponding elements of the algebras, each representing

a translation process, may then be called Translation Expressions.

The only missing item here is coming up with ways to actually write down

closed-form solutions to systems of equations. This requires coming up

with a notation for Translation Expressions, which as you'll see, is actually

very easy to do.

Another Kleene Algebra which is not free, which will prove highly pertinent

here, is the algebra C_2 which is generated from { b, d, p, q } and defined

to satisfy the relations:

bd = pq = 1, bq = pd = 0, db + qp = 1

More generally, one can define C_n from the set { <0|,...,<n-1|.|0>,...|n-1> }

by the relations:

<i| |j> = 1 if i = j, 0 else

|0><0| + |1><1| + ... + |n-1><n-1| = 1

These algebras have the remarkable properties that:

(1) C_n is contained within C_2 by the correspondence:

<i| <--> u_i, |i> <--> v_i

where the set { v_0, v_1, ..., v_{n-1} } forms a complete prefix

binary code in { d, q }, and u_i is the "adjoint" of v_i (e.g., the

adjoint of qqddq is pbbpp, which is qqddq written backwards).

Example, with:

v_0 = dd, v_1 = dq, v_2 = q,

u_0 = bb, u_1 = pb, u_2 = p

you have:

u_0 v_0 = bbdd = bd = 1 u_1 v_0 = pbdd = pd = 0 u_2 v_0 = pdd = 0d = 0

u_0 v_1 = bbdq = bq = 0 u_1 v_1 = pbdq = pq = 1 u_2 v_1 = pdq = 0q = 0

u_0 v_2 = bbq = b0 = 0 u_1 v_2 = pbq = p0 = 0 u_2 v_2 = pq = 1

v_0 u_0 + v_1 u_1 + v_2 u_2 = ddbb + dqpb + qp

= d(db+qp)b + qp = db + qp = 1

(2) C_2 is isomorphic to its own 2x2 matrix algebra by the correspondence:

A B <--> d A b + d B p + q C b + q D p

C D

which preserves all the operations (the star operation has to be defined

a certain way for matrices to make this correspondence work), and is

invertible to the form:

b E d b E q <--> E

p E d p E q

which proves the two algebras are the same: C_2 = M_2(C_2).

(3) More generally, C_n = M_n(C_n).

(4) C_2 contains all of its finite dimensional matrix algebras.

Given two Kleene Algebras, K and L, one can define their "tensor" product

K x L as the Kleene Algebra which contains K and L as subalgebras and for

which the relations

kl = lk, for k in K, l in L

hold. Therefore, if you let R(X) denote the free Kleene Algebra generated

from X, you have the following identities:

R(X) = the family of regular languages over X.

R(X, Y) = R(X) x R(Y) = the family of rational transductions from X to Y,

The ones that assume particular importance here are those formed as tensor

products with C_2 (or generally C_n) for which we will show the following:

C(X) is the subalgebra of C_2 x R(X) comprised of all expressions e

such that eb = be, ed = de, ep = pe, eq = qe.

C(X, Y) is the subalgebra of C_2 x R(X, Y) defined similarly.

where C(X) is defined as the algebra comprised of the Context-Free Languages

over X, and C(X, Y) as that of the Simple Syntax Direct Transductions from

X to Y.

Expressions corresponding to C_2 x R(X) (and C_2 x R(X, Y)) will be called

Context Free (Translation) Expressions, and will be deemed "operator free"

when they commute with all the members of C_2. Therefore, every context

free language is represented by an operator-free context-free expression,

and similarly for simple syntax directed translations.

This notation bears a direct link to parsing applications (the context

where I originally came up with it from), via the "standard interpretation":

<i| = push symbol i.

|i> = pop and test for symbol i.

x = shift input symbol x (i.e., read a symbol from the input

and test its equality to x).

y = write output symbol y (or: perform action y).

A B = do A, then do B.

A + B = do either A or B.

A* = do A either 0, 1, 2, or more times.

0 = fail (or: error).

1 = do nothing (or: empty word).

A parsing application can then be thought of as starting with an infinitely

deep stack of 0's and a string to read from the input. The input is

recognized if there is act least one pathway that leads to it being

successfully processed with the stack ending up in the same form as it started

out. One way to enforce this setup is to require a context-free expression

to be writeable in the form:

<0| e |0>

where the only occurrences of <0| or |0> in the expression e are in the

combination |0><0|. This will guarantee that the resulting expression

will be operator-free.

A "standard representation" of C_2 can be provided using infinite matrices.

The symbols b, d, p and q can be represented as follows (where the 0's are

written as ":" to clarify the matrices forms):

b = 1 : : : : : ... d = 1 : : : : : ...

: : 1 : : : ... : : : : : : ...

: : : : 1 : ... : 1 : : : : ...

: : : : : : ... : : : : : : ...

: : : : : : ... : : 1 : : : ...

: : : : : : ... : : : : : : ...

........... ...........

p = : 1 : : : : ... q = : : : : : : ...

: : : 1 : : ... 1 : : : : : ...

: : : : : 1 ... : : : : : : ...

: : : : : : ... : 1 : : : : ...

: : : : : : ... : : : : : : ...

: : : : : : ... : : 1 : : : ...

........... ...........

C_2 x R(X) is then represented as the infinite matrix algebra where each

element x in R(X) corresponds to a diagonal matrix:

x = x : : : : : ...

: x : : : : ...

: : x : : : ...

: : : x : : ...

: : : : x : ...

: : : : : x ...

...........

The components of the matrices in the representation for C_2 x R(X) are

generally language expressions of an arbitrary sort over the alphabet X

(meaning: they are sets), so that all matrix operations are defineable

(i.e. (AB)_ij = union of all sets (A_ik B_kj), for k = 0, 1, 2, ...,

(A+B)_ij = A_ij union B_ij).

We can get a representation for C_2 x R(X) x R(Y) by requiring that

the x matrices consist of x's on the diagonal at even positions, with 1's

at odd positions, and that the y's occupy the diagonal at odd positions

with the 1's at even positions. The b, d, p and q matrices are then

represented with each 0 in the original replaced by a 2 x 2 matrix of 0's

and each 1 by the 2 x 2 identity matrix.

A rather significant example can be provided of a context-free expression

for the set X = {a, c} is b (pa + cq)* d. Using our definition of the star

operation, we can express this ultimately as the least solution in q_0 to the

following system of equations:

q_0 = 1 + a q_1

q_1 = c q_0 + a q_2

q_2 = c q_1 + a q_3

...

This is none other than the language S consisting of all well-formed

nested sequences of brackets in a and c (called the Dyck language in a

and c), which is also the least solution to the equation, S = 1 + a S c S

and to the equation: S = (a S c)*. The other q's in fact will be expressible

in terms of S as:

q_i = S (c S)^i

and it is easy to verify by direct substitution that this does indeed solve

the system of equations above, e.g.,

q_1 = S (c S) = (1 + a S c S) (c S)

= c S + a S c S c S

= c S (c S)^0 + a S (c S)^2

= c q_0 + a q_2.

(3) Transforming a SSDTG: Quantum Grammars

To transform a SSDTG into a form most suitable for generating parsers,

we will want to (1) remove all of its cyclic left-recursions and (2) factor

out the shift-shift conflicts that occur in all the subterms on the right

hand side.

Ultimately, we can express it in kind of 2-Greibach Normal Form as

follows:

(a) A subset N_0 of the set N of non-terminals is defined, which

contains S. The set N is partitioned so that each partition

contains exactly one member from N_0. We will therefore write:

n1 == n2 if n1 and n2 are in the same partition

[n1] = n if n1 == n and n is in N_0.

(b) Every rule is of one of the forms

n -> a, n -> a r, n -> a l r

where the a's are elements of (X + Y)*, the l's are in N_0 and

n == r.

We can relax these restrictions to allow for non-cyclic left-recursions

n -> a l r where a is in the set Y*, provided that:

if n -> a l r, and n -> a' l' r',

then FIRST(l) and FIRST(l') have an empty intersection,

which is imposed to remove the possibility of shift-shift conflicts, where

FIRST() is defined recursively as the least solution to the following

system:

FIRST(n B) = union FIRST(w): n -> w is a rule, B in (X + Y + N)*

FIRST(y B) = FIRST(B), for y in Y, B in (X + Y + N)*

FIRST(x B) = { x }, for x in X, B in (X + Y + N)*

We will not detail how this transformation can be made, but two points

can be raised:

(a) As mentioned at the start of this presentation, there is

nothing that restricts our ability to perform this or any

algebraic transformation on grounds of "you will lose the

structure of the grammar", since we're working with SSDTG's

which preserve that structure.

(b) A *DETERMINISTIC* normal form (i.e. one free of shift-reduce

conflicts and reduce-reduce conflicts, as well as shift-shift

conflicts) will be possible exactly when the original grammar

is LR(k) for some k >= 0.

The process of trying to make the grammar deterministic amounts to substituting

to remove shift-reduce and reduce-reduce conflicts, and then factoring out

common initial subterms to remove shift-shift conflicts. Each stage of

factoring increases the k in LR(k) by 1, and can be carried out arbitrarily

far.

Once a grammar is in the normal form as described above, we can then

carry out the following process:

(a) For each combination lr that appears in a rule n -> alr, define

the symbols <lr| and |lr>. Also define the symbols <0| and |0>.

(b) Write down the equation, for each non-terminal n in N_0:

n_F = d_n |0> + sum (|lr> r [r]_F: n -> a l r is a rule in P)

where d_S = 1, and d_n = 0 for all other non-terminals in N_0.

(c) For each non-terminal, n, define n_S = n [n]_F.

In the algebra C_2 x R(X, Y), the non-termial S will be seen to satisfy a

FINITE system of equations, provided we assume that all the original

non-terminals lie in the subalgebra C(X, Y) of operator-free expressions.

First of all, we note that:

<i| n = n <i|, |i> n = n |i>

for all non-terminals n, by our assumption that n be in C(X, Y). Thus

we can write:

<0| S_S = <0| S S_F = S <0| S_F,

and

<0| S_F = <0| d_S |0> + <0| sum (|lr> r)

= 1 + 0

Thus

S = <0| S_S

Second, we note that:

<lr| l_S = <lr| l l_F = l <lr| l_F

with

<lr| l_F = <lr| d_l |0> + <lr| sum (|l'r'> r' [r']_F)

with the only non-cancelling term being that for which l' = l, r' = r.

Thus:

<lr| l_F = <lr| |lr> r [r]_F = r [r]_F = r_S.

Therefore, we may write:

n_S = n n_F = sum (a n_F) + sum (a r n_F) + sum (a l r n_F)

= sum (a n_F) + sum (a r_S) + sum (a <lr| l_S)

noting that r_S = r [r]_F = r n_F, since we're supposing in the normal

form that n == r.

Thus the original grammar may be transformed by the following rather

simple prescription:

(a) Set up the rules S -> <0| S_S, and S_F -> |0>.

(b) Change each rule n -> a into n_S -> a [n]_F.

(c) Change each rule n -> a r into n_S -> a r_S.

(d) Change each rule n -> a l r into n_S -> a <lr| l_S

and add the rule l_F -> |lr> r_S.

The result is what I call a Quantum Grammar. It is deterministic exactly

when the original grammar is deterministic and therefore can also serve

as a deterministic parser under the Standard Interpretation given before.

(e) Write down a system of equations in S, each n_F for n in N_0,

and each n_S for n in N, as follows:

n = sum of all t: n -> t is a rule.

It is possible to solve this system in terms of S to arrive at an expression

of the form:

<0| e |0>, where e is an expression in C_2 x R(X, Y)

whose only occurrences of <0|, |0> are in the combination |0><0|.

Actually, since none of the rules use the <0|, |0> operators, except for the

first two formed, then there won't be ANY occurrences of |0><0| at all in e.

(4) Examples. Issues Associated with Writing Quantum Parser Generators

A couple examples will be provided to show the ease of use of this method

(what makes it generally easy is that grammars that are actually written

down tend to already be close to the normal form given above).

EXAMPLE 1: The Dyck Context Free Language S = (a S c)*.

The expression (a S c)* can be written as S = 1 + a S c (a S c)* = 1 + a S c S.

Thus, we get a system of equations:

S = 1 + a S r

r = c S

There is only one partition here [S, r], so things are rather easy:

S = <0| S_S

S_S = S_F + a <1| S_S

S_F = |1> r_S + |0>

r_S = c S_S

where <1| stands for the context <Sr|.

This could be simplified further, by eliminating r_S and S_F:

S = <0| S_S

S_S = a <1| S_S + |1> c S_S + |0>

The closed form of this system is then:

S = <0| S_S (a <1| + |1> c)* |0>

which was essentially the first example shown before.

EXAMPLE 2: The expression SSDT Grammar:

E = x {1} + u E {2} + E p {3} + E b E {4} + r E s {5}

with X = { x, u, p, b, r, s } and Y = { {n}: n = 0,1,2,3,4,5},

with {n} standing for possible action sequences.

In normal form, this becomes:

E = x {1} A + u E C + r E B

A = 1 + p {3} A + b E D

B = s {5} A

C = {2} A

D = {4} A

Again, there is only one partition: [S,A,B,C,D]. The contexts here are

then denoted as follows:

<1| for <E C|, <2| for <E B| and <3| for <E D|,

|1> for |E C>, |2> for |E B> and |3> for |E D>.

and we can then write:

E = <0| E_S

E_S = x {1} A_S + u <1| E_S + r <2| E_S

A_S = E_F + p {3} A_F + b <3| E_S

B_S = s {5} A_S

C_S = {2} A_S

D_S = {4} A_S

E_F = |0> + |1> C_S + |2> B_S + |3> D_S

which, after eliminating C_S, B_S, D_S and then E_F would simplify to:

E = <0| E_S

E_S = x {1} A_S + u <1| E_S + r <2| E_S

A_S = p {3} A_F + b <3| E_S +

|0> + |1> {2} A_S + |2> s {5} A_S + |3> {4} A_S

Notice this time the presence of shift-reduce conflicts in A_S that

arise from the fact that there was an empty transition A_S -> E_F. This

is where operator precedences would be applied. For example, if (uE) binds

more tightly than (Ep), which binds more tightly than (EbE), and if (EbE)

associates to the left, then we could set up a table to resolve all the

conflicts:

Context: E -> E $ p E -> E $ b E

p {3} A_F b <3| E_S Context:

=== === |1> {2} A_S E -> u E $ {2}

||| === |3> {4} A_S E -> E b E $ {4}

where === means resolve in favor of the item listed at the end of the row,

and ||| means resolve in favor of the item listed at the top of the column.

An easy way to declare the majority of precedences among rules is simply

to state that (unless otherwise indicated) the rules listed in an actual

grammar have precedence over all the ones listed after it. Only the

associativity of a rule (such as E -> E b E) would have to be explicitly

declared, e.g.,

E -> x {1}

E -> r E s {5}

E -> u E {2}

E -> E p {3}

E -> E b E {4} %left

Therefore, a parser generator for Quantum Grammars would be able to use

rule ordering to resolve shift-reduce and reduce-reduce conflicts.

The only significant factors that would have to be resolved, and which

would make different implementations different in terms of their performance,

are:

(a) How to resolve "errors" (i.e., those places where

a parser has no shift or reduce for the pending

input symbol and stack state).

(b) An algorithm to transform an SSDTG into normal form. Such

an algorithm should be able to selectively perform up to

k rounds of factorings (to accomplish a reduction of an

LR(k) grammar), where k may even dynamically determined.

and (c) How to declare precedences. An easy way is simply to

stick a %prec N next to a given rule, where N can be

a whole number. If operator precendences and associativites

are declared, then a %prec N (and %left and %right) are

automatically inserted whereever the operator appears in

a rule (e.g. as the first operator terminal).

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.