18 Oct 2002 23:38:14 -0400

Related articles |
---|

LR(infinity) with EBNF and Context-Free Expressions whopkins@alpha2.csd.uwm.edu (Mark) (2002-10-18) |

From: | "Mark" <whopkins@alpha2.csd.uwm.edu> |

Newsgroups: | comp.compilers |

Date: | 18 Oct 2002 23:38:14 -0400 |

Organization: | University of Wisconsin - Milwaukee, Computing Services Division |

Keywords: | parse |

Posted-Date: | 18 Oct 2002 23:38:14 EDT |

From SLK Parsers (parsersinc@yahoo.com):

*> The classical, and proven grammar analysis algorithms in parsing theory are*

*> based on definitions that do not include a notion of regular expressions as*

*> in EBNF. Adhoc programming methods that internally translate EBNF run the*

*> risk of introducing incorrectness.*

A proper understanding of the parsing problem (i.e., in algebraic

terms), allows one to extend the common methods to the more general

case.

First, one should understand that parsers do not work with context

free grammars, but with syntax direct translations (SDT's), and

usually with simple SDT's (SSDT's). One generally only says that such

and such grammar G is being "processed" by a parser only in reference

to one of its canonical extensions to a SSDT. It's the extension

that's actually being worked with, not G.

For a grammar G, such as:

S -> T | S a T

T -> F | T m F

F -> u S v | n F | x

over alphabet X = {a,m,u,v,n,x}, the canonical bottom-up extension BU(G) would

be the SSDT:

start -> S z

S -> T b | S a T c

T -> F d | T m F e

F -> u S v f | n F g | x h

with output alphabet Y = {b,c,d,e,f,g,h,z} and a "standard bottom-up

interpretation" for Y as

z = accept

b = reduce T to S

c = reduce S a T to S

and so on.

The canonical top-down extension TD(G) would be the SSDT:

start -> z S

S -> b T | c S a T

T -> d F | e T m F

F -> f u S v | g n F | h x

with the same output alphabet, but a "standard top-down interpretation" as:

z = build root

b = build branch S -> T

c = build branch S -> S a T

and so on.

The original grammar G is a grammar over the word algebra X*, and the

extensions BU(G) and TD(G) are grammars over the word algebra X* x Y*.

However, since the concept of grammars over word algebras is (still!)

unknown, it'll be necessary to define this first.

So, with that, we'll proceed below to show how to do LR parsing in the

most general context.

This is divided into the sections:

(1) Grammars Over Word Algebras

(2) Grammars As Systems Of Inequalities

(3) LR Parsing On CFG's In Algebraic Form

(4) LR Parsing On EBNF Systems In Algebraic Form

(5) Graphical Depiction Of The Parsing Mechanism

(6) The Formal Bra-Ket Stack Algebra

(7) The LR Parser As A Context-Free Expression (simple CFG's)

(8) The Context-Free Expresson Algebra

(9) The LR Parser As A Context-Free Expression (EBNF's)

(10) The LR Parser For As A Context-Free (Translation) Expression (SSDT's)

capped off with the punchline:

(11) Direct Algebraic Manipulation Supersedes And Obsoletes LR Parsing

--------------------------------------

(1) Grammars Over Word Algebras

In general, a grammar G = (Q,S,M,H) defines a subset of a

word algebra M. At the very least, M is a MONOID, which is defined as an

algebra containing the items

1 (corresponding to "empty word")

(u,v) |-> uv (corresponding to "concatenation").

M could have more structure, such as the structure for a group, linear

algebra, ring, field, etc., and everything below proceeds virtually

unchanged (the exception: free extensions M[Q], referred to below, are

defined as free extensions *with respect to* whatever structure one is

working within).

Q is a set of variables (often called, in respect to the tree-interpretation

"non-terminals") and S the main variable. Q can be any set, finite or

infinite. There is, for instance, such a notion as an infinite regular

grammar.

For general grammars, H is a subset of M[Q] x M[Q], where M[Q] is the

free extension of M by the set Q. For context-free grammars, H is

restricted to being a subset of Q x M[Q].

For a regular grammar, one restricts H further to the subset Q x (M[Q] + M).

The closest thing you have to a general notion of context-sensitive

grammars is to restrict H to the subset M[Q] x M[Q] of words (a,b)

for which |a| <= |b|, where |.| is defined recursively by:

|m| = L(m), m in M

|q| = 1, for q in Q

|ab| = |a| + |b|

where L: M -> Real Numbers is a suitable function defined over M such

that

L(1) = 0; L(mn) = L(m) + L(n)

For X*, one normally defines L(w) = length of word w, to recover the

usual definition of context-sensitive grammar.

If the monoid M were X*, then M[Q] would be X*[Q] = (X + Q)*, where +

denotes disjoint union. So, in the "standard" definition of a grammar,

you'll see reference to (X + Q)* in the description of the set H of

productions. The fact that (X + Q)* generalizes to M[Q] is what's

still unknown. (Except it's not unknown anymore since you're reading

it here).

The set H of productions defines a transformation system over M[Q].

Corresponding to H is the relation => defined recursively by:

(1) a => a, for all a in M[Q]

(2) if a => bdc, (d,e) in H then a => bec

for all a, b, c, d, e in M[Q]

And corresponding to each a in M[Q] is the set

[a] = { m in M: a => m }, for a in M[Q].

of all words in M generated from a.

The subset of M generated by the grammar is then [S]. Thus L(G) = [S].

This relation has the following properties:

(3) If a => b, b => c, then a => c

(4) If a => b, then cad => cbd

(5) If (a,b) in H, then cad => cbd

(6) If (a,b) in H, cbd => e, then cad => e

(7) If a => b, c => d, then ac => bd

and following from this, you get:

(8) [ab] >= [a][b]

(9) [m] >= {m}, for m in M.

(10) [man] >= {m}[a]{n}, for m, n in M

For context-free grammars, (8) and (9) strengthen to

(8') [ab] = [a][b]

(9') [m] = {m}.

(10') [man] = {m}[a]{n}

In fact, the property (10') is where "context-free" part of the name, itself,

comes from. It says that the reductions a => m' are independent of the

left context m and right context n.

When M = X*, the general definition of a context-free grammar gives you the

usual definition of a context-free grammar over an alphabet X. For

M = X* x Y*, it gives you the usual definition of a simple syntax directed

translation with input alphabet X, output alphabet Y. Note that:

X*[Q] = (X + Q)*

X* x Y* = (X + Y)*,

subject to relations xy=yx, x in X, y in Y

(X* x Y*)[Q] = (X + Y + Q)*,

subject to relations xy=yx, x in X, y in Y.

The role the commutativity rule xy = yx plays in parsing applications is that

each application of the rule is, in essence, a "look-ahead" operation, when

applied in the direction yx -> xy. The commutativity rule is, in fact,

the algebraic underpinning of look-ahead.

--------------------------------------

(2) Grammars As Systems Of Inequalities

The grammar description can be recast in purely algebraic form.

Let M be a monoid. Its power set 2^M is then also a monoid with:

* Identity {1}

* Product AB = { ab: a in A, b in B },

but is also partially ordered with

* A >= B if A contains B as a subset.

It has the additional structure:

0 <==> {}

A + B <==> A union B

The + operation corresponds to the "|" above. It is the least-upper-bound

operator with respect to the partial ordering >=, and 0 is the least

element:

0 = lub {}

A + B = lub {A, B}

A grammar (of the general type) can be regarded as a system

of inequalities:

a >= b, if (a,b) is in H,

taken over 2^M. For context-free grammars, then, the solution set:

(q = [q]: q in Q)

represents the LEAST solution to the corresponding system. The

subset [S] generated by the grammar is then the least solution to the

system of inequalities in the main variable S.

The free extension 2^M[Q] is, itself, also a partially ordered algebra with

the same additional structure as 2^M:

a >= b; a + b; 0.

It's this structure that the corresponding system of inequalities is

defined over.

Example: S -> empty word; S -> S T; T -> x; T -> u S v.

X = { u, x, v }.

The corresponding system is:

S >= 1 + ST = {empty word} union ST

T >= x + uSv = {x} union {u}S{v}

and the least solution is:

S = [S] = D(u,v) ^ {x}*

T = [T] = (D(u,v) ^ {x}*)+

where

D(u,v) = The Dyck language on the complementary set u, v.

A ^ B = The interleave of languages A, B

A+ = A A*.

--------------------------------------

(3) LR Parsing On CFG's In Algebraic Form

This background provides you with a proper understanding in algebraic terms

of what parsers actually deal with. Also, with that background, I'll revert

to using "+" in place of "|" below. So the grammar in the original example

would be written as systems:

Original CFG:

S >= T + SaT; T >= F + TmF; F >= uSv + nF + x

or equivalently

S >= T; S >= SaT; T >= F; T >= TmF; F >= uSv; F >= nF; F >= x.

The LR parsing method works with the bottom-up canonical extension BU(G) of a

grammar G = (Q,S,X*,H). So, if G is as in the original example, then BU(G) is:

start >= Sz

S >= Tb + SaTc

T >= Fd + TmFe

F >= uSvf + nFg + xh

defined as a grammar over X* x Y*, with Y = {a,b,c,d,e,f,g,h,z}.

LR defines an auxillary system through the definitions:

For all a in (X* x Y*)[Q]:

[R] <a> >= a' + sum (<q>: taken over all q in Q such that da/dq != 0)

[S] a' >= a0 + sum (z <da/dz>: taken over all z in X + Q)

the least solution of which yields definitions for all the <a> and a'.

The first set [R] yields the effect of reductions; the second set [S], the

effect of shifts.

The derivative operation is defined over (X* x Y*)[Q] by:

da/dz = { v in (X* x Y*)[Q]: zv is in a }

which yields an equivalent recursive definition:

For y in X + Q: dy/dz = 1 if y = z, 0 else

For y in Y: dy/dz = 0

For a, b in (X* x Y*)[Q]:

d(a+b)/dz = da/dz + db/dz

d(ab)/dz = da/dz b + a0 db/dz

d(a*)/dz = da/dz a*

with

a0 = a intersect {1}

also having an equivalent recursive definition:

For y in X + Y + Q: a0 = 0

For a, b in (X* x Y*)[Q]:

(a+b)0 = a0 + b0

(ab)0 = a0 b0

(a*)0 = 1.

This is the algebra underlying the Item Set Construction. The set of

"states" of the LR parser is actually the set of expressions generated

from "start" by taking derivatives.

Note that,

<a+b> = <a> + <b> and (a+b)' = a' + b'.

These both follow, via inductive arguments, ultimately from the linearity

of d/dz: d(a+b)/dz = da/dz + db/dz. Also, we have:

<y> = y' = y; for all y in Y

<1> = 1' = 1; <0> = 0' = 0

Let's apply the algebra to the example above.

<start> >= Sz + <S>

<S> >= Tb + <T> + SaTc + <S>

<T> >= Fd + <F> + TmFe + <T>

<F> >= uSvf + nFg + xh

the least solution of which is:

<start> = (Sz + Tb + SaTc + Fd + TmFe + uSvf + nFg + xh)'

= S <z+aTc> + T <b+mFe> + F <d> + u <Svf> + n <Fg> + x <h>

<S> = (Tb + SaTc + Fd + TmFe + uSvf + nFg + xh)'

= S <aTc> + T <b+mFe> + F <d> + u <Svf> + n <Fg> + x <h>

<T> = (Fd + TmFe + uSvf + nFg + xh)'

= T <mFe> + F <d> + u <Svf> + n <Fg> + x <h>

<F> = (uSvf + nFg + xh)'

= u <Svf> + n <Fg> + x <h>

All that the item-set construction does is obtain the closure of <start>.

This is done algebraically below:

<start> >= S <z+aTc> + T <b+mFe> + F <d> + u <Svf> + n <Fg> + x <h>

<z+aTc> >= (z + aTc)' >= z + a <Tc>

<b+mFe> >= (b + mFe)' >= b + m <Fe>

<d> >= d' >= d

<Svf> >= (Svf)' + <S>

>= (Svf + Tb + SaTc + Fd + TmFe + uSvf + nFg + xh)'

>= S <vf+aTc> + T <b+mFe> + F <d> + u <Svf> + n <Fg> + x <h>

<Fg> >= (Fg)' + <F>

>= (Fg + uSvf + nFg + xh)'

>= F <g> + u <Svf> + n <Fg> + x <h>

<h> >= h' >= h

<Tc> >= (Tc)' + <T>

>= (Tc + Fd + TmFe + uSvf + nFg + xh)'

>= T <c+mFe> + F <d> + u <Svf> + n <Fg> + x <h>

<Fe> >= (Fe)' + <F>

>= (Fe + uSvf + nFg + xh)'

>= F <e> + u <Svf> + n <Fg> + x <h>

<vf+aTc> >= (vf + aTc)' >= v <f> + a <Tc>

<g> >= g' >= g

<c+mFe> >= (c+mFe)' >= c + m <Fe>

<f> >= f' >= f

<e> >= e' >= e,

thus resulting in the following set of expressions and equivalent system:

q0 = <start>; q0 >= S q5 + T q7 + F q9 + u q1 + n q4 + x qC

q1 = <Svf>; q1 >= S q6 + T q7 + F q9 + u q1 + n q4 + x qC

q2 = <Tc>; q2 >= T q8 + F q9 + u q1 + n q4 + x qC

q3 = <Fe>; q3 >= F qA + u q1 + n q4 + x qC

q4 = <Fg>; q4 >= F qB + u q1 + n q4 + x qC

q5 = <z+aTc>; q5 >= z + a q2

q6 = <vf+aTc>; q6 >= v qD + a q2

q7 = <b+mFe>; q7 >= b + m q3

q8 = <c+mFe>; q8 >= c + m q3

q9 = <d>; q9 >= d

qA = <e>; qA >= e

qB = <g>; qB >= g

qC = <h>; qC >= h

qD = <f>; qD >= f,

The rules q >= y (y in Y) are all REDUCE's, the rules q >= n q (n a

non-terminal in Q) are all GOTO's, and the rules q >= x q' (x in X) are

all SHIFT's.

--------------------------------------

(4) LR Parsing On EBNF Systems In Algebraic Form

This works just as well on ANY system of inequalities, where the right-hand

side is a regular expression. So, it can handle EBNF grammars. Take the

following example:

S >= T (aT)*

T >= F (mF)*

F >= n* (uSv + x)

The canonical bottom-up extension to a SSDT is then:

start >= Sz

S >= T (aT)* i

T >= F (mF)* j

F >= n* (uSv + x) k

with

Y = { i, j, k, z }.

One finds,

<start> >= (Sz)' + <S>

<S> >= (T (aT)* i)' + <T>

<T> >= (F (mF)* j)' + <F>

<F> >= (n* (uSv + x) k)'

The least solution for <start> is:

<start> >= (Sz + T (aT)* i + F (mF)* j + n* (uSv + x) k)'

>= S <z> + T <(aT)*i> + F <(mF)*j>

+ n <n* (uSv+x) k> + u <Svk> + x <k>.

This time the derivative operation acts non-trivially, giving you (for

instance):

d(n* (uSv+x) k)/du = d(n*)/du (uSv+x) k + d((uSv+x)k)/du

= dn/du n* (uSv+x) k + d(uSvk)/du + d(xk)/du

= 0 n* (uSv+x) k + Svk + dx/du k

= Svk

However, this and the other derivatives are found directly by expanding all

stars A* like A* = A A* + 1 and applying algebra. Thus, for instance, you

get:

n* (uSv + x) k = (uSv + x) k + n n* (uSv + x) k

= u (Svk) + x (k) + n (n* (uSv + x) k),

resulting in:

<n* (uSv + x) k> >= u <Svk> + x <k> + n <n* (uSv + x) k>.

So there's no need to actually work out derivatives at all.

The least solution for <S>, <T> and <F> are thus worked out to be:

<S> >= (T (aT)* i + F (mF)* j + n* (uSv + x) k)'

>= T <(aT)* i> + F <(mF)* j>

+ u <Svk> + x <k> + n <n* (uSv + x) k>

<T> >= (F (mF)* j + n* (uSv + x) k)'

>= F <(mF)* j> + u <Svk> + x <k> + n <n* (uSv + x) k>

<F> >= (n* (uSv + x) k)'

>= u <Svk> + x <k> + n <n* (uSv + x) k>.

The results for the closure of <start> are progressively determined to be:

<start> >= (Sz + T(aT)*i + F(mF)*j + n* (uSv + x) k)'

>= S <z> + T <(aT)*i> + F <(mF)*j>

+ u <Svk> + x <k> + n <n* (uSv+x) k>

<z> >= z' >= z

<(aT)*i> >= ((aT)*i)' >= i + a <T(aT)*i>

<(mF)*j> >= ((mF)*j)' >= j + m <F(mF)*j>

<n* (uSv+x) k> >= u <Svk> + x <k> + n <n* (uSv+x) k>

<Svk> >= (Svk + T (aT)* i + F (mF)* j + n* (uSv + x) k)'

>= S <vk> + T <(aT)*i> + F <(mF)*j> + u <Svk> + x <k> + n <n* (uSv+x) k>

<k> >= k' >= k

<T(aT)*i> >= (T(aT)*i + F(mF)*j + n* (uSv+x) k)'

>= T <(aT)*i> + F <(mF)*j> + u <Svk> + x <k> + n <n* (uSv+x) k>

<F(mF)*j> >= (F(mF)*j + n* (uSv+x) k)'

>= F <(mF)*j> + u <Svk> + x <k> + n <n* (uSv+x) k>

<vk> >= (vk)' >= v <k>

This yields the following equivalent system:

q0 = <start>; q0 >= S q6 + T q5 + F q4 + u q1 + x q9 + n q8

q1 = <Svk>; q1 >= S q7 + T q5 + F q4 + u q1 + x q9 + n q8

q2 = <T(aT)*i>; q2 >= T q5 + F q4 + u q1 + x q9 + n q8

q3 = <F(mF)*j>; q3 >= F q4 + u q1 + x q9 + n q8

q4 = <(mF)*j>; q4 >= j + m q3

q5 = <(aT)*i>; q5 >= i + a q2

q6 = <z>; q6 >= z

q7 = <vk>; q7 >= v q9

q8 = <n* (uSv+x) k>; q8 >= u q1 + x q9 + n q8

q9 = <k>; q9 >= k

--------------------------------------

(5) Graphical Depiction Of The Parsing Mechanism

By itself, none of these results provides a parsing mechanism. The

extra missing ingredient is usually kept separate from this part of the

description in the usual presentation of the LR method: that is the stacking

mechanism. This has to be explicitly described in order to best understand

how it works in the special case of ordinary grammars; but especially in

the general case of more general types of systems, including EBNF.

Take the first result, consisting of the "states" (expressions) q0,q1,...,qD.

For each element of X and Q, one can write down the set of all transitions

corresponding to that element in graphic terms (abbreviating the q's to

numbers and letters):

S: [0] -> 5 [1] -> 6

T: [0,1] -> 7 [2] -> 8

F: [0,1,2] -> 9 [3] -> A [4] -> B

u: [0,1,2,3,4] -> 1

n: [0,1,2,3,4] -> 4

x: [0,1,2,3,4] -> C

a: [5,6] -> 2

m: [7,8] -> 3

v: [6] -> D

Consider the rule S -> S a T, which eventually became S >= SaTc. Combine

the graphs for S, a, and T in a natural fashion.

/ [0] -> 5 \ ( [5,6] -> 2 ) / [0,1] -> 7 \

\ [1] -> 6 / \ [2] -> 8 /

= / [0] -> 5 ---*---> 2 \ / [0,1] -> 7 \

\ [1] -> 6 --/ / \ [2] -> 8 /

= / [0] -> 5 ---*---> 2 -> 8 \

\ [1] -> 6 --/ /

This represents all the possible progressions of "states" that are

"traversed" going through S, a and T. A REDUCE is supposed to backtrack

on this and re-traverse along S. So, take the final graph and take its

ADJOINT (the reversal of the graph), and combine it with S:

/ [0] -> 5 ---*---> 2 -> 8 \+ / [0] -> 5 \

\ [1] -> 6 --/ / \ [1] -> 6 /

where G+ denotes the adjoint of G. Then reduce it:

/ [0] -> 5 ---*---> 2 -> 8 \+ / [0] -> 5 \

\ [1] -> 6 --/ / \ [1] -> 6 /

= / 8 <- 2 <---*--- 5 <- [0] \ / [0] -> 5 \

\ \-- 6 <- [1] / \ [1] -> 6 /

= / 8 <- 2 <---*--- 5 <- [0] -> 5 \

\ \-- 6 <- [1] -> 6 /

In LR parsing, all the action symbols in the alphabet Y = {a,b,c,d,e,f,g,h,z}

are interpreted as reduce actions, with z specially interpreted as an

"accept" action. The graph above corresponds to element c.

The corresponding graphs for b, d, e, f, g and h are derived likewise. Let

[Z] be the graph corresponding to a symbol Z in X union Q. Define a in X*[Q]

recursively by:

[ab] = [a] [b]; [1] = 1.

Corresponding to the rule S -> S a T is the action c = [SaT]+ [S]. The

other actions are listed below:

b = [T]+ [S]

= (7 <- [0,1]) ([0] -> 5; [1] -> 6)

= / 7 <--*---- [0] -> 5 \

\ \--- [1] -> 6 /

d = [F]+ [T]

= (9 <- [0,1,2]; A <- [3]; B <- [4]) ([0,1] -> 7; [2] -> 8)

= / 9 <--*---- [0,1] -> 7 \

\ \--- [2] ---> 8 /

e = [TmF]+ [T]

= / A <- 3 <--*---- 7 <- [0,1] -> 7 \

\ \--- 8 <-- [2] --> 8 /

f = [uSv]+ [F]

= / /--- [0,1,2] -> 9 \

| D <- 6 <- 1 <--*------ [3] ---> A |

\ \----- [4] ---> B /

g = [nF]+ [F]

= / /--- [0,1,2] -> 9 \

| B <- 4 <--*------ [3] ---> A |

\ \----- [4] ---> B /

h = [x]+ [F]

= / /--- [0,1,2] -> 9 \

| C <--*------ [3] ---> A |

\ \----- [4] ---> B /

The accept action is supposed to be where the parser stops, so the

action corresponding to it would be:

z = [S]+ (0 <-)

= / 5 <- [0] \ (0 <- )

\ 6 <- [1] /

= (5 <- 0 <- )

Finally, the parser starts out at "state" q0:

Initial action: (-> 0)

The end result is a 2-state "automaton" given by the right-linear system:

S >= (-> 0) F

F >= x [x] F SHIFTS: for any symbol x in X

F >= [b]+ [a] F REDUCES: for any inequality a >= b in H

F >= [S]+ (0 <-) ACCEPT

--------------------------------------

(6) The Formal Bra-Ket Stack Algebra

The graphs just a convenient way to diagram the stacking actions the parser

takes. A more precise rendition of the above is obtained by embedding the

(implicit) graph algebra into a stack algebra.

We treated a combination of graphs (G1; G2; ...; Gn) as a sum

(G1; G2; ...; Gn) = G1 + G2 + ... + Gn;

and a concatenation of graphs as a product. Products distribute over

sums, so the graph

/ A <- 3 <--*---- 7 <- [0,1] \

\ \--- 8 <-- [2] /

would actually be

(A <- 3 <- ) (7 <- [0,1]; 8 <- [2])

= (A <- 3 <- ) ( (7 <- [0,1]) + (8 <- [2]) )

= (A <- 3 <- ) (7 <- [0,1]) + (A <- 3 <- ) (8 <- [2])

= (A <- 3 <- 7 <- [0,1]) + (A <- 3 <- 8 <- [2])

= (A<-) (3<-) (7<-) [0,1] + (A<-) (3<-) (8<-) [2]

The symbol [a1,a2,...,an] was just a sum:

[a1,a2,...,an] = [a1] + [a2] + ... + [an].

When combined in graphs, combination was done according to the following

rules:

[a] [b] = [a], if a = b

= 0, else

(->a) [b] = (->a), if a = b

= 0, else

[a] (b<-) = (b<-), if a = b

= 0, else

We also pose the identity

[0,1,2,...,n] = 1

where n is the number of distinct states, since this element will behave

like 1, upon multiplication with any other graph.

Completing the correspondence to stacking actions, the basic graphs

(->a) and (b<-) would be respectively the push and pop operations defined

by:

(->a) = push symbol a

(b<-) = pop and test for equality to b.

The symbol 0 receives special interpretation as:

(->0) = create empty stack

(0<-) = test for empty stack and delete stack.

Under this interpretation, the symbol [b] would be

[b] = test for equality to b on stack top (0 means empty)

= (b<-)(->b)

Thus, the algebra above can be subsumed by the more fundamental identities:

(->a)(b<-) = 1, if a = b

= 0, else

(0<-)(->0) + (1<-)(->1) + ... + (n<-)(->n) = 1

These are the same identities as are encountered in the algebra of Bra's and

Ket's used in Quantum Physics. So, we'll make the identification:

(->a) = <a|; (<-b) = |b>

[a] = |a><a| = projection on a

Adjoints are defined recursively by:

(uv)+ = (v+) (u+)

(u+v)+ = (u)+ + (v)+

<a|+ = |a>; |b>+ = <b|

The fundamental identities of the formal bra-ket algebra are then:

<i| |j> = delta_{ij}

|0><0| + |1><1| + ... + |n><n| = 1

where delta_{ij} is the Kroenecker Delta defined by:

delta_{ij} = 1, if i = j; 0 else.

--------------------------------------

(7) The LR Parser As A Context-Free Expression (simple CFG's)

So, revisiting the first LR result, we can write (using, again, the

notaton [S] to stand for S's graph, and so on):

[S] = [0]<5| + [1]<6|

[T] = [0,1]<7| + [2]<8|

[F] = [0,1,2]<9| + [3]<A| + [4]<B|

[u] = [0,1,2,3,4]<1|

[n] = [0,1,2,3,4]<4|

[x] = [0,1,2,3,4]<C|

[a] = [5,6]<2|

[m] = [7,8]<3|

[v] = [6]<D|

Thus, for instance:

[SaT]+ [S] = [T]+ [a]+ [S]+ [S]

= (|7>[0,1] + |8>[2]) (|2>[5,6])

(|5>[0] + |6>[1]) ([0]<5| + [1]<6|)

= (|8>|2>[5,6]) (|5>[0]<5| + |6>[1]<6|)

= |8>|2> (|5>[0]<5| + |6>[1]<6|).

We'll also use the extended notation defined recursively by:

<ab| = <a|<b|; |ab> = |a>|b>

<a+b| = <a|+<b|; |a+b> = |a>+|b>

[a1 ... an] = |an ... a1><a1 ... an|

[a1,...,an] = [a1] + ... + [an]

so we can write

c = [SaT]+ [S] = |8 2> [0 5, 1 6]

since

|5>[0]<5| = |5>|0><0|<5| = |5 0><0 5| = [0 5]

|6>[1]<6| = |6>|1><1|<6| = |6 1><1 6| = [1 6]

|5>[0]<5| + |6>[1]<6| = [0 5] + [1 6] = [0 5, 1 6]

The other actions are, respectively:

b = [T]+ [S] = |7> ([0]<5| + [1]<6|)

d = [F]+ [T] = |9> ([0,1]<7| + [2]<8|)

e = [TmF]+ [T] = |A 3> [2 8, (0, 1) 7]

f = [uSv]+ [F] = |D 6 1> ([0,1,2]<9| + [3]<A| + [4]<B|)

g = [nF]+ [F] = |B 4> ([0,1,2]<9| + [3]<A| + [4]<B|)

h = [x]+ [F] = |C> ([0,1,2]<9| + [3]<A| + [4]<B|)

Final Action: z = [S]+ |0> = |5 0>

Initial Action: <0|

The parsing automaton is then represented as 2-state automaton with:

S >= <0| F

F >= x [x] F SHIFTS: for any symbol x in X

F >= [b]+ [a] F REDUCES: for any inequality a >= b in H

F >= [S]+ |0> ACCEPT

which, in turn is a system that has, as its least solution the regular

expression:

S = <0| ([X] + [H])* [S]+ |0>

where

[X] = sum ([x] x: x in X)

[H] = sum ([b]+ [a]: (a,b) in H)

--------------------------------------

(8) The Context-Free Expresson Algebra

Let D be the set of all these symbols:

D = { <i|: i = 0,...,n-1 } union { |i>: i = 0,...,n-1 }

It is assumed that these symbols commute with those of X.

dx = xd; dy = yd;

for x in X, d in D.

Then the expression above will reduce to the sum (reinserting the "state"

q0 for 0):

S = <q0| ([X] + [H])* [S]+ |q0>

= sum (w in X*: w is in the language L(G) recognized by G)

In general, we'd have:

<q0| ([X] + [H])*

= sum (<q0|<qa|...<qz| v: v reduces bottom-up right-to-left

to za...zz, with q0 >= za qa; qa >= zb qb; ... qy >= zz qz)

which can still be established by the same inductive proof for general

grammars/systems that would be carried out in the "standard" LR parsing

formalism for ordinary context-free grammars. This is where the rules

[R] and [S] defined way up above would come into play.

The underlying algebra is a partially ordered monoid in which:

(A) (Rational) Completeness

Every rational subset U has a least upper bound sum U,

e.g., sum {1,a,aa,aaa,...} = sum {a}* = a*

(B) (Rational) Distributivity

If U, V are rational subsets then sum (UV) = (sum U) (sum V)

e.g., sum a{b}*c = sum {ac,abc,abbc,abbbc,...}

= sum {a}{b}*{c}

= (sum {a}) (sum {b}*) (sum {c})

= a b* c

Such an algebra is called a *-continuous Kleene Algebra and can be

equivalently defined by the items

0 = sum {}; a + b = {a,b};

a* = sum { 1, a, aa, aaa, ... }

through the axioms:

(uv)w = u(vw); u1 = u = 1u

(u+v)+w = u+(v+w); u+0 = u = 0+u; u+v = v+u; u+u = u

with general distributivity stated in the forms:

u0w = 0;

u(v+w)x = uvx + uwx;

sum (u v^n x: n = 0,1,2,...) = u v* x

Every rational subset will have a least upper bound which will be

distributive. This can be proven inductively, noting that the rational

subsets are the closure of the finite subsets under the 3 rational

operations:

set-wise product: AB = {ab: a in A, b in B}

union: A+B = A union B

Kleene star: A* = {1} union A union AA union AAA union ...

If one takes any such algebra K, and adds a copy of the formal Bra-Ket

algebra symbols D in such a way that the D's commute with the K's, then the

result is a *-continuous Kleene Algebra which contains the least upper

not only of the rational subsets of K, but of all the CONTEXT-FREE subsets

of K. So, the result provides an algebra for context-free expressions

over K.

The special case K = R(M), for a monoid M yields an algebra of context-free

expressions that faithfully represents the family C(M) of context-free

subsets of M. This specializes further into the two cases:

M = X* -- C(M) = context-free languages over X.

M = X* x Y* -- C(M) = simple syntax directed translations between X and Y.

--------------------------------------

(9) The LR Parser As A Context-Free Expression (EBNF's)

For the example involving a general system with regular expressions in

the right-hand side (which is what EBNF actually is):

S >= T (aT)*

T >= F (mF)*

F >= n* (uSv + x)

with the corresponding canonical bottom-up SSDT:

start >= Sz

S >= T (aT)* i

T >= F (mF)* j

F >= n* (uSv + x) k

we established the following system:

q0 >= S q6 + T q5 + F q4 + u q1 + x q9 + n q8

q1 >= S q7 + T q5 + F q4 + u q1 + x q9 + n q8

q2 >= T q5 + F q4 + u q1 + x q9 + n q8

q3 >= F q4 + u q1 + x q9 + n q8

q4 >= j + m q3

q5 >= i + a q2

q6 >= z

q7 >= v q9

q8 >= u q1 + x q9 + n q8

q9 >= k

which yields the definitions (again, abbreviating q's by numbers):

[S] = [0]<6| + [1]<7|

[T] = [0,1,2]<5|

[F] = [0,1,2,3]<4|

[u] = [0,1,2,3,8]<1|

[x] = [0,1,2,3,8]<9|

[n] = [0,1,2,3,8]<8|

[v] = [7]<9|

[m] = [4]<3|

[a] = [5]<2|

The actions are computed as follows:

z = [S]+ |0> = |6 0>

i = [T (aT)*]+ [S]

j = [F (mF)*]+ [T]

k = [n* (uSv + x)]+ [F].

where [] in the more general context is defined by identifying:

|A*> = |A>*; <A*| = <A|*.

Thus,

[T (aT)*] = [T] ([a] [T])*

= [0,1,2]<5| ([5]<2| [0,1,2]<5|)*

= [0,1,2]<5| ([5]<2|<5|)*

= [0,1,2]<5| (|5><5|<2|<5|)*

= [0,1,2] (<5||5><5|<2|)* <5|

= [0,1,2] (<5|<2|)* <5|

= [0,1,2] <(5 2)* 5|

Thus

i = [T (aT)*]+ [S] = |5 (2 5)*> [0,1,2] ([0]<6| + [1]<7|)

= |5 (2 5)*>([0]<6| + [1]<7|).

Similarly,

j = [F (mF)*]+ [T] = |4 (3 4)*>[0,1,2]<5|

k = [n* (uSv + x)]+ [F] = |9 [7 1] 8*>[0,1,2,3]<4|

where we also write:

[A] = 1 + A

with

<[A]| = 1 + <A|; |[A]> = 1 + |A>.

The corresponding automaton is then:

A = <q0| ([X] + [H])* |q6 q0>

with

[X] = [q0,q1,q2,q3,q8] (u<q1| + x<q9| + n<q8|)

+ [q7] v<q9|

+ [q4] m<q3|

+ [q5] a<q2|

[H] = i + j + k

= |q5 (q2 q5)*>([q0]<q6| + [q1]<q7|)

+ |q4 (q3 q4)*>[q0,q1,q2]<q5|

+ |q9 [q7 q1] q8*>[q0,q1,q2,q3]<q4|.

Finally, infinite look-ahead is accomplished by applying commutativity,

noting that the X's commute with the symbols <q|, |q>. We can write:

A = <0| ([X] + [H])* |6 0>

= <0| ([H]* [X])* [H]* |6 0>.

Noting that

[H] = |5 (2 5)*>([0]<6| + [1]<7|)

+ |4 (3 4)*>[0,1,2]<5|

+ |9 [7 1] 8*>[0,1,2,3]<4|,

we have

[H]^2 = |4 (3 4)*>[0,1,2]<5| |5 (2 5)*>([0]<6| + [1]<7|)

= |4 (3 4)*>[0,1,2] |(2 5)*> ([0]<6| + [1]<7|)

= |4 (3 4)* (2 5)*> ([0]<6| + [1]<7|)

and

[H]^3 = 0.

Thus

[H]* = 1 + [H] + [H]^2

= 1

+ |(5 + 4 (3 4)*) (2 5)*>([0]<6| + [1]<7|)

+ |4 (3 4)*>[0,1,2]<5|

+ |9 [7 1] 8*>[0,1,2,3]<4|,

and

[H]* |6 0> = |6 0> + |(5 + 4 (3 4)*) (2 5)*>[0]<6| |6 0>

= |(6 + (5 + 4 (3 4)*) (2 5)*) 0>

[H]* [X] = [H]* [0,1,2,3,8] (u<1| + x<9| + n<8|)

+ [H]* [7] v<9|

+ [H]* [4] m<3|

+ [H]* [5] a<2|

= [0,1,2,3,8] (u<1| + x<9| + n<8|)

+ [7] v<9| + |(5 + 4 (3 4)*) (2 5)*>[1] v <7 9|

+ [4] m<3| + |9 [7 1] 8*>[0,1,2,3] m <4 3|

+ [5] a<2| + |4 (3 4)*>[0,1,2] a <5 2|

= u [0,1,2,3,8]<1|

+ x [0,1,2,3,8]<9|

+ n [0,1,2,3,8]<8|

+ v (|7> + |(5 + 4 (3 4)*) (2 5)*>[1]) <7 9|

+ m (|4> + |9 [7 1] 8*>[0,1,2,3]) <4 3|

+ a (|5> + |4 (3 4)*>[0,1,2]) <5 2|

Thus, the automaton can be represented as an infinite-lookahead automaton

defined by:

A = <0| ([H]* [X])* [H]* |6 0>

= <0| U* V |0>

with

U = [H]* [X]

= u [0,1,2,3,8]<1|

+ x [0,1,2,3,8]<9|

+ n [0,1,2,3,8]<8|

+ v (|7> + |(5 + 4 (3 4)*) (2 5)*>[1]) <7 9|

+ m (|4> + |9 [7 1] 8*>[0,1,2,3]) <4 3|

+ a (|5> + |4 (3 4)*>[0,1,2]) <5 2|

and

V = [H]* |6> [0]

= |(6 + (5 + 4 (3 4)*) (2 5)*)> [0].

--------------------------------------

(10) The LR Parser For As A Context-Free (Translation) Expression (SSDT's)

All of the formalism is purely algebraic and applies regardless of whether

the grammar is taken over a free monoid X* or a more general monoid M. So,

it's not only more general in the sense of applying to more general types

of grammars/systems, but also in the sense of applying to more general

types of word algebras.

In particular, if the original grammar, itself, was an SSDT, then the

corresponding word algebra would have been of the form M x N for an

input word algebra M and output word algebra N. All the results above

apply virtually unchanged.

So this can be used even when actions are embedded within the right-hand

sides of rules. This is particularly important for EBNF, where you almost

always embed actions within the rules, instead of relying on the

rule-reduction mechanism alone.

Take the EBNF

S >= T (aT)*

T >= F (mF)*

F >= n* (uSv + x)

and extend it with an action alphabet Y = {b,c,d,e} to the SSDT:

S >= T (aTb)*

T >= F (mFc)*

F >= nFd + uSv + xe.

This is then a grammar over the word algebra X* x Y*, where X = {a,m,n,u,v,x}.

The corresponding system canonical extension is:

start >= Sz

S >= T (aTb)* i

T >= F (mFc)* j

F >= (nFd + uSv + xe) k

which is a grammar over the word algebra X* x Y* x Z* with Z = {z,i,j,k}.

The result of applying the definitions <a> and a' to find the closure

of <start> is then:

q0 = <start>; q0 >= S qD + T q5 + F q6 + n q4 + u q1 + x qA

q1 = <Svk>; q1 >= S q9 + T q5 + F q6 + n q4 + u q1 + x qA

q2 = <Tb(aTb)*i>; q2 >= T q7 + F q6 + n q4 + u q1 + x qA

q3 = <Fc(mFc)*j>; q3 >= F q8 + n q4 + u q1 + x qA

q4 = <Fdk>; q4 >= F qB + n q4 + u q1 + x qA

q5 = <(aTb)*i>; q5 >= i + a q2

q6 = <(mFc)*j>; q6 >= j + m q3

q7 = <b(aTb)*i>; q7 >= b q5

q8 = <c(mFc)*j>; q8 >= c q6

q9 = <vk>; q9 >= v qC

qA = <ek>; qA >= e qC

qB = <dk>; qB >= d qC

qC = <k>; qC >= k

qD = <z>; qD >= z,

which yields:

[S] = [0]<D| + [1]<9|

[T] = [0,1]<5| + [2]<7|

[F] = [0,1,2]<6| + [3]<8| + [4]<B|

[0,1,2,3,4]<(4 B + 1 9 + A) C|

[n] = [0,1,2,3,4]<4|

[u] = [0,1,2,3,4]<1|

[x] = [0,1,2,3,4]<A|

[a] = [5]<2|

[m] = [6]<3|

[v] = [9]<C|

[b] = [7]<5|

[c] = [8]<6|

[d] = [B]<C|

[e] = [A]<C|.

*>From this, we get the following actions:*

i = [T(aTb)*]+ [S] = |5 (7 2 5)*> ([0]<D| + [1]<9|)

j = [F(mFc)*]+ [T] = |6 (8 3 6)*> ([0,1]<5| + [2]<7|)

k = [nFd + uSv + xe]+ [F] = |C (9 1 + A + B 4)>([0,1,2]<6| + [3]<8| + [4]<B|)

z = |D 0>

Thus, the automaton is given by the expression:

<0| ([X] + [Y] + [H])* |D 0>

with

[X] = n[n] + u[u] + x[x] + a[a] + m[m] + v[v]

= [0,1,2,3,4] (n<4| + u<1| + x<A|)

+ [5] a<2|

+ [6] m<3|

+ [9] v<C|

[Y] = b[b] + c[c] + d[d] + e[e]

= [7]b<5| + [8]c<6| + ([A]e + [B]d)<C|

[H] = i + j + k

= |5 (7 2 5)*> ([0]<D| + [1]<9|)

+ |6 (8 3 6)*> ([0,1]<5| + [2]<7|)

+ |C (9 1 + A + B 4)>([0,1,2]<6| + [3]<8| + [4]<B|)

The reduction of this to an infinite-lookahead automaton would be

similar, using the identities, for instance:

<0| ([X] + [Y] + [H])* |D 0>

= <0| ([H]* [X] + [H]* [Y])* [H]* |D 0>

= <0| (([H]* [Y])* [H]* [X])* ([H]* Y) [H]* |D 0>

= <0| U* V |0>

this time with

U = W* [H]* [X]

V = W* [H]* |D> [0]

W = [H]* [Y].

The evaluation of these expressions will be left as an exercise.

--------------------------------------

(11) Direct Algebraic Manipulation Supersedes And Obsoletes LR Parsing

The punchline, unfortunately, is that much of this formalism -- having

been developed in the 1960's, is simply rendered obsolete by the formalism

just developed. It is, in fact, very easy to derive a context-free

expression for even the most general case by hand:

start >= Sz

S >= T (aTb)* i

T >= F (mFc)* j

F >= (nFd + uSv + xe) k.

The least solution to any context-free system of inequalities is also a

solution to the corresponding equations:

start = Sz

S = T (aTb)* i

T = F (mFc)* j

F = (nFd + uSv + xe) k.

We'll use that below.

Define SF, TF, FF as the least solutions respectively to:

SF >= |0>z + |1> v k FF

TF >= [0,1] (aTb)* i SF + |2> b (aTb)* i SF

FF >= [0,1,2] (mFc)* j TF + |3> c (mFc)* j TF + |4> d k FF.

In the extended algebra incorporating the formal Bra-Kets, the least

solutions to the original system are sums over context-free subsets of

X* x Y*, and so commute with the Bras and Kets. Thus, defining

SS = S SF; TS = T TF; FS = F FF

SQ = (aTb)* i SF; TQ = (mFc)* j TF,

we can write

<0| SS = S <0| SF = S z = start.

<1| SS = S <1| SF = S v k FF

[0,1] TS = T [0,1] TF = T (aTb)* i SF = S SF = SS

<2| TS = T <2| TF = T b (aTb)* i SF

[0,1,2] FS = F [0,1,2] FF = F (mFc)* j TF = T TF = TS

<3| FS = F <3| FF = F c (mFc)* j TF

<4| FS = F <4| FF = F d k FF.

Thus

start >= <0| SS

SS >= T (aTb)* i SF = [0,1] TS

TS >= F (mFc)* j TF = [0,1,2] FS

FS >= (nFd + uSv + xe) k FF

= n F d k FF + u S v k FF + x e k FF

= m <4| FS + u <1| SS + x e k FF

FF >= [0,1,2] TQ + |3> c TQ + |4> d k FF

TF >= [0,1] SQ + |2> b SQ

SF >= |0>z + |1> v k FF

TQ >= (mFc)* j TF

= m F c j (mFc)* j TF + j TF

= m <3| FS + j ([0,1] + |2>b) SQ

SQ >= (aTb)* i SF

= a T b (aTb)* i SF + i SF

= a <2| TS + i (|0>z + |1>vk FF)

thus arriving at a right-linear system:

start >= <0| FS

FS >= (m<4| + u<1|) FS + xek FF

FF >= ([0,1,2] + |3>c) TQ + |4>dk FF

TQ >= m<3| FS + j ([0,1] + |2>b) SQ

SQ >= a<2| FS + |0>iz + |1>ivk FF

whose least solution for "start" will be the corresponding context-free

expression.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.