31 Jul 2002 01:17:59 -0400

Related articles |
---|

[2 earlier articles] |

Re: Have I discovered something new? joachim_d@gmx.de (Joachim Durchholz) (2002-07-21) |

Have I discovered something new? cfc@world.std.com (Chris F Clark) (2002-07-21) |

Re: Have I discovered something new? kgw-news@stiscan.com (2002-07-21) |

Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-07-24) |

Re: Have I discovered something new? steve@lurking.demon.co.uk (Steve Horne) (2002-07-25) |

Re: Have I discovered something new? robert.corbett@sun.com (Robert Corbett) (2002-07-31) |

Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-07-31) |

Re: Have I discovered something new? cfc@shell01.TheWorld.com (Chris F Clark) (2002-08-04) |

Re: Have I discovered something new? whopkins@alpha2.csd.uwm.edu (Mark) (2002-08-04) |

From: | "Mark" <whopkins@alpha2.csd.uwm.edu> |

Newsgroups: | comp.compilers |

Date: | 31 Jul 2002 01:17:59 -0400 |

Organization: | University of Wisconsin - Milwaukee, Computing Services Division |

References: | 02-07-057 |

Keywords: | parse |

Posted-Date: | 31 Jul 2002 01:17:59 EDT |

"Steve Horne" <steve@lurking.demon.co.uk> writes:

*>The basic idea was that for each token recognised, the 'parser' should*

*>form a new and more specialised grammar from the old one in which the*

*>head part of all the strings in the grammar is restricted the the*

*>sequence of tokens shifted before. Thus any ambiguity in the input is*

*>simply maintained in the output grammar.*

The general problem is to find the intersection of a SSDT T and the

regular expression v (X* x Y*), where v is an input symbol in X.

Suppose the grammar for T is

qi -> q1 Ai1 | ... | qn Ain | Bi; i = 1,...,n

where all the left-recursions are removed to the A-expressions.

The "empty-test" grammar is:

[qi] -> [q1] [Ai1] | ... | [qn] [Ain] | [Bi]; i = 1,...,n

where for the A's and B's:

[e] = 1 if e contains the empty word 1; [e] = 0 else

That gives you the solutions [qi] = 1 or 0, for each i (which, thus,

embodies the emptiness test on each q).

In terms of this, the grammar for the intersection with v (X* x Y*) is:

qi' -> q1' Ai1 | ... | qn' Ain |

[q1] Ai1' | ... | [qn] Ain' | Bi'; i = 1,...,n

where for the e's

e' = { vrs in e: r in X*, s in Y* }

The example

[(avb)* + (bw + cv)*]' = abv(abv)* + c (bw + cv)*

with X = {v,w,x}; Y = {a,b,c}, shows the effect that the commutativity

of the X's with the Y's has on the determination of e'.

The Y symbols all clump up to the head of the components of the

expression e'.

The total grammar is the reduction of the one involving the rules for

q1,...,qn,[q1],...,[qn],q1',...,qn'.

All the [qi]'s reduce to 0 or 1, so they are all gone in a reduced

grammar.

A tree-based approach is a grammar-based approach, since a tree is

just a fancy notation for a grammar (as described in my previous

article).

Any tree-based or grammar-based method is out.

To put it more directly: the instant you decide to do anything with

trees/forests or grammars, in any way at all (no matter what or how),

you've already lost the game, so to say.

To get something that maintains something linear in the size of the

original word MUST make use of some equivalent algebraic notation for

context-free expressions, as illustrated in the previous article.

It's the bare minimum necessary to get an O(n) resolution to the

version of the parsing problem you're alluding to (which matches what

I called "Parsing" under header (E)).

There is, in fact, a very simple way to get the desired partial

reductions. Once you've expressed the SSDT in question as a context

free expression e, then you simply write down e', itself.

In fact, everything can be PRE-COMPUTED all in advance. The net

result will be that you'll be traversing what, for all intents

and purposes, is a finite state automaton (in a suitably generalized

context).

This is defined recursively by:

(AB)' = A' B + [A] B'

1' = 0

x' = 1, if x = V, 0 else; x in X union Y union D

(sum U) = sum { u': u in U }

where D comprises the set of operator symbols.

The last rule leads to:

0' = 0; (A+B)' = A' + B'; A*' = A' A*

as consequences; using the definitions

0 = sum {};

A+B = sum {A,B};

A* = sum {1,A,AA,AAA,...}

Here's a real eqregious example to show how this all works:

S -> 1 | S | SS | x S T

T -> u S v | y

with input alphabet

X = {u,v,x,y}.

The canonical bottom-up SSDTG is then

start -> Sz

S -> a | Sb | SSc | xSTd

T -> uSve | yf

with output alphabet

Y = {a,b,c,d,e,f,z}

and standard interpretation in terms of tree-building operations:

a = "reduce S -> 1"

b = "reduce S -> S"

c = "reduce S -> SS"

d = "reduce S -> xST"

e = "reduce S -> uSv"

f = "reduce S -> y"

z = "reduce root -> S"

You can get a context-free expression (CFE) in a similar way as before.

Remove the left-recursion, so:

start >= Sz

S >= a + Sb + SSc + xSTd

T >= uSve + yf

becomes:

start >= Sz

S >= (a + xSTd) (b + Sc)*

T >= uSve + yf

Define SF and TF by:

SF >= |0>z + |1>Td (b+Sc)* SF + |2>veTF + |3> c (b+Sc)* SF

TF >= |4>d (b+Sc)* SF.

and SS, TS by:

SS = S SF; TS = T TF

and set

SQ = (b + Sc)* SF.

Then the corresponding CFE comes out of reducing the system to right-linear

form:

start >= <0| SS

SS >= a SQ + x <1| SS

SQ >= b SQ + <3| SS + SF

SF >= |0> z + |1><4| TS + |2> ve TF + |3> c SQ

TS >= u <2| SS + yf TF

TF >= |4> d SQ

The components of the rules not involving input symbols are treated as

right-recursions and are eliminated. The above system can be written

as

start >= A SS

SS >= B SQ + xM

SQ >= C SQ + D SS + E SF

SF >= F TS + G SQ + vN + |0>z

TS >= uP + yQ

TF >= H SQ

where

A = <0|, B = a, C = b, D = <3|, E = 1, F = |1><4|, G = |3> c, H = |4> d

M = <1| SS, N = |2>e TF, P = <2| SS, Q = f TF

This yields,

SS >= R (xM + W ((vN + |0>z) + F(uP + yQ)))

where

R = (B (C + EG)* D)* = (a (b + |3>c)* <3|)*

W = B (C + EG)* E = a (b + |3>c)*

and

TF >= H X (DxM + E(vN + |0>z) + EF(uP + yQ))

with

X = (C + DB + EG)* = (b + <3|a + |3>c)*

and

start >= AR (xM + W ((vN + |0>z) + F(uP + yQ)))

Together, this yields a quasi-deterministic finite-state automaton

via the rlght-linear system:

start >= x <0| R <1| SS

+ v <0| R W |2> e TF

+ u <0| R W |1><4|<2| SS

+ y <0| R W |1><4| f TF

+ <0| R W |0> z

SS >= x R <1| SS

+ v R W |2> e TF

+ u R W |1><4|<2| SS

+ y R W |1><4| f TF

+ R W |0> z

TF >= x |4>d <3|<1| SS

+ v |4>d |2> e TF

+ u |4>d |1><4|<2| SS

+ y |4>d |1><4| f TF

+ |4>d |0> z

where

R = (a (b + |3>c)* <3|)*

W = a (b + |3>c)*

X = (b + <3|a + |3>c)*.

Parsing the word xuxyv yields the context-free expression:

start = (<0|R<1|) (RW|1><4|<2|) (R<1|) (RW|1><4|f) (|4>d|2>e) (|4>d|0>z)

where the bracketed items corresponding respectively to the symbols

x, u, x, y and v in the input; and (for |1>d|0>z) the end-marker.

This expression gives you the subset of Y* corresponding to all the

parses of the input. This subset is not only infinite, but it's not

even regular -- it's a context-free subset in its own right!

This is why you can't get away from using some form of algebraic

notation for context-free expressions.

The other versions of parsing I mentioned (recognition, enumeration)

then reduce to the problem of testing for emptiness and enumerating

on this context-free expression.

The emptiness test comes by replacing all the Y-symbols by 1's:

R = ((1 + |3>)* <3|)*

W = (1 + |3>)*

X = (1 + <3| + |3>)*.

start = <0|R<1| RW|1><4|<2| R<1| RW|1><4| |4>|2> |4>|0>

and reducing to normal form:

R = X = |3>* <3|*

W = |3>*

and

start = <0|<3|*<1| |1><4|<2| <3|*<1| |1><4| |4>|2> |4>|0>

= <0| <3|* <4| <2| <4| |4> |2> |4> |0>

= <0| <3|* |0> = 1

For the word xuxv, the expression would be the part without the

RW|1><4|:

start = <0|R<1| RW|1><4|<2| R<1| |4>|2> |4>|0>

and the emptiness test would give you:

start = <0|<3|*<1| |1><4|<2| <3|* <1| |4>|2> |4>|0> = 0

since <1| |4> = 0. So, xuxv is not recognized by the original CFG.

In general, if a version of Valiant's algorithm is used, rendered

suitably in algebraic form as a process for reducing pure operator

expressions to normal form, then the test will be O(n^log_2(7)) in

complexity.

The problem of enumeration would be that of rattling off the actual

output words of Y* from the context-free expression. Here, that would mean

enumerating all the words, one-by-one, that come out of the

expression:

<0|R<1| RW|1><4|<2| R<1| RW|1><4|f |4>d|2>e |4>d|0>z

with

R = (a (b + |3>c)* <3|)*

W = a (b + |3>c)*

X = (b + <3|a + |3>c)*

Since the SSDT came from the original grammar via the canomical "bottom-up"

extension, then each word is actually a word that directs the building of a

parse tree from bottom-up.

I won't discuss that here.

------------------

As an aside:

In the first article, different varieities of Kleene algebras were

described: one closed under arbitrary sums (Quantales); another

under sums of rational subsets (*-continuous Kleene algebras) and

two others closed respectively closed under sums of context-free

subsets (a "CF-continuous" Kleene algebra), and Turing subsets

(a "turing-continuous" Kleene algebra). The latter two are unknown.

The key theorem, not precisely stated in the first article, in full

generality gives you a way to represent these larger algebras using

regular expressions.

In detail, if you take a *-continuous Kleene Algebra, K; then close

it up under context-free sums or turing-sums to get respectively C(K)

or T(K), then the result -- regarded as a *-continous Kleene algebra --

can be expressed as that part of the algebra

Cn x K (for any n >= 2)

Cm x Cn x K (for any m, n >= 1)

which commutes with the operator symbols in the Cm's and Cn's.

The tensor product A x B of the algebras A and B is the common

extension of A and B in which the A's commute with the B's. It can

be formally defined by the set { (a,b): a in A, b in B } subject to

the identities:

(a,b) (a',b') = (a a', b b')

(1,1) = 1

and

sum (U x V) = (sum U, sum V); U, V rational subsets of A & B

What was described in th first article was the construction of

Cn x R(M), for a monoid M. The closure of R(M) under context-free sums

is C(R(M)) = C(M), the context-free subsets of M. So the result

is that C2 x R(M) is an (*-continuous) Kleene algebra for context-free

subsets of M. The cases M = X*, M = X* x Y* give you respectively

algebra for context free languages over X, and SSDT's for input alphabet

X, output alphabet Y. The latter are what parsers work with.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.