Fri, 18 Feb 1994 05:31:30 GMT

Related articles |
---|

LR(k) k >= 1, using Quantum Grammars mark@freenet.uwm.edu (1994-02-18) |

Re: LR(k) k >= 1, using Quantum Grammars parrt@s1.arc.umn.edu (Terence Parr) (1994-02-20) |

Newsgroups: | comp.compilers |

From: | mark@freenet.uwm.edu (Mark Hopkins) |

Keywords: | parse, LR(1) |

Organization: | Compilers Central |

Date: | Fri, 18 Feb 1994 05:31:30 GMT |

LR(k) Parsing using Quantum Grammars

This thread was from way back in early October ...

Nandakumar Sankaran writes:

*> I am looking for LR(k) parsers that are publicly available. I have*

*> rummaged thru the compilers list and have been unable to find one.*

From: "Terence Parr" <parrt@s1.arc.umn.edu>

*> This is because computing k-token lookahead is an exponential problem in*

*> theory. As usual, much can be done in practice, but requires what can be*

*> termed "polynomial approximations to exponential k-token lookahead" to get*

*> anything beyond 1-token lookahead. My Ph.D. thesis describes how this can*

*> be done (paper coming soon to a theatre near you). Also, ANTLR applies*

*> this theory to build LL(k) for k>1 if you want to take a look (send mail*

*> to pccts@ecn.purdue.edu).*

I believe that this problem can be solved in such a way that not only will

the resulting parsers be efficient in size in terms of k, but actually in

most cases, smaller than the corresponding LR(0) parser (!).

Some notes on Quantum Parsing and the LR(k) issue to follow. These

are all still preliminary observations...

LR(k) Parsing as Quantum Parsing:

Quantum Parsing derives from LR(k) parsing. In these notes I will

bring up some motivational issues that will explain part of the story of

how the Quantum notation came about. The short story is this: I got tired

of playing with LR(k) parsing by hand and dealing with the excessive

redundancy, started to apply a method to factor out redundancies by

introducing stack operations, and later decided to step back and try to

develop an algebraic foundation for this method.

The relation between LR(k) parsing and Quantum parsing may be summarized

by the following formula:

Quantum Parser Generation =

LR(k) Parser Generation + Goto Elimination

The Goto actions of an LR(k) parser are effectively precompiled. What this

does is make the parsing an order of magnitude faster, factors out the

redundancies in those states that differ only by their Goto actions (which

is actually where almost all the redundancy in LR(k) parsing tables comes

from) and reduces the number of states by an equal order of magnitude.

This will be illustrated with the following grammar:

S = a A c + b A d + b B c + a B d

A = e

B = e

We will first augment this grammar with a set of Action Symbols, {n} for

n = 0, 1, 2, 3, 4, 5, 6:

S' = S {0}

S = a A c {1} + b A d {2} + b B c {3} + a B d {4}

A = e {5}

B = e {6}

The action {0} will denote the "accept" action. I will use the notation

{ n1 n2 ... nk } = {n1} {n2} ... {nk}

to denote a sequence of actions, and will apply a similar notation more

generally to regular expressions of actions, e.g.:

{ 1 3* 0 } = {1} {3}* {0}

The LR(0) parser generater will create the following finite state

machine, given by the following system of equations:

q0 = S q1 + b q2 + a q3

q1 = {0}

q2 = A q4 + B q5 + e q6

q3 = A q7 + B q8 + e q6

q4 = d q9

q5 = c q10

q6 = {5} + {6}

q7 = c q11

q8 = d q12

q9 = {2}

q10 = {3}

q11 = {1}

q12 = {4}

I chose this grammar because it's a classic example of a grammar which

is LR(1) but not LALR(1).

The reason that this is not LALR(1) is because this parser is incomplete:

it's missing the information needed to resolve the state q6. What is it

missing? The information about what to do after processing an action. This

is generally left to the parser to interpret at run-time. Unfortunately,

the parser just doesn't get it. Further, being an interpreter, for all

practical purposes, the parser suffers from the same problems most

interpreters have.

A Quantum Parser is directly derived from this by adding extra terms after

each action symbol that will encode where the parser is to go. This is

done as a consequence of eliminating all the goto actions, e.g. the A q4 and

B q5 terms in the equation for q1. Additionally, what happens as a result is

that states that differ only in their goto terms (q1 and q2) will be merged.

This step literally is a Quantum leap beyond the usual LALR(1) traceback

method -- we're not just deriving the lookahead sets, we're deriving the

entire context. There will even be enough information present to resolve

the LR(1) conflict (in state q6) while simultaneously REDUCING the

redundancy.

Step 0: Add the equation

S = q0

Step 1: Introduce operator symbols on the states with goto actions,

starting with <0| for q0. Prefix each occurrence of each of

these states in the right by its operator symbol (this is why

the equation for S is added):

S = <0| q0

<0| -> q0 = S q1 + b <1| q2 + a <2| q3

q1 = {0}

<1| -> q2 = A q4 + B q5 + e q6

<2| -> q3 = A q7 + B q8 + e q6

q4 = d q9

q5 = c q10

q6 = {5} + {6}

q7 = c q11

q8 = d q12

q9 = {2}

q10 = {3}

q11 = {1}

q12 = {4}

Step 2: Resolve each of the actions by tracing back all the possible

paths, the same as would be done with LALR(1) parser generation.

For example, action {2} in state q9 corresponds to the rule S -> b A d.

Tracing back all the transitions on b, A, and d that reach q9, we get the

following tree:

q9 -> {2}

q4 -> d q9

q2 -> A q4

q0 -> b <1| q2

... -> <0| q0

--------------

q0 -> S q1

In the LR(k) parser we are effectively replacing the derivation

<0| q0 -> <0| <1| b A d {2}

by

<0| q0 -> <0| S q1

We skip the intermediate step and go directly from <0|<1| b A d {2} to

<0| S q1. The stack operations that take place in going from the first

term to the second are precompiled into the term:

q9 -> {2} |1> |0> <0| q1

The action {2} is therefore appended by the term |1> |0> <0| q1.

Tracing back on action {5} in state q6, we get the following tree:

q6 -> {5}

q2 -> e q6 q3 -> e q6

... -> <1| q2 ... -> <2| q3

------------- -------------

q2 -> A q4 q3 -> A q7

We therefore add terms for each of these branches:

q6 -> {5} (|1><1| q4 + |2><2| q7)

Note that we are actually tracing back a half step further by including the

stack operation on the states q2 and q3. This helps us reduce conflicts.

*>From here on out, I will use the notation*

[n1, n2, ..., nk] to denote |n1><n1| + |n2><n2| + ... + |nk><nk|

Note that

[n1, n2, ..., nk] = [n1] + [n2] + ... + [nk]

<x| [y] = 0 = [y] |x> if x != y

<x| [x] = <x|

[x] |x> = |x>

I will also use the notation

[x1 x2 ... xn] = |x1>|x2>...|xn><xn|...<x2|<x1|

The accept action deserves special consideration. Here, we're tracing

back the accept action:

S -> <0| q0

q0 -> S q1

q1 -> {0}

but instead of a goto action we simply halt the parser. The corresponding

term will pop the stack symbol <0|, but will not push any new symbol on the

stack or go to another state. So we augment the {0} action with the term:

q1 -> {0} |0>

The final result of this process is the following augmentation:

S = <0| q0

q0 = S q1 + b <1| q2 + a <2| q3

q1 = {0} |0>

q2 = A q4 + B q5 + e q6

q3 = A q7 + B q8 + e q6

q4 = d q9

q5 = c q10

q6 = {5} ([1] q4 + [2] q7)

+ {6} ([1] q5 + [2] q8)

q7 = c q11

q8 = d q12

q9 = {2} |1> q1

q10 = {3} |1> q1

q11 = {1} |2> q1

q12 = {4} |2> q1

Step 3: Remove all the goto actions

S = <0| q0

q0 = b <1| q2 + a <2| q3

q1 = {0} |0>

q2 = e q6

q3 = e q6

q4 = d q9

q5 = c q10

q6 = {5} ([1] q4 + [2] q7)

+ {6} ([1] q5 + [2] q8)

q7 = c q11

q8 = d q12

q9 = {2} |1> q1

q10 = {3} |1> q1

q11 = {1} |2> q1

q12 = {4} |2> q1

Step 4: Merge states (q2 and q3 in this case)

S = <0| q0

q0 = b <1| q2 + a <2| q2

q1 = {0} |0>

q2 = e q6

q4 = d q9

q5 = c q10

q6 = {5} ([1] q4 + [2] q7)

+ {6} ([1] q5 + [2] q8)

q7 = c q11

q8 = d q12

q9 = {2} |1> q1

q10 = {3} |1> q1

q11 = {1} |2> q1

q12 = {4} |2> q1

This is the resulting Quantum Grammar. Note that the LR(1) conflict in

state q6 is gone. Because we added the stack operations [1] and [2]

we are now able to distinguish which action to take on a transition on

symbol d by looking at the stack, e.g.

q6 -> {5} [1] q4 -> {5} [1] d q9

but q6 -> {6} [2] q8 -> {6} [2] d q12

We can precompile this conflict resolution by factoring out the symbol d.

The reason we can factor d out is because the actions, the stack operations

and input symbols all commute.

Step 5: Factor out lambda terms

All those terms that are have no input symbol in them are denoted lambda

terms, corresponding to lambda transitions. If the grammar is LR(k) there

will be no cycles formed by these lambda transitions so that we will be

safe in performing substitutions [this needs to be verified in detail].

For example:

q6 = {5} ([1] q4 + [2] q7) + {6} ([1] q5 + [2] q8)

= {5} ([1] d q9 + [2] c q11) + {6} ([1] c q10 + [2] d q12)

= c ([2] {5} q11 + [1] {6} q10)

+ d ([1] {5} q9 + [2] {6} q12)

Defining these parenthesized terms as states q13 and q14 we get:

S = <0| q0

q0 = b <1| q2 + a <2| q2

q1 = {0} |0>

q2 = e q6

q4 = d q9

q5 = c q10

q6 = c q13 + d q14

q7 = c q11

q8 = d q12

q9 = {2} |1> q1

q10 = {3} |1> q1

q11 = {1} |2> q1

q12 = {4} |2> q1

q13 = [2] {5} q11 + [1] {6} q10

q14 = [1] {5} q9 + [2] {6} q12

Working on state S, we get:

S = <0| q0

= b <0| <1| q2 + a <0| <2| q2

After performing similar substitutions on q13 and q14 we get:

S = b <0|<1| q2 + a <0|<2| q2

q0 = b <1| q2 + a <2| q2

q1 = {0} |0>

q2 = e q6

q4 = d q9

q5 = c q10

q6 = c q13 + d q14

q7 = c q11

q8 = d q12

q9 = {2} |1> q1

q10 = {3} |1> q1

q11 = {1} |2> q1

q12 = {4} |2> q1

q13 = {5 1 0} |2> |0> + {6 3 0} |1> |0>

q14 = {5 2 0} |1> |0> + {6 4 0} |2> |0>

(note the cancellations carried out in q13 and q14). As result of this,

the states q0, q1, q4, q5, q7, q8, q9, q10, q11, and q12 all become useless

and can be eliminated. The result is a 5 state Quantum Grammar:

S = b <0|<1| q2 + a <0|<2| q2

q2 = e q6

q6 = c q13 + d q14

q13 = {5 1 0} |2> |0> + {6 3 0} |1> |0>

q14 = {5 2 0} |1> |0> + {6 4 0} |2> |0>

As of yet, I know of no way to directly arrive at this without having to

derive the LR(0) parser first. There is a generalized item set

construction out there waiting to be discovered that directly generates

Quantum Parsers, and is closely related to the method presented in

the other case studies worked out here.

We don't actually need to carry out any of these substitutions except

for the ones in the conflicting state, q6. If the grammar is LR(k), all

such conflicting states will be resolved after the necessary substitutions

are completed, and this will happen with a finite number of steps. As

seen above, the number of states may drop dramatically.

Notice as a result of these substitutions certain actions will tend to

be grouped together into natural combinations. For instance, the action

{5 1 0} will correspond to the reduction sequence:

a A c {1 0} -> a e {5} c {1 0} = a e c {5 1 0}

S {0} -> a A c {1} {0} = a A c {1 0}

S' -> S {0}

or the reduction:

S' -> a e c {5 1 0}

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.