LR(k) k >= 1, using Quantum Grammars

mark@freenet.uwm.edu (Mark Hopkins)
Fri, 18 Feb 1994 05:31:30 GMT

          From comp.compilers

Related articles
LR(k) k >= 1, using Quantum Grammars mark@freenet.uwm.edu (1994-02-18)
Re: LR(k) k >= 1, using Quantum Grammars parrt@s1.arc.umn.edu (Terence Parr) (1994-02-20)
| List of all articles for this month |

Newsgroups: comp.compilers
From: mark@freenet.uwm.edu (Mark Hopkins)
Keywords: parse, LR(1)
Organization: Compilers Central
Date: Fri, 18 Feb 1994 05:31:30 GMT

LR(k) Parsing using Quantum Grammars


This thread was from way back in early October ...


Nandakumar Sankaran writes:
> I am looking for LR(k) parsers that are publicly available. I have
> rummaged thru the compilers list and have been unable to find one.


From: "Terence Parr" <parrt@s1.arc.umn.edu>
> This is because computing k-token lookahead is an exponential problem in
> theory. As usual, much can be done in practice, but requires what can be
> termed "polynomial approximations to exponential k-token lookahead" to get
> anything beyond 1-token lookahead. My Ph.D. thesis describes how this can
> be done (paper coming soon to a theatre near you). Also, ANTLR applies
> this theory to build LL(k) for k>1 if you want to take a look (send mail
> to pccts@ecn.purdue.edu).


I believe that this problem can be solved in such a way that not only will
the resulting parsers be efficient in size in terms of k, but actually in
most cases, smaller than the corresponding LR(0) parser (!).


      Some notes on Quantum Parsing and the LR(k) issue to follow. These
are all still preliminary observations...


LR(k) Parsing as Quantum Parsing:
      Quantum Parsing derives from LR(k) parsing. In these notes I will
bring up some motivational issues that will explain part of the story of
how the Quantum notation came about. The short story is this: I got tired
of playing with LR(k) parsing by hand and dealing with the excessive
redundancy, started to apply a method to factor out redundancies by
introducing stack operations, and later decided to step back and try to
develop an algebraic foundation for this method.


      The relation between LR(k) parsing and Quantum parsing may be summarized
by the following formula:


                Quantum Parser Generation =
                            LR(k) Parser Generation + Goto Elimination


The Goto actions of an LR(k) parser are effectively precompiled. What this
does is make the parsing an order of magnitude faster, factors out the
redundancies in those states that differ only by their Goto actions (which
is actually where almost all the redundancy in LR(k) parsing tables comes
from) and reduces the number of states by an equal order of magnitude.


      This will be illustrated with the following grammar:


                                S = a A c + b A d + b B c + a B d
                                A = e
                                B = e


We will first augment this grammar with a set of Action Symbols, {n} for
n = 0, 1, 2, 3, 4, 5, 6:


                              S' = S {0}
                              S = a A c {1} + b A d {2} + b B c {3} + a B d {4}
                              A = e {5}
                              B = e {6}


The action {0} will denote the "accept" action. I will use the notation


                            { n1 n2 ... nk } = {n1} {n2} ... {nk}


to denote a sequence of actions, and will apply a similar notation more
generally to regular expressions of actions, e.g.:


                                          { 1 3* 0 } = {1} {3}* {0}


      The LR(0) parser generater will create the following finite state
machine, given by the following system of equations:


                                        q0 = S q1 + b q2 + a q3
                                        q1 = {0}
                                        q2 = A q4 + B q5 + e q6
                                        q3 = A q7 + B q8 + e q6
                                        q4 = d q9
                                        q5 = c q10
                                        q6 = {5} + {6}
                                        q7 = c q11
                                        q8 = d q12
                                        q9 = {2}
                                        q10 = {3}
                                        q11 = {1}
                                        q12 = {4}


I chose this grammar because it's a classic example of a grammar which
is LR(1) but not LALR(1).


      The reason that this is not LALR(1) is because this parser is incomplete:
it's missing the information needed to resolve the state q6. What is it
missing? The information about what to do after processing an action. This
is generally left to the parser to interpret at run-time. Unfortunately,
the parser just doesn't get it. Further, being an interpreter, for all
practical purposes, the parser suffers from the same problems most
interpreters have.


      A Quantum Parser is directly derived from this by adding extra terms after
each action symbol that will encode where the parser is to go. This is
done as a consequence of eliminating all the goto actions, e.g. the A q4 and
B q5 terms in the equation for q1. Additionally, what happens as a result is
that states that differ only in their goto terms (q1 and q2) will be merged.
This step literally is a Quantum leap beyond the usual LALR(1) traceback
method -- we're not just deriving the lookahead sets, we're deriving the
entire context. There will even be enough information present to resolve
the LR(1) conflict (in state q6) while simultaneously REDUCING the
redundancy.


      Step 0: Add the equation
                                        S = q0


      Step 1: Introduce operator symbols on the states with goto actions,
                      starting with <0| for q0. Prefix each occurrence of each of
                      these states in the right by its operator symbol (this is why
                      the equation for S is added):


                                        S = <0| q0
                      <0| -> q0 = S q1 + b <1| q2 + a <2| q3
                                        q1 = {0}
                      <1| -> q2 = A q4 + B q5 + e q6
                      <2| -> q3 = A q7 + B q8 + e q6
                                        q4 = d q9
                                        q5 = c q10
                                        q6 = {5} + {6}
                                        q7 = c q11
                                        q8 = d q12
                                        q9 = {2}
                                        q10 = {3}
                                        q11 = {1}
                                        q12 = {4}


      Step 2: Resolve each of the actions by tracing back all the possible
                      paths, the same as would be done with LALR(1) parser generation.


            For example, action {2} in state q9 corresponds to the rule S -> b A d.
Tracing back all the transitions on b, A, and d that reach q9, we get the
following tree:
                                        q9 -> {2}
                                        q4 -> d q9
                                        q2 -> A q4
                                        q0 -> b <1| q2
                                        ... -> <0| q0
                                        --------------
                                        q0 -> S q1


In the LR(k) parser we are effectively replacing the derivation


                                    <0| q0 -> <0| <1| b A d {2}
by
                                    <0| q0 -> <0| S q1


We skip the intermediate step and go directly from <0|<1| b A d {2} to
<0| S q1. The stack operations that take place in going from the first
term to the second are precompiled into the term:


                                    q9 -> {2} |1> |0> <0| q1


The action {2} is therefore appended by the term |1> |0> <0| q1.


      Tracing back on action {5} in state q6, we get the following tree:


                                                q6 -> {5}
                        q2 -> e q6 q3 -> e q6
                        ... -> <1| q2 ... -> <2| q3
                        ------------- -------------
                        q2 -> A q4 q3 -> A q7


We therefore add terms for each of these branches:


                        q6 -> {5} (|1><1| q4 + |2><2| q7)


Note that we are actually tracing back a half step further by including the
stack operation on the states q2 and q3. This helps us reduce conflicts.
>From here on out, I will use the notation


              [n1, n2, ..., nk] to denote |n1><n1| + |n2><n2| + ... + |nk><nk|


Note that
                                  [n1, n2, ..., nk] = [n1] + [n2] + ... + [nk]
                                  <x| [y] = 0 = [y] |x> if x != y
                                  <x| [x] = <x|
                                  [x] |x> = |x>


I will also use the notation


                        [x1 x2 ... xn] = |x1>|x2>...|xn><xn|...<x2|<x1|


      The accept action deserves special consideration. Here, we're tracing
back the accept action:


                                    S -> <0| q0
                                    q0 -> S q1
                                    q1 -> {0}


but instead of a goto action we simply halt the parser. The corresponding
term will pop the stack symbol <0|, but will not push any new symbol on the
stack or go to another state. So we augment the {0} action with the term:


                                    q1 -> {0} |0>


      The final result of this process is the following augmentation:


                                        S = <0| q0
                                        q0 = S q1 + b <1| q2 + a <2| q3
                                        q1 = {0} |0>
                                        q2 = A q4 + B q5 + e q6
                                        q3 = A q7 + B q8 + e q6
                                        q4 = d q9
                                        q5 = c q10
                                        q6 = {5} ([1] q4 + [2] q7)
                                              + {6} ([1] q5 + [2] q8)
                                        q7 = c q11
                                        q8 = d q12
                                        q9 = {2} |1> q1
                                        q10 = {3} |1> q1
                                        q11 = {1} |2> q1
                                        q12 = {4} |2> q1


      Step 3: Remove all the goto actions
                                        S = <0| q0
                                        q0 = b <1| q2 + a <2| q3
                                        q1 = {0} |0>
                                        q2 = e q6
                                        q3 = e q6
                                        q4 = d q9
                                        q5 = c q10
                                        q6 = {5} ([1] q4 + [2] q7)
                                              + {6} ([1] q5 + [2] q8)
                                        q7 = c q11
                                        q8 = d q12
                                        q9 = {2} |1> q1
                                        q10 = {3} |1> q1
                                        q11 = {1} |2> q1
                                        q12 = {4} |2> q1


      Step 4: Merge states (q2 and q3 in this case)
                                        S = <0| q0
                                        q0 = b <1| q2 + a <2| q2
                                        q1 = {0} |0>
                                        q2 = e q6
                                        q4 = d q9
                                        q5 = c q10
                                        q6 = {5} ([1] q4 + [2] q7)
                                              + {6} ([1] q5 + [2] q8)
                                        q7 = c q11
                                        q8 = d q12
                                        q9 = {2} |1> q1
                                        q10 = {3} |1> q1
                                        q11 = {1} |2> q1
                                        q12 = {4} |2> q1


      This is the resulting Quantum Grammar. Note that the LR(1) conflict in
state q6 is gone. Because we added the stack operations [1] and [2]
we are now able to distinguish which action to take on a transition on
symbol d by looking at the stack, e.g.


                                    q6 -> {5} [1] q4 -> {5} [1] d q9
                          but q6 -> {6} [2] q8 -> {6} [2] d q12


      We can precompile this conflict resolution by factoring out the symbol d.
The reason we can factor d out is because the actions, the stack operations
and input symbols all commute.


      Step 5: Factor out lambda terms
      All those terms that are have no input symbol in them are denoted lambda
terms, corresponding to lambda transitions. If the grammar is LR(k) there
will be no cycles formed by these lambda transitions so that we will be
safe in performing substitutions [this needs to be verified in detail].
For example:


            q6 = {5} ([1] q4 + [2] q7) + {6} ([1] q5 + [2] q8)
                  = {5} ([1] d q9 + [2] c q11) + {6} ([1] c q10 + [2] d q12)
                  = c ([2] {5} q11 + [1] {6} q10)
                  + d ([1] {5} q9 + [2] {6} q12)


Defining these parenthesized terms as states q13 and q14 we get:
                                        S = <0| q0
                                        q0 = b <1| q2 + a <2| q2
                                        q1 = {0} |0>
                                        q2 = e q6
                                        q4 = d q9
                                        q5 = c q10
                                        q6 = c q13 + d q14
                                        q7 = c q11
                                        q8 = d q12
                                        q9 = {2} |1> q1
                                        q10 = {3} |1> q1
                                        q11 = {1} |2> q1
                                        q12 = {4} |2> q1
                                        q13 = [2] {5} q11 + [1] {6} q10
                                        q14 = [1] {5} q9 + [2] {6} q12


Working on state S, we get:
                                        S = <0| q0
                                            = b <0| <1| q2 + a <0| <2| q2


After performing similar substitutions on q13 and q14 we get:


                                        S = b <0|<1| q2 + a <0|<2| q2
                                        q0 = b <1| q2 + a <2| q2
                                        q1 = {0} |0>
                                        q2 = e q6
                                        q4 = d q9
                                        q5 = c q10
                                        q6 = c q13 + d q14
                                        q7 = c q11
                                        q8 = d q12
                                        q9 = {2} |1> q1
                                        q10 = {3} |1> q1
                                        q11 = {1} |2> q1
                                        q12 = {4} |2> q1
                                        q13 = {5 1 0} |2> |0> + {6 3 0} |1> |0>
                                        q14 = {5 2 0} |1> |0> + {6 4 0} |2> |0>


(note the cancellations carried out in q13 and q14). As result of this,
the states q0, q1, q4, q5, q7, q8, q9, q10, q11, and q12 all become useless
and can be eliminated. The result is a 5 state Quantum Grammar:


                                        S = b <0|<1| q2 + a <0|<2| q2
                                        q2 = e q6
                                        q6 = c q13 + d q14
                                        q13 = {5 1 0} |2> |0> + {6 3 0} |1> |0>
                                        q14 = {5 2 0} |1> |0> + {6 4 0} |2> |0>


As of yet, I know of no way to directly arrive at this without having to
derive the LR(0) parser first. There is a generalized item set
construction out there waiting to be discovered that directly generates
Quantum Parsers, and is closely related to the method presented in
the other case studies worked out here.


      We don't actually need to carry out any of these substitutions except
for the ones in the conflicting state, q6. If the grammar is LR(k), all
such conflicting states will be resolved after the necessary substitutions
are completed, and this will happen with a finite number of steps. As
seen above, the number of states may drop dramatically.


      Notice as a result of these substitutions certain actions will tend to
be grouped together into natural combinations. For instance, the action
{5 1 0} will correspond to the reduction sequence:


                                    a A c {1 0} -> a e {5} c {1 0} = a e c {5 1 0}
                                    S {0} -> a A c {1} {0} = a A c {1 0}
                                    S' -> S {0}
or the reduction:
                                    S' -> a e c {5 1 0}
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.