Re: How to get this to an lalr(1) grammar?

"Mark Hopkins" <mark@omnifest.uwm.edu>
Sun, 9 Oct 1994 19:47:30 GMT

          From comp.compilers

Related articles
How to get this to an lalr(1) grammar? Mark_Tarrabain@mindlink.bc.ca (1994-08-22)
Re: How to get this to an lalr(1) grammar? jholland@world.std.com (1994-08-22)
Re: How to get this to an lalr(1) grammar? pjj@cs.man.ac.uk (1994-08-28)
How to get this to an lalr(1) grammar? mph@zdenka.demon.co.uk (1994-08-28)
Re: How to get this to an lalr(1) grammar? mickunas@mickunas.cs.uiuc.edu (1994-09-10)
Re: How to get this to an lalr(1) grammar? dekker@dutiag.twi.tudelft.nl (1994-09-14)
Re: How to get this to an lalr(1) grammar? mark@omnifest.uwm.edu (Mark Hopkins) (1994-10-09)
| List of all articles for this month |

Newsgroups: comp.compilers
From: "Mark Hopkins" <mark@omnifest.uwm.edu>
Keywords: parse, LALR
Organization: Compilers Central
References: 94-08-137 94-09-052
Date: Sun, 9 Oct 1994 19:47:30 GMT

The grammar (fixed up by Rene Dekker (dekker@dutiag.twi.tudelft.nl)):
> s -> c | s t c
> c -> V n p
> a -> A | C o
> o -> /* empty */ | A
> t -> T | D | a u
> u -> /* empty */ | T
> n -> /* empty */ | L b | q
> q -> N | q a N
> b -> /* empty */ | B q
> p -> /* empty */ | P N


was noted to be LR(2), and it was asked if it could be converted to LR(1)
or LALR(1).


      The question is a little vague because what you really want is an LALR(1)
*parser* not LALR(1) *recognizer*. There's a big difference there. In fact
its possible to have a grammar that has a finite state recognizer, but an
inherently context free parser. In fact in this case, the language above is
given by the regular expression:


                                  c (t c)*
                                  where
                                        c = V [L | [L B] N (a N)*] [P N]
                                        t = T | D | a [T]
                                        a = A | C [A]


But might not even have a finite state parser (but it does and we'll prove it).


      The initial question has to be answered by making the output actions
explicit. It's only then that one can determine the complexity of the
language. Even there, it all depends on what actions are done where. If
no actions are done, for instance, you have a recognizer and the language is
just regular.


      To specify languages with explicit actions, you need two alphabets:


                            X = input alphabet = { V, A, C, T, D, L, N, B, P }
                            Y = output alphabet = { y0, y1, y2, ... }


It is axiomatically asserted that


                                      x y = y x for all x in X, y in Y.


Given this, the set


                      F = { w_in w_out: w_in in X*, w_out in Y*,
                                  w_out is a valid output action for w_in }


defines the parser's language, and can be specified by a context free
grammar. A parse of w_in, then, represents the task of finding all the
possible output words, w_out such that w_in w_out is in the set F. An
unambiguous language is one where there is only one w_out corresponding
to each w_in.


      The set of languages specifiable by these means are one and the same as the
Simple Syntax Directed Translation languages.


      Let's take the actions to stand for the production rules, e.g.


                              y0: s -> c, y1: s -> s t c, etc.


then the language can be given by the following context free grammar (listed
as a series of equations):


                                            s = c y0 | s t c y1
                                            c = V n p y2
                                            a = A y3 | C o y4
                                            o = y5 | A y6
                                            t = T y7 | D y8 | a u y9
                                            u = y10 | T y11
                                            n = y12 | L b y13 | q y14
                                            q = N y15 | q a N y16
                                            b = y17 | B q y18
                                            p = y19 | P N y20


You can eliminate the recursions and make some substitutions to yield:


                          o = y5 | A y6,
                          p = y19 | P N y20,
                          u = y10 | T y11,
                          a = A y3 | C o y4,
                          q = N y15 (a N y16)*,
                          t = T y7 | D y8 | a u y9,
                          b = y17 | B q y18,
                          n = y12 | L b y13 | q y14,
                          c = V n p y2,
                          s = c y0 (t c y1)*


This is non-recursive and so it specifies a regular language and thus a
(possibly non-deterministic) finite state machine. With the help of my
RE -> DFA converter (a copy of it and related software is at iecc.com
in /pub/files/regx.tar.gz last I checked), after back-substituting all the
states representing transitions on output symbols, the following results (Q1
is the start state):


Q1 = V Q2
Q2 = L Q3 | N y15 Q8 |
          y12 (P Q9 | y19 y1 y0 (1 | A y3 Q31 | C Q25 | D y8 Q34 | T y7 Q34))
Q3 = B Q6 | y17 y13 (P Q9 |
          y19 y1 y0 (1 | A y3 Q31 | C Q25 | D y8 Q34 | T y7 Q34))
Q6 = N y15 Q16
Q8 = A y3 Q17 | C Q13 |
          y14 (P Q9 | y19 y1 y0 (1 | A y3 Q31 | C Q25 | D y8 Q34 | T y7 Q34))
Q9 = N y20 y2 y0 Q20
Q13 = A y6 y4 Q17 | y5 y4 N y16 Q8
Q16 = A y3 Q28 | C Q22 |
            y18 y13 (P Q9 | y19 y1 y0 (1 | A y3 Q31 | C Q25 | D y8 Q34 | T y7 Q34))
Q17 = N y16 Q8
Q20 = 1 | A y3 Q31 | C Q25 | D y8 Q34 | T y7 Q34
Q22 = A y6 y4 Q28 | y5 y4 N y16 Q16
Q25 = A y6 y4 Q31 | y5 y4 T y11 y9 Q34 | y5 y4 y10 y9 V Q38
Q28 = N y16 Q16
Q31 = T y11 y9 Q34 | y10 y9 V Q38
Q34 = V Q38
Q38 = L Q39 | N y15 Q44 | y12 (P Q45 |
            y19 y2 y1 (1 | A y3 Q31 | C Q25 | D y8 Q34 | T y7 Q34))
Q39 = B Q42 | y17 y13 (P Q45 |
            y19 y2 y1 (1 | A y3 Q31 | C Q25 | D y8 Q34 | T y7 Q34))
Q42 = N y15 Q52
Q44 = A y3 Q53 | C Q49 |
            y14 (P Q45 | y19 y2 y1 (1 | A y3 Q31 | C Q25 | D y8 Q34 | T y7 Q34))
Q45 = N y20 y2 y1 Q20
Q49 = A y6 y4 Q53 | y5 y4 N y16 Q44
Q52 = A Q56 | C Q57 |
            y18 y13 (P Q45 | y19 y2 y1 (1 | A y3 Q31 | C Q25 | D y8 Q34 | T y7 Q34))
Q53 = N y16 Q44
Q56 = y3 N y16 Q52
Q57 = A y6 y4 Q59 | y5 y4 N y16 Q52
Q59 = N y16 Q52


      Factor out the input symbols (this is where the commutativity axtiom comes
into play) and use the notation:


                                          {n1-n2-...-nk} for yn1 yn2 ... ynk


Then the result is the following:


  Q1 = V Q2
  Q2 = L Q3 | N {15} Q8 | P {12} Q9 | {12-19-1-0} | A {12-19-1-0-3} Q31 |
            C {12-19-1-0} Q25 | D {12-19-1-0-8} Q34 | T {12-19-1-0-7} Q34
  Q3 = B Q6 | P {17-13} Q9 | {17-13-19-1-0} | A {17-13-19-1-0-3} Q31 |
            C {17-13-19-1-0} Q25 | D {17-13-19-1-0-8} Q34 | T {17-13-19-1-0-7} Q34
  Q6 = N {15} Q16
  Q8 = A ({3} Q17 | {14-19-1-0-3} Q31) | P {14} Q9 | {14-19-1-0} |
            C (Q13 | {14-19-1-0} Q25) | D {14-19-1-0-8} Q34 | T {14-19-1-0-7} Q34
  Q9 = N {20-2-0} Q20
Q13 = A {6-4} Q17 | N {5-4-16} Q8
Q16 = A ({3} Q28 | {18-13-19-1-0-3} Q31) | P {18-13} Q9 | {18-13-19-1-0} |
            C (Q22 | {18-13-19-1-0} Q25) | D {18-13-19-1-0-8} Q34 |
            T {18-13-19-1-0-7} Q34
Q17 = N {16} Q8
Q20 = 1 | A {3} Q31 | C Q25 | D {8} Q34 | T {7} Q34
Q22 = A {6-4} Q28 | N {5-4-16} Q16
Q25 = A {6-4} Q31 | T {5-4-11-9} Q34 | V {5-4-10-9} Q38
Q28 = N {16} Q16
Q31 = T {11-9} Q34 | V {10-9} Q38
Q34 = V Q38
Q38 = L Q39 | N {15} Q44 | P {12} Q45 | {12-19-2-1} | A {12-19-2-1-3} Q31 |
            C {12-19-2-1} Q25 | D {12-19-2-1-8} Q34 | T {12-19-2-1-7} Q34
Q39 = B Q42 | P {17-13} Q45 | {17-13-19-2-1} | A {17-13-19-2-1-3} Q31 |
            C {17-13-19-2-1} Q25 | D {17-13-19-2-1-8} Q34 | T {17-13-19-2-1-7} Q34
Q42 = N {15} Q52
Q44 = A ({3} Q53 | {14-19-2-1-3} Q31) | C (Q49 | {14-19-2-1} Q25) |
            P {14} Q45 | {14-19-2-1} | D {14-19-2-1-8} Q34 | T {14-19-2-1-7} Q34
Q45 = N {20-2-1} Q20
Q49 = A {6-4} Q53 | N {5-4-16} Q44
Q52 = A (Q56 | {18-13-19-2-1-3} Q31) | C (Q57 | {18-13-19-2-1} Q25) |
            P {18-13} Q45 | {18-13-19-2-1} | D {18-13-19-2-1-8} Q34 |
            T {18-13-19-2-1-7} Q34
Q53 = N {16} Q44
Q56 = N {3-16} Q52
Q57 = A {6-4} Q59 | N {5-4-16} Q52
Q59 = N {16} Q52


      The only apparent non-determinisms are the transitions on symbols A
and C in states Q8, Q16, Q44 and Q52.


  Q8 = A ({3} Q17 | {14-19-1-0-3} Q31) | C (Q13 | {14-19-1-0} Q25) |...
Q16 = A ({3} Q28 | {18-13-19-1-0-3} Q31) | C (Q22 | {18-13-19-1-0} Q25) |...
Q44 = A ({3} Q53 | {14-19-2-1-3} Q31) | C (Q49 | {14-19-2-1} Q25) |...
Q52 = A (Q56 | {18-13-19-2-1-3} Q31) | C (Q57 | {18-13-19-2-1} Q25) |...


      The input symbols for Q31 are T and V, and for Q17, Q28, Q53, and Q56 is
N. So transitions on A are deterministic here. Similarily, the inputs for
Q25 are A, T and V, and for states Q13, Q22, Q49 and Q57 are A and N. This
time there is an apparent non-determinism on a transition on symbol A.
Factoring out the A from each of the 4 subexpressions results in:


Q13 | {14-19-1-0} Q25 = A ({6-4} Q17 | {14-19-1-0-6-4} Q31) | ...
Q22 | {18-13-19-1-0} Q25 = A ({6-4} Q28 | {18-13-19-1-0-6-4} Q31) | ...
Q49 | {14-19-2-1} Q25 = A ({6-4} Q53 | {14-19-2-1-6-4} Q31) | ...
Q57 | {18-13-19-2-1} Q25 = A ({6-4} Q59 | {18-13-19-2-1-6-4} Q31) | ...


      Since Q17, Q28, Q53 and Q59 have transitions only on N, and Q31 only on
T and V, everything is deterministic. So the whole finite state machine is
deterministic.


      The two layers of factorings done here represent two-lookaheads, so the
input grammar is LR(2). Since the syntax reduces to a form suitable for
a deterministic finite state machine the *parser* language is regular,
not just the *recognizer* language.


      There's a lot of redundancy engendered by the repeated substitutions of
subexpressions, but there's ways to deal with that up front...
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.