Quantum Parsing (was: Re: Parsers are easy to write)

markh@csd4.csd.uwm.edu (Mark)
Mon, 28 Sep 1992 00:32:02 GMT

          From comp.compilers

Related articles
Backtracking yacc jarmo@ksvltd.FI (Jarmo Raiha) (1992-09-10)
Parsers are easy to write (was: Re: Backtracking yacc) markh@csd4.csd.uwm.edu (1992-09-25)
Quantum Parsing (was: Re: Parsers are easy to write) markh@csd4.csd.uwm.edu (1992-09-28)
Re: Quantum Parsing (was: Re: Parsers are easy to write) dak@kaa.informatik.rwth-aachen.de (1992-09-30)
Re: Quantum Parsing (was: Re: Parsers are easy to write) markh@csd4.csd.uwm.edu (1992-10-03)
| List of all articles for this month |
Newsgroups: comp.compilers,comp.theory
From: markh@csd4.csd.uwm.edu (Mark)
Organization: Computing Services Division, University of Wisconsin - Milwaukee
Date: Mon, 28 Sep 1992 00:32:02 GMT
Keywords: parse, LL(1), yacc
References: 92-09-059 92-09-185

markh@csd4.csd.uwm.edu (Mark) writes:
>In fact, I challenge anyone to find me one that I cannot craft a parser
>for by hand in a short time ...
>[Don't forget to make sure that your hand-written parser diagnoses all
>invalid input, as a yacc parser does. My experience is that the biggest wins
>from using yacc are the automatic diagnosis of ambiguous grammars and the
>extreme ease of adding little tweaks later on. -John]


The moderator brings up a valid point. Without the constraint of proper
error handling, one could easily come up with the default solution to the
problem of writing a parser by hand: a top-down recursive parser with
backtracking.


If anyone is curious, I do use have a method. It's approach is entirely
algebraic, and I'll give a cursory description below.


Suppose you have a Context Free Grammar, consisting of the following:
Terminals (T) (lexical items), Non-terminals (N), and Actions (Y). The
steps for deriving a parser are basically the following:


(1) Write the grammar as a recursive system of equations of the form:


                                Non-terminal = RE(T, N, Y)


where RE is a regular expression in the alphabet T \union N \union Y.


A regular expression over an alphabet, A, can consist of the following:


                                    (1) 0 = (roughly corresponding to YACC's "error").
                                    (2) 1 (the "empty string")
                                    (3) any member a in A.
                                    (4) concatenation RE RE
                                    (5) alteration RE | RE
                                    (6) iteration RE*


There is an underlying algebra here, for example 0 x = 0 = x 0, 1 x = x = x 1,
x | y = y | x, etc.


(2) Eliminate the recursions from the system. More generally, eliminate all
        the non-terminals from the right-hand sides of all the equations. The
        result is a SINGLE equation of the form:


                                        Start-symbol = ^ RE(T, X, X+) $.


All the non-terminals are eliminated.


The set X consists of "operators" (x), corresponding to stack symbols of a
push-down automaton. They're called CREATION OPERATORS. The set, X+
consists of the complementary operators X+ = { x+: x is in X }, called
DESTRUCTION OPERATORS. They satisfy the following rules:


                                      ^ is in X
                                      $ is in X+, with $ = ^+.
                                      x y+ = 0, if x != y
                                      x x+ = 1
                                      sum for all x in X (x+ x) = 1
                                      x t = t x, x+ t = t x+, for all t in N \union T \union Y


The symbols ^, and $ are boundary markers and do not occur within the
regular expression RE(T, X, X+). The X's and X+'s interpretations are the
following:


                                          x = "push symbol x"
                                          x+ = "pop symbol x, if x is on the stack's top"
                                          ^ = "stack bottom"
                                          $ = "stack is empty"


Every context free language can be represented as a regular expression
over such an algebra.


(3) Develop a finite automaton from the right-hand-side. When developing a FA,
        the arcs can contain one of more of the following:


                                  * zero or one "shifts" (any element of T)
                                  * zero or more "actions" (elements of Y)
                                  * zero or one "pushes" (any element of X)
                                  * zero or one "pops" (any element of X+)


      An arc is a REDUCE arc if it contains an pop, it's a SHIFT, if it
contains no pops, but at least one shift, and it's a LAMBDA arc, if it
contains no pops or shifts.


      A state in the finite automaton is a PURE SHIFT state if its only arcs
are shift arcs, and it's a PURE REDUCE state if its only arcs are reduce
arcs. It's a CRITICAL STATE if it has at least one shift arc and one
reduce arc.


      The ^ and $ are ignored when constructing the automaton.


      The finite automaton is "deterministic" if its only states are pure
shift and pure reduce states.


      Step (3) breaks down as follows:
      (3a) Develop a minimal DFA
      (3b) Eliminate all the lambda arcs.
      (3c) Disambiguate all critical states.


Step (1) is a null-step. The grammar should start out in the form stated.
Step (2) is a rigorous and straightforward step, but also involves several
rules of optimization that can be categorized as follows:


                            (a) Left-recursion elimination
                            (b) Stack-symbol elimination
                                    (i) Conversion of stack symbol to counter.
                                    (ii) Conversion of stack symbol to variable.
                                    (iii) Equating stack symbols


Step (3a) is rigorous, basically the algorithm for eliminating lambda
transitions from a finite automaton. Step (3b) involves heuristics.


Formalizing (2) and (3b) is tantamount to writing an expert system.


As an example of how the algebra works, take the syntax:


                                                    S = (a y0 | b S c y1)*.


                                            where y0 and y1 are actions.


Upon conversion, this becomes:


                                              S = ^ (b x | a y0 | c x+ y1)* $.


where x, x+, ^, and $ commute with a, b, c, y0, and y1; and with the
creation-destruction rules:


                                              ^ $ = 1 ^ x+ = 0 $ ^ | x+ x = 1
                                              x $ = 0 x x+ = 1


Under this algebra, the regular expression above PRECISELY generates the
given context free language.


      The parser is the (essentially) 1-state finite automaton generated from
the regular expression above:


                    Start state = Q0.
                    (final state) Q0 -> b x Q0 | a y0 Q0 | c x+ y1 Q0.


Q0 is almost a shift state. It can be made so, by factoring out the x+
and $:


                    Start state = Q0.
                    (final state) Q0 -> b x Q0 | a y0 Q0 | c Q1.
                                                Q1 -> x+ y1 Q0.


      In practice I fudge on the steps in the following ways:


(2) Not all non-terminals need to be "eliminated", if they do not contribute
          to a recursion.
(3a) Not all the lambda arcs need to be eliminated if they don't result in
          in critical states.
(3b) Some of the critical states are left alone, especially those that
          involve precedence relations. Instead, an "action" table is generated
          whose rows consist of elements of X+, and columns elements of T.


Another example is the following "expression" syntax:


                              E = x y0 | u E y1 | E p y2 | E b E y3 | ( E ) y4.


where y0, y1, y2, y3 and y4 are actions.


After conversion, this results in the non-recursive system of equations:


                        E = ^ E0 $.
                        E0 = x y0 F | u x0 E0 | ( x1 E0.
                        F = 1 | x0+ y1 F | p y2 F | b x2 E0 | x2+ y3 F | x1+ ) y4 F.


Which leads to a 2-state non-deterministic parser


                                        Start state = E
                                        E -> x y0 F | u x0 E | ( x1 E
          (final state) F -> x0+ y1 F | p y2 F | b x2 E0 | x2+ y3 F | x1+ ) y4 F.


State E is a shift state, and state F is critical, containing shifts from
p and b, and reduces involving x0+, x1+, and x2+. An action table is
formed involving the shifts (b, p) vs. the destruction operators (x0+ and
x2+). The operator x1+ is not involved, since it is already accompanied
by a shift: ).


The arcs headed by x0+ and x2+ essentially correspond to the following points
in the parse process:


                                  x0+: E -> u E .
                                  x2+: E -> E b E .


The shifts correspond to the parsing points:


                                  b: E -> E . b E
                                  p: E -> E . p


So the action table resolves the conflicts:


                                        x0+, b: (u E) b E vs. u (E b E)
                                        x0+, p: (u E) p vs. u (E p)
                                        x2+, b: (E b E) b E vs. E b (E b E)
                                        x2+, p: (E b E) p vs. E b (E p)


Thus, the traditional operator precedence parser is systematically derived.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.