# Bottom up hand-written parsers in under an hour

## mark@freenet.uwm.edu (Mark Hopkins)Sat, 12 Feb 1994 04:44:43 GMT

From comp.compilers

Related articles
Bottom up hand-written parsers in under an hour mark@freenet.uwm.edu (1994-02-12)
Bottoms-up, another hour has come: Quantum Parsing mark@freenet.uwm.edu (1994-02-17)
| List of all articles for this month |

 Newsgroups: comp.compilers From: mark@freenet.uwm.edu (Mark Hopkins) Keywords: parse Organization: Compilers Central Date: Sat, 12 Feb 1994 04:44:43 GMT

Context Free Grammars can be thought of as systems of recursive
equations whose variables are the grammar's non-terminals. These
variables denote context free sets. The equations can be solved, with the
solution expressed in terms of context free expressions. With the right
kind of notation the solution will also form a compact representation of a
parser for the grammar.

The notation to be presented by way of a case study is actually more
powerful than this, it originates from mid '92 and has been described here
briefly on several occastions. With it, we can not only represents
context free sets, but also sets generated by translation grammars and so
is ideally suitable for parsing applications. The form of the solution
will be a linear system of equations that will be called a Quantum
Grammar. One could ideally convert this system into regular expressions
(over algebraic sets called Monoids, for all you theorists out there), but
that's aside from our purpose.

In general, there are three sets: the input set (X), the action set
(Y), and the operator set (D). The operator set has the form:

{ <0|, <1|, ..., <m|, |0>, |1>, ..., |m> }

and the following properties hold:

(A) x y = y x, x d = d x, d y = y d
for all x in X, y in Y, d in D

(B) <m| |n> = 1 if m = n, 0 else

The symbol 1 denotes the empty string, and 0 effectively denotes the fail
string, and they have the following properties (expressed using the
notation of regular expressions):

(C) A 0 = 0 = 0 A, A + 0 = A = 0 + A
A 1 = A = 1 A

(Technically, this discussion amounts to saying that we're considering
regular expressions taken over non-free Monoids with 0). Also, the
following axiom will be invoked:

(D) |0><0| + ... |m><m| = 1

but note that being a regular expression equation, this leads to some
subtle changes in the way regular expressions are defined in terms of
sets. As an exercise given the algebraic relations defined by (A) and
(B), try coming up with a definition of regular expressions in terms of
sets that also satisfies (C), or (C) and (D) (C's easy, but C and D
together is a bit of a challenge).

Now, consider the grammar:

S -> w E d S | i E t S | i E t S e S
| x q E z | o L c
L -> | L S
E -> E b E | u E | E p | x | m E n

In equational form, using the notation regular expressions this can be
presented as the system:

S = w E d S + i E t S + i E t S e S + x q E z + o L c
L = 1 + L S
E = E b E + u E + E p + x + m E n

The input set, X, here is { w, d, i, t, e, x, q, z, o, c, b, u, p, x, m, n }.

It is assumed that some of the terms in this grammar are augmented by a
set of action symbols:

S = w E d S y0 + i E t S y1 + i E t S e S y2 + x q E z y3 + o L c
L = y4 + L S y5
E = E b E y6 + u E y7 + E p y8 + x y9 + m E n

Thus, Y = { y0, ..., y9 }. They are appended to some of the terms here,
but that's not a necessary restriction. These actions can stand for
anything, but are usually understood to stand for code segments in parsing
applications.

Before undergoing the process, the system is first left-factored:

S = w E d S y0 + i E t S (y1 + e S y2) + x q E z y3 + o L c
L = y4 + L S y5
E = E (b E y6 + p y8) + u E y7 + x y9 + m E n

Two simple steps are required to get a parser from this, and that's
all. Step 1: insert operator symbols, step 2: chop up the terms.

In the first step, each non-terminal, other than those in
left-recursive contexts, is bracketed by a balanced pair of operators.
Each occurrence of a given non-terminal should be bracketed by distinct
operators, but two different non-terminals may share the same operators if
neither occurs in the left-recursive context of the other (as is the case
here):

S = w <1| E |1> d <1| S |1> y0
+ i <2| E |2> t <2| S |2> (y1 + e <3| S |3> y2)
+ x q <4| E |4> z y3
+ o <5| L |5> c
L = y4 + L <6| S |6> y5
E = E (b <7| E |7> y6 + p y8)
+ u <8| E |8> y7
+ x y9
+ m <9| E |9> n

Thus, D = { <0|, ..., <9|, |0>, ..., |9> }, here. In addition, we add
another non-terminal, start, and the following equation:

start = <0| S |0>.

In the second step, we define a new set of non-terminals, SF, LF, EF
and make the following (re)definitions:

N redefined as N NF.
NF = { x in (X + Y)*: start ->* w N x }

These sets can be derived directly from the grammar above in the
following way: (i) append NF to each term in the equation for N
(except the new start symbol):

start = <0| S |0>.
S = w <1| E |1> d <1| S |1> y0 SF
+ i <2| E |2> t <2| S |2> (y1 SF + e <3| S |3> y2 SF)
+ x q <4| E |4> z y3 SF
+ o <5| L |5> c SF
L = y4 LF
+ L <6| S |6> y5 LF
E = E (b <7| E |7> y6 EF + p y8 EF)
+ u <8| E |8> y7 EF
+ x y9 EF
+ m <9| E |9> n EF

and (ii) starting at the right of each term, chop the term of past the
last non-terminal, N, and place the remainder in the equation for NF. For
example, |0> is removed from the first equation and placed in the equation
for SF, |1> y0 SF is removed from the second equation, placed in SF's
equation, and then |1> d <1| S is chopped off and placed in the equation
for EF. The final result of this process is:

start = <0| S.
S = w <1| E
+ i <2| E
+ x q <4| E
+ o <5| L
SF = |0>
+ |1> y0 SF
+ |2> (y1 SF + e <3| S)
+ |3> y2 SF
+ |6> y5 LF
E = E
+ u <8| E
+ x y9 EF
+ m <9| E
EF = |1> d <1| S
+ |2> t <2| S
+ |4> z y3 SF
+ |7> y6 EF
+ |8> y7 EF
+ |9> n EF
+ (b <7| E + p y8 EF)
L = y4 LF
+ L
LF = |5> c SF
+ <6| S

The spurious left-recursions are then eliminated and the result is:

start = <0| S.
S = w <1| E
+ i <2| E
+ x q <4| E
+ o <5| L
SF = |0>
+ |1> y0 SF
+ |2> (y1 SF + e <3| S)
+ |3> y2 SF
+ |6> y5 LF
E = u <8| E
+ x y9 EF
+ m <9| E
EF = |1> d <1| S
+ |2> t <2| S
+ |4> z y3 SF
+ |7> y6 EF
+ |8> y7 EF
+ |9> n EF
+ (b <7| E + p y8 EF)
L = y4 LF
LF = |5> c SF
+ <6| S

This is a Quantum Grammar, essentially a listing of a push down
transducer in equational form. Each equation can be interpreted as the
specification of a state, the <m| operator means "push m", the |m>
operator means "pop and test for equality to m". For instance, the state
representing SF is:

SF: Input Stack Context Stack Writes Actions Next State
|0> (end) <0| - accept -
|1> y0 SF - <1| - y0 SF
|2> y1 SF - <2| - y1 SF
|2> e <3| S e <2| <3| - S
|3> y2 SF - <3| - y2 SF
|6> y5 LF - <6| - y5 LF

On the surface, parts of this grammar may look non-deterministic, such
as state LF. But the determinism can be brought out by substitution and
algebra, e.g.,

LF = |5> c SF + <6| S
= c |5> SF
+ w <6| <1| E + i <6| <2| E + x q <6| <4| E + o <6| <5| L

In the other cases, the non-determinism is inherent,

In equation SF:
|2> y1 SF vs. |2> e <3| S

In equation EF:
{ |7> y6 EF, |8> y7 EF } vs. { b <7| E, p y8 EF }

The first non-determinism corresponds to the ambiguity:

i E t (i E t S e S) vs. i E t (i E t S) e S

(the famous dangling-else ambiguity), and the second set corresponds
to cases such as:

E b (E b E) vs. (E b E) b E

(the precedence conflicts).

To handle the first case, we can impose a disambiguation rule, e.g.
that |2> e <3| S take precedence over |2> y1 SF.

To handle the latter set of conflicts, the same method can be used and
an action table can be created, which amounts to a 2 x 2 table filled by
whatever disambiguation rules exist. In this case present case, the
disambiguation rules are none other than the precedence rules of the
various operators in the syntax.

The contexts corresponding to the first set of terms (the reduce
contexts) are respectively:

|7> y6 EF: E -> E b E .
|8> y7 EF: E -> u E .

The contexts corresponding to the second set of terms (the shift
contexts) are:
b <7| E: E -> E . b E
p y8 EF: E -> E . p

each table entry is filled out by combining a shift context and reduce
context and deciding which way to disambiguate. For example, the |7> vs b
entry corresponds to the context:

E -> E b E . b E

and is filled out with a shift action if b is right-associative, or a
reduce otherwise. The entire table is listed below:

Context |7> y6 EF |8> y7 EF
b <7| E E b E . b E u E . b E
p y8 EF E b E . p u E . p

Note that this treatment is similar in spirit to how precedence
grammars are handled, but substantially different in detail. I claim that
taking tables by Context vs. Lexical Item is actually the more natural way
to be approaching precedence relations and that the customary approach to
precedence grammars bungles up this insight. For one thing, you don't get
a plethora of extraneous and impossible combinations (e.g. a p vs. u entry
in a precedence table), but only the possible ones.

Another thing to note is that the conflicts are easy to resolve because
they bear such a direct relation to the original syntax, which in turns
arises from the fact that the transformation made from the syntax was so
direct.

On a more theoretical side, also note that the general parsing problem
in light of the developments above can be cast as the problem of finding
the intersection of w Y* with the Translation Expression, start, where w
is the input word, and w Y* (in view of the commutation relations (A)) the
set of interleaves of actions in the input. The result of the parsing
process will be the determination of the set, F, where:

(w Y*) & (start) = w F

This set, F, is in general a context free set compactly representing the
set of all parses of w, but for deterministic parsers is a regular
expression representing one string (the parse sequence).

Now, after all that I'm sure you're dying to see actual code (ha,
and you thought it was all a bluff!). Well, it's below. When the sample
input, w:
w x b x b x d o
x q u x z
x q x z
c

is given, the result will be that:

F = y9 y9 y6 y9 y6 y4 y9 y7 y3 y5 y9 y3 y5 y0

will be output. And yes, it did take less than an hour to write and
test. I challenge you to find any inconsistencies in it.

Stay tuned for future developments, there's more to come...

---------------------------------------------------------------------------
/* Must use an ANSI-C compiler to compile this program. */
#include <stdio.h>
#include <ctype.h>
#include <stdarg.h>

static unsigned ERRORS = 0;
void ERROR(const char *Format, ...) {
va_list AP;
va_start(AP, Format);
vfprintf(stderr, Format, AP); fputc('\n', stderr);
va_end(AP);
if (++ERRORS >= 24)
fprintf(stderr, "Too many errors. Aborting.\n"), exit(1);
}

void FATAL(const char *Format, ...) {
va_list AP;
va_start(AP, Format);
vfprintf(stderr, Format, AP); fputc('\n', stderr);
va_end(AP);
exit(1);
}

typedef unsigned char Lexical;
Lexical Scan(void) {
int Ch;
do Ch = getchar(); while (isspace(Ch));
switch (Ch) {
case EOF: return 0;
case 'w': case 'd': case 'i': case 't': case 'e':
case 'q': case 'z': case 'o': case 'c': case 'b':
case 'u': case 'p': case 'x': case 'm': case 'n': return Ch;
default: return 'x'; /* Anything else is considered to be x. */
}
}

typedef unsigned char Context;
#define STACK_MAX 0x10
Context Stack[STACK_MAX], *SP;
void PUSH(Context Tag, ...) {
if (SP >= Stack + STACK_MAX) FATAL("Syntax too complex.");
*SP++ = Tag;
}

void Parse(void) {
Lexical X;
SP = Stack, *SP++ = 0;
X = Scan();
qS:
switch (X) {
case 'w': PUSH(1); X = Scan(); goto qE;
case 'i': PUSH(2); X = Scan(); goto qE;
case 'o': PUSH(5); X = Scan(); goto qL;
default: ERROR("Missing 'x'."); goto HaveX;
case 'x': X = Scan();
HaveX:
if (X != 'q') ERROR("Missing 'q'."); else X = Scan();
PUSH(4);
goto qE;
}
qSF:
switch (*--SP) {
case 0:
if (X != 0) ERROR("Extra symbols present in input.");
return;
case 1: printf(" y0"); goto qSF;
case 2:
/* Incorporates the precedence |2> e <3| S over |2> y1 SF */
if (X == 'e') { X = Scan(), PUSH(3); goto qS; }
printf(" y1");
goto qSF;
case 3: printf(" y2"); goto qSF;
case 6: printf(" y5"); goto qLF;
default: FATAL("Internal error (SF).");
}
qE:
switch (X) {
case 'u': PUSH(8); X = Scan(); goto qE;
case 'm': PUSH(9); X = Scan(); goto qE;
default: ERROR("Missing 'x'."); goto HaveX2;
case 'x':
X = Scan();
HaveX2:
printf(" y9");
goto qEF;
}
qEF:
/* The action table here is incorporated directly into the code,
with the following precedences:
E b E b E = (E b E) b E, E b E p = E b (E p)
u E b E = (u E) b E, u E p = (u E) p
*/
switch (*--SP) {
case 7: /* |7> y6 EF has precedence only over b <7| E. */
if (X == 'p') goto ShiftP;
printf(" y6");
goto qEF;
case 8: /* |8> y7 EF has precedence. */
printf(" y7");
goto qEF;
}
switch (X) {
case 'b': ShiftB: SP++, X = Scan(); PUSH(7); goto qE;
case 'p': ShiftP: SP++, X = Scan(); printf(" y8"); goto qEF;
}
switch (*SP) {
case 1:
if (X != 'd') ERROR("Missing 'd' in w ... d.");
else X = Scan();
SP++;
goto qS;
case 2:
if (X != 't') ERROR("Missing 't' in i ... t.");
else X = Scan();
SP++;
goto qS;
case 4:
if (X != 'z') ERROR("Missing 'z' in x q ... z.");
else X = Scan();
printf(" y3");
goto qSF;
case 9:
if (X != 'n') ERROR("Missing 'n' in m ... n.");
else X = Scan();
goto qEF;
default: FATAL("Internal error (EF).");
}
qL: printf(" y4");
qLF:
if (SP[-1] == 5 && X == 'c') {
SP--, X = Scan(); goto qSF;
}
PUSH(6);
goto qS;
}

void main(void) {
Parse();
if (ERRORS > 0)
fprintf(stderr, "%d error(s) present in input.\n", ERRORS);
exit(0);
}
--

Post a followup to this message