Sat, 12 Feb 1994 04:44:43 GMT

Related articles |
---|

Bottom up hand-written parsers in under an hour mark@freenet.uwm.edu (1994-02-12) |

Bottoms-up, another hour has come: Quantum Parsing mark@freenet.uwm.edu (1994-02-17) |

Newsgroups: | comp.compilers |

From: | mark@freenet.uwm.edu (Mark Hopkins) |

Keywords: | parse |

Organization: | Compilers Central |

Date: | Sat, 12 Feb 1994 04:44:43 GMT |

Context Free Grammars can be thought of as systems of recursive

equations whose variables are the grammar's non-terminals. These

variables denote context free sets. The equations can be solved, with the

solution expressed in terms of context free expressions. With the right

kind of notation the solution will also form a compact representation of a

parser for the grammar.

The notation to be presented by way of a case study is actually more

powerful than this, it originates from mid '92 and has been described here

briefly on several occastions. With it, we can not only represents

context free sets, but also sets generated by translation grammars and so

is ideally suitable for parsing applications. The form of the solution

will be a linear system of equations that will be called a Quantum

Grammar. One could ideally convert this system into regular expressions

(over algebraic sets called Monoids, for all you theorists out there), but

that's aside from our purpose.

In general, there are three sets: the input set (X), the action set

(Y), and the operator set (D). The operator set has the form:

{ <0|, <1|, ..., <m|, |0>, |1>, ..., |m> }

and the following properties hold:

(A) x y = y x, x d = d x, d y = y d

for all x in X, y in Y, d in D

(B) <m| |n> = 1 if m = n, 0 else

The symbol 1 denotes the empty string, and 0 effectively denotes the fail

string, and they have the following properties (expressed using the

notation of regular expressions):

(C) A 0 = 0 = 0 A, A + 0 = A = 0 + A

A 1 = A = 1 A

(Technically, this discussion amounts to saying that we're considering

regular expressions taken over non-free Monoids with 0). Also, the

following axiom will be invoked:

(D) |0><0| + ... |m><m| = 1

but note that being a regular expression equation, this leads to some

subtle changes in the way regular expressions are defined in terms of

sets. As an exercise given the algebraic relations defined by (A) and

(B), try coming up with a definition of regular expressions in terms of

sets that also satisfies (C), or (C) and (D) (C's easy, but C and D

together is a bit of a challenge).

Now, consider the grammar:

S -> w E d S | i E t S | i E t S e S

| x q E z | o L c

L -> | L S

E -> E b E | u E | E p | x | m E n

In equational form, using the notation regular expressions this can be

presented as the system:

S = w E d S + i E t S + i E t S e S + x q E z + o L c

L = 1 + L S

E = E b E + u E + E p + x + m E n

The input set, X, here is { w, d, i, t, e, x, q, z, o, c, b, u, p, x, m, n }.

It is assumed that some of the terms in this grammar are augmented by a

set of action symbols:

S = w E d S y0 + i E t S y1 + i E t S e S y2 + x q E z y3 + o L c

L = y4 + L S y5

E = E b E y6 + u E y7 + E p y8 + x y9 + m E n

Thus, Y = { y0, ..., y9 }. They are appended to some of the terms here,

but that's not a necessary restriction. These actions can stand for

anything, but are usually understood to stand for code segments in parsing

applications.

Before undergoing the process, the system is first left-factored:

S = w E d S y0 + i E t S (y1 + e S y2) + x q E z y3 + o L c

L = y4 + L S y5

E = E (b E y6 + p y8) + u E y7 + x y9 + m E n

Two simple steps are required to get a parser from this, and that's

all. Step 1: insert operator symbols, step 2: chop up the terms.

In the first step, each non-terminal, other than those in

left-recursive contexts, is bracketed by a balanced pair of operators.

Each occurrence of a given non-terminal should be bracketed by distinct

operators, but two different non-terminals may share the same operators if

neither occurs in the left-recursive context of the other (as is the case

here):

S = w <1| E |1> d <1| S |1> y0

+ i <2| E |2> t <2| S |2> (y1 + e <3| S |3> y2)

+ x q <4| E |4> z y3

+ o <5| L |5> c

L = y4 + L <6| S |6> y5

E = E (b <7| E |7> y6 + p y8)

+ u <8| E |8> y7

+ x y9

+ m <9| E |9> n

Thus, D = { <0|, ..., <9|, |0>, ..., |9> }, here. In addition, we add

another non-terminal, start, and the following equation:

start = <0| S |0>.

In the second step, we define a new set of non-terminals, SF, LF, EF

and make the following (re)definitions:

N redefined as N NF.

NF = { x in (X + Y)*: start ->* w N x }

These sets can be derived directly from the grammar above in the

following way: (i) append NF to each term in the equation for N

(except the new start symbol):

start = <0| S |0>.

S = w <1| E |1> d <1| S |1> y0 SF

+ i <2| E |2> t <2| S |2> (y1 SF + e <3| S |3> y2 SF)

+ x q <4| E |4> z y3 SF

+ o <5| L |5> c SF

L = y4 LF

+ L <6| S |6> y5 LF

E = E (b <7| E |7> y6 EF + p y8 EF)

+ u <8| E |8> y7 EF

+ x y9 EF

+ m <9| E |9> n EF

and (ii) starting at the right of each term, chop the term of past the

last non-terminal, N, and place the remainder in the equation for NF. For

example, |0> is removed from the first equation and placed in the equation

for SF, |1> y0 SF is removed from the second equation, placed in SF's

equation, and then |1> d <1| S is chopped off and placed in the equation

for EF. The final result of this process is:

start = <0| S.

S = w <1| E

+ i <2| E

+ x q <4| E

+ o <5| L

SF = |0>

+ |1> y0 SF

+ |2> (y1 SF + e <3| S)

+ |3> y2 SF

+ |6> y5 LF

E = E

+ u <8| E

+ x y9 EF

+ m <9| E

EF = |1> d <1| S

+ |2> t <2| S

+ |4> z y3 SF

+ |7> y6 EF

+ |8> y7 EF

+ |9> n EF

+ (b <7| E + p y8 EF)

L = y4 LF

+ L

LF = |5> c SF

+ <6| S

The spurious left-recursions are then eliminated and the result is:

start = <0| S.

S = w <1| E

+ i <2| E

+ x q <4| E

+ o <5| L

SF = |0>

+ |1> y0 SF

+ |2> (y1 SF + e <3| S)

+ |3> y2 SF

+ |6> y5 LF

E = u <8| E

+ x y9 EF

+ m <9| E

EF = |1> d <1| S

+ |2> t <2| S

+ |4> z y3 SF

+ |7> y6 EF

+ |8> y7 EF

+ |9> n EF

+ (b <7| E + p y8 EF)

L = y4 LF

LF = |5> c SF

+ <6| S

This is a Quantum Grammar, essentially a listing of a push down

transducer in equational form. Each equation can be interpreted as the

specification of a state, the <m| operator means "push m", the |m>

operator means "pop and test for equality to m". For instance, the state

representing SF is:

SF: Input Stack Context Stack Writes Actions Next State

|0> (end) <0| - accept -

|1> y0 SF - <1| - y0 SF

|2> y1 SF - <2| - y1 SF

|2> e <3| S e <2| <3| - S

|3> y2 SF - <3| - y2 SF

|6> y5 LF - <6| - y5 LF

On the surface, parts of this grammar may look non-deterministic, such

as state LF. But the determinism can be brought out by substitution and

algebra, e.g.,

LF = |5> c SF + <6| S

= c |5> SF

+ w <6| <1| E + i <6| <2| E + x q <6| <4| E + o <6| <5| L

In the other cases, the non-determinism is inherent,

In equation SF:

|2> y1 SF vs. |2> e <3| S

In equation EF:

{ |7> y6 EF, |8> y7 EF } vs. { b <7| E, p y8 EF }

The first non-determinism corresponds to the ambiguity:

i E t (i E t S e S) vs. i E t (i E t S) e S

(the famous dangling-else ambiguity), and the second set corresponds

to cases such as:

E b (E b E) vs. (E b E) b E

(the precedence conflicts).

To handle the first case, we can impose a disambiguation rule, e.g.

that |2> e <3| S take precedence over |2> y1 SF.

To handle the latter set of conflicts, the same method can be used and

an action table can be created, which amounts to a 2 x 2 table filled by

whatever disambiguation rules exist. In this case present case, the

disambiguation rules are none other than the precedence rules of the

various operators in the syntax.

The contexts corresponding to the first set of terms (the reduce

contexts) are respectively:

|7> y6 EF: E -> E b E .

|8> y7 EF: E -> u E .

The contexts corresponding to the second set of terms (the shift

contexts) are:

b <7| E: E -> E . b E

p y8 EF: E -> E . p

each table entry is filled out by combining a shift context and reduce

context and deciding which way to disambiguate. For example, the |7> vs b

entry corresponds to the context:

E -> E b E . b E

and is filled out with a shift action if b is right-associative, or a

reduce otherwise. The entire table is listed below:

Context |7> y6 EF |8> y7 EF

b <7| E E b E . b E u E . b E

p y8 EF E b E . p u E . p

Note that this treatment is similar in spirit to how precedence

grammars are handled, but substantially different in detail. I claim that

taking tables by Context vs. Lexical Item is actually the more natural way

to be approaching precedence relations and that the customary approach to

precedence grammars bungles up this insight. For one thing, you don't get

a plethora of extraneous and impossible combinations (e.g. a p vs. u entry

in a precedence table), but only the possible ones.

Another thing to note is that the conflicts are easy to resolve because

they bear such a direct relation to the original syntax, which in turns

arises from the fact that the transformation made from the syntax was so

direct.

On a more theoretical side, also note that the general parsing problem

in light of the developments above can be cast as the problem of finding

the intersection of w Y* with the Translation Expression, start, where w

is the input word, and w Y* (in view of the commutation relations (A)) the

set of interleaves of actions in the input. The result of the parsing

process will be the determination of the set, F, where:

(w Y*) & (start) = w F

This set, F, is in general a context free set compactly representing the

set of all parses of w, but for deterministic parsers is a regular

expression representing one string (the parse sequence).

Now, after all that I'm sure you're dying to see actual code (ha,

and you thought it was all a bluff!). Well, it's below. When the sample

input, w:

w x b x b x d o

x q u x z

x q x z

c

is given, the result will be that:

F = y9 y9 y6 y9 y6 y4 y9 y7 y3 y5 y9 y3 y5 y0

will be output. And yes, it did take less than an hour to write and

test. I challenge you to find any inconsistencies in it.

Stay tuned for future developments, there's more to come...

---------------------------------------------------------------------------

/* Must use an ANSI-C compiler to compile this program. */

#include <stdio.h>

#include <ctype.h>

#include <stdarg.h>

static unsigned ERRORS = 0;

void ERROR(const char *Format, ...) {

va_list AP;

va_start(AP, Format);

vfprintf(stderr, Format, AP); fputc('\n', stderr);

va_end(AP);

if (++ERRORS >= 24)

fprintf(stderr, "Too many errors. Aborting.\n"), exit(1);

*}*

void FATAL(const char *Format, ...) {

va_list AP;

va_start(AP, Format);

vfprintf(stderr, Format, AP); fputc('\n', stderr);

va_end(AP);

exit(1);

*}*

typedef unsigned char Lexical;

Lexical Scan(void) {

int Ch;

do Ch = getchar(); while (isspace(Ch));

switch (Ch) {

case EOF: return 0;

case 'w': case 'd': case 'i': case 't': case 'e':

case 'q': case 'z': case 'o': case 'c': case 'b':

case 'u': case 'p': case 'x': case 'm': case 'n': return Ch;

default: return 'x'; /* Anything else is considered to be x. */

}

*}*

typedef unsigned char Context;

#define STACK_MAX 0x10

Context Stack[STACK_MAX], *SP;

void PUSH(Context Tag, ...) {

if (SP >= Stack + STACK_MAX) FATAL("Syntax too complex.");

*SP++ = Tag;

*}*

void Parse(void) {

Lexical X;

SP = Stack, *SP++ = 0;

X = Scan();

qS:

switch (X) {

case 'w': PUSH(1); X = Scan(); goto qE;

case 'i': PUSH(2); X = Scan(); goto qE;

case 'o': PUSH(5); X = Scan(); goto qL;

default: ERROR("Missing 'x'."); goto HaveX;

case 'x': X = Scan();

HaveX:

if (X != 'q') ERROR("Missing 'q'."); else X = Scan();

PUSH(4);

goto qE;

}

qSF:

switch (*--SP) {

case 0:

if (X != 0) ERROR("Extra symbols present in input.");

return;

case 1: printf(" y0"); goto qSF;

case 2:

/* Incorporates the precedence |2> e <3| S over |2> y1 SF */

if (X == 'e') { X = Scan(), PUSH(3); goto qS; }

printf(" y1");

goto qSF;

case 3: printf(" y2"); goto qSF;

case 6: printf(" y5"); goto qLF;

default: FATAL("Internal error (SF).");

}

qE:

switch (X) {

case 'u': PUSH(8); X = Scan(); goto qE;

case 'm': PUSH(9); X = Scan(); goto qE;

default: ERROR("Missing 'x'."); goto HaveX2;

case 'x':

X = Scan();

HaveX2:

printf(" y9");

goto qEF;

}

qEF:

/* The action table here is incorporated directly into the code,

with the following precedences:

E b E b E = (E b E) b E, E b E p = E b (E p)

u E b E = (u E) b E, u E p = (u E) p

*/

switch (*--SP) {

case 7: /* |7> y6 EF has precedence only over b <7| E. */

if (X == 'p') goto ShiftP;

printf(" y6");

goto qEF;

case 8: /* |8> y7 EF has precedence. */

printf(" y7");

goto qEF;

}

switch (X) {

case 'b': ShiftB: SP++, X = Scan(); PUSH(7); goto qE;

case 'p': ShiftP: SP++, X = Scan(); printf(" y8"); goto qEF;

}

switch (*SP) {

case 1:

if (X != 'd') ERROR("Missing 'd' in w ... d.");

else X = Scan();

SP++;

goto qS;

case 2:

if (X != 't') ERROR("Missing 't' in i ... t.");

else X = Scan();

SP++;

goto qS;

case 4:

if (X != 'z') ERROR("Missing 'z' in x q ... z.");

else X = Scan();

printf(" y3");

goto qSF;

case 9:

if (X != 'n') ERROR("Missing 'n' in m ... n.");

else X = Scan();

goto qEF;

default: FATAL("Internal error (EF).");

}

qL: printf(" y4");

qLF:

if (SP[-1] == 5 && X == 'c') {

SP--, X = Scan(); goto qSF;

}

PUSH(6);

goto qS;

*}*

void main(void) {

Parse();

if (ERRORS > 0)

fprintf(stderr, "%d error(s) present in input.\n", ERRORS);

exit(0);

*}*

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.