Mon, 28 Sep 1992 00:32:02 GMT

Related articles |
---|

Backtracking yacc jarmo@ksvltd.FI (Jarmo Raiha) (1992-09-10) |

Parsers are easy to write (was: Re: Backtracking yacc) markh@csd4.csd.uwm.edu (1992-09-25) |

Quantum Parsing (was: Re: Parsers are easy to write) markh@csd4.csd.uwm.edu (1992-09-28) |

Re: Quantum Parsing (was: Re: Parsers are easy to write) dak@kaa.informatik.rwth-aachen.de (1992-09-30) |

Re: Quantum Parsing (was: Re: Parsers are easy to write) markh@csd4.csd.uwm.edu (1992-10-03) |

Newsgroups: | comp.compilers,comp.theory |

From: | markh@csd4.csd.uwm.edu (Mark) |

Organization: | Computing Services Division, University of Wisconsin - Milwaukee |

Date: | Mon, 28 Sep 1992 00:32:02 GMT |

Keywords: | parse, LL(1), yacc |

References: | 92-09-059 92-09-185 |

markh@csd4.csd.uwm.edu (Mark) writes:

*>In fact, I challenge anyone to find me one that I cannot craft a parser*

*>for by hand in a short time ...*

*>[Don't forget to make sure that your hand-written parser diagnoses all*

*>invalid input, as a yacc parser does. My experience is that the biggest wins*

*>from using yacc are the automatic diagnosis of ambiguous grammars and the*

*>extreme ease of adding little tweaks later on. -John]*

The moderator brings up a valid point. Without the constraint of proper

error handling, one could easily come up with the default solution to the

problem of writing a parser by hand: a top-down recursive parser with

backtracking.

If anyone is curious, I do use have a method. It's approach is entirely

algebraic, and I'll give a cursory description below.

Suppose you have a Context Free Grammar, consisting of the following:

Terminals (T) (lexical items), Non-terminals (N), and Actions (Y). The

steps for deriving a parser are basically the following:

(1) Write the grammar as a recursive system of equations of the form:

Non-terminal = RE(T, N, Y)

where RE is a regular expression in the alphabet T \union N \union Y.

A regular expression over an alphabet, A, can consist of the following:

(1) 0 = (roughly corresponding to YACC's "error").

(2) 1 (the "empty string")

(3) any member a in A.

(4) concatenation RE RE

(5) alteration RE | RE

(6) iteration RE*

There is an underlying algebra here, for example 0 x = 0 = x 0, 1 x = x = x 1,

x | y = y | x, etc.

(2) Eliminate the recursions from the system. More generally, eliminate all

the non-terminals from the right-hand sides of all the equations. The

result is a SINGLE equation of the form:

Start-symbol = ^ RE(T, X, X+) $.

All the non-terminals are eliminated.

The set X consists of "operators" (x), corresponding to stack symbols of a

push-down automaton. They're called CREATION OPERATORS. The set, X+

consists of the complementary operators X+ = { x+: x is in X }, called

DESTRUCTION OPERATORS. They satisfy the following rules:

^ is in X

$ is in X+, with $ = ^+.

x y+ = 0, if x != y

x x+ = 1

sum for all x in X (x+ x) = 1

x t = t x, x+ t = t x+, for all t in N \union T \union Y

The symbols ^, and $ are boundary markers and do not occur within the

regular expression RE(T, X, X+). The X's and X+'s interpretations are the

following:

x = "push symbol x"

x+ = "pop symbol x, if x is on the stack's top"

^ = "stack bottom"

$ = "stack is empty"

Every context free language can be represented as a regular expression

over such an algebra.

(3) Develop a finite automaton from the right-hand-side. When developing a FA,

the arcs can contain one of more of the following:

* zero or one "shifts" (any element of T)

* zero or more "actions" (elements of Y)

* zero or one "pushes" (any element of X)

* zero or one "pops" (any element of X+)

An arc is a REDUCE arc if it contains an pop, it's a SHIFT, if it

contains no pops, but at least one shift, and it's a LAMBDA arc, if it

contains no pops or shifts.

A state in the finite automaton is a PURE SHIFT state if its only arcs

are shift arcs, and it's a PURE REDUCE state if its only arcs are reduce

arcs. It's a CRITICAL STATE if it has at least one shift arc and one

reduce arc.

The ^ and $ are ignored when constructing the automaton.

The finite automaton is "deterministic" if its only states are pure

shift and pure reduce states.

Step (3) breaks down as follows:

(3a) Develop a minimal DFA

(3b) Eliminate all the lambda arcs.

(3c) Disambiguate all critical states.

Step (1) is a null-step. The grammar should start out in the form stated.

Step (2) is a rigorous and straightforward step, but also involves several

rules of optimization that can be categorized as follows:

(a) Left-recursion elimination

(b) Stack-symbol elimination

(i) Conversion of stack symbol to counter.

(ii) Conversion of stack symbol to variable.

(iii) Equating stack symbols

Step (3a) is rigorous, basically the algorithm for eliminating lambda

transitions from a finite automaton. Step (3b) involves heuristics.

Formalizing (2) and (3b) is tantamount to writing an expert system.

As an example of how the algebra works, take the syntax:

S = (a y0 | b S c y1)*.

where y0 and y1 are actions.

Upon conversion, this becomes:

S = ^ (b x | a y0 | c x+ y1)* $.

where x, x+, ^, and $ commute with a, b, c, y0, and y1; and with the

creation-destruction rules:

^ $ = 1 ^ x+ = 0 $ ^ | x+ x = 1

x $ = 0 x x+ = 1

Under this algebra, the regular expression above PRECISELY generates the

given context free language.

The parser is the (essentially) 1-state finite automaton generated from

the regular expression above:

Start state = Q0.

(final state) Q0 -> b x Q0 | a y0 Q0 | c x+ y1 Q0.

Q0 is almost a shift state. It can be made so, by factoring out the x+

and $:

Start state = Q0.

(final state) Q0 -> b x Q0 | a y0 Q0 | c Q1.

Q1 -> x+ y1 Q0.

In practice I fudge on the steps in the following ways:

(2) Not all non-terminals need to be "eliminated", if they do not contribute

to a recursion.

(3a) Not all the lambda arcs need to be eliminated if they don't result in

in critical states.

(3b) Some of the critical states are left alone, especially those that

involve precedence relations. Instead, an "action" table is generated

whose rows consist of elements of X+, and columns elements of T.

Another example is the following "expression" syntax:

E = x y0 | u E y1 | E p y2 | E b E y3 | ( E ) y4.

where y0, y1, y2, y3 and y4 are actions.

After conversion, this results in the non-recursive system of equations:

E = ^ E0 $.

E0 = x y0 F | u x0 E0 | ( x1 E0.

F = 1 | x0+ y1 F | p y2 F | b x2 E0 | x2+ y3 F | x1+ ) y4 F.

Which leads to a 2-state non-deterministic parser

Start state = E

E -> x y0 F | u x0 E | ( x1 E

(final state) F -> x0+ y1 F | p y2 F | b x2 E0 | x2+ y3 F | x1+ ) y4 F.

State E is a shift state, and state F is critical, containing shifts from

p and b, and reduces involving x0+, x1+, and x2+. An action table is

formed involving the shifts (b, p) vs. the destruction operators (x0+ and

x2+). The operator x1+ is not involved, since it is already accompanied

by a shift: ).

The arcs headed by x0+ and x2+ essentially correspond to the following points

in the parse process:

x0+: E -> u E .

x2+: E -> E b E .

The shifts correspond to the parsing points:

b: E -> E . b E

p: E -> E . p

So the action table resolves the conflicts:

x0+, b: (u E) b E vs. u (E b E)

x0+, p: (u E) p vs. u (E p)

x2+, b: (E b E) b E vs. E b (E b E)

x2+, p: (E b E) p vs. E b (E p)

Thus, the traditional operator precedence parser is systematically derived.

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.