Related articles 

Bottom up handwritten parsers in under an hour mark@freenet.uwm.edu (19940212) 
Bottomsup, another hour has come: Quantum Parsing mark@freenet.uwm.edu (19940217) 
Newsgroups:  comp.compilers 
From:  mark@freenet.uwm.edu (Mark Hopkins) 
Keywords:  parse 
Organization:  Compilers Central 
References:  9402079 
Date:  Thu, 17 Feb 1994 09:24:30 GMT 
(0) Introduction
This is a follow up to that other article, "Bottom up parsing by
hand, in under an hour" (at which point I could have sworn I heard some
people saying D.O.C.). So if you just joined in, refer to that article
for details.
A reference list in case you're interested:
Ref: "Bottom up parsing by hand...", me, comp.compilers, Feb 1994
"The Equational Approach", me, comp.compilers,comp.theory, Oct 1993
"Quantum Parsing", me, comp.compilers, Aug 1992
"Introduction to Formal Language Theory", Harrison, M.A. 1978
Harrison discusses representation theorems in Chapter 10 of his book,
particularily the ChomskySchuetzenberger Representation Theorem, which
helps bring light to some of the stuff we're doing here. Schuetzenberger
really got off on this kind of stuff and on monoids and algebra, so he was
probably thinking along the same lines, but for some inexplicable reason
just never made the leap to the Context Free Expression notation we're
using here (nor for that matter did anyone else...)
(1) Operator Representations
In the previous article, we introduced the operator BraKet notation
(borrowed from Quantum Physics), <m, m>, and gave them the respective
interpretations:
The Stack Representation
<m: Push m.
m>: Pop and test for m.
A point to emphasize, however, is that these are not merely abbreviations
for stack operations, but are abstract algebraic operators in their own
right. This point can be brought to bear by pointing out other possible
representations:
The Counter Representation
<0: Counter = 0
<m: Counter = M*(Counter + 1) + m
where M is the largest operator number.
0>: Test for Counter == 0
m>: Test for (Counter)%M == 0, and Counter /= M.
In a 32bit representation, this leads to an equivalent stack depth of
2^32 for M = 1, and about 30 for M = 2. So this representation is
effectively limited in its application to M < 3.
Another representation, of a theoretical nature, is the Automaton
Representation. Take the quantum grammar:
S = <0 T
T = <1 p T + 1> q T + <2 b T + 2> d T + 0>
Underlying this is a 2state finite automaton:
S = T
T = p T + q T + b T + d T + 1
[S] > [[T]]
 ^
\ / p, q, b, d
Replicate this automaton, creating an infinite Mary tree of their copies.
Then interpret the operators as follows:
<0: Go to the root level
0>: Test for root level.
<m: Go to branch #m below the current level.
m>: Test for branch #m and return to parent level.
With this interpretation, the term <1 p T will become a transition from
state T to state T one level down on the left branch. The term 1> q T
will return from the leftbranch back to T. The resulting automaton
looks like this:
[S]

v
[[T]]
* ^ ^ *
   
p  q   d  b
v   v
T * * T
* ^ ^ * * ^ ^ *
       
p  q   d  b p  q   d  b
v   v v   v
T * * T T * * T
... ... ... ...
Notice that only one S state is in use, so none of the other copies of
it are shown.
In earlier references, I called it an Infinite Automaton. A better
name for it, perhaps, would be: Quantum Automaton.
Enough theory for today. In this article, we will be graduating from
the Toy language of the last article to a semirespectable language: a
C subset. It's still somewhat of a toy (no function calls), but you
can actually use it as an interpreter (by the time we're through here).
This language will have 2 types (int, char), dynamic unbounded arrays
(because I was too lazy to think about bounds checking), and pointers.
As for statements, it will have the usual complement, except for the
switch/case/default (same reason for omission). Expressions: all the
operators except typecasting and fields (because there's no structures
or unions).
No functions. That you'll have to wait for in the next language.
The language has the following grammar, presented in Equational form:
start = (S + D)+
D = Sp Dc ("," Dc)* ";"
Sp = "int" + "char"
Dc = X + "*" Dc + Dc "[" "]"
S = "if" "(" E ")" S ["else" S]
+ "while" "(" E ")" S
+ "do" S "while" "(" E ")" ";"
+ "for" "(" [E] ";" [E] ";" [E] ")" S
+ "{" S* "}"
+ X ":" S
+ "goto" X ";"
+ "break" ";"
+ "continue" ";"
+ [E] ";"
E = E "?" E ":" E
+ E "," E
+ E "" E
+ E "&&" E
+ E b E
+ u E
+ E p
+ E "[" E "]"
+ "(" E ")"
+ X
+ NUM
+ CHAR
+ STR
X: Identifier
NUM: Integer number
CHAR: Character constant
STR: String constant
b: Binary, infix operators, including the folliowing:
== != < > <= >= << >>  ^ & +  * / %
= <<= >>= = ^= &= += = *= /= %=
u: Unary, prefix operators, including the following:
! ~ +  ++  & *
p: Postfix operators, including: ++ 
Note that we're using regular expression notation on the right hand
sides of the rules:
1 means empty string,
[A] = 1 + A,
and A* = 1 + A + A A + ...
This only serves to emphasise the point of view here that Context Free
Grammars are recursive systems of equations between Context Free
Expressions. Ironically, this view of grammars as equation systems was
the one originally adopted in the early days of Formal Language Theory,
but for some odd inexplicable reason it was largely dropped (with the
effect of setting the discovery of the methods described here back by over
30 years). Well, no use crying over spilt milk, I wasn't around so it
makes me no difference...
As you might have guessed, D, S and E stand respectively for Declaration,
Statement, and Expression. In addition, you have Sp ("type specifier"),
and Dc ("declarator"). All the operators have the exact same precedence as
in C.
As in the prior article, we'll argument these rules by an Action
Alphabet: Y = { y0, y1, ..., y42 }. This alphabet, also mentioned in the
article, commutes with the input alphabet, X: x y = y x for all x in X,
y in Y. The result of the augmentation:
start = (S + D)+
D = Sp y0 Dc y1 ("," Dc y1)* ";"
Sp = "int" y2 + "char" y3
Dc = X y4 + "*" Dc y5 + Dc "[" "]" y6
S = "if" "(" E y7 ")" S [y8 + "else" y9 S y10]
+ "while" "(" y11 E y12 ")" S y13
+ "do" y14 S y15 "while" "(" E y16 ")" ";"
+ "for" "(" [E] y17 ";" (E y18 + y19) ";" [E] y20 ")" S y21
+ "{" S* "}"
+ X ":" y22 S
+ "goto" X ";" y23
+ "break" ";" y24
+ "continue" ";" y25
+ E ";" y26
+ ";"
E = E y27 "?" E y28 ":" E y29
+ E "," y30 E
+ E "" y31 E y32
+ E "&&" y33 E y34
+ E b E y35
+ u E y36
+ E p y37
+ E "[" E "]" y38
+ "(" E ")"
+ X y39
+ NUM y40
+ CHAR y41
+ STR y42
I stuck them where I did because I'm thinking ahead to the next article
(the Translation Grammar), where the y's will be defined and the source
code given. The effect of y14, for instance, will be to mark the place
for a "continue" statement inside the loop to go to. As mentioned in the
last article, this notation is powerful enough to represent any language
acceptable to a push down transducer: the Simple Syntax Directed
Translation Languages. Therefore, it constitutes a rather elegant means
of representing Translation Expressions as well, not just Context Free
Expressions (and I haven't even pulled out the big toys).
Next, the operator alphabet, D = { <0, ..., <m, 0>, ..., m> } is
introduced (D stands for Dyck, see the Harrison Reference). Note that
in the following, I didn't even bother to enclose any of the
nonterminals in the equation for D by operators. They all "inherit"
the <0, 0> operator pair from D (which gets it from the first rule).
The area enclosed within an operator pair, <m, m> will be called
Context #m. When we carry out conflict resolution, our disambiguation
rules will be conditioned on the context they occur in (which almost
sounds like a tautology).
start = (<0 (S + D) 0>)+
D = Sp y0 Dc y1 ("," Dc y1)* ";"
Sp = "int" y2 + "char" y3
Dc = X y4 + "*" <1 Dc 1> y5 + Dc "[" "]" y6
S = "if" "(" <2 E 2> y7 ")" <3 S 3> [y8 + "else" y9 <4 S 4> y10]
+ "while" "(" y11 <5 E 5> y12 ")" <6 S 6> y13
+ "do" y14 <7 S 7> y15 "while" "(" <8 E 8> y16 ")" ";"
+ "for" "(" [<9 E 9>] y17 ";"
(<10 E 10> y18 + y19) ";"
[<11 E 11>] y20 ")" <12 S 12> y21
+ "{" (<13 S 13>)* "}"
+ X ":" y22 S
+ "goto" X ";" y23
+ "break" ";" y24
+ "continue" ";" y25
+ E ";" y26
+ ";"
E = E y27 "?" <14 E 14> y28 ":" <15 E 15> y29
+ E "," y30 <16 E 16>
+ E "" y31 <17 E 17> y32
+ E "&&" y33 <18 E 18> y34
+ E b <19 E 19> y35
+ u <20 E 20> y36
+ E p y37
+ E "[" <21 E 21> "]" y38
+ "(" <22 E 22> ")"
+ X y39
+ NUM y40
+ CHAR y41
+ STR y42
I skipped the leftrecursion removal step performed first in the
previous article. Ideally, one should carry out an entire leftrecursion
removal process AFTER the action symbols are inserted, but BEFORE the
operator symbols are added. So in actual fact, one will have effectively
converted the syntax to a Greibach normal form before operator symbols are
added. But it's not absolutely essential to do so.
Roughly speaking, the operator symbols are inserted in such a way as to
guarantee that all occurrences of a given nonterminal with distinct
"followsets" will be enclosed within a distinct context, including the
leftrecursive occurrences, which are considered to "inherit" their
contexts from their parent nonterminal. The one apparent exception to
the general rule is the two occurrences of Dc, which both inherit context
#0. However, they have the same follow set: y1 ("," Dc y1)* ";"
Conversion to a Quantum Grammar is a little bit more difficult than in
the last article because this time we're faced with regular expressions on
the right hand sides. As before, we append N_F to each term on the right
hand side of the equation for the nonterminal N (which introduces new
nonterminals, D_F, Sp_F, Dc_F, S_F, E_F). But now, instead of chopping
off terms after the nonterminals, now we have to construct all the
possible nonterminal to nonterminal paths that occur with the regular
expression on the right hand side. For instance, in the equation for D,
these paths would correspond to the following:
D = Sp y0 Dc y1 ("," Dc y1)* ";" D_F
Paths: Sp y0 Dc
Dc y1 "," Dc
Dc y1 ";" D_F
Other than that, the basic rule would be the same, we'd place each of the
resulting pieces into their appropriate equations:
Sp_F = ... + y0 Dc
Dc_F = ... + y1 "," Dc
Dc_F = ... + y1 ";" D_F
and the resulting equation for D becomes just: D = Sp.
As a result of this process, we get the following Quantum Grammar:
start = <0 (S + D)
D = Sp = "int" y2 Sp_F + "char" y3 Sp_F
D_F = 0><0 (S + D)
+ 0>
Sp_F = y0 Dc
Dc = X y4 Dc_F
+ "*" <1 Dc
Dc_F = y1 ("," Dc + ";" D_F)
+ 1> y5 Dc_F
+ "[" "]" y6 Dc_F
S = "if" "(" <2 E
+ "while" "(" y11 <5 E
+ "do" y14 <7 S
+ "for" "(" (<9 E + F1
+ "{" S1
+ X ":" y22 S
S_F = 0>
+ 0><0 (S + D)
+ 3> y8 S_F + 3> "else" y9 <4 S + 4> y10 S_F
+ 6> y13 S_F
+ 7> y15 "while" "(" <8 E
+ 12> y21 S_F
+ 13> S1
+ "goto" X ";" y23 S_F
+ "break" ";" y24 S_F
+ "continue" ";" y25 S_F
+ E
+ ";" S_F
F1 = y17 ";" <10 E + y19 F2
F2 = ";" (<11 E + F3)
F3 = y20 ")" <12 S
S1 = <13 S + "}" S_F
E = u <20 E
+ "(" <22 E
+ X y39 E_F
+ NUM y40 E_F
+ CHAR y41 E_F
+ STR y42 E_F
E_F = 2> y7 ")" <3 S
+ 5> y12 ")" <6 S
+ 8> y16 ")" ";" S_F
+ 9> F1
+ 10> y18 F2
+ 11> F3
+ ";" y26 S_F
+ 14> y28 ":" <15 E
+ 15> y29 E_F
+ 16> E_F
+ 17> y32 E_F
+ 18> y34 E_F
+ 19> y35
+ 20> y36 E_F
+ 21> "]" y38 E_F
+ 22> ")" E_F
+ y27 "?" <14 E
+ "," y30 <16 E
+ "" y31 <17 E
+ "&&" y33 <18 E
+ b <19 E
+ p y37 E_F
+ "[" <21 E
The terms, F1, F2, F3, and S1 were added for convenience to
avoid duplications.
All the spurious leftrecursions have been removed, such as the
following:
S_F = S_F: from S = ... + X ":" y22 S + ...
E = E : from E = E p y37 + E b <19 E 19> y35 + ...
In the following table, the contexts each of the nonterminals are
defined in, and their inherited contexts are listed:
explicit contexts inherited contexts (from)
D 0 
Sp  0 (D)
Dc 1 0 (D), 1 (Dc)
S 0,34,67,1213 
E 2,5,811,1421 0,34,67,1213 (S)
0,221 (E)
Except for the trivial leftrecursions,
Dc = Dc ... + ...,
E = E ... + ...,
there are no conflicting contexts. But note that each of these
occurrences will be directly responsible for precedence conflicts.
More to follow soon...

Return to the
comp.compilers page.
Search the
comp.compilers archives again.