Related articles |
---|
Case Study 2: How to build your own stack frames. mark@omnifest.uwm.edu (1995-08-26) |
Newsgroups: | comp.compilers |
From: | mark@omnifest.uwm.edu (Mark Hopkins) |
Keywords: | interpreter |
Organization: | Omnifest |
Date: | Sat, 26 Aug 1995 17:34:48 GMT |
Actually, I'm not entirely sure if either this or the previous
case study is entirely relevant to the original issue. However, the same
basic idea is the same: derive it, don't design it.
The general idea is you start out with the recursive specification and
then systematically derive the form and operation of the stack frame by
eliminating the recursion.
The following is a design reference on the C-BC interpreter. In every
place where possible the actual form of the interpreter was derived, including
the byte-code language. However, only one derivation is explicitly specified
at the very end: that of the call-return mechanism.
(1) C-BC Overview
C-BC is an extension of the UNIX BC language which includes a nearly
complete superset of C's statement and expression syntax, but is compatible
with BC. It has multi-dimensional dynamic arrays, pointers with dynamic
allocation, and arbitrary precision arithmetic with several base types for
real numbers, complex numbers, and finite fields.
The source code may be found in the alt.sources archives from early October
of 1993.
(2) Rational Infinite Terms
Source code is mapped into an infinitary language. The expressions in this
language can be infinitely large, but in such a way that each contains only
a finite number of distinct subexpressions, thus the term "rational infinite".
Given any grammar, one can extend the language represented by that grammar
to include rational infinite terms. The way this is done is to represent
such a term by a rational infinite tree: its parse tree. The grammar from
which the C-BC object language is derived is (in equational form):
String = 1 + Code String + Code? String: String
where 1 denotes the empty word (the ambiguity of the grammar will not
present a problem for what follows below).
In general, a parse tree can be represented as a finite series of
equations whose variables are the non-terminals of the grammar, but
labeled. On the right hand side of each equation would be a copy of a
one of the terms on the right-hand side of the corresponding grammar,
with each non-terminal labeled. For example, one parse tree for the
item (a? b: c? d: e) will be:
S
S ? S : S
C S C S
S ? S : S
C S C S C S
a b c d e
with the corresponding system of equations:
S0 = S1? S2: S3
S1 = C1 S4 S2 = C2 S5 S3 = S6? S7: S8
C1 = a C2 = b S6 = C3 S9 S7 = C4 S10 S8 = C5 S11
S4 = 1 S5 = 1 C3 = c C4 = d C5 = e
S9 = 1 S10 = 1 S11 = 1
A parse tree for a finite term will not have any cycles in it. Therefore, the
corresponding system of equations won't either: no recursions.
However, when the language is extended to allow for rational infinite trees
this will change. The parse trees will be infinite. But since a rational
infinite tree has a finite number of distinct subterms, it can be represented
as a "tree" with cycles in it. The corresponding system of equations,
expressing the relations between the subtrees, will therefore be finite but
cyclic.
For example, the infinite expression
a? b: c a? b: c a? b: c a? b: ...
will has a "tree" that looks like this:
S <------
/ | \ |
a ? b : S |
/ \ |
c * -->
and the corresponding system of equations:
S0 = a? b: S1
S1 = c S0
This system will then correspond to the sequence:
S0: if (a) { b; return; } else goto S1
S1: c; goto S0
Likewise, all intra-procedural control flow in C-BC is represented by
rational infinite terms, so that all that needs to be translated as the
operators of the language. This makes the object language an order of
magnitude simpler.
References:
Guy Cousineau "An Algebraic Definition for Control Structures",
Theoretical Computer Science 12 (1980) 175-192.
"A Representation of Trees by Languages"
Theoretical Computer Science 7 (1978) 25-55.
(3) C-BC Control Flow
The basis of the interpreter is therefore the rational infinite string,
which is presented as a series of items of the form:
(x: E :y z)
Each item stands for an equations according to one of the following schemes:
* x, y, z are labels for subterms
* E is either NULL or is a String that contains no (S? S: S) in it.
* If E is NULL: x = 1
* If y and z are distinct: x = E? y: z
* If y and z are the same: x = E y
A special label is reserved to correspond to the "halt" statement, which
cause the interpreter to exit.
The way an item is interpreted corresponds directly to the meaning of the
item being interpreted:
(0) If E is NULL, the current execution sequence is ended, and
control either resumes one level back (this is basically the
"return" statement), or at the top level goes idle.
(1) The sequence E is executed. Usually it is an expression to be
evaluated, but may involve function calls, which recursively
start up another level of execution. There are no jumps,
breaks, returns or any other kind of control flow in E.
(2) If y or z is NULL, the interpreter halts and returns to the
system (this is the "halt" statement).
(3) If y == z, then execution continues with the string labeled y.
(4) If y != z, then execution continues based on the value of the
expression evaluated. If it is 0, execution continues at z,
otherwise it continues at y.
All intraprocedural control flow can be handled. For instance, one might
have a translation rule such as:
while (E) S <---> ... :c c) (c: E :x b) (x: S :c c) (b: ...
Inside this while-loop, breaks and continues would be translated as follows:
break; <---> ... :b b) (y: ... where y is a new label
continue; <---> ... :c c) (z: ... where z is a new label
where b and c are the same labels as in the translation of the while loop.
Labeled statements are translated as follows:
L: S <----> ... :L L) (L: S ...
and goto's referring to this label, as follows:
goto L; <----> .. :L L) (x: S ... where x is a new label
Together, this allows one to define the entirety of a function's body.
A function definition is translated as in the following example:
define f(...) { <----> (0: NULL - -) (1:
E E
while (A) S :2 2) (2: A :3 4) (3: S :2 2) (4:
if (B) T B :5 6) (5: T :6 6) (6:
return C C :0 0)
}
which, when taken together forms the following set:
(0: NULL :- -)
(1: E :2 2)
(2: A :3 4)
(3: S :2 2)
(4: B :5 6)
(5: T :6 6)
(6: C :0 0)
When the function is called, execution starts at the subterm labeled 1.
The significance of all this is that now it's possible to set up a purely
expression based interpreter, involving no control flow statements at all --
because control flow is directly represented by the very shape of the infinite
expressions. So the actual interpreter language has relatively few operators
(basically just the numeric, stack, pointer, and array operators). This
separates out control from evaluation.
The same factoring idea leads to a very similar architecture, which is an
extension of the SECD machine that I call SEQCD. This was described in
detail in the previous case study.
(4) The C-BC Execution Machine
Though C-BC is not actually a functional language its execution machine is
in fact the direct ancestor of the SEQCD machine. So there is a great deal
of similarity between the two.
Following is a description of the object language used internally by
the C-BC interpreter. This language was not concocted out of the blue but
was *derived* as the least that would be necessary to effect a complete
translation of the source language AND make the corresponding translation
grammar a SSDTG.
A SDTG (Syntax Directed Translation Grammar) is one in which for each
non-terminal, a set of transformation rules are stated of the form
N: a => b
where the number of non-terminals of each type in the string a is the same
as the corresponding number in the string b. The grammar is called an SSDTG
(Simple SDTG) if the order of the non-terminals is also the same. For
instance, YACC is technically a processor for an SSDTG, not for a context
free grammar.
Any SDTG can be represented by an SSDTG in one of two ways:
(1) Place intermediate values on a stack.
(2) Use the SSDTG to translate to a parse tree, and accomplish the
reordering by traversing the parse tree after it's fully formed.
so as a result, SSDTG's assume a primary role. YACC uses the first method.
Likewise, the corresponding reduction rules were derived systematically
from the condition of being the least which would be necessary to correctly
interpret the action of the corresponding high-level statement. Much of
this designc could have, in fact, been automated.
The configuration of the C-BC machine takes on the following form:
S, E, C
where S is an stack containing a set of values, E is the current environment
and C is the code-string currently undergoing evaluation. The specification
itself is recursive, with the recursion occurring with the specification of
the function calls. The process of removing this recursion will lead to an
extra component, D, which will be a dump (or display frame) for the
call-return mechanism. This can be (and actually was) systematically derived
from the recursive specification.
An environment, E, contains a set of object:value pairings. The types of
objects and values are as follows:
TABLE 1
Object Value
Array Variable Array = List of Variables
Pointer Variable Variable
Scalar Variable Constant
Function String (plus extra information
corresponding to parameters and local
variables)
Variables are indexed and denoted &n (for scalars and pointers), ;n (for
arrays) for n = 0, 1, 2, .... Functions are also indexed denoted xn for
n = 0, 1, 2, ...
Given an environment the current value of a variable v will be denoted
E(v). The nth component of an array variable, a, will be denoted E(a)[n].
The effect of modifying the current environment E by assigning an object
x a value v will be denoted: (x <- v, E).
Technically, a configuration (S, E, C) is a function that maps input to
output. This mapping will be explicitly indicated only where necessary.
Otherwise, all configuration rules of the form
(S, E, C) = (S', E', C')
are to be understood in the sense:
(S, E, C)(Input) = (S', E', C')(Input)
Also, list notation will be used for inputs and outputs with the basic
list primitives:
x:L ------------ the list formed from L by adding x to the front
L:x ------------ the list formed from L by adding x to the back
[] -------------- the empty list
[a,b, ..., z] --- a: b: ... z:[]
List notation will also be used for C and S.
The Execution Machine: TABLE 2
S, E, constant:C --> S:constant, E, C [1]
S, E, &n:C --> S:&n, E, C
S, E, ;n:C --> S:;n, E, C
S, E, xn:C --> S:xn, E, C
S:A:B, E, : :C --> S:E(A)[B], E, C
S:A, E, , :C --> S, E, C
S:A, E, n :C --> S:-A, E, C
S:A, E, ! :C --> S:!A, E, C
S:A, E, d :C --> S:A:E(A), E, C
S:A, E, l :C --> S:E(A), E, C
S:A:B, E, op:C --> S:A OP B, E, C [2]
S:A, E, r :C (x I) --> S:r, (A<-x, E), C (I) [3]
S:A:B, E, s :C --> S:B, (A<-B, E), C
S:A, E, a :C --> S:E(A), (A<-E(A)+1, E), C
S:A, E, m :C --> S:E(A), (A<-E(A)-1, E), C
S:A, E, A :C --> S:E(A)+1, (A<-E(A)+1, E) C
S:A, E, M :C --> S:E(A)-1, (A<-E(A)-1, E), C
S:A, E, w :C --> f(A): (S, E, C) [4]
S:A, E, P :C --> g(A): (S, E, C) [4]
S:a1:...:an:f, E, @:C (x1:x2:... xm:I) --> y1:y2:...:yp:
(S:f(a1,...,an), E', C) (I) [5]
Notes:
[1] A constant is 0, 1 or a scalar value of one of the forms [...], "...",
or $...$.
[2] The correspondences between byte codes (op) and operations (OP) are as
follows:
op: + - * / % ^ < > = # { }
OP: + - * div mod exponent < > == != <= >=
with all the operations (including div, mod and exponent) defined in the
C-BC reference.
[3] r will be equal to 0 or 1 depending on whether the item read has the
correct syntax for the type of the expression A or not.
[4] f(A) and g(A) are the two "write" functions that convert the value A into
an ASCII string suitable for output. Both operations are described in
the C-BC reference.
[5] The function f is assumed to have arity n for some n >= 0, and when
called will recursively invoke a new level of execution with respect
to its corresponding String. The resulting environment is denoted E',
The values x1, x2, ..., xm are whatever the function reads in as input,
and y1, y2, ..., yp whatever it outputs. All three of these items are
recursively defined by applying the specification to the function's body.
Example:
The following example illustrates the application of the rules. To make
the presentation easier, the notations x L and L x are used respectively in
place of x:L and L:x.
String: &2 d &1 a * s ,
Reduction:
S, (&2<-4, &1<-2, E), &2 d &1 a * s , C
--> S &2, (&2<-4, &1<-2, E), d &1 a * s , C
--> S &2 4, (&2<-4, &1<-2, E), &1 a * s , C
--> S &2 4 &1, (&2<-4, &1<-2, E), a * s , C
--> S &2 4 2, (&2<-4, &1<-3, E), * s , C
--> S &2 8, (&2<-4, &1<-3, E), s , C
--> S 8, (&2<-8, &1<-3, E), , C
--> S, (&2<-8, &1<-3, E), C
If the variable &1 and &2 corresponded to variables named x and i respectively
then this sequence would have had the effect: x *= i++.
(5) The C-BC Translation Grammar
The source to byte-code translation is carried out with a SSDTG. A SSDTG
is defined as consisting of the following:
X: Input alphabet
Y: Output alphabet
N: Non-terminals
S, an element of N denoting the START symbol
P: a set of rules of the form
n: a => b
where n is a non-terminal in N, a is a word in (X + N)*
and b is a word in (Y + N)*.
All rules are constrained so that for each rule (n: a => n) the number and
order of non-terminals in a and b is the same. This notation is extended
so that
n: a1 => b1
a2 => b2
...
an => bn
will denote the set of translations { n: ai => bi: i = 1, 2, ..., n }.
The grammar itself is ambiguous with respect to the syntax for expressions
and is resolved with precedence rules. These are described in the C-BC
reference..
(a) Expressions
The description below uses the following abbreviations:
f_n: Function #n
a_n: Array variable #n
v_n: Variable #n
l_x: Label denoting state #x
e, e1, e2, ..., en: Expressions
s, s1, s2, ..., sn: Statements
x, x1, x2, ..., xn, y, z: Continuation string labels
If a P appears in the first column, the expression following is printable
if it is of scalar type. If a ? appears in the first column, the expression
is printable if any only if the final sub-expression is. If an L appears in
the first column, the expression is an L-value and will be printable when
dereferenced if and only if it is of a scalar type.
I/O Expressions:
E: -> e, e1, ..., en => e r :x1 x)
(x1: 1 + e1 r :x2 x)
(x2: 1 + ...
... en r :xn x)
(xn: 1 + :x x)
(x:
<- e, e1, ..., en => e w :x1 x)
(x1: 1 + e1 w :x2 x)
(x2: 1 + ...
... en w :xn x)
(xn: 1 + :x x)
(x:
-> e => e r
<- e => e w
Logic Expressions:
E: e? e1: e2 => e :x y) (x: e1 :z z) (y: e2 :z z) (z:
e1 || e2 => e1 :x z) (z: e2 :x y) (y: 0 :w w) (x: 1 :w w) (w:
e1 && e2 => e1 :z y) (z: e1 :x y) (x: 1 :w w) (y: 0 :w w) (w:
!e => e!
Relational Expressions:
E: e1 == e2 => e1 e2 =
e1 != e2 => e1 e2 #
e1 <= e2 => e1 e2 {
e1 >= e2 => e1 e2 }
e1 < e2 => e1 e2 <
e1 > e2 => e1 e2 >
Assignment/Update and Sequential Expressions:
E: ++e => e A
--e => e M
e++ => e a
e-- => e m
? e1, e2 => e1 , e2
e1 = e2 => e1 e2 s
e1 op= e2 => e1 d e2 op s op: + - * / ^ %
e1 =op e2 => e1 d e2 op s op: + - * / ^ %
Arithmetic Operators:
E:
P + e => e
P - e => e n
P e1 op e2 => e1 e2 op op: + - * / ^ %
Addressing/Pointer Expressions:
E: & e => e
new e => e N
free e => e F
L * e => e
Other Expressions:
E:
L e1[e2] => e1 e2 :
L a_n[e2] => ;n e2 : (special case)
? (e) => e
P f_n(e1, ..., em) => e1 ... em xn @
L v_n => &n
a_n[] => ;n
Constants:
E:
P number => [number]
P "string" => "string"
P $galois => $galois$
P $galois$ => $galois$
P 0 => 0
These are the mappings for the following built-in objects:
Built-In Variables:
ibase: v_0, obase: v_1, scale: v_2
last: v_3, lastc: v_4, lastg: v_5, lasts: v_6
Built-In Functions:
sqrt: f_0, scale: f_1, length: f_2,
shell: f_3, ifile: f_4, ofile: f_5,
comp: f_6, re: f_7, im: f_8,
field: f_9, gal_p: f_10, gal_m: f_11,
gal_gx: f_12, gal_set: f_13, gal_get: f_14
(b) Automatic Conversions.
L-values are automatically derefernced in all contexts except those
specifically requiring L-values. If this operation is denoted "deref e",
then the following effective translation rule applies.
P E: deref e => e l
Arrays, likewise, are automatically converted to pointers in most contexts,
as described in the C-BC documentation. If this operation is denoted as
"convert e", the following rule then applies:
E: convert e => e 0 :
(c) Statements
Expression Statements:
S: e ; => e P (if e is printable)
e ; => e , (if e is not printable)
; =>
Jump Statements:
In all these statements, a break occurs in control-flow. Therefore,
when the following continuation string is created, it is created with a new
label, denoted as y. The labels c and b denote the c-label and b-label of the
innermost containing loop statement. The label, -, denotes the null
continuation string.
S: goto l_x; => :x x) (y:
continue; => :c c) (y:
break; => :b b) (y:
return; => 0 :0 0) (y:
return(); => 0 :0 0) (y:
return e; => e :0 0) (y:
halt; => :- -) (y:
Other Statements:
In the following statements, the labels c and b denote the labels that will
be matched respectively to "continue" and "break" statements contained
immediately inside these statements.
S: { s1 ... sn } => s1 ... sn
l_x: s => :x x) (x: s
if (e) s => e :x z) (x: s :z z) (z:
if (e) s1 else s2 => e :x y) (x: s1 :z z) (y: s2 :z z) (z:
while (e) s => :c c) (c: e :x b) (x: s :c c) (b:
do s while (e); => :x x) (x: s :c c) (c: e :x b) (b:
for ([e1]; [e2]; [e3]) s => [e1 ,] :x x) [(x: e2 :y b)] [(c: e3, :x x)]
(y: s :c c) (b:
In the for statement, the expressions denoted in brackets are optional. If
they do not occur, then the corresponding items enclosed in []'s on the right
hand side will not appear either. If expression e2 is absent, therefore,
labels x and y will be considered identical. Likewise, if expression e3 is
absent, labels c and x will be considered identical. Also, in this version
of C-BC, the :c c) (c: sequence in the do-while statement will be optimized
away if this statement does not contain a continue immediately within it.
(d) Pseudo Statements
Some of the top-level commands (called here: pseudo-statements) are
translated and processed in the interpreter. These include the following:
P: include "string" => "string" I
log "string" => "string" L
log => "" L
quit =>
The command "quit" is not processed at all, since the interpreter exits as
soon as it is seen.
(e) Function Bodies and Execution Sequences
A function definition consists of a set of parameter declarations, auto
variable declarations, and a sequence of statements. In this version of
C-BC, the declarations are processed and associated with the function name
in the function name space. If the sequence of statements is s1 ... sn, it is
translated as if by the following rule:
P: Header { s1 ... sn } => (0: - :? ?) (1: s1 ... sn 0 :0 0)
The - and ? denote respectively empty continuation strings and don't-care
continuation labels. In all function definitions, state 0 will be the ending
state, consisting of an empty continuation string, and state 1 will be the
starting state. Note the final 0 :0 0) sequence, which corresponds to an
implicit return; statement at the end of the function.
An execution sequence is a single statement listed outside a function.
If there is more than one statement on a given line, each one is still
considered a separate execution sequence.
If the statement, s, is the execution sequence, it is translated as if
by the following rule:
P: s => (0: - :? ?) (1: s :0 0)
Note that in this case, there is no value returned. Effectively, execution
sequences are anonymous functions of type void that are (re)defined, and then
executed.
Future extensions of C-BC may allow for the addition of "void" types,
and may denote this function as "main". In this case, the following
declarations would be understood: void main();
Finally, at the top level of C-BC, one has the rule
start: P1 P2 ... Pn => P1 P2 ... Pn
which states that a program is a sequence of function definitions, statements
and pseudo-statements translated in the order in which they are issued.
(f) Declarations
The syntax and semantics for declarations is described in detail in the
C-BC documentation, so no further information will be provided here.
Declarations may be made at the top-level (and so one has a rule of the
form P: Dec => ...) or as part of a function's header, whose syntax is not
further specified here.
(6) Generating the Stack Frame
In the process of removing the recursion from the specification of the
C-BC Execution Machine listed in part (4), a stack frame will be generated.
In order to carry out the reduction
S:a1:...:an:f, E, @:C (x1:x2:... xm:I) --> y1:y2:...:yp:
(S:f(a1,...,an), E', C)(I) [5]
we need to temporarily save the the environment. A new environment with the
function's parameters and locals is created. Then the values for a1 through
an are stored into function's parameters and the locals are initialised (all
to 0, there are no declaration initialisers in C-BC). The function's string
is prefixed to C, along with a return code. When the return code is reached
the original environment will be restored.
An extra component, D, called the DUMP (or frame display) is then added to
the configuration (S, E, C). Each of the rules previously stated will be
modified by adding D to both sides.
Assume that the function's string is named Cf, its parameters are named
&P1, ..., &Pn, that its locals are named &L1, ..., &Lp (the parameters and
locals are the "extra information" referred to by the listing in TABLE 1).
Also let pi denote E(Pi), and li denote E(li).
The function call is accomplished inductively as follows:
S:a1:...:an:f, E, @:C, D (x1:x2:... xm:I)
--> S, (L1<-0, ..., Lp<-0, P1<-a1, ..., Pn<-an, E), Cf:#:C,
[P1, p1, ..., Pn, pn]:[L1, l1,...,Lp, lp]:D (x1:x2:...:xm:I)
--> ... [execute Cf]
--> y1:y2:...:yp: (S:A, E', #:C,
[P1, p1, ..., Pn, pn]:[L1, l1, ..., Lp: lp]:D) (I)
--> y1:y2:...:yp: (S:A, (P1<-p1, ..., Pn<-pn, L1<-l1, ..., Lp<-lp,E'), C, D
where the x's and y's are as in note [5] under TABLE 2. This has the effect
of restoring the values of the locals L1, ..., Lp and parameters P1, ..., Pn
after the function call. The value A is whatever is left on the stack
after the function is done. Since all functions in C-BC return a single
value, there will always be an extra item on the stack after the function is
complete.
This listing leads to several more rules:
FUNCTION CALLS:
S:a1:...:an:f, E, @:C, D
--> S, (L1<-0,...,Lp<-0,P1<-a1,...,Pn<-an,E), Cf:#:C,
[P1,E(P1),...,Pn,E(Pn)]:[L1,E(L1),..., Lp,E(Lp)]:D
where n is the arity of the function, L1, ..., Lp its locals, P1, ..., Pn its
parameters, and Cf its string. All these items are listed with the function
in the environment so that f's entry in the environment will look like:
E(f) = ([P1, ..., Pn], [L1, ..., Lp], Cf)
Note that such an entry is established as a result of processing a function
definition. This specification, hwoever was left out of the SSDTG.
FUNCTION RETURNS:
S:A, E', #:C, [P1, p1, ..., Pn, pn]:[L1, l1, ..., Lp, lp]:D
--> S:A, (P1<-p1, ..., Pn<-pn, L1<-l1, ..., Lp<-lp,E'), C, D
So the format of the dump is a list of items of the form:
[[P1, p1, ..., Pn, pn], [L1, l1, ..., Lp, lp]]
In the C-BC interpreter, the dump doesn't really take on this form. Instead
each variable is represented by name and its corresponding environment entry
is actually a stack of values instead of just one. When parameters and
locals are initialised the new values are pushed on the corresponding variable
stacks, which are popped upon the function's return.
However, this is the process that was actually applied to the code itself
to arrive at the interpreter, minus the function-call recursion. Originally,
I was going to just leave it alone but there were some other parts of the
interpreter that needed to explicitly manipulate the dump.
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.