Sat, 26 Aug 1995 17:34:48 GMT

Related articles |
---|

Case Study 2: How to build your own stack frames. mark@omnifest.uwm.edu (1995-08-26) |

Newsgroups: | comp.compilers |

From: | mark@omnifest.uwm.edu (Mark Hopkins) |

Keywords: | interpreter |

Organization: | Omnifest |

Date: | Sat, 26 Aug 1995 17:34:48 GMT |

Actually, I'm not entirely sure if either this or the previous

case study is entirely relevant to the original issue. However, the same

basic idea is the same: derive it, don't design it.

The general idea is you start out with the recursive specification and

then systematically derive the form and operation of the stack frame by

eliminating the recursion.

The following is a design reference on the C-BC interpreter. In every

place where possible the actual form of the interpreter was derived, including

the byte-code language. However, only one derivation is explicitly specified

at the very end: that of the call-return mechanism.

(1) C-BC Overview

C-BC is an extension of the UNIX BC language which includes a nearly

complete superset of C's statement and expression syntax, but is compatible

with BC. It has multi-dimensional dynamic arrays, pointers with dynamic

allocation, and arbitrary precision arithmetic with several base types for

real numbers, complex numbers, and finite fields.

The source code may be found in the alt.sources archives from early October

of 1993.

(2) Rational Infinite Terms

Source code is mapped into an infinitary language. The expressions in this

language can be infinitely large, but in such a way that each contains only

a finite number of distinct subexpressions, thus the term "rational infinite".

Given any grammar, one can extend the language represented by that grammar

to include rational infinite terms. The way this is done is to represent

such a term by a rational infinite tree: its parse tree. The grammar from

which the C-BC object language is derived is (in equational form):

String = 1 + Code String + Code? String: String

where 1 denotes the empty word (the ambiguity of the grammar will not

present a problem for what follows below).

In general, a parse tree can be represented as a finite series of

equations whose variables are the non-terminals of the grammar, but

labeled. On the right hand side of each equation would be a copy of a

one of the terms on the right-hand side of the corresponding grammar,

with each non-terminal labeled. For example, one parse tree for the

item (a? b: c? d: e) will be:

S

S ? S : S

C S C S

S ? S : S

C S C S C S

a b c d e

with the corresponding system of equations:

S0 = S1? S2: S3

S1 = C1 S4 S2 = C2 S5 S3 = S6? S7: S8

C1 = a C2 = b S6 = C3 S9 S7 = C4 S10 S8 = C5 S11

S4 = 1 S5 = 1 C3 = c C4 = d C5 = e

S9 = 1 S10 = 1 S11 = 1

A parse tree for a finite term will not have any cycles in it. Therefore, the

corresponding system of equations won't either: no recursions.

However, when the language is extended to allow for rational infinite trees

this will change. The parse trees will be infinite. But since a rational

infinite tree has a finite number of distinct subterms, it can be represented

as a "tree" with cycles in it. The corresponding system of equations,

expressing the relations between the subtrees, will therefore be finite but

cyclic.

For example, the infinite expression

a? b: c a? b: c a? b: c a? b: ...

will has a "tree" that looks like this:

S <------

/ | \ |

a ? b : S |

/ \ |

c * -->

and the corresponding system of equations:

S0 = a? b: S1

S1 = c S0

This system will then correspond to the sequence:

S0: if (a) { b; return; } else goto S1

S1: c; goto S0

Likewise, all intra-procedural control flow in C-BC is represented by

rational infinite terms, so that all that needs to be translated as the

operators of the language. This makes the object language an order of

magnitude simpler.

References:

Guy Cousineau "An Algebraic Definition for Control Structures",

Theoretical Computer Science 12 (1980) 175-192.

"A Representation of Trees by Languages"

Theoretical Computer Science 7 (1978) 25-55.

(3) C-BC Control Flow

The basis of the interpreter is therefore the rational infinite string,

which is presented as a series of items of the form:

(x: E :y z)

Each item stands for an equations according to one of the following schemes:

* x, y, z are labels for subterms

* E is either NULL or is a String that contains no (S? S: S) in it.

* If E is NULL: x = 1

* If y and z are distinct: x = E? y: z

* If y and z are the same: x = E y

A special label is reserved to correspond to the "halt" statement, which

cause the interpreter to exit.

The way an item is interpreted corresponds directly to the meaning of the

item being interpreted:

(0) If E is NULL, the current execution sequence is ended, and

control either resumes one level back (this is basically the

"return" statement), or at the top level goes idle.

(1) The sequence E is executed. Usually it is an expression to be

evaluated, but may involve function calls, which recursively

start up another level of execution. There are no jumps,

breaks, returns or any other kind of control flow in E.

(2) If y or z is NULL, the interpreter halts and returns to the

system (this is the "halt" statement).

(3) If y == z, then execution continues with the string labeled y.

(4) If y != z, then execution continues based on the value of the

expression evaluated. If it is 0, execution continues at z,

otherwise it continues at y.

All intraprocedural control flow can be handled. For instance, one might

have a translation rule such as:

while (E) S <---> ... :c c) (c: E :x b) (x: S :c c) (b: ...

Inside this while-loop, breaks and continues would be translated as follows:

break; <---> ... :b b) (y: ... where y is a new label

continue; <---> ... :c c) (z: ... where z is a new label

where b and c are the same labels as in the translation of the while loop.

Labeled statements are translated as follows:

L: S <----> ... :L L) (L: S ...

and goto's referring to this label, as follows:

goto L; <----> .. :L L) (x: S ... where x is a new label

Together, this allows one to define the entirety of a function's body.

A function definition is translated as in the following example:

define f(...) { <----> (0: NULL - -) (1:

E E

while (A) S :2 2) (2: A :3 4) (3: S :2 2) (4:

if (B) T B :5 6) (5: T :6 6) (6:

return C C :0 0)

}

which, when taken together forms the following set:

(0: NULL :- -)

(1: E :2 2)

(2: A :3 4)

(3: S :2 2)

(4: B :5 6)

(5: T :6 6)

(6: C :0 0)

When the function is called, execution starts at the subterm labeled 1.

The significance of all this is that now it's possible to set up a purely

expression based interpreter, involving no control flow statements at all --

because control flow is directly represented by the very shape of the infinite

expressions. So the actual interpreter language has relatively few operators

(basically just the numeric, stack, pointer, and array operators). This

separates out control from evaluation.

The same factoring idea leads to a very similar architecture, which is an

extension of the SECD machine that I call SEQCD. This was described in

detail in the previous case study.

(4) The C-BC Execution Machine

Though C-BC is not actually a functional language its execution machine is

in fact the direct ancestor of the SEQCD machine. So there is a great deal

of similarity between the two.

Following is a description of the object language used internally by

the C-BC interpreter. This language was not concocted out of the blue but

was *derived* as the least that would be necessary to effect a complete

translation of the source language AND make the corresponding translation

grammar a SSDTG.

A SDTG (Syntax Directed Translation Grammar) is one in which for each

non-terminal, a set of transformation rules are stated of the form

N: a => b

where the number of non-terminals of each type in the string a is the same

as the corresponding number in the string b. The grammar is called an SSDTG

(Simple SDTG) if the order of the non-terminals is also the same. For

instance, YACC is technically a processor for an SSDTG, not for a context

free grammar.

Any SDTG can be represented by an SSDTG in one of two ways:

(1) Place intermediate values on a stack.

(2) Use the SSDTG to translate to a parse tree, and accomplish the

reordering by traversing the parse tree after it's fully formed.

so as a result, SSDTG's assume a primary role. YACC uses the first method.

Likewise, the corresponding reduction rules were derived systematically

from the condition of being the least which would be necessary to correctly

interpret the action of the corresponding high-level statement. Much of

this designc could have, in fact, been automated.

The configuration of the C-BC machine takes on the following form:

S, E, C

where S is an stack containing a set of values, E is the current environment

and C is the code-string currently undergoing evaluation. The specification

itself is recursive, with the recursion occurring with the specification of

the function calls. The process of removing this recursion will lead to an

extra component, D, which will be a dump (or display frame) for the

call-return mechanism. This can be (and actually was) systematically derived

from the recursive specification.

An environment, E, contains a set of object:value pairings. The types of

objects and values are as follows:

TABLE 1

Object Value

Array Variable Array = List of Variables

Pointer Variable Variable

Scalar Variable Constant

Function String (plus extra information

corresponding to parameters and local

variables)

Variables are indexed and denoted &n (for scalars and pointers), ;n (for

arrays) for n = 0, 1, 2, .... Functions are also indexed denoted xn for

n = 0, 1, 2, ...

Given an environment the current value of a variable v will be denoted

E(v). The nth component of an array variable, a, will be denoted E(a)[n].

The effect of modifying the current environment E by assigning an object

x a value v will be denoted: (x <- v, E).

Technically, a configuration (S, E, C) is a function that maps input to

output. This mapping will be explicitly indicated only where necessary.

Otherwise, all configuration rules of the form

(S, E, C) = (S', E', C')

are to be understood in the sense:

(S, E, C)(Input) = (S', E', C')(Input)

Also, list notation will be used for inputs and outputs with the basic

list primitives:

x:L ------------ the list formed from L by adding x to the front

L:x ------------ the list formed from L by adding x to the back

[] -------------- the empty list

[a,b, ..., z] --- a: b: ... z:[]

List notation will also be used for C and S.

The Execution Machine: TABLE 2

S, E, constant:C --> S:constant, E, C [1]

S, E, &n:C --> S:&n, E, C

S, E, ;n:C --> S:;n, E, C

S, E, xn:C --> S:xn, E, C

S:A:B, E, : :C --> S:E(A)[B], E, C

S:A, E, , :C --> S, E, C

S:A, E, n :C --> S:-A, E, C

S:A, E, ! :C --> S:!A, E, C

S:A, E, d :C --> S:A:E(A), E, C

S:A, E, l :C --> S:E(A), E, C

S:A:B, E, op:C --> S:A OP B, E, C [2]

S:A, E, r :C (x I) --> S:r, (A<-x, E), C (I) [3]

S:A:B, E, s :C --> S:B, (A<-B, E), C

S:A, E, a :C --> S:E(A), (A<-E(A)+1, E), C

S:A, E, m :C --> S:E(A), (A<-E(A)-1, E), C

S:A, E, A :C --> S:E(A)+1, (A<-E(A)+1, E) C

S:A, E, M :C --> S:E(A)-1, (A<-E(A)-1, E), C

S:A, E, w :C --> f(A): (S, E, C) [4]

S:A, E, P :C --> g(A): (S, E, C) [4]

S:a1:...:an:f, E, @:C (x1:x2:... xm:I) --> y1:y2:...:yp:

(S:f(a1,...,an), E', C) (I) [5]

Notes:

[1] A constant is 0, 1 or a scalar value of one of the forms [...], "...",

or $...$.

[2] The correspondences between byte codes (op) and operations (OP) are as

follows:

op: + - * / % ^ < > = # { }

OP: + - * div mod exponent < > == != <= >=

with all the operations (including div, mod and exponent) defined in the

C-BC reference.

[3] r will be equal to 0 or 1 depending on whether the item read has the

correct syntax for the type of the expression A or not.

[4] f(A) and g(A) are the two "write" functions that convert the value A into

an ASCII string suitable for output. Both operations are described in

the C-BC reference.

[5] The function f is assumed to have arity n for some n >= 0, and when

called will recursively invoke a new level of execution with respect

to its corresponding String. The resulting environment is denoted E',

The values x1, x2, ..., xm are whatever the function reads in as input,

and y1, y2, ..., yp whatever it outputs. All three of these items are

recursively defined by applying the specification to the function's body.

Example:

The following example illustrates the application of the rules. To make

the presentation easier, the notations x L and L x are used respectively in

place of x:L and L:x.

String: &2 d &1 a * s ,

Reduction:

S, (&2<-4, &1<-2, E), &2 d &1 a * s , C

--> S &2, (&2<-4, &1<-2, E), d &1 a * s , C

--> S &2 4, (&2<-4, &1<-2, E), &1 a * s , C

--> S &2 4 &1, (&2<-4, &1<-2, E), a * s , C

--> S &2 4 2, (&2<-4, &1<-3, E), * s , C

--> S &2 8, (&2<-4, &1<-3, E), s , C

--> S 8, (&2<-8, &1<-3, E), , C

--> S, (&2<-8, &1<-3, E), C

If the variable &1 and &2 corresponded to variables named x and i respectively

then this sequence would have had the effect: x *= i++.

(5) The C-BC Translation Grammar

The source to byte-code translation is carried out with a SSDTG. A SSDTG

is defined as consisting of the following:

X: Input alphabet

Y: Output alphabet

N: Non-terminals

S, an element of N denoting the START symbol

P: a set of rules of the form

n: a => b

where n is a non-terminal in N, a is a word in (X + N)*

and b is a word in (Y + N)*.

All rules are constrained so that for each rule (n: a => n) the number and

order of non-terminals in a and b is the same. This notation is extended

so that

n: a1 => b1

a2 => b2

...

an => bn

will denote the set of translations { n: ai => bi: i = 1, 2, ..., n }.

The grammar itself is ambiguous with respect to the syntax for expressions

and is resolved with precedence rules. These are described in the C-BC

reference..

(a) Expressions

The description below uses the following abbreviations:

f_n: Function #n

a_n: Array variable #n

v_n: Variable #n

l_x: Label denoting state #x

e, e1, e2, ..., en: Expressions

s, s1, s2, ..., sn: Statements

x, x1, x2, ..., xn, y, z: Continuation string labels

If a P appears in the first column, the expression following is printable

if it is of scalar type. If a ? appears in the first column, the expression

is printable if any only if the final sub-expression is. If an L appears in

the first column, the expression is an L-value and will be printable when

dereferenced if and only if it is of a scalar type.

I/O Expressions:

E: -> e, e1, ..., en => e r :x1 x)

(x1: 1 + e1 r :x2 x)

(x2: 1 + ...

... en r :xn x)

(xn: 1 + :x x)

(x:

<- e, e1, ..., en => e w :x1 x)

(x1: 1 + e1 w :x2 x)

(x2: 1 + ...

... en w :xn x)

(xn: 1 + :x x)

(x:

-> e => e r

<- e => e w

Logic Expressions:

E: e? e1: e2 => e :x y) (x: e1 :z z) (y: e2 :z z) (z:

e1 || e2 => e1 :x z) (z: e2 :x y) (y: 0 :w w) (x: 1 :w w) (w:

e1 && e2 => e1 :z y) (z: e1 :x y) (x: 1 :w w) (y: 0 :w w) (w:

!e => e!

Relational Expressions:

E: e1 == e2 => e1 e2 =

e1 != e2 => e1 e2 #

e1 <= e2 => e1 e2 {

e1 >= e2 => e1 e2 }

e1 < e2 => e1 e2 <

e1 > e2 => e1 e2 >

Assignment/Update and Sequential Expressions:

E: ++e => e A

--e => e M

e++ => e a

e-- => e m

? e1, e2 => e1 , e2

e1 = e2 => e1 e2 s

e1 op= e2 => e1 d e2 op s op: + - * / ^ %

e1 =op e2 => e1 d e2 op s op: + - * / ^ %

Arithmetic Operators:

E:

P + e => e

P - e => e n

P e1 op e2 => e1 e2 op op: + - * / ^ %

Addressing/Pointer Expressions:

E: & e => e

new e => e N

free e => e F

L * e => e

Other Expressions:

E:

L e1[e2] => e1 e2 :

L a_n[e2] => ;n e2 : (special case)

? (e) => e

P f_n(e1, ..., em) => e1 ... em xn @

L v_n => &n

a_n[] => ;n

Constants:

E:

P number => [number]

P "string" => "string"

P $galois => $galois$

P $galois$ => $galois$

P 0 => 0

These are the mappings for the following built-in objects:

Built-In Variables:

ibase: v_0, obase: v_1, scale: v_2

last: v_3, lastc: v_4, lastg: v_5, lasts: v_6

Built-In Functions:

sqrt: f_0, scale: f_1, length: f_2,

shell: f_3, ifile: f_4, ofile: f_5,

comp: f_6, re: f_7, im: f_8,

field: f_9, gal_p: f_10, gal_m: f_11,

gal_gx: f_12, gal_set: f_13, gal_get: f_14

(b) Automatic Conversions.

L-values are automatically derefernced in all contexts except those

specifically requiring L-values. If this operation is denoted "deref e",

then the following effective translation rule applies.

P E: deref e => e l

Arrays, likewise, are automatically converted to pointers in most contexts,

as described in the C-BC documentation. If this operation is denoted as

"convert e", the following rule then applies:

E: convert e => e 0 :

(c) Statements

Expression Statements:

S: e ; => e P (if e is printable)

e ; => e , (if e is not printable)

; =>

Jump Statements:

In all these statements, a break occurs in control-flow. Therefore,

when the following continuation string is created, it is created with a new

label, denoted as y. The labels c and b denote the c-label and b-label of the

innermost containing loop statement. The label, -, denotes the null

continuation string.

S: goto l_x; => :x x) (y:

continue; => :c c) (y:

break; => :b b) (y:

return; => 0 :0 0) (y:

return(); => 0 :0 0) (y:

return e; => e :0 0) (y:

halt; => :- -) (y:

Other Statements:

In the following statements, the labels c and b denote the labels that will

be matched respectively to "continue" and "break" statements contained

immediately inside these statements.

S: { s1 ... sn } => s1 ... sn

l_x: s => :x x) (x: s

if (e) s => e :x z) (x: s :z z) (z:

if (e) s1 else s2 => e :x y) (x: s1 :z z) (y: s2 :z z) (z:

while (e) s => :c c) (c: e :x b) (x: s :c c) (b:

do s while (e); => :x x) (x: s :c c) (c: e :x b) (b:

for ([e1]; [e2]; [e3]) s => [e1 ,] :x x) [(x: e2 :y b)] [(c: e3, :x x)]

(y: s :c c) (b:

In the for statement, the expressions denoted in brackets are optional. If

they do not occur, then the corresponding items enclosed in []'s on the right

hand side will not appear either. If expression e2 is absent, therefore,

labels x and y will be considered identical. Likewise, if expression e3 is

absent, labels c and x will be considered identical. Also, in this version

of C-BC, the :c c) (c: sequence in the do-while statement will be optimized

away if this statement does not contain a continue immediately within it.

(d) Pseudo Statements

Some of the top-level commands (called here: pseudo-statements) are

translated and processed in the interpreter. These include the following:

P: include "string" => "string" I

log "string" => "string" L

log => "" L

quit =>

The command "quit" is not processed at all, since the interpreter exits as

soon as it is seen.

(e) Function Bodies and Execution Sequences

A function definition consists of a set of parameter declarations, auto

variable declarations, and a sequence of statements. In this version of

C-BC, the declarations are processed and associated with the function name

in the function name space. If the sequence of statements is s1 ... sn, it is

translated as if by the following rule:

P: Header { s1 ... sn } => (0: - :? ?) (1: s1 ... sn 0 :0 0)

The - and ? denote respectively empty continuation strings and don't-care

continuation labels. In all function definitions, state 0 will be the ending

state, consisting of an empty continuation string, and state 1 will be the

starting state. Note the final 0 :0 0) sequence, which corresponds to an

implicit return; statement at the end of the function.

An execution sequence is a single statement listed outside a function.

If there is more than one statement on a given line, each one is still

considered a separate execution sequence.

If the statement, s, is the execution sequence, it is translated as if

by the following rule:

P: s => (0: - :? ?) (1: s :0 0)

Note that in this case, there is no value returned. Effectively, execution

sequences are anonymous functions of type void that are (re)defined, and then

executed.

Future extensions of C-BC may allow for the addition of "void" types,

and may denote this function as "main". In this case, the following

declarations would be understood: void main();

Finally, at the top level of C-BC, one has the rule

start: P1 P2 ... Pn => P1 P2 ... Pn

which states that a program is a sequence of function definitions, statements

and pseudo-statements translated in the order in which they are issued.

(f) Declarations

The syntax and semantics for declarations is described in detail in the

C-BC documentation, so no further information will be provided here.

Declarations may be made at the top-level (and so one has a rule of the

form P: Dec => ...) or as part of a function's header, whose syntax is not

further specified here.

(6) Generating the Stack Frame

In the process of removing the recursion from the specification of the

C-BC Execution Machine listed in part (4), a stack frame will be generated.

In order to carry out the reduction

S:a1:...:an:f, E, @:C (x1:x2:... xm:I) --> y1:y2:...:yp:

(S:f(a1,...,an), E', C)(I) [5]

we need to temporarily save the the environment. A new environment with the

function's parameters and locals is created. Then the values for a1 through

an are stored into function's parameters and the locals are initialised (all

to 0, there are no declaration initialisers in C-BC). The function's string

is prefixed to C, along with a return code. When the return code is reached

the original environment will be restored.

An extra component, D, called the DUMP (or frame display) is then added to

the configuration (S, E, C). Each of the rules previously stated will be

modified by adding D to both sides.

Assume that the function's string is named Cf, its parameters are named

&P1, ..., &Pn, that its locals are named &L1, ..., &Lp (the parameters and

locals are the "extra information" referred to by the listing in TABLE 1).

Also let pi denote E(Pi), and li denote E(li).

The function call is accomplished inductively as follows:

S:a1:...:an:f, E, @:C, D (x1:x2:... xm:I)

--> S, (L1<-0, ..., Lp<-0, P1<-a1, ..., Pn<-an, E), Cf:#:C,

[P1, p1, ..., Pn, pn]:[L1, l1,...,Lp, lp]:D (x1:x2:...:xm:I)

--> ... [execute Cf]

--> y1:y2:...:yp: (S:A, E', #:C,

[P1, p1, ..., Pn, pn]:[L1, l1, ..., Lp: lp]:D) (I)

--> y1:y2:...:yp: (S:A, (P1<-p1, ..., Pn<-pn, L1<-l1, ..., Lp<-lp,E'), C, D

where the x's and y's are as in note [5] under TABLE 2. This has the effect

of restoring the values of the locals L1, ..., Lp and parameters P1, ..., Pn

after the function call. The value A is whatever is left on the stack

after the function is done. Since all functions in C-BC return a single

value, there will always be an extra item on the stack after the function is

complete.

This listing leads to several more rules:

FUNCTION CALLS:

S:a1:...:an:f, E, @:C, D

--> S, (L1<-0,...,Lp<-0,P1<-a1,...,Pn<-an,E), Cf:#:C,

[P1,E(P1),...,Pn,E(Pn)]:[L1,E(L1),..., Lp,E(Lp)]:D

where n is the arity of the function, L1, ..., Lp its locals, P1, ..., Pn its

parameters, and Cf its string. All these items are listed with the function

in the environment so that f's entry in the environment will look like:

E(f) = ([P1, ..., Pn], [L1, ..., Lp], Cf)

Note that such an entry is established as a result of processing a function

definition. This specification, hwoever was left out of the SSDTG.

FUNCTION RETURNS:

S:A, E', #:C, [P1, p1, ..., Pn, pn]:[L1, l1, ..., Lp, lp]:D

--> S:A, (P1<-p1, ..., Pn<-pn, L1<-l1, ..., Lp<-lp,E'), C, D

So the format of the dump is a list of items of the form:

[[P1, p1, ..., Pn, pn], [L1, l1, ..., Lp, lp]]

In the C-BC interpreter, the dump doesn't really take on this form. Instead

each variable is represented by name and its corresponding environment entry

is actually a stack of values instead of just one. When parameters and

locals are initialised the new values are pushed on the corresponding variable

stacks, which are popped upon the function's return.

However, this is the process that was actually applied to the code itself

to arrive at the interpreter, minus the function-call recursion. Originally,

I was going to just leave it alone but there were some other parts of the

interpreter that needed to explicitly manipulate the dump.

--

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.