Define C

mark@omnifest.uwm.edu (Mark Hopkins)
8 Jun 1996 22:12:31 -0400

          From comp.compilers

Related articles
Define C mark@omnifest.uwm.edu (1996-06-08)
| List of all articles for this month |

From: mark@omnifest.uwm.edu (Mark Hopkins)
Newsgroups: comp.compilers
Date: 8 Jun 1996 22:12:31 -0400
Organization: Omnifest
Keywords: C, theory

This is the outline of an approach in which the C programming language is
reduced to primitive form which is directly represented within an extended
Lambda Calculus. An evaluation mechanism, analogous to Landin's SECD
Machine, can then be rigorously defined which will process an expression in
the extended Lambda Calculus exactly as an imperative machine would, when
the expression is the representation of a program segment -- all without
explicitly being told to do so.


      How is the Lambda Calculus extended? By allowing its terms to be infinite
in size. The structure of an infinite lambda term will exactly mimick the
control flow graph of a program which is how we get away with defining
programs without states or machines. By extending the definition of alpha
conversion it is even possible to come up with a new and much more compact
notation for expressing lambda terms which also has the benefit of directly
embodying pointer semantics. Once again, we get away with doing this without
ever having to refer to a machine, or memory state.


      The basic correspondence was first noted by Landin in 1965:


                                X = E; Q <--> (lambda X.Q) E


for which he coined the syntax "let X = E in Q". I'll retain the C-like
syntax since it's our intent to show how C itself can be regarded as nothing
more than a notation for the extended lambda calculus.


      There is a common conception about how an imperative program and
functional program differ. One says that "variables can be updated and
there are side effects" in an imperative program, whereas in a functional
program "variables uniquely denote single values forever". In reality that
is an illusory distinction: either semantics can be used with either language
type and indeed one has the following combinations of (imperative/functional)
semantics with (imperative/functional) languages:


                              Functional Language Imperative Language
Funct. Seman. The reduction of the <What I'm about to show>
                              language to the Lambda Calc.


Imp. Seman. The evaluation of the The processing of the language
                              language on a SECD-like on an imperative machine
                              machine.


So in reality, it's not the language families that are different, but the
semantics that are commonly used with them. As an example:


                                    (lambda x. (lambda x.x+2) x+1) x+1


has a functional semantics in which each level of scoping of the variable
"x" represents an instance of a DISTINCT variable that just happens to have
the same name. This expression could just as well be written:


                                    (lambda u. (lambda v.v+2) u+1) x+1


The imperative semantics of this expression is that expression by applying
the Landin correspondence: x = x + 1; x = x + 1; x + 2 -- a "single" variable
named x is incremented twice and then the value x + 2 is returned.


      Thus, we see that "variable updating" and "name clashes" represent exactly
the same thing, just seen under different semantic filters. This provides the
theme behind what's to follow: every feature commonly attributed to an
imperative language is actually seen to be a feature of the lambda calculus
in disguise, hidden by the fact that it is being interpreted under a different
semantics. What's interesting is that this fact, in virtue of the existence
of SECD-like machines as I described above, which can evaluate any lambda
term which is reducible in applicative order, is largely reversible! Most
features of the Lambda Calculus may be rendered an imperative interpretation.


      Also, note that in the example above the corresponding conversion would
look like this: u = x + 1; v = u + 1; v + 2. So you can at last see:
variables aren't actually being UPDATED (nooooo!) they're really just
different variables that just happen to have the same name, but different
scopes (which in imperative semantics translates as: "different lifetimes").


(1) Rational Infinite Lambda Expressions and Control Structures
      But wait, you might say: what about while loops? How many variables are
in a while loop that correspond to any variable inside its body which is
assigned to? Isn't this where the correspondence breaks down? Indeed, this
is exactly where Landin's original 1965 treatment ran aground and is why
you don't hear nowadays how and why imperative and functional languages are
the same thing in different guises. Well, I aim to correct this matter.


      The answer is yes. If we're to consistently apply our principples then
we must admit indeed that there are in infinite number of distinct variables
in the body of a while loop and that the loop, itself, is really nothing
more than the compact notation for an infnitely nested set of conditional
expressions. In particular, we make the correspondence:


                              while (E) S <--> E? (S x): Q
                              Q


where x denotes a copy of the entire expression on the right: E? (S x): Q,
and where (S x) denotes what would result if we further applied this
correspondence and others to the (S x) combination. For example:


                      I = 0; J = 1; while (I < N) { I++; J *= 2; } J


when written in full would look like this:


                      I = 0; J = 1; (I < N)? (I = I + 1; J = J * 2; (I < N)?...: J): J


or more compactly as:


                      I = 0; J = 1; x where x denotes the expression
                                                                (I < N)? (I = I + 1; J = J * 2; x): J


An expression, such as this, which though infinite in size has only a
finite number of distinct subterms, is called rational (infinite). As
a general rule: control structures correspond to rational infinite lambda
terms.


      Note the use of the label x and its reference within the expression itself.
Doesn't this look a lot like a goto label and goto statement? We could have
written the while loop in equivalent form like so:


                    I = 0; J = 1;
              x: if (I < N) { I++; J *= 2; goto x; } else goto y;
              y: J


Thus we arrive at the root of our correspondence here: goto labels denote
subexpressions which comprise both the statement labeled and everything
that comes after it. A goto statement corresponds to an embedded copy of
the subexpression labeled by goto label which the statement refers to.


      What we've been calling "states" all along are actually expressions!


      A complete set of correspondences may thus be written down for every
possible combination (S Q) of a statement followed by Q, which represents
"what comes after" and which can denote any expression or any combination
of expressions (or even things which the language itself does not have
notation for but which we define anyhow for the sake of carrying out an
analysis). The ultimate question of "what comes after" the main program
can then be answered: anything that's pertinent to the analysis which
we're trying to figure out the behavior of. This could include file streams,
peripheral I/O streams, memory blocks or just expressions.


      There's no real sense in carrying out the reduction in detail here since
in most cases it's fairly obvious where the labels and "goto"'s go for a
given control structure. But what should be pointed out is that everything
will then be reduced to one of the primitive forms:


              (Expression evaluation followed by Q): E; Q
              (Conditional expression): E? Q: Q
              (Return): return [E]; Q


In the latter case, as with the other control-breaking statements (goto x,
continue; break;) the final outcome of the reduction does not involve Q.
Another thing that should be pointed out is that a labeled statement followed
by Q, (x: S) Q is interpreted as the definition of a SUBEXPRESSION x: (S Q).
So this is the mechanism by which expressions are embedded in other expressons
(possibly themselves).


      Now reconsider the expression in the example cited above:


                          I = 0; J = 1; x x: (I < N)? (I = I + 1; J = J * 2; x): J


We noted that a variable such as I or J occurs an infinite number of times,
one for each lambda operator for each copy of x (noting that x is actually
the lambda expression: (I < N)? (lambda I. (lambda J.x) J*2) I+1: J.


      How do we reduce this? Related questions are: how do you substitute a
values for a variable in an infinite lambda expression, and which variables
are free in that expression (and which are bound)? Fortunately, since in
the lambda calculus, the definitions of substitution, free and bound variables
only rely on the LOCAL structure of an expression, it does not actually
matter whether the expression itself is infinite in size or not, the exact
same definition applies! It only looks somewhat complicaetd as the price
we pay for writing them in compact, finite, notation. For instance if
you substitute 3 for I in x, you get (3<N)?(I = 3+1; J = J+2; x): J,
but if you substitute 32 for N in x, you get a term (X<32)?(I=I+1;J=J*2;y):J
where y is the term that resulted from making this substitution (i.e. the
substitution propagates all the way down to infinite since there are no
lambdas blocking it).


      The set of free variables for the subexpression x would have to be
defined in a way that reflects the cyclic structure of x:


                                          FR(x) = {N,I,J} union FR(x)


The least solution to this formula (which is {N, I, J}) then gives you
FR(x).


      Does this remind you of the kinds of equations you would see in variable
flow analysis? If one defines USED(S) as the set of variables used by a
statement S before being defined, and DEFINED(S) as the set of variables
which are defined before being used, then one can write down the equation:


                                FR(S Q) = USED(S) union (FR(Q) - DEFINED(S))


which (a) defines the correspndence between variable flow analysis and
variable analysis in the lambda calculus, (b) shows we could actually
*derive* definitions for USED and DEFINED from the definition of FR and
(c) we can therefore dispense with USED and DEFINED entirely and work with
FR instead.


(2) Algebraic Infinite Lambda Expressions and Call-Returns
      This time consider the following program sequence:


                              x = 1; P(5); x


where P(n) is defined as the following procedure (written in a C-like
syntax)


                                P(n) { if (n > 0) x *= n, P(n - 1); }


We ask the same question: how many distinct copies of n are there in this
program segment? Once again the answer is Infinity, if we are to apply
our principles in a consistent manner. Also: what comes after the embedded
call to P? Implicitly, we note the existence of the return statement


                          P(n) { if (n > 0) x *= n, P(n - 1); return; }


so where is the "return" do a "goto" to?


      The combination P(n); Q is actually an infinitely sized expression the
structure of whose subexpressions is (potentially) so vastly infinite that
not even a finite number of labels would suffice to label all the distinct
ones. Yet, we can still arrive at a compact notation by allowing labels
to carry "parameters" within them which can be substituted by subexpression,
thus yielding a potentially infinite number of distinct subexpressions for
that label. In particular, here, the sequence P(n); x will correspond to
the instantiation P[n, x], where P is a parametrized label defined by:


                                        P[n, Q]: ((n > 0)? (x = x*n; P[n-1,Q]): Q)


The extra label Q is where the return statement "goes to", so each
occurence of "return" (including the implicit one at the end, if need
be) is replaced by Q.


      The original example therefore wil correspond to the infinite lambda
expression:
                                                x = 1; P[5, x]


which when written out in full would look something like this:


                x = 1; (n>0)?(x = x*n; (n-1>0? (x = x*(n-1);...): x): x


where the more deeply nested instantiations of P would contain copies
of (n-1)-1, ((n-1)-1)-1, etc. respectively, thus generating an infinitely
sized lambda expression now with an infinite number of DISTINCT subterms.


      A (possibly) infinite expression whose subterm structure can be
specified by the use of parametrized labels is referred to as Algebraic
(infinite).


      As another example, suppose we had the sequence


                                                              x = P(3); x


where now P(n) is defined as the procedure


                            P(n) { if (n <= 0) return 1; else return P(n-1)*n; }


This time, we represent the procedure by a parametrized label where there
are now two extra parameters:


                      P[n,v,Q]: (n <= 0)? (v = 1; Q): P[n-1, w, (v = w*n; Q)]


Note the appearance of the auxilliary variable w. This arose by
rewriting the procedure in the form:


                  P(n) { if (n <= 0; return 1; else { w = P(n-1); return w*n; }}


first and then applying the transformation.


      The example above will then be transformed into the lambda expression


                                                              P[3,x,x]


which I'll leave to you to write out in all its infinite glory (since it
takes a long time to completely unroll an infinitely big expression from
its compact notation).


      Thus there is a slight difference between the way void procedures and
value-returning procedures are made to correspond to infinite lambda
expressions.


      With the preliminaries out of the way, we can then conclude the following:


                            All sequences (S Q) of statements S following by combinations
                            of values in Q may be reduced to an algebraic infinite
                            expression involving only the combinations (E; Q) and
                            (E? Q: Q), in addition to the lambda operators.


Thus, to define a language such as C it is really only necessary to look
at how its expressions behave. The statement part of the language is
completely encapsulated in terms of the algebraic infinite lambda terms
used.


      To carry out this definition, we will have to adopt a convention which
will make expressions deterministic: all concurrent "side-effects" are
considered to be evaluated from left to right, e.g., (i++ + j++) is
regarded as equivalent to the sequence (u = i; i++; v = j; j++; u + v).


      Systematically, we may apply reductions to the combinations (E; Q)
and (E? Q: Q) and these will only spawn smaller combinations of the
same type (which may be reduced by the same means) and of the following
type (u = E; Q), where u is an "auxillary variable" which we use solely
to carry out the analysis with. Ultimately, everything will reduce to
a set of primitive combinations which could even be taken to be a kind
of intermediate code. This reduction is indicated in the following
charts:


E E; Q E? Q: R u = E; Q
--- ---- ------- --------
A,B A;(B;Q) A; (B?Q:R) A; (u=B;Q)
A?B:C A?(B;Q):(C;Q) A?(B?Q:R):(C?Q:R) A?(u=B;Q):(u=C;Q)
A||B A?Q:(B;Q) A?Q:(B?Q:R) u = A?1:(B?1:0); Q
A&&B A?(B;Q):Q A?(B?Q:R):R u = A?(B?1:0):0; Q
!A A; Q A? R: Q u = A?0:1; Q
A rel B A; B; Q u=A;v=B;u rel v?Q:R v=A; w=B; u=v rel w; Q
rel denotes one of the relative comparison operators ==, !=, <, >, <=, >=


A op B A; B; Q u=A op B; u? Q: R v=A; w=B; u=v op w; Q
op denotes one of the operators + - * / % & | ^ << >>


op A A; Q u = op A; u? Q: R v = A; u = op v; Q


P(A...Z) a=A;...;z=Z; (not allowed) (not allowed)
                        P[a...z,Q]
where P is a void function. Parameters are pre-evaluated, thus the
assignments a=A, ..., z=Z.


P(A...Z) u=P(A...Z);Q u=P(A...Z);u?Q:R a=A;...;z=Z;P[a...z,u,Q]
where P is a value-returning function


X Q u=X; u? Q: R <u=X;Q is in primitive form>
where X denotes a variable


C Q either Q or R <u=C;Q is in primitive form>
where C denotes a constant


&L L;Q u=&L; u?Q:R <see below>
A[B] A;B;Q u=A[B]; u?Q:R v=A;w=B; u=v[w];Q
A.X A;Q u=A.X; u?Q:R v=&A;u=v->X;Q
A->X A;Q u=A->X; u?Q:R v=A;u=v->X;Q
*A A;Q u=*A; u?Q:R v=A;u=*v;Q


L inc u=&L [(*u)inc;Q] u=&L [(*u)inc?Q:R] v=&L[u=*v;(*v)inc;Q]
inc L u=&L [(*u)inc;Q] u=&L [inc(*u)?Q:R] v=&L[(*v)inc;u=*v;Q]
where inc denotes one of the operators ++ or --


L op A u=&L u=(L op A); u?Q:R v=&L [w=A;(*v)=w;u=*v;Q]
                    [v=A;*u op v;Q]
where op denotes any of the assignment operators = += -= *= /= %= &= ^= |=
<<= or >>=.




The variables v and w, like u, are auxillary variables used only in the
context of the analysis and are intended to be unique to each application
of each rule (so as to avoid name clashes and the like). The form
u=&L[...*u...] is indicated only schematically and is completed in
different ways depending on which assignable L-value L corresponds to.


The rest of this reduction will be presented in a followup article, along
with the remainder of this presentation.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.