Fundamentals for a Compiler and Language (was: Writing A Plain English Compiler)

federation2005@netzero.com
Mon, 29 Dec 2014 13:30:10 -0800 (PST)

          From comp.compilers

Related articles
Writing A Plain English Compiler gerry.rzeppa@pobox.com (Gerry Rzeppa) (2014-11-04)
Fundamentals for a Compiler and Language (was: Writing A Plain English federation2005@netzero.com (2014-12-29)
| List of all articles for this month |
From: federation2005@netzero.com
Newsgroups: comp.compilers
Date: Mon, 29 Dec 2014 13:30:10 -0800 (PST)
Organization: Compilers Central
References: 14-11-004
Injection-Date: Mon, 29 Dec 2014 21:30:10 +0000
Keywords: design
Posted-Date: 29 Dec 2014 18:09:55 EST

On Tuesday, November 4, 2014 10:27:29 PM UTC-6, Gerry Rzeppa wrote:
> [And with that, this thread is over unless someone has something
> to say that's related to compilers. Thanks, all. -John]
> Exactly how does one go about writing a Plain English compiler?


What you should be focusing on (and to sorta abide by the moderator's wish) is
what goes on in the semantics, and on the back end more so than how the
language is presented to someone.


That means that the "fundamentals"
> 1. Type definitions;
> 2. Global Variable definitions;
> 3. Routine Headers;
> 4. Conditional Commands; and
> 5. Unconditional Commands.
have to be properly addressed; and first you have to ask whether these really
are the fundamentals, and if not, then what they should be.


In fact, you can get away with just the following for the procedural element:
      * the comma statement A, B (do A, discard its results, then do B [with
results])
      * the conditional A? B: C. The comma can be treated as a special case A, B
= A? B: B.
      * recursive definitions of continuation or call-return segments. Doing
things by continuations is preferrable since you can define call-return
semantics in terms of it, but not so easily the other way around.
      * the ability to reference continuations (i.e. gotos).


Fortran actually headed in that direction early on (as did the scripting
language DC), but decided to fall back in more recent years.


The distinction between locals and globals has to go one level deeper: to
include also the distinction between thread-local and shared.


Finally, as for type systems: it appears that contemporary languages have been
progressively trying to incorporate more and more of the Curry-Howard
isomorphism (or "Type-Proposition" correspondence), and ... in particular ...
the various modifications that try to this correspondence to include
quantifiers in predicate logic (which then requires indexed or parametrized
types). These are embodied by the various higher-order type theories that have
cropped up since Carnac and Goedel (Goedel's Dialectica interpretation,
Martin-Loef, Carnac's systems B and C, Lambek/Scott's formalism). I think C++
was moving in the direction of Hindley-Milnor type theory.


This aspect of the language design -- which is (in fact) the aspect that bears
the greatest affinity to natural language (think Montague semantics) -- is
highly non-trivial. This is borne witness by the large number of ways that
different language have tried to approach this issue.


In particular, what's non-trivial about it is that objects which can be named
and addressed can be contained in or contain other objects. Fortran actually
went a step further allowing *overlapping* objects (by way of the equivalence
statement). Languages like C++, in contrast, seem to shy away from this.


The Lambda operator *can* be incorporate *seamlessly* into any of these
languages provided you recognize the continuation for what it really is: an
infinitary lambda expression (i.e. an expression with an infinitary parse
tree.) So, for instance, the cyclic statement (while (E) S) followed by a
continuation Q would correspond to the infinitary expression E? (S (E? S (E? S
(...) Q) Q) Q, which is called "rational" since it has only a finite number of
distinct subexpressions. A goto label is simply a label of a subexpression,
and a goto statement is just a compact reference to the labelled
subexpression.


This works quite well with continuation semantics, since the call and return
are already factored out by the time you get to this semantics. All you have
are jumps. So, I won't belabor how you represent calls and returns in
continuation form.


The correspondence goes all the way back to Landin, via the following
equivalence for assignments to simple variables (x = A, Q) <-> (lambda x.Q) A,
where Q is a continuation. But Landin never incorporated the above-mentioned
insight on cyclic control-flow structures, so the whole programme languished
in the 1960's.


Ultimately, the correspondence is of the following form: to each statement S
is a context S[] (i.e. an infinitary expression with a missing subexpression)
such that S[Q] is the result of applying Q after executing S. Thus, for
instance, the context corresponding to the simple assignment statement x = A
is just (lambda x.[]) A.


But the non-trivial part: the type system, the objects residing within it and
pointers. What's non-trivial about is what goes on with the object-subobject
hierarchy; i.e. what does lambda V[I].Q mean, for instance, for the I element
of the vector V? There is, in fact, an facility native to the lambda calculus
for addressing this very issue; the equivalence
      lambda V[I].Q = (V[I] = [], Q)
      = (V = (lambda J.(I == J? []: V[J])), Q)
      = (lambda V.Q) (lambda J.(I == J? []: V[J]))
which effectively treats the structured object as a function applied to its
"designators" (here, the index [I]) and the assignment as a partial update to
the function itself.


A good translation method will be one that treats *all* instances equivalent
to this type of statement the same (no matter whether they come from an
assignment to a designated subobject or not) while avoiding any unnecessary
structure-wide copying of unchanged elements.


A proper account of how pointers fit into this meshes completely with the
question of how the inheritance and object/subobject hierarchies are defined
in the language. To each object type T, the corresponding pointer operator *
is actually a universal object (i.e. the type-cast pointer operator *(T *)) so
that if P is a pointer variable, then P is the designator that bears the same
relation to *P that the index I bears to component V[I]. This fits snugly into
the type/subtype hierarchy in such a way that if T' is a derived from T, then
the operator *(T' *) is a subobject of *(T *).


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.