Re: Object Oriented Compiler Construction

chase@Think.COM (David Chase)
Wed, 5 Jan 1994 22:38:14 GMT

          From comp.compilers

Related articles
Object Oriented Compiler Construction (1994-01-04)
Re: Object Oriented Compiler Construction (1994-01-05)
Re: Object Oriented Compiler Construction (1994-01-05)
Re: Object Oriented Compiler Construction chase@Think.COM (1994-01-05)
Re: Object Oriented Compiler Construction (1994-01-06)
Object Oriented Compiler Construction (Chris Clark USG) (1994-01-06)
| List of all articles for this month |

Newsgroups: comp.compilers
From: chase@Think.COM (David Chase)
Keywords: OOP, design
Organization: Thinking Machines Corporation, Cambridge MA, USA
References: 94-01-010
Date: Wed, 5 Jan 1994 22:38:14 GMT

I've worked on compilers written (or partially written) in
"object-oriented" languages. The Olivetti Modula-3 front-end (mostly
written by Mick Jordan and Trevor Morris) was originally coded in
Modula-2+ with the AST described in IDL (using multiple inheritance in
some instances). It was then ported to Modula-3 (which only has single
inheritance). I'm more or less repeating Mick's opinions of the exercise,
but what I recall was that single inheritance was not an incredible
burden. Multiple inheritance would have been nice, but it was not
necessary. The instances where it would have been nice include things
like (and this depends upon how you organize your "tree") "names", which
can be expressions (with values, types, etc) and can also be named things
(with associated symbol table information). The one interesting thing
about this is that the workaround for no MI was very much like the
implementation that is used for (virtual) MI within C++ -- namely, the
object contains a pointer to the type it ought to have inherited from.
This approach comes with constraints similar to those in C++ -- you cannot
get from the "base" class back to the "derived", just like in C++ (note
that in C++, you CAN get back if the base class contains virtual functions
overridden in the derived class, and that any derived-to-base conversions
are performed automatically). But, it turned out that this was not
something that we needed to do (at least, that's what I recall.)

There was some attempt at factoring the AST into the "views" by the
various components, and this was done using opaque supertypes. Multiple
inheritance might have worked better here, or different rules for opaque
inheritance. I thought it was convoluted, but this may have just been a
consequence of dealing with the rules concerning opaque inheritance (there
is some sort of a restriction in Modula-3 that I remember as "not as
opaque as I'd like").

The single inheritance type system worked just as you might expect to get
rid of silly unions. Note that M-3 has checked downcasting (aka
NARROWing), and garbage collection, and exception handling. The exception
handling was really quite nice -- I cannot emphasize enough how it can
change your style to know that you are (1) working in a safe language and
(2) able to throw exceptions. (Without the safe language constraint, you
can damage your abstract machine, in which case the only recourse is to
exit the entire program. With a safety guarantee, you can reliably throw
the exception to someplace where you know that things will work properly.)
This allowed me, in the code generator, to just "give up" in some sticky
situations (e.g., out of memory), without really worrying where I was
giving up to (better for modular design -- I don't need to know the magic
name of the error reporter for this compiler). Because of the clean exit,
it was still possible to write a long-running compiler server that would
recover from an occasional error.

At Sun, I worked on an optimizing back-end written in C++. My experience
with that "object-oriented" language was very mixed, and it's not clear
whether the best and worst parts had anything to do with objects at all.
Part of the problem was that we had six people learning C++ along the way,
as the language (and compilers, and OS) changed underneath us.

+ Best had to be mandatory prototypes.

+ I'm generally a fan of the object-oriented style, meaning that state is
packaged up, hidden, and attached to behavior. Of course, this goes way
back (I recall reading about modular design and information hiding in 1980
in papers by Parnas and Haberman). Just in case anyone has forgotten the
advantages of this, I'll repeat them. You can have multiple instances of
a "thing" (e.g., multiple dominator trees, forwards, backwards, and over
subsets of the control flow graph. Very Very Nice.) You can have a
defined interface through which your data structures are modified, and you
can hope to assure yourself of correct behavior of the data structures
without understanding the entire program. You can swap out different
versions of a data structure, as long as the interface remains the same.
In some instances, it assists in unit testing (this can get pretty
difficult in a full-blown optimizer).

There's at least two rules for choosing "what should be an object". One
rule is to look for things like AST nodes, or quads -- basically,
something where you have a great number of different "types" that all
share some basic characteristics (e.g., all AST nodes are tree nodes, and
correspond to some point in the original program). In that case, what
you're "really" doing is using inheritance to get a little bit of
polymorphism out of your data types.

Another rule is to look for "great big wads of state", such as the
control-flow-graph, dominator tree, and interval decomposition of the
basic blocks that make up a program. In this case, there is only one CFG,
but it may help to define the ways in which people can modify it and
examine it, both to enable (for instance) incremental update, and to allow
changes to the CFG data structures. One thing to note is that, at least
in a compiler, the interface are driven somewhat both by what would be a
"nice interface" for someone using program analysis or writing program
transformation tools, and by what operations you can actually support,
given efficiency and algorithmic constraints. It may not be a good idea
to write your incremental CFG package in terms of arbitrary edge/node
insertion/deletion -- it may be much easier, and about as useful, to work
in terms of other operations, such as "interpose node on edge", "add
header to region", or "duplicate region and interpose". You can see where
some of these things make sense for (for instance) loop code motion, or
insertion of register fill/spill code. For other things (e.g., software
pipelining) the graph update is probably too messy to perform
incrementally, at least in the first cut.

    (As an unrelated issue -- I am a big fan of incremental maintenance
      of data structures. Working with partially updated information is a
      terrible mess.)

- Worst had to be all the automatic conversion, overloading, and default
parameters, at least the way that we ended up using them. Those make it
easy to WRITE code, easy to SKIM code, but not easy to understand, debug,
or modify. I had a good time writing some of this code, but decided
(after modifying some code that someone else wrote, using those same
features) that perhaps the fun was not worth it in the long run. Remember
that PL/1 had lots of automatic conversions, and remember how much fun
those were.

- I'm also not a fan of destructors. I'd much prefer to just have garbage
collection. I recall often being forced to think hard about whether to do
a deep copy or a shallow copy, a deep delete or a shallow delete. If
anything, the bugs that did occur were more confusing than in code with
straightforward memory management, because sometimes destructors would be
implicitly run. A coding style that emerged was that destructors did not
really do much -- every significant object type ended up with a special
"destruct" method for explicit use.

(I think the preceding two gripes can be summed as "half-hidden behavior
is not a good thing". Implicit destruction/construction/conversion is
hidden because you don't see it in the source code, but it often has
visible side-effects, so it isn't as hidden as it should be. A corollary
of this can be made a part of a coding style -- implicit behavior should
be truly hidden.)

- I got very very annoyed at the concessions-to-stupid-linker requirement
that all the private data and functions appear in the class definition.
(This was especially annoying because I knew that all over the world,
people were spending a jillion dollars on development of their shiny new
C++ compilers, debuggers, analyzers, what have you, yet not doing anything
to improve the linkers). This made header files obnoxious to read, often
forced excessive recompilation (the language needs real modules and
interfaces), and tended to slow down compilation even further (parsing all
that header file crap that really did not need to be reparsed).

- One non-intuitive problem was the eventual size of the outermost
"compilation object". It just kept on accumulating flags, fields, and
subobjects like the biggest ball of tar that I ever saw, and it was hard
(i.e., careful design was required) to partition the information among the
blob and its components -- it is clearly wrong (non-modular, in an
inverted way, and bad for code re-use) to have every tiny sub-object
needing knowledge of the top-level compilation blob. Nonetheless, there
are cases where things must be widely known -- for instance, if you are
generating position-independent code, it affects the way you generate
references (on Sparc) to floating point constants, which surprises you
when it comes time to expand a convert-float-to-unsigned "instruction"
into real code (it requires a floating point temporary). (It happens that
there was a different solution to this problem than the one that first
comes to mind -- it turned out to be better to modify the
float-to-unsigned "instruction" to take all its constant inputs as
parameters, thus moving their generation to where you knew about PIC-ness,
and exposing their generation to various other optimizations (CSE, base
register generation, scheduling)). Similar surprises occur each time you
need to generate a memory reference for some oddball reason (profiling and
PIC, code coverage and PIC).

We didn't use multiple inheritance. We made little use of virtual
functions (mostly, I used virtual functions). Templates didn't exist when
we started working on the compiler. Memory consumption does matter,
because someone will always try to shove a too-big program through the
compiler. (And if they don't have one handy, they can generate one by
concatenating all the files in a Fortran program into one big file, and
asking for lots of inlining.)

So, that's about what I can tell you about OOP and writing a compiler (my
former employer might not want me to tell everything I learned, and my
current employer might not want me to take the time.)

David Chase, speaking for myself
Thinking Machines Corp.

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.