# Adding garbage collection to C++

## Dain.Samples@UC.EDUTue, 11 Aug 1992 17:27:14 GMT

From comp.compilers

Related articles
Adding garbage collection to C++ Dain.Samples@UC.EDU (1992-08-11)
Re: Adding garbage collection to C++ tmb@arolla.idiap.ch (1992-08-12)
Re: Adding garbage collection to C++ mw@ki.fht-mannheim.de (1992-08-13)
Re: Adding garbage collection to C++ kelvin@kickapoo.cs.iastate.edu (1992-08-13)
Re: Adding garbage collection to C++ fjh@cs.mu.OZ.AU (1992-08-14)
Re: Adding garbage collection to C++ jos@and.nl (1992-08-14)
Re: Adding garbage collection to C++ henry@zoo.toronto.edu (1992-08-14)
[10 later articles]
| List of all articles for this month |

 Newsgroups: comp.compilers From: Dain.Samples@UC.EDU Organization: Compilers Central Date: Tue, 11 Aug 1992 17:27:14 GMT Keywords: C++, storage, GC

I have been doing some work on implementing garbage collection for C++,
but some of these comments will apply to GC for C. Alternatively, these
comments could also be taken to ask the questions: why not use C++ as an
IL?

I will summarize some of the contents of a paper I will be presenting at
the Int'l Workshop on Memory Management in Sept. A copy of the paper can
be ftp'd from the e-address below. I would be interested in any
commentary.

I have reached the conclusion that only conservative garbage collection
allows no changes to the existing language. Even so, as Hans Boehm has
pointed out, there are (relatively easily solved) problems with code
produced by highly intelligent optimizers (i.e., make them less
intelligent, or even more intelligent, depending on your point of view).
Any other GC-scheme that assumes no changes to C (or C++) requires
prodigious amounts of compiler-generated information to describe the
structure of runtime objects based on the most general of possibilities
(e.g., see the '92 SIGPLAN paper by Diwan, Moss, and Hudson), or
constrains the programming idioms that can be used by programmers. (Much
of the discussion about using C as an IL has focused on the observation
that a translator emitting C can in fact easily constrain itself.)

Since such constraints border on language redefinition anyway, I addressed
the question: what is a minimum number of language enhancements that could
be added to C++ to allow reasonable garbage collection to be implemented
as part of the runtime system? I haven't tried to back-pedal the language
changes to C since I take advantage of the class/object structure in C++
to simplify the declaration of *which* thingies are going to be collected.
I also have not totally removed the need for prodigious amounts of
information', but I do think it is more manageable and reasonably
generated.

latter keyword is used to declare pointers that can be used to point into
objects that are collected, and the other two are used to declare
collectible objects and allow the programmer to allocate normally gc'd
objects on the heap.

The long-range goal is to allow garbage collection to be treated as
malloc/free is treated in C programming systems: not really' part of the
language, but assumed to exist, assumed to work, assumed to have a certain
(minimal) functionality, and replaceable by the user. (An aside: does
anyone really want to argue that C without malloc [or other dynamic memory
allocator] is still C?) The paper describes the language enhancements to
C++, and describes how the compiler can generate data structures that
support many flavors of GC, including copying, conservative,
mark-and-sweep, etc.

I'm currently working on a prototype that implements a basic, non-copying
mark-and-sweep collector with conservative collection of the runtime
stack. I hope to have it completed by Sept.

If you have read this far and still aren't sure whether the paper would
interest you or not, I include a section of it that describes some of the
sub-goals of the design:

To streamline what follows, we define some terms. C++ classes that have
been declared collectible are referred to as {\em gc-classes\/} or {\em
gc-types}, their instantiations as {\em gc-objects}, and the space in
which they are collected as {\em gc-space}. Objects that are pointed to
by an application from outside gc-space are said to be {\em rooted}, and
the pointers are called {\em roots}. Gc-objects to which there exists a
path of pointers from a rooted object are said to be {\em grounded};
rooted objects are trivially grounded. Pointers to or into gc-objects are
called {\em gc-pointers}. Gc-pointers {\em into\/} gc-objects are also
called {\em embedded\/} pointers (also called {\em interior\/} pointers by
some). Variables that contain bit-patterns that could be interpreted as
gc-pointers but might actually be something that just happens to look like
a gc-pointer (e.g., an integer, a sequence of characters, etc.) are called
{\em ambiguous roots}. The act of informing the GC system that a pointer
may point to a gc-object is called {\em registration}.

As mentioned, the primary goal of this design is to retro-fit C++ with
garbage collection facilities while not changing the spirit of the
language. That spirit, based on the spirit of C, can be summarized
provocatively as: (1) Programmers should be able to do anything they want,
subject to their willingness to deal with the consequences (side-effects
and interactions of features whose invariants are violated). (2) No one
pays for any feature of the language that they do not use.

Based on this spirit, we consequently adopt the following goals for the
system.

Adding GC to C++ should not force overhead on users that do not use it.
When GC is used, it should not interfere with software modules that do not
use it. GC must not interfere with the current and/or standard
definitions of the language. That is, a GC facility should not cause
re-definition or loss of language features.

The dual of this requirement is that programmers that use GC must take the
responsibility either for not violating the invariants of the design, or
for knowing how to deal with the consequences. Almost any restriction
imposed by a GC-cooperative C++ compiler can be bypassed by a sufficiently
determined programmer. The compiler therefore can take responsibility
only for maintaining its GC invariants, but cannot be expected to find any
and all ways in which users may have violated those invariants.

The language should be changed as little as possible. A strict goal of no
change was seen rather quickly to be unrealistic. Either all runtime
objects would have to be garbage collected---a {\em major\/} change to the
design of the language and a violation of the spirit of the language---or,
at a minimum, programmers would need language support to be able to
specify which (classes of) objects are to be garbage collected. A design
consequence of this goal is that a program can have two dynamic memory
spaces: the space dedicated to objects that are to be garbage collected,
and the usual heap' of dynamically allocatable memory.

Use of a garbage collector must not impose a complete dichotomy on runtime
objects: collected objects must be able to refer to, contain, and be
contained in non-collected objects, and {\em vice versa}. Gc-objects are
not restricted to exist only in gc-space: gc-objects should be
instantiable as global, automatic, or heap objects, if the programmer so
desires. Obviously, only those that are in gc-space can and will be
automatically reclaimed. While it may be necessary to add data to an
object to support GC, it must be the case that objects of a type have
exactly one memory representation, no matter where they reside (global
space, stack, heap, or gc-space). Furthermore, it should be possible for
the user to declare that a specific object is to be automatically
reclaimed. For instance, the programmer should not have to declare a
special class or data type to have arrays of characters (strings)
automatically reclaimed.

Arrays of gc-objects must be supported.

The design should support pointers into objects and pointer-to-member.
For instance, if a gc-object has an array of characters as an element, the
programmer should still be able to traverse that array with a
pointer-to-char. Compiler and runtime mechanisms must exist that allow
correct reclamation of gc-objects that have such embedded' pointers into
them. (`Embedded' here refers to where the pointer is pointing, and not
to the location of the pointer itself.)

C unions and Pascal tagless variant records are a way of hiding
information from the compiler and/or exposing representation information
to the user. Unions should be supported wherever possible. (However,
unions can be supported only with non- or mostly-copying GC strategies;
cf.~\S\ref{unions}).

It would be nice if the design did not preclude supporting multiple and
discontiguous gc-spaces, where each gc-space may have a different GC
strategy. The prototype being constructed does not yet support multiple
gc-spaces, and so this paper does not pursue this goal directly. The
problem of cross-registration of root pointers between GC strategies is
complex enough that more data is needed on the desirability of this
feature to justify a design to support it at this time.
--
A. Dain Samples, Dain.Samples@uc.edu, wk:(513)556-4783, hm:(513)771-5492
Dept of ECE, ML#30 University of Cincinnati, Cin'ti OH 45221-0030
[About 15 years ago I used a PDP-11 Lisp interpreter written in C to which
I added a garbage collected malloc(). When it ran out of space, it ran
down the malloc arena marking everything free, then marked the stdio
buffers as in use and then followed all of the pointers kept by Lisp
objects. Worked pretty well, considering. There was also a lengthy tech
report from UC Santa Cruz a few years ago on a GC version of C++. -John]
--

Post a followup to this message