Stacks are faster than heaps (was re Appel, continuations) (Paul Wilson)
Wed, 29 Jan 1992 11:31:26 -0600

          From comp.compilers

Related articles
Re: Compiling with Continuations/Andrew Appel (Vincent Delacour) (1992-01-28)
Stacks are faster than heaps (was re Appel, continuations) (1992-01-29)
| List of all articles for this month |

Newsgroups: comp.compilers
From: (Paul Wilson)
Keywords: storage
Organization: Compilers Central
References: 92-01-115
Date: Wed, 29 Jan 1992 11:31:26 -0600

> (1) even supposing that sml-nj uses that virtual memory like god
>himself (no page fault, etc) memory (even _virtual_ memory) is a precious
>resource. With such a demand, unless you have a very powerful machine your
>sml job will ask for *all of it*.

One thing to keep in mind is that simple GC's have abysmal, even
pathological behavior in a conventional memory hierarchy. That is, they
romp through a large amount of memory, allocating mostly very short-lived
objects, and don't reuse ANY of it until they've used up that big hunk of
memory. Yuck-o-matic. This defeats an LRU replacement strategy very
nicely, and other replacement policies aren't all that much better. As
long as your heaps virtual size is greater than your physical memory size,
you spend most of your time paging. (Even if you optimize away page
fetches, which is easy, you're still disk-bound paging out stuff that's
mostly recently-generated garbage. Ungar pointed this out in his 1984
paper on Generation Scavenging.)

Generational GC's are much better in terms of virtual memory, but usually
still pathological at the level of CACHE memories---you allocate more than
a cache-full of data, and don't reuse ANY of it until the next GC. So you
get cache misses equal to the rate of allocation, and most of those cause
write-backs (of recently-allocated short-lived garbage), so you get
cache-to-memory-and-back traffic at least TWICE the rate of allocation.

One way around this is to have a large cache and a small youngest
generation, which fits entirely in the cache, along with anything else
referenced on the same time scale.

On top of that, gc's tend to fare especially badly in direct-mapped
caches, because DM caches assume certain things about reference frequency
distributions that are systematically false in a garbage- collected
system, unless you're clever about heap layout.

> (2) experience shows that even when you run your sml world alone,
>the machine pages a lot, and you can get very bad performances (so sml-nj
>is not god himself ;-)

Appel's gc is explicitly designed to run in large memory, and optimized
(within that constraint) to be reasonably cheap in terms of copying cost
while still being very easy to implement.

If you want to run in small memory, you need a gc with a youngest
generation that doesn't move around so much. (Check out Ungar's, or mine,
which was described in my 1989 OOPSLA paper. They're a little more
complicated, but better for running in tighter RAM.)

>It turns out that this serious drawback is related to the overall strategy
>of sml-nj compilation : what Appel "allocates like crazy" are essentially
>... continuation objects (activation records). In other words, the problem
>with sml wasting machine resource comes from a bad pair of memory
>management and model of execution.

Allocating continuations on the heap increases heap usage TREMENDOUSLY.
(Roughly, continuations are like activation records, or more precisely,
suspension records. A chain of them acts as a control stack.)

In our Scheme system, adding a (software) stack cache reduced heap
allocation by roughly one to three ORDERS OF MAGNITUDE.

I suspect a similar hack for SML of NJ would decrease heap allocation by a
factor of several. (Hard to say how much with any precision, because it
depends on funny things about compiler analysis and code generation.)

I'd guess that the extra heap allocation caused by the lack of a stack (or
stack cache) accounts for a tremendous number of cache misses. This may
help explain why Ryan Stansifer's experiments showed T (Yale's Scheme) to
be faster, usually, than SML of NJ.

> [...]
>Finally, I'd like to clarify that despite the appearance I *really* don't
>criticize A. Appel's work with sml-nj [more than this: I'd be happy to
>have done it in his place, and sign it]. I put the blame on people
>reciting that sml-nj is a thunder without having tried it, or worse: even
>when they actually tried it, preferring the paradox and *belief* to facts.
>It's better to try to understand what's going on, and in that respect, it
>should be emphasized that negative results in research work are not a
>shame, they are as valuable as other results, and infinitely more valuable
>than slogans.

Hear, hear. I think SML of NJ is very cool. Add a stack cache, and it might
be a lot faster.

>[87] A. Appel, "Garbage Collection can be Faster than Stack
> Allocation", Information Processing Letters, 1987.

One thing to keep in mind about this paper. The success reported was on
an old VAX with huge RAM. On a modern RISC machine, however, cache-level
locality is absolutely crucial to performance. So while garbage
collection CAN be faster than stack allocation, it's not likely to be on
any machine you use in the future. It CAN be fairly cheap, though, so
gc'd languages are still a very good idea; just don't stress the gc
unnecessarily by throwing everything onto the heap when you don't have to.

    I've got a paper on this stuff (submitted for publication), and can
send a .tex version of the technical summary to anybody who's interesed.

          -- Paul

| Paul R. Wilson, runtime environmentalist email:
| U. of Texas Computer Sciences Dept. voice: (512) 471-9555
| Taylor Hall 2.124, Austin, TX 78712-1188 fax: (512) 471-8885

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.