Re: Writing fast compilers...

brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
18 Aug 91 21:28:48 GMT

From comp.compilers

Related articles
[4 earlier articles]
Re: Writing fast compilers... preston@helena.rice.edu (1991-08-13)
Re: Writing fast compilers... alex@vmars.tuwien.ac.at (1991-08-13)
Re: Writing fast compilers... pcg@aber.ac.uk (1991-08-14)
Re: Writing fast compilers... markh@csd4.csd.uwm.edu (1991-08-16)
Re: Writing fast compilers... glew@pdx007.intel.com (1991-08-16)
Re: Writing fast compilers... blenko-tom@CS.YALE.EDU (1991-08-16)
*Re: Writing fast compilers... brnstnd@kramden.acf.nyu.edu* (1991-08-18)**
Re: Writing fast compilers... henry@zoo.toronto.edu (1991-08-20)
Re: Writing fast compilers... andy@DEC-Lite.Stanford.EDU (1991-08-21)

| List of all articles for this month |

Newsgroups:	comp.compilers
From:	brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
Keywords:	optimize, design
Organization:	Compilers Central
References:	<+M6D-1A@xds13.ferranti.com> <3606@crdos1.crd.ge.COM> <1991Aug18.003201.6867@zoo.toronto.edu>
Date:	18 Aug 91 21:28:48 GMT

[From comp.arch -John]

In article <1991Aug18.003201.6867@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
> For example, the desirable
> degree of loop unrolling is heavily influenced by things like cache
> design. In an ideal world, these are things you *would* want to leave to
> the compiler, so that your code would run well on any machine, rather
> than well on one and poorly on others.

I disagree, for two reasons. First, current optimizers are incompetent.
Last year I posted to comp.lang.misc a thirty-line description of an
optimization which I had done. It was obvious that the technology
necessary for that optimization---e.g., range analysis, simple algebra,
range invariant introduction---was way beyond the reach of compilers we
know how to write.

Several people thought I was trying to show off. What they didn't
realize is that this optimization was no more than the statement that if
you've used some of the bits in an 8-bit byte, then you have between 0
and 7 bits left. In context this is obvious to even novice programmers.
How long do you think it'll be before compilers can do this? (For the
curious: The resulting code is in bit_printbuf() in bitout.c in my
yabbawhap package, comp.sources.unix volume 24. Can your compiler prove
that curbb8 is between 0 and 8 after line 29?)

The original whap program ran three to four times slower than compress,
depending on the machine. The MIPS compiler produced a 25% speedup; the
somewhat smarter Astronautics ZS compiler produced a 40% speedup on a
similar architecture. After several dozen hand optimizations like the
bit output optimization above, I had whap running nearly as fast as
compress with no help from the compiler. With -O it runs faster on some
machines (e.g., a Sun 3).

Keep in mind that AP coding inherently does about three times more
dictionary manipulations than LZW coding. I didn't use a radically
different data structure in whap than the hash trie in compress. Where'd
this 2x-3x speedup come from? Hand optimization, at a level way below
``replace insertion sort with heap sort'' but still above what the MIPS
or any other compiler could comprehend.

I think that most CPU-intensive programs can see similar benefits from
hand optimization. Available compiler technology simply isn't competent
to achieve these results. People are!

(Note that I am not arguing for optimizers to be eliminated entirely.
There are some transformations---notably smart instruction scheduling---
which are not only impossible to express in high-level code but which
compilers really can do reasonably well. My final yabbawhap code still
gets a speedup of over 20% under the ZS scheduler. I see no reason to
give that up.)

Henry says that you should leave transformations to the compiler merely
because they produce worse code on some machines. I can sympathize with
this sentiment. The programs in my yabbawhap package, for instance, run
faster on most architectures with -DPTRS, and faster on the Convex with
-UPTRS. But did I ignore pointer optimizations merely because they
should be turned off on some machines? No! I provided the PTRS define
for the user to choose between the code without the transformation and
the code with the transformation. (The total cost in source lines of
supporting both versions was 24 lines. Long live macros.)

If you follow Henry's advice, you're leaving yourself at the mercy of
compilers which, as noted above, are just too stupid to perform most
optimizations which are obvious to any human who takes the time to look.
Chances are your code will work poorly on most machines. If, on the
other hand, you provide a switch to choose between different versions,
your code will work well on all machines.

One disadvantage of the latter strategy is that (feedback compilers
aside) a human has to get into the loop to set the switch at some point.
The yabbawhap README has to take time out to note that on a Convex you
should set -UPTRS. Even this can be solved. In my Q language I have a
``quickpick'' control structure which asserts that any number of
sections of code do the same thing:

    quick
    pick // new syntax...
        a = 3; b = 5; if(x) a = 7; b = 11; fi
    pick
        if(x) a = 7; b = 11; else a = 3; b = 5; fi
    end

The compiler is free to choose the form which results in fewer
instructions, or a lower expected run time. (Currently q2c always
chooses the first form, but it's not a high-quality compiler.) Of
course, this can be combined with preprocessor macros to let a human
override the compiler's decisions.

Obviously this could be added to any language. I think it would
completely answer Henry's objections to ``machine-specific
optimizations'' done by hand. I think it's perfectly natural to specify
the same code in several different ways for several different
architectures, just as I find the following data-rotation code in C
quite natural (y is a local array):

    #ifdef VECTOR
          for (i = 0; i < k; i++) y[i] = x[i + r - k];
          for (i = k; i < r; i++) y[i] = x[i - k];
          for (i = 0; i < r; i++) x[i] = y[i];
    #else
          for (i = 0; i < k; i++) y[i] = x[i + r - k];
          for (i = r - 1; i >= k; i--) x[i] = x[i - k];
          for (i = 0; i < k; i++) x[i] = y[i];
    #endif

I'm not going to write just the #ifdef VECTOR lines and pretend that a
scalar compiler will know what I mean, or vice versa. I think of the
code in both ways, and like Schrodinger's cat I'll gladly defer the
decision between one and the other until the last possible moment.

---Dan
--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: Writing fast compilers...

brnstnd@kramden.acf.nyu.edu (Dan Bernstein)18 Aug 91 21:28:48 GMT

brnstnd@kramden.acf.nyu.edu (Dan Bernstein)
18 Aug 91 21:28:48 GMT