Re: Superoptimizer (was: optimizing compiler against assembly programmer.)

qua@microunity.com (Henry Massalin)
16 Jul 1997 23:04:32 -0400

          From comp.compilers

Related articles
optimizing compiler against iaverage assembly programmer. gclind01@starbase.spd.louisville.edu (1997-06-15)
Re: optimizing compiler against average assembly programmer. gclind01@starbase.spd.louisville.edu (1997-06-30)
Re: optimizing compiler against average assembly programmer. debray@cs.arizona.edu (1997-06-30)
Re: optimizing compiler against average assembly programmer. rhyde@cs.ucr.edu (1997-07-04)
Superoptimizer (was: optimizing compiler against assembly programmer.) pardo@cs.washington.edu (1997-07-13)
Re: Superoptimizer (was: optimizing compiler against assembly programm qua@microunity.com (1997-07-16)
| List of all articles for this month |

From: qua@microunity.com (Henry Massalin)
Newsgroups: comp.compilers
Date: 16 Jul 1997 23:04:32 -0400
Organization: Compilers Central
References: 97-06-071 97-06-110 97-06-140 97-07-023 97-07-067
Keywords: performance, optimize, architecture

Hi!


As author of the original Superoptimizer program, I feel moved to
respond to this thread, if only to say hello and expand on a few
things.


First of all, I feel quite honored that even 10 years after the fact,
this is still being talked about. I'd like to publicly acknowledge my
PhD. Advisor, Calton Pu. It was his help and encouragement to forge
ahead that led to what was my very first research paper.




| The 68K has some tricky instructions so I do expect weird
| combinations to output unexpected results, but what about
| more streamlined ISAs [...] Would the Superoptimizer really
| be useful on a MIPS ?


Oh, yes! See the rest of the response.




| As the title above suggests, the SO produces the *shortest*
| possible program. This is not always the same thing as the
| *fastest* possible program.


True. The original program used this metric because:


        (1) It is the simplest cost-metric to implement.


        (2) It happened that for the instructions in its search-space,
there was a 1:1 correspondance between program length and
execution-time. (i.e., no latency-dependent stalls).




As many have already pointed out, shortest-program isn't the only
metric, nor necessarily the preferred one.


However, "fastest-program" isn't the correct metric either. At least,
not unless you have a *very* sophisticated cache and memory model to
back it up. Why? Because one can always compute complicated
functions in essentially constant time using table-lookup in a
sufficiently large address space. But most people would probably not
want to dedicate a (2^67)-byte chunk of cache memory just to make
double-precision sin(x) go fast... :-)


So you'd somehow have to cap the total memory used. A weighted
cost-function may help:


        cost = exec_time (uS) + mem_used * (ConversionFactor uS/byte)


Of course, all this may be overly complicated: All superopt-like
programs that I'm aware of implicitly cap memory usage - they don't
search the class of functions having internal lookup tables.


While on the topic of metrics, an interesting one might be to assume
all memory except for the code-fragment starts out zeroed, and to
optimize the total runtime for N executions of the target function
plus the initialization cost of any internal tables. This does weigh
the searcher toward easy-to-initialize tables; on the other hand, if a
table is easy to initialize, there's probably also a simple
instruction sequence that obviates the need for that table...


Of course, there is the question of how to search over functions that
have internal data tables. Counting through all possible values for
each of the table entries doesn't get you very far. Some kind of
theorem-prover is necessary to eliminate huge chunks of search space.
This is difficult, but not impossible.


The general approach I've tried is to "execute" the test-code
symbolically and come up with an expression that relates the values in
the tables to the squared-difference between the function-value
returned and the wanted result. The goal is to find the global
minimum squared-error and the table entries that correspond to it. If
the table affects the output in an arithmetic way (i.e., no bitwise
masking ops between the table-lookup and the final output), one can
take partial-derivatives with respect to each table entry and solve
for zero to locate the minima.


A correct function has a sum-of-squared-differences equal to 0, which
must be the global minimum, so this method will find, if they exist,
the necessary table entries that goes with a particular code-fragment.
You can do this in O(N^3) time, where N is the number of elements in
the table. (I think it's O(N^3)...might be O(N^4). And again,
subject to the condition that the table affects the output in an
arithmetic way.) This is a powerful technique because the length of
code fragment required to compute a function often goes down
significantly when you allow lookup-tables. So even though the cost
of program-testing goes way up (symbolic execution is *very*
expensive), the searching-depth required goes way down.


I have, in fact, used a varient of this method to help find 8- and
16-bit fixed-point approximations to sin(x) (among others) - each
having small (16-entry) internal lookup tables. Using the Microunity
processor's vector-lookup instruction, this gets nearly 50 million
evaluations of complex z*exp(I w) per second at a chip clock of 200MHz.


A similar method can often help find values for immediate fields
in instructions without having to search thru all possibilities.




| (2) Optimization is provably NP-Complete and that's how
| the SO runs.


This is still true. However, you can use theorem-prover generated
rulebases and symbolic execution to greatly reduce the size of the
exponent. The idea is to derive sets of equivalent code sequences.
You search only one member from each equivalence set; when you find a
code fragment that does what you want, then you can go grovelling for
which particular choice from each equivalence set gives the shortest
overall execution time. In fact, once you're at the level of
equivalence-sets, you're really representing small code- fragments as
a set of transformation-rules. The testing phase goes faster, since
you don't have to emulate every instruction one at a time. It helps
with the symbolic execution necessary to find values for any internal
lookup-tables. And it cuts down on the branching factor.


With equivalence-sets up depth 5, the branching factor is around 4 -
small enough to do a rich search to depth 20 in about a week on an
ordinary 10-processor DEC-Alpha machine :-)


Another idea is to treat classes of short code-fragments as if they
were lookup-tables albeit with constraints on the entries, and use a
method similar to the lookup-table-finder discussed earlier. At this
level, you're essentially "proving" what classes of code-fragments
admit potential solutions.


One of these days I may actually get the time to write a followup
paper on the subject....


So while NP-class problems are hard, it doesn't mean you can't take
huge whacks at that exponent... :-)


Henry
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.