|Run time optimizations firstname.lastname@example.org (Sanjay Jinturkar) (1993-04-20)|
|Re: Run time optimizations email@example.com (1993-04-22)|
|Re: Run time optimizations firstname.lastname@example.org (1993-04-23)|
|Re: Run time optimizations email@example.com (1993-04-23)|
|Re: Run time optimizations firstname.lastname@example.org (1993-04-24)|
|Re: Run time optimizations email@example.com (1993-04-28)|
|From:||firstname.lastname@example.org (Paul Haahr)|
|Date:||Sat, 24 Apr 1993 03:18:33 GMT|
David Keppel <email@example.com> writes:
> * Look at runtime values and decide how to generate code that is more
> optimized than the general case but still conservative given the runtime
> values. Examples: Bitblt [Pike, Locanthi & Reiser 85]
One data point here: i had written, back when i was in school, a 680n0 (n
>= 2) compile-on-the-fly bitblt. It ran at (roughly) memory bandwidth on
my 68020-based sun3/60, and generally moved 3-5x as many bits per second
as an ``interpretive'' bitblt. I was pretty pleased with this code. ;-)
I recently compiled it and timed on on my 68040-based nextstation.
the compile-on-the-fly version and the interpretive version ran
in the same amount of time, reproducible within 10%. In some cases,
the compiling bitblt was faster, in others it was slower, seemingly
due to the i-cache flushing that was going on before jumping into
the actual copying routine.
My explanations for what's going on, which are all in the form of
first impressions and could easily be completely mistaken, are:
+ The 68040 outruns the memory by a lot. No matter how much
other work you do in bitblt, performance is most directly
related to memory bandwidth and you have cycles to spare.
+ The 68040 is (internally) clock-doubled, so you already
have twice as many instruction cycles as you have chances
to get at the memory buses. That means that a loop with
two memory references that are not coming out of cache can
effectively have two other completely ``free'' instructions.
This makes loop-unrolling, one of the prime advantages of
the compile-on-the-fly code, much less important.
+ There's much more of a penalty on the 68040 for flushing
the icache: the cache is an order of magnitude larger and
almost certainly has useful code from outside the bitblt
in it, not to mention advantages stemming from repeated
bitblts being cached.
+ The icache is bigger, so you don't need the incredibly
density of the compile-on-the-fly code to keep the entire
inner loop in cache. (In fact, I suspect that the inner
loop typically fits entirely in the 68020s 256 byte icache,
but i'm note sure.)
Anyway, I don't want to disparage the approach of run-time code
generation, but do want to remind people that as hardware changes,
engineering trade-offs change.
Return to the
Search the comp.compilers archives again.