Re: Run time optimizations

haahr@mv.us.adobe.com (Paul Haahr)
Sat, 24 Apr 1993 03:18:33 GMT

          From comp.compilers

Related articles
Run time optimizations sj3e@server.cs.virginia.edu (Sanjay Jinturkar) (1993-04-20)
Re: Run time optimizations krishna@cs.unm.edu (1993-04-22)
Re: Run time optimizations pardo@cs.washington.edu (1993-04-23)
Re: Run time optimizations jim@float.co.uk (1993-04-23)
Re: Run time optimizations haahr@mv.us.adobe.com (1993-04-24)
Re: Run time optimizations jeremy@sw.oz.au (1993-04-28)
| List of all articles for this month |
Newsgroups: comp.compilers
From: haahr@mv.us.adobe.com (Paul Haahr)
Keywords: optimize, architecture
Organization: Compilers Central
References: 93-04-069 93-04-091
Date: Sat, 24 Apr 1993 03:18:33 GMT

David Keppel <pardo@cs.washington.edu> writes:
> * Look at runtime values and decide how to generate code that is more
> optimized than the general case but still conservative given the runtime
> values. Examples: Bitblt [Pike, Locanthi & Reiser 85]


One data point here: i had written, back when i was in school, a 680n0 (n
>= 2) compile-on-the-fly bitblt. It ran at (roughly) memory bandwidth on
my 68020-based sun3/60, and generally moved 3-5x as many bits per second
as an ``interpretive'' bitblt. I was pretty pleased with this code. ;-)


I recently compiled it and timed on on my 68040-based nextstation.
the compile-on-the-fly version and the interpretive version ran
in the same amount of time, reproducible within 10%. In some cases,
the compiling bitblt was faster, in others it was slower, seemingly
due to the i-cache flushing that was going on before jumping into
the actual copying routine.


My explanations for what's going on, which are all in the form of
first impressions and could easily be completely mistaken, are:


+ The 68040 outruns the memory by a lot. No matter how much
other work you do in bitblt, performance is most directly
related to memory bandwidth and you have cycles to spare.


+ The 68040 is (internally) clock-doubled, so you already
have twice as many instruction cycles as you have chances
to get at the memory buses. That means that a loop with
two memory references that are not coming out of cache can
effectively have two other completely ``free'' instructions.
This makes loop-unrolling, one of the prime advantages of
the compile-on-the-fly code, much less important.


+ There's much more of a penalty on the 68040 for flushing
the icache: the cache is an order of magnitude larger and
almost certainly has useful code from outside the bitblt
in it, not to mention advantages stemming from repeated
bitblts being cached.


+ The icache is bigger, so you don't need the incredibly
density of the compile-on-the-fly code to keep the entire
inner loop in cache. (In fact, I suspect that the inner
loop typically fits entirely in the 68020s 256 byte icache,
but i'm note sure.)


Anyway, I don't want to disparage the approach of run-time code
generation, but do want to remind people that as hardware changes,
engineering trade-offs change.


paul
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.