Re: Optimizations to improve icache hit rate

A Pietu Pohjalainen <>
28 Apr 2004 15:21:46 -0400

          From comp.compilers

Related articles
Optimizations to improve icache hit rate (Jeff Kenton) (2004-04-21)
Re: Optimizations to improve icache hit rate (A Pietu Pohjalainen) (2004-04-28)
Re: Optimizations to improve icache hit rate (2004-04-29)
| List of all articles for this month |

From: A Pietu Pohjalainen <>
Newsgroups: comp.compilers
Date: 28 Apr 2004 15:21:46 -0400
Organization: University of Helsinki
References: 04-04-068
Keywords: architecture, storage
Posted-Date: 28 Apr 2004 15:21:46 EDT

Jeff Kenton <> wrote:
> I had a discussion during an interview about optimizations designed to
> decrease icache misses. It seems to me that code re-arrangement will
> give small benefits at most, and that general optimizations,
> especially those that decrease overall code size, will be more
> productive in the end.

> Comments?
> [Sounds to me like it depends on the size of the cache lines. -John]

In order to visualise the effect of size of the cache in optimization,
I did this experiment some time ago: I have a simple C-program that
loops for 2^20 times. Then the loop is manually unrolled for 2^1,
2^2, 2^3, .., 2^15 times.

As one might guess, the runtime for this loop first decreases, as the
unroll causes less calls to the exit-condition of the loop. However,
once the L1 and L2 caches on a x86 machine get full, the execution time
suddenly jumps.

The loop body in my experiment happened to be 87 bytes. For example, on
a PentiumIII / 1GHz machine with 16kB L1 cache and 256 kB L2 cache the
average runtimes with the following unroll counts are:

unrolls | execution
  1 | 63 ms
  2 | 49 ms
  4 | 40 ms
  8 | 30 ms
  16 | 24 ms
  32 | 24 ms
  64 | 21 ms
  128 | 20 ms
  256 | 21 ms
  512 | 32 ms
  1024 | 48 ms
  2048 | 50 ms
  4096 | 90 ms
  8192 | 140 ms
  16384 | 140 ms
  32768 | 140 ms

As we can calculate, the loop body can be unrolled to L1 cache to
16384 / 87 = 188 times. The first penalty is indeed placed between 128
and 215 unrolls. similarily, the L2 can contain 262144 / 87 = 3013
unrolls - and the second penalty in execution time is placed between
2048 and 4096 unrolls.

This penalty comes from the fact that when the execution reaches to the
end of the unrolled loop, the beginning of the loop has already been
removed from the cache. In the PIII case, the impact seems to take the
execution time from 20 ms to 140 ms.

To the question I'd believe that code re-arrangement optimizations are
orthogonal to those that decrease overall code size. With smaller code
fragments, you can fit more fragments to your limited cache size - which
means more possible orderings of the fragments.

(Full runtime information of my unroll experiment with few Pentium III,
  Pentium IV, Athlon and Opteron processors is available at )

Pietu Pohjalainen

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.