Re: Can this type of cache miss be reduced?

Max Hailperin <>
Tue, 02 Jun 2009 07:22:42 -0500

          From comp.compilers

Related articles
Can this type of cache miss be reduced? (Eric Fisher) (2009-06-01)
Re: Can this type of cache miss be reduced? (George Neuner) (2009-06-01)
Re: Can this type of cache miss be reduced? (Max Hailperin) (2009-06-02)
Re: Can this type of cache miss be reduced? (Eric Fisher) (2009-06-03)
Re: Can this type of cache miss be reduced? (Louis Krupp) (2009-06-03)
Re: Can this type of cache miss be reduced? (glen herrmannsfeldt) (2009-06-03)
Re: Can this type of cache miss be reduced? (Max Hailperin) (2009-06-03)
| List of all articles for this month |

From: Max Hailperin <>
Newsgroups: comp.compilers
Date: Tue, 02 Jun 2009 07:22:42 -0500
Organization: Compilers Central
References: 09-06-003
Keywords: architecture
Posted-Date: 02 Jun 2009 13:26:16 EDT

Eric Fisher <> writes:

> Optimizations for cache miss are often that loop transformations, such
> as loop interchange, loop blocking, etc.
> But, for a large one-dimensional array, suppose the elements are only
> accessed once, can we still reduce the cache miss?
> Example:
> #define NUM 320*240*3
> static const char a[NUM] = {.......};
> char *ptr=a;
> for (i = 0; i < NUM; i++)
> {
> x = *ptr++;
> y = *ptr++;
> z = *ptr++;
> fun(x, y, z);
> }

Yes and no, in some mixture that depends on your architecture, what
the fun(x,y,z) is doing, and your definition of "cache miss."

Momentarily ignoring the fun(x,y,z), your loop would naively seem to
experience one cache miss per`cache-block worth of data. But ignoring
fun(x,y,z) may be inappropriate. Assuming that the cache block is
larger than 3 array elements (which is typically true), one possible
source of additional misses is interference between the caching of the
array accesses and the cache footprint of the fun(x,y,z) computation.

Thus, one way to reduce the cache misses of your loop might be to
reduce interference between the array and whatever data structures fun
is using. Given that your array isn't really all that big by modern
architectures' standards, you might have room to fit it and fun's data
structures into the L1 cache's index space at non-overlapping
positions by adjusting the position of one or the other. But that
kind of interprocedural analysis of conflicts isn't easy for
compilers. More importantly, if you were to experiment manually with
adjusting the positioning of the array relative to fun's data
structures, I would suspect you would find little performance impact
on modern architectures. In the olden days of off-chip caches, the
caches often had very low associativity, even direct-mapped (1-way),
and so conflicts could easily kill you. But nowadays more and more
cache (even L2 and L3) is on-chip and consequently is practical to
engineer with high associativity (e.g., 8-way). You have to try
pretty hard to get a conflict miss with that kind of associativity.

Another issue arises with the array accesses themselves. I said
earlier that your loop naively seems to experience one cache miss per
block of data. If you define "cache miss" as bringing data into an
(initially cold) cache from the next higher level, then this naive
assumption is correct. But if you define "cache miss" as being an
event that stalls the processor, then the cache misses could be
dramatically lower by using prefetching. That is, each block of data
could be brought into the cache while the previous block is being
processed, so that by the time the loop is ready for the next block of
data, it already is in the cache. In the olden days of simple cache
controllers, this was another opportunity for the compiler --
compilers could (and sometimes did) insert extra code into loops to do
non-blocking accesses to array elements that a later iteration would
access. But again, hardware progress has rendered this kind of
compiler-based prefetching largely irrelevant, at least for such a
simple pattern of memory accesses as your linear scan. The modern
cache controllers automatically detect that kind of linear access and
will do the prefetching for you. The code you showed, compiled
completely naively, would probably experience very few cache-miss
stalls on a modern processor, assuming that the memory bandwidth could
keep up (and there were no stalls caused by the fun(x,y,z) part).

There is quite a literature on "streaming" computations that sweep
through large volumes of memory. (Again, I will say that yours isn't
really that large -- even though I'm not quite sure exactly how large,
because there is a factor of 3 difference between the declaration of
the array and how much the loop acceesses.) So you might want to do
some searches on that keyword. One of the things you will discover is
that many of the approaches to streaming computation avoids using the
cache at all. More specialized structures can perform as well as or
better than a cache for taking advantage of the spatial locality.
Without any temporal locality, using a cache as a glorified prefetch
buffer is just "polluting" the whole cache, evicting all the data used
by other parts of your overall computation. That is, the non-cache
ways to do streaming access will perform the array accesses themselves
at a rate that is just as good as the cache-based ways (perhaps even a
bit better), but the overall system performance will be higher because
other data structures that exhibit temporal locality will not be
evicted from the cache.

  -Max Hailperin
    Professor of Computer Science
    Gustavus Adolphus College

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.