Cache size restrictions obsolete for unrolling?

Stephan Ceram <linuxkaffee_@_gmx.net>
7 Jan 2009 21:24:15 GMT

          From comp.compilers

Related articles
Cache size restrictions obsolete for unrolling? linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-07)
Re: Cache size restrictions obsolete for unrolling? harold.aptroot@gmail.com (Harold Aptroot) (2009-01-09)
Re: Cache size restrictions obsolete for unrolling? gneuner2@comcast.net (George Neuner) (2009-01-10)
Re: Cache size restrictions obsolete for unrolling? linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-10)
Re: Cache size restrictions obsolete for unrolling? jgd@cix.compulink.co.uk (2009-01-10)
Re: Cache size restrictions obsolete for unrolling? harold.aptroot@gmail.com (Harold Aptroot) (2009-01-10)
| List of all articles for this month |
From: Stephan Ceram <linuxkaffee_@_gmx.net>
Newsgroups: comp.compilers
Date: 7 Jan 2009 21:24:15 GMT
Organization: Compilers Central
Keywords: storage, performance, question
Posted-Date: 09 Jan 2009 07:32:53 EST

I've made the experience that for some DSPs it's better to unroll
loops as much as possible without taking care of the instruction
cache. In my experiments, I've written a program that fits exactly
into the cache (i.e. the code of the loops was slightly smaller than
the I-cache capacity). For this test program I've measured the
execution time using a cycle-true simulator. Next, I increased the
unrolling factor stepwise resulting in the unrolled loop that exceeded
the cache capacity.


I expected to get a performance decrease, i.e. the stronger the loop
was unrolled the more capacity cache misses should arise leading to a
decrease in the execution time. However, my measurement showed the
opposite. For a loop with 100 iterations, the increase of the
unrolling factor (with one exception) continuously reduced the program
run time. How is this possible?


My feeling is that modern processors have sophisticated features (like
prefetching, fast memories ...) that heavily help to hide/avoid
instruction cache misses, thus they rarely occur even if a frequently
executed loop exceeds the cache capacity. In contract, aggressive
unrolling reduced the expensive execution of branches (especially
mispredicted) in the loop header and produced more optimization
potential. In total, this pays off even at the cost of some more cache
misses. So my first conclusion is that the commonly found restriction
of unrolling factors to avoid too large loops not fitting in the cache
is obsolete and does not hold for modern processors and compilers.


Do you agree or are my assumptions wrong in some point?


Thank you for your opinion.


Cheers,
Stephan



Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.