Related articles |
---|
Cache size restrictions obsolete for unrolling? linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-07) |
Re: Cache size restrictions obsolete for unrolling? harold.aptroot@gmail.com (Harold Aptroot) (2009-01-09) |
Re: Cache size restrictions obsolete for unrolling? gneuner2@comcast.net (George Neuner) (2009-01-10) |
Re: Cache size restrictions obsolete for unrolling? linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-10) |
Re: Cache size restrictions obsolete for unrolling? jgd@cix.compulink.co.uk (2009-01-10) |
Re: Cache size restrictions obsolete for unrolling? harold.aptroot@gmail.com (Harold Aptroot) (2009-01-10) |
From: | Stephan Ceram <linuxkaffee_@_gmx.net> |
Newsgroups: | comp.compilers |
Date: | 7 Jan 2009 21:24:15 GMT |
Organization: | Compilers Central |
Keywords: | storage, performance, question |
Posted-Date: | 09 Jan 2009 07:32:53 EST |
I've made the experience that for some DSPs it's better to unroll
loops as much as possible without taking care of the instruction
cache. In my experiments, I've written a program that fits exactly
into the cache (i.e. the code of the loops was slightly smaller than
the I-cache capacity). For this test program I've measured the
execution time using a cycle-true simulator. Next, I increased the
unrolling factor stepwise resulting in the unrolled loop that exceeded
the cache capacity.
I expected to get a performance decrease, i.e. the stronger the loop
was unrolled the more capacity cache misses should arise leading to a
decrease in the execution time. However, my measurement showed the
opposite. For a loop with 100 iterations, the increase of the
unrolling factor (with one exception) continuously reduced the program
run time. How is this possible?
My feeling is that modern processors have sophisticated features (like
prefetching, fast memories ...) that heavily help to hide/avoid
instruction cache misses, thus they rarely occur even if a frequently
executed loop exceeds the cache capacity. In contract, aggressive
unrolling reduced the expensive execution of branches (especially
mispredicted) in the loop header and produced more optimization
potential. In total, this pays off even at the cost of some more cache
misses. So my first conclusion is that the commonly found restriction
of unrolling factors to avoid too large loops not fitting in the cache
is obsolete and does not hold for modern processors and compilers.
Do you agree or are my assumptions wrong in some point?
Thank you for your opinion.
Cheers,
Stephan
Return to the
comp.compilers page.
Search the
comp.compilers archives again.