|Cache size restrictions obsolete for unrolling? linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-07)|
|Re: Cache size restrictions obsolete for unrolling? firstname.lastname@example.org (Harold Aptroot) (2009-01-09)|
|Re: Cache size restrictions obsolete for unrolling? email@example.com (George Neuner) (2009-01-10)|
|Re: Cache size restrictions obsolete for unrolling? linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-10)|
|Re: Cache size restrictions obsolete for unrolling? firstname.lastname@example.org (2009-01-10)|
|Re: Cache size restrictions obsolete for unrolling? email@example.com (Harold Aptroot) (2009-01-10)|
|From:||George Neuner <firstname.lastname@example.org>|
|Date:||Sat, 10 Jan 2009 06:27:07 -0500|
|Organization:||A noiseless patient Spider|
|Keywords:||architecture, performance, DSP|
|Posted-Date:||10 Jan 2009 10:40:26 EST|
On 7 Jan 2009 21:24:15 GMT, Stephan Ceram <linuxkaffee_@_gmx.net>
>I've made the experience that for some DSPs it's better to unroll
>loops as much as possible without taking care of the instruction
>cache. In my experiments, I've written a program that fits exactly
>into the cache (i.e. the code of the loops was slightly smaller than
>the I-cache capacity). For this test program I've measured the
>execution time using a cycle-true simulator. Next, I increased the
>unrolling factor stepwise resulting in the unrolled loop that exceeded
>the cache capacity.
>I expected to get a performance decrease, i.e. the stronger the loop
>was unrolled the more capacity cache misses should arise leading to a
>decrease in the execution time. However, my measurement showed the
>opposite. For a loop with 100 iterations, the increase of the
>unrolling factor (with one exception) continuously reduced the program
>run time. How is this possible?
If this really is a DSP the results you are see are likely do to
memory banking and/or parallel ALUs. Without details of the system -
chip, memory configuration, disassembly of your code, etc. - it's
impossible to say for certain.
>My feeling is that modern processors have sophisticated features (like
>prefetching, fast memories ...) that heavily help to hide/avoid
>instruction cache misses, thus they rarely occur even if a frequently
>executed loop exceeds the cache capacity.
Again, if you are really talking about DSPs, they are architecturally
quite different from CPU systems.
First of all, many DSPs don't have an I-cache in the traditional
sense. Instead they typically use clock matched SRAM and wide
instruction buses. Most DSPs are VLIW (2 or more instructions per
code word) and many can grab 2 or more words per fetch. Code and data
are typically placed in separate memory banks attached to independent
buses. Usually the data memory is dual ported and can be read and
written simultaneously. Also it is typical to use both data and
instruction buses to fetch memory operands simultaneously (so long as
the memory system permits it). DSPs are designed with the expectation
that memory access is fast - usually 1 cycle - and they achieve high
throughput by using deep pipelines and parallel units rather than by
avoiding memory access.
Almost all DSPs have a "loop cache" - a FIFO/RA buffer at the back end
of their instruction pipeline and a dedicated loop start instruction.
Instructions get pushed through the FIFO as the DSP executes. When a
loop start instruction is found, the DSP tracks its position in the
buffer (some track several loop starts simultaneously). If a branch
references a start point that is still in the buffer, the DSP fetches
the loop instructions from the buffer instead of from memory.
In a typical system using clock matched SRAM and where code and data
are placed appropriately, you should see little difference in
execution time between a loop that fits in the cache and one that
doesn't - a loop and the equivalent straight line code should run at
approximately the same speed given appropriate code and data
placement. To see a clear difference, you need to introduce some
memory access interference (code/data or data/data) by placing things
together in the same bank of memory or in competing memory banks
(odd/even, etc.). Access to the loop cache will always be single
Your compiler (assuming it's relatively modern) will separate out code
and data and try to place data as best it can for non-interfering
memory access. DSP linkers typically take as input a file detailing
the memory configuration so they know how to lay out data segments for
the loader. It's likely all of this has been optimized for your
particular setup and you'll need to futz with it to study performance
of different data placement.
>In contract, aggressive unrolling reduced the expensive execution of
>branches (especially mispredicted) in the loop header and produced
>more optimization potential. In total, this pays off even at the cost
>of some more cache misses. So my first conclusion is that the
>commonly found restriction of unrolling factors to avoid too large
>loops not fitting in the cache is obsolete and does not hold for
>modern processors and compilers. Do you agree or are my assumptions
>wrong in some point?
I would say they don't hold for most DSPs.
DSPs expect to be connected to SRAM ... with clock matched SRAM every
access takes 1 cycle so there is no branch penalty to worry about -
both targets are equally "close". Consequently, most DSPs don't do
branch prediction ... only the very fastest (GHz+ clock speeds) and
those that are designed for connection to DRAM can benefit from it.
What matters most for loop speed is whether the loop creates register
over-pressure. In the best case (the normal case with SRAM) the time
to do each body calculation will be constant, but it can be affected
by the extra instructions to load/incr/save loop variables.
Return to the
Search the comp.compilers archives again.