|Unrolling and spilling linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-28)|
|Re: Unrolling and spilling SidTouati@inria.fr (Sid Touati) (2009-01-29)|
|Re: Unrolling and spilling firstname.lastname@example.org (Harold Aptroot) (2009-01-29)|
|Re: Unrolling and spilling email@example.com (cr88192) (2009-01-30)|
|Re: Unrolling and spilling firstname.lastname@example.org (Carlie Coats) (2009-05-23)|
|From:||Carlie Coats <email@example.com>|
|Date:||Sat, 23 May 2009 11:52:46 -0400|
|Posted-Date:||24 May 2009 19:41:15 EDT|
Harold Aptroot wrote:
> Instruction scheduling is a very important optimization for pipelined
> processors, they really should have mentioned how it affects
> Software pipelining is a good example that combines scheduling with
> Anyway, when an array is accessed in the loop body, and it reads
> back from the same array a few iterations after it had written that
> entry, unrolling could cause the compiler to keep the value in a
> register (staying live across the boundaries of the copies of the
> loop bodies), adding to the pressure. I don't expect that to be
> overly common...
Maybe you don't do numerical analysis codes...
A simple example is a numerical second-derivative code; let's chase
For pseudocode, you have:
input array Y[0:N+1]. For realism, assume N large.
spacing DX, scratch variable DXSQ=DX*DX
output array D2Y[1:N]
D2Y[I] = (Y[I+1]-2.0*Y[I]+Y[I-1])/DXSQ
With clever set-up, the main body has one array load, four
sequentially-dependent FP-arithmetic ops, and one array store, using 5
And long pipeline depth will kill your performance.
Unrolled by 4:
D2Y[I ] = (Y[I+1]-2.0*Y[I ]+Y[I-1])/DXSQ
D2Y[I+1] = (Y[I+2]-2.0*Y[I+1]+Y[I ])/DXSQ
D2Y[I+2] = (Y[I+3]-2.0*Y[I+2]+Y[I+1])/DXSQ
D2Y[I+3] = (Y[I+4]-2.0*Y[I+3]+Y[I+2])/DXSQ
The main body now has one array load, four independent *sets* of four
sequentially-dependent FP-arithmetic operations, and four
array-stores, using eleven FP registers. You can "cover" a lot of
that pipeline depth, and get almost four times the performance.
For big problems on K-long pipelines and lots of registers, unroll by
K-1, using 2K-1 FP registers... Oops (Prescott!) maybe K=30 and you
have nowhere near 59 registers ;-(
Return to the
Search the comp.compilers archives again.