Related articles |
---|
Unrolling and spilling linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-28) |
Re: Unrolling and spilling SidTouati@inria.fr (Sid Touati) (2009-01-29) |
Re: Unrolling and spilling harold.aptroot@gmail.com (Harold Aptroot) (2009-01-29) |
Re: Unrolling and spilling cr88192@hotmail.com (cr88192) (2009-01-30) |
Re: Unrolling and spilling carlie@jyarborough.com (Carlie Coats) (2009-05-23) |
From: | Carlie Coats <carlie@jyarborough.com> |
Newsgroups: | comp.compilers |
Date: | Sat, 23 May 2009 11:52:46 -0400 |
Organization: | Compilers Central |
References: | 09-01-058 09-01-061 |
Keywords: | optimize |
Posted-Date: | 24 May 2009 19:41:15 EDT |
Harold Aptroot wrote:
[snip...]
> Instruction scheduling is a very important optimization for pipelined
> processors, they really should have mentioned how it affects
> spilling/unrolling
> Software pipelining is a good example that combines scheduling with
> unrolling..
> Anyway, when an array is accessed in the loop body, and it reads
> back from the same array a few iterations after it had written that
> entry, unrolling could cause the compiler to keep the value in a
> register (staying live across the boundaries of the copies of the
> loop bodies), adding to the pressure. I don't expect that to be
> overly common...
Maybe you don't do numerical analysis codes...
A simple example is a numerical second-derivative code; let's chase
through that:
For pseudocode, you have:
input array Y[0:N+1]. For realism, assume N large.
spacing DX, scratch variable DXSQ=DX*DX
output array D2Y[1:N]
Loop: I=1,...,N
D2Y[I] = (Y[I+1]-2.0*Y[I]+Y[I-1])/DXSQ
With clever set-up, the main body has one array load, four
sequentially-dependent FP-arithmetic ops, and one array store, using 5
FP registers.
And long pipeline depth will kill your performance.
Unrolled by 4:
Loop:
D2Y[I ] = (Y[I+1]-2.0*Y[I ]+Y[I-1])/DXSQ
D2Y[I+1] = (Y[I+2]-2.0*Y[I+1]+Y[I ])/DXSQ
D2Y[I+2] = (Y[I+3]-2.0*Y[I+2]+Y[I+1])/DXSQ
D2Y[I+3] = (Y[I+4]-2.0*Y[I+3]+Y[I+2])/DXSQ
The main body now has one array load, four independent *sets* of four
sequentially-dependent FP-arithmetic operations, and four
array-stores, using eleven FP registers. You can "cover" a lot of
that pipeline depth, and get almost four times the performance.
For big problems on K-long pipelines and lots of registers, unroll by
K-1, using 2K-1 FP registers... Oops (Prescott!) maybe K=30 and you
have nowhere near 59 registers ;-(
Return to the
comp.compilers page.
Search the
comp.compilers archives again.