Re: Unrolling and spilling

Carlie Coats <>
Sat, 23 May 2009 11:52:46 -0400

          From comp.compilers

Related articles
Unrolling and spilling (Stephan Ceram) (2009-01-28)
Re: Unrolling and spilling (Sid Touati) (2009-01-29)
Re: Unrolling and spilling (Harold Aptroot) (2009-01-29)
Re: Unrolling and spilling (cr88192) (2009-01-30)
Re: Unrolling and spilling (Carlie Coats) (2009-05-23)
| List of all articles for this month |

From: Carlie Coats <>
Newsgroups: comp.compilers
Date: Sat, 23 May 2009 11:52:46 -0400
Organization: Compilers Central
References: 09-01-058 09-01-061
Keywords: optimize
Posted-Date: 24 May 2009 19:41:15 EDT

Harold Aptroot wrote:
> Instruction scheduling is a very important optimization for pipelined
> processors, they really should have mentioned how it affects
> spilling/unrolling
> Software pipelining is a good example that combines scheduling with
> unrolling..

> Anyway, when an array is accessed in the loop body, and it reads
> back from the same array a few iterations after it had written that
> entry, unrolling could cause the compiler to keep the value in a
> register (staying live across the boundaries of the copies of the
> loop bodies), adding to the pressure. I don't expect that to be
> overly common...

Maybe you don't do numerical analysis codes...

A simple example is a numerical second-derivative code; let's chase
through that:

For pseudocode, you have:

          input array Y[0:N+1]. For realism, assume N large.
          spacing DX, scratch variable DXSQ=DX*DX
          output array D2Y[1:N]
          Loop: I=1,...,N
                  D2Y[I] = (Y[I+1]-2.0*Y[I]+Y[I-1])/DXSQ

With clever set-up, the main body has one array load, four
sequentially-dependent FP-arithmetic ops, and one array store, using 5
FP registers.

And long pipeline depth will kill your performance.

Unrolled by 4:
                  D2Y[I ] = (Y[I+1]-2.0*Y[I ]+Y[I-1])/DXSQ
                  D2Y[I+1] = (Y[I+2]-2.0*Y[I+1]+Y[I ])/DXSQ
                  D2Y[I+2] = (Y[I+3]-2.0*Y[I+2]+Y[I+1])/DXSQ
                  D2Y[I+3] = (Y[I+4]-2.0*Y[I+3]+Y[I+2])/DXSQ

The main body now has one array load, four independent *sets* of four
sequentially-dependent FP-arithmetic operations, and four
array-stores, using eleven FP registers. You can "cover" a lot of
that pipeline depth, and get almost four times the performance.

For big problems on K-long pipelines and lots of registers, unroll by
K-1, using 2K-1 FP registers... Oops (Prescott!) maybe K=30 and you
have nowhere near 59 registers ;-(

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.