Sat, 23 May 2009 11:52:46 -0400

Related articles |
---|

Unrolling and spilling linuxkaffee_@_gmx.net (Stephan Ceram) (2009-01-28) |

Re: Unrolling and spilling SidTouati@inria.fr (Sid Touati) (2009-01-29) |

Re: Unrolling and spilling harold.aptroot@gmail.com (Harold Aptroot) (2009-01-29) |

Re: Unrolling and spilling cr88192@hotmail.com (cr88192) (2009-01-30) |

Re: Unrolling and spilling carlie@jyarborough.com (Carlie Coats) (2009-05-23) |

From: | Carlie Coats <carlie@jyarborough.com> |

Newsgroups: | comp.compilers |

Date: | Sat, 23 May 2009 11:52:46 -0400 |

Organization: | Compilers Central |

References: | 09-01-058 09-01-061 |

Keywords: | optimize |

Posted-Date: | 24 May 2009 19:41:15 EDT |

Harold Aptroot wrote:

[snip...]

*> Instruction scheduling is a very important optimization for pipelined*

*> processors, they really should have mentioned how it affects*

*> spilling/unrolling*

*> Software pipelining is a good example that combines scheduling with*

*> unrolling..*

*> Anyway, when an array is accessed in the loop body, and it reads*

*> back from the same array a few iterations after it had written that*

*> entry, unrolling could cause the compiler to keep the value in a*

*> register (staying live across the boundaries of the copies of the*

*> loop bodies), adding to the pressure. I don't expect that to be*

*> overly common...*

Maybe you don't do numerical analysis codes...

A simple example is a numerical second-derivative code; let's chase

through that:

For pseudocode, you have:

input array Y[0:N+1]. For realism, assume N large.

spacing DX, scratch variable DXSQ=DX*DX

output array D2Y[1:N]

Loop: I=1,...,N

D2Y[I] = (Y[I+1]-2.0*Y[I]+Y[I-1])/DXSQ

With clever set-up, the main body has one array load, four

sequentially-dependent FP-arithmetic ops, and one array store, using 5

FP registers.

And long pipeline depth will kill your performance.

Unrolled by 4:

Loop:

D2Y[I ] = (Y[I+1]-2.0*Y[I ]+Y[I-1])/DXSQ

D2Y[I+1] = (Y[I+2]-2.0*Y[I+1]+Y[I ])/DXSQ

D2Y[I+2] = (Y[I+3]-2.0*Y[I+2]+Y[I+1])/DXSQ

D2Y[I+3] = (Y[I+4]-2.0*Y[I+3]+Y[I+2])/DXSQ

The main body now has one array load, four independent *sets* of four

sequentially-dependent FP-arithmetic operations, and four

array-stores, using eleven FP registers. You can "cover" a lot of

that pipeline depth, and get almost four times the performance.

For big problems on K-long pipelines and lots of registers, unroll by

K-1, using 2K-1 FP registers... Oops (Prescott!) maybe K=30 and you

have nowhere near 59 registers ;-(

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.