|[8 earlier articles]|
|Re: Graph colouring and local register allocation email@example.com (sdt) (2007-12-01)|
|Re: Graph colouring and local register allocation firstname.lastname@example.org (email@example.com) (2007-12-01)|
|Re: Graph colouring and local register allocation gneuner2@/comcast.net (George Neuner) (2007-12-01)|
|Re: Graph colouring and local register allocation firstname.lastname@example.org (=?ISO-8859-1?Q?Pertti_Kellom=E4ki?=) (2007-12-03)|
|Re: Graph colouring and local register allocation SidTouati@inria.fr (Sid Touati) (2007-12-04)|
|Re: Graph colouring and local register allocation email@example.com (2007-12-05)|
|Re: Graph colouring and local register allocation SidTouati@inria.fr (Sid Touati) (2007-12-08)|
|Re: Graph colouring and local register allocation firstname.lastname@example.org (2007-12-09)|
|Re: Graph colouring and local register allocation email@example.com (glen herrmannsfeldt) (2007-12-09)|
|Re: Graph colouring and local register allocation SidTouati@inria.fr (Sid Touati) (2007-12-10)|
|From:||Sid Touati <SidTouati@inria.fr>|
|Date:||Sat, 08 Dec 2007 19:07:17 +0100|
|References:||07-10-103 07-11-019 07-11-063 07-11-091 07-12-003 07-12-005 07-12-007 07-12-010 07-12-013|
Yes I agree with some of your remarks about OoO execution and register
renaming. Let's debate further. Do not forget that:
1/ OoO execution is a heuristics (a basic list scheduler). And we know
that any list scheduler can be twice the optimal schedule...
2/ Do not forget that register OOO have a limited instruction window.
Nowadays, codes are large, and basic blocks are larger that instruction
3/ Register renaming is limited by the number of hidden hardware
4/ The commit stages of the pipelines are still in-order.
Many other architectural restrictions are not modeled by ILP schedulers.
We can launch maybe a new thread to debate why compilers do not model
every microarch restrictions.
I suggest you an experience to check if OoO execution and register
renaming are restricted in practice. For instance, take a set of loops
with a reasonable amount of ILP. Choose loops with high data locality
in order to avoid data cache problems. Or even better, choose compute
bound loops (not memory bound). Then unroll these loops with a high
unrolling degree in order to produce large loop bodies. Compile and
optimise with all the options you want. Execute and see if OoO
execution and register renaming allow to be more or less close to
expected performances. Use hardware performance counters to really
measure the real performance metrics.
Torben Aegidius Mogensen a ecrit :
> If you work with processors that have little or no runtime scheduling
> but lots of registers (Itanium, Cell, etc.), then compile-time
> scheduling becomes more important than register allocation.
We have our experience on itanium2 processors, and trust me, 128
registers are not sufficient for many codes... Just take the case of
scientific computing codes. Loops are highly parallel, and loads require
higher static latencies (at least 8 cycles, since floating point data
reside only on the L2). Consequently, ILP scheduling techniques are
aggressive regarding register requirement, and instruction scheduling of
highly parallel loops becomes quickly restricted by registers. If you
spill, you lose (the compiler cannot guess how spill would react at
execution, while it can statically guess how arithmetic operations would
More details on scheduling instructions on itanium2 can be found in the
An Efficient Memory Operations Optimization Technique for Vector Loops
on Itanium 2 Processors. Concurrency and Computation: Practice and
Experience. Volume 11, issue 11, september 2006, pages 1485 - 1508.
If you are interested in ooO processors such that alpha 21264 and
Power4, take a look on the following reference:
Christophe Lemuet and William Jalby and Sid-Ahmed-Ali Touati. Improving
Load/Store Queues Usage in Scientific Computing. The International
Conference on Parallel Processing (ICPPb04). Montreal, august 2004, IEEE.
Regarding the Cell, I do not know if its architecture is really intended
for ILP. I would say that Cell targets medium or coarse grain parallelism.
Thanks for sharing,
ps: 128 registers are far from what high performance codes require. If
some compilers cannot allocate more than 128 registers in practice, this
can be explained in another debate...
Return to the
Search the comp.compilers archives again.