From: | anton@mips.complang.tuwien.ac.at (Anton Ertl) |
Newsgroups: | comp.compilers |
Date: | Tue, 07 Feb 2023 08:35:00 GMT |
Organization: | Institut fuer Computersprachen, Technische Universitaet Wien |
References: | <Adkz+TvWa4zLl8W9Qd6ovtClKZpZrA==> 23-01-078 23-02-001 23-02-007 23-02-011 23-02-015 |
Injection-Info: | gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="11090"; mail-complaints-to="abuse@iecc.com" |
Keywords: | architecture |
Posted-Date: | 07 Feb 2023 21:24:39 EST |
gah4 <gah4@u.washington.edu> writes:
>On Friday, February 3, 2023 at 10:17:06 AM UTC-8, Anton Ertl wrote:
>
>(snip, I wrote)
>
>> >This would have been especially useful for Itanium, which
>> >(mostly) failed due to problems with code generation.
>
>> I dispute the latter claim. My take is that IA-64 failed because the
>> original assumption that in-order performance would exceed OoO
>> performance was wrong. OoO processors surpassed in-order CPUs; they
>> managed to get higher clock rates (my guess is that this is due to
>> them having smaller feedback loops) and they benefit from better
>> branch prediction, which extends to 512-instruction reorder buffers on
>> recent Intel CPUs, far beyond what compilers can achieve on IA-64.
>> The death knell for IA-64 competetiveness was the introduction of SIMD
>> instruction set extensions which made OoO CPUs surpass IA-64 even in
>> those vectorizable codes where IA-64 had been competetive.
...
>But okay, the biggest failure of Itanium is that it was two years or
>so behind schedule when it came out.
If it had had superior performance when McKinley came out in 2002,
that would have been just a minor speed bump on the way to success.
>And partly, as well as I
>remember, is the need to implement x86 instructions, too.
Certainly, backwards compatibility is paramount in the markets that it
was intended for. And the 386 compatibility also had to be close to
competetive. But it wasn't.
>But is it the whole idea of compile-time instruction scheduling the
>cause of the failure, or just the way they did it?
It seems to me that the idea that in-order execution combined with
architectural features for reducing dependencies and with compiler
scheduling turned out to be inferior to out-of-order execution.
Moreover, in those areas where the compiler works well (numerical
code), it also works ok for using SIMD instructions
(auto-vectorization) which could be added relatively cheaply to the
hardware.
>All OoO processors have a limit
>to how far they can go. But the compiler does not have that limit.
And the compiler can make more sophisticated scheduling decisions
based on the critical path, while the hardware scheduler picks the
oldest ready instruction. These were the ideas that seduced Intel,
HP, and Transmeta to invest huge amounts of money into EPIC ideas.
But the compiler has other limits. It cannot schedule across indirect
calls (used in object-oriented dispatch), across compilation unit
boundaries (in particular, calls to and returns from shared
libraries). Another important limit is the predictability of
branches. Static branch prediction using profiles has ~10%
mispredictions, while (hardware) dynamic branch prediction has a much
lower misprediction rate (I remember numbers like 3% (for the same
benchmarks that have 10% mispredictions with static branch prediction)
in papers from the last century; I expect that this has improved even
more in the meantime. If the compiler mispredicts, it will schedule
instructions from the wrong path, instructions that will be useless
for execution.
In the end a compiler typically can schedule across a few dozen
instructions, while hardware can schedule across a few hundred.
Compiler scheduling works well for simple loops, and that's where
IA-64 shone, but only doing loops well is not good enough for
general-purpose software.
>Now, since transistors are cheap now, and one can throw a large
>number into reorder buffers and such, one can build really deep
>pipelines.
It's interesting that Intel managed to produce their first OoO CPU
(the Pentium Pro with 5.5M transistors) in 350nm, while Merced (the
first Itanium) at 25.4M transistors was too large for the 250nm and
they had to switch to 180nm (contributing to the delays). So, while
the theory was that the EPIC principle would reduce the hardware
complexity (to allow adding more functional units for increased
performance), in Itanium practice the hardware was more complex, and
the performance advantages did not appear.
>But the reason for bringing this up, is that if Intel had a defined
>intermediate code, and supplied the back end that used it,
>and even more, could update that back end later, that would have
>been very convenient for compiler writers.
Something like this happened roughly at the same time with LLVM.
There were other initiatives, but LLVM was the one that succeeded.
There was the Open Research Compiler for IA-64 from Intel and the
Chinese Academy of Sciences.
SGI released their compiler targeted at IA-64 as Open64.
>Even more, design for it could have been done in parallel with the
>processor, making both work well together.
Intel, HP, and others worked on compilers in parallel to the hardware
work. It's just that the result did not perform as well for
general-purpose code as OoO processors with conventional compilers.
>[Multiflow found that VLIW compile-time instruction scheduling was
>swell for code with predictable memory access patterns, much less so
>for code with data-dependent access patterns.
Multiflow and Cydrome built computers for numerical computations (aka
HPC aka (mini-)supercomputing). These tend to spend a lot of time in
simple loops with high iteration counts and have statically
predictable branches. They found compiler techniques like trace
scheduling (Multiflow) that work well for code with predictable
branches, and modulo schedling (Cydrome) that work well for simple
loops with high iteration counts. The IA-64 and Transmeta architects
wanted to extend these successes to general-purpose computing, but it
did not work out.
Concerning memory access patterns: While they are not particularly
predictable in general-purpose code, mostd general-purpose code
benefits quite a lot from caches (more than numerical code), so I
don't think that this was a big problem for IA-64.
Some people mention varying latencies due to caches as a problem, but
the choice of a small L1 cache (16KB compared to 64KB for the earlier
21264 and K7 CPUs) for McKinley indicates that average latency was
more of a problem for IA-64 than varying latencies.
>And if the memory access is that predictable, you can
>likely use SIMD instructions instead. -John]
Yes, SIMD ate EPICs lunch on the numerical program side, leaving few
programs where IA-64 outdid the mainstream.
- anton
--
M. Anton Ertl
anton@mips.complang.tuwien.ac.at
http://www.complang.tuwien.ac.at/anton/
Return to the
comp.compilers page.
Search the
comp.compilers archives again.