Related articles |
---|
Compilers taking advantage of architectural enhancements aglew@crhc.uiuc.edu (1990-10-11) |
Re: Compilers taking advantage of architectural enhancements preston@titan.rice.edu (1990-10-11) |
Re: Compilers taking advantage of architectural enhancements golds@fjcnet.GOV (1990-10-12) |
Re: Compilers taking advantage of architectural enhancements aglew@crhc.uiuc.edu (1990-10-12) |
Re: Compilers taking advantage of architectural enhancements spot@TR4.GP.CS.CMU.EDU (1990-10-12) |
Re: Compilers taking advantage of architectural enhancements ctl8588@rigel.tamu.edu (1990-10-14) |
Re: Compilers taking advantage of architectural enhancements jourdan@minos.inria.fr (1990-10-15) |
[8 later articles] |
Newsgroups: | comp.compilers |
From: | aglew@crhc.uiuc.edu (Andy Glew) |
In-Reply-To: | gillies@m.cs.uiuc.edu's message of 10 Oct 90 21:51 CDT |
Keywords: | design, optimize |
Organization: | Center for Reliable and High-Performance Computing University of Illinois at Urbana Champaign |
References: | <162639@<1990Oct9> <3300194@m.cs.uiuc.edu> |
Date: | 11 Oct 90 19:49:20 GMT |
[Copied from comp.arch. -John]
>[gillies@m.cs.uiuc.edu]
>One of the problems in new CPU designs is that designers don't realize
>which architecture enhancements are pointless, because we don't and
>may never have the optimization technology to take advantage of them.
Ah, a potentially interesting and useful topic. Perhaps we can start
a discussion that will lead to a list of possible hardware
architectural enhancements that a compiler can/cannot take advantage
of? Maybe the real compiler guys (preston, David C.) will help us
out?
By the way, it is useful in such discussions to distinguish between
run-of-the-mill optimizations that can pretty much be added to any
existing compiler by a fairly good CS grad, and optimizations that
require so much background that only the "experts" in the field can
reasonably be expected to do this. For example, many of the
parallelizing FORTRAN loop optimizations can reasonably only be
expected to be done by people from Rice or UIUC CSRD (note that I'm
UIUC, but CRHC, not CSRD), so unless you are willing to pay for these
guys (or their assorted front companies) you aren't likely to get them
onto your machine.
Cost of compiler development can be significant. Sometimes a
company might put a hardware feature in even though they know a
compiler approach would be better, because they can't afford the
compiler guys (or the compiler guys already have exclusive contracts
with a competitor).
Let me list a few things to start off the discussion. I hope and
expect to be shot down on a few of them. It's easy to list a few of
the hardware enhancements that we already know compilers can take
advantage of.
Branch Delay Slots - small number
One or two branch delay slots can be filled quite easily.
DIFFICULTY: a reasonably good CS grad can do it.
Branch Delay slots - large number
Most groups report difficulty filling a large number of branch delay
slots, although my group's compiler guy (UIUC IMPACT, Prof Weh-M Hwu)
has been able to get good results for up to 8 delay slots, on slightly
different hardware (forward semantic, with profiling feedback).
DIFFICULTY: given the slightly different hardware, and exposure
to the necessary ideas, a reasonably good CS grad can do it.
Register file - moderate sized (up to 32 registers)
Most compilers can use these efficiently.
DIFFICULTY: a reasonably good CS grad can do it.
Avoiding IBM's patents is a slight problem.
Register file - large (around 128 registers, or more)
Most compilers do not get enough benefit from these to justify
the extra hardware, or the slowed down register access.
Multiple, hierarchical, registers sets
Like Cray's fast S registers and slower T registers.
It took a very long time for compilers to take advantage
of these multiple register sets.
DIFFICULTY: complex.
Separate floating point register file
Most compilers can take advantage of this.
DIFFICULTY: easy
Heterogenous register file
Few compilers have been developed that can take advantage of a truly heterogenous
register file, one in which, for example, the divide unit writes to registers
D1..D2, the add unit writes to registers A1..A16, the shift unit writes to registers
S1..S4 -- even though such hardware conceivably has a cycle time advantage over
homogenous registers, even on VLIW machines where data can easily be moved
to generic registers when necessary.
DIFFICULTY: hard.
Instruction cache
Many groups have reported compilers that can take advantage of small
instruction caches by rearranging code to minimize I-cache misses. Eg.
HP, UIUC IMPACT.
DIFFICULTY: moderately complex - a good CS grad can do it, but it requires
tools and background that take a while to develop, and are not widely
available. One of the complicating factors is that this sort of
optimization is most useful when it is done in cooperaton with the
linker/loader. Which is something that your average compiler hacker
doesn't have background in (although it isn't too hard).
Data cache - software managed consistency
This reportedly has been done, but certainly isn't run-of-the-mill.
DIFFICULTY: needs a skilled compiler expert.
Micro-scheduling parallelism (like CONVEX's ASAP)
Apparently isn't too hard to develop compiler support for this.
Hardware is rather expensive, though.
DIFFICULTY: moderately complex
Vectorizing
Done in lots of places, but high performance vectorizing compilers
still seem to be a problem for a lot of companies.
DIFFICULTY: complex. The easy things are easy, but milking the
last few FLOPs out of your vector unit requires compiler experts.
Parallelizing - fine grain, small numbers of processors
(like Alliant's concurrent hardware)
Done, eg. by Alliant and Cray
DIFFICULTY: to be competitive with this sort of hardware, you
need real compiler experts on your team. Harder than vectorizing.
Parallelizing, fine grain, large numbers of processors.
What Tera is trying to do.
DIFFICULTY: very complex. Leading edge.
Good luck, Tera.
Multiple functional units - heterogenous - VLIW or superscalar
DIFFICULTY: complex.
Complexity of the packing algorithms grows rapidly as
you try to go beyond basic block parallelism
Multiple functional units - homogenous - VLIW or superscalar
DIFFICULTY: moderately complex
Easier than the heterogenous case, and the packing algorithms
are considerably easier.
Special hardware instructions - scalar
Taking advantage of simple instructions like abs(), conditional exchange, etc.
DIFFICULTY:
(1) When treated not as a compiler problem, but as a problem of simply
writing libraries to inline optimized machine code, EASY
Requires inlining support.
(2) When treated as a compiler problem, eg. compiler should recognize
IF a < 0 THEN a = -a
and automatically convert it to a = ABS(a)
MODERATELY COMPLEX.
I expect Herman Rubin to comment on this.
--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.