Re: Compilers taking advantage of architectural enhancements (Andy Glew)
12 Oct 90 03:28:01 GMT

          From comp.compilers

Related articles
Compilers taking advantage of architectural enhancements (1990-10-11)
Re: Compilers taking advantage of architectural enhancements (1990-10-11)
Re: Compilers taking advantage of architectural enhancements golds@fjcnet.GOV (1990-10-12)
Re: Compilers taking advantage of architectural enhancements (1990-10-12)
Re: Compilers taking advantage of architectural enhancements spot@TR4.GP.CS.CMU.EDU (1990-10-12)
Re: Compilers taking advantage of architectural enhancements (1990-10-14)
Re: Compilers taking advantage of architectural enhancements (1990-10-15)
Re: Compilers taking advantage of architectural enhancements (1990-10-16)
Re: Compilers taking advantage of architectural enhancements (1990-10-16)
Re: Compilers taking advantage of architectural enhancements (1990-10-16)
[5 later articles]
| List of all articles for this month |

Newsgroups: comp.compilers
From: (Andy Glew)
In-Reply-To:'s message of 11 Oct 90 22:32:24 GMT
Keywords: design
Organization: Center for Reliable and High-Performance Computing University of Illinois at Urbana Champaign
References: <1990Oct9> <> <> <>
Date: 12 Oct 90 03:28:01 GMT

[Copied from comp.arch -John]

>>Perhaps we can start
>>a discussion that will lead to a list of possible hardware
>>architectural enhancements that a compiler can/cannot take advantage of?
>I'll comment on a couple of features from the list

Thanks. Do you have any features of your own to add?

>>Register file - large (around 128 registers, or more)
>> Most compilers do not get enough benefit from these to justify
>> the extra hardware, or the slowed down register access.
>In the proceedings of Sigplan 90, there's a paper about how to chew
>lots of registers.
> Improving Register Allocation for Subscripted Variables
> Callahan, Carr, Kennedy
>I suggested the subtitle "How to use of all those FP registers"
>but nobody was impressed. Also, there's a limit to how many registers
>you need, at least for scientific fortran. It depends on the speed
>of memory and cache, speed of the FPU, and the actual applications.
>The idea is that once the FPU is running at full speed,
>more registers are wasted.

I'm pretty sure that being able to put elements of aggregates into
registers is the next big step - subscripted variables and structure

However, if it's only just now being published, that probably means
its leading edge, so a company should not design for this unless it
has guys in its compiler group actually working on it.
        Any comment on this way of looking at it?

>>Heterogenous register file
>> Few compilers have been developed that can take advantage of a
>> truly heterogenous register file, one in which, for example, the
>> divide unit writes to registers D1..D2, the add unit writes to
>> registers A1..A16, the shift unit writes to registers S1..S4 --
>> even though such hardware conceivably has a cycle time advantage
>> over homogenous registers, even on VLIW machines where data can
>> easily be moved to generic registers when necessary.
>> DIFFICULTY: hard.
>At first glance, the problem seems susceptable to coloring.
>Perhaps I'm missing something.

I agree with you --- I really don't understand why heterogenous
register files are so hard to handle. But homogenous register files
are one thing that compiler people have gone into rhapsodies wrt. RISC

Here's one example: the Intel 80x86 is basically a heterogenous
register file machine. Specific registers were tied to the outputs and
inputs of specific functional units in the original hardware.
Compiler people hated targetting this architecture, and there are very
few compilers that can produce machine code comparable to hand-coded
assembly on this architecture.

But heterogenous register files are much easier to make fast. I think
that two things can be done to make them easier for the compilers to
        (1) Provide more than one register at the output of any specific
         functional unit. This avoids the need to immediately move
         the result away.
         (if my spelling is bad it's because my keyboard just went flakey)
        (2) Make the register file heterogenous for writes, but homogenous
         for reads (it's easier to provide read multiporting than
         write multiporting).

>>Data cache - software managed consistency
>> This reportedly has been done, but certainly isn't run-of-the-mill.
>> DIFFICULTY: needs a skilled compiler expert.
>At Rice (and other places), people are considering the perhaps
>related problems of trying to manage cache usage for a single
>processor. I'm personally turned on by the topic because of
>big performance gains possible and the possible impact on
>architecture. Questions like: Can we get away with no D-cache?
>Perhaps we don't need cache for FP only?
>Can we get away with only direct mapped cache? What does a compiler
>do with set associativity? How can we do prefetches to cache?
>Porterfield did a thesis here that talks some about these questions.
>Additionally, Callahan and Porterfield (both at Tera) have a paper
>in Supercomputing 90 on (perhaps) similar topics.

Again - it's current research, so a company should not "bet the farm"
on this sort of thing, unless it has people actively working in the field.
A compnay should not assume that it can just go out and buy this sort
of compiler technology, or that it can hire a generic CS grad to
implement this sort of optimization within a year.

>>Multiple functional units - heterogenous - VLIW or superscalar
>> DIFFICULTY: complex.
>>Multiple functional units - homogenous - VLIW or superscalar
>> DIFFICULTY: moderately complex
>> Easier than the heterogenous case, and the packing algorithms
>> are considerably easier.
>I had never thought to distinguish the two cases and
>I'm not sure why the scheduling algorithms should be much different.

Neither am I. I just put this up because it seemed to be what Gillies was
referring to.

>>Special hardware instructions - scalar
>> Taking advantage of simple instructions like abs(), conditional
>> exchange, etc.
>> (1) When treated not as a compiler problem, but as a problem of simply
>> writing libraries to inline optimized machine code, EASY
>> Requires inlining support.
>For intrinsics, I follow the PL.8 example.
>That is, have intermediate language instructions
>for ABS etc. so the optimizer can try and hoist them or perhaps strength
>reduce them (e.g. SIN). Then expand to a simple form (perhaps with branches
>and so forth), and let the optimizer get at the guts of each operation.
>Some like ABS might be available as basic instructions and so need not
>be expanded to a lower level form. This seems to require that the
>front-end recognize certain calls as intrinsics. Naturally, this
>works fine with Fortran, but compilers for other languages could
>easily adopt the same approach. Probably have for C.
>This isn't wonderfully extensible, but people have worked on
>variations that might be worth exploring. In particular,
>the Experimental Compiler System (ECS) project at IBM hoped to
>achieve the same effect in a more extensible fashion.

I'm a fan of GNU EMACS style inlining. I don't think that your
intermediate language should even attempt to represent all of the
special hardware instructions that are likely to be useful - and it is
very frustrating when you have the operation that nobody thought of.
PS. comp.compilers has been talking about this recently.

Andy Glew, [get ph nameserver from]

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.