Related articles |
---|
SPARC compiler optimisation gregw@highland.oz.au (1992-02-13) |
Re: SPARC compiler optimisation casper@fwi.uva.nl (1992-02-14) |
Re: SPARC compiler optimisation how@leland.stanford.edu (1992-02-14) |
Re: SPARC compiler optimisation ucsd!math.ucla.edu!pmontgom@uunet.uu.net (1992-02-15) |
Re: SPARC compiler optimisation grunwald@foobar.cs.colorado.edu (1992-02-22) |
Re: SPARC compiler optimisation andrew@highland.oz.au (1992-02-26) |
Re: SPARC compiler optimisation dmk@craycos.com (1992-02-27) |
Re: SPARC compiler optimisation nickh@CS.CMU.EDU (1992-02-28) |
Re: SPARC compiler optimisation nickh@CS.CMU.EDU (1992-03-02) |
Re: SPARC compiler optimisation preston@dawn.cs.rice.edu (1992-03-02) |
[1 later articles] |
Newsgroups: | comp.compilers,comp.sys.sun.misc |
From: | ucsd!math.ucla.edu!pmontgom@uunet.uu.net (Peter Montgomery) |
Keywords: | optimize, sparc |
Organization: | UCLA Mathematics Department |
References: | 92-02-062 92-02-072 |
Date: | Sat, 15 Feb 92 03:56:48 GMT |
In article 92-02-072 how@leland.stanford.edu
(Dana How) writes:
> And yes, I have NEVER seen the optimizer use [%a+%b].
>
>This is frequently a big pain for me, since i always try to write loops
>where everything is indexed by one loop variable from different bases.
>Frequently the optimizer will turn this into SEVERAL combined indices to
>avoid the imaginary ADD instruction, and then have to maintain EACH
>index...!
I like to write that way too. Reducing the number of iteratively
defined variables in our codes makes it easier to state loop invariants
and improves readability, in my opinion.
If our loops have items of different sizes per element (say we
reference an array of structures and an array of shorts with the same
index), then the compiler will need separate index registers for each
stride.
Even if a loop resembles a[i] = b[i] + c[i], where a, b, c, are
all 4 bytes/element and i increases by 1 per iteration, the optimizer may
have problems. You might expect
initialize ra, rb, rc to addresses of a[0], b[0], c[0]
initialize ri to 4*i
loop:
load r1 from [rb + ri]
load r2 from [rc + ri]
add r3 = r1 + r2
store r3 at [ra + ri]
add ri = ri + 4
compare ri and 4*(bound on i)
possibly branch back to start of loop
The instruction ordering is very restricted; in particular, nothing fits
in the branch delay slot. If the data is floating point, then we also
want more delay after the add. To overcome this, the optimizer can
replace ra by ra-4, effectively using (ra-4) + (ri+4) for the store, with
something like:
initialize ra, rb, rc to addresses of a[-1], b[0], c[0]
initialize ri to 4*i
loop:
load r1 from [rb + ri]
load r2 from [rc + ri]
add ri = ri + 4
compare ri and 4*(bound on i)
add r3 = r1 + r2
possibly branch back to start of loop
store r3 at [ra + ri] (delay slot)
But if ra is used elsewhere in the loop (e.g., a[i] = a[i] + b[i]), it may
not be easy to make the change. And suppose we unroll our source to
a[i ] = b[i ] + c[i ];
a[i+1] = b[i+1] + c[i+1];
a[i+2] = b[i+2] + c[i+2];
i += 3;
Now it is probably cheapest to assign addresses of a[i], b[i], c[i] to
separate registers, each incremented on every iteration; two alternatives
are to
(1) assign the addresses of a[0], a[1], a[2], b[0], b[1], b[2],
c[0], c[1], c[2] all to separate registers, or
(2) assign the addresses of a[0], b[0], c[0] to three registers,
and the offsets 4*i, 4*(i+1), 4*(i+2) to another three registers.
A compiler algorithm (untested) which compromises between these is
1) Whenever two different displacements are needed from
a single (base, index) pair (e.g., both a[i] and a[i+1]
referenced within a loop, or two different fields
referenced within an indexed structure), assign
base+index to a register. This also applies
if both a[i] and b[i] appear, where the
difference a-b is known at compile time and small enough
to be used as a displacement.
2) If no existing register has the right stride,
then establish a new register (e.g., both 32-bit
and 64-bit data indexed by same variable).
Prefer the element referenced last in the loop
if you have a choice of several with the same stride.
3) Otherwise compute difference in bases, and index
off the established register.
For the earlier a[i] = b[i] + c[i] example, if it chooses to
assign &a[i] to a register, then it can assign the address differences b-a
and c-a to other registers (where the difference remains in bytes or other
storage units). The code might be
initialize rba to b-a, rca to c-a
initialize rai to &a[i]
loop:
load r1 from [rba + rai]
load r2 from [rca + rai]
add rai = rai + 4
compare rai and (bound on &a[i])
add r3 = r1 + r2
possibly branch back to start of loop
store r3 at [rai - 4] (delay slot)
Yes, I have found register+register addressing useful in my SPARC
assembly code. I have seen gcc use it. But these compiler difficulties
may be one reason why it was omitted from the MIPS R2000/R3000 RISC
architecture.
--
Peter L. Montgomery Internet: pmontgom@math.ucla.edu
Department of Mathematics, UCLA, Los Angeles, CA 90024-1555 USA
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.