Re: SPARC compiler optimisation

ucsd!math.ucla.edu!pmontgom@uunet.uu.net (Peter Montgomery)
Sat, 15 Feb 92 03:56:48 GMT

From comp.compilers

Related articles
SPARC compiler optimisation gregw@highland.oz.au (1992-02-13)
Re: SPARC compiler optimisation casper@fwi.uva.nl (1992-02-14)
Re: SPARC compiler optimisation how@leland.stanford.edu (1992-02-14)
*Re: SPARC compiler optimisation ucsd!math.ucla.edu!pmontgom@uunet.uu.net* (1992-02-15)**
Re: SPARC compiler optimisation grunwald@foobar.cs.colorado.edu (1992-02-22)
Re: SPARC compiler optimisation andrew@highland.oz.au (1992-02-26)
Re: SPARC compiler optimisation dmk@craycos.com (1992-02-27)
Re: SPARC compiler optimisation nickh@CS.CMU.EDU (1992-02-28)
Re: SPARC compiler optimisation nickh@CS.CMU.EDU (1992-03-02)
Re: SPARC compiler optimisation preston@dawn.cs.rice.edu (1992-03-02)
[1 later articles]

| List of all articles for this month |

Newsgroups:	comp.compilers,comp.sys.sun.misc
From:	ucsd!math.ucla.edu!pmontgom@uunet.uu.net (Peter Montgomery)
Keywords:	optimize, sparc
Organization:	UCLA Mathematics Department
References:	92-02-062 92-02-072
Date:	Sat, 15 Feb 92 03:56:48 GMT

In article 92-02-072 how@leland.stanford.edu
(Dana How) writes:
> And yes, I have NEVER seen the optimizer use [%a+%b].
>
>This is frequently a big pain for me, since i always try to write loops
>where everything is indexed by one loop variable from different bases.
>Frequently the optimizer will turn this into SEVERAL combined indices to
>avoid the imaginary ADD instruction, and then have to maintain EACH
>index...!

                I like to write that way too. Reducing the number of iteratively
defined variables in our codes makes it easier to state loop invariants
and improves readability, in my opinion.

                If our loops have items of different sizes per element (say we
reference an array of structures and an array of shorts with the same
index), then the compiler will need separate index registers for each
stride.

                Even if a loop resembles a[i] = b[i] + c[i], where a, b, c, are
all 4 bytes/element and i increases by 1 per iteration, the optimizer may
have problems. You might expect

                                initialize ra, rb, rc to addresses of a[0], b[0], c[0]
                                initialize ri to 4*i
                loop:
                                load r1 from [rb + ri]
                                load r2 from [rc + ri]
                                add r3 = r1 + r2
                                store r3 at [ra + ri]
                                add ri = ri + 4
                                compare ri and 4*(bound on i)
                                possibly branch back to start of loop

The instruction ordering is very restricted; in particular, nothing fits
in the branch delay slot. If the data is floating point, then we also
want more delay after the add. To overcome this, the optimizer can
replace ra by ra-4, effectively using (ra-4) + (ri+4) for the store, with
something like:

                                initialize ra, rb, rc to addresses of a[-1], b[0], c[0]
                                initialize ri to 4*i
                loop:
                                load r1 from [rb + ri]
                                load r2 from [rc + ri]
                                add ri = ri + 4
                                compare ri and 4*(bound on i)
                                add r3 = r1 + r2
                                possibly branch back to start of loop
                                store r3 at [ra + ri] (delay slot)

But if ra is used elsewhere in the loop (e.g., a[i] = a[i] + b[i]), it may
not be easy to make the change. And suppose we unroll our source to

                                a[i ] = b[i ] + c[i ];
                                a[i+1] = b[i+1] + c[i+1];
                                a[i+2] = b[i+2] + c[i+2];
                                i += 3;

Now it is probably cheapest to assign addresses of a[i], b[i], c[i] to
separate registers, each incremented on every iteration; two alternatives
are to

(1) assign the addresses of a[0], a[1], a[2], b[0], b[1], b[2],
        c[0], c[1], c[2] all to separate registers, or
(2) assign the addresses of a[0], b[0], c[0] to three registers,
        and the offsets 4*i, 4*(i+1), 4*(i+2) to another three registers.

                A compiler algorithm (untested) which compromises between these is

                1) Whenever two different displacements are needed from
                        a single (base, index) pair (e.g., both a[i] and a[i+1]
                        referenced within a loop, or two different fields
                        referenced within an indexed structure), assign
                        base+index to a register. This also applies
                        if both a[i] and b[i] appear, where the
                        difference a-b is known at compile time and small enough
                        to be used as a displacement.

                2) If no existing register has the right stride,
                        then establish a new register (e.g., both 32-bit
                        and 64-bit data indexed by same variable).
                        Prefer the element referenced last in the loop
                        if you have a choice of several with the same stride.

                3) Otherwise compute difference in bases, and index
                        off the established register.

                For the earlier a[i] = b[i] + c[i] example, if it chooses to
assign &a[i] to a register, then it can assign the address differences b-a
and c-a to other registers (where the difference remains in bytes or other
storage units). The code might be

                                initialize rba to b-a, rca to c-a
                                initialize rai to &a[i]
                loop:
                                load r1 from [rba + rai]
                                load r2 from [rca + rai]
                                add rai = rai + 4
                                compare rai and (bound on &a[i])
                                add r3 = r1 + r2
                                possibly branch back to start of loop
                                store r3 at [rai - 4] (delay slot)

                Yes, I have found register+register addressing useful in my SPARC
assembly code. I have seen gcc use it. But these compiler difficulties
may be one reason why it was omitted from the MIPS R2000/R3000 RISC
architecture.
--
                Peter L. Montgomery Internet: pmontgom@math.ucla.edu
                Department of Mathematics, UCLA, Los Angeles, CA 90024-1555 USA
--

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.

Re: SPARC compiler optimisation

ucsd!math.ucla.edu!pmontgom@uunet.uu.net (Peter Montgomery)Sat, 15 Feb 92 03:56:48 GMT

ucsd!math.ucla.edu!pmontgom@uunet.uu.net (Peter Montgomery)
Sat, 15 Feb 92 03:56:48 GMT