Related articles |
---|
Instruction scheduling with gcc on alpha denk@obelix.cica.es (1997-05-13) |
Re: Instruction scheduling with gcc on alpha jch@hazel.pwd.hp.com (John Haxby) (1997-05-22) |
Re: Instruction scheduling with gcc on alpha Robert.Harley@inria.fr (1997-06-13) |
Re: Instruction scheduling: loop unrolling jacob@jacob.remcomp.fr (1997-06-15) |
Re: Instruction scheduling with gcc on alpha toon@moene.indiv.nluug.nl (Toon Moene) (1997-06-24) |
From: | Robert.Harley@inria.fr (Robert Harley) |
Newsgroups: | comp.compilers |
Date: | 13 Jun 1997 22:05:46 -0400 |
Organization: | I.N.R.I.A Rocquencourt |
Keywords: | optimize |
Claus Denk (denk@obelix.cica.es) writes:
>I am just looking at the machine code created by gcc. I am interested
>in simple floating vector operations, as for example:
>
> for (i = 0; i< n; i++)
> dy[i] = da*dx[i];
>
>For pipelined architectures like the alpha, loop unrolling is
>essential. [...]
It is indeed, but neither gcc nor other compilers will do much with
this because dx and dy might overlap. If you are sure they don't, you
can qualify dx as 'const', use some 'no-aliasing' flag etc. If this
was a critical part I would unroll by hand and stick in some registers
explicitly, like this:
/*-- scale ----------------------------------------------------------------*/
static void scale(double *dy, u64 n, double da, const double *dx) {
u64 i;
double t0,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14,t15;
i = 0;
if (n & 1) {
dy[0] = da*dx[0];
i = 1;
} /* end if (n & 1) */
if (n & 2) {
t0 = dx[i]; t1 = dx[i+1];
dy[i] = da*t0; dy[i+1] = da*t1;
i += 2;
} /* end if (n & 2) */
if (n & 4) {
t0 = dx[i]; t1 = dx[i+1]; t2 = dx[i+2]; t3 = dx[i+3];
dy[i] = da*t0; dy[i+1] = da*t1; dy[i+2] = da*t2; dy[i+3] = da*t3;
i += 4;
} /* end if (n & 4) */
if (n & 8) {
t0 = dx[i ]; t1 = dx[i+1]; t2 = dx[i+2]; t3 = dx[i+3];
t4 = dx[i+4]; t5 = dx[i+5]; t6 = dx[i+6]; t7 = dx[i+7];
dy[i ] = da*t0; dy[i+1] = da*t1; dy[i+2] = da*t2; dy[i+3] = da*t3;
dy[i+4] = da*t4; dy[i+5] = da*t5; dy[i+6] = da*t6; dy[i+7] = da*t7;
i += 8;
} /* end if (n & 8) */
n &= ~15;
while (i < n) {
t0 = dx[i ]; t1 = dx[i+1]; t2 = dx[i+2]; t3 = dx[i+3];
t4 = dx[i+4]; t5 = dx[i+5]; t6 = dx[i+6]; t7 = dx[i+7];
t8 = dx[i+8]; t9 = dx[i+9]; t10 = dx[i+10]; t11 = dx[i+11];
t12 = dx[i+12]; t13 = dx[i+13]; t14 = dx[i+14]; t15 = dx[i+15];
dy[i ] = da*t0; dy[i+1] = da*t1; dy[i+2] = da*t2; dy[i+3] = da*t3;
dy[i+4] = da*t4; dy[i+5] = da*t5; dy[i+6] = da*t6; dy[i+7] = da*t7;
dy[i+8] = da*t8; dy[i+9] = da*t9; dy[i+10] = da*t10; dy[i+11] = da*t11;
dy[i+12] = da*t12; dy[i+13] = da*t13; dy[i+14] = da*t14; dy[i+15] = da*t15;
i += 16;
} /* end while */
} /* end scale */
OK, so I went a bit overboard... a quick test suggests that this does
upto 270 Mflops i.e., circa 2 cycles to load-mul-store each double as
long as the data fits in L1 or L2 cache. After that memory is the
bottleneck.
But if you're doing Linpack type stuff, forget about this and unroll
daxpy instead! (into chunks of eight or so since it needs more registers).
-- Rob.
.-. Robert.Harley@inria.fr .-.
--
Return to the
comp.compilers page.
Search the
comp.compilers archives again.