Re: Instruction scheduling with gcc on alpha

Robert.Harley@inria.fr (Robert Harley)
13 Jun 1997 22:05:46 -0400

          From comp.compilers

Related articles
Instruction scheduling with gcc on alpha denk@obelix.cica.es (1997-05-13)
Re: Instruction scheduling with gcc on alpha jch@hazel.pwd.hp.com (John Haxby) (1997-05-22)
Re: Instruction scheduling with gcc on alpha Robert.Harley@inria.fr (1997-06-13)
Re: Instruction scheduling: loop unrolling jacob@jacob.remcomp.fr (1997-06-15)
Re: Instruction scheduling with gcc on alpha toon@moene.indiv.nluug.nl (Toon Moene) (1997-06-24)
| List of all articles for this month |

From: Robert.Harley@inria.fr (Robert Harley)
Newsgroups: comp.compilers
Date: 13 Jun 1997 22:05:46 -0400
Organization: I.N.R.I.A Rocquencourt
Keywords: optimize

Claus Denk (denk@obelix.cica.es) writes:
>I am just looking at the machine code created by gcc. I am interested
>in simple floating vector operations, as for example:
>
> for (i = 0; i< n; i++)
> dy[i] = da*dx[i];
>
>For pipelined architectures like the alpha, loop unrolling is
>essential. [...]


It is indeed, but neither gcc nor other compilers will do much with
this because dx and dy might overlap. If you are sure they don't, you
can qualify dx as 'const', use some 'no-aliasing' flag etc. If this
was a critical part I would unroll by hand and stick in some registers
explicitly, like this:




/*-- scale ----------------------------------------------------------------*/


static void scale(double *dy, u64 n, double da, const double *dx) {
    u64 i;
    double t0,t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14,t15;


    i = 0;


    if (n & 1) {
        dy[0] = da*dx[0];
        i = 1;
    } /* end if (n & 1) */


    if (n & 2) {
        t0 = dx[i]; t1 = dx[i+1];
        dy[i] = da*t0; dy[i+1] = da*t1;
        i += 2;
    } /* end if (n & 2) */


    if (n & 4) {
        t0 = dx[i]; t1 = dx[i+1]; t2 = dx[i+2]; t3 = dx[i+3];
        dy[i] = da*t0; dy[i+1] = da*t1; dy[i+2] = da*t2; dy[i+3] = da*t3;
        i += 4;
    } /* end if (n & 4) */


    if (n & 8) {
        t0 = dx[i ]; t1 = dx[i+1]; t2 = dx[i+2]; t3 = dx[i+3];
        t4 = dx[i+4]; t5 = dx[i+5]; t6 = dx[i+6]; t7 = dx[i+7];
        dy[i ] = da*t0; dy[i+1] = da*t1; dy[i+2] = da*t2; dy[i+3] = da*t3;
        dy[i+4] = da*t4; dy[i+5] = da*t5; dy[i+6] = da*t6; dy[i+7] = da*t7;
        i += 8;
    } /* end if (n & 8) */


    n &= ~15;
    while (i < n) {
        t0 = dx[i ]; t1 = dx[i+1]; t2 = dx[i+2]; t3 = dx[i+3];
        t4 = dx[i+4]; t5 = dx[i+5]; t6 = dx[i+6]; t7 = dx[i+7];
        t8 = dx[i+8]; t9 = dx[i+9]; t10 = dx[i+10]; t11 = dx[i+11];
        t12 = dx[i+12]; t13 = dx[i+13]; t14 = dx[i+14]; t15 = dx[i+15];
        dy[i ] = da*t0; dy[i+1] = da*t1; dy[i+2] = da*t2; dy[i+3] = da*t3;
        dy[i+4] = da*t4; dy[i+5] = da*t5; dy[i+6] = da*t6; dy[i+7] = da*t7;
        dy[i+8] = da*t8; dy[i+9] = da*t9; dy[i+10] = da*t10; dy[i+11] = da*t11;
        dy[i+12] = da*t12; dy[i+13] = da*t13; dy[i+14] = da*t14; dy[i+15] = da*t15;
        i += 16;
    } /* end while */


} /* end scale */




OK, so I went a bit overboard... a quick test suggests that this does
upto 270 Mflops i.e., circa 2 cycles to load-mul-store each double as
long as the data fits in L1 or L2 cache. After that memory is the
bottleneck.


But if you're doing Linpack type stuff, forget about this and unroll
daxpy instead! (into chunks of eight or so since it needs more registers).


-- Rob.
          .-. Robert.Harley@inria.fr .-.
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.