Re: reducible loops

preston@dawn.cs.rice.edu (Preston Briggs)
Thu, 27 Feb 1992 07:28:55 GMT

          From comp.compilers

Related articles
reducible loops preston@cs.rice.edu (1992-02-24)
Re: reducible loops rockwell@socrates.umd.edu (Raul Deluth Miller-Rockwell) (1992-02-25)
Re: reducible loops preston@dawn.cs.rice.edu (1992-02-26)
Re: reducible loops lee@dumas.rutgers.edu (1992-02-26)
Re: reducible loops glew@pdx007.intel.com (1992-02-26)
Re: reducible loops preston@dawn.cs.rice.edu (1992-02-27)
Re: reducible loops joel@decwrl.dec.com (1992-02-28)
Re: reducible loops idacrd!desj@uunet.UU.NET (1992-03-09)
Re: reducible loops preston@dawn.cs.rice.edu (1992-03-10)
| List of all articles for this month |

Newsgroups: comp.compilers
From: preston@dawn.cs.rice.edu (Preston Briggs)
Keywords: optimize
Organization: Rice University, Houston
References: <9202220021.AA28153@ucbvax.Berkeley.EDU> 92-02-125
Date: Thu, 27 Feb 1992 07:28:55 GMT

In article 92-02-125 glew@pdx007.intel.com (Andy Glew) writes:


>Duff's device is a technique for unrolling loops with minimal overhead.
> [ buggy version omitted -John]


All the stars are missing!
Here's a less tricky way to accomplish the same effect


        j = i / 8;
        switch( i % 8 ) {
case 7:
         *destptr++ = *srcptr++;
         case 6:
         *destptr++ = *srcptr++;
         case 5:
         *destptr++ = *srcptr++;
         case 4:
         *destptr++ = *srcptr++;
         case 3:
         *destptr++ = *srcptr++;
         case 2:
         *destptr++ = *srcptr++;
         case 1:
         *destptr++ = *srcptr++;
         case 0:
         *destptr++ = *srcptr++;
        }
        while (--j >= 0) {
         *destptr++ = *srcptr++;
         *destptr++ = *srcptr++;
         *destptr++ = *srcptr++;
         *destptr++ = *srcptr++;
         *destptr++ = *srcptr++;
         *destptr++ = *srcptr++;
         *destptr++ = *srcptr++;
         *destptr++ = *srcptr++;
        }


Same overhead, but better optimization of the loop. Even allowing the
overhead is probably fine, especially if we consider the code growth and
switch overhead.


for (j=0; j<i%8; j++)
dest[j] = src[j];
while (j < i) {
dest[j+0] = src[j+0];
dest[j+1] = src[j+1];
dest[j+2] = src[j+2];
dest[j+3] = src[j+3];
dest[j+4] = src[j+4];
dest[j+5] = src[j+5];
dest[j+6] = src[j+6];
dest[j+7] = src[j+7];
j += 8;
}


Probably runs at least as fast, and the optimizer will do nice things to
the loop (assuming there's something to do). In particular, it's clear
that the 2nd assignment (dest[j+1] = src[j+1]) always follows the first,
allowing (sometimes) common subexpression elimination, value numbering,
strength reduction, and so forth.


In the example above, it ought to be able to manage each iteration with a
single increment and test. All the array accesses should be doable with
constant indices off the single index. Alternatively, if the machine
supports free auto-increment, the optimizer could use it, as suggested by
Andy's code.


In the original Duff'd code, each assignment may be reached from outside
the loop or from the previous assignment. This choice (at every
statement) eliminates many possibilities for optimization.


Of course, with this simple a loop body, there's little to gain from
optimization after unrolling. More interesting loops have more to gain
(or lose, if Duff's device is used).


Preston Briggs
--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.