Duff's device

glew@pdx007.intel.com (Andy Glew)
Thu, 27 Feb 1992 18:55:07 GMT

          From comp.compilers

Related articles
Re: reducible loops glew@pdx007.intel.com (1992-02-26)
Duff's device glew@pdx007.intel.com (1992-02-27)
| List of all articles for this month |

Newsgroups: comp.compilers
From: glew@pdx007.intel.com (Andy Glew)
Keywords: optimize, history
Organization: Intel Corp, Hillsboro, Oregon
References: <9202220021.AA28153@ucbvax.Berkeley.EDU> 92-02-125
Date: Thu, 27 Feb 1992 18:55:07 GMT

Alan Watson <alan@uwast.astro.wisc.edu> corrects me about Duff's devive.
Since I've already received several requests for a reference to Duff's
device, I provide it here.

    I believe your post on Duff's device is in error, and I enclose the
    original and one of Duff's comments on its use. Note that Duff's
    device is not a block-to-block copy, but a block-to-location copy. I
    have not submitted a correction to comp.compilers.

(1) Thanks for the original. I'll keep it in my files. I rather thought
that I had the code a bit wrong.

(2) Maybe Tom Duff only originated this as a block-to-location copy, but
I've used it for block to block copies, and to implement fast checksums,
and I've seen other people use it for vector operations. I don't think
it's unreasonable to give Tom Duff credit for any use of the switch into
loop body construct.

    >From research!ucbvax!dagobah!td Sun Nov 13 07:35:46 1983
    Received: by ucbvax.ARPA (4.16/4.13)
    Received: by dagobah.LFL (4.6/4.6b)
    Date: Thu, 10 Nov 83 17:57:56 PST
    From: ucbvax!dagobah!td (Tom Duff)
    Message-Id: <8311110157.AA01034@dagobah.LFL>
    To: ucbvax!decvax!hcr!rrg, ucbvax!ihnp4!hcr!rrg, ucbvax!research!dmr,

    Consider the following routine, abstracted from code which copies an
    array of shorts into the Programmed IO data register of an Evans &
    Sutherland Picture System II:

     send(to, from, count)
     register short *to, *from;
     register count;
     *to = *from++;

    (Obviously, this fails if the count is zero.)
    The VAX C compiler compiles the loop into 2 instructions (a movw and
    a sobleq, I think.) As it turns out, this loop was the bottleneck in
    a real-time animation playback program which ran too slowly by about 50%.
    The standard way to get more speed out of something like this is to unwind
    the loop a few times, decreasing the number of sobleqs. When you do that,
    you wind up with a leftover partial loop. I usually handle this in C with
    a switch that indexes a list of copies of the original loop body. Of
    course, if I were writing assembly language code, I'd just jump into the
    middle of the unwound loop to deal with the leftovers. Thinking about this
    yesterday, the following implementation occurred to me:

     send(to, from, count)
     register short *to, *from;
     register count;
     register n=(count+7)/8;
     case 0: do{ *to = *from++;
     case 7: *to = *from++;
     case 6: *to = *from++;
     case 5: *to = *from++;
     case 4: *to = *from++;
     case 3: *to = *from++;
     case 2: *to = *from++;
     case 1: *to = *from++;

    Disgusting, no? But it compiles and runs just fine. I feel a combination
    of pride and revulsion at this discovery. If no one's thought of it before,
    I think I'll name it after myself.

    It amazes me that after 10 years of writing C there are still little corners
    that I haven't explored fully. (Actually, I have another revolting way to
    use switches to implement interrupt driven state machines but it's too
    horrid to go into.)

    Many people (even bwk?) have said that the worst feature of C is that
    switches don't break automatically before each case label. This code forms
    some sort of argument in that debate, but I'm not sure whether it's for or

     yrs trly

    From: td@alice.UUCP (Tom Duff)
    Date: 29 Aug 88 20:33:51 GMT
    Transformations like this can only be justified by measuring the
    resulting code. Be careful when you use this thing that you don't
    unwind the loop so much that you overflow your machine's instruction
    cache. Don't try to be smarter than an over-clever C compiler that
    recognizes loops that implement block move or block clear and compiles
    them into machine idioms.


Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Pkwy,
Hillsboro, Oregon 97124-6497

Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.