Re: Why is using single-precision slower than using double-precision

davidm@Rational.COM (David Moore)
Wed, 23 Nov 1994 22:16:59 GMT

          From comp.compilers

Related articles
[2 earlier articles]
Re: Why is using single-precision slower than using double-precision meissner@osf.org (1994-11-23)
Re: Why is using single-precision slower than using double-precision scott@cs.arizona.edu (1994-11-23)
Re: Why is using single-precision slower than using double-precision joelw@convex.convex.com (1994-11-23)
Re: Why is using single-precision slower than using double-precision koppel@omega.ee.lsu.edu (1994-11-23)
Re: Why is using single-precision slower than using double-precision bevan@cs.man.ac.uk (1994-11-23)
Re: Why is using single-precision slower than using double-precision luigi@paris.CS.Berkeley.EDU (1994-11-23)
Re: Why is using single-precision slower than using double-precision davidm@Rational.COM (1994-11-23)
Re: Why is using single-precision slower than using double-precision dsmentek@hpfcla.fc.hp.com (1994-11-23)
Re: Why is using single-precision slower than using double-precision trobey@taos.arc.unm.edu (1994-11-23)
Re: Why is using single-precision slower than using double-precision kenneta@hubcap.clemson.edu (1994-11-23)
Re: Why is using single-precision slower than using double-precision dik@cwi.nl (1994-11-24)
Re: Why is using single-precision slower than using double-precision davidc@panix.com (David B. Chorlian) (1994-11-24)
Re: Why is using single-precision slower than using double-precision roedy@BIX.com (1994-11-30)
[4 later articles]
| List of all articles for this month |

Newsgroups: comp.parallel,comp.arch,comp.compilers
From: davidm@Rational.COM (David Moore)
Status: RO
Originator: rmuise@dragon.acadiau.ca
Organization: Rational Software Corporation
X-Newsreader: NN version 6.5.0 #4 (NOV)
References: <3aqv5k$e27@monalisa.usc.edu>
Date: Wed, 23 Nov 1994 22:16:59 GMT

zxu@monalisa.usc.edu (Zhiwei Xu) writes:


>Can any one explain why a C program using single precision (float) is slower
>that the same code using double precision (double)? Please try the following
>code for computing pi. I have tried it on IBM RS6000/250, IBM SP2, Sun4, and
>Sun SS20, and got the same strange timing.


This is pretty easy to see if you output the assembly code with -S
and lay the code out side by side for the tight loop (modulo some
register renumbering). The second column contains only the instructions
that are extra or different for the single precision case:


        double pi,w; float pi,w
        ============= ==========
L77003:
        st %i5,[%sp+LP37+8]
        ld [%sp+LP37+8],%f15
        fitod %f15,%f16
        ldd [%fp-88],%f18
        ldd [%fp-40],%f30
        fsubd %f16,%f18,%f16
        ldd [%fp-96],%f28
        fmuld %f16,%f30,%f30
        ldd [%fp-80],%f2
        fmuld %f30,%f30,%f26
        ldd [%fp-8],%f6 ld [%fp-4],%f8
                                                                                      fstod %f8,%f6
        faddd %f28,%f26,%f28
        inc %i5
        cmp %i5,%i4
        fdivd %f2,%f28,%f2
        fmovs %f2,%f2 ! [internal]
        faddd %f6,%f2,%f6
                                                                                      fdtos %f6,%f13
        ble L77003
        std %f6,[%fp-8] st %f13,[%fp-4]


The tight loop was:


    for(i=1;i<=N;i=i+1) {
        local = ( ((double) i) - 0.5 ) * w ;
        pi = pi + 4.0 / ( 1.0 + local * local ) ;
    }


Of course, better optimization would overcome this speed difference here,
but it would still be the case that double precision would in general be
faster because single precision values have to be converted to double
before they can be manipulated.


Notice too the first three instructions of this loop:


        st %i5,[%sp+LP37+8]
        ld [%sp+LP37+8],%f15
        fitod %f15,%f16


These are doing the (double) i. The Sparc chip is similar to many other
Risc chips in that there is no direct path from the integer registers
to the float register bank - you have to go via memory. Once again,
the optimizer could have done a reduction of strength here, but the
programmer can help by using a double temporary instead of floating i:


    double di=1.0;
    for(i=1;i<=N;i=i+1) {
        local = ( (di - 0.5 ) * w ;
        pi = pi + 4.0 / ( 1.0 + local * local ) ;
        di += 1.0;
    }


Of course, one immediately sees that one can use di=0.5 instead and
that the squaring of "local" can be replaced by some adds which can
be scheduled so as to keep the pipeline more full, etc, but let's
not get carried away.


DISCLAIMER: The compiler I used is the cc that came with my machine.
It is certainly out of date. Nor did I attempt to get the best code
possible out of it. The code is an example of the sort of code a compiler
might produce rather than the best code that can be obtained with a
Sunos compiler. The code as presented is fairly typical of the code
that will be produced by compilers on many machines - trying to squeeze
the last drop of optimization out of the Sun compiler would not
have served the didactic purpose.







Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.