Newsgroups: | comp.parallel,comp.arch,comp.compilers |
From: | davidm@Rational.COM (David Moore) |
Status: | RO |
Originator: | rmuise@dragon.acadiau.ca |
Organization: | Rational Software Corporation |
X-Newsreader: | NN version 6.5.0 #4 (NOV) |
References: | <3aqv5k$e27@monalisa.usc.edu> |
Date: | Wed, 23 Nov 1994 22:16:59 GMT |
zxu@monalisa.usc.edu (Zhiwei Xu) writes:
>Can any one explain why a C program using single precision (float) is slower
>that the same code using double precision (double)? Please try the following
>code for computing pi. I have tried it on IBM RS6000/250, IBM SP2, Sun4, and
>Sun SS20, and got the same strange timing.
This is pretty easy to see if you output the assembly code with -S
and lay the code out side by side for the tight loop (modulo some
register renumbering). The second column contains only the instructions
that are extra or different for the single precision case:
double pi,w; float pi,w
============= ==========
L77003:
st %i5,[%sp+LP37+8]
ld [%sp+LP37+8],%f15
fitod %f15,%f16
ldd [%fp-88],%f18
ldd [%fp-40],%f30
fsubd %f16,%f18,%f16
ldd [%fp-96],%f28
fmuld %f16,%f30,%f30
ldd [%fp-80],%f2
fmuld %f30,%f30,%f26
ldd [%fp-8],%f6 ld [%fp-4],%f8
fstod %f8,%f6
faddd %f28,%f26,%f28
inc %i5
cmp %i5,%i4
fdivd %f2,%f28,%f2
fmovs %f2,%f2 ! [internal]
faddd %f6,%f2,%f6
fdtos %f6,%f13
ble L77003
std %f6,[%fp-8] st %f13,[%fp-4]
The tight loop was:
for(i=1;i<=N;i=i+1) {
local = ( ((double) i) - 0.5 ) * w ;
pi = pi + 4.0 / ( 1.0 + local * local ) ;
}
Of course, better optimization would overcome this speed difference here,
but it would still be the case that double precision would in general be
faster because single precision values have to be converted to double
before they can be manipulated.
Notice too the first three instructions of this loop:
st %i5,[%sp+LP37+8]
ld [%sp+LP37+8],%f15
fitod %f15,%f16
These are doing the (double) i. The Sparc chip is similar to many other
Risc chips in that there is no direct path from the integer registers
to the float register bank - you have to go via memory. Once again,
the optimizer could have done a reduction of strength here, but the
programmer can help by using a double temporary instead of floating i:
double di=1.0;
for(i=1;i<=N;i=i+1) {
local = ( (di - 0.5 ) * w ;
pi = pi + 4.0 / ( 1.0 + local * local ) ;
di += 1.0;
}
Of course, one immediately sees that one can use di=0.5 instead and
that the squaring of "local" can be replaced by some adds which can
be scheduled so as to keep the pipeline more full, etc, but let's
not get carried away.
DISCLAIMER: The compiler I used is the cc that came with my machine.
It is certainly out of date. Nor did I attempt to get the best code
possible out of it. The code is an example of the sort of code a compiler
might produce rather than the best code that can be obtained with a
Sunos compiler. The code as presented is fairly typical of the code
that will be produced by compilers on many machines - trying to squeeze
the last drop of optimization out of the Sun compiler would not
have served the didactic purpose.
Return to the
comp.compilers page.
Search the
comp.compilers archives again.