Related articles |
---|
[6 earlier articles] |
Re: MMX/3Dnow!/SSE/SSE2 compilers jacob@jacob.remcomp.fr (jacob navia) (2002-05-01) |
Re: MMX/3Dnow!/SSE/SSE2 compilers a.richards@codeplay.com (Andrew Richards) (2002-05-01) |
Re: MMX/3Dnow!/SSE/SSE2 compilers cparpart@surakware.net (Christian Parpart) (2002-05-03) |
Re: MMX/3Dnow!/SSE/SSE2 compilers marcov@toad.stack.nl (Marco van de Voort) (2002-05-04) |
Re: MMX/3Dnow!/SSE/SSE2 compilers a.richards@codeplay.com (Andrew Richards) (2002-05-08) |
Re: MMX/3Dnow!/SSE/SSE2 compilers snowwolf@diku.dk (Allan Sandfeld Jensen) (2002-05-12) |
Re: MMX/3Dnow!/SSE/SSE2 compilers jacob@jacob.remcomp.fr (jacob navia) (2002-05-23) |
Re: MMX/3Dnow!/SSE/SSE2 compilers jgd@cix.co.uk (2002-05-23) |
Re: MMX/3Dnow!/SSE/SSE2 compilers salbin@emse.fr (2002-05-23) |
Re: MMX/3Dnow!/SSE/SSE2 compilers jacob@jacob.remcomp.fr (jacob navia) (2002-05-27) |
From: | "jacob navia" <jacob@jacob.remcomp.fr> |
Newsgroups: | comp.compilers |
Date: | 23 May 2002 01:27:47 -0400 |
Organization: | Wanadoo, l'internet avec France Telecom |
References: | 02-04-126 02-04-137 02-04-146 02-04-157 02-05-051 |
Keywords: | arithmetic |
Posted-Date: | 23 May 2002 01:27:47 EDT |
> Well, the biggest gain and the focus of the project was the option
> -fpmath=sse. This replaces all x87 instructions with SSE/SSE2 ones. This
> gives a performance boost even without vectorization, because it makes
> compiler optimizations a lot easier (more RISC like).
>
Well, I did JUST THAT in my compiler system (lcc-win32)
It took me at least a month in two attempts to rechannel all FP
instructions to SSE2.
FORGET IT!
Speed dropped by 50%!!
It seems that the SSE2 instructions take much longer than the FPU ones.
Besides, you will see that all functions return their double results
in the FPU.
You will have to move that from the FPU to the stack and then to the
sse2 registers since there is no direct path between the sse2
registers and the FPU.
Yes, optimizations are easier, but even using registers all the time,
it is slower than the FPU. Please do not ask me why.
Another nice feature is that the results are incompatible between sse2
and the FPU due to the different precision used in calculations: sse2
uses 64 bits and the FPU uses 80 bits. It is impossible then to mix
both.
Yet another problem is that those instructions are much longer. Be
prepared to 20% code bloat at least. This takes longer to load then.
And yet another problem are the transcendental functions. Instead of doing
fldl -24(%ebp)
fsin
fstpl -24(%ebp)
You have to do: (data is xmm0)
subl $8,%esp
movlpd xmm0,(%esp)
fldl (%esp)
fsin
fstpl (%esp)
movlpd (%esp),%xmm0
addl $8,%esp
It was THIS that killed my FFT benchmark.
Another nice feature is that you have to save all the used xmm
registers before a function call, and restore them later. This means
big trouble since each register is 16 bytes.. If you have just 3 regs
used before the call it means 48 bytes of I/O to the system bus before
the call and 48 after...
Return to the
comp.compilers page.
Search the
comp.compilers archives again.