Re: MMX/3Dnow!/SSE/SSE2 compilers

"jacob navia" <jacob@jacob.remcomp.fr>
23 May 2002 01:27:47 -0400

          From comp.compilers

Related articles
[6 earlier articles]
Re: MMX/3Dnow!/SSE/SSE2 compilers jacob@jacob.remcomp.fr (jacob navia) (2002-05-01)
Re: MMX/3Dnow!/SSE/SSE2 compilers a.richards@codeplay.com (Andrew Richards) (2002-05-01)
Re: MMX/3Dnow!/SSE/SSE2 compilers cparpart@surakware.net (Christian Parpart) (2002-05-03)
Re: MMX/3Dnow!/SSE/SSE2 compilers marcov@toad.stack.nl (Marco van de Voort) (2002-05-04)
Re: MMX/3Dnow!/SSE/SSE2 compilers a.richards@codeplay.com (Andrew Richards) (2002-05-08)
Re: MMX/3Dnow!/SSE/SSE2 compilers snowwolf@diku.dk (Allan Sandfeld Jensen) (2002-05-12)
Re: MMX/3Dnow!/SSE/SSE2 compilers jacob@jacob.remcomp.fr (jacob navia) (2002-05-23)
Re: MMX/3Dnow!/SSE/SSE2 compilers jgd@cix.co.uk (2002-05-23)
Re: MMX/3Dnow!/SSE/SSE2 compilers salbin@emse.fr (2002-05-23)
Re: MMX/3Dnow!/SSE/SSE2 compilers jacob@jacob.remcomp.fr (jacob navia) (2002-05-27)
| List of all articles for this month |

From: "jacob navia" <jacob@jacob.remcomp.fr>
Newsgroups: comp.compilers
Date: 23 May 2002 01:27:47 -0400
Organization: Wanadoo, l'internet avec France Telecom
References: 02-04-126 02-04-137 02-04-146 02-04-157 02-05-051
Keywords: arithmetic
Posted-Date: 23 May 2002 01:27:47 EDT

> Well, the biggest gain and the focus of the project was the option
> -fpmath=sse. This replaces all x87 instructions with SSE/SSE2 ones. This
> gives a performance boost even without vectorization, because it makes
> compiler optimizations a lot easier (more RISC like).
>


Well, I did JUST THAT in my compiler system (lcc-win32)


It took me at least a month in two attempts to rechannel all FP
instructions to SSE2.


FORGET IT!


Speed dropped by 50%!!


It seems that the SSE2 instructions take much longer than the FPU ones.


Besides, you will see that all functions return their double results
in the FPU.


You will have to move that from the FPU to the stack and then to the
sse2 registers since there is no direct path between the sse2
registers and the FPU.


Yes, optimizations are easier, but even using registers all the time,
it is slower than the FPU. Please do not ask me why.


Another nice feature is that the results are incompatible between sse2
and the FPU due to the different precision used in calculations: sse2
uses 64 bits and the FPU uses 80 bits. It is impossible then to mix
both.


Yet another problem is that those instructions are much longer. Be
prepared to 20% code bloat at least. This takes longer to load then.


And yet another problem are the transcendental functions. Instead of doing
        fldl -24(%ebp)
        fsin
        fstpl -24(%ebp)


You have to do: (data is xmm0)
        subl $8,%esp
        movlpd xmm0,(%esp)
        fldl (%esp)
        fsin
        fstpl (%esp)
        movlpd (%esp),%xmm0
        addl $8,%esp


It was THIS that killed my FFT benchmark.


Another nice feature is that you have to save all the used xmm
registers before a function call, and restore them later. This means
big trouble since each register is 16 bytes.. If you have just 3 regs
used before the call it means 48 bytes of I/O to the system bus before
the call and 48 after...


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.