Re: vectorization in icc

"Kimmo Fredriksson" <kfredrik@saippua.cs.Helsinki.FI>
7 Dec 2002 20:01:17 -0500

          From comp.compilers

Related articles
vectorization in icc kf@iki.fi (2002-11-26)
Re: vectorization in icc skral@mips.complang.tuwien.ac.at (Kral Stefan) (2002-12-01)
vectorization in icc aart.bik@intel.com (Bik, Aart) (2002-12-03)
Re: vectorization in icc kfredrik@saippua.cs.Helsinki.FI (Kimmo Fredriksson) (2002-12-07)
vectorization in icc aart.bik@intel.com (Bik, Aart) (2002-12-07)
Re: vectorization in icc terryg@qwest.net (Terry Greyzck) (2002-12-11)
Re: vectorization in icc kf@iki.fi (2002-12-11)
Re: vectorization in icc kf@iki.fi (2002-12-11)
Re: vectorization in icc kf@iki.fi (2002-12-11)
Re: vectorization in icc nmm1@cus.cam.ac.uk (2002-12-13)
| List of all articles for this month |

From: "Kimmo Fredriksson" <kfredrik@saippua.cs.Helsinki.FI>
Newsgroups: comp.compilers
Date: 7 Dec 2002 20:01:17 -0500
Organization: University of Helsinki
References: 02-11-173 02-12-006 02-12-038
Keywords: parallel
Posted-Date: 07 Dec 2002 20:01:17 EST

> You did not copy and paste the full context of the loop and the resulting
> assembly code, so that I am unable to determine if all arrays are aligned at


Okay, here are some more details. The array t and n are parameters
(char * t, int n), and the rest are local variables declared as:


_declspec(align(16)) char b[ 256 ][ 16 ];
_declspec(align(16)) char * B;
_declspec(align(16)) char d[ 16 ];
_declspec(align(16)) char dm[ 16 ];
_declspec(align(16)) char mm[ 16 ];


int i, j, m;


// some preprocessing to initialize the locals, which should not be relevant
// to copy/paste here...
// the actual code:


for( i = 0; i < n; i++ )
                {
B = b[ t[ i ]]; // does not vectorize without this
// is this the preformance problem?


#pragma ivdep
                                #pragma vector aligned // this speeds up the code somewhat
for( j = 0; j < 16; j++ )
{
d[ j ] = d[ j ] + d[ j ];
d[ j ] = d[ j ] | B[ j ];


dm[ j ] = d[ j ] & mm[ j ];
}


// the following could be vectorized also, but the
// result is really slow:
#pragma novector
for( j = 0; j < 16; j++ ) if( !dm[ j ] ) m++;
}




This compiles to the following (sorry for the AT&T syntax):


..B3.13: # Preds ..B3.12
                testl %ebp, %ebp #169.2
                jle ..B3.19 # Prob 2% #169.2
                                                                # LOE ebp esi edi ah dh ch
..B3.14: # Preds ..B3.13
                movdqa 4656(%esp), %xmm1 #178.23
                movdqa 4672(%esp), %xmm0 #181.23
                addl %esi, %ebp #181.23
                                                                # LOE ebp esi edi xmm0 xmm1
..B3.15: # Preds ..B3.17 ..B3.14
                movzbl (%esi), %ecx #171.10
                addl %ecx, %ecx #171.10
                paddb %xmm1, %xmm1 #178.14
                por 48(%esp,%ecx,8), %xmm1 #179.14
                movdqa %xmm1, %xmm3 #181.14
                xorl %ecx, %ecx #184.8
                pand %xmm0, %xmm3 #181.14
                movdqa %xmm3, 4688(%esp) #181.4
                .align 4,0x90
                                                                # LOE ecx ebp esi edi xmm0 xmm1
..B3.16: # Preds ..B3.16 ..B3.15
                movsbl 4688(%esp,%ecx), %eax #184.34
                lea 1(%edi), %edx #184.44
                testl %eax, %eax #184.44
                jne .L3 # Prob 50% #184.44
                movl %edx, %edi #184.44
.L3: #
                addl $1, %ecx #184.23
                cmpl $16, %ecx #184.3
                jl ..B3.16 # Prob 93% #184.3
                                                                # LOE ecx ebp esi edi xmm0 xmm1
..B3.17: # Preds ..B3.16
                addl $1, %esi #169.21
                cmpl %ebp, %esi #169.2
                jb ..B3.15 # Prob 93% #169.2
                                                                # LOE ebp esi edi xmm0 xmm1
..B3.19: # Preds ..B3.17 ..B3.13




The compiler is icc 7.0 for Linux. The computer is Pentium4 2Ghz. The
code runs in 3.61 seconds. If I add #pragna novector, then the code
runs in 3.18 seconds.


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.