Related articles |
---|
vectorization in icc kf@iki.fi (2002-11-26) |
Re: vectorization in icc skral@mips.complang.tuwien.ac.at (Kral Stefan) (2002-12-01) |
vectorization in icc aart.bik@intel.com (Bik, Aart) (2002-12-03) |
Re: vectorization in icc kfredrik@saippua.cs.Helsinki.FI (Kimmo Fredriksson) (2002-12-07) |
vectorization in icc aart.bik@intel.com (Bik, Aart) (2002-12-07) |
Re: vectorization in icc terryg@qwest.net (Terry Greyzck) (2002-12-11) |
Re: vectorization in icc kf@iki.fi (2002-12-11) |
Re: vectorization in icc kf@iki.fi (2002-12-11) |
Re: vectorization in icc kf@iki.fi (2002-12-11) |
Re: vectorization in icc nmm1@cus.cam.ac.uk (2002-12-13) |
From: | "Kimmo Fredriksson" <kfredrik@saippua.cs.Helsinki.FI> |
Newsgroups: | comp.compilers |
Date: | 7 Dec 2002 20:01:17 -0500 |
Organization: | University of Helsinki |
References: | 02-11-173 02-12-006 02-12-038 |
Keywords: | parallel |
Posted-Date: | 07 Dec 2002 20:01:17 EST |
> You did not copy and paste the full context of the loop and the resulting
> assembly code, so that I am unable to determine if all arrays are aligned at
Okay, here are some more details. The array t and n are parameters
(char * t, int n), and the rest are local variables declared as:
_declspec(align(16)) char b[ 256 ][ 16 ];
_declspec(align(16)) char * B;
_declspec(align(16)) char d[ 16 ];
_declspec(align(16)) char dm[ 16 ];
_declspec(align(16)) char mm[ 16 ];
int i, j, m;
// some preprocessing to initialize the locals, which should not be relevant
// to copy/paste here...
// the actual code:
for( i = 0; i < n; i++ )
{
B = b[ t[ i ]]; // does not vectorize without this
// is this the preformance problem?
#pragma ivdep
#pragma vector aligned // this speeds up the code somewhat
for( j = 0; j < 16; j++ )
{
d[ j ] = d[ j ] + d[ j ];
d[ j ] = d[ j ] | B[ j ];
dm[ j ] = d[ j ] & mm[ j ];
}
// the following could be vectorized also, but the
// result is really slow:
#pragma novector
for( j = 0; j < 16; j++ ) if( !dm[ j ] ) m++;
}
This compiles to the following (sorry for the AT&T syntax):
..B3.13: # Preds ..B3.12
testl %ebp, %ebp #169.2
jle ..B3.19 # Prob 2% #169.2
# LOE ebp esi edi ah dh ch
..B3.14: # Preds ..B3.13
movdqa 4656(%esp), %xmm1 #178.23
movdqa 4672(%esp), %xmm0 #181.23
addl %esi, %ebp #181.23
# LOE ebp esi edi xmm0 xmm1
..B3.15: # Preds ..B3.17 ..B3.14
movzbl (%esi), %ecx #171.10
addl %ecx, %ecx #171.10
paddb %xmm1, %xmm1 #178.14
por 48(%esp,%ecx,8), %xmm1 #179.14
movdqa %xmm1, %xmm3 #181.14
xorl %ecx, %ecx #184.8
pand %xmm0, %xmm3 #181.14
movdqa %xmm3, 4688(%esp) #181.4
.align 4,0x90
# LOE ecx ebp esi edi xmm0 xmm1
..B3.16: # Preds ..B3.16 ..B3.15
movsbl 4688(%esp,%ecx), %eax #184.34
lea 1(%edi), %edx #184.44
testl %eax, %eax #184.44
jne .L3 # Prob 50% #184.44
movl %edx, %edi #184.44
.L3: #
addl $1, %ecx #184.23
cmpl $16, %ecx #184.3
jl ..B3.16 # Prob 93% #184.3
# LOE ecx ebp esi edi xmm0 xmm1
..B3.17: # Preds ..B3.16
addl $1, %esi #169.21
cmpl %ebp, %esi #169.2
jb ..B3.15 # Prob 93% #169.2
# LOE ebp esi edi xmm0 xmm1
..B3.19: # Preds ..B3.17 ..B3.13
The compiler is icc 7.0 for Linux. The computer is Pentium4 2Ghz. The
code runs in 3.61 seconds. If I add #pragna novector, then the code
runs in 3.18 seconds.
Return to the
comp.compilers page.
Search the
comp.compilers archives again.