Related articles |
---|
vectorization in icc kf@iki.fi (2002-11-26) |
Re: vectorization in icc skral@mips.complang.tuwien.ac.at (Kral Stefan) (2002-12-01) |
vectorization in icc aart.bik@intel.com (Bik, Aart) (2002-12-03) |
Re: vectorization in icc kfredrik@saippua.cs.Helsinki.FI (Kimmo Fredriksson) (2002-12-07) |
vectorization in icc aart.bik@intel.com (Bik, Aart) (2002-12-07) |
Re: vectorization in icc terryg@qwest.net (Terry Greyzck) (2002-12-11) |
Re: vectorization in icc kf@iki.fi (2002-12-11) |
Re: vectorization in icc kf@iki.fi (2002-12-11) |
Re: vectorization in icc kf@iki.fi (2002-12-11) |
Re: vectorization in icc nmm1@cus.cam.ac.uk (2002-12-13) |
From: | "Bik, Aart" <aart.bik@intel.com> |
Newsgroups: | comp.compilers |
Date: | 7 Dec 2002 20:06:57 -0500 |
Organization: | Compilers Central |
Keywords: | parallel, performance |
Posted-Date: | 07 Dec 2002 20:06:57 EST |
content-class: | urn:content-classes:message |
Thread-Topic: | vectorization in icc |
Thanks for the code with full context that was very helpful.
Lets explore the hotspot of your application:
char *B, d[ 16 ], dm[ 16 ], mm[ 16 ];
int i, j, m;
...
for( i = 0; i < n; i++ ) {
B = b[ t[ i ]];
#pragma ivdep
#pragma vector aligned
for( j = 0; j < 16; j++ ) {
d[ j ] = d[ j ] + d[ j ];
d[ j ] = d[ j ] | B[ j ];
dm[ j ] = d[ j ] & mm[ j ];
}
#pragma novector
for( j = 0; j < 16; j++ ) if( !dm[ j ] ) m++;
}
...
Experimentation on my Pentium 4 Processor 2.66GHz. confirmed the slowdown you
observed. With a certain input set, I got the following runtimes:
-O2 [sequential]: 9.9s
-QxW [vectorized]: 12.5s
The slowdown is caused by a so-called "wide store feeds narrow load" problem
at the transition between the vectorized loop and the scalar counting loop.
The VTUNE performance analysis tool reveals this with the following clock tick
counts:
Line Clock Ticks
37 872 " movdqa XMMWORD PTR [esp+01240h], xmm2" /* this
is the last SIMD store for the vector loop */
40 992 " movsx edi, BYTE PTR [esp+01240h]"
/* followed by an immediate narrow load for the first test */
40 3605 " test edi, edi" 1
/* the number of clock ticks [belonging to the load] is simply too high */
This problem, however, can be easily avoided in your application by adhering
to one of the golden rules of effective SIMD vectorization: use the smallest
possible data type. In the counting loop, an int data type is mixed with a
char data type which is not very amendable to vectorization (which is probably
why you used the #pragma novector). A simple inspection of this loop shows
that the local counter in one complete loop execution can never exceed 16.
Hence, a char counter (which matches the data type of the dm array nicely) can
be used during the loop, after which the result is added back into the full
int counter. This results in a vectorizable loop, as shown below.
{ char locm = 0;
#pragma vector aligned
for( j = 0; j < 16; j++ ) if( !dm[ j ] ) locm++;
m += locm;
}
The runtimes of your application as a whole now become very differently!
-O2 [sequential]: 9.9s [no impact due to change]
-QxW [vectorized]: 5.0s [nice performance boost!]
Hope this helps to speedup your application. Let me know the results.
If you are interested in more background on automatic vectorization in the
Intel compiler, please visit www.intelcompiler.com.
--
Aart Bik, Senior Staff Engineer, SSG, Intel Corporation
2200 Mission College Blvd. SC12-301, Santa Clara CA 95052
email: aart.bik@intel.com URL: http://www.aartbik.com/
Return to the
comp.compilers page.
Search the
comp.compilers archives again.