vectorization in icc

"Bik, Aart" <aart.bik@intel.com>
7 Dec 2002 20:06:57 -0500

          From comp.compilers

Related articles
vectorization in icc kf@iki.fi (2002-11-26)
Re: vectorization in icc skral@mips.complang.tuwien.ac.at (Kral Stefan) (2002-12-01)
vectorization in icc aart.bik@intel.com (Bik, Aart) (2002-12-03)
Re: vectorization in icc kfredrik@saippua.cs.Helsinki.FI (Kimmo Fredriksson) (2002-12-07)
vectorization in icc aart.bik@intel.com (Bik, Aart) (2002-12-07)
Re: vectorization in icc terryg@qwest.net (Terry Greyzck) (2002-12-11)
Re: vectorization in icc kf@iki.fi (2002-12-11)
Re: vectorization in icc kf@iki.fi (2002-12-11)
Re: vectorization in icc kf@iki.fi (2002-12-11)
Re: vectorization in icc nmm1@cus.cam.ac.uk (2002-12-13)
| List of all articles for this month |
From: "Bik, Aart" <aart.bik@intel.com>
Newsgroups: comp.compilers
Date: 7 Dec 2002 20:06:57 -0500
Organization: Compilers Central
Keywords: parallel, performance
Posted-Date: 07 Dec 2002 20:06:57 EST
content-class: urn:content-classes:message
Thread-Topic: vectorization in icc

Thanks for the code with full context that was very helpful.
Lets explore the hotspot of your application:


char *B, d[ 16 ], dm[ 16 ], mm[ 16 ];
int i, j, m;
...
for( i = 0; i < n; i++ ) {
                B = b[ t[ i ]];
                #pragma ivdep
                #pragma vector aligned
                  for( j = 0; j < 16; j++ ) {
                        d[ j ] = d[ j ] + d[ j ];
                        d[ j ] = d[ j ] | B[ j ];
                        dm[ j ] = d[ j ] & mm[ j ];
                  }
                  #pragma novector
                  for( j = 0; j < 16; j++ ) if( !dm[ j ] ) m++;
}
...


Experimentation on my Pentium 4 Processor 2.66GHz. confirmed the slowdown you
observed. With a certain input set, I got the following runtimes:


-O2 [sequential]: 9.9s
-QxW [vectorized]: 12.5s


The slowdown is caused by a so-called "wide store feeds narrow load" problem
at the transition between the vectorized loop and the scalar counting loop.
The VTUNE performance analysis tool reveals this with the following clock tick
counts:


Line Clock Ticks
        37 872 " movdqa XMMWORD PTR [esp+01240h], xmm2" /* this
is the last SIMD store for the vector loop */
        40 992 " movsx edi, BYTE PTR [esp+01240h]"
/* followed by an immediate narrow load for the first test */
        40 3605 " test edi, edi" 1
/* the number of clock ticks [belonging to the load] is simply too high */


This problem, however, can be easily avoided in your application by adhering
to one of the golden rules of effective SIMD vectorization: use the smallest
possible data type. In the counting loop, an int data type is mixed with a
char data type which is not very amendable to vectorization (which is probably
why you used the #pragma novector). A simple inspection of this loop shows
that the local counter in one complete loop execution can never exceed 16.
Hence, a char counter (which matches the data type of the dm array nicely) can
be used during the loop, after which the result is added back into the full
int counter. This results in a vectorizable loop, as shown below.


                    { char locm = 0;
                        #pragma vector aligned
                        for( j = 0; j < 16; j++ ) if( !dm[ j ] ) locm++;
                        m += locm;
                  }


The runtimes of your application as a whole now become very differently!


-O2 [sequential]: 9.9s [no impact due to change]
-QxW [vectorized]: 5.0s [nice performance boost!]


Hope this helps to speedup your application. Let me know the results.
If you are interested in more background on automatic vectorization in the
Intel compiler, please visit www.intelcompiler.com.
--
Aart Bik, Senior Staff Engineer, SSG, Intel Corporation
2200 Mission College Blvd. SC12-301, Santa Clara CA 95052
email: aart.bik@intel.com URL: http://www.aartbik.com/


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.