Re: vectorization in icc

kf@iki.fi
11 Dec 2002 22:23:21 -0500

          From comp.compilers

Related articles
vectorization in icc kf@iki.fi (2002-11-26)
Re: vectorization in icc skral@mips.complang.tuwien.ac.at (Kral Stefan) (2002-12-01)
vectorization in icc aart.bik@intel.com (Bik, Aart) (2002-12-03)
Re: vectorization in icc kfredrik@saippua.cs.Helsinki.FI (Kimmo Fredriksson) (2002-12-07)
vectorization in icc aart.bik@intel.com (Bik, Aart) (2002-12-07)
Re: vectorization in icc terryg@qwest.net (Terry Greyzck) (2002-12-11)
Re: vectorization in icc kf@iki.fi (2002-12-11)
Re: vectorization in icc kf@iki.fi (2002-12-11)
Re: vectorization in icc kf@iki.fi (2002-12-11)
Re: vectorization in icc nmm1@cus.cam.ac.uk (2002-12-13)
| List of all articles for this month |

From: kf@iki.fi
Newsgroups: comp.compilers
Date: 11 Dec 2002 22:23:21 -0500
Organization: -
References: 02-12-049
Keywords: performance, parallel
Posted-Date: 11 Dec 2002 22:23:21 EST

Hi again,


I was thinking that the standard 32bit ALU instructions can be used to
vectorize the code. The code below does just that. It uses 32bit ints to
compute the 8bit results in parallel.


                int * dp, * mmp, * Bp, nm, m;


                ...


                nm = ~0;
                for( i = 0; i < 4; i++ ) nm ^= ( 1 << ( i * 8 ));


                dp = d;
                mmp = mm;


                for( i = 0; i < n; i++ )
                {
                                Bp = b[ t[ i ]];
                                #pragma novector
                                for( j = 0; j < 4; j++ )
                                {


                                                // Use nm to zero the 'carry' bits:


                                                dp[ j ] = (( dp[ j ] << 1 ) & nm ) | Bp[ j ];


                                                // This is safe, because I know that m is
                                                // incremented only once in the original loop:


                                                if( mmp[ j ] != ( dp[ j ] & mmp[ j ] )) m++;
                                }
                }


The result is that this code runs in time 0.85s, whereas the original
vectorized code after all the tuning, thanks to Aart, runs in time
1.44s. The above loop can be also vectorized (add #pragma ivdep,
#pragma vector aligned), but the code runs again in time 0.85s...


So, what it is the conclusion? Maybe icc should vectorize using the
integer instructions...? This is probably not so easy always. In the
above example the carry over bits were easy to handle. But I'd like
to see a compiler that achieves the same...


kf


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.