High performance comparison of signed int arrays (using Intel IPP library)

We are trying to compare two arrays with the same size of values signed int

using the inequality operations, <, <=,> and> =, with high efficiency. When many values ​​are compared, the results true/false

will be sorted into an array of char

the same input size where 0x00

means false

and 0xff

means true

.

For this we use the Intel IPP library. The problem is that the function we found that does this operation on a name ippiCompare_*

, from images and video processing lib, only supports types unsigned char

( Ipp8u

), signed/unsigned short

( Ipp16s/Ipp16u

) and float

( Ipp32f

). It does not directly support signed int

( Ipp32s

)

I (only) envision two possible ways to address this issue:

  • Casting the array to one of the directly supported types and performing the comparison in more complex steps (this will become a short array twice the size or a char array four times the size) and merging the intermediate results.

  • Using another function that directly supports arrays signed int

    from IPP or another library that can do something equivalent from a performance standpoint.

But there may be other creative ways ... So I ask you to help with this! :)

PS: The benefit of using Intel IPP is the performance increase for large arrays: it uses multi-valued processor functions and many cores at the same time (and maybe more tricks). So simple looping solutions won't do it as fast AFAIK.

PS2: link for ippiCompare_ * doc

+2


source to share


3 answers


You can make a comparison with PCMPEQD followed by PACKUSDW and PACKUSWB. It will be something like

#include <emmintrin.h>

void cmp(__m128d* a, __m128d* b, v16qi* result, unsigned count) {
    for (unsigned i=0; i < count/16; ++i) {
        __m128d result0 = _mm_cmpeq_pd(a[0], b[0]);  // each line compares 4 integers
        __m128d result1 = _mm_cmpeq_pd(a[1], b[1]);
        __m128d result2 = _mm_cmpeq_pd(a[2], b[2]);
        __m128d result3 = _mm_cmpeq_pd(a[3], b[3]);
        a += 4; b+= 4;

        v8hi wresult0 = __builtin_ia32_packssdw(result0, result1);  //pack 2*4 integer results into 8 words
        v8hi wresult1 = __builtin_ia32_packssdw(result0, result1);

        *result = __builtin_ia32_packsswb(wresult0, wresult1);  //pack 2*8 word results into 16 bytes
        result++;
    }
}

      



Need alignment of pointers, number divisible by 16, some types of types I missed out of laziness / stupidity and probably a lot of debugging of course. And I didn't find intrinsics for packssdw / wb, so I just used the intrinsics from my compiler.

+1


source


I thought there was an SSE instruction that would compare integers. Are you examining the insides that can do this?



+1


source


The output from the window is a bit: are you sure this is a performance issue? If your dataset is not fit for L1 cache, you will be limited by the cache and the actual cycles you spend on comparison operations (which are unlikely to be slow , even if done in the least possible way) cannot be limiting.

0


source







All Articles