High performance comparison of signed int arrays (using Intel IPP library)
We are trying to compare two arrays with the same size of values signed int
using the inequality operations, <, <=,> and> =, with high efficiency. When many values ββare compared, the results true/false
will be sorted into an array of char
the same input size where 0x00
means false
and 0xff
means true
.
For this we use the Intel IPP library. The problem is that the function we found that does this operation on a name ippiCompare_*
, from images and video processing lib, only supports types unsigned char
( Ipp8u
), signed/unsigned short
( Ipp16s/Ipp16u
) and float
( Ipp32f
). It does not directly support signed int
( Ipp32s
)
I (only) envision two possible ways to address this issue:
-
Casting the array to one of the directly supported types and performing the comparison in more complex steps (this will become a short array twice the size or a char array four times the size) and merging the intermediate results.
-
Using another function that directly supports arrays
signed int
from IPP or another library that can do something equivalent from a performance standpoint.
But there may be other creative ways ... So I ask you to help with this! :)
PS: The benefit of using Intel IPP is the performance increase for large arrays: it uses multi-valued processor functions and many cores at the same time (and maybe more tricks). So simple looping solutions won't do it as fast AFAIK.
PS2: link for ippiCompare_ * doc
You can make a comparison with PCMPEQD followed by PACKUSDW and PACKUSWB. It will be something like
#include <emmintrin.h>
void cmp(__m128d* a, __m128d* b, v16qi* result, unsigned count) {
for (unsigned i=0; i < count/16; ++i) {
__m128d result0 = _mm_cmpeq_pd(a[0], b[0]); // each line compares 4 integers
__m128d result1 = _mm_cmpeq_pd(a[1], b[1]);
__m128d result2 = _mm_cmpeq_pd(a[2], b[2]);
__m128d result3 = _mm_cmpeq_pd(a[3], b[3]);
a += 4; b+= 4;
v8hi wresult0 = __builtin_ia32_packssdw(result0, result1); //pack 2*4 integer results into 8 words
v8hi wresult1 = __builtin_ia32_packssdw(result0, result1);
*result = __builtin_ia32_packsswb(wresult0, wresult1); //pack 2*8 word results into 16 bytes
result++;
}
}
Need alignment of pointers, number divisible by 16, some types of types I missed out of laziness / stupidity and probably a lot of debugging of course. And I didn't find intrinsics for packssdw / wb, so I just used the intrinsics from my compiler.
I thought there was an SSE instruction that would compare integers. Are you examining the insides that can do this?
The output from the window is a bit: are you sure this is a performance issue? If your dataset is not fit for L1 cache, you will be limited by the cache and the actual cycles you spend on comparison operations (which are unlikely to be slow , even if done in the least possible way) cannot be limiting.