Why is it slower than memcmp
I am trying to compare two strings pixel
s.
A is pixel
defined as struct
containing 4 float
(RGBA) values .
The reason I am not using memcmp
is because I need to return the position of the 1st other pixel, which memcmp
does not.
My first implementation uses SSE
intrinsics and is ~ 30% slower than memcmp
:
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128 x = _mm_load_ps((float*)(a + i));
__m128 y = _mm_load_ps((float*)(b + i));
__m128 cmp = _mm_cmpeq_ps(x, y);
if (_mm_movemask_ps(cmp) != 15) return i;
}
return -1;
}
Then I found that handling the values as integers rather than floats got a little faster and now it's only ~ 20% slower than memcmp
.
inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
for (int i = 0; i < count; i++)
{
__m128i x = _mm_load_si128((__m128i*)(a + i));
__m128i y = _mm_load_si128((__m128i*)(b + i));
__m128i cmp = _mm_cmpeq_epi32(x, y);
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
}
return -1;
}
From what I've read on other questions, the MS implementation is memcmp
also implemented using SSE
. My question is, what other tricks in MS implementation have his sleeve, which I don't? How is it still faster even though it is a byte comparison?
Is alignment a problem? If it pixel
contains 4 floats, won't the pixel array be allocated already at the 16 byte boundary?
I am compiling with /o2
and all the optimization flags.
source to share
I wrote optimizing strcmp / memcmp with SSE (and MMX / 3DNow!) And the first step is to ensure that the arrays are as aligned as possible - you may find that you need to do the first and / or last bytes "one at a time" ...
If you can align the data before it enters the loop [if your code does the selection], then that's ideal.
The second part is to unroll the loop so you don't get that much "if the loop is not at the end, go back to the beginning of the loop" - assuming the loop is quite long.
You may find that preloading the next input before the exit now condition is met also helps.
Edit: last paragraph might need an example. This code assumes an unwrapped loop of at least two:
__m128i x = _mm_load_si128((__m128i*)(a));
__m128i y = _mm_load_si128((__m128i*)(b));
for(int i = 0; i < count; i+=2)
{
__m128i cmp = _mm_cmpeq_epi32(x, y);
__m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
__m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));
if (_mm_movemask_epi8(cmp) != 0xffff) return i;
cmp = _mm_cmpeq_epi32(x1, y1);
__m128i x = _mm_load_si128((__m128i*)(a + i + 2));
__m128i y = _mm_load_si128((__m128i*)(b + i + 2));
if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1;
}
Like that.
source to share
You can check out this memcmp SSE implementation , specifically the function __sse_memcmp
, it starts with some sanity checks and then checks if the pointers are aligned or not:
aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );
If they are not aligned, it compares the bytes of the byte pointers to the start of the aligned address:
while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
if(*a++ != *b++) return -1;
--len;
}
And then compares the remaining memory to SSE instructions similar to your code:
if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....
source to share