Why is it slower than memcmp

I am trying to compare two strings pixel

s.

A is pixel

defined as struct

containing 4 float

(RGBA) values .

The reason I am not using memcmp

is because I need to return the position of the 1st other pixel, which memcmp

does not.

My first implementation uses SSE

intrinsics and is ~ 30% slower than memcmp

:

inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
    for (int i = 0; i < count; i++)
    {
        __m128 x = _mm_load_ps((float*)(a + i));
        __m128 y = _mm_load_ps((float*)(b + i));
        __m128 cmp = _mm_cmpeq_ps(x, y);
        if (_mm_movemask_ps(cmp) != 15) return i;
    }
    return -1;
}

      

Then I found that handling the values ​​as integers rather than floats got a little faster and now it's only ~ 20% slower than memcmp

.

inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
    for (int i = 0; i < count; i++)
    {
        __m128i x = _mm_load_si128((__m128i*)(a + i));
        __m128i y = _mm_load_si128((__m128i*)(b + i));
        __m128i cmp = _mm_cmpeq_epi32(x, y);
        if (_mm_movemask_epi8(cmp) != 0xffff) return i; 
    }
    return -1;
}

      

From what I've read on other questions, the MS implementation is memcmp

also implemented using SSE

. My question is, what other tricks in MS implementation have his sleeve, which I don't? How is it still faster even though it is a byte comparison?

Is alignment a problem? If it pixel

contains 4 floats, won't the pixel array be allocated already at the 16 byte boundary?

I am compiling with /o2

and all the optimization flags.

+3


source to share


3 answers


I wrote optimizing strcmp / memcmp with SSE (and MMX / 3DNow!) And the first step is to ensure that the arrays are as aligned as possible - you may find that you need to do the first and / or last bytes "one at a time" ...

If you can align the data before it enters the loop [if your code does the selection], then that's ideal.

The second part is to unroll the loop so you don't get that much "if the loop is not at the end, go back to the beginning of the loop" - assuming the loop is quite long.

You may find that preloading the next input before the exit now condition is met also helps.



Edit: last paragraph might need an example. This code assumes an unwrapped loop of at least two:

 __m128i x = _mm_load_si128((__m128i*)(a));
 __m128i y = _mm_load_si128((__m128i*)(b));

 for(int i = 0; i < count; i+=2)
 {
    __m128i cmp = _mm_cmpeq_epi32(x, y);

    __m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
    __m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));

    if (_mm_movemask_epi8(cmp) != 0xffff) return i; 
    cmp = _mm_cmpeq_epi32(x1, y1);
    __m128i x = _mm_load_si128((__m128i*)(a + i + 2));
    __m128i y = _mm_load_si128((__m128i*)(b + i + 2));
    if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1; 
}

      

Like that.

+3


source


You can check out this memcmp SSE implementation , specifically the function __sse_memcmp

, it starts with some sanity checks and then checks if the pointers are aligned or not:

aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );

      

If they are not aligned, it compares the bytes of the byte pointers to the start of the aligned address:



 while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
   if(*a++ != *b++) return -1;
   --len;
}

      

And then compares the remaining memory to SSE instructions similar to your code:

 if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....

      

+3


source


I cannot help you directly because I am using a Mac, but there is an easy way to find out what is going on:

You just enter memcpy in debug mode and switch to disassembly view. Since memcpy is a simple little function, you can easily understand all the tricks of the implementation.

0


source







All Articles