Why is it slower than memcmp

Question

Why is it slower than memcmp

I am trying to compare two strings pixel

s.

A is pixel

defined as struct

containing 4 float

(RGBA) values .

The reason I am not using memcmp

is because I need to return the position of the 1st other pixel, which memcmp

does not.

My first implementation uses SSE

intrinsics and is ~ 30% slower than memcmp

:

inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
    for (int i = 0; i < count; i++)
    {
        __m128 x = _mm_load_ps((float*)(a + i));
        __m128 y = _mm_load_ps((float*)(b + i));
        __m128 cmp = _mm_cmpeq_ps(x, y);
        if (_mm_movemask_ps(cmp) != 15) return i;
    }
    return -1;
}

Then I found that handling the values as integers rather than floats got a little faster and now it's only ~ 20% slower than memcmp

.

inline int PixelMemCmp(const Pixel* a, const Pixel* b, int count)
{
    for (int i = 0; i < count; i++)
    {
        __m128i x = _mm_load_si128((__m128i*)(a + i));
        __m128i y = _mm_load_si128((__m128i*)(b + i));
        __m128i cmp = _mm_cmpeq_epi32(x, y);
        if (_mm_movemask_epi8(cmp) != 0xffff) return i; 
    }
    return -1;
}

From what I've read on other questions, the MS implementation is memcmp

also implemented using SSE

. My question is, what other tricks in MS implementation have his sleeve, which I don't? How is it still faster even though it is a byte comparison?

Is alignment a problem? If it pixel

contains 4 floats, won't the pixel array be allocated already at the 16 byte boundary?

I am compiling with /o2

and all the optimization flags.

+3

c ++ visual-c ++ sse memcmp

Rotem 10 Feb At 10:01 am

source to share

3 answers

You can check out this memcmp SSE implementation , specifically the function __sse_memcmp

, it starts with some sanity checks and then checks if the pointers are aligned or not:

aligned_a = ( (unsigned long)a & (sizeof(__m128i)-1) );
aligned_b = ( (unsigned long)b & (sizeof(__m128i)-1) );

If they are not aligned, it compares the bytes of the byte pointers to the start of the aligned address:

 while( len && ( (unsigned long) a & ( sizeof(__m128i)-1) ) )
{
   if(*a++ != *b++) return -1;
   --len;
}

And then compares the remaining memory to SSE instructions similar to your code:

 if(!len) return 0;
while( len && !(len & 7 ) )
{
__m128i x = _mm_load_si128( (__m128i*)&a[i]);
__m128i y = _mm_load_si128( (__m128i*)&b[i]);
....

+3

iabdalkader 10 Feb 13 at 11:02

source to share

I cannot help you directly because I am using a Mac, but there is an easy way to find out what is going on:

You just enter memcpy in debug mode and switch to disassembly view. Since memcpy is a simple little function, you can easily understand all the tricks of the implementation.

0

Jurlie 10 Feb At 10:12

source to share

Mats Petersson · Accepted Answer · 2013-02-10T11:25:28+0000

I wrote optimizing strcmp / memcmp with SSE (and MMX / 3DNow!) And the first step is to ensure that the arrays are as aligned as possible - you may find that you need to do the first and / or last bytes "one at a time" ...

If you can align the data before it enters the loop [if your code does the selection], then that's ideal.

The second part is to unroll the loop so you don't get that much "if the loop is not at the end, go back to the beginning of the loop" - assuming the loop is quite long.

You may find that preloading the next input before the exit now condition is met also helps.

Edit: last paragraph might need an example. This code assumes an unwrapped loop of at least two:

 __m128i x = _mm_load_si128((__m128i*)(a));
 __m128i y = _mm_load_si128((__m128i*)(b));

 for(int i = 0; i < count; i+=2)
 {
    __m128i cmp = _mm_cmpeq_epi32(x, y);

    __m128i x1 = _mm_load_si128((__m128i*)(a + i + 1));
    __m128i y1 = _mm_load_si128((__m128i*)(b + i + 1));

    if (_mm_movemask_epi8(cmp) != 0xffff) return i; 
    cmp = _mm_cmpeq_epi32(x1, y1);
    __m128i x = _mm_load_si128((__m128i*)(a + i + 2));
    __m128i y = _mm_load_si128((__m128i*)(b + i + 2));
    if (_mm_movemask_epi8(cmp) != 0xffff) return i + 1; 
}

Like that.

Why is it slower than memcmp

More articles: