SSE alignment and strange behavior

Question

SSE alignment and strange behavior

I am trying to work with SSE and I am running into some strange behavior.

I write a simple code to compare two strings with SSE Intrinsics, run it and work. But later I realize that in my code one of the pointers is still not aligned, but I am using an instruction _mm_load_si128

that requires a pointer aligned on a 16-byte boundary.

//Compare two different, not overlapping piece of memory
__attribute((target("avx"))) int is_equal(const void* src_1, const void* src_2, size_t size)
{
    //Skip tail for right alignment of pointer [head_1]
    const char* head_1 = (const char*)src_1;
    const char* head_2 = (const char*)src_2;
    size_t tail_n = 0;
    while (((uintptr_t)head_1 % 16) != 0 && tail_n < size)
    {                                
        if (*head_1 != *head_2)
            return 0;
        head_1++, head_2++, tail_n++;
    }

    //Vectorized part: check equality of memory with SSE4.1 instructions
    //src1 - aligned, src2 - NOT aligned
    const __m128i* src1 = (const __m128i*)head_1;
    const __m128i* src2 = (const __m128i*)head_2;
    const size_t n = (size - tail_n) / 32;    
    for (size_t i = 0; i < n; ++i, src1 += 2, src2 += 2)
    {
        printf("src1 align: %d, src2 align: %d\n", align(src1) % 16, align(src2) % 16);
        __m128i mm11 = _mm_load_si128(src1);
        __m128i mm12 = _mm_load_si128(src1 + 1);
        __m128i mm21 = _mm_load_si128(src2);
        __m128i mm22 = _mm_load_si128(src2 + 1);

        __m128i mm1 = _mm_xor_si128(mm11, mm21);
        __m128i mm2 = _mm_xor_si128(mm12, mm22);

        __m128i mm = _mm_or_si128(mm1, mm2);

        if (!_mm_testz_si128(mm, mm))
            return 0;
    }

    //Check tail with scalar instructions
    const size_t rem = (size - tail_n) % 32;
    const char* tail_1 = (const char*)src1;
    const char* tail_2 = (const char*)src2;
    for (size_t i = 0; i < rem; i++, tail_1++, tail_2++)
    {
        if (*tail_1 != *tail_2)
            return 0;   
    }
    return 1;
}

I am printing the alignment of two pointers, and one of these alignment shafts, but the second one was not. And the program works fine and fast.

Then I create a synthetic test like this:

//printChars128(...) function just print 16 byte values from __m128i
const __m128i* A = (const __m128i*)buf;
const __m128i* B = (const __m128i*)(buf + rand() % 15 + 1);
for (int i = 0; i < 5; i++, A++, B++)
{
    __m128i A1 = _mm_load_si128(A);
    __m128i B1 = _mm_load_si128(B);
    printChars128(A1);
    printChars128(B1);
}

And it crashes as we expected on the first iteration when trying to load pointer B.

Interestingly, if I switch target

to sse4.2

, then my implementation is_equal

fails.

Another interesting fact: if I try to align the second pointer instead of the first (so that the first pointer is not aligned, the second is aligned) then is_equal

it crashes.

So my question is, "Why is_equal

does the function work fine if only the first pointer is aligned if I enable avx

command generation?"

UPD: This is the code C++

. I am compiling my code using MinGW64/g++, gcc version 4.9.2

under Windows x86.

Compile the line: g++.exe main.cpp -Wall -Wextra -std=c++11 -O2 -Wcast-align -Wcast-qual -o main.exe

+1

c ++ c sse intel simd

Nikita Sivukhin 18 jul. 16 at 18:16

source to share

1 answer

Peter cordes · Accepted Answer · 2016-07-19T03:10:36+0000

TL: DR : loads from _mm_load_*

intrinsic functions can be collapsed (at compile time) into memory operands to other instructions. The vector versions of AVX instructions do not require alignment for memory operands , except for specially lined load / store instructions such as vmovdqa

.

In the legacy SSE encoding of vector instructions (for example, pxor xmm0, [src1]

) unequal 128-bit memory operands will be erroneous, except for special non-standard load / store instructions (for example, movdqu

/ movups

).

VEX-encoding vector instructions (for example vpxor xmm1, xmm0, [src1]

) does not crash in unmodified memory, except for the required alignment of the load / store instruction (for example, vmovdqa

or vmovntdq

).

Interface _mm_loadu_si128

vs. _mm_load_si128

(and store / storeu) binds alignment guarantees with the compiler, but does not force it to actually generate a standalone load statement. (Or anything at all, if it already has data in a register, just like dereferencing a scalar pointer).

The as-if rule still applies when optimizing code that uses inline functions. The load can be added to the memory operand for the vector-ALU instruction that uses it, as long as it does not result in an error. This is beneficial for code density reasons, and also less uops to track in some parts of the processor thanks to micro-fusion (see Agner Fog microarch.pdf) . An optimization skip that does this is not included in -O0

, so an unoptimized build of your code would probably be broken with an undelivered src1.

(Conversely, this means it _mm_loadu_*

can only flush to the memory operand using AVX, but not with SSE. So even on processors where it movdqu

runs as fast as movqda

when the pointer gets aligned _mm_loadu

can hurt performance because movqdu xmm1, [rsi]

/ pxor xmm0, xmm1

is 2 -my fused domains for the front-end, but pxor xmm0, [rsi]

- only 1. And no register zero is needed. See also Microswitching and Addressing Modes ).

The interpretation of the as-if rule in this case is that the program should not be wrong in some cases where the naive translation in asm would be faulty. (Or, for the same code, with an error in a non-optimized assembly, but not an error in an optimized assembly).

This is contrary to the rules for floating point exceptions, where the compiler-generated code must still raise any and all exceptions that occur on the abstract C machine. This is because there are well-defined FP exception handling mechanisms, but not for handling segfaults.

Note that since stores cannot be flushed to memory operands for ALU instructions, the store

(not storeu

) built-in functions will compile to unsigned pointer code even when compiled for the AVX target.

To be specific: consider this piece of code:

// aligned version:
y = ...;                         // assume it in xmm1
x = _mm_load_si128(Aptr);        // Aligned pointer
res = _mm_or_si128(y, x);

// unaligned version: the same thing with _mm_loadu_si128(Uptr)

When setting up SSE (code that can run on processors without AVX support), the aligned version can por xmm1, [Aptr]

offload the load , but the non-master version must use movdqu xmm0, [Uptr]

/ por xmm0, xmm1

. The aligned version can do this if the old value y

is still needed after the OR.

When targeting AVX ( gcc -mavx

or gcc -march=sandybridge

later), all vector instructions emitted (including 128 bits) will use VEX encoding. This way you get different asm from the same _mm_...

internals. Both versions can be compiled to vpor xmm0, xmm1, [ptr]

. (And the non-destructive 3-operand function means this only happens if the original value is loaded multiple times).

Only one operand for ALU instructions can be a memory operand , so you need to load it separately in your case. Your code is faulty when the first pointer is not aligned, but it doesn't care about the alignment for the second, so we can conclude that gcc decided to load the first operand with vmovdqa

and add the second, not the other way around.

You can see this happening in practice in your code on in the Godbolt compiler explorer.Unfortunately , gcc 4.9 (and 5.3) will compile it into somewhat suboptimal code that generates a return value in al

and then validates it, rather than just forking to flags from vptest

:( clang-3.8 does a significantly better job.

.L36:
        add     rdi, 32
        add     rsi, 32
        cmp     rdi, rcx
        je      .L9
.L10:
        vmovdqa xmm0, XMMWORD PTR [rdi]           # first arg: loads that will fault on unaligned
        xor     eax, eax
        vpxor   xmm1, xmm0, XMMWORD PTR [rsi]     # second arg: loads that don't care about alignment
        vmovdqa xmm0, XMMWORD PTR [rdi+16]        # first arg
        vpxor   xmm0, xmm0, XMMWORD PTR [rsi+16]  # second arg
        vpor    xmm0, xmm1, xmm0
        vptest  xmm0, xmm0
        sete    al                                 # generate a boolean in a reg
        test    eax, eax
        jne     .L36                               # then test&branch on it.  /facepalm

Please note that your is_equal

- memcmp

. I think glibc memcmp will be better than your implementation in many cases, as handwritten versions of asm for SSE4.1 and others handle various cases of offset buffers relative to each other. (eg one aligned, one not.) Note that the glibc code is LGPLed, so you cannot just copy it. If your use case has smaller buffers that are usually aligned, your implementation is probably fine. Not requiring VZEROUPPER before calling it from other AVX code is nice as well.

The compiler generated byte loop to clean up at the end is definitely not optimal. If the size is greater than 16 bytes, do an unaligned load that ends at the last byte of each src. It doesn't matter that you have re-compared some of the bytes that you have already checked.

Either way, define your code using the system memcmp

. Besides the library implementation, gcc knows what memcmp does and has its own inline definition for which there may be inline code for.

SSE alignment and strange behavior

To be specific: consider this piece of code:

More articles: