SSE alignment and strange behavior
I am trying to work with SSE and I am running into some strange behavior.
I write a simple code to compare two strings with SSE Intrinsics, run it and work. But later I realize that in my code one of the pointers is still not aligned, but I am using an instruction _mm_load_si128
that requires a pointer aligned on a 16-byte boundary.
//Compare two different, not overlapping piece of memory
__attribute((target("avx"))) int is_equal(const void* src_1, const void* src_2, size_t size)
{
//Skip tail for right alignment of pointer [head_1]
const char* head_1 = (const char*)src_1;
const char* head_2 = (const char*)src_2;
size_t tail_n = 0;
while (((uintptr_t)head_1 % 16) != 0 && tail_n < size)
{
if (*head_1 != *head_2)
return 0;
head_1++, head_2++, tail_n++;
}
//Vectorized part: check equality of memory with SSE4.1 instructions
//src1 - aligned, src2 - NOT aligned
const __m128i* src1 = (const __m128i*)head_1;
const __m128i* src2 = (const __m128i*)head_2;
const size_t n = (size - tail_n) / 32;
for (size_t i = 0; i < n; ++i, src1 += 2, src2 += 2)
{
printf("src1 align: %d, src2 align: %d\n", align(src1) % 16, align(src2) % 16);
__m128i mm11 = _mm_load_si128(src1);
__m128i mm12 = _mm_load_si128(src1 + 1);
__m128i mm21 = _mm_load_si128(src2);
__m128i mm22 = _mm_load_si128(src2 + 1);
__m128i mm1 = _mm_xor_si128(mm11, mm21);
__m128i mm2 = _mm_xor_si128(mm12, mm22);
__m128i mm = _mm_or_si128(mm1, mm2);
if (!_mm_testz_si128(mm, mm))
return 0;
}
//Check tail with scalar instructions
const size_t rem = (size - tail_n) % 32;
const char* tail_1 = (const char*)src1;
const char* tail_2 = (const char*)src2;
for (size_t i = 0; i < rem; i++, tail_1++, tail_2++)
{
if (*tail_1 != *tail_2)
return 0;
}
return 1;
}
I am printing the alignment of two pointers, and one of these alignment shafts, but the second one was not. And the program works fine and fast.
Then I create a synthetic test like this:
//printChars128(...) function just print 16 byte values from __m128i
const __m128i* A = (const __m128i*)buf;
const __m128i* B = (const __m128i*)(buf + rand() % 15 + 1);
for (int i = 0; i < 5; i++, A++, B++)
{
__m128i A1 = _mm_load_si128(A);
__m128i B1 = _mm_load_si128(B);
printChars128(A1);
printChars128(B1);
}
And it crashes as we expected on the first iteration when trying to load pointer B.
Interestingly, if I switch target
to sse4.2
, then my implementation is_equal
fails.
Another interesting fact: if I try to align the second pointer instead of the first (so that the first pointer is not aligned, the second is aligned) then is_equal
it crashes.
So my question is, "Why is_equal
does the function work fine if only the first pointer is aligned if I enable avx
command generation?"
UPD: This is the code C++
. I am compiling my code using MinGW64/g++, gcc version 4.9.2
under Windows x86.
Compile the line: g++.exe main.cpp -Wall -Wextra -std=c++11 -O2 -Wcast-align -Wcast-qual -o main.exe
source to share
TL: DR : loads from _mm_load_*
intrinsic functions can be collapsed (at compile time) into memory operands to other instructions. The vector versions of AVX instructions do not require alignment for memory operands , except for specially lined load / store instructions such as vmovdqa
.
In the legacy SSE encoding of vector instructions (for example, pxor xmm0, [src1]
) unequal 128-bit memory operands will be erroneous, except for special non-standard load / store instructions (for example, movdqu
/ movups
).
VEX-encoding vector instructions (for example vpxor xmm1, xmm0, [src1]
) does not crash in unmodified memory, except for the required alignment of the load / store instruction (for example, vmovdqa
or vmovntdq
).
Interface _mm_loadu_si128
vs. _mm_load_si128
(and store / storeu) binds alignment guarantees with the compiler, but does not force it to actually generate a standalone load statement. (Or anything at all, if it already has data in a register, just like dereferencing a scalar pointer).
The as-if rule still applies when optimizing code that uses inline functions. The load can be added to the memory operand for the vector-ALU instruction that uses it, as long as it does not result in an error. This is beneficial for code density reasons, and also less uops to track in some parts of the processor thanks to micro-fusion (see Agner Fog microarch.pdf) . An optimization skip that does this is not included in -O0
, so an unoptimized build of your code would probably be broken with an undelivered src1.
(Conversely, this means it _mm_loadu_*
can only flush to the memory operand using AVX, but not with SSE. So even on processors where it movdqu
runs as fast as movqda
when the pointer gets aligned _mm_loadu
can hurt performance because movqdu xmm1, [rsi]
/ pxor xmm0, xmm1
is 2 -my fused domains for the front-end, but pxor xmm0, [rsi]
- only 1. And no register zero is needed. See also Microswitching and Addressing Modes ).
The interpretation of the as-if rule in this case is that the program should not be wrong in some cases where the naive translation in asm would be faulty. (Or, for the same code, with an error in a non-optimized assembly, but not an error in an optimized assembly).
This is contrary to the rules for floating point exceptions, where the compiler-generated code must still raise any and all exceptions that occur on the abstract C machine. This is because there are well-defined FP exception handling mechanisms, but not for handling segfaults.
Note that since stores cannot be flushed to memory operands for ALU instructions, the store
(not storeu
) built-in functions will compile to unsigned pointer code even when compiled for the AVX target.
To be specific: consider this piece of code:
// aligned version: y = ...; // assume it in xmm1 x = _mm_load_si128(Aptr); // Aligned pointer res = _mm_or_si128(y, x); // unaligned version: the same thing with _mm_loadu_si128(Uptr)
When setting up SSE (code that can run on processors without AVX support), the aligned version can por xmm1, [Aptr]
offload the load , but the non-master version must use
movdqu xmm0, [Uptr]
/ por xmm0, xmm1
. The aligned version can do this if the old value y
is still needed after the OR.
When targeting AVX ( gcc -mavx
or gcc -march=sandybridge
later), all vector instructions emitted (including 128 bits) will use VEX encoding. This way you get different asm from the same _mm_...
internals. Both versions can be compiled to vpor xmm0, xmm1, [ptr]
. (And the non-destructive 3-operand function means this only happens if the original value is loaded multiple times).
Only one operand for ALU instructions can be a memory operand , so you need to load it separately in your case. Your code is faulty when the first pointer is not aligned, but it doesn't care about the alignment for the second, so we can conclude that gcc decided to load the first operand with vmovdqa
and add the second, not the other way around.
You can see this happening in practice in your code on in the Godbolt compiler explorer.Unfortunately , gcc 4.9 (and 5.3) will compile it into somewhat suboptimal code that generates a return value in al
and then validates it, rather than just forking to flags from vptest
:( clang-3.8 does a significantly better job.
.L36:
add rdi, 32
add rsi, 32
cmp rdi, rcx
je .L9
.L10:
vmovdqa xmm0, XMMWORD PTR [rdi] # first arg: loads that will fault on unaligned
xor eax, eax
vpxor xmm1, xmm0, XMMWORD PTR [rsi] # second arg: loads that don't care about alignment
vmovdqa xmm0, XMMWORD PTR [rdi+16] # first arg
vpxor xmm0, xmm0, XMMWORD PTR [rsi+16] # second arg
vpor xmm0, xmm1, xmm0
vptest xmm0, xmm0
sete al # generate a boolean in a reg
test eax, eax
jne .L36 # then test&branch on it. /facepalm
Please note that your is_equal
- memcmp
. I think glibc memcmp will be better than your implementation in many cases, as handwritten versions of asm for SSE4.1 and others handle various cases of offset buffers relative to each other. (eg one aligned, one not.) Note that the glibc code is LGPLed, so you cannot just copy it. If your use case has smaller buffers that are usually aligned, your implementation is probably fine. Not requiring VZEROUPPER before calling it from other AVX code is nice as well.
The compiler generated byte loop to clean up at the end is definitely not optimal. If the size is greater than 16 bytes, do an unaligned load that ends at the last byte of each src. It doesn't matter that you have re-compared some of the bytes that you have already checked.
Either way, define your code using the system memcmp
. Besides the library implementation, gcc knows what memcmp does and has its own inline definition for which there may be inline code for.
source to share