SIMD Property - Segmentation Fault
I am running the following code:
#include <emmintrin.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argv, char** argc)
{
float a[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0};
float b[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0};
float c[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0};
__m128 *v_a = (__m128*)(a+1); // Trying to create c[i] = a[i=1] * b[i];
__m128 *v_b = (__m128*)(b);
__m128 *v_c = (__m128*)(c);
for (int i=0; i < 1; i++)
{
*v_c = _mm_mul_ps(*v_a,*v_b);
v_a++;
v_b++;
v_c++;
}
for (int i=0; i<= 9;i++)
{
printf("%f\n",c[i]);
}
return 0;
}
and get segmentation fault: 11 (Mac runs OS X "Mavericks").
When removing +1 from a and declaring like this:
__m128 *v_a = (__m128*)(a+1);
He works.
Now I am wondering about a few things:
-
Why is this happening? There shouldn't be any memory alignment issues that can lead to access to the allocated memory. If I am wrong in my understanding - please let me know what I missed.
-
what conversion happens to (__m128 *) (a + 1).
I am trying to understand how SIMD works, so any information you can relate might help me understand why it reacts this way.
source to share
To expand on Corey Nelson's answer:
Each type has an alignment. An object of this type "wants" an address that is a multiple of the alignment. For example, a float has an alignment of 4. This literally means that when you take a float address and pass it to an integer, you get a multiple of 4, because the compiler will never assign an address that is not a multiple of 4 for a float.
In 32-bit x86, here are some examples of alignments: char = 1, short = 2, int = 4, long long = 4, float = 4, double = 4, void * = 4, SSE vector = 16, Alignments are always 2 ...
We can get the wrong address if we point to a pointer to a different type of pointer with stricter (larger) alignment. That's what happens in your code when you draw float *
(alignment 4) to __m128 *
(alignment 16). The consequences of accessing (reading or writing) an object with an inconsistent address can be nothing, performance degradation, or failure, depending on the processor architecture.
We can print the addresses of your vectors:
printf("%p %p %p\n", a, b, c);
or for more clarity, just their low 4 bits:
printf("%ld %ld %ld\n", (intptr_t)a & 0xF, (intptr_t)b & 0xF,(intptr_t)c & 0xF);
On my machine, this outputs 12 4 12
, showing that the addresses are not multiples of 16, and therefore are not 16 byte aligned. (But note that they are all multiples of 4, because they are array-of-float, and floats must be 4-byte aligned.)
When you remove +1, your code no longer crashes. This is because you are "lucky" with the addresses: the floats need to be aligned with a multiple of 4, but they just match with a multiple of 16. It's a time bomb! Tweak something in your code (say introduce a different variable) or change the optimization level and it will probably start crashing! You must explicitly align the variables.
So how do you align them? When you declare a variable, the compiler (not you) chooses the memory address where the variable will reside. It tries to copy the variables as close together as possible to avoid wasting space, but it still has to ensure that addresses are correctly aligned for their type.
One of the best ways to increase alignment is to use a union that includes the type you need alignment:
union vec {
float f[10];
__m128 v;
};
union vec av = {.f = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0}};
union vec bv = {.f = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0}};
union vec cv = {.f = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0}};
float *a = av.f;
float *b = bv.f;
float *c = cv.f;
printf("%ld %ld %ld\n", (intptr_t)a & 0xF, (intptr_t)b & 0xF,(intptr_t)c & 0xF);
Now printf prints 0 0 0
because the compiler has chosen 16 byte aligned addresses for each float [10].
gcc and clang also allow you to request alignment directly:
float a[] __attribute__ ((aligned (16))) = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0};
float b[] __attribute__ ((aligned (16))) = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0};
float c[] __attribute__ ((aligned (16))) = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0};
printf("%ld %ld %ld\n", (intptr_t)a & 0xF, (intptr_t)b & 0xF,(intptr_t)c & 0xF);
This works too, but is less portable.
That said, how about your +1:
__m128 *v_a = (__m128*)(a+1);
Assuming that it a
is 16 bytes aligned and has a type float*
then a+1
appends sizeof(float)
(which is 4) to the address, resulting in an address that is only 4 bytes aligned. This is a hardware limitation that you cannot load / store from a simple 4 byte aligned address directly into the SSE register using normal instructions. It's a glitch! Instead, you should use different (slower) commands, such as those created with _mm_loadu_ps
.
Ensuring proper alignment is one of the challenges of using SIMD instruction sets. You will often see that SIMD algorithms process the first few elements using "normal" (scalar) code so that it can achieve the alignment required by SIMD instructions.
source to share
Alignment is not a function of free space, but where that space is in memory. When people talk about alignment, it means that the address must be evenly divisible.
SSE requires load / store addresses to be 16 byte aligned. For example. you need an address 0
, 16
, 32
etc., but does not 4
, 20
or 36
.
Variables have appropriate alignment for their type - in this case a
, b
and c
will be aligned by at least 4 bytes, because alignment float
requires a function on your platform. The compiler could have, but (truthfully) couldn't give them a tighter alignment - so when you throw on __mm128*
and play, you get a segfault.
Instead of dereferencing pointers, use _mm_loadu_ps
and _mm_storeu_ps
, which allow for uneven access. Or, for better performance, fix the alignment.
source to share