AVX: data alignment: store crash, storeu, load, loadu not

I am modifying an RNNLM neural network to learn the language model. However, given the size of my case, it is very slow. I tried to optimize the matrix * vector routine (which is matched for 63% of the total time on a small dataset (I would expect it to be worse on large datasets)). I'm stuck inside right now.

    for (b=0; b<(to-from)/8; b++) 
    {
        val = _mm256_setzero_ps();
        for (a=from2; a<to2; a++) 
        {
            t1 = _mm256_set1_ps (srcvec.ac[a]);
            t2 = _mm256_load_ps(&(srcmatrix[a+(b*8+from+0)*matrix_width].weight));
            //val =_mm256_fmadd_ps (t1, t2, t3)
            t3 = _mm256_mul_ps(t1,t2);
            val = _mm256_add_ps (val, t3);
        }
        t4 = _mm256_load_ps(&(dest.ac[b*8+from+0]));
        t4 = _mm256_add_ps(t4,val);
        _mm256_store_ps (&(dest.ac[b*8+from+0]), t4);
    }

      

This example fails:

_mm256_store_ps (&(dest.ac[b*8+from+0]), t4);

      

However, if I change to

_mm256_storeu_ps (&(dest.ac[b*8+from+0]), t4);

      

(with u for unaligned, I suppose) everything works as intended. My question is why load a job (when it is not supposed to be if the data is not aligned) and the store does not. (in addition, both devices operate at the same address).

dest.ac have been allocated with

void *_aligned_calloc(size_t nelem, size_t elsize, size_t alignment=64)
{
    size_t max_size = (size_t)-1;

    // Watch out for overflow
    if(elsize == 0 || nelem >= max_size/elsize)
        return NULL;

    size_t size = nelem * elsize;
    void *memory = _mm_malloc(size+64, alignment);
    if(memory != NULL)
        memset(memory, 0, size);
    return memory;
}

      

and a length of at least 50 elements. (BTW with VS2012 I have an illegal instruction for some random assignment, so I am using linux.)

Thank you in advance, Arkantus.

+3


source to share


1 answer


TL: DR : in optimized code, loads will be added to memory operands for other operations that have no alignment requirements in AVX . There will be no shops.


Your code example doesn't compile on its own, so I can't easily check which command is _mm256_load_ps

compiling.

I tried a little experiment with gcc 4.9 and it doesn't generate vmovaps

for at all _mm256_load_ps

, since I only used the download result as input to another command. It generates this instruction with a memory operand. AVX instructions have no alignment requirements for their memory operands. (There is a performance hit for crossing the cache line and a bigger hit for crossing a page border, but your code still works.)



On the other hand, the store creates an instruction vmov...

. Because you were using the correct version for alignment, it erroneously works with inconsistent addresses. Just use the unaligned version; it will be just as fast when the address is aligned, and still work when it isn't.

I have not tested your code to make sure all calls MUST be aligned. I suppose not in the way you phrased it to just ask why you also don't get errors for unbalanced loads. As I said, it’s likely that your code simply didn’t compile into any load commands vmovaps

, or even “aligned” AVX downloads fail with mismatched addresses.

Are you running AVX (no AVX2 or FMA?) On a Sandy / Ivybridge processor? I am assuming your FMA objects are commented out.

0


source







All Articles