What's the fastest way to multi-threaded SIMD operation?

Using intrinsics is a common SIMDizing technique. For example, I can execute one command to add eight integers per _mm256_add_epi32

. It needs two _mm256_load_si256

and one _mm256_store_si256

after adding like this:

__m256i vec1 = _mm256_load_si256((__m256i *)&A[0]); // almost 5 cycles
__m256i vec2 = _mm256_load_si256((__m256i *)&B[0]); // almost 5 cycles
__m256i vec3 = _mm256_add_epi32( vec1 , vec2); // almost 1 cycle
_mm256_store_si256((__m256i *)&C[0], vec3); // almost 5

      

It executes instructions on one processor core. My Core i7 has 8 cores (4 real); I want to send operations to all cores like this:

int i_0, i_1, i_2, i_3, i_4, i_5, i_6, i_7 ; // These specify the values in memory
//core 0
__m256i vec1_0 = _mm256_load_si256((__m256i *)&A[i_0]);  
__m256i vec2_0 = _mm256_load_si256((__m256i *)&B[i_0]); 
__m256i vec3_0 = _mm256_add_epi32( vec1 , vec2); 
_mm256_store_si256((__m256i *)&C[i_0], vec3_0);

//core 1
__m256i vec1_1 = _mm256_load_si256((__m256i *)&A[i_1]);
__m256i vec2_1 = _mm256_load_si256((__m256i *)&B[i_1]);
__m256i vec3_1 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_1], vec3_1);

//core 2
__m256i vec1_2 = _mm256_load_si256((__m256i *)&A[i_2]);
__m256i vec2_2 = _mm256_load_si256((__m256i *)&B[i_2]);
__m256i vec3_2 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_2], vec3_2);

//core 3
__m256i vec1_3 = _mm256_load_si256((__m256i *)&A[i_3]);
__m256i vec2_3 = _mm256_load_si256((__m256i *)&B[i_3]);
__m256i vec3_3 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_3], vec3_3);

//core 4
__m256i vec1_4 = _mm256_load_si256((__m256i *)&A[i_4]);
__m256i vec2_4 = _mm256_load_si256((__m256i *)&B[i_4]);
__m256i vec3_4 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_4], vec3_4);

//core 5
__m256i vec1_5 = _mm256_load_si256((__m256i *)&A[i_5]);
__m256i vec2_5 = _mm256_load_si256((__m256i *)&B[i_5]);
__m256i vec3_5 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_5, vec3_5);

//core 6
__m256i vec1_6 = _mm256_load_si256((__m256i *)&A[i_6]);
__m256i vec2_6 = _mm256_load_si256((__m256i *)&B[i_6]);
__m256i vec3_6 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_6], vec3_6);

//core 7
__m256i vec1_7 = _mm256_load_si256((__m256i *)&A[i_7]);
__m256i vec2_7 = _mm256_load_si256((__m256i *)&B[i_7]);
__m256i vec3_7 = _mm256_add_epi32( vec1 , vec2);
_mm256_store_si256((__m256i *)&C[i_7], vec3_7);

      

POSIX Thread is available and openMP can be useful in this case as well. But creating and maintaining threads takes too long compared to almost 5+5+1

for this operation. Because all data is dependent, so I don't need to look at shared memory. What's the fastest explicit method to implement this operation?

I'm working on GPP so GPUs may not be the answer. I also want to implement a library, so a basic compiler solution might be a contender. The problem is big enough for multithreading. So for my research, I can modify the problem to fit the concept. I want to implement a library and compare it with other solutions like OpenMP and hopefully my library will be faster than other current solutions. GCC 6.3/clang 3.8, Linux Mint, Skylake

Thanks in advance.

+3


source to share


2 answers


If the problem is big, you should be multithreaded.

You can choose openmp or pthread, they will give you similar levels of performance (maybe slightly better with pthread, but this will not be portable and more difficult to maintain).

Your code will be bandwidth limited, not computed.

To achieve maximum throughput, you need to interleave independent memory operations using multithreading.

very simple solution like

extern "C" void add(int* a, int* b, int* c, int N) {
    #pragma omp parallel for
    for(int i = 0; i < N; ++i) {
        a[i] = b[i] + c[i];
    }
}

      

will probably give you acceptable performance on all systems with each compiler.

In fact, letting the compiler optimize will likely give you good results and will most likely help you write readable code.

But sometimes even the best compiler does not give satisfactory results (always check your build on performance-critical sections).

They need help and sometimes you need to write the assembly yourself.

Here is the path I will follow to optimize this loop until I get the results I want.

First, there are the classic optimization tricks you can implement:

  • constant and alias

Provide a constant and prevent aliasing with the __restrict keyword:

extern "C" void add(int* __restrict a, const int* __restrict b, const int* __restrict c, int N) {
    #pragma omp parallel for
    for(int i = 0; i < N; ++i) {
        a[i] = b[i] + c[i];
    }
}

      

This will help the compiler since it will know that a, b and c cannot alias each other.

  1. alignment information :

Tell the compiler that your pointers are correctly aligned

#define RESTRICT __restrict

    typedef __attribute__((aligned(32))) int* intptr;

    extern "C" void add(intptr RESTRICT a, const intptr RESTRICT b, const intptr RESTRICT c, int N) {
        #pragma omp parallel for
        for(int i = 0; i < N; ++i) {
            a[i] = b[i] + c[i];
        }
    }

      

This will also help the compiler generate vload instead of vloadu (load unaligned).



  1. Expand inner loops (if possible):

If you know your problem size, if multiple of 256 bits, you can even expand the inner loop:

#define RESTRICT __restrict

typedef __attribute__((aligned(32))) int* intptr;

extern "C" void add(intptr RESTRICT a, const intptr RESTRICT b, const intptr RESTRICT c, int N) {
    #pragma omp parallel for
    for(int i = 0; i < N; i += 8) {
        #pragma unroll
        for(int k = 0; k < 8; ++k)
        a[i+k] = b[i+k] + c[i+k];
    }
}

      

with this code, clang 4.0 gives a pretty neat build:

...
 vmovdqu ymm0, ymmword ptr [rdx + 4*rcx]
 vpaddd  ymm0, ymm0, ymmword ptr [rsi + 4*rcx]
 vmovdqu ymmword ptr [rdi + 4*rcx], ymm0
...

      

For some reason, you need to tweak your attributes and pragmas to have the same result with other compilers.

  1. Intrinsics

If you want to make sure you have the correct assembly, you should go to intrinsics / assembly.

Something simple:

#define RESTRICT __restrict

typedef __attribute__((aligned(32))) int* intptr;

extern "C" void add(intptr RESTRICT a, const intptr RESTRICT b, const intptr RESTRICT c, int N) {
    #pragma omp parallel for
    for(int i = 0; i < N; i += 8) {
        __m256i vb = _mm256_load_si256((__m256i*) (b + i));
        __m256i vc = _mm256_load_si256((__m256i*) (c + i));
        _mm256_store_si256((__m256i*) (a + i), _mm256_add_epi32(vb, vc));
    }
}

      

  1. Non-temporary storage : As a final optimization, you can use the non-temporary hint in the storage statement, since another iteration of the loop will not read the value you just wrote:
typedef __attribute__((aligned(32))) int* intptr;
extern "C" void add(intptr RESTRICT a, const intptr RESTRICT b, const intptr RESTRICT c, int N) {
    #pragma omp parallel for
    for(int i = 0; i < N; i += 8) {
        __m256i vb = _mm256_load_si256((__m256i*) (b + i));
        __m256i vc = _mm256_load_si256((__m256i*) (c + i));
        _mm256_stream_si256((__m256i*) (a + i), _mm256_add_epi32(vb, vc));
    }
}

      

which gives you this assembly:

.L3:
        vmovdqa ymm0, YMMWORD PTR [rdx+rax]
        vpaddd  ymm0, ymm0, YMMWORD PTR [rsi+rax]
        vmovntdq        YMMWORD PTR [rdi+rax], ymm0
        add     rax, 32
        cmp     rcx, rax
    jne     .L3
    vzeroupper

      

If you are worried about the cmp instruction every step of the way, you can unwrap more steps in your loop, but branch prediction does a pretty good job on modern CPUs

[EDIT: add pthread] As stated above, pthread is a bit painful to manage ... Here is a complete functional example with pthread:

#include <pthread.h>
#include <cstdlib>
#include <cstdio>
#include <immintrin.h>

typedef struct AddStruct {
    int *a, *b, *c;
    int N;
} AddStruct_t;

void* add(void* s);

int main() {
    const int N = 1024*1024*32; // out of cache
    int *a, *b, *c;
    int err;
    err = posix_memalign((void**) &a, 32, N*sizeof(int));
    err = posix_memalign((void**) &b, 32, N*sizeof(int));
    err = posix_memalign((void**) &c, 32, N*sizeof(int));
    for(int i = 0; i < N; ++i) {
        a[i] = 0;
        b[i] = 1;
        c[i] = i;
    }
int slice = N / 8;
pthread_t threads[8];
AddStruct_t arguments[8];
for(int i = 0; i < 8; ++i) {
    arguments[i].a = a + slice * i;
    arguments[i].b = b + slice * i;
    arguments[i].c = c + slice * i;
    arguments[i].N = slice;
}

for(int i = 0; i < 8; ++i) {
    if(pthread_create(&threads[i], NULL, add, &arguments[i])) {
        fprintf(stderr, "ERROR CREATING THREAD %d\n", i);
        abort();
    }
   }

for(int i = 0; i < 8; ++i) {
    pthread_join(threads[i], NULL);
}

for(int i = 0; i < N; ++i) {
    if(a[i] != i + 1) {
        fprintf(stderr, "ERROR AT %d: expected %d, actual %d\n", i, i+1, a[i]);
        abort();
    }
}

fprintf(stdout, "OK\n");
}

void* add(void* v) {
    AddStruct_t* s = (AddStruct_t*) v;
    for(int i = 0; i < s->N; i += 8) {
        __m256i vb = _mm256_load_si256((__m256i*) (s->b + i));
        __m256i vc = _mm256_load_si256((__m256i*) (s->c + i));
        _mm256_stream_si256((__m256i*) (s->a + i), _mm256_add_epi32(vb, vc));
    }
}

      

This code goes up to 34GB / s on my Xeon E5-1620 v3 with DDR4 at 2133MHz and the simple solution at the start is 33GB / s.

All these efforts to save 3% :). But sometimes this 3% can be critical.

Note that memory initialization must be done by the same kernel that will perform the computation (especially for NUMA systems) to avoid page migrations.

+4


source


The fastest way to implement :

void add_ints(int *vec1, int *vec2, int *vec3 int n){
 int i; 
#pragma simd
for (i=0; i<n; i++){
  vec3[i] = vec1[i] + vec2[i] ;
} 

      

Whether it's better to "spin your own" deserves some investigation. But "rolling your own" can be more error prone ... which slows down its implementation.

For these simple tasks, one would expect compiler compilers to be sophisticated enough to figure out the fastest solutions for simple problems, and often they are even good at finding the fastest solutions to complex problems ... And using #pragma helps them.

Secondly, I rarely find cases where "SIMD parallel" is faster on IO problems, for example, compared to direct "SIMD" on one core.
I usually hit just under 1600MB / s throughput, which seems pretty good at 1600 memory.
If the GPU has more IO bandwidth than 1600MB / s, you may be better off on a single host core and use the GPU when more math / IO is required.



However, you can and should try this to see for yourself. (yes ... the next example is on the icc site)

#pragma omp parallel for simd schedule(static,10) {
  for (i=0; i<N; i++) { vec3[i] = vec1[i] + vec2[i]; }
}

      

Once you have an easy way, you can get some measurements of how much better "the roll you have" does on the -O3 compiler, using both single and multiple cores.

Another choice for vectors is CILK +. This is especially true when you come from a MATLAB or Fortran background, as vector and matrix / array constructs are very similar.

basically SIMD intrinsic functions were "in vogue" early on, and once the compiler and OpenMP hit their internals, using intrinsics properties seems better if reserved solely for cases where the compiler cannot provide the vector code machine for you.

+2


source







All Articles