Apply this function to a 256-bit vector using the SIMD paradigm

Is there a way to evaluate a function along a vector __m256d/s

? Like this:

#include <immintrin.h>

inline __m256d func(__m256d *a, __m256d *b)
{
    return 1 / ((*a + *b) * (*a + *b));
}

int main()
{
    __m256d a = _mm256_set_pd(1.0f, 2.0f, 3.0f, 4.0f);
    __m256d b = _mm256_set_pd(1.0f, 2.0f, 3.0f, 4.0f);
    __m256d c = func(a, b);

    return 0;
}

      

I would like to evaluate any math function using the SIMD paradigm. If that's not possible, wouldn't that be the biggest limitation of SIMD Vs GPGPU programming? I mean, I realized that the processing power in terms of FLOPS of processors is close to the GPU, some comparisons:

  • Nvidia Quadro K6000 ~ 5196 GFLOPS
  • Nvidia Quadro K5000 ~ 2169 GFLOPS
  • Intel Xeon E5-2699 v3 ~ 1728 GFLOPS (18 cores * 32 FLOP / cycle * 3 Ghz)

Future guesses:

  • AVX-512 and likely 20 cores Xeon CPU 3840 GLOPS (20 cores * 64 FLOP / cycle * 3 Ghz)

  • Knights Landing 5907 GFLOPS (71 core * 64 FLOP / cycle * 1.3 GHz)

+3


source to share


1 answer


Your question is very interesting. What you are describing cannot be accomplished using existing compilers. If you overwrite your basic operators handling 256b vectors, you should be able to get close to the functionality you want.

However, I would not say that this is the biggest limitation of SIMD and GPGPU programming . The main advantage of GPGPU is the FLOPS count, but it comes with some cost. One is that GPGPUs don't handle branches very well, don't work well with streams dealing with big local data, etc. Another limitation is that the GPGPU programming model is quite complex compared to traditional coding.



On the processor, you can run more general codes and the compiler will vectorize most of the time without asking the programmer to write specific built-in functions.

So, I went ahead and say that simple code is actually an advantage for processors . Consider the amount of effort required for 20 years of FORTRAN software for GPGPU. Although, if you have a good compiler and a good processor (with a good amount of FLOPs), you can get the performance you expect.

+2


source







All Articles