Intel OpenCL Compiler: Optimizes Structure Usage

I have a question about using structs in OpenCL on an Intel processor. My current kernel accesses two buffers using a struct like this:

struct pair {
    float first;
    float second;
};

inline const float f(const struct pair param) {
    return param.first * param.second;
}

inline const struct pair access_func(__global float const * const a, __global float const * const b, const int i) {
    struct pair res = {
            a[i],
            b[i]
    };
    return res;
}

// slow
__kernel ...(__global float const * const a, __global float const * const b)
{
 // ...

 x = f( access_func( a, b, i ) );

 // ...
}

      

When I change the kernel like this, it is much faster:

// fast
__kernel ...(__global float const * const a, __global float const * const b)
{
 // ...

 x = a[i] * b[ i ];

 // ...
}

      

Is there a way to let the Intel compiler do this optimization? Perhaps the NVIDIA compiler can do this as I don't see a difference in runtime on the GPU.

Thanks in advance!

+3


source to share


1 answer


The compiler cannot perform optimizations on the memory layout of your data, given that buffers are shared between the OpenCL device and the host and / or across multiple cores on the OpenCL device; the most efficient layout will depend on the kernel access patterns, and they can be different for each kernel.



You need to choose the right data memory location; this is one of the hardest parts of GPU programming. Check out the OpenCL optimization guides for each target you target to see which ones they prefer. Sometimes inefficient access patterns can be masked by copying from memory global

to memory local

and then from a local copy.

0


source







All Articles