CUDA: how to return device lambda from host function

I have a virtual function that returns a different lambda depending on the derived class:

class Base
{
public:
    virtual std::function<float()> foo(void) = 0;
};

class Derived : public Base
{
public:
    std::function<float()> foo(void) {
        return [] __device__ (void) {
            return 1.0f;
        };
    }
};

      

Then I want to pass this lambda to the CUDA core and call it from the device. In other words, I want to do this:

template<typename Func>
__global__ void kernel(Func f) {
    f();
}

int main(int argc, char** argv)
{
    Base* obj = new Derived;
    kernel<<<1, 1>>>(obj->foo());
    cudaDeviceSynchronize();
    return 0;
}

      

Tha above gives an error like this: calling a __host__ function("std::function<float ()> ::operator ()") from a __global__ function("kernel< ::std::function<float ()> > ") is not allowed

As you can see, I am declaring my lambda as __device__

, but the method foo()

stores it in in std::function

order to return it. As a result, what gets passed to kernel()

is the host address, and of course it doesn't work. This is probably my problem, right? So my questions are:

  • Is it possible to create __device__ std::function

    and return it from a method foo()

    ?

  • If this is not possible, is there another way to dynamically select the lambda and pass it to the CUDA core? Hardcoding multiple calls kernel()

    with all possible lambdas is not an option.

So far, from a quick research I have done, CUDA does not have / maintain the required syntax required for a function to return a lambda device. I just hope I'm wrong. :) Any ideas?

Thank you in advance

+3


source to share


2 answers


Before answering this question, I have to wonder if your question is not an XY problem . That is, I'm skeptical by default that people have a good reason to execute code using lambdas / function pointers on the device.

But I will not shy away from such a question ...

Is it possible to create __device__ std::function

and return this from the foo () method?

Short answer: No, try something else.

Longer answer: if you want to implement a large chunk of the standard library on the device side, perhaps you can have a class-like std::function

device on the device side. But I'm not sure if it's even possible (maybe not), and in any case, it's beyond the capabilities of everyone except the very experienced library developers. So, do something else.



If this is not possible, is there another way to dynamically select the lambda and pass it to the CUDA core? Hardcoding multiple calls to kernel () with all possible lambdas is not an option.

First, remember that lambdas are essentially anonymous classes - and therefore, if they don't grab anything, they boil down to function pointers, since anonymous classes have no data, only operator()

.

So if the lambdas have the same signature and no capture, you can use them in a function pointer (non-member) and pass them to the function; and it definitely works, see this simple example on the nVIDIA forums.

Another possibility is to use runtime mapping from type identifier or other such keys to instances of those types, or rather for constructors. That is, using a factory. But I do not want to go into the details of this, so as not to carry this answer longer than it is; and that's probably not a good idea.

+2


source


While I don't think you can achieve what you want using virtual functions that return device lambdas, you can achieve something similar by passing a static device member function as a template parameter to the kernel. An example is shown below. Note that the classes in this example can also be a struct if you prefer.



#include <iostream>

// Operation: Element-wise logarithm
class OpLog {
    public:
    __device__ static void foo(int tid, float * x) {
        x[tid] = logf(x[tid]);
    };
};

// Operation: Element-wise exponential
class OpExp {
    public:
    __device__ static void foo(int tid, float * x) {
        x[tid] = expf(x[tid]);
    }
};

// Generic kernel
template < class Op >
__global__ void my_kernel(float * x) {
    int tid = threadIdx.x;
    Op::foo(tid,x);
}

// Driver
int main() {

    using namespace std;

    // length of vector
    int len = 10;

    // generate data
    float * h_x = new float[len];
    for(int i = 0; i < len; i++) {
        h_x[i] = rand()/float(RAND_MAX);
    }

    // inspect data
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;

    // copy onto GPU
    float * d_x;
    cudaMalloc(&d_x, len*sizeof(float));
    cudaMemcpy(d_x, h_x, len*sizeof(float), cudaMemcpyHostToDevice);

    // Take the element-wise logarithm
    my_kernel<OpLog><<<1,len>>>(d_x);

    // get result
    cudaMemcpy(h_x, d_x, len*sizeof(float), cudaMemcpyDeviceToHost);
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;

    // Take the element-wise exponential
    my_kernel<OpExp><<<1,len>>>(d_x);

    // get result
    cudaMemcpy(h_x, d_x, len*sizeof(float), cudaMemcpyDeviceToHost);
    cout << "h_x = [";
    for(int j = 0; j < len; j++) {
        cout << h_x[j] << " ";
    }
    cout << "]" << endl;


}

      

0


source







All Articles