CUDA C ++ 11, lambdas array, function by index, not working

Question

CUDA C ++ 11, lambdas array, function by index, not working

I am having trouble trying to make a CUDA program manipulate the lambdas array by their index. Sample code that reproduces the problem

 #include <cuda.h>
 #include <vector>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <sys/time.h>
 #include <cassert>

 #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
 inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
     if (code != cudaSuccess) {
         fprintf(stderr,"GPUassert: %s %s %d\n",
         cudaGetErrorString(code), file, line);
         if (abort) exit(code);
     }   
 }

 template<typename Lambda>
 __global__ void kernel(Lambda f){ 
     int t = blockIdx.x * blockDim.x + threadIdx.x;
     printf("device: thread %i: ", t); 
     printf("f() = %i\n", f() );
 }

 int main(int argc, char **argv){
     // arguments
     if(argc != 2){ 
         fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
         exit(EXIT_FAILURE);
     }   
     int i = atoi(argv[1]);


     // lambdas
     auto lam0 = [] __host__ __device__ (){ return 333; };
     auto lam1 = [] __host__ __device__ (){ return 777; };


     // make vector of functions
     std::vector<int(*)()> v;
     v.push_back(lam0);
     v.push_back(lam1);


     // host: calling a function by index
     printf("host: f() = %i\n", (*v[i])() );


     // device: calling a function by index
     kernel<<< 1, 1 >>>( v[i] ); // does not work
     //kernel<<< 1, 1 >>>( lam0 ); // does work
     gpuErrchk( cudaPeekAtLastError() );
     gpuErrchk( cudaDeviceSynchronize() );
     return EXIT_SUCCESS;
 }

Compiling with

nvcc -arch sm_60 -std=c++11 --expt-extended-lambda main.cu -o prog

Startup error

➜  cuda-lambda ./prog 0
host: f() = 333
device: GPUassert: invalid program counter main.cu 53

It seems that CUDA cannot control the shape of the int (*) () function pointer (while the C ++ host is working correctly). On the other hand, each lambda is managed like a different data type, regardless of whether they are identical in code and have the same contract. Then how can we achieve the function by index in CUDA?

+1

c ++ lambda c ++ 11 cuda

Cristobal Navarro Dec 29. 16 at 14:03

source to share

1 answer

Robert crovella · Accepted Answer · 2016-12-30T16:59:19+0000

There are several considerations here.

While you are suggesting "manipulating a lambda array", you are actually relying on a graceful conversion of the lambda to a function pointer (possibly when the lambda does not capture).

When you mark something as __host__ __device__

, you are declaring to the compiler that it needs to compile two copies of the specified item (with two distinctly different entry points): one for the CPU and one for the GPU.

When we take a lambda __host__ __device__

and ask it to degrade to a function pointer, then we leave the question "which function pointer (entry point) to choose?" The compiler no longer has the ability to port the experimental lambda object anymore, so it must choose one or the other (host or device, processor or GPU) for your vector. Whichever one he chooses, the vector can (will) crash if used in the wrong environment.

One takeaway from this is that your two tests don't match. In one case (broken) you are passing a pointer to the kernel (so the template kernel takes a function pointer argument) and in the other case (works) you are passing a lambda to the kernel (so the kernel is templated to take a lambda argument).

The problem here, in my opinion, does not just arise from the use of the container, but arises from the type of container you are using. I can demonstrate this in a simple way (see below) by converting your vector to a vector of the actual lambda type. In this case, we can make the code "work" (sort of), but since each lambda has a unique type , this is not an interesting demo. We can create a multi-element vector, but the only element we can store in it is one of your two lambdas (not both at the same time).

If we are using a container that can handle different types (for example std::tuple

), maybe we can make some progress here, but I don't know of a direct method for indexing through the elements of such a container. Even if we could, a template core that takes a lambda as an argument / template type should be created for each lambda.

In my opinion, function pointers eliminate this "mess" of a certain type.

Hence, as an answer to this question:

Then how can we achieve the function by index in CUDA?

I would suggest that while the function with the index in the main code should be separated (for example, two separate containers) from the function by the index in the device code, and for the function by the index in the device code, you use any of the methods (which do not use or depend on from lambda) covered by other questions like this one .

Here is a working example (I think) demonstrating the above note that we can create a vector of a lambda type, and use the resulting element from that vector as lambdas in both the host and device code:

$ cat t64.cu
 #include <cuda.h>
 #include <vector>
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
 #include <sys/time.h>
 #include <cassert>

 #define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
 inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true){
     if (code != cudaSuccess) {
         fprintf(stderr,"GPUassert: %s %s %d\n",
         cudaGetErrorString(code), file, line);
         if (abort) exit(code);
     }
 }


 template<typename Lambda>
 __global__ void kernel(Lambda f){
     int t = blockIdx.x * blockDim.x + threadIdx.x;
     printf("device: thread %i: ", t);
     printf("f() = %i\n", f() );
 }

 template <typename T>
 std::vector<T> fill(T L0, T L1){
   std::vector<T> v;
   v.push_back(L0);
   v.push_back(L1);
   return v;
}

 int main(int argc, char **argv){
     // arguments
     if(argc != 2){
         fprintf(stderr, "run as ./prog i\nwhere 'i' is function index");
         exit(EXIT_FAILURE);
     }
     int i = atoi(argv[1]);


     // lambdas
     auto lam0 = [] __host__ __device__ (){ return 333; };
     auto lam1 = [] __host__ __device__ (){ return 777; };

     auto v = fill(lam0, lam0);

     // make vector of functions
 //    std::vector< int(*)()> v;
 //    v.push_back(lam0);
 //    v.push_back(lam1);


     // host: calling a function by index
     // host: calling a function by index
     printf("host: f() = %i\n", (*v[i])() );


     // device: calling a function by index
     kernel<<< 1, 1 >>>( v[i] ); // does not work
     //kernel<<< 1, 1 >>>( lam0 ); // does work
     gpuErrchk( cudaPeekAtLastError() );
     gpuErrchk( cudaDeviceSynchronize() );
     return EXIT_SUCCESS;
 }

$ nvcc -arch sm_61 -std=c++11 --expt-extended-lambda t64.cu -o t64
$ cuda-memcheck ./t64 0
========= CUDA-MEMCHECK
host: f() = 333
device: thread 0: f() = 333
========= ERROR SUMMARY: 0 errors
$ cuda-memcheck ./t64 1
========= CUDA-MEMCHECK
host: f() = 333
device: thread 0: f() = 333
========= ERROR SUMMARY: 0 errors
$

As mentioned above, this code is not sane code. This is advanced to prove a specific point.

CUDA C ++ 11, lambdas array, function by index, not working

More articles: