CUDA core with function pointer and variational templates

I am trying to create a cuda framework that will accept custom functions and redirect them to the kernel via device function pointers. CUDA can work with variadic templates (-stc = C ++ 11) and so far that's good.

However, I ran into an issue where the kernel calls a pointer to a device function. Obviously the kernel works without issue, but the GPU usage is 0%. If I just replace the callback pointer with an actual function, then the GPU utilization is 99%. The code here is very simple and the range of large loops is just to measure things. I measured the gpu status:

nvidia-smi --query-gpu=utilization.gpu,utilization.mory,memory.used --format=csv -lms 100 -f out.txt

      

IIRC, the custom function must be in the same file block as the kernel (possibly #include) for nvcc to succeed. Func_d is right in the source and it compiles and works fine, but also doesn't work with a function pointer (which is a whole point in this project).

My question is: Why is the kernel with a callback device function pointer not working?

Note that when I do not print the callback and func_d addresses, they match, as in this example output:

size of Args = 1
callback() address = 4024b0
func_d()   address = 4024b0

      

Another oddity is that if you uncomment the call callback()

in kernel()

, the GPU usage will go back to 0% even if the call func_d()

is still there ... The func_d version takes about 4 seconds to, whereas the callback version accepts nothing (well, ~ 0.1 s).

The system specifications and compilation command are at the beginning of the code below.

Thank!

// compiled with:
// nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp
//
// Nvidia Quadro 6000 (compute capability 2.0)
// CUDA 6.5 (V6.5.12),
// Arch Linux, Nvidia driver 343.22-4, gcc 4.9.1
// Nov, 2014


#include <stdio.h>

__device__
void func_d(double* vol)
{
*vol += 5.4321f;
}


// CUDA kernel function
template <typename... Types>
__global__ void kernel( void (*callback)(Types*...) )
{
double val0 = 1.2345f;

//  // does not use gpu (0% gpu utilization)
//  for ( int i = 0; i < 1000000; i++ ) {
//  callback( &val0 );
//  }

// uses gpu (99% gpu utilization)
for ( int i = 0; i < 10000000; i++ ) {
func_d( &val0 );
}
}


// host function
template <typename... Types>
void host_func( void (*callback)(Types*...) )
{
// get user kernel number of arguments.
constexpr int I = sizeof...(Types);
printf("size of Args = %d\n",I);

printf("callback() address = %x\n",callback);
printf("func_d()   address = %x\n",func_d);

dim3 nblocks = 100;
int nthread = 100;
kernel<Types...><<<nblocks,nthread>>>( callback );
}


__host__
int main(int argc, char** argv)
{
host_func(func_d);
}

      

0


source to share


1 answer


My question is, why is the kernel with a callback device function pointer not working?

There are probably several problems to solve. But the simplest answer is that it is illegal to take the address of device objects into the host code. This is true for device variables as well as device functions. Now you can take the address of these objects. But the address is rubbish. It cannot be used on either the host or the device. If you try to use them anyway, you will end up with undefined behavior on the device, which will usually bring your kernel to a halt.

Host addresses can be seen in the main code. Device addresses can be observed in the device code. Any other behavior requires API intervention.



  • You seem to be using the use query nvidia-smi

    as a measure of whether it is being executed correctly or not. I would suggest doing correct cuda error checking and also you can run your code with cuda-memcheck

    .

  • "Why, then, does the address func_d

    match the address callback

    ?" As you are taking both addresses in host code and both addresses are garbage. To verify this, add a line to this very beginning at the very end of your kernel:

    if ((!threadIdx.x)&&(!blockIdx.x)) printf("in-kernel func_d()   address = %x\n",func_d);
    
          

    and you will see that it prints something different from what is printed on the host.

  • "How about using your device?" As soon as the device encounters an error, the kernel exits and usage goes to zero. Hopefully this will explain this expression to you: "Another oddity is that if one of them cancels the callback () in the kernel (), then the GPU usage goes back to 0% even if the call to func_d () is still there ... "

  • "How can I fix this?" I don't know how to fix this. If you have a limited number of CUDA functions known at compile time that you want the user to be able to select, then the appropriate thing is probably to simply create an appropriate index and use that to select the function. If you really want to, you can run a pre / install kernel that will take the address of the functions you care about, and then you can pass those addresses back into the host code and use them in subsequent kernel calls as parameters, and that should let your mechanism work. But I don't see how this prevents the need for indexing through a set of predefined functions known at compile time. If the direction you're in iswhat do you want the user to be able to provide user defined functions at runtime i think it will be difficult for you to docurrently with the CUDA runtime API (I suspect this may change in the future.) I presented a rather mangled mechanism to try and do this here (read the whole question and answer; also informative). If, on the other hand, you are willing to use the CUDA driver API, then it should be possible, albeit to some extent involved, as this is exactly what is done very elegantly in PyCUDA, for example.

  • Check out your code in the future.

Here's a complete example demonstrating some of the ideas above. In particular, I am rather crudely demonstrating that the address func_d

can be taken in the device code, then passed back to the host, and then used as a future kernel parameter to successfully select / call this device function.

$ cat t595.cu
// compiled with:
// nvcc -g -G -O0 -std=c++11 -arch=sm_20 -x cu sample.cpp
//
// Nvidia Quadro 6000 (compute capability 2.0)
// CUDA 6.5 (V6.5.12),
// Arch Linux, Nvidia driver 343.22-4, gcc 4.9.1
// Nov, 2014


#include <stdio.h>

__device__
void func_d(double* vol)
{
  if ((!threadIdx.x) && (!blockIdx.x)) printf("value = %f\n", *vol);
  *vol += 5.4321f;
}

template <typename... Types>
__global__ void setup_kernel(void (**my_callback)(Types*...)){
  *my_callback = func_d;
}

// CUDA kernel function
template <typename... Types>
__global__ void kernel( void (*callback)(Types*...) )
{
  double val0 = 1.2345f;

//  // does not use gpu (0% gpu utilization)
//  for ( int i = 0; i < 1000000; i++ ) {
  callback( &val0 );
//  }

  val0 = 0.0f;
// uses gpu (99% gpu utilization)
//  for ( int i = 0; i < 10000000; i++ ) {
    func_d( &val0 );
//  }
  if ((!threadIdx.x)&&(!blockIdx.x)) printf("in-kernel func_d()   address = %x\n",func_d);
}


// host function
template <typename... Types>
void host_func( void (*callback)(Types*...) )
{
// get user kernel number of arguments.
  constexpr int I = sizeof...(Types);
  printf("size of Args = %d\n",I);

  printf("callback() address = %x\n",callback);
  printf("func_d()   address = %x\n",func_d);

  dim3 nblocks = 100;
  int nthread = 100;
  unsigned long long *d_callback, h_callback;
  cudaMalloc(&d_callback, sizeof(unsigned long long));
  setup_kernel<<<1,1>>>((void (**)(Types*...))d_callback);
  cudaMemcpy(&h_callback, d_callback, sizeof(unsigned long long), cudaMemcpyDeviceToHost);
  kernel<Types...><<<nblocks,nthread>>>( (void (*)(Types*...))h_callback );
  cudaDeviceSynchronize();
}


__host__
int main(int argc, char** argv)
{
  host_func(func_d);
}
$ nvcc -std=c++11 -arch=sm_20 -o t595 t595.cu
$ cuda-memcheck ./t595
========= CUDA-MEMCHECK
size of Args = 1
callback() address = 4025dd
func_d()   address = 4025dd
value = 1.234500
value = 0.000000
in-kernel func_d()   address = 4
========= ERROR SUMMARY: 0 errors
$

      

+2


source







All Articles