Build one CUDA function online

I have a function in my program named float valueAt (float3 v). It should return the value of the function at this point. The function is user defined. At the moment I have an interpreter for this function, but others have recommended compiling the function online so that it is machine code and faster.

How should I do it? I believe I know how to load the function when I created the PTX, but I have no idea how to create the PTX.


source to share

2 answers

CUDA does not provide runtime compilation capabilities for non-PTX code.

What can you do but not use the standard CUDA APIs. PyCUDA provides an elegant just-in-time compilation method for CUDA C code that includes a behind-the-scenes forking toolchain to compile device code and download using runtime APIs. The downside is that you need to use Python for the top level of your application, and if you are posting code to third parties, you may need to submit a working Python distribution.

The only alternative I can think of is OpenCL, which supports runtime compilation (i.e. everything that it did until recently). The C99 language base is much more restrictive than CUDA suggests, and I find the APIs are very verbose, but the runtime compilation model works well.



I've been thinking about this issue for a while, and while I don't think this "great" solution seems to work, so I decided to share it.

The basic idea is to use linux to create processes to compile and then run the compiled code. I think this is a bit tricky, but since I put together the parts, I'll post the instructions here in case anyone else is helpful.

The challenge for the problem in question is to be able to take a file that contains a user-defined function, suppose it is a function of one variable f(x)

, i.e. y = f(x)

, and that x and y can be represented by quantities float


The user will edit a file named fx.txt

that contains the desired function. This file must follow the C syntax rules.




This file is then included in the function __device__

that will hold it:


__device__ float fx(float x){
  float y;
#include "fx.txt"
  return y;


which goes into the kernel, which is called through the shell.

#include <math.h>
#include "cudalib.h"
#include "user_testfunc.cuh"

__global__ void my_kernel(float x, float *y){

  *y = fx(x);

float cudalib_compute_fx(float x){
  float *d, *h_d;
  h_d = (float *)malloc(sizeof(float));
  cudaMalloc(&d, sizeof(float));
  my_kernel<<<1,1>>>(x, d);
  cudaMemcpy(h_d, d, sizeof(float), cudaMemcpyDeviceToHost);
  return *h_d;



float cudalib_compute_fx(float x);


The above files are built into the shared library:

nvcc -arch=sm_20 -Xcompiler -fPIC -shared -o


We need a main application to use this shared library.

#include <stdio.h>
#include <stdlib.h>
#include "cudalib.h"

int main(int argc, char* argv[]){

  if (argc == 1){
    //  recompile lib, and spawn new process
    int retval = system("nvcc -arch=sm_20 -Xcompiler -fPIC -shared -o");
    char scmd[128];
    sprintf(scmd, "%s skip", argv[0]);
    retval = system(scmd);}
  else { // compute f(x) at x = 2.0
    printf("Result is: %f\n", cudalib_compute_fx(2.0));
  return 0;


Which is compiled like this:

nvcc -arch=sm_20 -o t452 -L. -lmycudalib


At this point, the main application ( t452

) can be executed , in which case the result f (2.0) will be obtained, which is 0.5:

Result is: 0.500000


Then the user can modify the file fx.txt


$ vi fx.txt
$ cat fx.txt
y = 5/x


And just re-run the application and the new functional behavior will be used:

Result is: 2.500000


This method takes advantage of the fact that when recompiling / replacing a shared library, a new linux process will build a new shared library. Also note that I have omitted several types of error checking for clarity. At a minimum, I would check for CUDA errors, and I would also probably delete the shared object (.so) library before compiling it, and then check for its existence after compilation to perform a base test that compiles successfully.

This method makes full use of the runtime API to achieve this, so as a result, the user will have to install the CUDA toolbox on their machine and configure accordingly to nvcc

be available in the PATH. Using the driver API with PTX code will make the process cleaner (and doesn't require tooling on the user's machine), but AFAIK won't be able to generate PTX from CUDA C without using nvcc

or a user-created toolchain built on the nvidia llvm compiler tools. In the future, there may be a more "integrated" approach available in the "standard" CUDA C toolchain, or perhaps even to the driver.

A similar approach can be organized using separate compilation and linking of device code, so that the only source code to be exposed to the user is in

(s fx.txt


EDIT: There is now a CUDA runtime compiler that should be used instead of the above.



All Articles