CUDA cudaMemcpy Array Structure

I would like to clear the CUDA kernels options in my project.


Now the kernel requires 3 uint32_t

arrays, which leads to some pretty ugly code: (id means global thread id and valX is an arbitrary value)

__global__ void some_kernel(uint32_t * arr1, uint32_t * arr2, uint32_t * arr3){arr1[id] = val1; arr2[id] = val2; arr3[id] = val3;}

      

I would like to combine all these arrays with a structure:

typedef struct S{uint_32_t arr1, uint_32_t arr2, uint_32_t arr3, uint32_t size} S;

      

where size denotes the length of each arrX within the structure.

What I would like to have is something like:

__global__ void some_kernel(S * s){s->arr1[id] = val1; s->arr2[id] = val2; s->arr3[id] = val3;}

      


What would be like cudaMalloc and cudaMemcpy for such a structure? Are there performance drawbacks that I am not seeing yet?

Thanks in advance!

+3


source to share


1 answer


You have at least two options. One excellent choice has already been given , but I will introduce you to the "learn the hard way" approach.

First, your structure definition:

typedef struct S {
    uint32_t *arr1;
    uint32_t *arr2;
    uint32_t *arr3; 
    uint32_t size;
} S;

      

... and a kernel definition (with some global variable, but you don't have to follow this pattern):

const int size = 10000;

__global__ void some_kernel(S *s)
{
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    if (id < size)
    {
        s->arr1[id] = 1; // val1
        s->arr2[id] = 2; // val2
        s->arr3[id] = 3; // val3
    }
}

      

Note that if

protects you from enforcing restrictions.

Then we come up with some function that prepares the data, executes the kernel, and prints some result. Part one is data distribution:



uint32_t *host_arr1, *host_arr2, *host_arr3;
uint32_t *dev_arr1, *dev_arr2, *dev_arr3;

// Allocate and fill host data
host_arr1 = new uint32_t[size]();
host_arr2 = new uint32_t[size]();
host_arr3 = new uint32_t[size]();

// Allocate device data   
cudaMalloc((void **) &dev_arr1, size * sizeof(*dev_arr1));
cudaMalloc((void **) &dev_arr2, size * sizeof(*dev_arr2));
cudaMalloc((void **) &dev_arr3, size * sizeof(*dev_arr3));

// Allocate helper struct on the device
S *dev_s;
cudaMalloc((void **) &dev_s, sizeof(*dev_s));

      

It's nothing special, you just allocate three arrays and a structure. What's more interesting is how to handle copying such data to the device:

// Copy data from host to device
cudaMemcpy(dev_arr1, host_arr1, size * sizeof(*dev_arr1), cudaMemcpyHostToDevice);
cudaMemcpy(dev_arr2, host_arr2, size * sizeof(*dev_arr2), cudaMemcpyHostToDevice);
cudaMemcpy(dev_arr3, host_arr3, size * sizeof(*dev_arr3), cudaMemcpyHostToDevice);

// NOTE: Binding pointers with dev_s
cudaMemcpy(&(dev_s->arr1), &dev_arr1, sizeof(dev_s->arr1), cudaMemcpyHostToDevice);
cudaMemcpy(&(dev_s->arr2), &dev_arr2, sizeof(dev_s->arr2), cudaMemcpyHostToDevice);
cudaMemcpy(&(dev_s->arr3), &dev_arr3, sizeof(dev_s->arr3), cudaMemcpyHostToDevice);

      

Besides the usual copy of the array, which you noticed, you also need to "bind" them to the structure. To do this, you need to pass the address of the pointer. As a result, only these pointers are copied.

Next kernel call, copy data back to host and print the results:

// Call kernel
some_kernel<<<10000/256 + 1, 256>>>(dev_s); // block size need to be a multiply of 256

// Copy result to host:
cudaMemcpy(host_arr1, dev_arr1, size * sizeof(*host_arr1), cudaMemcpyDeviceToHost);
cudaMemcpy(host_arr2, dev_arr2, size * sizeof(*host_arr2), cudaMemcpyDeviceToHost);
cudaMemcpy(host_arr3, dev_arr3, size * sizeof(*host_arr3), cudaMemcpyDeviceToHost);

// Print some result
std::cout << host_arr1[size-1] << std::endl;
std::cout << host_arr2[size-1] << std::endl;
std::cout << host_arr3[size-1] << std::endl;

      

Keep in mind that in any serious code, you should always check for errors in CUDA API calls.

+3


source







All Articles