Opencl duplicate memory object on device

Backround: I got a kernel called "buildlookuptable" that does some calculations and stores its result in an int array called "dense_id"

creating cl_mem object:

cl_mem dense_id = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);

      

Kernel argument setting:

errWrapper("setKernel", clSetKernelArg(kernel_buildLookupTable, 5, sizeof(cl_mem), &dense_ids));

      

dense_ids is then used in other kernels. Due to the terrible memory allocation, I have a huge performance hit.

The following kernel accesses dense_id as follows:

result_tuples += (dense_id[bucket+1] - dense_id[bucket]);

      

Runtime: 66ms no compiler based vector

However, if I change the line to:

result_tuples += (dense_id[bucket] - dense_id[bucket]);

      

Runtime: 2ms vectorized (4) with compiler Both kernels were running on a geforce 660ti server.

So, if I remove the overlapping memory access, the speed increases significantly. Thread N accesses N memory, does not overlap.

To get correct results, I would like to duplicate the cl_mem Object dense_id. So the line in the next core would be:

result_tuples += (dense_id1[bucket+1] - dense_id2[bucket]);

      

Whereas dense_id1 and dense_id2 are identical. Another idea would be to wrap the content of dense_id1 by one element. So the kernel line would be:

result_tuples += (dense_id1[bucket] - dense_id2[bucket]);

      

Since dense_id is a small memory object, I'm sure I could improve the runtime at the expense of memory while copying it.

Question: After running the "buildlookuptable" kernel, I would like to duplicate the dense_id result array from the device side. The direct way would be to use ClEnqueueReadBuffer

host side to fetch the dense_id, create a new cl_mem object and put it back on the device. Is there a way to duplicate the dense_id after "buildlookuptable" finishes without copying it to the host?

I can add more code here if needed. I've tried to use only the parts I need, as I don't want to drown you in irrelevant code.

+3


source to share


1 answer


I tried the solution using the Clenqueuecopybuffer command, which works as desired. Solution to my ist problem:

clEnqueueCopyBuffer(command_queue, count_buffer, count_buffer3, 1, 0, (inCount1 + 1) * sizeof(int), NULL, NULL, NULL);

      

Without using another kernel, you can only duplicate a memory object on a device.

To do this, you must first create another cl_mem object from the host side:

cl_mem count_buffer3 = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1 + 1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);

      

As I had to wait for the copy to finish, I used



clFinish(command_queue);

      

to make the program wait for its completion

As outlined by DarkZeros, the performance gain was 0 as the compiler optimized the line

result_tuples += (dense_id[bucket] - dense_id[bucket]);

      

to 0.

Thank you for your understanding!

0


source







All Articles