Opencl duplicate memory object on device
Backround: I got a kernel called "buildlookuptable" that does some calculations and stores its result in an int array called "dense_id"
creating cl_mem object:
cl_mem dense_id = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);
Kernel argument setting:
errWrapper("setKernel", clSetKernelArg(kernel_buildLookupTable, 5, sizeof(cl_mem), &dense_ids));
dense_ids is then used in other kernels. Due to the terrible memory allocation, I have a huge performance hit.
The following kernel accesses dense_id as follows:
result_tuples += (dense_id[bucket+1] - dense_id[bucket]);
Runtime: 66ms no compiler based vector
However, if I change the line to:
result_tuples += (dense_id[bucket] - dense_id[bucket]);
Runtime: 2ms vectorized (4) with compiler Both kernels were running on a geforce 660ti server.
So, if I remove the overlapping memory access, the speed increases significantly. Thread N accesses N memory, does not overlap.
To get correct results, I would like to duplicate the cl_mem Object dense_id. So the line in the next core would be:
result_tuples += (dense_id1[bucket+1] - dense_id2[bucket]);
Whereas dense_id1 and dense_id2 are identical. Another idea would be to wrap the content of dense_id1 by one element. So the kernel line would be:
result_tuples += (dense_id1[bucket] - dense_id2[bucket]);
Since dense_id is a small memory object, I'm sure I could improve the runtime at the expense of memory while copying it.
Question: After running the "buildlookuptable" kernel, I would like to duplicate the dense_id result array from the device side. The direct way would be to use ClEnqueueReadBuffer
host side to fetch the dense_id, create a new cl_mem object and put it back on the device. Is there a way to duplicate the dense_id after "buildlookuptable" finishes without copying it to the host?
I can add more code here if needed. I've tried to use only the parts I need, as I don't want to drown you in irrelevant code.
source to share
I tried the solution using the Clenqueuecopybuffer command, which works as desired. Solution to my ist problem:
clEnqueueCopyBuffer(command_queue, count_buffer, count_buffer3, 1, 0, (inCount1 + 1) * sizeof(int), NULL, NULL, NULL);
Without using another kernel, you can only duplicate a memory object on a device.
To do this, you must first create another cl_mem object from the host side:
cl_mem count_buffer3 = clCreateBuffer(context, CL_MEM_READ_WRITE, (inCount1 + 1) * sizeof(int), NULL, &err); errWrapper("create Buffer", err);
As I had to wait for the copy to finish, I used
clFinish(command_queue);
to make the program wait for its completion
As outlined by DarkZeros, the performance gain was 0 as the compiler optimized the line
result_tuples += (dense_id[bucket] - dense_id[bucket]);
to 0.
Thank you for your understanding!
source to share