Nvidia OpenCL hangs when blocking buffer access

I have an OpenCL program that copies a bunch of values ​​into an input buffer, processes those values, and copies the results.

// map input data buffer, has CL_MEM_ALLOC_HOST_PTR
cl_float* data = clEnqueueMapBuffer(queue, data_buffer, CL_TRUE, CL_MAP_WRITE, 0, data_size, 0, NULL, NULL, NULL);

// set input values
for(size_t i = 0; i < n; ++i)
    data[i] = values[i];

// unmap input buffer
clEnqueueUnmapMemObject(queue, data_buffer, data, 0, NULL, NULL);

// run kernels
...

// map results buffer, has CL_MEM_ALLOC_HOST_PTR
cl_float* results = clEnqueueMapBuffer(queue, results_buffer, CL_TRUE, CL_MAP_READ, 0, results_size, 0, NULL, NULL, NULL);

// processing
...

// unmap results buffer
clEnqueueUnmapMemObject(queue, results_buffer, results, 0, NULL, NULL);

      

(In real code, I check for errors, etc.)

This works great on AMD and Intel architectures (both CPU and GPU). On Nvidia GPUs, the code is incredibly slow. It usually takes 10 seconds for the program to run (5 seconds host, 5 seconds device) will run over two and a half minutes on Nvidia cards.

However, I found that this is not a simple optimization problem or zero-copy speed difference. Using a profiler, I can see that the program host time is 5 seconds, as is the usual case. And using the OpenCL profiling events, I can see that the device time is also 5 seconds, as usual!

So I used the poor mans 'profiler' to figure out where the program spends its time on Nvidia GPUs. And this shows that the program is just waiting idly on both calls clEnqueueMapBuffer

. I find this especially incomprehensible in the first case, since at this point the queue is empty.

Again, I've profiled every map / unmap and kernel call and the extra time doesn't show up there, so it's not wasted on the device, nor on the host. I can see from the stack profile that it waits instead of a semaphore. Does anyone know what is causing this?

+3


source to share





All Articles