How to allocate memory for OpenCL result data?

What's the best way (in a sense) to allocate memory for OpenCL output? Is there a solution that works reasonably with discrete and integrated graphics?

As a simplified example, consider the following C ++ (host) code:

std::vector<float> generate_stuff(size_t num_elements) {
    std::vector<float> result(num_elements);
    for(int i = 0; i < num_elements; ++i)
        result[i] = i;
    return result;
}

      

This can be done using the OpenCL core:

__kernel void gen_stuff(float *result) {
    result[get_global_id(0)] = get_global_id(0);
}

      

The simplest solution is to allocate the array on both the device and the host, and then copy after the kernel completes:

std::vector<float> generate_stuff(size_t num_elements) {
    //global context/kernel/queue objects set up appropriately
    cl_mem result_dev = clCreateBuffer(context, CL_MEM_WRITE_ONLY, num_elements*sizeof(float) );
    clSetKernelArg(kernel, 0, sizeof(cl_mem), result_dev);
    clEnqueueNDRangeKernel(queue, kernel, 1, nullptr, &num_elements, nullptr, 0, nullptr, nullptr);
    std::vector<float> result(num_elements);
    clEnqueueReadBuffer( queue, result_dev, CL_TRUE, 0, num_elements*sizeof(float), result_host.data(), 0, nullptr, nullptr );
    return result;
}

      

This works reasonably with discrete cards. But with shared memory graphics, that means double and extra copy allocation. How can you avoid this? Of course, you should give up clEnqueuReadBuffer

and use clEnqueueMapBuffer

/ clUnmapMemObject

.

Some alternative scenarios:

  • Process the extra memory copy. Valid if memory bandwidth is not an issue.
  • Allocate a regular memory array on the host, use CL_MEM_USE_HOST_PTR

    when creating a buffer. Should stand out with device specific - this is 4k with Intel HD graphics: https://software.intel.com/en-us/node/531272 I don't know if the OpenCL environment can be requested. Results should be matched (s CL_MAP_READ

    ) after kernel termination to clear caches. But when can you unmount? Immediately after display completes (doesn't seem to work with AMD discrete graphics)? Allocating an array also requires modification of the client code on Windows (due to the fact that _aligned_free is different from free).
  • Highlight with CL_MEM_ALLOCATE_HOST_PTR

    and display after kernel termination. The cl_mem object must be maintained until the buffer is consumed (and perhaps even displayed?), So it requires infecting the client code. Also this keeps the array in pinned memory, which might not be desirable.
  • Allocate the device without CL_MEM_*_HOST_PTR

    and map it after the kernel completes. This is the same as option 2 in terms of deallocation, it just avoids pinned memory. (Actually, not sure if the memory card is not tethered.)
  • ???

How do you deal with this problem? Is there any vendor specific solution?

+3


source to share


2 answers


You can do this with one discrete and integrated hardware buffer:



  • Allocate CL_MEM_WRITE_ONLY (since your kernel only writes to the buffer). You can also use the CL_MEM_ALLOCATE_HOST_PTR or vendor (e.g. AMD) flags if it helps performance on specific platforms (read vendor guide and benchmark).
  • Lock your kernel, which is being buffered.
  • clEnqueueMapBuffer with CL_MAP_READ and blocking. On discrete hardware, this will replicate via PCIe; on integrated hardware it is "free".
  • Use the results on the CPU with a return pointer.
  • clEnqueueUnmapMemObject.
+2


source


Depends on the use case:



  • For minimum memory and I / O efficiency : (Dithermaster answer)
    • Create flags CL_MEM_WRITE_ONLY or possibly CL_MEM_ALLOCATE_HOST_PTR

      (depending on platforms). When blocking the card for reading, use it, remove it. This parameter requires the data handler (consumer) to know about the existence of the CL and to cancel it using CL calls.
  • In situations where you need to provide buffer data to a third party (i.e.: libraries that need a C pointer or class buffer, CL agnostic):
    • In this case, it can be bad to use the mapped memory. Memory access times are generally longer when compared to regular CPU memory. So instead of mapping, then memcpy () and unmap; easier to execute directly to clEnqueueReadBuffer()

      the CPU address where the output is to be copied. In some vendor cases this does not provide pinned memory and the copy is slow, so it is better to revert to option "1". But for some other cases where there is no pinned memory, I found it faster.
  • Any other condition for reading kernel output? I think not ...
+1


source







All Articles