Optimized GPU for CPU Data Transfer

I'm a little out of my depth (better way to think), but I'm looking for optimizations that can bring GPU down to CPU data transfer for my application.

I have an application that is doing some changes to the vertex data in the GPU. Sometimes the processor has to read parts of the modified vertex data and then compute some parameters, which are then passed back to the GPU shader through the uniform, forming a loop.

It takes too long to move all the vertex data back to the CPU and then sift it across the CPU (in millions of points), and so I have a "hack" to reduce the workload to use, albeit not optimal.

What am I doing:

  • CPU: read image
  • CPU: generates 1 vertex per pixel, Z based on color / filter information, etc.
  • CPU: Pass all vertex data to the GPU
  • GPU: Feedback transform used to update GL_POINT vertex coordinates in real time based on some uniform parameters set from the CPU.

When I only want to read the rectangular "section", I use glMapBufferRange to display all lines that contain the rectangle I want (bad diagram warning):

enter image description here

It is assumed to be an image / vertex set in the GPU. My "hack" involves reading all blue and red vertices. This is because I can only specify one contiguous data range to read.

Does anyone know of a clever way to effectively hit red without blue? (without having to issue a series of glMapBufferRange calls)

EDIT -

The use case is that I turn the image into a 3D world as GLPoints, colored and offset in Z by an amount based on the color information (size, etc. by distance). The user can then modify the vertex Z data with the mouse cursor brush. The logic behind the brush application code is to know the Z area under the arm (brush circle), for example. min / max / average, etc., so that the CPU can control the modification of these shaders by setting a series of uniforms that are fed into the shader. So, for example, the user can say that I want all points under the cursor to set the average. Maybe the whole thing could be done entirely on GPU, but the idea is that as soon as I get a CPU-GPU "loopback" (optimized as far as I can do this) I can then extend the min / max / avg stuff ,to do interesting things on a processor that would be cumbersome (perhaps) to do entirely on a GPU.

Hooray! Laythe

+3


source to share


2 answers


To get any data from GPU to CPU, you need to map GPU memory anyway, which means the OpenGL app has to use something like mmap

under the hood. I have checked the implementation of this for both x86 and ARM and it looks like it is page-aligned, so you cannot display less than one contiguous page of GPU memory at any given time, so even if you can ask to map red regions, you. you will most likely get blue colors as well (depending on your page sizes and pixels).

Solution 1 Just use glReadPixels as it allows you to select the framebuffer window. I am guessing that a GPU vendor like Intel will optimize the driver so it will render as few pages as possible, however this is not guaranteed and in some cases you may need to map 2 pages in just 2 pixels.

Solution 2 Create a compute shader or use multiple glCopyBufferSubData calls to copy the ROI to an adjacent buffer in GPU memory. If you know the height and width you want, you can then undo and get the 2D buffer back on the CPU side.



Which of the above solutions works best depends on your hardware and driver implementation. If GPU-> CPU is the bottleneck and GPU-> GPU is fast, then the second solution may work well, however you'll have to experiment.

Solution 3 As suggested in the comments, do everything on GPU. This largely depends on whether the parallelization job works well, but if copying memory is too slow for you, then you have no other choice.

+1


source


I suppose you are asking because you cannot work everything in shaders, right?



If you pass in a framebuffer object and then bind it as GL_READ_FRAMEBUFFER, you can read a block of it glReadPixels

.

-1


source







All Articles