GPGPU: an efficient way to handle "irregular" conversion?

In a regular conversion, each GPU thread is expected to have the same time complexity O. For example:

for i=0 to 10: c[i] = a[i]*b[i]

      

By irregular conversion, this is not:

for i=0 to len(arr)
    for k=0 to random()%100
        arr[i] += 1

      

which results in an array like [2,50,32,77,1,5,66, ...], where each element indicates roughly the computational cost.

GPGPU programming is well suited for regular transformations like "elemental addition", "matrix multiplication", "convolution", ... But what about irregular transformations? How to distribute GPU threads "well"? How do you create a "good" kernel? Is there a common methodology?

+3


source to share


1 answer


If the hardware is not Vega and Volta (both can have almost independent command execution per item), then it is best to regroup suspicious works together. For example, a mandelbrot image generator (varying amounts of work per item) may be faster using 2D tiled generation, as all items in one group can have more or less the same neighbor work items and are more balanced than 1-D (scanline) (which has more different result for each group). Eirther you should reorder the items based on the last iteration or use spatial grouping.



In the worst case, the maximum cycles per unit of calculation (each with 8,64,128,192 cores) determines the resulting performance, which will be faster with more compute units. But all other work items will still be hidden behind these maximum cycles and will be more efficient than the processor.

+2


source







All Articles