Accessing the GPU buffer by index

Note. ... My question is about the Apple Metal API, but I think the concept is general enough to translate to other GPU platforms.

My goal is to add a row vector 1 x N

b

to every row in the matrix M x N

A

.

My kernel, boiled down to the part I'm having problems with:

kernel void vmadd(const device float* A [[ buffer(0) ]],
                  const device float* b [[ buffer(1) ]],
                  device float* C [[ buffer(2) ]],
                  constant ushort& aWidth [[ buffer(3) ]],
                  ushort2 gid [[ thread_position_in_grid ]]) {

    int idx = gid.y * aWidth + gid.x; // Compute absolute index in C
    C[idx] = A[idx] + b[gid.x];

}

      

Assumption: I understand that gid

- this is the position of one item in C

: gid.x

- column, gid.y

is a row. If this is not true, someone please correct me.

Now if I fill A

8 x 8 with zeros:

A = 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0

      

and b

as follows:

b = 1 2 3 4 5 6 7 8

      

then after execution, there C

should be an 8 x 8 matrix where each row is 1 2 3 4 5 6 7 8

.

Instead, I get this:

C = 1 2 3 4 5 6 7 8
    1 2 3 4 5 6 7 8
    1 2 3 4 5 6 7 8
    0 0 0 0 0 0 0 0
    1 2 3 4 5 6 7 8
    1 2 3 4 5 6 7 8
    1 2 3 4 5 6 7 8
    0 0 0 0 0 0 0 0

      

I define themes as recommended by Apple here :

let w = computePipeline.threadExecutionWidth
let h = computePipeline.maxTotalThreadsPerThreadgroup / w
let threadsPerThreadgroup = MTLSizeMake(w, h, 1)
let threadgroupsPerGrid = MTLSize(width: (cWidth + w - 1) / w,
                                  height: (cHeight + h - 1) / h,
                                  depth: 1)

      

On my machine, there are streams (64, 16, 1)

for each group and a (1, 1, 1)

stream group for each mesh.

However, if I manually set threadsPerThreadgroup

in (1, 1, 1)

and threadgroupsPerGrid

in (8, 8, 1)

, I get the correct result in C

.

Question:

I'm pretty sure my problem is related to thread size and buffer settings, but I'm pretty new to GPU programming so I don't fully understand it.

Why does reducing the size of thread groups produce the correct result? More generally, why does the size of the group threads affect this computation at all?

It seems to me that if it gid

always matches the index in C

, and I ask b[gid.x]

, then that value should be available. I understand that data tends to be chunked to fit into thread-group caching - if this is here, what rules are followed and how can I explain this?

+3


source to share


1 answer


I understand that gid

- this is the position of one item in C

: gid.x

- this is a column, gid.y

- is a row. If this is not true, someone please correct me.

This is not entirely true. gid

is the position on the grid .

Since the grid, as it happens, is 64x16, the compute function will be called for positions that fall outside your 8x8 matrices ( A

and C

) and your 8-element vector ( b

). When this happens, the read A

can access the wrong line, or even the end A

. Likewise, reading b

will be read outside of it.

For example, consider when gid

is (8, 0). idx

will be 8. You will read A[8]

, which is actually at (0, 1). You will read b[8]

which ends. This is technically undefined, but in practice it will likely be 0 for a buffer of this relatively short length. You will write C[8]

which is also at (0, 1). This happens in much the same way as a function call, which should be written to (0, 1), and race predominates there.

Your function should, at the beginning, test if it gid

goes out of bounds and, if so, will return earlier:



if (any(gid > aWidth))
    return;

      

(It is assumed that A

and C

will always be square, so width and height can be tested for a single value.)

You can try to tweak the calculations threadsPerThreadgroup

and threadgroupsPerGrid

to get the mesh to exactly the size of your matrices, but it can be tedious to get it right for all cases. However, you can keep it threadsPerThreadgroup

too large:

let w = min(computePipeline.threadExecutionWidth, cWidth)
let h = min(computePipeline.maxTotalThreadsPerThreadgroup / w, cHeight)

      

But you still need to check the calculation function because the overall grid might be too big. For example, suppose computePipeline.threadExecutionWidth

at least 8 computePipeline.maxTotalThreadsPerThreadgroup

is equal to 60. Well, there w

will be 8, but there h

will be 7. Then there threadgroupsPerGrid

will be (1, 2, 1) and the total grid size will be 8x14x1, which will be larger than your matrices.

+2


source







All Articles