Accessing the GPU buffer by index
Note. ... My question is about the Apple Metal API, but I think the concept is general enough to translate to other GPU platforms.
My goal is to add a row vector 1 x N
b
to every row in the matrix M x N
A
.
My kernel, boiled down to the part I'm having problems with:
kernel void vmadd(const device float* A [[ buffer(0) ]],
const device float* b [[ buffer(1) ]],
device float* C [[ buffer(2) ]],
constant ushort& aWidth [[ buffer(3) ]],
ushort2 gid [[ thread_position_in_grid ]]) {
int idx = gid.y * aWidth + gid.x; // Compute absolute index in C
C[idx] = A[idx] + b[gid.x];
}
Assumption: I understand that gid
- this is the position of one item in C
: gid.x
- column, gid.y
is a row. If this is not true, someone please correct me.
Now if I fill A
8 x 8 with zeros:
A = 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
and b
as follows:
b = 1 2 3 4 5 6 7 8
then after execution, there C
should be an 8 x 8 matrix where each row is 1 2 3 4 5 6 7 8
.
Instead, I get this:
C = 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
0 0 0 0 0 0 0 0
I define themes as recommended by Apple here :
let w = computePipeline.threadExecutionWidth
let h = computePipeline.maxTotalThreadsPerThreadgroup / w
let threadsPerThreadgroup = MTLSizeMake(w, h, 1)
let threadgroupsPerGrid = MTLSize(width: (cWidth + w - 1) / w,
height: (cHeight + h - 1) / h,
depth: 1)
On my machine, there are streams (64, 16, 1)
for each group and a (1, 1, 1)
stream group for each mesh.
However, if I manually set threadsPerThreadgroup
in (1, 1, 1)
and threadgroupsPerGrid
in (8, 8, 1)
, I get the correct result in C
.
Question:
I'm pretty sure my problem is related to thread size and buffer settings, but I'm pretty new to GPU programming so I don't fully understand it.
Why does reducing the size of thread groups produce the correct result? More generally, why does the size of the group threads affect this computation at all?
It seems to me that if it gid
always matches the index in C
, and I ask b[gid.x]
, then that value should be available. I understand that data tends to be chunked to fit into thread-group caching - if this is here, what rules are followed and how can I explain this?
source to share
I understand that
gid
- this is the position of one item inC
:gid.x
- this is a column,gid.y
- is a row. If this is not true, someone please correct me.
This is not entirely true. gid
is the position on the grid .
Since the grid, as it happens, is 64x16, the compute function will be called for positions that fall outside your 8x8 matrices ( A
and C
) and your 8-element vector ( b
). When this happens, the read A
can access the wrong line, or even the end A
. Likewise, reading b
will be read outside of it.
For example, consider when gid
is (8, 0). idx
will be 8. You will read A[8]
, which is actually at (0, 1). You will read b[8]
which ends. This is technically undefined, but in practice it will likely be 0 for a buffer of this relatively short length. You will write C[8]
which is also at (0, 1). This happens in much the same way as a function call, which should be written to (0, 1), and race predominates there.
Your function should, at the beginning, test if it gid
goes out of bounds and, if so, will return earlier:
if (any(gid > aWidth))
return;
(It is assumed that A
and C
will always be square, so width and height can be tested for a single value.)
You can try to tweak the calculations threadsPerThreadgroup
and threadgroupsPerGrid
to get the mesh to exactly the size of your matrices, but it can be tedious to get it right for all cases. However, you can keep it threadsPerThreadgroup
too large:
let w = min(computePipeline.threadExecutionWidth, cWidth)
let h = min(computePipeline.maxTotalThreadsPerThreadgroup / w, cHeight)
But you still need to check the calculation function because the overall grid might be too big. For example, suppose computePipeline.threadExecutionWidth
at least 8 computePipeline.maxTotalThreadsPerThreadgroup
is equal to 60. Well, there w
will be 8, but there h
will be 7. Then there threadgroupsPerGrid
will be (1, 2, 1) and the total grid size will be 8x14x1, which will be larger than your matrices.
source to share