CUDA Warps and Stream Divergence

I am trying to understand CUDA transitions and thread divergence. Suppose I have a naive matrix multiplication kernel for nxn matrix multiplication.

__global__ void matrix_multiply(float* a, float* b, float* c, int n)
{
    int row = blockIdx.y + blockDim.y + threadIdx.y;
    int col = blockIdx.x + blockDim.x + threadIdx.x;

    if(row < n && col < n) {
        float tmp = 0.0f;
        for(int i = 0; i < n; ++i)
            tmp += a[row * n + i] * b[i * n + col];
        c[row * n + col] = tmp;
    }
}

      

If I run the kernel with a grid size of 32 by 32 and a size of 16 by 16, and the matrices are 500 by 500, then how many cue bits have streams that will encounter stream divergence?

Since each flow block on the right edge of the matrix will have flow divergence, shouldn't the number of flow divergence skews be 256?

+3


source to share


1 answer


There are two potential points of divergence in your code. The first can be created by an operator if

, and the second by a condition in a loop for

. The second of them is harmless from the point of view of divergence along the van, since the input n

is uniform along the flows.

For the first, those threads that do not meet the condition will exit quickly. If n

equal to 500, then the number of rapidly existing streams is (16 * 16) * (32 * 32) - (500 * 500) = 12144. Keeping in mind the answer to this question , there are 250 imbalances facing the divergence, each of which consists of two lines in the 16 * 16 topmost blocks that pass the right edge. In each of them, the bands with identifiers 0, 1, 2, 3, 16, 17, 18 and 19 satisfy the condition and fall into the block if

, and the rest are disabled. There will be 6 * (512/16) = 192 skews, the condition if

will be false for all of their bands, therefore, they do not encounter a discrepancy.



The snapshot below shows what is happening in the lower right corner.

enter image description here

+5


source







All Articles