What is the scheduling behavior of thread blocks for a particular SM after the CUDA kernel has started?

Question

What is the scheduling behavior of thread blocks for a particular SM after the CUDA kernel has started?

My question is scheduling thread blocks in CUDA (specifically Kepler or newer nvidia architectures) after kernel execution has already started.

From my understanding of Kepler architecture (which may not be correct) there is a limit on the number of active blocks that can be scheduled per SM at any given time (16 blocks if I remember correctly). Also from my understanding, blocks cannot move as soon as they are scheduled to run on a particular SM.

What I'm interested in is the block scheduling and execution behavior after the initial block selection takes place and starts executing on the device (assuming there are more thread blocks in the kernel than can be active in all SMs).

Are new blocks executed as soon as one active active block ends in SM? Or is the next set of blocks only executed after the SM completes all of its active blocks? Or are they only started after all SMs have finished executing all currently active blocks?

Also, I've heard that frame scheduling is "fixed" for one SM. I am making the assumption that it is bound to a single SM only after the block becomes active. This is true?

+1

cuda

NothingMore May 20 '15 at 22:28

source to share

1 answer

Robert crovella · Accepted Answer · 2015-05-21T16:30:39+0000

New blocks can be scheduled as soon as the SM has enough unused resources to support the new block. It is not necessary for SM to fully merge with blocks before scheduling new blocks.

As pointed out in the comments, if you now ask the public documentation to support this claim, I'm not sure I can point it out. However, you can create a test case and prove it to yourself.

In short, you will create a block-specialized kernel that will run many blocks. The first block on each SM will discover and declare itself using atomistics. These blocks will "persist" until all other blocks are completed using a block-completed counter (again, using atoms similar to the stream length shortening code pattern). All other blocks that will not be the first to run on this SM will simply exit. Completion of such code, by contrast, will be proof that other blocks can be scheduled, even if some blocks still exist.

Here's a fully processed example:

$ cat t743.cu
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

#define NB 1000
// increase array length here if your GPU has more than 32 SMs
#define MAX_SM 32
// set HANG_TEST to 1 to demonstrate a hang for test purposes
#define HANG_TEST 0

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

static __device__ __inline__ uint32_t __smid(){
    uint32_t smid;
    asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
    return smid;}

__device__ volatile int blocks_completed = 0;
// increase array length here if your GPU has more than 32 SMs
__device__ int first_SM[MAX_SM];

// launch with one thread per block only
__global__ void tkernel(int num_blocks, int num_SMs){

  int my_SM = __smid();
  int im_not_first = atomicCAS(first_SM+my_SM, 0, 1);
  if (!im_not_first){
    while (blocks_completed < (num_blocks-num_SMs+HANG_TEST));
  }
  atomicAdd((int *)&blocks_completed, 1);
}

int main(int argc, char *argv[]){
  unsigned my_dev = 0;
  if (argc > 1) my_dev = atoi(argv[1]);
  cudaSetDevice(my_dev);
  cudaCheckErrors("invalid CUDA device");
  int tot_SM = 0;
  cudaDeviceGetAttribute(&tot_SM, cudaDevAttrMultiProcessorCount, my_dev);
  cudaCheckErrors("CUDA error");
  if (tot_SM > MAX_SM) {printf("program configuration error\n"); return 1;}
  printf("running on device %d, with %d SMs\n", my_dev, tot_SM);
  int temp[MAX_SM];
  for (int i = 0; i < MAX_SM; i++) temp[i] = 0;
  cudaMemcpyToSymbol(first_SM, temp, MAX_SM*sizeof(int));
  cudaCheckErrors("cudaMemcpyToSymbol fail");
  tkernel<<<NB, 1>>>(NB, tot_SM);
  cudaDeviceSynchronize();
  cudaCheckErrors("kernel error");
}

$ nvcc -o t743 t743.cu
$ ./t743 0
running on device 0, with 15 SMs
$ ./t743 1
running on device 1, with 1 SMs
$ ./t743 2

I tested the above code on Linux with CUDA 7, K40c, C2075 and Quadro NVS 310 GPUs. It doesn't hang.

To answer your second question, the block remains on the SM it was first scheduled on. One possible exception relates to CUDA dynamic parallelism.

What is the scheduling behavior of thread blocks for a particular SM after the CUDA kernel has started?

More articles: