How are blocks scheduled to SM in CUDA when there are fewer of the available SMs?

This question arises because of the differences between theoretical and achieved filling observed in the core. I know there is a different occupation between a calculator and nvprof , as well as a Question for details about allocation from blocks to SM in CUDA .

Consider a GPU with compute capacity = 6.1 and 15 SM (GTX TITAN, Pascal architecture, GP104 chipset). And consider the small size of the 2304 element problem.

If we configure the kernel with 512 threads, each thread will process one item, we need 5 blocks to manage all the data. And the kernel is so small that there are no restrictions on resource usage, in regards to registers or shared memory.

So the theoretical padding is 1 because four parallel blocks can be allocated in one SM (2048 threads), resulting in 2048/32 = 64 active distortions (maximum value).

However, the occupied occupancy achieved (reported by nvidia's profiler) is ~ 0.215 and this is probably due to how blocks are displayed in SM. So how are the blocks scheduled in SM in CUDA when there are fewer of the available SMs?

Option 1. Sends 4 blocks of 512 streams to one SM and 1 block of 512 to another SM. In this case, the padding will be (1 + 0.125) / 2 = 0.56. I assumed the last block has only 256 of 512 threads active to reach the last 256 elements of the array, and it is allocated in the second SM. Thus, only 8 distortions are active, given the granularity of the warp.

Option 2. Schedule each 512 block for a different SM. Since we have 15 SMs, why saturate just one SM with many blocks? In this case, we have 512/32 = 16 active skews per SM (except for the last one, which has only 256 active threads). So we have 0.25 employed achieved in four SMs and 0.125 in the last, which results in (0.25 + 0.25 + 0.25 + 0.25 + 0.125) / 5 = 0.225.

Option 2 is closer to the padding reported by the visual profiler, and in our opinion this is what happens behind the scenes. Anyway, it's worth asking him: how are blocks planned in SM in CUDA when there are fewer of the available SMs? Documented?

- Please note that this is not homework. This is a real scenario in a project using various third-party libraries that have a small number of items to process at some stage in a pipeline consisting of multiple cores.

+3


source to share





All Articles