How registers are registered in CUDA compilation

The number of registers per core is said to be important for CUDA optimization, and an upper bound on this number can be set with "-maxrregcount = N" in nvcc. I couldn't figure it out because I thought that the number of registers could simply be determined by counting local variables (and possibly passed parameters) in the kernel. I know I am wrong because the report from "nvcc -ptxas-options = -v" is far superior to what I calculated the way I thought. Can anyone think about this for a bit?

+3


source to share


1 answer


The maximum number of registers per thread, while GPUs on devices with Compute Capability 2.1 are 63 registers. Each threading multiprocessor contains a limited number of registers, which are distributed among the threads executed in the threadblock. If you have a small number of threads per block, you can almost be sure that the threads will receive the maximum number of registers, but if there are many threads, they will receive fewer registers (it all depends on the total memory used by threads and tayloring needs for each application) ...

Now all variables that cannot be stored in registers because they are not there go to local memory, which is part of the device's global memory and provides high memory latency, unlike registers. This is called case scatter, you can read about it here http://www.ece.umn.edu/~wxiao/ee5940/lecture8-2.pdf



It is very important to try to keep all variables in registers. The consequences of a registry spill are often underestimated by new Cuda developers. I did some tests in which I artificially doubled the amount of memory used by threads and caused a register spill with no other computation overhead, and this increased the computation time by a factor of 5! In small CUDA applications, the number of registers is sufficient. You can find out how many variables go to local memory by following the instructions in the pdf above.

+2


source







All Articles