Why integer division and modulus are not optimized in NVRTC
I built the kernel in NVRTC:
__global__ void kernel_A(/* args */) {
unsigned short idx = threadIdx.x;
unsigned char warp_id = idx / 32;
unsigned char lane_id = idx % 32;
/* ... */
}
I know that integer division and modulo are very expensive on CUDA GPUs. However, I thought that such a force-2 split should be optimized for bit operations, until I discovered that it was not:
__global__ void kernel_B(/* args */) {
unsigned short idx = threadIdx.x;
unsigned char warp_id = idx >> 5;
unsigned char lane_id = idx & 31;
/* ... */
}
seems to be kernel_B
faster. In the absence of all other codes in the kernel, starting from 1024 blocks of size 1024, it nvprof
shows 15.2uskernel_A
on average , and 7.4us works on average. My guess is that NVRTC did not optimize integer division and modulo.kernel_B
The result is obtained on GeForce 750 Ti, CUDA 8.0, averaged over 100 calls. The compiler options specified in nvrtcCompileProgram()
are equal -arch compute_50
.
Is this expected?
source to share
Did a thorough bugsweep on the codebase. It turns out my application was built in DEBUG
. This leads to the fact that additional flags -G
and -lineinfo
are transmitted tonvrtcCompileProgram()
From the nvcc
man page:
--device-debug
(-G)
Generate debug information for device code. Disables all optimizations. Don't use for profiling; Use -lineinfo instead.
source to share