Why integer division and modulus are not optimized in NVRTC

I built the kernel in NVRTC:

__global__ void kernel_A(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx / 32;
    unsigned char lane_id = idx % 32;
    /* ... */
}

      

I know that integer division and modulo are very expensive on CUDA GPUs. However, I thought that such a force-2 split should be optimized for bit operations, until I discovered that it was not:

__global__ void kernel_B(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx >> 5;
    unsigned char lane_id = idx & 31;
    /* ... */
}

      

seems to be kernel_B

faster. In the absence of all other codes in the kernel, starting from 1024 blocks of size 1024, it nvprof

shows 15.2uskernel_A

on average , and 7.4us works on average. My guess is that NVRTC did not optimize integer division and modulo.kernel_B

The result is obtained on GeForce 750 Ti, CUDA 8.0, averaged over 100 calls. The compiler options specified in nvrtcCompileProgram()

are equal -arch compute_50

.

Is this expected?

+3


source to share


1 answer


Did a thorough bugsweep on the codebase. It turns out my application was built in DEBUG

. This leads to the fact that additional flags -G

and -lineinfo

are transmitted tonvrtcCompileProgram()

From the nvcc

man page:



--device-debug

(-G)

Generate debug information for device code. Disables all optimizations. Don't use for profiling; Use -lineinfo instead.

0


source







All Articles