Different occupation between calculator and nvprof

I am using nvprof to measure the occupancy level reached and I find it as

Achieved Employment 0.344031 0.344031 0.344031

but using the occupancy calculator I find 75%.

Results:

Active Threads per Multiprocessor   1536
Active Warps per Multiprocessor 48
Active Thread Blocks per Multiprocessor 6
Occupancy of each Multiprocessor    75%

      

I am using 33 registers, 144 bytes shared memory, 256 threads / blocks, device capability 3.5.

EDIT:

Also, I want to clarify. At http://docs.nvidia.com/cuda/profiler-users-guide/#axzz30pb9tBTN it points for

gld_efficiency

The ratio of the requested global memory bandwidth to the required global memory bandwidth, expressed as a percentage

So if it's 0% does it mean I don't have global memory transfers in the kernel?

:

0


source to share


1 answer


You need to understand that the occupancy calculator provides the maximum theoretical occupancy that a particular core can achieve, based only on the resource requirements of that core. It does not (and cannot) say anything about how much of this theoretical filling the code is able to achieve.

Grading tools, on the other hand, infer the actual infill from the measured profile counters. According to this document, the number of the busy number you are asking about is calculated as

(active_warps / active_cycles) / MAX_WARPS_PER_SM

      



i.e. it displays the number of active deformations on one or more SMs during kernel startup and calculates the actual infill from this

There can be many reasons why the kernel is not reaching its theoretical fill, and (before you ask), no, I cannot tell you why your kernel is not reaching its theoretical fill. But Visual Profiler can. If this is important to you, I suggest you take a look at the automatic performance analysis features available in the CUDA 5/6 visual profiler to better understand the performance of your code.

It is also worth noting that employment should only be viewed as an approximate metric of potential code performance, and high theoretical employment does not always translate into high performance. Training level parallelism and strategies to minimize latency can also be very effective at achieving high levels of performance even when busy. There is a lot of body work in this, mainly because of Vasily Volkov - GTC 2010 semantic document .

+2


source







All Articles