Understanding CUDA Profiler Output (nvprof)

Question

Understanding CUDA Profiler Output (nvprof)

I'm just looking at the following output and trying to ponder the numbers:

==2906== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 23.04%  10.9573s     16436  666.67us  64.996us  1.5927ms  sgemm_sm35_ldg_tn_32x16x64x8x16
 22.28%  10.5968s     14088  752.18us  612.13us  1.6235ms  sgemm_sm_heavy_nt_ldg
 18.09%  8.60573s     14088  610.86us  513.05us  1.2504ms  sgemm_sm35_ldg_nn_128x8x128x16x16
 16.48%  7.84050s     68092  115.15us  1.8240us  503.00us  void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>)
...
  0.25%  117.53ms      4744  24.773us     896ns  11.803ms  [CUDA memcpy HtoD]
  0.23%  107.32ms     37582  2.8550us  1.8880us  8.9556ms  [CUDA memcpy DtoH]

...

==2906== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 83.47%  41.8256s     42326  988.18us  16.923us  13.332ms  cudaMemcpy
  9.27%  4.64747s    326372  14.239us  10.846us  11.601ms  cudaLaunch
  1.49%  745.12ms   1502720     495ns     379ns  1.7092ms  cudaSetupArgument
  1.37%  688.09ms      4702  146.34us     879ns  615.09ms  cudaFree
...

When it comes to optimizing memory access, what are the numbers that I really need to consider when comparing different implementations? It looks like it memcpy

only accepts 117.53+107.32ms

(both ways), but then there is this API call cudaMemcpy

: 41.8256s

which is much more. Also, the min / avg / max columns don't add up between the top and bottom output blocks.

Why is there a difference and what is the "true" number that is important to me for optimizing memory transfer?

EDIT : second question: is there a way to find out who is calling, eg. axpy_kernel_val

(and how many times)?

+3

cuda memcpy nvprof

Pavel May 21 '15 at 10:24

source to share

1 answer

Jez · Accepted Answer · 2015-05-21T11:12:24+0000

The difference in total time is due to the fact that the job is started asynchronously on the GPU. If you have a long running kernel or a set of kernels without explicit synchronization with the host and follow them with a call cudaMemcpy

, the call cudaMemcpy

will be fired long before the kernel (s) have finished. The total time for an API call occurs from the time it starts to the time it ends, so will overlap with the execution of the cores. You can see this very clearly if you trigger the output through the NVIDIA Visual Profiler ( nvprof -o xxx ./myApp

and then import the xxx into nvvp).

Difference in min. time is due to startup costs. While API profiling takes into account all of the startup overhead, kernel time only contains a small fraction. The overhead can be ~ 10-20us as you can see here.

In general, the API calls section lets you know what the processor is doing, and the profiling results show what the GPU is doing. In this case, I would say that you are underestimating the processor as it may be starting cudaMemcpy

too early and CPU cycles are wasted. In practice, however, it is often difficult or impossible to get anything useful from these spare cycles.

Understanding CUDA Profiler Output (nvprof)

More articles: