Understanding CUDA Profiler Output (nvprof)
I'm just looking at the following output and trying to ponder the numbers:
==2906== Profiling result:
Time(%) Time Calls Avg Min Max Name
23.04% 10.9573s 16436 666.67us 64.996us 1.5927ms sgemm_sm35_ldg_tn_32x16x64x8x16
22.28% 10.5968s 14088 752.18us 612.13us 1.6235ms sgemm_sm_heavy_nt_ldg
18.09% 8.60573s 14088 610.86us 513.05us 1.2504ms sgemm_sm35_ldg_nn_128x8x128x16x16
16.48% 7.84050s 68092 115.15us 1.8240us 503.00us void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>)
...
0.25% 117.53ms 4744 24.773us 896ns 11.803ms [CUDA memcpy HtoD]
0.23% 107.32ms 37582 2.8550us 1.8880us 8.9556ms [CUDA memcpy DtoH]
...
==2906== API calls:
Time(%) Time Calls Avg Min Max Name
83.47% 41.8256s 42326 988.18us 16.923us 13.332ms cudaMemcpy
9.27% 4.64747s 326372 14.239us 10.846us 11.601ms cudaLaunch
1.49% 745.12ms 1502720 495ns 379ns 1.7092ms cudaSetupArgument
1.37% 688.09ms 4702 146.34us 879ns 615.09ms cudaFree
...
When it comes to optimizing memory access, what are the numbers that I really need to consider when comparing different implementations? It looks like it memcpy
only accepts 117.53+107.32ms
(both ways), but then there is this API call cudaMemcpy
: 41.8256s
which is much more. Also, the min / avg / max columns don't add up between the top and bottom output blocks.
Why is there a difference and what is the "true" number that is important to me for optimizing memory transfer?
EDIT : second question: is there a way to find out who is calling, eg. axpy_kernel_val
(and how many times)?
source to share
The difference in total time is due to the fact that the job is started asynchronously on the GPU. If you have a long running kernel or a set of kernels without explicit synchronization with the host and follow them with a call cudaMemcpy
, the call cudaMemcpy
will be fired long before the kernel (s) have finished. The total time for an API call occurs from the time it starts to the time it ends, so will overlap with the execution of the cores. You can see this very clearly if you trigger the output through the NVIDIA Visual Profiler ( nvprof -o xxx ./myApp
and then import the xxx into nvvp).
Difference in min. time is due to startup costs. While API profiling takes into account all of the startup overhead, kernel time only contains a small fraction. The overhead can be ~ 10-20us as you can see here.
In general, the API calls section lets you know what the processor is doing, and the profiling results show what the GPU is doing. In this case, I would say that you are underestimating the processor as it may be starting cudaMemcpy
too early and CPU cycles are wasted. In practice, however, it is often difficult or impossible to get anything useful from these spare cycles.
source to share