How do I initialize CUDA so I can make correct runtime measurements?

In my application, I have implemented the same algorithm for CPU and GPU with CUDA, and I need to measure the time it takes to execute the algorithm on CPU and GPU. I noticed that some time spent initializing CUDA in the GPU algorithm version was added cudaFree(0);

at the beginning of the program code as recommended here for CUDA initialization, but the first CUDA GPU algorithm still takes longer to execute than the second.

Are there any other CUDA related things that need to be initialized at the beginning in order to correctly measure the actual execution time of the algorithm?

+3


source to share


1 answer


The lazy context initialization heuristic in the CUDA execution API has changed subtly since the answer you linked to was written in two ways that I know of:

  • cudaSetDevice()

    now initiates a context where it was not previously on it (hence the need for a call cudaFree()

    discussed in this answer)
  • Some device related initialization that the runtime API used to execute when the context is initialized is now done the first time the kernel is called


The only solution I know of for the second element is to run the CUDA kernel code you want to "warm up" once, to absorb the latency of the install, and then run its code time for benchmarking purposes.

Alternatively, you can use the driver API and have much finer control over the timeout when the application starts.

+3


source







All Articles