How do I initialize CUDA so I can make correct runtime measurements?

Question

How do I initialize CUDA so I can make correct runtime measurements?

In my application, I have implemented the same algorithm for CPU and GPU with CUDA, and I need to measure the time it takes to execute the algorithm on CPU and GPU. I noticed that some time spent initializing CUDA in the GPU algorithm version was added cudaFree(0);

at the beginning of the program code as recommended here for CUDA initialization, but the first CUDA GPU algorithm still takes longer to execute than the second.

Are there any other CUDA related things that need to be initialized at the beginning in order to correctly measure the actual execution time of the algorithm?

+3

c ++ cuda

Pavel Ryzhov May 25 '15 at 12:09

source to share

1 answer

talonmies · Accepted Answer · 2015-05-25T12:24:37+0000

The lazy context initialization heuristic in the CUDA execution API has changed subtly since the answer you linked to was written in two ways that I know of:

cudaSetDevice()

now initiates a context where it was not previously on it (hence the need for a call cudaFree()

discussed in this answer)
Some device related initialization that the runtime API used to execute when the context is initialized is now done the first time the kernel is called

The only solution I know of for the second element is to run the CUDA kernel code you want to "warm up" once, to absorb the latency of the install, and then run its code time for benchmarking purposes.

Alternatively, you can use the driver API and have much finer control over the timeout when the application starts.

How do I initialize CUDA so I can make correct runtime measurements?

More articles: