Cuda hangs on cudaDeviceSynchronize randomly

I have a piece of GPU code that has been working for a while. I recently made a few minor algorithmic changes, but they did not touch the CUDA part.

I am running production runs on a set of three Xeon machines, each with a 780 Ti. Each run takes about three minutes, but at this point there were two cases (out of 5000) where the application hung for several hours (until it was killed). Both were in the same car.

The second time I connected GDB to a running process and got a line back that looks like

#0  0x00007fff077ffa01 in clock_gettime ()
#1  0x0000003e1ec03e46 in clock_gettime () from /lib64/librt.so.1
#2  0x00002b5b5e302a1e in ?? () from /usr/lib64/libcuda.so
#3  0x00002b5b5dca2294 in ?? () from /usr/lib64/libcuda.so
#4  0x00002b5b5dbbaa4f in ?? () from /usr/lib64/libcuda.so
#5  0x00002b5b5dba8cda in ?? () from /usr/lib64/libcuda.so
#6  0x00002b5b5db94c4f in cuCtxSynchronize () from /usr/lib64/libcuda.so
#7  0x000000000041cd8d in cudart::cudaApiDeviceSynchronize() ()
#8  0x0000000000441269 in cudaDeviceSynchronize ()
#9  0x0000000000408124 in main (argc=11, argv=0x7fff076fa1d8) at src/fraps3d.cu:200

      

I manually did frame 8; return;

to force it to terminate, which caused it to get stuck on the next call to cudaDeviceSynchronize (). After that, it got stuck again on the next sync call (each time with the same frames 0 to 8). Unusually, the failure occurred in the middle of the main cycle, during the 5000th time.

After killing it, the following jobs start and run correctly, so it is not an execution host system crash.

Any ideas on what might be causing an accidental crash like this?

I am compiling and running with V6.0.1 running with driver version 331.62.

+3


source to share





All Articles