Failed to create handle cudnn: CUDNN_STATUS_INTERNAL_ERROR

I have installed tensorflow 1.0.1 GPU version on my Macbook Pro with GeForce GT 750M. CUDA 8.0.71 and cuDNN 5.1 are also installed. I am running tf code that works fine with decentralized tensor, but with GPU version, I get this error (it works from time to time):

name: GeForce GT 750M
major: 3 minor: 0 memoryClockRate (GHz) 0.9255
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 67.48MiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 67.48M (70754304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Training...

E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:364] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 
Abort trap: 6

      

What's going on here? This is a bug in tensorflow. Please, help.

Here is the GPU memory space when I run the python code:

Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 83.477 of 2047.6 MB (i.e. 4.08%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 1.1016 of 2047.6 MB (i.e. 0.0538%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 91.477 of 2047.6 MB (i.e. 4.47%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 22.852 of 2047.6 MB (i.e. 1.12%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 22.852 of 2047.6 MB (i.e. 1.12%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 36.121 of 2047.6 MB (i.e. 1.76%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 71.477 of 2047.6 MB (i.e. 3.49%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free
MacBook-Pro:cuda-smi-master xxxxxx$ ./cuda-smi
Device 0 [PCIe 0:1:0.0]: GeForce GT 750M (CC 3.0): 67.477 of 2047.6 MB (i.e. 3.3%) Free

      

+9


source to share


13 replies


I managed to get it to work by deleting the .nv folder in my home folder:



sudo rm -rf ~/.nv/

      

+15


source


As strange as it sounds, try restarting your computer and running the model again. If the model is working fine, the problem is with your GPU's memory allocation and tensor-flow management of that available memory. On Windows 10, I had two terminals open and one closing solved my problem. There may be open topics (zombies) that still hold memory.



+5


source


In my case, after checking the cuDNN and CUDA versions, I found that my GPU was out of memory . Using watch -n 0.1 nvidia-smi

bash in another terminal, the moment 2019-07-16 19:54:05.122224: E tensorflow/stream_executor/cuda/cuda_dnn.cc:334] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

is when the GPU memory is almost full. Screenshot

So I am setting up a limit for tnsorflow to use my GPU. Since I am using a module tf.keras

, I add the following code to the beginning of my program:

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9
tf.keras.backend.set_session(tf.Session(config=config));

      

Then the problem is solved!

You can change yours batch_size

or use smarter ways to enter training data (for example, tf.data.Dataset

and use the cache). I hope my answer can help someone else.

+5


source


In my case, it seems that the problem was caused by a mismatch between tensorflow and cudnn version. The following worked for me (I was working on Ubuntu 16.04 with NVidia Tesla K80 on Google Cloud, tensorflow 1.5 finally worked with cudnn 7.0.4 and cuda 9.0):

  • Remove cuDNN completely:

    sudo rm /usr/local/cuda/include/cudnn.h
    sudo rm /usr/local/cuda/lib64/libcudnn*
    
          

    After that import tensorflow should throw an error.

  • Download the appropriate version of cuDNN. Note that there is cuDNN 7.0.4 for CUDA 9.0 and cuDNN 7.0.4 for CUDA 8.0. You should choose the one that matches your CUDA version. Be careful during this step, otherwise you will get a similar problem again. Install cuDNN as usual:

    tar -xzvf cudnn-9.0-linux-x64-v7.tgz
    cd cuda
    sudo cp -P include/cudnn.h /usr/include
    sudo cp -P lib64/libcudnn* /usr/lib/x86_64-linux-gnu/
    sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcudnn*
    
          

    In this example, I've installed cuDNN 7.0.x for CUDA 9.0 (x doesn't really matter). Observe your version of CUDA.

  • Restart your computer. In my case, the problem went away. If the error still occurs, consider installing a different version of TensorFlow.

Hope this helps someone.

+3


source


I also get the same error and I solved the problem. My system properties were as follows:

  • Operating system: Ubuntu 14.04
  • GPU: GTX 1050Ti
  • Nvidia Driver: 375.66
  • Tensorflow: 1.3.0
  • Cudnn: 6.0.21 (cudnn-8.0-linux-x64-v6.0.deb)
  • Cuda: 8.0.61
  • Keras: 2.0.8

How I solved the problem:

  • I copied the cudnn files to their respective locations (/ usr / local / cuda / include and / usr / local / cuda / lib 64)
  • I am setting environment variables like:

    * export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
    * export CUDA_HOME=/usr/local/cuda
    
          

  • I am also running sudo ldconfig -v command

    to cache shared libraries for the runtime linker.

I hope these steps help those who are about to go crazy too.

+2


source


This is a cudnn compatible problem. Check what you installed using the GPU for example tensorflow-gpu

. Which version? Is the version compatible with your version cudnn

and is cudnn installed the correct version for your cuda ?.

I noticed that: cuDNN v7.0.3

for Cuda 7.*

cuDNN v7.1.2

for Cuda 9.0

cuDNN v7.3.1

for Cuda 9.1

and so on.

So also check the correct TensorFlow version for your cuda configurations. For example -using tensorflow-gpu

: TF v1.4

for cudnn 7.0.*

TF v1.7

and above for cudnn 9.0.*

etc.

Therefore, all you have to do is reinstall the appropriate version of cudnn. Hope it helps!

+1


source


For anyone else who ran into a problem in Jupyter notebook :

I have managed two Jupyter notebooks. After closing one of them, the problem was resolved.

+1


source


Adding the following code worked for me:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)

      

In my environment, there is no mismatch between CuDNN and Cuda versions. OS: Ubuntu-18.04; Flux tensor: 1.14; CuDNN: 7.6; where: 10.1 (418.87.00).

+1


source


I faced the same problem too:

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1050
major: 6 minor: 1 memoryClockRate (GHz) 1.493 pciBusID 0000:01:00.0
Total memory: 3.95GiB
Free memory: 3.60GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:532] Check failed:  stream->parent()->GetConvolveAlgorithms(&algorithms)

Aborted (core dumped)

      

But in my case, using sudo with the command worked perfect.

0


source


I ran into this problem when I accidentally installed CUDA 9.2 libcudnn7_7.2.1.38-1 + cuda9.2_amd64.deb instead of libcudnn7_7.0.5.15-1 + cuda9.0_amd64.deb on a system with CUDA 9.0 installed.

I got there because I had CUDA 9.2 installed and I downgraded to CUDA 9.0 and obviously libcudnn is version specific.

0


source


For me, re-running the CUDA installation as described here solved the problem:

# Add NVIDIA package repository
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
sudo apt install ./cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt update

# Install CUDA and tools. Include optional NCCL 2.x
sudo apt install cuda9.0 cuda-cublas-9-0 cuda-cufft-9-0 cuda-curand-9-0 \
    cuda-cusolver-9-0 cuda-cusparse-9-0 libcudnn7=7.2.1.38-1+cuda9.0 \
    libnccl2=2.2.13-1+cuda9.0 cuda-command-line-tools-9-0


      

During the setup time apt-get

reduced cudnn7

which I think is the culprit here. It was probably updated by accident with an upgrade apt-get upgrade

to a version that is incompatible with some other part of the system.

0


source


Please remember to close your tensor board terminal / cmd or other terminals that interact with / with the directory. Then you can resume your workout, everything should work on this.

0


source


This is due to the proportion of memory available to load GPU resources to create the cudnn descriptor, also known as per_process_gpu_memory_fraction

. Reducing this share of memory will resolve the error on its own.

> sess_config = tf.ConfigProto(gpu_options =
> tf.GPUOptions(per_process_gpu_memory_fraction=0.7),
> allow_soft_placement = True)
> 
> with tf.Session(config=sess_config) as sess:
>      sess.run([whatever])

      

Use as little of your memory as possible. (In the code I'm using 0.7, you can start with 0.3 or less, and then increase until you get the same error as your limit.) Give it to her tf.Session()

, or tf.train.MonitoredTrainingSession()

, or supervisor sv.managed_session()

as a configuration.

This should allow your GPU to create a cudnn descriptor for your TensorFlow code.

0


source







All Articles