All Cubernese ores drop periodically

I've been running a Kubernetes cluster for a while now, but I haven't been able to keep it stable. My cluster consists of four nodes, two masters and two workers. All nodes run on a single physical server, which in turn runs VMware vSphere 6.5. Each node is running with CoreOS stabilizer (1353.7.0) and I am running Kubernetes / Hyperkube v1.6.4 using Calico for networking. I followed the steps in this tutorial.

What happens is that the cluster will run smoothly for a few hours / days. Then, all of a sudden (for some discernible reason, as far as I can tell), all my pods go to Pending and stay that way. Any hosted services are no longer available. After a while (usually 5 to 10 minutes) it seems to recover, after which it starts to recreate all my pods and tries (but fails) to close all my running pods. Some of the newly created containers are going out but will initially have no Internet connection.

For a couple of weeks, I ran into this issue intermittently and it prevented me from using Kubernetes in production. I would really like to find out what caused this.

Oddly enough, when I try to diagnose the problem by checking the logs, I noticed that on both of my worker nodes, the log logs get corrupted! On the main sites, the magazine remains readable, but not very informative.

Even on startup, kubelet keeps emitting errors in its logs. On all nodes, this is what is posted about once a minute:

May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.012890   24228 cni.go:275] Error deleting network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.014762   24228 remote_runtime.go:109] StopPodSandbox "3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logstash-s3498_default" network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:37:14 kube-master1 kubelet-wrapper[24228]: E0526 09:37:14.014818   24228 kuberuntime_gc.go:138] Failed to stop sandbox "3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "logstash-s3498_default" network: open /var/lib/cni/flannel/3975179a14dac15cd41881266c9bfd6b8763c0a48934147582cb55d5618a9233: no such file or directory
May 26 09:38:07 kube-master1 kubelet-wrapper[24228]: I0526 09:38:07.422341   24228 operation_generator.go:597] MountVolume.SetUp succeeded for volume "kubernetes.io/secret/9a378211-3597-11e7-a7ec-000c2958a0d7-default-token-0p3gf" (spec.Name: "default-token-0p3gf") pod "9a378211-3597-11e7-a7ec-000c2958a0d7" (UID: "9a378211-3597-11e7-a7ec-000c2958a0d7").
May 26 09:38:14 kube-master1 kubelet-wrapper[24228]: W0526 09:38:14.037553   24228 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "logstash-s3498_default": Unexpected command output nsenter: cannot open : No such file or directory
May 26 09:38:14 kube-master1 kubelet-wrapper[24228]:  with error: exit status 1

      

I searched for this error, ran into this problem , but it got closed and people point out that using v1.6.0 or newer should resolve it, but it definitely doesn't have it in my case ...

Can anyone point me in the right direction ?!

Thank!

+3


source to share


1 answer


We will see this. the problem seems to go away if you downgrade CoreOS to an older version with docker 1.12.3.



Docker is a regression nightmare in every version they release :(

+1


source







All Articles