CoreOS Partial Cluster Upgrade

I am trying to set up a small CoreOS cluster on AWS EC2 instances in a VPC. For this exercise, I am using two autoscale groups, one of three machines that will form the main one, etc. And a consul cluster and then a second auto-scaling group currently with a single node that will actually scale as the application grows. They are all in a common etcd cluster.

This week coreos.com released build 681 for the stable branch, one of the nodes was immediately updated to 681.0, but after 48 hours, version 647.2 has 3 nodes left in the main cluster. When I check the logs I see the following:

Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:libcurl_http_fetcher.cc(48)] Starting/Resuming transfer
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:libcurl_http_fetcher.cc(164)] Setting up curl options for HTTPS
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:libcurl_http_fetcher.cc(427)] Setting up timeout source: 1 seconds.
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:libcurl_http_fetcher.cc(240)] HTTP response code: 200
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:libcurl_http_fetcher.cc(297)] Transfer completed (200), 267 bytes downloaded
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: [0611/142517:INFO:omaha_request_action.cc(574)] Omaha request response: <?xml version="1.0" encoding="UTF-8"?>
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: <response protocol="3.0" server="update.core-os.net">
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: <daystart elapsed_seconds="0"></daystart>
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: <app appid="e96281a6-xxxx-xxxx-xxxx-xxxxxxxxxxxx" status="ok">
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: <updatecheck status="noupdate"></updatecheck>
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: </app>
Jun 11 14:25:17 ip-10-0-68-116.ec2.internal update_engine[477]: </response>

      

Thus, the nodes receive a response that there are no updates.

Is this how the Coreos team is trying to load the balance of their file servers or are there some additional tweaks? Is this how the kernel is trying to push me towards paid services? My understanding of the update process was that the nodes would update one by one like dominoes.

This is my current cluster state:

for m in $(fleetctl list-machines -fields="machine" -full -no-legend); do fleetctl ssh $m cat /etc/lsb-release; done
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=647.2.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 647.2.0"
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=681.0.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 681.0.0"
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=647.2.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 647.2.0"
DISTRIB_ID=CoreOS
DISTRIB_RELEASE=647.2.0
DISTRIB_CODENAME="Red Dog"
DISTRIB_DESCRIPTION="CoreOS 647.2.0"

      

Update a week later : The cluster is still stuck in a half-mark. Would love to know how I can debug this issue if anyone has experience.

+3


source to share


1 answer


As mentioned in the comments, in situations like this, it is possible that the machine received an update, and after looking at the number of crashes, the OS CoreOS team decided to stop rolling out the update to more hosts to avoid more crashes.

If you want to force a check for updates, you can run:



$ update_engine_client -check_for_update
[0123/220706:INFO:update_engine_client.cc(245)] Initiating update check and install.

      

See https://coreos.com/os/docs/latest/update-strategies.html for details

0


source







All Articles