Amazon Elasticache Failover

We have been using AWS Elasticache for about 6 months without any problems. Every night we have a Java application that launches which will flush DB 0 of our redis cache and then re-write it with updated data. However, we had 3 cases from July 31st to August 5th where our DB was successfully cleaned up and then we were unable to write new data to the database.

In our application, we got the following exception:

redis.clients.jedis.exceptions.JedisDataException: redis.clients.jedis.exceptions.JedisDataException: READONLY You cannot write against a read-only slave.

When we look at cache events in Elasticache, we can see

Failover from master node prod-redis-001 to replica nodeprod-redis-002 complete

We weren't able to diagnose the issue, and since the app has been working fine for the past 6 months, I'm wondering if this is related to the recent release of Elasticache that was made on June 30th. https://aws.amazon.com/releasenotes/Amazon-ElastiCache

We have always written our master node and we only have 1 replica node.

If anyone can offer any insight it would be greatly appreciated.

EDIT: This seems to be an intermittent problem. Some days it will fail on other days when it works fine.

+3


source to share


1 answer


We've been maintaining AWS support over the past few weeks and this is what we found.

Most Redis requests are synchronous, including flash, so it blocks all other requests. In our case, we are actually flushing the 19 meter keys and this takes over 30 seconds.

Elasticache periodically checks the health, and after the flush starts, the health check will be blocked, which will lead to a failure.

We asked the support group how often the health check is done, so we can understand why our flush only triggers a switch 3-4 times a week. The best answer we can get is "We think it is every 30 seconds." However, our flush consistently takes over 30 seconds and does not always fail.



They said they could implement the ability to customize the health check time, however they said it won't be done anytime soon.

The best advice they can give us:

1) Create a completely new cluster to load new data and instead of dropping the previous cluster, reassign your applications (applications) to the new cluster and delete the old one.

2) If the data you are flushing is an updated version of the data, consider not flushing but updating and overwriting new keys?

3) Instead of flushing the data, set the expiration of the items to be when you normally blush, and let the keys be fixed (possibly at random times to avoid thundering herd problems), then reload the data.

Hope it helps :)

+4


source







All Articles