Celery SQS + Job Duplication + SQS Visibility Timeout
Most of my Celery tasks have ETAs greater than the maximum visibility timeout defined by Amazon SQS.
Celery documentation says:
This causes problems with ETA / countdown / redo tasks where the time to execute is timeout; in fact, if this happens it will run over and over in a loop.
Therefore, you must increase the visibility timeout to match the time of the longest ETA you plan to use.
At the same time, he also says that:
The maximum visibility timeout supported by AWS at the time of this writing is 12 hours (43,200 seconds):
What can I do to avoid having my workers run tasks over and over again if I use SQS?
source to share
It is generally not good to have problems with very long ETAs.
First of all, there is the "visibility_timeout" problem. And you probably don't need a very large visibility timeout, because if a worker exits 1 minute before the task is started, the queue will still wait for the visibility_timeout to complete before sending the task to another worker, and, I I guess you don't want it to be another 1 month.
From the celery docs:
Note that Celery will forward messages when a worker is shut down, so having a long visibility timeout will only delay re-delivery of a lost task in the event of a power failure or forcibly terminated by workers.
In addition, SQS only allows you to have a lot of tasks listed.
SQS calls these tasks "in-flight messages". From http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html :
A message is considered in-flight after it has been received from the consumer's queue, but has not yet been removed from the queue.
For standard queues, there can be a maximum of 120,000 in-flight messages in the queue. If you reach this limit, Amazon SQS will return an OverLimit error message. To avoid reaching the limit, you must remove messages from the queue after they have been processed. You can also increase the number of queues that you use to process your messages.
For FIFO queues, there can be a maximum of 20,000 messages in flight per queue. If you reach this limit, Amazon SQS will not return a message error.
I see two possible solutions: instead you can use RabbitMQ, which is independent of visibility timeouts (there are RabbitMQ as a service if you don't want to manage your own), or modify your code to have really small ETAs (best practice). )
These are my 2 cents, maybe @asksol can provide some additional ideas.
source to share
In SQS, you can change the message visibility time. This is documented here . So what you need to do is as follows: when you process a message, you can regularly update the visibility timeout, and when finished, you can delete the message.
To increase the visibility timeout regularly, if you are using some kind of loop, you can continue to increase the timeout at the end of each iteration or every x number of iterations depending on the time it takes to complete one iteration. Here is some sample code for what I mean.
process_message(){
for(i=0;i++;..){
.
.
.
if(i%5 == 0){
extendVisibilityTimeOut(..)
}
}
}
source to share