Celery, Resque, or a custom solution for handling jobs on machines in my cloud?
My company has thousands of server instances running application code - some instances running databases, others running web applications, others running APIs or Hadoop work orders. All servers run Linux.
In this cloud, developers usually want to do one of two things for an instance:
Update the version of the application running on this instance. Typically this includes a) tagging the code in the appropriate subversion repository, b) creating an RPM from this tag, and c) installing that RPM on the appropriate application server. Note that this operation will affect four instances: the SVN server, the build host (where the build takes place), the YUM host (where the RPM is stored), and the instance that runs the application.
Today, the deployment of a new version of an application can be up to 500 instances.
Run an arbitrary script on an instance. The script can be written in any language as long as an interpreter exists on that instance . For example. The UI designer wants to run his "check_memory.php" script, which does x, y, z on 10 UI instances and then restarts the web server if some conditions are met.
What tools should be used to create this system? I've seen Celery, Resque, and delayed_job, but they seem to be built to do a lot of things. This system is under much less load - perhaps a thousand update jobs and several hundred arbitrary scripts can run on a big day. Also, they do not support tasks written in any language.
How should the central "worker processor" communicate with the instances? SSH, message queues (which one), something else?
Thank you for your help.
NOTE. This cloud is proprietary, so EC2 tools are not an option.
source to share
I can imagine two approaches:
Set up SSH without a password on the server, find a file containing a list of all machines in the cluster, and run your scripts directly using SSH. For example: ssh email@example.com "ls -la". This is the same approach used by the Hadoop cluster startup and shutdown scripts. If you want to dynamically assign tasks, you can choose nodes at random.
Use something like Torque or Sun Grid Engine to manage your cluster.
The package installation can be completed inside the script, so you just need to solve the second problem and use this solution to solve the first problem :)
source to share