Parallel cluster R worker process never returns

I am using a package doParallel

to parallelize jobs across multiple Linux machines using the following syntax:

cl <- makePSOCKcluster(machines, outfile = '', master = system('hostname -i', intern = T))

It usually takes less than 10 minutes for each job to work on one machine. However, sometimes there may be one workflow that would "run away" and continue to run for hours and never return to the main process. I can see a process executing with a help top

, but it looks like the process is somehow stuck and nothing really works. The option outfile=''

does not generate anything useful because the workflow has never been executed.

This happens quite often, but in a very random way. Sometimes I could just resume work and they would have finished well. Therefore, I cannot provide a reproducible example. Does anyone have general suggestions on how to troubleshoot this issue? Or what to look for when this happens again in the future?

EDIT:

Adding more details in response to comments. I run thousands of small simulations on 10 machines. IO and memory usage are minimal. I noticed that the workflow was running away on different machines randomly without any schema, not necessarily the busiest. I do not have permission to view the syslog, but based on the CPU / RAM history, there is nothing unusual.

This happens often enough that it's pretty easy to catch the startup process in action. top

will show the process is running with almost 100% cpu with status R

, so it is definitely running and not waiting. But I'm also pretty sure that each simulation should only take a few minutes and somehow the winning worker just keeps running non-stop.

So far doParallel

- the only package I've tried. I am exploring other options, but it is difficult to make an informed decision without knowing the reason.

+3


source to share


1 answer


This kind of problem is not uncommon for large compute clusters. While the hung worker process may not throw an error message, you should check the syslogs on the node where the worker was executed to see if any system problem was found. There may be disk or memory errors, or the system may be running low on memory. If a node problem occurs, your problem can be solved simply by not using that node.

This is one of the reasons that batch systems are useful. Quests that take too long can be killed and automatically resubmitted. Unfortunately, they often restart with the same bad node, so it is important to detect bad nodes and not let the scheduler use them for subsequent jobs.



You might want to consider adding your checkpoint capabilities to your program. Unfortunately, this is usually difficult, and especially difficult to use the backend doParallel

as parallel

there is no audit trail in the package . You might want to do some research on the backend doRedis

as I believe the author was interested in supporting specific failover capabilities.

Finally, if you do catch the hanging worker in action, you should get as much information as possible about it using "ps" or "top". The state of the process is important as it can help you determine if the process is stuck trying to do I / O, for example. Better yet, you can try connecting gdb to it and get a traceback to determine what it is actually doing.

+1


source







All Articles