Providing one job for Node on StarCluster / SunGridEngine (SGE)
When jobs are qsub
in a StarCluster / SGE, is there an easy way to ensure that each node receives no more than one job at a time? I am having problems where multiple jobs end on the same node, resulting in memory (OOM) issues.
I tried to use -l cpu=8
but I think it doesn't check the number of USED cores only the number of cores on the box itself.
I also tried -l slots=8
, but then I get:
Unable to run job: "job" denied: use parallel environments instead of requesting slots explicitly.
source to share
Depends heavily on how the cluster resources are configured i.e. memory limits etc. However, it's one thing to request a lot of memory for each job:
-l h_vmem=xxG
This will have the side effect of excluding other jobs from the node, since most of the memory of that node has already been requested by another previously started job.
Just make sure the requested memory does not exceed the allowed limit for node. You can see if it gets around this limit by checking the output qstat -j <jobid>
for errors.
source to share