Qsub returns error when submitting jobs from node

Question

Qsub returns error when submitting jobs from node

I have a complex fortran MPI application running on a Torque / Maui system. When I run my application, it produces a huge unique result (~ 20GB). To avoid this, I released a RunJob script that breaks the job down into 5 pieces, each producing smaller outputs that are much easier to handle.

At the moment, my RunJob script stops correctly at the end of the first part and also produces correct output. However, when you try to restart, you receive the following error message:

qsub: Bad UID for job execution MSG = ruserok failed to validate username / username from compute-0-0.local

I know this problem occurs due to the default Torque / Maui system not allowing node to submit a job.

In fact, when I type this:

qmgr -c 'ls' | grep allow_node_submit

I got:

allow_node_submit = False

I don't have a user-only admin account

My questions:

Is it possible to set allow_node_submit = true on the gmgr being the user? How? (- I think no)
If question 1 = false, is there any other way around this? How?

All the best.

+3

cluster-computing qsub torque

Quim Aug 29. 14 at 19:01

source to share

1 answer

Jonathan Dursi · Accepted Answer · 2014-08-29T19:07:50+0000

No, an unprivileged user cannot change the queue system settings. A common reason for refusing to resubmit from compute nodes is pretty good - to protect the cluster and all of its users from someone accidentally (or otherwise) by sending a script that doesn't run fast and resubmits itself once - or much worse, more than once - quickly flood the scheduler and queues, generating the equivalent of a fork bomb burst burst . Even with such restrictions, we accidentally presented several tens of thousands of jobs at once due to script errors.

A common job is to ssh to one of the queue view nodes and send the script from there, for example. at the end of your script views:

ssh queue-head-node qsub /path/to/new/submission/script

This is how we suggest our users treat it, for example. here . Obviously this will only work if ssh with / without password is enabled on the cluster, which is common (but not universal) practice.

Alternatively, if this applies to the usual case, by simply automatically submitting a series of jobs that continue to run, you can watch how work dependencies are handled on your site, and submit a convoy of jobs, each dependent on the successful completion of the latter, which will then work fine.

Qsub returns error when submitting jobs from node

More articles: