Stopping created dask-ssh scheduler from client interface
I am running Dask on a cluster running SLURM.
dask-ssh --nprocs 2 --nthreads 1 --scheduler-port 8786 --log-directory `pwd` --hostfile hostfile.$JOBID &
sleep 10
# We need to tell dask Client (inside python) where the scheduler is running
scheduler="`hostname`:8786"
echo "Scheduler is running at ${scheduler}"
export ARL_DASK_SCHEDULER=${scheduler}
echo "About to execute $CMD"
eval $CMD
# Wait for dash-ssh to be shutdown from the python
wait %1
I create a Client inside my Python code and then when I finished I closed it.
c=Client(scheduler_id) ... c.shutdown()
My reading of the dask-ssh help is that shutdown shuts down all workers and then the scheduler. But this does not stop the background dask-ssh and so it eventually times out.
I've tried this interactively in the shell. I can't see how to stop the scheduler.
Any help would be appreciated.
Thanks Tim
source to share
Recommending with -scheduler-file
First, when configuring with SLURM, you can use an option --scheduler-file
that allows you to coordinate the scheduler address using your NFS (which I assume you indicated that you are using SLURM). We recommend you read this section of the document: http://distributed.readthedocs.io/en/latest/setup.html#using-a-shared-network-file-system-and-a-job-scheduler
dask-scheduler --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json
>>> client = Client(scheduler_file='/path/to/scheduler.json')
This also makes it easier to use the sbatch or qsub command. Here is an example with SGE qsub
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json
Client.shutdown
It looks like client.shutdown is shutting down the client. You are correct that this is not consistent with the docstring. I raised an issue here: https://github.com/dask/distributed/issues/1085 to track further developments.
Meanwhile
These three commands should be sufficient to disrupt workers, close the scheduler, and stop the scheduler process.
client.loop.add_callback(client.scheduler.retire_workers, close_workers=True)
client.loop.add_callback(client.scheduler.terminate)
client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.loop.stop())
What do people usually do
Usually people start and stop clusters by whatever means they started. This can include using the SLURM kill command. We need to make the customer-centric approach more consistent, even though it is not.
source to share