Stopping created dask-ssh scheduler from client interface

I am running Dask on a cluster running SLURM.

dask-ssh --nprocs 2 --nthreads 1 --scheduler-port 8786 --log-directory `pwd` --hostfile hostfile.$JOBID &
sleep 10

# We need to tell dask Client (inside python) where the scheduler is running
scheduler="`hostname`:8786"
echo "Scheduler is running at ${scheduler}"
export ARL_DASK_SCHEDULER=${scheduler}

echo "About to execute $CMD"

eval $CMD

# Wait for dash-ssh to be shutdown from the python
wait %1

      

I create a Client inside my Python code and then when I finished I closed it.

c=Client(scheduler_id)
...
c.shutdown()

      

My reading of the dask-ssh help is that shutdown shuts down all workers and then the scheduler. But this does not stop the background dask-ssh and so it eventually times out.

I've tried this interactively in the shell. I can't see how to stop the scheduler.

Any help would be appreciated.

Thanks Tim

+3


source to share


1 answer


Recommending with -scheduler-file

First, when configuring with SLURM, you can use an option --scheduler-file

that allows you to coordinate the scheduler address using your NFS (which I assume you indicated that you are using SLURM). We recommend you read this section of the document: http://distributed.readthedocs.io/en/latest/setup.html#using-a-shared-network-file-system-and-a-job-scheduler

dask-scheduler --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json
dask-worker --scheduler-file /path/to/scheduler.json

>>> client = Client(scheduler_file='/path/to/scheduler.json')

      

This also makes it easier to use the sbatch or qsub command. Here is an example with SGE qsub

# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json

# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json

      

Client.shutdown



It looks like client.shutdown is shutting down the client. You are correct that this is not consistent with the docstring. I raised an issue here: https://github.com/dask/distributed/issues/1085 to track further developments.

Meanwhile

These three commands should be sufficient to disrupt workers, close the scheduler, and stop the scheduler process.

client.loop.add_callback(client.scheduler.retire_workers, close_workers=True)
client.loop.add_callback(client.scheduler.terminate)
client.run_on_scheduler(lambda dask_scheduler: dask_scheduler.loop.stop())

      

What do people usually do

Usually people start and stop clusters by whatever means they started. This can include using the SLURM kill command. We need to make the customer-centric approach more consistent, even though it is not.

+1


source







All Articles