Running an MPI call queue in parallel with SLURM and limited resources

I am trying to run a swarm particle optimization task on a cluster using SLURM, with an optimization algorithm driven by a matlab single core process. Each particle evaluation requires multiple MPI calls, which alternate between the two Python programs until the result converges. Each MPI call takes up to 20 minutes.

At first, I naively served each MPI call as a separate SLURM job, but as a result, queue times were made slower than starting each job locally in sequential order. I am now trying to find a way to submit an N node job that will run MPI tasks continuously to use the available resources. The Matlab process would manage this task with text file flags.

Here is a bash pseudocode file that might help illustrate what I am trying to do on a smaller scale:

#!/bin/bash

#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 32 # total number of processor cores in this job

# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0

# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`

# Run Command
while <"KeepRunning.txt" == 1>
do
  for i in {0..40}
  do
    if <"RunJob_i.txt" == 1>
    then
      mpirun -np 8 -rr -f ${PBS_NODEFILE} <job_i> &
    fi
  done
done

wait

      

This approach doesn't work (just crashes), but I don't know why (perhaps resource redundancy?). Some of my peers suggested using parallel

with srun

, but as far as I can tell, it requires me to call MPI functions in batches. This will be a huge waste of resources as a significant proportion of launches end or end quickly (this is the expected behavior). A concrete example of a problem would be launching a batch of 5 8-core jobs and 4 of them crash immediately; now 32 cores will not do anything while they wait up to 20 minutes for the 5th job to complete.

Since optimization is likely to require over 5000 mPi calls, any gain in efficiency will make a huge difference in the absolute wall time. Does anyone have any advice as to how I can run a constant flow of MPI calls on a large SLURM job? I would really appreciate any help.

+3


source to share


1 answer


A couple of things: under SLURM, you should use srun, not mpirun. Second, the pseudo code you provided starts an infinite number of jobs without waiting for an end signal. You should try to include wait in the inner loop so that you only start a set of jobs, wait for them to complete, evaluate the state, and possibly start the following set of jobs:

#!/bin/bash
#SBATCH -t 4:00:00 # walltime
#SBATCH -N 2 # number of nodes in this job
#SBATCH -n 4 # total number of tasks in this job
#SBATCH -s 8 # total number of processor cores for each task

# Set required modules
module purge
module load intel/16.0
module load gcc/6.3.0

# Job working directory
echo Working directory is $SLURM_SUBMIT_DIR
cd $SLURM_SUBMIT_DIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`

# Run Command
while <"KeepRunning.txt" == 1>
do
  for i in {0..40}
  do
    if <"RunJob_i.txt" == 1>
    then
      srun -np 8 --exclusive <job_i> &
    fi
  done
  wait
  <Update "KeepRunning.txt">

done

      



Take care of the differences in tasks and kernels. -n tells how many tasks will be used, -c says how much cpus will be allocated per job.

The code I wrote will run 41 jobs in the background (0 to 40 included), but they will only run after resources are available (--exclusive), waiting while they are busy. Each task will use 8 processors. You will wait for them to complete and I assume that after this round you will update KeepRunning.txt.

0


source







All Articles