In Slurm 20.11 some changes have been made by SchedMD that we know will affect some NSC users. As far as NSC has been able to determine, these changes are intentional and permanent, i.e we now need to update our jobs to how Slurm works in the new version.
The semantics of srun
has changed, and if you use srun
today to
launch job steps from within a job you may need to change your job
scripts to make them work with Slurm 20.11.
If you use mpprun or mpiexec.hydra to launch an MPI job and do not use srun in your job script for anything else you are not affected by this change.
If your jobs only run on a single node and does not use srun you are not affected by this change.
In most cases, jobs using srun will still run, but fewer job steps will concurrently, so you will lose performance. Often, this results in only one CPU core being used on each node (a slowdown of 97%).
For an application that uses mpprun to launch the main application and then use srun to start e.g a monitoring task on each node, the monitoring task might not start or the monitoring task might start before the main application and then block it from starting or from using all CPU cores.
Summary: if your application does not work, or runs slower, after the upgrade to Slurm 20.11 you are probably affected by this change, and will need to modify your job. In this case, please read the rest of this page for hints on how to modify your job.
If you need help modifying your jobs to work with Slurm 20.11, please contact NSC Support
In earlier Slurm versions, this would work as expected (run 64 concurrent tasks on the two assigned nodes until all 256 tasks have completed):
#!/bin/bash
#SBATCH -N2 --exclusive
#
for task in $(seq 1 256); do
srun -n1 -N1 --exclusive /somewhere/myapp $task &
done
wait
With Slurm 20.11, the above script will only run two concurrent tasks (one on each node), leaving 62 of the 64 allocated CPUs idle!
With Slurm 20.11, you can instead do:
#!/bin/bash
#SBATCH -N2 --exclusive
#
for task in $(seq 1 256); do
srun -n1 -N1 --exact /somewhere/myapp $task &
done
wait
Another option is to skip srun and use parallel and jobsh:
#!/bin/bash
#SBATCH -N2 --exclusive
#
module load parallel/20181122-nsc1
seq 1 256 | parallel --ssh=jobsh -S $(hostlist -e -s',' -d -p "$SLURM_CPUS_ON_NODE/" $SLURM_JOB_NODELIST) /somewhere/myapp {}
In this example we use GNU Parallel and ask it to run as many tasks
per node as there are CPU cores on the node ($SLURM_CPUS_ON_NODE
).
Sometimes we want to launch a monitoring task, a debugger or something similar on all nodes in a job, but we don’t want CPUs to be allocated to those tasks and unavailable to the real application.
To do this, you can either use jobsh
(which is designed to mimic
ssh
as far as possible while still using Slurm internally) or
srun
. If you use srun
you need to use certain options to ensure
that it does not attempt to allocate CPU or memory for the task.
Example 1: use jobsh and loop over all nodes in the job
#!/bin/bash
#SBATCH -N2 --exclusive
# Start one instance of monitorapp per node in the job, but
# allocate no resources.
for node in $(hostlist -e "$SLURM_JOB_NODELIST"); do
jobsh $node /somepath/monitorapp &
done
# Start the main application
mpprun /somepath/myapp
Example 2: use srun (with the same options jobsh would use) to launch one task per node in the job
#!/bin/bash
#SBATCH -N2 --exclusive
# Start one instance of monitorapp per node in the job, but
# allocate no resources.
srun --whole --mem-per-cpu=0 /somepath/monitorapp
# Start the main application
mpprun /somepath/myapp
If you use srun
to launch your main MPI application, you should
probably switch to mpprun
or mpiexec.hydra
instead. Contact NSC
Support for more information.
It’s not all bad news… Slurm 20.11 also fixes various bugs, especially one that sometimes prevented GUI windows from being displayed when run on a compute node.
We also need to run a supported version to get security fixes, so staying at Slurm 20.02 long-term is not an option.
Guides, documentation and FAQ.
Applying for projects and login accounts.