The hidden cost of short jobs and job steps

Summary: avoid running many jobs and job steps that are shorter than a few minutes. If you can, make most of your jobs longer than 10 minutes!

Details:

There is nothing inherently wrong about running a single short job. Feel free to run a short job to test something. In fact, when you start running a new type of job, we strongly encourage you to run a short test job to verify everything is working before submitting longer production jobs.

However, running many short jobs should be avoided! This is why:

  1. The job scheduler (Slurm) has limited performance (and it's not trivial to make it run faster, as most of the scheduling code is not parallelized). When the average job length is too short, we start seeing performance problems in the job scheduler. In extreme cases, it can become so busy that normal commands (e.g squeue) will hang or fail. But what's worse is that launching of MPI jobs can fail of the job scheduler is overloaded (e.g Intel MPI depends on srun to launch MPI ranks, and srun needs to communicate with the job scheduler to launch a "job step", even if the job itself is already running).

  2. All jobs and job steps are logged in the cluster (Slurm database + files in /var/log/accounting/slurm). Information about jobs are also sent to a SNIC database. The more jobs we run (i.e the shorter the average job length is), the more difficult (and expensive) it becomes to handle this data.

  3. To ensure that jobs does not fail unnecessarily due to hardware or software problems on the compute nodes, we run an extensive check of the compute node between each job. This takes some time (seconds), but we feel its time well spent. However, if the job itself also only runs for a handful of seconds, the overhead due to the node health check becomes excessive.

There is no hard limit on what is a "short" job. It depends not only on the job length but also how many such jobs you run. A single 5 second job is OK. Ten thousand such jobs are not OK.

Please note that the job step length is also relevant. Running "long" jobs where each job runs a large number of job steps can also cause problems. A job has a larger overhead than a job step, but very large number of job steps can also cause problems. Job steps are created when you use e.g srun, mpprun, mpiexec.hydra, mpirun and jobsh.

Recommendation: if you find that the average runtime of a job (or a job step) is less than 10 minutes, please consider making them longer. If the average runtime is less than a minute we strongly recommend that you make them longer!

There are several ways to increase the average job or job step length, e.g

  1. Package several subjobs into larger jobs, and run the subjobs serially within the larger job.

  2. If your short job uses more than one CPU core, it might be possible to run it on fewer cores for a longer time. Often this makes the application more efficient. This is not always possible, e.g if you need a certain amount of memory per core/node.

If you need help in making these changes, please contact NSC Support.

To review the actual runtime of your recent jobs, you can use sacct. Example that shows the job ID, job name, number of CPU cores allocated, the timelimit and the actual runtime (in seconds) for completed jobs since 2019-01-01:

sacct -S 2019-01-01 --state=COMPLETED,FAILED,TIMEOUT --format=JobID,JobName,State,AllocCPUs,ElapsedRaw,Timelimit -X

If you remove the "-X" option, sacct will show all job steps.


User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express