Introduction

This page tries to describe the job scheduling policy on Tetralith, i.e how the system decides which jobs should be started and when.

It also describes how users and projects can interact with the system and adapt the job scheduling to their workflow (or in some cases, adapt their workflow to the job scheduling).

If you don't have time to read all of this, I recommend that you read this page and decide which model works best for your project. Then run your jobs, and come back to this page if you need to.

The policy is complicated, which is why there are so many footnotes1 on this page. Read them for more in-depth information. There are also a number of short articles on various related subjects which you will find at the bottom of this page.

The basic goals for the scheduling policy are:

  • Allow most projects to easily run at least as many core hours as they have been allocated by SNIC.2
  • Keep the system utilization high.3
  • Allow some flexibility for projects to adapt the job scheduling to their needs.4

An key decision we have made is to never terminate/preempt a running job to let a higher priority job run.5

On Tetralith we use Slurm for job scheduling.6

Scheduling policy and limits

  • Jobs are primarily started in priority order. The priority of a job is determined by how much the project associated with the job has run recently in relation to its allocated computing time ("fair-share scheduling").7
  • Lower priority jobs are sometimes started using "backfill" scheduling if this can be done without affecting higher-priority jobs.8
  • Once started, a job is never terminated/preempted to allow a higher-priority job to start.
  • A small number of nodes (currently eight) are reserved for test and development jobs shorter than one hour (the "devel" reservation).9
  • A project that has run more than 1.5x its allocated time recently is prevented from starting new jobs ("QOSUsageThreshold" limit). Read more about this limit and how you can sometimes get around it here.
  • A project may not use an unreasonably large part of the system at any one time (the "MAXPS"/"AssocGrpCPURunMinutesLimit" limit).10
  • A project may increase the priority of a small number of jobs for added flexibility using boost-tools.
  • A project may increase the time limit of a small number of jobs beyond the normal maximum (7 days) using boost-tools.
  • A project may reserve nodes for a certain time period using boost-tools.
  • You may not run "too flat" jobs. This is a soft limit, but you might be asked by the system administrators to not run certain jobs. See flat jobs for more information.
  • You may not run "too short" jobs. This is a soft limit, but you might be asked by the system administrators to not run certain jobs. See short jobs for more information.
  • When your project ends, no new jobs will be started, but we allow running jobs to finish.11
  • The maximum "wall time" (Timelimit) for a job is 7 days. The default (if you do not specify a limit) is 2 hours. Please try to estimate how much time your job will actually need, and only ask for that much plus a reasonable margin. This benefits both you and other users.12
  • Interactive jobs are not given priority over batch jobs13
  • Sometimes there will be a large number of idle nodes and no jobs left in the queue that can be started due to the 1.5x limit described above. We will then start some of the blocked jobs as "bonus jobs" to keep the utilization of the system high. Only jobs shorter than 24 hours are eligible to be run as bonus jobs.14
  • For special needs, i.e course projects that have lab sessions during fixed times, or scaling tests that require running many wide and short jobs, NSC can provide guaranteed access to nodes during certain times ("reservations"). Contact NSC Support if you believe reservations are needed by your project.

  1. The scheduling policy and its various limits, tools etc is complex. We are very aware of this. Unfortunately, all simple scheduling policies (e.g a plain FIFO queue) have problems, and no matter which one you start with, you tend to end up with many added limitations, exceptions and external tools, and then the simple policy is no longer simple...

  2. There are of course situations in which this is not possible. E.g if a project has run nothing for the first 29 days of the month, it's very difficult to let it run all its core hours on the last day.

  3. Even if idle compute nodes use less power, the depreciation cost is a very large part of the total system cost, so it makes sense to use the system as much as possible.

  4. Some projects want nothing except to run as many jobs as possible but does not care about how long an individual job has to wait in the queue. Another project might need to get access to compute nodes quickly, but cares less about the total throughput.

  5. In many ways, terminating running low-priority jobs to allow new high-priority jobs to start would simplify job scheduling and provide a lot of flexibility, but it has a high cost to users - all jobs must be made restartable (and many applications cannot even easily be restarted). Making jobs preemptable would be unfair to the many users whose applications cannot be restarted - their jobs would then randomly fail if a higher-priority job was submitted.

  6. As far as possible, we use Slurm's built-in features to realize the scheduling policy, but we have also developed tools (e.g boost-tools and bonus job scheduling) that work outside Slurm.

  7. Within a project, fair-share is also used so that users that have run a lot recently have lower priority than users that have run little recently.

  8. Backfilling is the process of scheduling jobs on compute nodes that would otherwise be idle while waiting for enough nodes to become available to start a large job. If there are idle nodes available and a lower priority job can be started without affecting the estimated start time of the highest priority job, the lower priority job is started. If more than one low-priority job could be started using backfill, the highest priority one is selected. In general, jobs shorter than a few hours have a good chance to be started using backfill. However, please dont make your jobs too short, see this page for why.

  9. To use the development nodes, add "--reservation=devel" to your sbatch/interactive options and request one hour or less of walltime. Example: interactive --reservation=devel -N1 -t 01:00:00. A single user can only use a total of two devel nodes (64 cores) at any one time. We encourage you to use the devel nodes to test new jobs before submitting them to the normal queue. That way you can quickly find simple problems like syntax errors in the job script, etc.

  10. This limit (sometimes referred to as the "MAXPS" limit) prevents a project from having more remaining core hours in running jobs than 1.0x its monthly allocation. Without this limit, if the queue was empty and there were many idle nodes, a single project (even a very small one) could fill the entire system with 7-day jobs, running much more than its allocated time and making it difficult for other projects to get any work done in the next 7 days. Example: if a project with an allocation of 1000 core hours/month already has one job running that uses 600 cores and has one hour left to run, the biggest job that can be started using that project is 400 core hours (e.g a 40-core job for 10 hours). Note that this also puts an upper limit on how large a job a project can run, as a job larger than 1.0x the monthly allocation violates this rule and will never start. Jobs blocked by this rule have the Reason "AssocGrpCPURunMinutesLimit" set (shown by squeue).

  11. When your project has ended, any jobs still left in the queue will be shown with the Reason "QOSUsageThreshold" set (shown by squeue). This is a little confusing, but due to the fact that your project size is now zero, and you have used more than 1.5x zero core hours recently. If you do not remove such jobs from the queue, NSC might do so.

  12. You can under some conditions run longer jobs, see boost-tools. If you ask for a much larger Timelimit than your job actually needs, it becomes harder to predict the start time for queued jobs (which is bad for both you and other users) and your job might start later than it otherwise would (due to not being eligble for backfill or bonus).

  13. The reason why we don't prioritize interactive jobs over batch jobs is that many batch jobs are urgent, and that many interactive jobs aren't (e.g starting a Matlab GUI in an ineractive session and then running it for 7 days). We also believe there would be some abuse of the system if we prioritized interactive jobs.

  14. Bonus jobs run "for free" (they don't affect the project's future fair-share priority), but they show up in e.g projinfo and SUPR. Bonus jobs are only started when too many nodes are idle, and we never fill all idle nodes with bonus jobs (as that would prevent new high-priority jobs from being started quickly). Bonus jobs are started in priority order (i.e a project that has run 1.5x its allocation will get its jobs started as bonus jobs before a project that has run 1.7x its allocation). Bonus jobs show up in projinfo, SUPR etc but do not affect the project's future priority. The 24 hour limit is set to limit how far into the future bonus jobs can affect normal jobs. It's common to see bonus jobs start during weekends when demand is lower, and allowing them to run for e.g 7 days would cause them to prevent higher-priority jobs from starting on Monday. We will sometimes allow bonus jobs longer than 24h if there is more than 24h remaining until the start of the next workday. Bonus jobs should be considered an extra bonus, not a right! Remember, you can still run up to 1.5x your allocation before you're blocked from running normal jobs! The command projinfo --qos=bonus can be used to see how much CPU time that has been used by bonus jobs. Jobs may still be shown as "QOSUsageThreshold" even if they might be started as bonus jobs.

User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express