This page tries to describe the job scheduling policy on Tetralith, i.e how the system decides which jobs should be started and when.
It also describes how users and projects can interact with the system and adapt the job scheduling to their workflow (or in some cases, adapt their workflow to the job scheduling).
If you don't have time to read all of this, I recommend that you read this page and decide which model works best for your project. Then run your jobs, and come back to this page if you need to.
The policy is complicated, which is why there are so many footnotes1 on this page. Read them for more in-depth information. There are also a number of short articles on various related subjects which you will find at the bottom of this page.
The basic goals for the scheduling policy are:
An key decision we have made is to never terminate/preempt a running job to let a higher priority job run.5
The scheduling policy and its various limits, tools etc is complex. We are very aware of this. Unfortunately, all simple scheduling policies (e.g a plain FIFO queue) have problems, and no matter which one you start with, you tend to end up with many added limitations, exceptions and external tools, and then the simple policy is no longer simple...↩
There are of course situations in which this is not possible. E.g if a project has run nothing for the first 29 days of the month, it's very difficult to let it run all its core hours on the last day.↩
Even if idle compute nodes use less power, the depreciation cost is a very large part of the total system cost, so it makes sense to use the system as much as possible.↩
Some projects want nothing except to run as many jobs as possible but does not care about how long an individual job has to wait in the queue. Another project might need to get access to compute nodes quickly, but cares less about the total throughput.↩
In many ways, terminating running low-priority jobs to allow new high-priority jobs to start would simplify job scheduling and provide a lot of flexibility, but it has a high cost to users - all jobs must be made restartable (and many applications cannot even easily be restarted). Making jobs preemptable would be unfair to the many users whose applications cannot be restarted - their jobs would then randomly fail if a higher-priority job was submitted.↩
As far as possible, we use Slurm's built-in features to realize the scheduling policy, but we have also developed tools (e.g boost-tools and bonus job scheduling) that work outside Slurm.↩
Within a project, fair-share is also used so that users that have run a lot recently have lower priority than users that have run little recently.↩
Backfilling is the process of scheduling jobs on compute nodes that would otherwise be idle while waiting for enough nodes to become available to start a large job. If there are idle nodes available and a lower priority job can be started without affecting the estimated start time of the highest priority job, the lower priority job is started. If more than one low-priority job could be started using backfill, the highest priority one is selected. In general, jobs shorter than a few hours have a good chance to be started using backfill. However, please dont make your jobs too short, see this page for why.↩
To use the development nodes, add "--reservation=devel" to your sbatch/interactive options and request one hour or less of walltime. Example:
interactive --reservation=devel -N1 -t 01:00:00. A single user can only use a total of two devel nodes (64 cores) at any one time. We encourage you to use the devel nodes to test new jobs before submitting them to the normal queue. That way you can quickly find simple problems like syntax errors in the job script, etc.↩
This limit (sometimes referred to as the "MAXPS" limit) prevents a project from having more remaining core hours in running jobs than 1.0x its monthly allocation. Without this limit, if the queue was empty and there were many idle nodes, a single project (even a very small one) could fill the entire system with 7-day jobs, running much more than its allocated time and making it difficult for other projects to get any work done in the next 7 days. Example: if a project with an allocation of 1000 core hours/month already has one job running that uses 600 cores and has one hour left to run, the biggest job that can be started using that project is 400 core hours (e.g a 40-core job for 10 hours). Note that this also puts an upper limit on how large a job a project can run, as a job larger than 1.0x the monthly allocation violates this rule and will never start. Jobs blocked by this rule have the Reason "AssocGrpCPURunMinutesLimit" set (shown by
When your project has ended, any jobs still left in the queue will be shown with the Reason "QOSUsageThreshold" set (shown by
squeue). This is a little confusing, but due to the fact that your project size is now zero, and you have used more than 1.5x zero core hours recently. If you do not remove such jobs from the queue, NSC might do so.↩
You can under some conditions run longer jobs, see boost-tools. If you ask for a much larger Timelimit than your job actually needs, it becomes harder to predict the start time for queued jobs (which is bad for both you and other users) and your job might start later than it otherwise would (due to not being eligble for backfill or bonus).↩
The reason why we don't prioritize interactive jobs over batch jobs is that many batch jobs are urgent, and that many interactive jobs aren't (e.g starting a Matlab GUI in an ineractive session and then running it for 7 days). We also believe there would be some abuse of the system if we prioritized interactive jobs.↩
Bonus jobs run "for free" (they don't affect the project's future fair-share priority), but they show up in e.g projinfo and SUPR. Bonus jobs are only started when too many nodes are idle, and we never fill all idle nodes with bonus jobs (as that would prevent new high-priority jobs from being started quickly). Bonus jobs are started in priority order (i.e a project that has run 1.5x its allocation will get its jobs started as bonus jobs before a project that has run 1.7x its allocation). Bonus jobs show up in projinfo, SUPR etc but do not affect the project's future priority. The 24 hour limit is set to limit how far into the future bonus jobs can affect normal jobs. It's common to see bonus jobs start during weekends when demand is lower, and allowing them to run for e.g 7 days would cause them to prevent higher-priority jobs from starting on Monday. We will sometimes allow bonus jobs longer than 24h if there is more than 24h remaining until the start of the next workday. Bonus jobs should be considered an extra bonus, not a right! Remember, you can still run up to 1.5x your allocation before you're blocked from running normal jobs! The command
projinfo --qos=bonus can be used to see how much CPU time that has been used by bonus jobs. Jobs may still be shown as "QOSUsageThreshold" even if they might be started as bonus jobs.↩
How you can adapt the job scheduling to your workflow, or vice versa
A more detailed description of fair-share scheduling and settings
Flat jobs and their hidden cost
Tips and tricks for monitoring queued and running jobs
Short jobs and job steps and their hidden cost
The hard limit for how much a project may run. Also known as the QOSUsageThreshold limit
Some reasons for why your job won't start, and what you can do about it