A project is allowed to use more than its allocated computing time. Since not all projects use up all their allocated time and since unused computing time is wasted (most of the Tetralith cost is fixed costs, e.g hardware depreciation), we allow projects to use more than their allocated computing time.
However, we have had to put a limit in place for how much more a project can run. There are several reasons for this, but the most important one is the "weekend" effect. This is when a low-priority project can get many long (up to 7 days on Tetralith) jobs started during a period of low demand (e.g Sunday evening), which then runs well into a period of much higher demand (e.g Monday-Friday) and then prevents higher priority jobs that were not submitted until the period of high demand from starting.
The limit is currently set to 1.5 times the project's allocation.
The limit is compared to the project's recent usage. "Recent" is not a moving 30-day window! The limit is implemented (because support for doing it this way already existed in Slurm) by setting the UsageThreshold to 0.66 and enabling EnforceUsageThreshold. This means that Slurm will not start jobs for projects whose "Fairshare Usage" is below 0.66. If you run jobs evenly spread out over time, UsageThreshold=0.66 corresponds to 1.5x your allocation. "Fairshare Usage" is the same counter (shown by
sshare as LevelFS) as is used to determine a project's priority, and it differs from a moving window in that jobs run recently have a greater weight than jobs run a long time ago. This means it is possible to hit the limit even if your usage in the last 30 days is lower than your monthly allocation if most of your usage was very recently. On the other hand, if the bulk of your usage was a long time ago, you will not hit the limit despite having used more than your allocated time in the last 30 days.
projinfo tool will tell you if your project is blocked by this limit. It will also attempt to predict when your project will be able to run normal jobs again. Please note that it is difficult to say with certainty when a project will go under the limit, especially if the project still has running jobs (which will affect LevelFS), so take the number with a grain of salt.
When a project hits the limit, no normal jobs can start until recent usage has fallen below the limit.
Projects blocked by this limit will be shown by
squeue with the Reason set to "QOSUsageThreshold".
Blocked projects can still run test/development jobs in the "devel" reservation.
If there are enough idle nodes that the blocked project's jobs are unlikely to affect anyone else, they are started anyway as "bonus" jobs. Only jobs with a time limit less than 24h are eligible to be started as bonus jobs. Based on experience from 2019, bonus jobs are only run during a handful oftimes per year (summer, Christmas, long weekends, ...), so don't rely on them being available.
Blocked projects can also force a small amount of jobs to start using boost-tools. Since usage of boost-tools is limited (by how many tokens the project has left), the amount of jobs you can run in this way is also limited (to a maximum of 10% of your monthly allocation).