Scheduling changes 2017-10-03

Several changes to the scheduling on Bi were done on 2017-10-03. This is what has changed:

Use fairshare to prioritise queued jobs based on the groups' recent usage compared to allocated shares
Double the group node limits
Remove the shared node limit and node overallocation that affected groups sm_a, sm_foua, sm_fouh, sm_foul and sm_foup.
Add high priority jobs
Add low priority jobs
Allow normal jobs to use fat nodes when no other nodes are available

Scheduling policy on Bi

Bi uses fairshare to prioritise between jobs from various groups. Each user group on Bi is assigned a number of shares of the cluster. The more resources a group has used, relative to its assigned share of the total available resources, the lower priority the group's jobs get in the queue.

To improve utilisation the scheduler also uses backfill, which allows jobs to be started out of priority order when this does not delay the predicted start time of higher priority jobs.

Exception for the fm group

The fm user group is handled separately from other groups and always have 44 nodes available for normal jobs.

Number of shares for user groups

The number of shares for the groups are based on the node limits used before 2017-10-03. The number of shares is also used to calculate node limits for high priority jobs. Values as of 2017-10-03:

Group (Slurm Account)	Shares
rossby	69
sm_a	3
sm_foua	2
sm_fouh	2
sm_foul	21
sm_fouo	69
sm_foup	45

Job types

There are four types of jobs on Bi:

normal jobs
high priority jobs
low priority jobs
risk jobs

Normal jobs

The vast majority of jobs should be submitted as normal jobs.

Jobs are prioritised using fairshare.

High priority jobs

High priority jobs have higher queue priority than all other job types. They are not affected by the CPU limits that used to affect normal jobs, but have their own smaller limits (mainly to prevent mistakes and abuse). Unless the cluster is completely full, it should be possible to start high priority jobs with very short queue times.

Intended use cases are:

interactive work
short test jobs
short development jobs
other urgent jobs that for some reason needs to start right away

Usage by high priority jobs is included when calculating fairshare priority for both normal and high priority jobs.

To submit high priority jobs, use --qos=high.

Low priority jobs

Low priority jobs have lower priority than normal and high priority jobs, and will only be started if no other jobs need the requested resources.

Low priority jobs have a max allowed walltime of 2 hours.

Usage of low priority jobs is ''free'' and is NOT included when calculating fairshare priority for normal jobs.

To submit low priority jobs, use --qos=low.

Risk jobs

Risk jobs are low priority jobs that are preemptable, without the short walltime limit. Low priority jobs and risk jobs share the same low queue priority.

Risk jobs have no node limits, and can use all nodes in the cluster. The jobs are preemptable, and will be killed if the nodes they run on are needed to be able to run a normal or high-priority job.

For a risk batch job to be re-queued automatically when preempted, use the --requeue option to sbatch. Without this the job will be cancelled when preempted. Note that requeueing will cause the same batch script to be executed multiple times. It is your responsibility to ensure that this does not cause input/output files to be overwritten (or similar problems).

Usage of risk jobs is ''free'' and is NOT included when calculating fairshare priority for normal jobs.

To submit risk jobs, use --qos=risk.

Specifying Slurm Account

If you are a member of more than one group, you should always use an option like -A rossby, -A sm_fouo etc. to sbatch/interactive to tell Slurm what account to run under.

If you are only part of one group you do not need to use the -A option for normal job submission. You might have to use it under special circumstances, such as cron jobs.

Time limits

The maximum wall time for a job is 7 days (except for low priority jobs which have a 2 hour limit). The default time limit (if you do use a "-t" flag) is 2 hours. Please use the "-t" flag to set a time limit that is appropriate for each job!

Avoid running long jobs if the work can be split into several shorter jobs without losing performance. Several shorter jobs can improve the overall scheduling on the cluster. However, there are limits as Bi is not optimised for very short jobs. For example, splitting a 30 minute job into 30 1-minute jobs is not recommended.

Fat nodes

Bi have 13 fat nodes with extra memory. To use them, add -C fat to your job specification. Do not use --mem or similar options to request fat nodes or to specify that you do not need fat nodes.

Of the 13 fat nodes, 6 are limited to jobs with a time limit of at most 24 hours while the remaining 7 allow for up to 7 days.

Use of the fat nodes counts towards fairshare usage like any other jobs. Jobs not requesting fat nodes can be scheduled on fat nodes if no other nodes are available.

All job types can request fat nodes.

Node sharing is available on Bi. The idea behind node sharing is that you do not have to allocate a full compute node in order to run a small job using, say, 1 or 2 cores. Thus, if you request a job like sbatch -n 1 ... the job may share the node with other jobs smaller than 1 node. Jobs using a full node or more will not experience this (that is, we will not pack two 24-core jobs into 3 nodes). You can turn off node-sharing for otherwise eligible jobs using the --exclusive flag.

Warning: If you do not include -n, -N or --exclusive to commands like sbatch and interactive, you will get a single core, not a full node.

When you allocate less than a full node, you get a proportional share of the node's memory. On a thin node with 64 GiB, that means that you get 2 GiB per allocated hyperthread which is the same as 4 GiB per allocated core.

If you need more memory you need to declare that using an option like --mem-per-cpu=MEM, where MEM is the memory in MiB per hyperthread (even if you do not allocate your tasks on the hyperthread level).

Example: to run a process that needs approximately 32 GiB on one core, you can use -n1 --mem-per-cpu=16000. As you have not turned on hyperthreading, you allocate a whole core, but the memory is still specified per hyperthread.

As a comparison, -n2 --ntasks-per-core=2 --mem-per-cpu=16000 allocates two hyperthreads (on a core). Together, they will also have approximately 32 GiB of memory to share.

Note: you cannot request a fat node on Bi by passing a --mem or --mem-per-cpu option too large for thin nodes. You need to use the -C fat option discussed above.

Job private directories

Each compute node has a local hard disk with approximately 420 GiB available for user files. The environment variable $SNIC_TMP in the job script environment points to a writable directory on the local disk that you can use. A difference on Bi vs Krypton is that each job has private copies of the following directories used for temporary storage:

/scratch/local (`$SNIC_TMP`)
/tmp
/var/tmp

This means that one job cannot read files written by another job running on the same node. This applies even if it is two of your own jobs running on the same node!

Please note that anything stored on the local disk is deleted when your job ends. If some temporary or output files stored there needs to be preserved, copy them to project storage at the end of your job script.