Triolith Scheduling policy

Triolith uses a fairshare scheduler operating on the the project level, i.e the more CPU time the project has used (as a percentage of its monthly allocation), the lower the priority of all queued jobs in the project will be.

The maximum wall time limit for a job is 7 days (168 hours). The default wall time limit (if you don't use the -t option to sbatch/interactive) for a job is 2 hours.

The maximum wall time limit for a job on a node reserved for development and testing (--reservation=devel) is 1 hour. There are currently eight nodes reserved for test and development jobs.

Please note that the test and development nodes are expected to be available with little or no queue time to all users, so it is not acceptable for a single user to use all of them. Please use common sense.

Fairshare scheduling on Triolith

How does fairshare scheduling work on Triolith?

Fairshare scheduling on Triolith attempts to give each project (not user) a "fair share" of the available computing time of the system over time.

A "fair share" is not an equal share. A project's share is the time SNIC allocated to the project (e.g 100000 core hours per month) divided by the total capacity of the system (18 million core hours per month).

A project that makes a reasonable effort to use its allocated time (i.e keeps some jobs in the queue most of the time) can expect to be able to run approximately as many core hours as allocated by SNIC, or more.

The fairshare scheduler tries to achieve this by adjusting the priority of queued jobs. Since the queue is continuously re-sorted by priority, this generally results in short queue times for jobs submitted by projects with high priority, and long queue times for jobs submitted by projects with low priority.

The priority of a queued job is determined by how much the project has run recently compared to its allocation. The higher the usage is (as a percentage of the allocation), the lower the priority is.

There is no limit on how much a project can run in a month. But the more you run, the lower your priority will be, so the harder it will be to run the next job.

If you are interested in the gory details of how this is implemented: we use the SLURM multifactor plugin (https://computing.llnl.gov/linux/slurm/priority_multifactor.html)

The running configuration settings for the multifactor plugin can be seen by running "scontrol show config". As of 2012-10-24, the most interesting ones are:

PriorityDecayHalfLife   = 21-00:00:00
PriorityWeightFairShare = 1000000
PriorityWeightAge       = 1000

This means that the job priority is almost entirely determined by the FairShare of the project. The PriorityWeightAge is so much smaller that the age of a job will never affect the ordering of projects, only the ordering of jobs belonging to the same project.

Backfill - unfair handling of small jobs!

In addition to the main fairshare scheduler, which will always try to start the highest priority job, we also use a backfill scheduler.

Backfilling is the process of scheduling jobs into the holes created from large jobs that are waiting for nodes to become available.

If there are idle nodes available and a lower priority job can be started without affecting the start time of the highest priority job, the lower priority job is started. If more than one low-priority job could be started using backfill, the highest priority one is selected.

How can I adapt the scheduling to my workflow?

An example of how you can make sure you get the scheduling needed for your workflow:

Group A: is allocated 50000 core hours per month. They only care about getting as much work done as possible, so they submit many jobs and make sure that some are always waiting in the queue. This group might be able to run significantly more than 50000 core hours per month, but their queue priority will be low, and each job will on average wait a long time in the queue.

Group B: is allocated 50000 core hours per month. They need to run a limited number of jobs, but need their jobs to start quickly. As long as they run significantly less than their allocation per month (say 30000 hours), they will have a high priority, and their jobs will start immediately or as soon as nodes become available.

Note that the scheduler responds fairly slowly to changes in behaviour. If you change your behaviour (e.g stop running jobs) it will still take days or a few weeks until the full effect of this is seen in your queue waiting times.

Why fairshare scheduling on Triolith?

SNIC allocates a certain number of "core hours" to each project that is allocated time on Triolith. SNIC also wants utilization of its systems to be high.

NSC has decided to use fairshare scheduling on Triolith (and our other academic systems) because we believe it is the best way to share the system "fairly" (guided by how much time a project was allocated by SNIC) in a way that keeps utilization high, while still allowing different research groups to use the system in different ways that suits their workflow.

Idle nodes but your job won't start?

Sometimes you might see many idle nodes but your job still won't start. Here are some common reasons for this:

  1. The system is gathering nodes for a wide job. If a wide (i.e needs many nodes) job is the highest priority job, but there are not enough idle nodes to start it on, the scheduler will need to wait until enough jobs have ended to have the required number of idle nodes. I.e if a 128 node job is waiting to be started, you might see 127 idle nodes, and your job would still not be started (unless it was short enough to be run using backfill).

  2. Nodes are reserved. Sometimes compute nodes are reserved for a particular purpose, and not available to normal jobs. You can view all reservations using the command "scontrol show reservations".

  3. A scheduled service stop is coming up. When we need to perform maintenance on the system, we notify users via email and then reserve all compute nodes from a particular date and time. When this time is approaching, jobs will not be started if they cannot finish before the service stop reservation starts. E.g if the service stop starts Monday at 08:00, on Saturday at 08:00, only jobs with a wall time limit of less than 48 hours will be started.

  4. The "MAXPS" limit. In order to prevent a single project from using a very large part of the system, there is a hard limit on how much outstanding work a project may have running at any one time. The amount of outstanding work is defined as the sum of (number_of_cores * remaining runtime) for all running jobs in the project. The amount of outstanding work is limited to the monthly allocation of the project. E.g a project that is allocated 100000 core hours per month can start 130 single-node jobs with 48h walltime (100000 core hours / (48 h * 16 cores/node)). If your project has hit this limit your jobs will be shown by squeue with "Reason" set to "AssociationResourceLimit".

  5. Too-long walltime. If your job requests more than the allowed wall time limit, it will not start. You will also get an email notifying you of this. The job will be shown by squeue as "PartitionTimeLimit".

  6. Your project has expired. If your project's allocation has ended, your running jobs will finish but no new ones will start. The command "projinfo" will show a "Current allocation" of "-" for that project.

Running jobs

Node sharing is enabled

Node sharing is enabled on Triolith. This means that jobs that request less than a full node (e.g sbatch -n2) might share that node with other jobs.

On Triolith a job will be allocated only the resources that it actually requests. If you request one core you will get one core, etc. On some older systems (e.g Kappa and Matter) your job would always get a complete node, even if you only requested a single CPU core.

If you want to ensure that your job always gets whole nodes, add the flag --exclusive to sbatch/interactive/salloc/srun. You will then get the same behaviour as on Kappa/Matter. Note that we sometimes automatically add --exclusive in order to be compatible with older job scripts, see below.

Why node sharing? Some reasons:

  1. Running single-core jobs becomes easier. There is no longer any need to package those jobs into bigger packages that use a whole node. You can submit each single-core task as a separate job and let the scheduling system figure out which ones run together.

  2. Triolith has 16 cores per node. Some applications cannot utilize 16 cores. Without node sharing, the unused cores would be wasted. One example of this is development and testing jobs. If test jobs only use one core instead of a whole node, many more users can share a single development node.

  3. "Fat" nodes (with lots of RAM) can be shared between jobs. E.g two 64GB jobs can fit into one 128GB "fat" node.

Backwards compatibility

In order to avoid causing problems for users who are used to Neolith-style behaviour (i.e no node sharing), we have added a few extra rules to the scheduler configuration:

Note that these are just fallback mechanisms, we recommend that you always specify exactly what resources you want (e.g -N2 --exclusive).

  • If you specify -N or -nodes but not -n or --ntasks, the system will automatically add --exclusive.

  • If you request more than one node (e.g -n22), the system will automatically add --exclusive).

Examples

(In the examples below we use "interactive", but the same options can be used for sbatch, srun and salloc).

Without sharing nodes

To get "Neolith-style" behaviour, add --exclusive.

Requesting two full nodes (32 cores) for 24 hours: interactive -N2 --exclusive -t 24:00:00

Requesting 32 cores (2 nodes) for 24 hours: interactive -n32 --exclusive -t 24:00:00

Request 2 full nodes, but tell SLURM to only launch one task per node (e.g for a hybrid MPI/OpenMP application): interactive -N2 --exclusive --cpus-per-task=16 -t 24:10:00

To use the development nodes that are reserved for short development and testing jobs, add --reservation=devel and request a walltime of less than one hour.

One development node for interactive use for 10 minutes: interactive -N1 --exclusive --reservation=devel -t 00:10:00

With node sharing

Request 4 cores for 24h, allow node sharing: interactive -n4 -t 24:00:00

One CPU core on a development node for 10 minutes: interactive -n1 --reservation=devel -t 00:10:00

You can also request a certain amount of RAM:

One CPU core and 16GB RAM for 10 minutes on a development node: interactive -n1 --mem=16000 --reservation=devel -t 00:10:00

Using the "fat" nodes

If you need more than 32GiB RAM per node, you can request that your job be run on the "fat" nodes which has 128GiB RAM by adding -C fat or specifying how much RAM you need --mem=xxxGB.

Please note that this will usually give you longer waiting times in the queue, since there are only 56 "fat" nodes in the system.

Node sharing limitations

Currently there is no quota on the /scratch/local disk, so if you use /scratch/local and want to make sure that no other job can use up all the space there, always use --exclusive.

Monitoring your jobs

The usual SLURM commands are available, e.g sinfo, squeue.

To cancel a queued or running job, use scancel.

The NSC projinfo command is available. It will display your projects and their usage, giving you a rough idea of what priority your jobs will have in the queue. If your project has used a large percentage of its allocation, your priority will be low.

There is also a graphical display of the node utilization, the amount of jobs in the queue etc on https://www.nsc.liu.se/status/.

By using suitable options to squeue, you can get an overview of the jobs in the queue. This will give you an idea of how long your jobs might have to wait.

To display quite a lot of detail about each queued job, sorted by priority, run squeue like this: squeue -o "%.12Q %.7i %.8j %.8u %.15a %.12l %.19S %.4D %.4C %.6h %R" --state=PD -S "-p" | less

NOTE: "START_TIME" is just an estimate by the scheduler, based on the current jobs in the queue. Due to the fact that any job submitted in the future with a higher priority than your job will skip ahead of you in the queue, the estimated start time is very unreliable if your priority is low. if jobs end ahead of schedule, the opposite can happen - your job might start earlier than the estimated time.

Example:

[kronberg@triolith1 ~]$ squeue -o "%.12Q %.7i %.8j %.8u %.15a %.12l %.19S %.4D %.4C %.6h %R" --state=PD -S "-p" | less
    PRIORITY   JOBID     NAME     USER         ACCOUNT    TIMELIMIT          START_TIME NODE CPUS SHARED NODELIST(REASON)
  1000000153   75956 tt4_bdis    raber             nsc   1-00:00:00 2012-10-25T16:08:24    4    4     no (Resources)
     1000132   76091 ScZrNiCo  x_robjo  snic001-11-241     10:00:00 2012-10-25T16:08:24    1    1     no (Resources)
      995105   76217   ISDAC1  x_julsa   snic002-12-16   1-00:00:00 2012-10-25T16:08:24    1    8 unknwn (Resources)
      995067   77248 x_FOTO_x  x_laubr   snic002-12-16      3:00:00 2012-10-25T16:08:24    1    1     no (Resources)
      989612   76208       MG  x_kanil  snic001-12-100      3:00:00 2012-10-25T16:08:24    4   64     no (Resources)
      989611   76211       FG  x_kanil  snic001-12-100   2-22:00:00 2012-10-25T16:08:24    4   64     no (Resources)
      976515   77119 001_2L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976514   77128 001_3L_C  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976514   77133 001_3L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976514   77149 001_4L_c  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976514   77157 001_4L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976514   77193 010_2L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976513   77195 010_4L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976513   77196 010_3L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77197 011_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77198 011_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77199 011_2L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77214 100_4L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77217 100_4L_c  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77219 100_2L_C  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77222 100_2L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77224 100_3L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77227 100_3L_C  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976511   77309 101_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976511   77310 101_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976511   77311 101_2L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77317 110_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77318 110_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77319 110_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77320 110_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77321 110_33L_  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77322 110_33L_  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77323 110_33L_  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77324 110_33L_  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77325 110_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77326 110_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77327 110_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77328 110_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77329 111_2L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77330 111_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
[...]