Why your job is not starting

Since the job scheduling policy is complex, it’s not easy to figure out why a job has not started, or when it might start. You can always contact NSC Support and ask us why your job won’t start (remember to provide the job ID(s)) and we will explain why.

Here are some reasons why a job might not start when you expect it to.

The system is gathering nodes to start another job. If a wide (i.e needs many nodes) job is the highest priority job, but there are not enough idle nodes to start it on, the scheduler will need to wait until enough jobs have ended to have the required number of idle nodes. I.e if a 128 node job is waiting to be started, you might see 127 idle nodes, and your job will still not start (unless it is short enough to be started using backfill).
Nodes are reserved. Sometimes compute nodes are reserved for a particular purpose (e.g for maintenance or a course), and not available to normal jobs. Reserved nodes are shown as “resv” by the sinfo command. You can view all reservations using the command sinfo -Tl.
A scheduled service stop is coming up. When we need to perform maintenance on the system, we notify users via email and then reserve all compute nodes from a particular date and time. When this time is approaching, jobs will not be started if they cannot finish before the service stop reservation starts. E.g if the service stop starts Monday at 08:00, on Saturday at 08:00, only jobs with a wall time limit of less than 48 hours will be started.
Your project has run much more than its allocated time recently, and is temporarily blocked from starting new jobs. Such jobs will be shown by squeue as “QOSUsageThreshold”.
Your project has hit the “MAXPS” limit that prevents a single project from using a too large part of the system at any one time. Such jobs will be shown by squeue as “AssocGrpCPURunMinutesLimit”.
Your job has requested a too high Timelimit. If your job requests more than the allowed limit, it will never start. Such jobs will be shown by squeue as “PartitionTimeLimit”.
Your project has ended. If your project’s allocation has ended, your running jobs will finish but no new ones will start. If a project is no longer displayed by the “projinfo” command, it has expired. Such jobs will be listed by squeue as “QOSUsageThreshold”.

Things that might be a problem but usually are not

`ReqNodeNotAvail`

Technically, this means “when the cluster job scheduler last looked at your job, there were no nodes available in the cluster on which the job could be started”.

There are several possible reasons why this might be the case.

1: The cluster has scheduled maintenance (when no jobs can run) coming up, and there is less time remaining until the start of the maintenance period than the amount of time the job needs. Planned maintenance is announced via email, but you can also check our system status page.

2: The job requests an impossible combination of node type, memory, etc.

3: We have just started a “rolling upgrade” (an automatic reboot of all compute nodes when their current job finishes). For a few minutes (until the first nodes have completed their reboot, no more than 30 minutes) there are no nodes available to start any job on.

4: At some point in the past, one of the above conditions 1-3 were true, but the (always very busy) job scheduler has not looked at your job again since then, so an incorrect old Reason is shown.

What you can do:

Check the system status page. If there is downtime scheduled and your job asks for more time (Timelimit, the “-t” option to sbatch) than available until the start of the downtime, the job will not be started until after the downtime. If you need to run the job before the downtime, you need to ask for less time (“-t” option).

If this is a new job type and you ask for an unusual amount of memory, GPUs, etc - the reason might be that no suitable node exists. See this page to see which node types are available.

If you can’t figure out the reason yourself, ask NSC Support (remember to provide the job ID) and we can explain why that particular job is listed as ReqNodeNotAvail.