Why your job is not starting

Since the job scheduling policy is complex, it's not easy to figure out why a job has not started, or when it might start. You can always contact NSC Support and ask us why your job won't start (remember to provide the job ID(s)) and we will explain why.

Here are some reasons why a job might not start when you expect it to.

  • The system is gathering nodes to start another job. If a wide (i.e needs many nodes) job is the highest priority job, but there are not enough idle nodes to start it on, the scheduler will need to wait until enough jobs have ended to have the required number of idle nodes. I.e if a 128 node job is waiting to be started, you might see 127 idle nodes, and your job will still not start (unless it is short enough to be started using backfill).

  • Nodes are reserved. Sometimes compute nodes are reserved for a particular purpose (e.g for maintenance or a course), and not available to normal jobs. Reserved nodes are shown as "resv" by the sinfo command. You can view all reservations using the command sinfo -Tl.

  • A scheduled service stop is coming up. When we need to perform maintenance on the system, we notify users via email and then reserve all compute nodes from a particular date and time. When this time is approaching, jobs will not be started if they cannot finish before the service stop reservation starts. E.g if the service stop starts Monday at 08:00, on Saturday at 08:00, only jobs with a wall time limit of less than 48 hours will be started.

  • Your project has run much more than its allocated time recently, and is temporarily blocked from starting new jobs. Such jobs will be shown by squeue as "QOSUsageThreshold".

  • Your project has hit the "MAXPS" limit that prevents a single project from using a too large part of the system at any one time. Such jobs will be shown by squeue as "AssocGrpCPURunMinutesLimit".

  • Your job has requested a too high Timelimit. If your job requests more than the allowed limit, it will never start. Such jobs will be shown by squeue as "PartitionTimeLimit".

  • Your project has ended. If your project's allocation has ended, your running jobs will finish but no new ones will start. If a project is no longer displayed by the "projinfo" command, it has expired. Such jobs will be listed by squeue as "QOSUsageThreshold".

