Monitoring your jobs

Monitoring your jobs and the job queue

The usual Slurm commands are available, e.g sinfo, squeue. To cancel a queued or running job, use scancel.

The command lastjobs shows your last 10 ended jobs for the past 30 days. Run it with ‘-h’ for help. More information is available from the Slurm sacct database (run man sacct for help).

There is also a log of all completed or cancelled jobs in /var/log/slurm/accounting. Sometimes this is easier to use than sacct, but it contains less information.

The NSC projinfo command is available. It will display your projects and their usage, giving you a rough idea of what priority your jobs will have in the queue. If your project has used a large percentage of its allocation recently, your priority will be low.

Please note that projinfo will show how much CPU time has been used in a certain time window (by default 30 days), but that Slurm does not use a 30-day window to determine a project’s priority. Instead a metric is used that is based off a half-life formula that favors the most recent usage statistics, i.e jobs you ran yesterday affect your priority more than jobs you ran two weeks ago. The actual number used by Slurm to determine fair-share priority is “LevelFS” (can be seen using the command sshare -l -A <your project>).

You can also use projinfo to see how much extra computing time you have been able to use in the form of “bonus” jobs: projinfo --qos=bonus.

A simple graphical display of the node utilization, the amount of jobs in the queue etc is available on https://www.nsc.liu.se/status/. A graphical display of the node status and the queue is available at https://www.nsc.liu.se/cgi-bin/tetralithstatus

By using suitable options to squeue, you can get an overview of the jobs in the queue. This will give you an idea of how long your jobs might have to wait.

To display quite a lot of detail about each queued job, sorted by priority, run squeue like this: squeue -o "%.12Q %.10i %.8j %.8u %.15a %.12l %.19S %.4D %.4C %.12v %r" --state=PD -S "-p" | less

Some hints on how to read the output:

“START_TIME” is just an estimate by the scheduler, based on the current jobs in the queue. Since jobs can end earlier than expected, and that new high-priority jobs can be submitted at any time, the estimated start time is very unreliable except for the highest-priority jobs.
Jobs that request a specific RESERVATION can be ignored, since they won’t affect the start time of normal jobs.
Jobs with REASON=Dependency can be ignored, since they are waiting for another job to finish before they can start.
Jobs with REASON=AssocGrpCPURunMinutesLimit are blocked from starting and can thus be ignored
Jobs with REASON=QOSUsageThreshold are blocked from starting (but may start as bonus jobs if shorter than 24h)

Looking at the number of jobs in the queue before your job that are not blocked, and how many nodes they request, can give you some idea on when your job might start. But keep in mind that the priority of queued jobs change all the time (if a project has many running jobs, its priority will drop rapidly and that will place their waiting jobs further down the queue).

It is possible to only display the information for your current jobs (or a specific user) by using: squeue -u <username>

To display the job ID and the working directory where a current job was submitted: squeue -u <username> -o "%A %Z"

To get detalied information for a job: scontrol show job <jobid>

Monitoring your jobs and the job queue

User support

Getting access

Everything OK!

Self-service