A common cause for a job to fail is exhausting the memory on one or more compute nodes in the job.
Some signs of hitting an OOM condition are:
The job stops prematurely (or immediately). This happens if the application dies by itself when running out of memory, or if the Linux kernel kills the process. Often you will find signs of this in the application output (usually in the slurm-JOBID.out file if you have not redirected it elsewhere)
The job stop making progress, but continue running until it hits the walltime limit.
The queue system stops a running job. This happens when one or more of the nodes became so slow due to the lack of available memory that the scheduler took it offline and killed the whole job. In this case, you might get an email from NSC informing you of what happened.
You can try following the suggestions below, or in determining if your job ran out of memory. Please remember to include the job ID. We can check system logs on the compute nodes that are not available to you, and these logs will usually tell us if the node ran out of memory or not.
These are some of your options:
Check actual memory usage while the job is running Use
jobsh (see this page for details) to login to one or more compute nodes in your job, then you can monitor memory usage in real-time using e.g the
top command. Also see the
jobstats tool at the bottom of this page.
Use nodes with more memory. Most clusters have "fat" nodes (see the hardware information) with more memory than normal nodes. You can use the "--mem" option to sbatch/interactive to request nodes with more memory. Please note, though, that most clusters only have a small number of fat nodes, so your job might need to wait for longer than usual in the queue when you request fat nodes.
Use less cores per compute node. If you run an MPI application, you can usually try running fewer ranks per node, and either run on more nodes or accept a longer runtime. On Triolith, you could try e.g.:
which will run 8 MPI ranks per node instead of 16. This gives each rank twice the effective memory at the expensive of a longer runtime. Many programs run a lot faster than you would expect with just half the number of cores, so this can be economical compared to e.g. the next option.
Use more compute nodes. Some MPI-parallel program will distribute their data (more or less) evenly over all compute nodes. In such a case, you might be be able to fit your calculation within the available node memory by running on more nodes. The trade-off is that your program might not scale very well to many compute nodes and you will run with low efficiency and using up to twice the amount of core hours for the job.
Limit application memory. If your application has a configuration option for how much memory to use per node, try lowering that. Gaussian, for example, has such a switch. Remember that even if the compute nodes has e.g 32GiB RAM, you cannot use all of that for your application, some room must be left for the operating system, disk cache etc. A value of around 30GiB for a 32GiB node is usually OK. Some programs are also notoriously bad at estimating memory use, so you might need to set a large safety marign.
slurm-[jobid].out file is created in the directory from where you submitted the job. Unless you have redirected the output somewhere else, this is where the output from your job script will end up.
Any log files written by your application.
The NSC accounting logs in /var/log/slurm/accounting/YYYY-MM-DD on the login node. All jobs that have ended are listed there. In this file you can look at the "jobstate". Some common states are:
COMPLETED: the job script exited normally (i.e with exit status == 0). This does not necessarily mean that the job was successful, only that the job script did not return an error to the scheduler.
FAILED: the job script returned a non-zero exit status. This usually means that something went wrong, but it does not necessarily mean that the application itself failed, it might be e.g a failed "cp" command that was run as the last command in the job script.
CANCELLED: you (or a system administrator) cancelled the job (using scancel).
NODE_FAIL: one or more of the compute nodes in the job failed in such a way that the scheduling system decided to take it offline.
TIMEOUT: the job ran until it had used all the walltime requested by it, and was terminated by the scheduler.
NODE_FAIL is commonly associated with out of memory conditions.
On Triolith you can add the jobstats tool to your job script for a fairly lightweight monitoring of the CPU- and memory usage of your job.