A common cause for a job to fail is exhausting the memory on one or more compute nodes in the job.
Some signs of hitting an OOM condition are:
The job stops prematurely (or immediately). This happens if the application dies by itself when running out of memory, or if the Linux kernel kills the process. Often you will find signs of this in the application output (usually in the slurm-JOBID.out file if you have not redirected it elsewhere). The message
slurmstepd: Exceeded step memory limit at some point is a typical sign of having run out of memory.
The job stop making progress, but continue running until it hits the walltime limit.
The queue system stops a running job. This happens when one or more of the nodes became so slow due to the lack of available memory that the scheduler took it offline and killed the whole job. In this case, you might get an email from NSC informing you of what happened.
You can try following the suggestions below, or in determining if your job ran out of memory. Please remember to include the job ID. We can check system logs on the compute nodes that are not available to you, and these logs will usually tell us if the node ran out of memory or not.
These are some of your options:
Check actual memory usage while the job is running Use
jobsh (see this page for details) to login to one or more compute nodes in your job, then you can monitor memory usage in real-time using e.g the
top command. Also see the
jobstats tool at the bottom of this page.
seff command when the job has ended This will show you (among other things) how much memory was used. See below for an example.
Use nodes with more memory. Most clusters have "fat" nodes (see the hardware information) with more memory than normal nodes. You can use the "--mem" option to sbatch/interactive to request nodes with more memory. Please note, though, that most clusters only have a small number of fat nodes, so your job might need to wait for longer than usual in the queue when you request fat nodes.
Use less cores per compute node. If you run an MPI application, you can usually try running fewer ranks per node, and either run on more nodes or accept a longer runtime. On Tetralith, you could try e.g.:
which will run 16 MPI ranks per node instead of 32. This gives each rank twice the effective memory at the expensive of a longer runtime. Many programs run a lot faster than you would expect with just half the number of cores, so this can be economical compared to e.g. the next option.
Use more compute nodes. Some MPI-parallel program will distribute their data (more or less) evenly over all compute nodes. In such a case, you might be be able to fit your calculation within the available node memory by running on more nodes. The trade-off is that your program might not scale very well to many compute nodes and you will run with low efficiency and using up to twice the amount of core hours for the job.
Limit application memory. If your application has a configuration option for how much memory to use per node, try lowering that. Gaussian, for example, has such a switch. Remember that even if the compute nodes has e.g 32GiB RAM, you cannot use all of that for your application, some room must be left for the operating system, disk cache etc. A value of around 30GiB for a 32GiB node is usually OK. Some programs are also notoriously bad at estimating memory use, so you might need to set a large safety marign.
slurm-[jobid].out file is created in the directory from where you submitted the job. Unless you have redirected the output somewhere else, this is where the output from your job script will end up.
Any log files written by your application.
The NSC accounting logs in /var/log/slurm/accounting/YYYY-MM-DD on the login node. All jobs that have ended are listed there. In this file you can look at the "jobstate". Some common states are:
COMPLETED: the job script exited normally (i.e with exit status == 0). This does not necessarily mean that the job was successful, only that the job script did not return an error to the scheduler.
FAILED: the job script returned a non-zero exit status. This usually means that something went wrong, but it does not necessarily mean that the application itself failed, it might be e.g a failed "cp" command that was run as the last command in the job script.
CANCELLED: you (or a system administrator) cancelled the job (using scancel).
NODE_FAIL: one or more of the compute nodes in the job failed in such a way that the scheduling system decided to take it offline.
TIMEOUT: the job ran until it had used all the walltime requested by it, and was terminated by the scheduler.
NODE_FAIL is commonly associated with out of memory conditions.
On some NSC systems (e.g Tetralith and Sigma) you can use "jobstats" to monitor the CPU and memory use of a job.
Load the module and wrap your mpprun command in the batch script by “jobstats start/stop”. Example:
module load jobstats/0.5 jobstats start mpprun /path/to/binary jobstats stop ... jobstats report
You will then get a summary in the form of a jobstats.txt file. Example:
CPU usage per node ----------------------------- Minimum amount : 0.2 % First quartile : 79.3 % Median : 88.6 % Third quartile : 98.6 % Maximum amount : 100.0 % Average(trimean): 88.8 % Memory usage per node ----------------------------- Minimum amount : 886 MB First quartile : 9102 MB Median : 9213 MB Third quartile : 9238 MB Maximum amount : 10131 MB Average(trimean): 9191 MB
The most interesting data point is usually average CPU and maximum amount of memory. Average CPU may not always be 100% if your job spends a lot of time waiting for disk input/output or network traffic.
The statistics are aggregated in the following way. First, samples from all nodes are averaged into a time series of average node values, then statistics over time are done. The minimum and maximum values, however, are not node averages, but the actual min/max observed in the full node set. I choose the trimean as the main statistics, as it is more sensitive to biased distributions, which is usually what we are looking for.
The monitoring is done by the nmon program which runs locally on each node, and writes log data to /scratch/local every 10 seconds. A preliminary test indicates that the runtime overhead, as measured on 32-node VASP job, is 0.5%. The logs consume ca 300KB per hour per node uncompressed, and are currently copied to your current working directory when you stop sampling.
seff command displays data that the resource manager (Slurm) collected while the job was running. Please note that the data is sampled at regular intervals and might miss short peaks in memory usage.
If your job failed and seff shows memory utilization close to 100%, you can assume that the job ran out of memory. If you need to know for sure, contact NSC Support and ask us to analyze the logs for the job in question.
Example (in this case job 12345678 ran out of memory and was killed, and seff shows very close to 100% memory utilization):
$ seff 12345678 Job ID: 12345678 Cluster: tetralith User/Group: x_makro/x_makro State: FAILED (exit code 9) Nodes: 1 Cores per node: 32 CPU Utilized: 00:23:37 CPU Efficiency: 11.84% of 03:19:28 core-walltime Job Wall-clock time: 00:06:14 Memory Utilized: 88.20 GB Memory Efficiency: 97.19% of 90.75 GB