The jobstats program monitors memory and cpu usage on the nodes of a running jobs, and writes a report to disk when the job is finished. You can use this tool to diagnose your jobs.
Wrap your mpprun command in the batch script by “jobstats start/stop”:
module load jobstats/0.1 jobstats start mpprun /path/to/binary jobstats stop ... jobstats report
And you will get a summary in the form of a jobstats.txt file. Examples:
CPU usage per node ----------------------------- Minimum amount : 0.2 % First quartile : 79.3 % Median : 88.6 % Third quartile : 98.6 % Maximum amount : 100.0 % Average(trimean): 88.8 % Memory usage per node ----------------------------- Minimum amount : 886 MB First quartile : 9102 MB Median : 9213 MB Third quartile : 9238 MB Maximum amount : 10131 MB Average(trimean): 9191 MB
The most interesting data point is usually average CPU and maximum amount of memory. Average CPU may not always be 100% if your job spends a lot of time waiting for disk input/output or network traffic.
The statistics are aggregated in the following way. First, samples from all nodes are averaged into a time series of average node values, then statistics over time are done. The minimum and maximum values, however, are not node averages, but the actual min/max observed in the full node set. I choose the trimean as the main statistics, as it is more sensitive to biased distributions, which is usually what we are looking for.
The monitoring is done by the nmon program which runs locally on each node, and writes log data to /scratch/local every 10 seconds. A preliminary test indicates that the runtime overhead, as measured on 32-node VASP job, is 0.5%. The logs consume ca 300KB per hour per node uncompressed, and are currently copied to your current working directory when you stop sampling.