NSC Jobstats


Level of support

Tier 3 NSC will not be able to help you much with this program, either because we lack the in-house experience, or because it is a test installation. In general, these types of installations are untested, and will not be updated unless you send a request to NSC.

Please see the page describing our software support categories for more information.
You can also contact support@nsc.liu.se for further information.

The jobstats program monitors memory and cpu usage on the nodes of a running jobs, and writes a report to disk when the job is finished. You can use this tool to diagnose your jobs.

How to run

Wrap your mpprun command in the batch script by “jobstats start/stop”:

module load jobstats/0.1
jobstats start
mpprun /path/to/binary
jobstats stop
...
jobstats report

And you will get a summary in the form of a jobstats.txt file. Examples:

CPU usage per node
-----------------------------
Minimum amount  : 0.2 %
First quartile  : 79.3 %
Median          : 88.6 %
Third quartile  : 98.6 %
Maximum amount  : 100.0 %
Average(trimean): 88.8 %

Memory usage per node
-----------------------------
Minimum amount  : 886 MB
First quartile  : 9102 MB
Median          : 9213 MB
Third quartile  : 9238 MB
Maximum amount  : 10131 MB
Average(trimean): 9191 MB

The most interesting data point is usually average CPU and maximum amount of memory. Average CPU may not always be 100% if your job spends a lot of time waiting for disk input/output or network traffic.

Technical info

The statistics are aggregated in the following way. First, samples from all nodes are averaged into a time series of average node values, then statistics over time are done. The minimum and maximum values, however, are not node averages, but the actual min/max observed in the full node set. I choose the trimean as the main statistics, as it is more sensitive to biased distributions, which is usually what we are looking for.

The monitoring is done by the nmon program which runs locally on each node, and writes log data to /scratch/local every 10 seconds. A preliminary test indicates that the runtime overhead, as measured on 32-node VASP job, is 0.5%. The logs consume ca 300KB per hour per node uncompressed, and are currently copied to your current working directory when you stop sampling.