What is Triolith being used for? We have some idea by looking at the computer time applications we get through SNAC, and the support cases we work on also tells us something about what people are doing on Triolith, but the most comprehensive picture is likely to be painted by actually analyzing, in real time, what is running on the compute nodes. During the last service stop, I had the opportunity to examine the low-level logging data of a sizeable set of Triolith compute nodes. I managed to collect a sample of one month of log data from 186 nodes. To get a sense of the scale, it expands to about 2 TB of data uncompressed. Fortunately, when you have access to a supercomputer, you can attack the problem in parallel, so with 186 compute nodes unleashed at the task, it took just 10 minutes.
What you see below is an estimate of the fraction of time that the compute nodes spent running different applications. The time period is August 17th to September 17th, but the relative distribution has been rather stable over the previous months.
|Application||Share of core hours (%)|
Unsurprisingly, we find VASP at the top, accounting for about a third of the computing time. This is the reason why I spend so much time optimizing and investigating VASP – each of per cent of performance improvement is worth a lot of core hours in the cluster. We also have a good deal of molecular dynamics jobs (Gromacs, LAMMPS, NAMD, CPMD, ca 18%) and a steady portion of computational fluids dynamics jobs (Fluent + NEK5000 + OpenFOAM , ca 12%). Quantum chemistry programs, such as Gaussian, GAMESS, and Dalton (8%) catch the eye in the list, as expected, although this was a low month for Gaussian (3%), the usage is often higher (6-7%), competing for the top-5.
It would be interesting to compare this to other supercomputing sites. When talking to people at the SC conference, I get the impression that VASP is major workload at basically all academic sites, although perhaps not as much as 30%. In any case, getting statistics like this is going to be useful to plan application support and the design of future clusters that we buy.
Below follows some technical observations for people interested in the details behind getting the numbers above.
The data is based on
collectl process data, but at the logging level, you only see the file name of the binary, so you have to identify a certain software package just by the name of its running binaries. This is easy for certain programs, such as VASP, which are always called
vasp-something, but more difficult for others. You can, for example, find the notorious
a.out in the list above, which could be any kind of code compiled by the users themselves.
A simple check of the comprehensiveness of the logging is to aggregate all the core hours encountered in the sample, and compare with the maximum amount possible (186 nodes running for 24 hours for 30 days). This number is around 75-85% with my current approach, which means that something might be missing, as Triolith is almost always utilized to >90%. I suspect it is a combination of the sampling resolution at the collectl level, and the fact that I filter out short-running processes (less than 6 minutes) in the data processing stage to reduce noise from background system processes. Certain software packages (like Wien2k and RSPT) run iteratively by launching a new process for each iteration, creating many short-lived processes inside a single job. Many of these are probably not included in the statistics above, which could account for the shortfall.