Peter Larsson

Electronic Structure Workshop in Linköping

2017-02-24T00:00:00+01:00

There will be an workshop in Linköping on March 29th focusing on electronic structure calculations and software. Application experts from SNIC centers across Sweden will present software packages and give some best practice advice on how to run efficiently on the available supercomputing resources.

More information is available on NSC’s home page. Please register if you intend to show up, as it helps us in planning the event.

ISC'16 Conference

2016-06-16T00:00:00+02:00

Next week, I will be at the ISC High Performance conference in Frankfurt. I will also be at HPE’s CAST. I am always interested in meeting other people working with ab initio/electronic structure software. Send me an email if you want to meet up.

Paper highlight: Reproducibility of DFT calculations of solids

2016-06-15T00:00:00+02:00

In a previous post “How accurate are different DFT codes?” I looked at the preliminary outcome of a study comparing the accuracy and precision of various DFT codes using a single measure called the “delta gauge”. A comprehensive study using this method has now been published in the journal Science. It features a long author list, as the work is a joint effort by many of the big research groups that employs or develops DFT.

The main purpose of the paper is to show that today’s DFT calculations are precise and also reasonable accurate. Given, sufficient (sometimes very high) convergence settings, DFT calculations performed using different software implementations do in fact arrive at the same answer. There is a certain error margin, but it is shown to be comparable to experimental uncertainties.

My observations from a quick read of the paper are:

It confirms my old hypothesis that high-quality PAW calculations are as precise as all-electron calculations in practice. The delta gauge for the best all-electron codes (LAPW methods) are 0.5-0.6, which is very close to what you can achieve with VASP and Abinit using the most recent PAW libraires.
A very important practical aspect that is not investigated in this study would be the computer resources required to arrive at the results. Even rough estimates would have been interesting to see, both from a user perspective and a technical HPC perspective.
Among the all-electron codes, RSPt, which is based on the FP-LMTO method does not fare as well. I asked one of the authors who ran the RSPt calculations, Torbjörn Björkman. He believed that the results could be improved to some degree. These were one of the first sets that were run, and the resulting delta values were deemed sufficiently good to not warrant further improvement, when compared with the preliminary Wien2K results available then. He believed that the RSPt results could be probably be improved further with the hindsight of the more recent results, but there were still some outliers in the data set which would prevent the delta vs Wien2k to reach zero.

I have a few reservations, though, whether this study finally settles the debate for reproducibility of modern computational material science:

The paper shows what is possible in the hands of an expert user or a developer of the software. That represents a best case scenario, because in everyday scientific practice, calculations are often produced by either relatively uneperienced users such as PhD students, or in a completely unsupervised process by a computer algorithm that itself runs and analyzes the calculations. In my opinion, the ultimate goal of reproducibility should be to arrive at a simulation process that can be automated and specified to such a level that a computer program can perform the calculations with the same accuracy as an expert, but I think we are not there yet, perhaps not until we see strong artificial intelligence.
The numerical settings used in the different programs are not shown in the paper, but is available in the supplementary information. They are in general very high and not representative of many research calculations. I think it cannot be assumed a priori that the predictions of all software package degrade equally gracefully when the settings are decreased. I believe that would be an interesting topic for further investigation.

Running VASP on Nvidia GPUs

2015-11-16T00:00:00+01:00

Today at Supercomputing 15, the coming release of an official GPU version of VASP 5.4.1 was announced by Nvidia. Both standard DFT and hybrid DFT with Hartree-Fock is GPU accelerated, but there is no GPU-support for GW calculations yet. I have played around with the beta release of GPU-VASP and the speed-up I see when adding 2 K40 GPUs to a dual-socket Sandy Bridge compute node varies between 1.4x to 8.0x, depending on the cell size and the choice of algorithm.

For a long time, VASP was shown in Nvidia’s marketing information as already ported to GPU, despite not being generally available. In fact, I often got questions about it, but had to explain to our users that there were several independently developed prototype versions of VASP with code that had not yet been accepted into the main VASP codebase. But now, an official GPU version is finally happening, and the goal is that it will be generally available to users by the end of the year. No information is available on the VASP home page yet, but I assume that more information will come eventually.

The GPU version is a collaborative effort involving people from several research groups and companies. The list of contributors includes University of Wien, University of Chicago, ENS-Lyon, IFPEN, CMU, RWTH Aachen, ORNL, Materials Design, URCA and NVIDIA. The three key papers, that should be cited when using the GPU version are:

Speeding up plane-wave electronic-structure calculations using graphics-processing units. Maintz, Eck, Dronskowski. (2011)
VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron. Hutchinson, Widom. (2011)
Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units. Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard. (2012)

The history of GPU-VASP, as I have understood it, is that after the initial porting work by research groups mentioned above, Nvidia got involved and worked on optimizing the GPU parts, which eventually lead to the acceptance of the GPU code into the main codebase by the VASP developers and subsequently to the launch of the beta testing program coordinated by Nvidia. It is encouraging to see the involvement by Nvidia and I think this is an excellent example of community outreach and industry-academia collaboration. I hope we will see more of this in the future with involvement from other companies. Electronic structure software is, after all, a major workload at many HPC centers. For Intel’s Xeon Phi, VASP is listed as a “work in progress” with involvement from Zuse Institute in Berlin, so we will likely see further vectorization and OpenMP parallelization aimed at manycore architectures as well. I think the fact the GPU version performs as well as it does (see more below) is an indication that there is much potential out there for the CPU version too, in terms of optimization.

Beta-testing GPU-VASP

I have been part of the beta testing program for GPU-VASP. The analysis in this post will approach the subject from two perspectives. The first one is the buyer’s perspective. Does it make economical sense to start looking at GPUs for running VASP? This is the question that we at NSC face as an academic HPC center when we are buying systems for our users. The second perspective is the experience from the user perspective. Does it work? How does it differ from the regular VASP version?

The short answers for the impatient are: 1) possibly, the price/performance might be there given aggressive GPU pricing 2) yes, for a typical DFT calculation, you only need to adjust some parameters in the INCAR file, most importantly NSIM, and then launch VASP as usual.

Hardware setup

The tests were performed on the upcoming GPU partition of NSC’s Triolith system. The compute nodes there have dual-socket Intel Xeon E5-2660 “Sandy Bridge” processors, 64 GB of memory and Nvidia K20 or K40 GPUs. The main difference between the K20 and the K40 is the amount of memory on the card: the K20 has 6 GB and the K40 has 12 GB. VASP uses quite a lot of GPU memory, so with only 6 GB of memory you might see some limitations. For example, the GaAsBi 256 atom test job below used up about 9300 MB per card when running on a single node. It was possible to run smaller jobs on the K20s, though.

I ran most of the tests with the default GPU clock speed of 745 Mhz, but out of curiosity, I also tried to clock up the cards to 875 Mhz with the nvidia-smi utility

$ nvidia-smi -ac 3004,875

It didn’t seem to cause any problems with cooling or stability, and produced a nice 10 % gain in speed. The GPUs are rated for up to 235 Watt, but I never saw them use more than ca 180 W on average during VASP jobs.

Compiling the GPU version

A new build system was introduced with VASP 5.4. When the makefile.include is set up, you can compile the different versions of VASP (regular, gamma-point only, noncollinear) by giving arguments to the make command, e.g.

make std

With GPU-VASP, there is new kind of VASP executable defined in the makefile, called gpu, so the command to compile the GPU version is simply

make gpu

I would recommend sticking to compilation in serial mode. I tried using my old trick of running make -j4 repeatedly to resolve all dependencies, but the new build process does not work as well in parallel, you can get errors during the rsync stages when files are copied between directories.

To compile any program with CUDA, such as GPU-VASP, you need to have the CUDA developer tools installed. They are typically found in /usr/local/cuda-{version} and that is also where you will find them on the Triolith compute nodes. If there are no module files, you can add the relevant directory to your PATH yourself. In this case, CUDA version 7.5:

export PATH=/usr/local/cuda-7.5/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH

I tested with CUDA 6.5 in the beginning, and that seemed to work too, but VASP ran significantly faster when I reran the benchmarks with CUDA 7.5 later. Once you have CUDA set up, the critical command to look for is the Nvidia CUDA compiler, called nvcc. The makefiles will use that program to compile and link the CUDA kernels.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

There is a configuration file for the GPU version in the arch/ directory called makefile.include.linux_intel_cuda which you can use a starting point for configuration. I did not have to make much changes to compile on Triolith. In addition to the standard things like compiler names and flags, one should point out the path to the CUDA tools.

CUDA_ROOT  := /usr/local/cuda-7.5

Running

When you log in to a GPU compute node, it is not obvious where to “find” the GPUs and how many there are. There is a utility called nvidia-smi which can be used to inspect the state of the GPUs. Above, I used it for overclocking, but you can also do other things, such as listing the GPUs attached to the system:

[pla@n1593 ~]$ nvidia-smi -L
GPU 0: Tesla K40m (UUID: GPU-f4e02ffa-b01c-1e3e-ebdb-46e1fef83ce6)
GPU 1: Tesla K40m (UUID: GPU-35bee978-8707-1957-12e2-bddda324da88)

And to look at a running job and see how much power and GPU memory that is being used there is nvidia-smi -l

+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40m          Off  | 0000:08:00.0     Off |                    0 |
| N/A   39C    P0   124W / 235W |    558MiB / 11519MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          Off  | 0000:27:00.0     Off |                    0 |
| N/A   40C    P0   133W / 235W |    558MiB / 11519MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+

The most important thing to make VASP run efficiently is to make sure that you have the MPS system active on the node. MPS is the “Multi Process Service” – it virtualizes the GPU so that many MPI ranks can access the GPU independently without of having to wait for each other. Nvidia has an overview (PDF file) on their web site describing MPS and how set it up. Basically, what you have to do as user is to check if the nvidia-cuda-mps-control process is running. If it is not, you have to start it yourself, before starting your VASP job.

$ mkdir /tmp/nvidia-mps
$ export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps 
$ mkdir /tmp/nvidia-log
$ export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log 
$ nvidia-cuda-mps-control -d

Except for the initialization above, which you only need to do once on the node, you run VASP as usual, using mpiexec.hydra or a similar command, to start an MPI job.

mpiexec.hydra -n 8 vasp_gpu

You should see messages in the beginning of the program output telling you that GPU has been initialized:

Using device 1 (rank 3) : Tesla K40m
Using device 1 (rank 2) : Tesla K40m
Using device 0 (rank 1) : Tesla K40m
Using device 0 (rank 0) : Tesla K40m
running on    4 total cores
distrk:  each k-point on    4 cores,    1 groups
distr:  one band on    1 cores,    4 groups
using from now: INCAR
...
creating 32 CUDA streams...
creating 32 CUFFT plans with grid size 36 x 40 x 48...

(Please note that these might go away or look differently in the final release)

VASP Test Suite

I attempted to run through my VASP test suite with GPU version, but many of the test cases require running with LREAL=.FALSE., so the total energies are different. Despite that, and comparing a few other runs, I did not see any significant discrepancies between the CPU and GPU versions. More than 100 different test cases have been used during the acceptance testing, in addition to the testing done by the beta testers, so we can be certain that the major bugs have been found at this stage.

Performance of standard DFT calculations

I have only performed single-node benchmarks so far, so the focus is comparing the speed when running with CPUs only vs. CPUs+GPUs. The K40 node has 2 GPUs and 2 CPU sockets, with 1 GPU attached to each socket, so the comparison is 16 cores vs (any number of cores) and 2 GPUs. Typically, I found that using 8 MPI ranks (i.e. 8 out of 16 cores on a Triolith node) sharing 2 GPUs using MPS was the fastest combination for regular DFT jobs.

I tested regular DFT by using the old test case of GaAsBi 256 atoms (9 k-points) on which I have collected lots of data on for Triolith (Intel “Sandy Bridge”) and Beskow (Cray XC40 with Intel “Haswell”). As mentioned above, that was close to the biggest job I could run on a single compute node with 2 GPUs due to memory limitations. The reference run with CPUs completed in 7546 seconds on 1 node. With 8 cores and 2 K40 GPUs, it runs remarkably faster and finishes in 1242 seconds, which is around 6 times faster. Here, I used overclocked GPUs (875 Mhz), so at base frequency it is around 10% slower. For reference, with 8 compute nodes using no GPUs, the GaAsBi-256 job completes in 900 seconds on Triolith and in 400 seconds on Beskow.

Interestingly, GPU-VASP is especially strong on the Davidson algorithm. ALGO=fast runs about 50% faster in CPU mode, but with GPUs, there is very little difference, so if you rely on ALGO=normal for getting convergence, there is good news. If you calculate the speed-up for ALGO=normal, it is therefore higher, around 8x faster.

The NSIM parameter is now very important for performance. Theoretically, the GPU calculations will run faster the higher the value of NSIM you set, with the drawback being that the memory consumption on the GPUs increase with higher NSIM as well. The recommendation from the developers is that you should increase NSIM as much you can until you run out of memory. This can require some experimentation, where you have to launch your VASP job and the follow the memory use with the nvidia-smi -l command. I generally had to stop at NSIM values of 16-32.

Packing a relatively big cell like this (256 atoms) on a single node is probably the poster case for the GPU version, as you need sufficiently large chunks of work offloaded onto the GPU in order to account for the overhead of sending data back and forth between the main memory and the GPU. As an example of what happens when there is too little work to parallelize, we can consider the 128-atom Li2FeSiO4 test case in the gamma point only. I can run that one on a CPU node with the gamma point version of VASP in around 260 seconds. The GPU version, which has no gamma point optimizations, clocks in 185 seconds, even with 2 K40s GPUs, for an effective improvement of only 40%.

Of course, one can argue that this is an apples to oranges comparison, and that you should compare with the slower VASP NGZhalf runtime for the CPU-version (which is 428 seconds or half the speed of the GPU run), but the gamma-point only optimization is available in the CPU version today, and my personal opinion is that the symmetries enforced in the gamma point version produces more accurate results, even if they might be different.

Performance of hybrid DFT calculations

Hybrid DFT calculation incorporating Hartree-Fock, perhaps screened such as in HSE06, are becoming close to the standard nowadays, as regular DFT is no longer considered state of the art. Part of the reason is the availability of more computing resources, as an HSE06 calculation can easily take 100 times longer to run. Speeding up hybrid calculation was one of the original motivations for GPU-accelerating VASP, so I was curious to test this out. Unfortunately, I had lots of problems with memory leaks and crashes in the early beta versions, so I had a hard time getting any interesting test cases to run. Eventually, though, these bugs were ironed out in the last beta release, enabling me to start testing HSE06 calculations, but the findings here should be considered preliminary for now.

In my tests, I found that 4-8 MPI ranks was optimal for hybrid DFT. The reason for hybrid jobs being able to get along with less CPU cores is that the Hartree-Fock part is the dominant part, and it runs completely on the GPU, so there should be some efficiency gain by having less MPI ranks competing for the GPU resources. For really big jobs, I was told by Maxwell Hutchinson, one of the exact-exchange GPU developers, that having 1 GPU per MPI rank should be the best way to run, but I have not been able to confirm that yet.

Setting NSIM is even more important here, the recommendation is

NSIM = NBANDS / (2*cores)

So you need to have lots of bands in order to fully utilize the GPUs.

The test case here was an MgO cell with 63 atoms, 192 bands and 4 k-points. These are quite heavy calculations so I had to resort to timing individual SCF iterations to get some results quickly. With 16 CPU cores only, using ALGO=all one SCF iteration requires around 900 seconds. When switching to 4 cores and 2 GPU:s (2 cores per GPU), I get the time for one SCF iteration down to around 640 seconds, which is faster, but not a spectacular improvement (40-50%). But note that the same SCF algorithm is not being used here, so actual number of iterations required to converge might be different, which affects the total runtime. I have actually not tested running HSE06 calculations with ALGO=normal before, as it is not the standard way to run, so I cannot say right now whether to expect faster convergence with ALGO=normal. The underlying problem, as I have understood it, is that a job like this launches lots of small CUDA kernels, and although there are many of them, they cannot effectively saturate the GPU. The situation should be better for larger cells, but I have not been able to run these test yet, as I only had a few nodes to play with.

Summary and reflections

It is well known that making a fair comparison between CPU and GPU computing is very challenging. The conclusion you come to is to a large extent dependent on which question you ask and what you are measuring. The whole issue is also complicated by the fact that many of the improvements made to the code during the GPU porting can be back-ported to the CPU version, so the process of porting a code to GPU might itself, paradoxically, weaken the (economical) case for running on GPUs, as the associated gain might make the CPU performance just good enough to compete with GPUs.

From a price/performance point of view, one should remember that a GPU-equipped node is likely to be much more expensive and use more power when running. In a big procurement of an HPC resource, the final pricing is always subject to some negotiation and discounting, but if one looks at the list prices, a standard 2-socket compute node with 2 K40 GPUs is likely to cost at least 2-3 times as much as without the GPUs. One must also take into consideration that such a GPU node might not always run GPU-accelerated codes and/or be idling some part of the time due to a lack of appropriate jobs. Consequently, the average GPU workload that runs on a GPU partition in an HPC cluster must run a lot faster to make up for the higher cost. In practice, a 2-3x speedup for a breakeven with respect to pricing is probably not enough to make it economically viable, instead we are looking at maybe 4-6x. The good news is, of course, that certain VASP workloads (such as normal DFT and molecular dynamics on big cells) do meet this requirement.

Another perspective that should not be forgotten is that the compute power of a single workstation running VASP can be improved a lot with GPUs. This is perhaps not as relevant for “big” HPC, but it significantly increases the total compute power and job capacity that a VASP user without HPC access can easily acquire. From a maintenance and system administration perspective, there is a big jump in moving from a single workstation to a full-fledged multi-node cluster. A cluster needs rack space, a queue system, some kind of shared storage etc. The typical scientist, will not be able to set up such a system easily. The 2-socket workstation of yesterday was probably not sufficient for state of the art VASP calculations, but with, let us say, an average 4x improvement with GPUs, it might be viable for certain kinds of research level calculations.

From a user and scientific point of view, the GPU version of VASP seems ready for wider adoption. It works and is able to reproduce the output of the regular VASP. Running it requires making some change of settings in the input files, which unfortunately can make the job suboptimal when running on CPUs. But it has always been the case, that you need to adjust the INCAR parameters to get the best out of a parallel run, so that is nothing new.

In conclusion, it would not surprise me if the availability of VASP with CUDA support might be a watershed event for the adoption of Nvidia’s GPUs, since VASP is such a big workload on many HPC centers around the world. For us, for example, it has definitely made us consider GPUs for the next generation cluster that will eventually replace Triolith.

P.S. 2015-11-23: The talk by Max Hutchinson at SC15 about VASP for GPU is now available online.

SC15 conference

2015-11-11T00:00:00+01:00

Next week, I am going to Austin to participate in SC15 (the supercomputing conference). I will also be at Intel’s HPC developer conference. I am always interested in meeting other people working with ab initio/electronic structure software. Send me an email if you want to meet up.

New version of VASP - 5.4.1

2015-09-21T00:00:00+02:00

A new version of VASP, denoted vasp.5.4.1 24Jun15 was released during the summer. The release was a bit stealthy, because there was no mention of it on the VASP home page until announcements of “bugfixes for vasp.5.4.1” showed up. There seems to be no official release notes published either, but the announcement email contains the following list of improvements and changes:

Interface to the solvation model code VASPsol of Mathew and Hennig.
Bugfix in the symmetry detection
Support for NCORE≠1 for hybrid functional calculations.
Bugfix in pead.F: macroscopic dielectric properties (LCALCEPS=.TRUE.) didn’t work with LREAL≠.FALSE.
Improvements to hybrid calculations: less memory use when run at large scale.
A new build system which simplifies the compilation of VASP. There is now a separate build directory and you can compile the three usual versions with distinct make commands.

Since the original 5.4.1 release, there has also been two patches released:

Patch #1: 2015-07-08 fixing “several bugs”.
Patch #2: 2015-08-27 which fixed a problem with analyzing the symmetry of collinear magnetic structures.

The first installations of VASP 5.4.1 binaries at NSC is available in the usual place, so you can put the following in your job scripts:

mpprun /software/apps/vasp/5.4.1-24Jun15/build04/vasp

Build04 includes both of the patches, but build01-02 only have the first patch. You can also do module load vasp/5.4.1-24Jun15, if you prefer to use modules. That command currently loads build04.

I believe that the new version is safe to use. I ran through the test suite that I have and saw identical output in most tests and small deviations for Cu-fcc and Si with GW. I especially recommend to try out 5.4.1 if you are struggling with large hybrid functional calculations and employ many compute nodes in the process. In repeating some of the GaAsBi benchmark runs with 5.4.1, I found that HSE06 is now significantly faster, and uses less memory per MPI rank, so you might be able use more cores per compute node without running out of memory. To exemplify, with 5.3.5, I was able to run my GaAsBi-512 atom test system on 96 Triolith compute nodes with 8 cores per node in 9812 seconds, but with 5.4.1, it completes in 6719 seconds using the same configuration. That is a 46% improvement in speed! Furthermore, it possible to scale up the calculation further to 128 nodes and 12c/node, which I was not able to do before due to memory shortage. I think this is good news for the people running large calculations, especially on Beskow.

P.S. A comment on the lack of updates to the blog:

Over the year, I have gradually moved on to another position at NSC. Today, I work as partner manager, developing NSC’s external collaborations with partners such as SMHI and Saab Group. I intend to keep publishing benchmarks and recommendations on the blog, as we move into the process of replacing Triolith, but the update frequency will likely be lower in the future. Weine Olovsson at NSC is taking over most of the actual support duties for VASP.

VASP seminar at Uppsala University

2015-01-14T00:00:00+01:00

Tomorrow (January 15th), I will be visiting Uppsala University and give some lectures on running VASP efficiently on the clusters available in Sweden. The seminar is being organized by NSC and UPPMAX, mainly for SNIC users, but everyone interested is welcome. It is not necessary to register in advance. You can find more information on the SNIC web page.

Most of the content that I will present is already available here, scattered over many blog posts. If you check out the blog post archives, you will find posts covering VASP and how certain input parameters affect performance. For information about compiling VASP, I have a special section on the web page called Compile VASP. It now has a new guide for the Beskow system at PDC in Stockholm. I will show some preliminary benchmark results from there and we will talk about what kind of simulations that are suitable for running at large scale on such a computer.

There is also a dedicated time slot for a Q&A session. It is a good time to ask questions, not only about VASP, as several of the SNIC application experts will be there.

I am looking forward to seeing you there!

Update 2015-01-16: Thank you to everyone who participated. It is always nice to return to Uppsala. By request, here are the slides from the sessions:

Session 1: General performance aspects of electronic structure software
Session 2: VASP runtime settings: general recommendations
Session 3: Running VASP on Beskow: benchmark results

VASP on Cray XC-40 Beskow: preliminary benchmark results

2015-01-13T00:00:00+01:00

During the Christmas holidays, I had the oppurtunity to run some VASP benchmarks on Beskow, the Cray XC40 supercomputer currently being installed at PDC in Stockholm. The aim was to develop guidelines for VASP users with time allocations there. Beskow is a significant addition in terms of aggregated core hours available to Swedish researchers, so many of heavy users of supercomputing in Sweden, like the electronic structure community, were granted time there.

For this benchmarking round, I developed a new set of tests to gather data on the relationship between simulation cell size and the appropriate number of cores. There has been concerns that almost no ordinary VASP jobs would be meaningful to run on the Cray machine, because they would be too small to fit into the minimal allocation of 1024 cores (or 32 compute nodes, in my interpretation). Fortunately, my initial results show that this is not case, especially if you use k-point parallelization.

The tests consisted of GaAs supercells of varying sizes, doped with a single Bi atom. The cells and many of the settings are picked from a research paper, to make it more realistic. The supercell sizes were successive doublings of 64, 128, 256, and finally 512 atoms, with a decreasing number of k-points in the Monkhorst-Pack grids (36, 12, 9, 4). I think it is a quite realistic example of a supercell convergence study.

First, please note the scales on the axes. The y-axis is reversed time in logarithmic scale, so upwards represents faster speed. Similarly, the x-axis is the number of compute nodes (with 32 cores each) in log-2 scale. A 256-node calculation is potentially 8192 cores! But in this case, I used 24 cores/node, except for the biggest HSE06 cells, where I had to reduce the number cores per node to gain more memory per MPI rank. The solid lines are PBE calculations and the dashed lines HSE06 calculations. Note the very consistent picture of parallel displacement of the scaling curves: bigger cells take longer time to run and scale to a larger number of compute nodes (although the increase is surprisingly small). The deviations from the straight lines comes from outlier cases where I had to use a sub-optimal value of KPAR, for example, with 128 atoms, 32 nodes, and 12 k-points, I had to use KPAR=4 instead of KPAR=12 to balance the number of k-points. For real production calculations, you could easily avoid such combinations.

The influential settings in the INCAR file were:

NCORE = cores/node (typically 24)
KPAR = MIN(number of nodes,number of k-points)
NSIM = 2
NBANDS = 192, 384, 768, 1536 (which are all divisible by 24)
LREAL = Auto
LCHARG = .FALSE.
LWAVE = .FALSE.
ALGO = Fast (Damped for HSE06)

There were no other special tricks involved, such as setting MPI environment variables or custom MPI rank placement. It is just standard VASP calculations with minimal input files and careful settings of the key parameters. I actually have more data points for even bigger runs, but I have chosen to cut off the curves where the parallel efficiency fell too much, usually to less than 50%. In my opinion, it is difficult to motivate a lower efficiency target than that. So what you see is the realistic range of compute nodes you should employ to simulate a cell of a given size.

If we apply the limit of 32 nodes, we see that a 64-atom GGA calculation might be a borderline case, which is simply to small to run on Beskow, but 128 atoms and more scale well up to 32 compute nodes, which is half a chassis on the Cray XC40. If you use hybrid DFT, such as HSE06, you should be able to run on 64 nodes (1 chassis) without problem, perhaps even up to 4 chassis with big supercells. In that case, though, you will run into problems with memory, because it seems that the memory use of an HSE06 calculation increase linearly by the number of cores I use. I don’t know if it is a bug, or if the algorithm is actually designed that way, but it is worth keeping in mind when using hybrid functionals in VASP. Sometimes, the solution to an out of memory problem is to decrease the number of nodes.

In addition to parallel scaling, we are also interested in the actual runtime. In some ways, it is impressive. Smaller GGA calculations can complete 1 SCF cycle in less than 1 minute, and even relatively large hybrid-DFT jobs can generally be tuned to complete 1 SCF cycle in per hour. In other ways it is less impressive. While we can employ more compute nodes to speed up bigger cells, in general, we cannot always make a bigger system run as fast as a smaller by just adding more compute nodes. For example, an HSE06 calculation will take about two orders of magnitude longer time to run than a GGA calculation, but unfortunately, it cannot also make use of two orders of magnitude more compute nodes efficiently. Therefore, large hybrid calculations will remain a challenge to run until the parallelization in VASP is improved, especially with regards to memory consumption.

Selecting the right number of cores for a VASP calculation

2015-01-12T00:00:00+01:00

A frequent question I encounter supporting VASP users is: “I have a cell with X atoms and Y electrons. How many compute nodes (or cores) should I choose for my simulation?”

It is an important question because using too many cores will be inefficient and the result is less number of jobs completed in a given computer time allocation.

Currently, there is only MPI parallelization in VASP, so by “cores”, I mean the number of MPI ranks or processes, i.e. the number you give to the mpirun command using the -n command line flag or the number of cores you request in the queue system.

Besides the suggestion of actually testing it out and finding a good number of cores, the main rule of thumb that I have been telling people is:

number of cores = number of atoms

This is almost always safe, and will not waste computer time. Typically, it will ensure a parallel efficiency of at least 80%. This is of course a very unscientific and handwavy rule, but it has a certain pedagogical elegance, because it is easy to remember, and you don’t need to look up any other technical parameters.

Let’s now look into how you could make a more accurate estimate. VASP has three levels of parallelization: over k-points, over bands, and over plane-wave coefficients (or equivalently Fast-Fourier transforms). You need to ensure that when the work is split up over several compute nodes, there is a sufficient amount of work allocated to each processor core, otherwise, they will just spend time waiting for more work to arrive. The fundamental numbers to be aware of are therefore:

The number of k-points
The number of bands (determined indirectly by the number of atoms and electrons)
The size of the basis set (i.e. number of plane waves, which corresponds to the number of grid points in the FFTs).

If you can estimate, or derive these numbers for your calculation, you can more precisely guess a suitable number of cores to use.

Bands and cores

The first step is to consider the number of bands (NBANDS). VASP has parallelization over bands (controlled by the NPAR tag). The ultimate limit is 1 band per core. So, for example, if you have 100 bands, you cannot run on more than 100 cores and expect it to work well. What I have seen, in my scaling tests, though, is that 1 band per core is too little work for a modern processor. You need to have at least 2 bands per core to reach more than 50% efficiency. A conservative choice is 8 bands/core. That will give you closer to 90% efficiency.

number of cores = NBANDS / 8

So how does this relate to the rule of thumb above? By apply it, you will arrive at a number of bands per core equal to the average number of valence electrons per atom in your calculation. If we assume that the typical VASP calculation has about 4-8 valence electrons per atom, this will land us in the ballpark of 4-8 bands/core, which is usually ok.

Let’s now try to apply this principle:

Example 1: We have a cell with 500 bands and a cluster with compute nodes having 16 cores per node. We aim for 8 bands/core, which unfortunately means 62.5 cores. It is better to have even numbers, so we increase the number of bands to 512 by setting NBANDS=512 in the INCAR file and allocate 64 cores, or 4 compute nodes.

Example 2: Suppose that you want to speed up the calculation in the previous example. You need the results fast, and care less about efficiency in terms of the number of core hours spent. You could drop down to 1 band/core (512 cores), but there is really not that much improvement compared to 2 bands/core (256 cores). So it seems like 256 cores is the maximum number possible. But what you can do is to take these 256 MPI processes and spread them out over more compute nodes. This will improve the memory bandwidth available to each MPI process, which usually speed things up. So you can try running on 32 nodes, but using 8 cores/node instead. It could be faster, if the extra communication overhead is not too large.

K-points and KPAR

The next step is to consider the number of k-points. VASP can treat each k-point independently. The number of k-point groups that run in parallel is controlled by the KPAR parameter. The upper limit of KPAR is obviously the number of k-points in your calculation. In theory, the maximum number of cores you can run on using combined k-point- and band-parallelization is NBANDS * KPAR. So for example, 500 bands and 10 k-points would allow up to 5000 cores, in principle. In practice, though, k-point parallelization does not scale that well. What I have found on the Triolith and Beskow systems is that supplying KPAR=compute nodes usually allows you to run on twice as many cores as you determined in the previous step, regardless of the actual value of KPAR. I would not recommend attempting run with KPAR>compute nodes, even though you may have more k-points than compute nodes.

(Note: A side of effect of this is that the most effective number of bands/core when using k-point parallelization is higher than without. This is likely due to the combined overhead of using two parallelization methods.)

Example 3: Consider the 500 bands cell above. 64 cores was a good choice when using just band parallelization. But you also have 8 k-points. So set KPAR to 8 and double the number of cores to 128 cores (or 8 compute nodes). In this case, we end up with 1 k-point per node, which is a very balanced setup. Note that this may increase the required memory per compute node, as k-point parallelization replicates a lot of data inside each k-point group. If you would run out of memory, the next step would be to lower KPAR to 2 or 4.

Basis set size, LPLANE and NGZ

As a last step, it might be worth considering what the load-balancing of your FFTs will look like. This is covered in the VASP manual in section 8.1. VASP, by default, works with the FFTs in a plane-wise manner (meaning LPLANE=.TRUE.), which reduces the amount of communication needed between MPI ranks. In general, you want to use this feature, as it is typically faster. The 3D FFT:s are split up into 2D planes, where each group (as determined by NPAR) works on a number of planes. It means that ideally, you want NGZ (the number of grid points in the Z direction) to be evenly divisible by NPAR. That will ensure good load-balance.

NGZ=n*NPAR

The second thing to consider is, according to the manual, that NGZ should be sufficiently big for the LPLANE approach to work:

NGZ ≥ 3*(number of cores)/NPAR = 3*NCORE

Since NCORE will be of the same magnitude as the number of cores per compute node, it means that NGZ should be at least 24-96, depending on the node configuration. More concretely, for the following clusters, you should check that the conditions below hold:

NSC Kappa/Matter: NGZ ≥ 24
NSC Triolith: NGZ ≥ 48
PDC Beskow: NGZ ≥ 72 (using 24c/node)

Typically, this is not a big problem. As an example of what NGZ can be, consider a 64-atom supercell of GaAs (11³ Å) with a cut-off of 313 eV. Then the small FFT grid is 70x70x70 points. So that is approximately the smallest cell that you can run on many nodes without suffering from excessive load imbalance on Beskow. For bigger cells, with more than 100 atoms, NGZ is also usually larger than 100, so there will be no problem in this regard, as long as you stick to the rule of using NPAR=compute nodes or NCORE=cores/node. But you should still check that NGZ is an even number and not, for example, a prime number.

In order to tune NGZ, you have two choices, either adjust ENCUT to a more appropriate number and let VASP recalculate the values of NGX, NGY and NGZ, or switch from specifying the basis set size in terms of an energy cut-off and set the NG{X,Y,Z} parameters yourself directy in the INCAR file instead. For a very small system, with NGZ falling below the threshold above, you can also consider lowering the number of cores per node and adjusting NCORE accordingly. For example, on Triolith, using 12 cores/node and NCORE=12 would lower the threshold for NGZ to 36, which enables you to run a small system over many nodes.

Summary

Check the number of bands (NBANDS). The number of bands divided by 8 is a good starting guess for the number of cores to employ in your calculation.
If you have more than one k-point, set KPAR to the number of compute nodes or the number of k-points, whichever is the smallest number. Then double the amount of cores determined in the previous step.
Make a test run and check the value of NGZ, it should be an even number and sufficiently big (larger than 3*cores/node). Adjust either the basis set size or the number of cores/node.

How to compile VASP on the Cray XC40

2015-01-07T00:00:00+01:00

Recently, I have been working on making VASP installations for the new Cray XC40 (“Beskow”) at PDC in Stockholm. Here are some instructions for making a basic installation of VASP 5.3.5 using the Intel compiler. Some of it might be specific to the Cray at PDC, but Cray has a similar environment on all machines, so I expect it to be generally useful as well. My method of compiling VASP produces binaries which are around 30-50% faster than the ones that were provided to us by Cray, so I really recommend making the effort to recompile if you are a heavy VASP user.

If you have an account on Beskow, my binaries are available in the regular VASP module:

module load vasp/5.3.5-31Mar14

The installation path is (as of now, it might change when the system becomes publically available):

/pdc/vol/vasp/5.3.5-31Mar14/build04/

You can find the makefiles and some README files too.

Summary of the findings

VASP compiles fine with module PrgEnv/intel and MKL on Cray XC-40.
Using MKL is still signficantly better, especially for the FFTW routines.
Optimization level -O2 -xCORE-AVX2 is enough to get good speed.
VASP does not seem to be helped much by AVX2 instructions (small matrices and limited by memory bandwidth).
A SCALAPACK blocking factor NB of 64 seems appropriate.
MPI_BLOCK should be increased as usual, 64kb is a good number.
Enabling MKL’s conditional bitwise reproducibility at the AVX2 level does not hurt performance, it may even be faster than running in automatic mode.
Memory “hugepages” does not seem to improve performance of VASP.
The compiler flags -DRPROMO_DGEMV and -DRACCMU_DGEMV have very little effect on speed.
Hyper-threading (symmetric multi-threading) does not improve performance, the overhead of running twice as many MPI ranks is too high.
Multithreading in MKL does not improve performance either.

Preparations for compiling

First, download the prerequisite source tarballs from the VASP home page:

http://www.vasp.at/

You need both the regular VASP source code, and the supporting “vasp 5” library:

vasp.5.3.5.tar.gz
vasp.5.lib.tar.gz

I suggest to make a new directory called e.g. vasp.5.3.5, where you download and expand them. You would type commands approximately like this:

mkdir 5.3.5
cd 5.3.5
(download)
tar zxvf vasp.5.3.5.tar.gz
tar zxvf vasp.5.lib.tar.gz

This will set you up with the source code for VASP.

Load modules for compilers and libraries

The traditional compiler for VASP is Intel’s Fortran compiler (ifort command), so we will stick with Intel’s Fortran compiler in this guide. In the Cray environment, this module is called “PrgEnv-Intel”. Typically, PGI or Cray is the default preloaded compiler, so we have to swap compiler modules.

module swap PrgEnv-cray PrgEnv-intel/5.2.40

Check which version of the compiler you have by typing “ifort -v”:

$ ifort -v
ifort version 14.0.4

If you have the PrgEnv-intel/5.2.40 module loaded, it should state 14.0.4. This version can compile VASP with some special rules in the makefile (see compiler status for more information). Please note that the Fortran compiler command you should use to compile is always called ftn on the Cray (regardless of the module loaded).

We are going to use Intel’s math kernel library (MKL) for BLAS, LAPACK and FFTW, so we unload Cray’s LibSci, to be on the safe side.

module unload cray-libsci

Then I add these modules, to nail everything down:

module load cray-mpich/7.0.4
module load craype-haswell

This select Cray’s MPI library, which should be default, and the sets up the environment to compile for XC-40.

VASP 5 lib

Compiling the VASP 5 library is straightforward. It contains some timing and IO routines, necessary for VASP, and LINPACK. Just download my makefile for the VASP library into the vasp.5.lib directory and run the make command.

cd vasp.5.lib
wget http://www.nsc.liu.se/~pla/downloads/makefile.vasp5lib.crayxc40
make

When it is finished, there should be a file called libdmy.a in the directory. Leave it there, as the main VASP compilation picks it up automatically.

Editing the main VASP makefile

Go to the vasp.5.3 directory and download the main makefile.

cd vasp.5.3
wget http://www.nsc.liu.se/~pla/downloads/makefile.vasp535.crayxc40

I recommend that you edit the -DHOST variable in the makefile to something that you will recognize, like the machine name. The reason is that this piece of text is written out in the top of OUTCAR files.

CPP    = $(CPP_) -DMPI  -DHOST=\"MACHINE-VERSION\" -DIFC \
   -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc -DNGZhalf \
   -DMPI_BLOCK=65536 -Duse_collective -DscaLAPACK \
   -DRPROMU_DGEMV  -DRACCMU_DGEMV -DnoSTOPCAR

You will usually need three different versions of VASP: the regular one, the gamma-point only version, and one for spin-orbit and/or non-collinear calculations. These are produced by the following combinations of precompiler flags that you have to put into the CPP line in the makefile:

regular:       -DNGZhalf
gamma-point:   -DNGZhalf -DwNGZhalf
non-collinear: (nothing)

At the Swedish HPC sites, we install and name the different binaries vasp, vasp-gamma, and vasp-noncollinear, but this is optional.

Compiling

VASP does not have a makefile that supports parallel compilation. So in order to compile we just do:

make -f makefile.vasp5lib.crayxc40

If you really want to speed it up, you can try something like:

nice make -j4; make -j4; make -j4; make -j4;

Run these commands repeatedly until all the compiler errors are cleared (or write a loop in the bash shell). Obviously, this approach only works if you have a makefile that you know works from the start. When finished, you should find a binary called “vasp”. Rename it immediately after you are finished, otherwise it will get destroyed when you type make clean to compile the other VASP versions.

Running

The Cray compiler environment produces statically linked binaries by default, since this is the most convenient way to run on the Cray compute nodes, so we just have to put e.g. the following in the job script to run on 2048 cores using 64 compute nodes with 32 cores per node and 16 cores per socket:

aprun -n 2048 -N 32 -S 16 /path/to/vasp

Normally, I would recommend lowering the number of cores per compute node. This will often make the calculation run faster. In the example below, I run with 24 cores per node (12 per socket), which is typically a good choice:

aprun -n 1536 -N 24 -S 12 /path/to/vasp

When running on the Cray XC-40, keeping in mind the basic topology of the fast network connecting the compute nodes. 4 nodes sit together on 1 board, and 16 boards connect to the same chassis (for a total of 64 compute nodes), while any larger job will have to span more than one chassis and/or physical rack, which slows down network communications. Therefore, it is best to keep the number of compute nodes to 64 at most, as few VASP jobs will run efficiently using more nodes than that.

Initial test of VASP on Intel's Xeon E5 v3 processors ("Haswell")

2014-12-16T00:00:00+01:00

I finally got around to run some VASP benchmarks on the recently released Intel Xeon E5 v3-processors (see previous post for an overview of the differences vs older Xeon models). I have tested two different configurations:

A 16-core node from the coming weather forecasting cluster at NSC to be named “BiFrost”. It is equipped with the Xeon E5-2640v3 processor at 2.6 Ghz, together with 64 GB of 1866 MHz DDR4 memory.
The 32-core nodes in the Cray XC40 “Beskow” at machine at PDC. The processor model is Xeon E5-2698-v3. Beskow also has two sockets per node with 64 GB memory, but the speed of the memory is faster: 2133 MHz.

I did a quick compile of VASP with Intel Fortran (version 14 and 15) and ran some single node benchmarks of the 24-atom PbSO4 cell I traditionally use for initial testing.

The initial results are basically along my expectations. The single core performance is very strong thanks to Turbo Boost and the improved memory bandwidth. It is the best I have measured so far: 480 seconds vs 570 seconds earlier for a 3.4 Ghz Sandy Bridge CPU.

When running on all 16 cores and comparing to a Triolith node, the absolute speed is up 30%, which is also approximately +30% improved performance per core, as the number of cores are the same and the effective clock frequency is also very close in practice. The intra-node parallel scaling has improved as well, which is important, because it hints that we will be able to run VASP on more than 16 cores per node without degrading performance too much on multi-core nodes such as the ones in Beskow. In fact, when running the above cell on a single Cray XC-40 node with 32-cores, I do see improvement in speed all the way up to 32 cores:

This is a good result, since the paralllel scaling for such a small cell is not that good in the first place.

So overall, you can expect a Beskow compute node to be about twice as fast as a Triolith node when fully loaded. That is logical, as it has twice as many cores, but not something that you can take for granted. However, when running wide parallel jobs, I expect you are likely to find that using 24 cores/node is better, because 24-cores brings 90+% of the potential node performance, while at the same significantly lowering the communication overhead due to having many MPI processes.

A first case study: a 512-atom supercell

I will publish more comprehensive benchmarks later, but here is an indication of what kind of improvement you can expect on Beskow vs Triolith for more realistic production jobs. The test system is a 512-atom GaAs supercell with one Bi defect atom. A cell like this is something you would typically run as part of a convergence study to determine the necessary supercell size.

Using 64 compute nodes on Triolith (1024 cores in total) it takes 340s to run a full SCF cycle to convergence (13 iterations)
On Beskow, using 64 compute nodes and 1536 cores (24c/node), it takes 160s.

Again, about 2.0x faster per compute node. If you compare it to the old Cray XE6 “Lindgren” at PDC, the difference will be even bigger, between 3-4x faster. But please keep in mind that allocations in SNIC are accounted in core hours, and not node hours, so while your job will run approximately twice as fast on the same number of nodes, the “cost” in core hours is the same.

Visiting University of Antwerp

2014-10-22T00:00:00+02:00

On Thursday and Friday (23-24 October), I will be visiting the University of Antwerp in Belgium and give some lectures as part of the Specialist course on efficient use of VASP that is being organized by CalcUA and the Flemish Supercomputing Centrum. You can find more information on the course web page. It seems like you need to register in advance to attend the course.

Most of the content that I will present is already available here, as part of several blog posts. If you check out the blog post archives, you will find many posts covering VASP and how certain input parameters affect performance. For information about compiling VASP, I have a special section on the web page called Compile VASP. The system guide for NSC’s Triolith is probably the best to start out with, as the hardware is very similar to what is currently available in Belgium. After the course, I plan to add a few articles based on the new material I prepared for this event.

I am looking forward to seeing you in Antwerp!

Update 2014-10-29: Thank you to everyone who participated. I had a pleasant stay in Antwerp and much enjoyed the discussions. It gave me some new ideas of things to test out. By request, here are the slides from the sessions:

General aspects of DFT codes (not only VASP).
Introduction to VASP. Might be useful for people not having used the VASP program before at all.
Influential settings part 1. General VASP settings such as NPAR/NCORE.
Influential settings part 2. What to think about when running big parallel jobs.

Application statistics for Triolith

2014-09-26T00:00:00+02:00

What is Triolith being used for? We have some idea by looking at the computer time applications we get through SNAC, and the support cases we work on also tells us something about what people are doing on Triolith, but the most comprehensive picture is likely to be painted by actually analyzing, in real time, what is running on the compute nodes. During the last service stop, I had the opportunity to examine the low-level logging data of a sizeable set of Triolith compute nodes. I managed to collect a sample of one month of log data from 186 nodes. To get a sense of the scale, it expands to about 2 TB of data uncompressed. Fortunately, when you have access to a supercomputer, you can attack the problem in parallel, so with 186 compute nodes unleashed at the task, it took just 10 minutes.

What you see below is an estimate of the fraction of time that the compute nodes spent running different applications. The time period is August 17th to September 17th, but the relative distribution has been rather stable over the previous months.

Application	Share of core hours (%)
VASP	36.6%
Gromacs	10.0%
NEK5000	8.2%
[Not recognized]	6.7%
LAMMPS	5.0%
Ansys (Fluent)	3.3%
Gaussian	3.3%
Dalton	3.1%
C2-Ray	2.4%
Nemo	2.2%
NAMD	2.2%
Python	1.8%
a.out	1.5%
GAMESS	1.5%
OpenFOAM	1.1%
STAR-CCM	1.1%
UPPASD	1.1%
CPMD	0.9%
EC-Earth	0.9%
KKR	0.8%
Spectral	0.8%
Rosetta	0.7%
CP2K	0.6%
Octopus	0.6%
RSPt	0.5%

Unsurprisingly, we find VASP at the top, accounting for about a third of the computing time. This is the reason why I spend so much time optimizing and investigating VASP – each of per cent of performance improvement is worth a lot of core hours in the cluster. We also have a good deal of molecular dynamics jobs (Gromacs, LAMMPS, NAMD, CPMD, ca 18%) and a steady portion of computational fluids dynamics jobs (Fluent + NEK5000 + OpenFOAM , ca 12%). Quantum chemistry programs, such as Gaussian, GAMESS, and Dalton (8%) catch the eye in the list, as expected, although this was a low month for Gaussian (3%), the usage is often higher (6-7%), competing for the top-5.

It would be interesting to compare this to other supercomputing sites. When talking to people at the SC conference, I get the impression that VASP is major workload at basically all academic sites, although perhaps not as much as 30%. In any case, getting statistics like this is going to be useful to plan application support and the design of future clusters that we buy.

Methodology

Below follows some technical observations for people interested in the details behind getting the numbers above.

The data is based on collectl process data, but at the logging level, you only see the file name of the binary, so you have to identify a certain software package just by the name of its running binaries. This is easy for certain programs, such as VASP, which are always called vasp-something, but more difficult for others. You can, for example, find the notorious a.out in the list above, which could be any kind of code compiled by the users themselves.

A simple check of the comprehensiveness of the logging is to aggregate all the core hours encountered in the sample, and compare with the maximum amount possible (186 nodes running for 24 hours for 30 days). This number is around 75-85% with my current approach, which means that something might be missing, as Triolith is almost always utilized to >90%. I suspect it is a combination of the sampling resolution at the collectl level, and the fact that I filter out short-running processes (less than 6 minutes) in the data processing stage to reduce noise from background system processes. Certain software packages (like Wien2k and RSPT) run iteratively by launching a new process for each iteration, creating many short-lived processes inside a single job. Many of these are probably not included in the statistics above, which could account for the shortfall.

Intel releases new "Haswell" Xeon server processors

2014-09-10T00:00:00+02:00

On September 8, Intel finally lifted the veil and revealed the new Xeon E5 server processors based on the “Haswell” architecture. These are the processers that you are likely to find in new supercomputer and cluster installations during the next years.

The main improvements are:

Up to 18-cores per processor socket. I expect that the mainstream configuration will be 10-14 cores/socket, so your typical 2-socket compute node will have 20-28 cores, and twice that number of threads with hyperthreading enabled.
Faster memory bandwidth with up to 2133 MHz DDR4-memory. Early benchmarks suggest a 40% improvement in bandwidth vs Triolith-style hardware (based on the “Sandy Bridge” platform). This is especially important for electronic structure codes, which tend to be limited by memory bandwidth.
Improved vectorization with AVX2 instructions. This can theoretically double the floating point arithmetics performance, but it reality there is diminishing return beyond some point for longer vectors. We expect +25% out of it at most. You will need to recompile your codes, or link to AVX2-enabled libraries such as Intel’s MKL, to use this feature.
Faster single core performance. Fortunately, the processor cores are still getting faster. Clock frequencies are not increasing, but according to Intel, the Haswell cores have about 10% better throughput of instructions per clock cycle. This is mainly from improvements in caches and better branch predictions, so it might not necessarily improve an already well-tuned and vectorized code.

Further reading: A longer technical overview of the Xeon E5 v3 series processors is available at enterprisetech.com and the old review of the Haswell microarchitecture on realwordtech.com is still relevant.

Upcoming Haswell-based systems in Sweden

So when can you get access to hardware like this as a supercomputing user in Sweden?

PDC in Stockholm has just announced that they will be installing a new 1+ petaflops Cray XC30 system to replace the “Lindgren” Cray XE6 system. It will be based on the 16-core variant of these new processors for a total of 32 cores per node. The system will be available for SNIC users from January 1st 2015.
NSC will install a new cluster dedicated to weather forecasting in late 2014 based on the 8-core variant. This system belongs to the SMHI and will not be available for SNIC users, but it will be an interesting configuration, with very good balance between compute power, memory bandwidth, parallel interconnect performance and storage. While being optimized for weather forecasting, it could also perform very well on electronic structure workloads.

I expect to be able to work on VASP installations and run benchmarks on both of these systems during fall/winter, so please check in here later.

Peak VASP calculations ahead?

2014-07-18T00:00:00+02:00

In June, I attended the International Supercomputing Conference in Leipzig, Germany. ISC is the second largest conference in the high performance computing field. Attending the big supercomputing conferences is always a good time to meditate on the state of the field and the future.

Instead of starting from the hardware point of view, I would like to begin from the other end: the scientific needs and how we can deliver more high-performance computing. At NSC, we believe in the usefulness of high-performance computing (HPC). So do our users, judging from the amount of computer time being applied for by Swedish scientists. When we compile the statistics, about three times more resources are asked for then we can provide. Clearly, the only way we can meet the need is to deliver more HPC capability in the future. The question, then, is how. There is always the possibility of increased funding for buying computers. Within our current facilities, we could accommodate many more systems of Triolith’s size, but realistically, I do not see funding for HPC systems increasing manyfold over the coming years, even though the potential benefits are great (see for example the recent report on e-infrastructure from the Swedish Science Council).

The traditional way has rather been to rely on new technology to bring us more compute power for the same amount of money. The goal is better price/performance, or more compute with the same amount of energy, which is related to the former. Fortunately, that approach has historically been very successful. Over the years, we have seen a steady stream of higher clocked CPU cores, multi-core servers, better memory bandwidth, and lower-latency networks being introduced. Each time we installed a new machine, our users could count on noticeable performance improvements, using the same simulation software as before, sometimes without even changing the underlying source code at all.

Thus, for a long time, the performance improvements have been essentially for free for our HPC users. I suspect, though, that this is a luxury that will come to an end. Why? Because currently, the way forward to more cost-effective computing, as envisioned by the HPC community, is:

Many-core architectures, such as IBM’s BlueGene and Intel’s Xeon Phi processors.
Vectorization, such as computing on GPU:s or with SIMD processors .
Special purpose hardware, such as custom SoC’s, FPGA:s, and ASICs.

Usually, such technologies are mentioned in the context of exascale computing, but it is important to realize that we would have to use the same technology if we wanted to build a smaller supercomputer, but for a fraction of the current cost. More concretely, what could happen in the coming years, is that there will be a new cluster with maybe ten times the floating point capability of today, but in the form of compute nodes with e.g. 256 cores and as much as 1000 threads. The key point, though, is that the speed of an individual core will most likely be less than on our current clusters. Thus, to actually get better performance out it, you will need excellent parallel scalability just to fully use a single compute node. The problem is that: 1) there are few mass market codes today that have this kind of scalability 2) Many current scientific models are simply not meaningful to run at the scale required to fully utilize such a machine.

In such a scenario, we could arrive at a “peak-VASP” situation, where traditional HPC algorithms, such as dense linear algebra operations and fast Fourier transforms, simply will not run any faster on the new hardware, which would essentially halt what has so far been seen as natural speed development. This could happen before any end of Moore’s law comes into play. It makes me think that there might be trouble ahead for traditional electronic structure calculations based on DFT unless there is a concerted effort to move to new hardware architectures. (This is also one of the conclusions in the Science Council report mentioned earlier.)

So what could such an effort look like?

Putting resources into code optimization and parallelization is one obvious approach. SeRC and the application experts in the Swedish e-science community have been very active in this domain. There is clearly potential here, but my understanding is that it has always been difficult to get it done in practice due to the funding structure in science (you get money for doing science, not “IT”), and also personnel shortage in the case that you actually have the funding available. There is also a limit to how much you can parallelize, with diminishing returns as you put more effort into it. So I think it can only be part of the solution.
Changing to scientific models that are more suitable to large-scale computations should also be considered, as this would amplify the results of the work done under point #1. This is something that has to be initiated from within the science community itself. There might be other methods to attack electronic structure problems which would be inconceivable at a smaller scale, but competitive at a large scale. I think the recent resurgence of interest in Monte Carlo and configuration-interaction methods is a sign of this development.
In the cases where models cannot be changed, the hardware itself has to change. That means bigger awareness from funding agencies of the importance of high-throughput computing. By now, computational material science is a quite mature field, and perhaps the challenge today is not only to simulate bigger systems, but rather to apply it in a large scale, which could mean running millions of jobs instead of one job with a million cores. The insight that this can produce just as valuable science as big MPI-parallel calculations could open up for investments in new kinds of computational resources.

Looking at the OpenMX DFT code

2014-06-11T00:00:00+02:00

OpenMX is an “order-N” DFT code with numerical pseudo-atomic basis sets and pseudopotentials. It takes a similar approach to solving the DFT problem as the SIESTA, ONETEP, and CONQUEST codes. What directed my attention towards OpenMX was a presentation at the SC13 conference showing excellent parallel scaling on the K computer, and the possibility to get very accurate solutions with the LCAO basis if necessary, as evident in the delta-code benchmark, which I wrote about earlier. It is also available under an open-source GPL license, unlike the other codes.

Installing

OpenMX is now available on Triolith. The code is written in straight C and was easy to compile and install. My changes to the makefile were:

CC = mpiicc -I$(MKLROOT)/include/fftw -O2 -xavx -ip -no-prec-div -openmp 
FC = mpiifort -I$(MKLROOT)/include/fftw -O2 -xavx -ip -no-prec-div -openmp
LIB = -mkl=parallel -lifcoremt

I used the following combination of C compiler, BLAS/LAPACK and MPI modules:

module load intel/14.0.1 mkl/11.1.1.106 impi/4.1.1.036

Initial tests

To get a feel of how easy it would be work with OpenMX, I first tried to set up my trusty 16-atom lithium iron silicate cell and calculate the partial lithium intercalation energy (or the cell voltage). This requires calculating the full cell, Li in bcc structure, and a partially lithiated cell; with spin polarization. To get the cell voltage right, you need a good description of both metallic, semiconducting, and insulating states, and an all-electron treatment of lithium. The electronic structure is a bit pathologic, so you cannot expect to get e.g. SCF convergence without some massaging in most codes. For example, so far, I have been able to successfully run this system with VASP and Abinit, but not with Quantum Espresso. This is not really the kind of calculation where I expect OpenMX to shine (due to known limitations in the LCAO method), but it is a useful benchmark, because it tells you something about how the code will behave in less than ideal conditions.

With the help of the OpenMX manual, I was able to prepare a working input file without too much problems. I found the atomic coordinate format a bit too verbose, e.g. you have to specify the spin up and spin down values for each atom individually, but that is a relatively minor point. As expected, I immediately ran into SCF convergence problems. After playing around with smearing, mixing parameters and algorithms, I settled with the rmm-diisk algorithm with a high Kerker factor, long history, and 1000K electronic temperature. This lead to convergence in about 62 SCF steps. For comparison, with VASP, I got convergence in around 53 SCF steps with ALGO=fastand linear mixing. From the input file:

scf.Mixing.Type rmm-diisk
scf.maxIter 100
scf.Init.Mixing.Weigth 0.01
scf.Min.Mixing.Weight 0.001
scf.Max.Mixing.Weight 0.100
scf.Kerker.factor 10.0
scf.Mixing.StartPulay 15
scf.Mixing.History 30
scf.ElectronicTemperature 1000.0

The choice of basis functions is critical in OpenMX. I recommend the paper by M. Gusso in J. Chem. Phys. for a detailed insight into the quality of the OpenMX basis set vs plane-waves. From quantum chemistry codes, I know that double-zeta basis will get you qualitatively correct results, but triple-zeta basis and above is required for quantitative results. It also seems imperative that you always have at least one d-component. I tried three combinations: s2p1+d (SVP quality), s2p2d1+f (DZP quality), and s3p3d3+f (TZP quality). The resulting cell voltages are shown below. The voltages are expected to decrease as the basis set size increases due overbinding from basis set superposition errors.

s1p1+d:   3.85 V
s1p1+d:   3.40 V (counterpoise correction)
s2p2d1+f: 3.14 V 
s2p2d1+f: 3.00 V (counterpoise correction) 
s3p3d3+f: 2.77 V (counterpoise correction)

For comparison, the converged VASP result with 550 eV and PREC=accurate is 2.80 V, meaning that the s3p3d3-level calculation is quite accurate. This confirms the delta-code benchmark study, where OpenMX calculations were shown to be as accurate as those of VASP, provided that a big basis set is used. In terms of speed, however, the results are not that impressive, on one Triolith compute node with 16 cores, VASP runs this 16-atom cell in 18 seconds at 500eV, whereas OpenMX with s3p3d3 basis takes 700 seconds! We will see, however, in the next section, that the outcome is different for large systems.

512-atom carbon nanotube

I think that carbon nanotubes (and other nanostructures) are a better fit for order-N approaches. For this purpose, I set up a (16,0) nanotube with 512 atoms including terminating hydrogens. It is stated in the OpenMX manual that you need at least 1 atom per MPI process, so ideally we could scale up to 512 cores with MPI, and possibly more with OpenMP multithreading.

Here, I choose DZP level basis set for OpenMX:

I believe that this is a slightly weaker basis set than what you would get in a standard VASP approach, so for the corresponding VASP calculation, a “rough” basis set of ENCUT=300 and PREC=Normal was chosen. For reference, the ENMAX value in the carbon POTCAR is 400 eV, which is what you would normally use. The calculations are spin-polarized. OpenMX reaches SCF convergence in 46 cycles using a similar scheme as above, and the VASP calculation converges in 34 cycles with the standard ALGO=fast. Both programs agree on a magnetic moment of 9.5-10.0.

In terms of performance, regular order-N(3) DFT in OpenMX is about as fast as in VASP, at least for wider parallel jobs:

We reach a peak speed of about 10 jobs/hour (or 10 geometry optimization steps per hour) on NSC’s Triolith. Interestingly, for the narrowest job with 4 compute nodes, OpenMX was almost twice as fast as the corresponding VASP calculation, implying that the intra-node performance is very strong for OpenMX, perhaps due to the use of hybrid OpenMP/MPI. Unfortunately, the calculation runs out of memory on less than 4 compute nodes, so I couldn’t test anything smaller (see more below).

The main benefit of using OpenMX is, however, the availability of linear scaling methods. The green line in the graph above shows the speed achieved with the order-N divide-conquer method (activated by scf.EigenvalueSolver DC in the input file). It cuts the time to solution by half, reaching more than 20 jobs/hour, but please note that the speed for narrow jobs is the same as regular DFT, so the main strength of the DC method seems not to be in improving serial (or intra-node performance) but rather in enabling better parallel scaling.

The impact on parallel scalling is more evident if we normalize the performance graph to relative speeds vs the 4-node jobs:

Regular DFT in VASP and OpenMX flattens out after 16 compute nodes, which is equivalent to 256 cores in total or 2 atoms/core, whereas with the linear scaling method, it is possible to beyond that.

Memory consumption

The main issue I had with OpenMX was memory consumption. OpenMX seems to replicate a lot of data in each MPI rank, so it is essential to use the hybrid MPI/OpenMP approach to conserve memory. For example, on 4 compute nodes, the memory usage looks like this for the calculation above:

64 MPI ranks without OpenMP threads = OOM (more than 32 GB/node)
32 MPI ranks with 2x OpenMP threads = 25 GB/node
16 MPI ranks with 4x OpenMP threads = 17 GB/node
 8 MPI ranks with 8x OpenMP threads = 13 GB/node

with 2x OpenMP giving the best speed. For wider jobs, 4x OpenMP was optimal. This was a quite small job with s2p2d1 basis and moderate convergence settings, so I imagine that it might be challenging to run very accurate calculations on Triolith, since there is only 32 GB memory in most compute nodes.

Of course, adding more nodes also helps, but the required amount of memory per node does not strictly decrease in proportion to the number of nodes used:

  4 nodes (2x OpenMP) = 25 GB/node
  8 nodes (2x OpenMP) = 21 GB/node
 16 nodes (2x OpenMP) = 22 GB/node
 32 nodes (2x OpenMP) = 18 GB/node

So when you are setting up an OpenMX calculation, you need to find the right balance between the number of MPI ranks and OpenMP threads in order to get good speed without running out of memory.

Conclusion

In summary, it was a quite pleasant experience to play around with OpenMX. The performance is competitive, and there is adequate documentation, and available example calculations, so it is possible to get started without a “master practitioner” nearby. For actual research problems, good speed and basic functionality is not enough, though, as usually, you need to be able to calculate specific derived properties, and visualize the results in a specific way. I noticed that there are some post-processing utilities available, even inside OpenMX itself, and among the higher level functionality, there is support for relativistic effects, LDA+U, calculation of exchange coupling, electronic transport, NEB and MLWFs, so I think many of the modern needs are, in fact, satisfied.

On long queue times

2014-05-15T00:00:00+02:00

One common question we get is “Why have my jobs been waiting in the queue for so long?” Upon investigation, it usually transpires that that the user is a member of a project that has run a lot of jobs recently. As a result, the job scheduler will suppress the priority of the newly queued jobs in order to let other projects have a chance of running. We have elaborated a bit on how Triolith’s queue system works on the NSC web page. I specifically recommend reading the section titled “How can I adapt the job scheduling to my workflow?”. The essence of it is:

There is no limit on how much a project can run in a month. But the more you run, the lower your priority will be, so the harder it will be to run the next job.

The relationship is actually non-linear, which is illustrated below:

The queue time in the sketch is a normalized number, but you can imagine it as being days. The percentage representing the project usage is the actual core hours used during the last 30-day period divided by the given allocation, i.e. running 15,000 core hours when having been allocated 10,000 core hours/month by SNAC is 150% usage.

The key insight here is that running more than one’s allocation results in queue times approaching infinity. But note that the allocation, in terms of core hours/month is not a hard limit. It is possible to run more jobs, for example reaching 150% or 200% of your allocated hours for a month. It is even possible to do this consistently over several months if other projects relinquish their core hours. But borrowing from other projects comes at a cost: if your project is always above the allocated usage, the job priority is also always low as a direct result, implying long queue times, as shown in the sketch above.

This means that it is critical for a PI to manage the project to ensure that adequate resources are available when the participants need them. For example, if a PhD student needs to perform a large set of calculations next month in order to finalize their thesis, the PI must prevent the other project members from overusing the project during the current month and accumulating a low priority for everyone as a result. In certain cases, it might even be prudent to underuse the project in order to save up for a priority boost later.

Below, I will show two ways to monitor how the core hours are being used. That will hopefully be helpful in doing capacity planning and manage resource use over time.

The projinfo command

On Triolith (and all other SNIC systems), there is command called projinfo, which reveals the current status of the projects that you are a member of. Here is the output of a hypothetical SNIC project that would suffer from low priority and long queue times:

[x_secun@triolith1 ~]$ projinfo
Project             Used[h]   Current allocation [h/month]
   User
-----------------------------------------------------
snic2014-XX-YY         320807.82              250000
   x_prima               5877.88
   x_secun                  0.34
   x_terti              12229.15
   x_quart              22312.09
   x_quint              23944.71
   x_sextu              75189.68
   x_septi             181253.97

The two most important numbers are the ones in the top, which accounts for the actual usage (320k core hours) over the last 30 days and the target allocation (250k). Note that the first number is an average over a 30-day sliding window, not the number of hours used since the start of the month. Again, it is the relation between the core hours used vs the project’s allocation that controls the priority of jobs. In this case, the priority for all users in the project will be very low. The reason is that the project, as a whole, has overused its allocation. In order to balance the books, this project would need to run less than their allocation during the following month, so that the average value becomes closer to 250k core hours/month. It is the job of the queue system to enforce this through job priorities.

Inspecting project use over time

The projinfo command shows only the most recent 30-day period. For long-term statistics, you can log in to NSC Express. There, it is possible to inspect the historical use of resources for a project per month. The graphs are accessed by clicking on a project name in the table in the “Projects” section of your personal NSC Express page. For example:

Here, one could imagine that around November-December, the queue time situation must have been particularly bad, whereas the other months were more in line with the allocated use and most likely more tolerable.

New version of VASP - 5.3.5

2014-04-30T00:00:00+02:00

A new version of VASP, denoted vasp.5.3.5 31Mar14 was released in the beginning of April. Swedish HPC users can find 5.3.5 installed on NSC’s Triolith and Matter clusters, and at PDC’s Lindgren. So what is new? The release notes on the VASP community page mentions a few new functionals (MSx family meta-GGAs, BEEF and Grimme’s D3) together with many minor changes and bug fixes.

The first installation of VASP 5.3.5 binaries at NSC is available in the expected place, so you can do the following in your job scripts:

mpprun /software/apps/vasp/5.3.5-31Mar14/default/vasp

You can also do module load vasp/5.3.5-31Mar14, if you prefer to use modules.

The installation and compilation was straightforward with Intel’s compilers and MKL, but I did not have much success with gcc/gfortran (4.7.2) as usual. Even after applying my previous patches for gfortran, the compiled binary crashed due to numerical errors.

It is also worth mentioning that some recent MPI-libraries now assume MPI version 2.2 standards compliance by default. This is the case with e.g. Intel MPI 4.1, which we use on Triolith. Unfortunately, VASP is not fully compliant with the MPI standard, as there are places in the code where memory buffers overlap, which results in undefined behavior. You can see errors like this when running VASP:

Fatal error in PMPI_Allgatherv: Internal MPI error!, error stack:
...
MPIR_Localcopy(381).......: memcpy arguments alias each other, dst=0xa57e9c0 src=0xa57e9c0 len=49152

Some of the problems can be alleviated by instructing the MPI runtime to assume a different MPI standard. For Intel MPI, one can set

export I_MPI_COMPATIBILITY=4

to force the same behavior as with Intel MPI 4.0. This seems to help with VASP. If we get reports of much problems like this, I will install a new version of VASP 5.3.5 with the old Intel MPI as a stopgap solution.

The Intel-compiled version of 5.3.5 ran through the test suite that I have without problems, implying that the results of 5.3.5 remain unchanged vs 5.3.3 for basic properties, as we expect. The overall performance appears unchanged for regular DFT calculations, but hybrid calculations run slightly faster now. There is also preliminary support for NPAR for Hartree-Fock-type calculations. I played around with it using a 64-atom cell on 64 cores, but setting NPAR actually made it run slower on Triolith, so I suppose k-point parallelization is still much more efficient for hybrid calculations.

How accurate are different DFT codes?

2014-02-21T00:00:00+01:00

How accurate is DFT in theory and in practice? There has been some reviews on the former, comparing calculations of a given DFT program with experiments, but not as much of the latter – comparing the numerical approximations inherent in different DFT codes. I came across a paper taking both of these aspects into account. The paper is titled “Error Estimates for Solid-State Density-Functional Theory Predictions: An Overview by Means of the Ground-State Elemental Crystals”, written by K. Lejaeghere et al. More information about their project to compare DFT codes can be found at their page at the Center for Molecular Modeling at the University of Gent.

Their approach to compare DFT codes is to look at the root mean square error of the equations of state w.r.t. the ones from Wien2K. They called this number the “delta-factor”. The sample set is the ground-state crystal structures of the elements H-Rn in the periodic table. I have plotted the outcome below, which is to be interpreted as the deviation from a full-potential APW+lo calculation, which is considered as the exact solution. Please note the logarithmic scale on the horizontal axis.

My observations are:

Well-converged PAW calculations with good atomic setups are very accurate. Abinit with the JTH library achieves a delta value of 0.5 meV/atom vs Wien2K. As the authors put it in the paper: “predictions by APW+lo and PAW are for practical purposes identical”.
Norm-conserving pseudopotentials (NC) with plane-wave basis set are an order of magnitude worse than PAW. The numerical error is of the same magnitude as the intrinsic error vs experiments for the PBE exchange-correlation potential (23.5 meV/atom).
VASP is no longer the most accurate PAW solution. Similar, or better, quality results can now be arrived at with Abinit and GPAW.
The quality of the PAW atomic setups matters a lot. Compare the results for Abinit (blue bars in the graph) with different PAW libraries. I think this explains why VASP has remained so popular – only recently did PAW-libraries which surpass VASP’s built-in one become available.
The PAW setups for GPAW are of comparable quality to VASP’s, but GPAW’s grid approach seems to be detrimental to numerical precision. GPAW with plane-wave (PW) basis gets 1.7 meV/atom vs 3.3 meV/atom using finite differences.
OpenMX (pseudo-atomic orbitals + norm-conserving PPs) performs surprisingly well, matching the PAW results. I noticed that the calculations employed very large basis sets, though, which should slow down the speed significantly.

Another relevant aspect is the relative speed of the different codes. Do you have to trade speed for precision? The paper does not mention the accumulated runtime for the different data sets, which would otherwise have made an interesting “price/performance” analysis possible.

Before, I tried to compare the absolute performance and the parallel caling of Abinit and VASP, reaching the conclusion that Abinit was significant slower. Perhaps the improved precision is the reason why? Regarding GPAW, I know, from unpublished results, that GPAW exhibits similar parallel scaling to VASP and matches the per core performance, but SCF convergence can be an issue. OpenMX can be extremely fast compared to plane-wave codes, but the final outcome critically depends on the choice of the basis set.

I am putting GPAW and OpenMX on my list of codes to benchmark this year.

Live profiling on Triolith with "perf"

2014-02-06T00:00:00+01:00

On Triolith, we have the Perf profiler tool installed on all computer nodes. It is pretty neat, because it allows you to look into your running jobs and see what they are doing, without recompiling or doing a special profiling run. This can be quite useful for locating bottlenecks in the code and to quickly check whether jobs appears to be running efficiently.

Here is a rundown on how do it. Suppose we are running a job on Triolith. First, we need to find out on which nodes the job is running on. This information is availble in the squeue output in the “NODELIST” column.

[pla@triolith1 ~]$ squeue -u pla
 JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
 1712173  triolith _interac      pla   R       0:17      2 n[2-3]

If you are running a job on a node, you are allowed to use ssh to log in there and check what is going on. Do that!

[pla@triolith1 ~]$ ssh n2
....(login message)...
[pla@n2 ~]$

Now, running the top command on the node, will show us that they we are busy running VASP here, as expected.

The next step is to run perf top instead. It will show us a similar “top view”, but of the subroutines running inside all of the processes running on the node. Once you have started perf top, you will have to wait at least a few seconds to allow the monitor to collect some samples before you get something representative.

If your program is compiled to preserve subroutine names, you will see a continuously updating list of the “hot” subroutines in your program (like above) even including calls to external libraries such as MKL and MPI. The leftmost percentage number is the approximate amount of time that VASP, in this case, is spending in that particular subroutine. This specific profile looks ok, and is what I would expect for a properly sized VASP run. The program is spending most of the time inside libmkl_avx.so doing BLAS, LAPACK, and FFTWs operations, and we see some a moderate amount of time (about 10% in total) in libmpi.so doing and waiting for network communications.

For something more pathological, we can look at a Quantum Espresso phonon calculation, which I am deliberately running on too many cores.

Here, something is wrong, because almost 70% of the time seems to be spent inside the MPI communications library. There is actually very little computation being done – these compute nodes are just passing data back and forth. This is usually an indication that the job is not parallelizing well, and that you should run it on less nodes, or at least use less cores per node. In fact, here I was running a phonon job of a simple metal on 32 cores on 2 compute nodes. The runtime was 1m29s, but it would have run just as fast (1m27s) on a single compute node with just 4 cores. The serial runtime, for comparison, was 4m20s. Now, 1 minute on 1 compute is not much time saved, but imagine the effect if this was a job running on 16 compute nodes for one week. That is a saving of 20,000 core hours.

There are much more things you can do with perf, for example, gathering statistics from processor performance counters using perf stat, but for starters, I would suggest using it as a routine check when preparing new jobs to run on the cluster. For big jobs using hundreds of thousands of cores, I would always recommend doing a real parallel scaling study, but for small jobs, it might not be worth it. That is when perf top comes in handy.