Running VASP on Nvidia GPUs

Today at Supercomputing 15, the coming release of an official GPU version of VASP 5.4.1 was announced by Nvidia. Both standard DFT and hybrid DFT with Hartree-Fock is GPU accelerated, but there is no GPU-support for GW calculations yet. I have played around with the beta release of GPU-VASP and the speed-up I see when adding 2 K40 GPUs to a dual-socket Sandy Bridge compute node varies between 1.4x to 8.0x, depending on the cell size and the choice of algorithm.

For a long time, VASP was shown in Nvidia’s marketing information as already ported to GPU, despite not being generally available. In fact, I often got questions about it, but had to explain to our users that there were several independently developed prototype versions of VASP with code that had not yet been accepted into the main VASP codebase. But now, an official GPU version is finally happening, and the goal is that it will be generally available to users by the end of the year. No information is available on the VASP home page yet, but I assume that more information will come eventually.

The GPU version is a collaborative effort involving people from several research groups and companies. The list of contributors includes University of Wien, University of Chicago, ENS-Lyon, IFPEN, CMU, RWTH Aachen, ORNL, Materials Design, URCA and NVIDIA. The three key papers, that should be cited when using the GPU version are:

  • Speeding up plane-wave electronic-structure calculations using graphics-processing units. Maintz, Eck, Dronskowski. (2011)
  • VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron. Hutchinson, Widom. (2011)
  • Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units. Hacene, Anciaux-Sedrakian, Rozanska, Klahr, Guignon, Fleurat-Lessard. (2012)

The history of GPU-VASP, as I have understood it, is that after the initial porting work by research groups mentioned above, Nvidia got involved and worked on optimizing the GPU parts, which eventually lead to the acceptance of the GPU code into the main codebase by the VASP developers and subsequently to the launch of the beta testing program coordinated by Nvidia. It is encouraging to see the involvement by Nvidia and I think this is an excellent example of community outreach and industry-academia collaboration. I hope we will see more of this in the future with involvement from other companies. Electronic structure software is, after all, a major workload at many HPC centers. For Intel’s Xeon Phi, VASP is listed as a “work in progress” with involvement from Zuse Institute in Berlin, so we will likely see further vectorization and OpenMP parallelization aimed at manycore architectures as well. I think the fact the GPU version performs as well as it does (see more below) is an indication that there is much potential out there for the CPU version too, in terms of optimization.

Beta-testing GPU-VASP

I have been part of the beta testing program for GPU-VASP. The analysis in this post will approach the subject from two perspectives. The first one is the buyer’s perspective. Does it make economical sense to start looking at GPUs for running VASP? This is the question that we at NSC face as an academic HPC center when we are buying systems for our users. The second perspective is the experience from the user perspective. Does it work? How does it differ from the regular VASP version?

The short answers for the impatient are: 1) possibly, the price/performance might be there given aggressive GPU pricing 2) yes, for a typical DFT calculation, you only need to adjust some parameters in the INCAR file, most importantly NSIM, and then launch VASP as usual.

Hardware setup

The tests were performed on the upcoming GPU partition of NSC’s Triolith system. The compute nodes there have dual-socket Intel Xeon E5-2660 “Sandy Bridge” processors, 64 GB of memory and Nvidia K20 or K40 GPUs. The main difference between the K20 and the K40 is the amount of memory on the card: the K20 has 6 GB and the K40 has 12 GB. VASP uses quite a lot of GPU memory, so with only 6 GB of memory you might see some limitations. For example, the GaAsBi 256 atom test job below used up about 9300 MB per card when running on a single node. It was possible to run smaller jobs on the K20s, though.

I ran most of the tests with the default GPU clock speed of 745 Mhz, but out of curiosity, I also tried to clock up the cards to 875 Mhz with the nvidia-smi utility

$ nvidia-smi -ac 3004,875

It didn’t seem to cause any problems with cooling or stability, and produced a nice 10 % gain in speed. The GPUs are rated for up to 235 Watt, but I never saw them use more than ca 180 W on average during VASP jobs.

Compiling the GPU version

A new build system was introduced with VASP 5.4. When the makefile.include is set up, you can compile the different versions of VASP (regular, gamma-point only, noncollinear) by giving arguments to the make command, e.g.

make std

With GPU-VASP, there is new kind of VASP executable defined in the makefile, called gpu, so the command to compile the GPU version is simply

make gpu

I would recommend sticking to compilation in serial mode. I tried using my old trick of running make -j4 repeatedly to resolve all dependencies, but the new build process does not work as well in parallel, you can get errors during the rsync stages when files are copied between directories.

To compile any program with CUDA, such as GPU-VASP, you need to have the CUDA developer tools installed. They are typically found in /usr/local/cuda-{version} and that is also where you will find them on the Triolith compute nodes. If there are no module files, you can add the relevant directory to your PATH yourself. In this case, CUDA version 7.5:

export PATH=/usr/local/cuda-7.5/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-7.5/lib64:$LD_LIBRARY_PATH

I tested with CUDA 6.5 in the beginning, and that seemed to work too, but VASP ran significantly faster when I reran the benchmarks with CUDA 7.5 later. Once you have CUDA set up, the critical command to look for is the Nvidia CUDA compiler, called nvcc. The makefiles will use that program to compile and link the CUDA kernels.

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

There is a configuration file for the GPU version in the arch/ directory called makefile.include.linux_intel_cuda which you can use a starting point for configuration. I did not have to make much changes to compile on Triolith. In addition to the standard things like compiler names and flags, one should point out the path to the CUDA tools.

CUDA_ROOT  := /usr/local/cuda-7.5


When you log in to a GPU compute node, it is not obvious where to “find” the GPUs and how many there are. There is a utility called nvidia-smi which can be used to inspect the state of the GPUs. Above, I used it for overclocking, but you can also do other things, such as listing the GPUs attached to the system:

[pla@n1593 ~]$ nvidia-smi -L
GPU 0: Tesla K40m (UUID: GPU-f4e02ffa-b01c-1e3e-ebdb-46e1fef83ce6)
GPU 1: Tesla K40m (UUID: GPU-35bee978-8707-1957-12e2-bddda324da88)

And to look at a running job and see how much power and GPU memory that is being used there is nvidia-smi -l

| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K40m          Off  | 0000:08:00.0     Off |                    0 |
| N/A   39C    P0   124W / 235W |    558MiB / 11519MiB |     97%      Default |
|   1  Tesla K40m          Off  | 0000:27:00.0     Off |                    0 |
| N/A   40C    P0   133W / 235W |    558MiB / 11519MiB |     98%      Default |

The most important thing to make VASP run efficiently is to make sure that you have the MPS system active on the node. MPS is the “Multi Process Service” – it virtualizes the GPU so that many MPI ranks can access the GPU independently without of having to wait for each other. Nvidia has an overview (PDF file) on their web site describing MPS and how set it up. Basically, what you have to do as user is to check if the nvidia-cuda-mps-control process is running. If it is not, you have to start it yourself, before starting your VASP job.

$ mkdir /tmp/nvidia-mps
$ export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps 
$ mkdir /tmp/nvidia-log
$ export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log 
$ nvidia-cuda-mps-control -d

Except for the initialization above, which you only need to do once on the node, you run VASP as usual, using mpiexec.hydra or a similar command, to start an MPI job.

mpiexec.hydra -n 8 vasp_gpu

You should see messages in the beginning of the program output telling you that GPU has been initialized:

Using device 1 (rank 3) : Tesla K40m
Using device 1 (rank 2) : Tesla K40m
Using device 0 (rank 1) : Tesla K40m
Using device 0 (rank 0) : Tesla K40m
running on    4 total cores
distrk:  each k-point on    4 cores,    1 groups
distr:  one band on    1 cores,    4 groups
using from now: INCAR
creating 32 CUDA streams...
creating 32 CUFFT plans with grid size 36 x 40 x 48...

(Please note that these might go away or look differently in the final release)

VASP Test Suite

I attempted to run through my VASP test suite with GPU version, but many of the test cases require running with LREAL=.FALSE., so the total energies are different. Despite that, and comparing a few other runs, I did not see any significant discrepancies between the CPU and GPU versions. More than 100 different test cases have been used during the acceptance testing, in addition to the testing done by the beta testers, so we can be certain that the major bugs have been found at this stage.

Performance of standard DFT calculations

I have only performed single-node benchmarks so far, so the focus is comparing the speed when running with CPUs only vs. CPUs+GPUs. The K40 node has 2 GPUs and 2 CPU sockets, with 1 GPU attached to each socket, so the comparison is 16 cores vs (any number of cores) and 2 GPUs. Typically, I found that using 8 MPI ranks (i.e. 8 out of 16 cores on a Triolith node) sharing 2 GPUs using MPS was the fastest combination for regular DFT jobs.

I tested regular DFT by using the old test case of GaAsBi 256 atoms (9 k-points) on which I have collected lots of data on for Triolith (Intel “Sandy Bridge”) and Beskow (Cray XC40 with Intel “Haswell”). As mentioned above, that was close to the biggest job I could run on a single compute node with 2 GPUs due to memory limitations. The reference run with CPUs completed in 7546 seconds on 1 node. With 8 cores and 2 K40 GPUs, it runs remarkably faster and finishes in 1242 seconds, which is around 6 times faster. Here, I used overclocked GPUs (875 Mhz), so at base frequency it is around 10% slower. For reference, with 8 compute nodes using no GPUs, the GaAsBi-256 job completes in 900 seconds on Triolith and in 400 seconds on Beskow.

Interestingly, GPU-VASP is especially strong on the Davidson algorithm. ALGO=fast runs about 50% faster in CPU mode, but with GPUs, there is very little difference, so if you rely on ALGO=normal for getting convergence, there is good news. If you calculate the speed-up for ALGO=normal, it is therefore higher, around 8x faster.

The NSIM parameter is now very important for performance. Theoretically, the GPU calculations will run faster the higher the value of NSIM you set, with the drawback being that the memory consumption on the GPUs increase with higher NSIM as well. The recommendation from the developers is that you should increase NSIM as much you can until you run out of memory. This can require some experimentation, where you have to launch your VASP job and the follow the memory use with the nvidia-smi -l command. I generally had to stop at NSIM values of 16-32.

Packing a relatively big cell like this (256 atoms) on a single node is probably the poster case for the GPU version, as you need sufficiently large chunks of work offloaded onto the GPU in order to account for the overhead of sending data back and forth between the main memory and the GPU. As an example of what happens when there is too little work to parallelize, we can consider the 128-atom Li2FeSiO4 test case in the gamma point only. I can run that one on a CPU node with the gamma point version of VASP in around 260 seconds. The GPU version, which has no gamma point optimizations, clocks in 185 seconds, even with 2 K40s GPUs, for an effective improvement of only 40%.

Of course, one can argue that this is an apples to oranges comparison, and that you should compare with the slower VASP NGZhalf runtime for the CPU-version (which is 428 seconds or half the speed of the GPU run), but the gamma-point only optimization is available in the CPU version today, and my personal opinion is that the symmetries enforced in the gamma point version produces more accurate results, even if they might be different.

Performance of hybrid DFT calculations

Hybrid DFT calculation incorporating Hartree-Fock, perhaps screened such as in HSE06, are becoming close to the standard nowadays, as regular DFT is no longer considered state of the art. Part of the reason is the availability of more computing resources, as an HSE06 calculation can easily take 100 times longer to run. Speeding up hybrid calculation was one of the original motivations for GPU-accelerating VASP, so I was curious to test this out. Unfortunately, I had lots of problems with memory leaks and crashes in the early beta versions, so I had a hard time getting any interesting test cases to run. Eventually, though, these bugs were ironed out in the last beta release, enabling me to start testing HSE06 calculations, but the findings here should be considered preliminary for now.

In my tests, I found that 4-8 MPI ranks was optimal for hybrid DFT. The reason for hybrid jobs being able to get along with less CPU cores is that the Hartree-Fock part is the dominant part, and it runs completely on the GPU, so there should be some efficiency gain by having less MPI ranks competing for the GPU resources. For really big jobs, I was told by Maxwell Hutchinson, one of the exact-exchange GPU developers, that having 1 GPU per MPI rank should be the best way to run, but I have not been able to confirm that yet.

Setting NSIM is even more important here, the recommendation is

NSIM = NBANDS / (2*cores)

So you need to have lots of bands in order to fully utilize the GPUs.

The test case here was an MgO cell with 63 atoms, 192 bands and 4 k-points. These are quite heavy calculations so I had to resort to timing individual SCF iterations to get some results quickly. With 16 CPU cores only, using ALGO=all one SCF iteration requires around 900 seconds. When switching to 4 cores and 2 GPU:s (2 cores per GPU), I get the time for one SCF iteration down to around 640 seconds, which is faster, but not a spectacular improvement (40-50%). But note that the same SCF algorithm is not being used here, so actual number of iterations required to converge might be different, which affects the total runtime. I have actually not tested running HSE06 calculations with ALGO=normal before, as it is not the standard way to run, so I cannot say right now whether to expect faster convergence with ALGO=normal. The underlying problem, as I have understood it, is that a job like this launches lots of small CUDA kernels, and although there are many of them, they cannot effectively saturate the GPU. The situation should be better for larger cells, but I have not been able to run these test yet, as I only had a few nodes to play with.

Summary and reflections

It is well known that making a fair comparison between CPU and GPU computing is very challenging. The conclusion you come to is to a large extent dependent on which question you ask and what you are measuring. The whole issue is also complicated by the fact that many of the improvements made to the code during the GPU porting can be back-ported to the CPU version, so the process of porting a code to GPU might itself, paradoxically, weaken the (economical) case for running on GPUs, as the associated gain might make the CPU performance just good enough to compete with GPUs.

From a price/performance point of view, one should remember that a GPU-equipped node is likely to be much more expensive and use more power when running. In a big procurement of an HPC resource, the final pricing is always subject to some negotiation and discounting, but if one looks at the list prices, a standard 2-socket compute node with 2 K40 GPUs is likely to cost at least 2-3 times as much as without the GPUs. One must also take into consideration that such a GPU node might not always run GPU-accelerated codes and/or be idling some part of the time due to a lack of appropriate jobs. Consequently, the average GPU workload that runs on a GPU partition in an HPC cluster must run a lot faster to make up for the higher cost. In practice, a 2-3x speedup for a breakeven with respect to pricing is probably not enough to make it economically viable, instead we are looking at maybe 4-6x. The good news is, of course, that certain VASP workloads (such as normal DFT and molecular dynamics on big cells) do meet this requirement.

Another perspective that should not be forgotten is that the compute power of a single workstation running VASP can be improved a lot with GPUs. This is perhaps not as relevant for “big” HPC, but it significantly increases the total compute power and job capacity that a VASP user without HPC access can easily acquire. From a maintenance and system administration perspective, there is a big jump in moving from a single workstation to a full-fledged multi-node cluster. A cluster needs rack space, a queue system, some kind of shared storage etc. The typical scientist, will not be able to set up such a system easily. The 2-socket workstation of yesterday was probably not sufficient for state of the art VASP calculations, but with, let us say, an average 4x improvement with GPUs, it might be viable for certain kinds of research level calculations.

From a user and scientific point of view, the GPU version of VASP seems ready for wider adoption. It works and is able to reproduce the output of the regular VASP. Running it requires making some change of settings in the input files, which unfortunately can make the job suboptimal when running on CPUs. But it has always been the case, that you need to adjust the INCAR parameters to get the best out of a parallel run, so that is nothing new.

In conclusion, it would not surprise me if the availability of VASP with CUDA support might be a watershed event for the adoption of Nvidia’s GPUs, since VASP is such a big workload on many HPC centers around the world. For us, for example, it has definitely made us consider GPUs for the next generation cluster that will eventually replace Triolith.

P.S. 2015-11-23: The talk by Max Hutchinson at SC15 about VASP for GPU is now available online.

SC15 Conference

Next week, I am going to Austin to participate in SC15 (the supercomputing conference). I will also be at Intel’s HPC developer conference. I am always interested in meeting other people working with ab initio/electronic structure software. Send me an email if you want to meet up.

New Version of VASP - 5.4.1

A new version of VASP, denoted vasp.5.4.1 24Jun15 was released during the summer. The release was a bit stealthy, because there was no mention of it on the VASP home page until announcements of “bugfixes for vasp.5.4.1” showed up. There seems to be no official release notes published either, but the announcement email contains the following list of improvements and changes:

  • Interface to the solvation model code VASPsol of Mathew and Hennig.
  • Bugfix in the symmetry detection
  • Support for NCORE≠1 for hybrid functional calculations.
  • Bugfix in pead.F: macroscopic dielectric properties (LCALCEPS=.TRUE.) didn’t work with LREAL≠.FALSE.
  • Improvements to hybrid calculations: less memory use when run at large scale.
  • A new build system which simplifies the compilation of VASP. There is now a separate build directory and you can compile the three usual versions with distinct make commands.

Since the original 5.4.1 release, there has also been two patches released:

The first installations of VASP 5.4.1 binaries at NSC is available in the usual place, so you can put the following in your job scripts:

mpprun /software/apps/vasp/5.4.1-24Jun15/build04/vasp

Build04 includes both of the patches, but build01-02 only have the first patch. You can also do module load vasp/5.4.1-24Jun15, if you prefer to use modules. That command currently loads build04.

I believe that the new version is safe to use. I ran through the test suite that I have and saw identical output in most tests and small deviations for Cu-fcc and Si with GW. I especially recommend to try out 5.4.1 if you are struggling with large hybrid functional calculations and employ many compute nodes in the process. In repeating some of the GaAsBi benchmark runs with 5.4.1, I found that HSE06 is now significantly faster, and uses less memory per MPI rank, so you might be able use more cores per compute node without running out of memory. To exemplify, with 5.3.5, I was able to run my GaAsBi-512 atom test system on 96 Triolith compute nodes with 8 cores per node in 9812 seconds, but with 5.4.1, it completes in 6719 seconds using the same configuration. That is a 46% improvement in speed! Furthermore, it possible to scale up the calculation further to 128 nodes and 12c/node, which I was not able to do before due to memory shortage. I think this is good news for the people running large calculations, especially on Beskow.

P.S. A comment on the lack of updates to the blog:

Over the year, I have gradually moved on to another position at NSC. Today, I work as partner manager, developing NSC’s external collaborations with partners such as SMHI and Saab Group. I intend to keep publishing benchmarks and recommendations on the blog, as we move into the process of replacing Triolith, but the update frequency will likely be lower in the future. Weine Olovsson at NSC is taking over most of the actual support duties for VASP.

VASP Seminar at Uppsala University

Tomorrow (January 15th), I will be visiting Uppsala University and give some lectures on running VASP efficiently on the clusters available in Sweden. The seminar is being organized by NSC and UPPMAX, mainly for SNIC users, but everyone interested is welcome. It is not necessary to register in advance. You can find more information on the SNIC web page.

Most of the content that I will present is already available here, scattered over many blog posts. If you check out the blog post archives, you will find posts covering VASP and how certain input parameters affect performance. For information about compiling VASP, I have a special section on the web page called Compile VASP. It now has a new guide for the Beskow system at PDC in Stockholm. I will show some preliminary benchmark results from there and we will talk about what kind of simulations that are suitable for running at large scale on such a computer.

There is also a dedicated time slot for a Q&A session. It is a good time to ask questions, not only about VASP, as several of the SNIC application experts will be there.

I am looking forward to seeing you there!

Update 2015-01-16: Thank you to everyone who participated. It is always nice to return to Uppsala. By request, here are the slides from the sessions:

VASP on Cray XC-40 Beskow: Preliminary Benchmark Results

During the Christmas holidays, I had the oppurtunity to run some VASP benchmarks on Beskow, the Cray XC40 supercomputer currently being installed at PDC in Stockholm. The aim was to develop guidelines for VASP users with time allocations there. Beskow is a significant addition in terms of aggregated core hours available to Swedish researchers, so many of heavy users of supercomputing in Sweden, like the electronic structure community, were granted time there.

For this benchmarking round, I developed a new set of tests to gather data on the relationship between simulation cell size and the appropriate number of cores. There has been concerns that almost no ordinary VASP jobs would be meaningful to run on the Cray machine, because they would be too small to fit into the minimal allocation of 1024 cores (or 32 compute nodes, in my interpretation). Fortunately, my initial results show that this is not case, especially if you use k-point parallelization.

The tests consisted of GaAs supercells of varying sizes, doped with a single Bi atom. The cells and many of the settings are picked from a research paper, to make it more realistic. The supercell sizes were successive doublings of 64, 128, 256, and finally 512 atoms, with a decreasing number of k-points in the Monkhorst-Pack grids (36, 12, 9, 4). I think it is a quite realistic example of a supercell convergence study.

Chart of GaAsBi supercell timings on Beskow

First, please note the scales on the axes. The y-axis is reversed time in logarithmic scale, so upwards represents faster speed. Similarly, the x-axis is the number of compute nodes (with 32 cores each) in log-2 scale. A 256-node calculation is potentially 8192 cores! But in this case, I used 24 cores/node, except for the biggest HSE06 cells, where I had to reduce the number cores per node to gain more memory per MPI rank. The solid lines are PBE calculations and the dashed lines HSE06 calculations. Note the very consistent picture of parallel displacement of the scaling curves: bigger cells take longer time to run and scale to a larger number of compute nodes (although the increase is surprisingly small). The deviations from the straight lines comes from outlier cases where I had to use a sub-optimal value of KPAR, for example, with 128 atoms, 32 nodes, and 12 k-points, I had to use KPAR=4 instead of KPAR=12 to balance the number of k-points. For real production calculations, you could easily avoid such combinations.

The influential settings in the INCAR file were:

NCORE = cores/node (typically 24)
KPAR = MIN(number of nodes,number of k-points)
NSIM = 2
NBANDS = 192, 384, 768, 1536 (which are all divisible by 24)
LREAL = Auto
ALGO = Fast (Damped for HSE06)

There were no other special tricks involved, such as setting MPI environment variables or custom MPI rank placement. It is just standard VASP calculations with minimal input files and careful settings of the key parameters. I actually have more data points for even bigger runs, but I have chosen to cut off the curves where the parallel efficiency fell too much, usually to less than 50%. In my opinion, it is difficult to motivate a lower efficiency target than that. So what you see is the realistic range of compute nodes you should employ to simulate a cell of a given size.

If we apply the limit of 32 nodes, we see that a 64-atom GGA calculation might be a borderline case, which is simply to small to run on Beskow, but 128 atoms and more scale well up to 32 compute nodes, which is half a chassis on the Cray XC40. If you use hybrid DFT, such as HSE06, you should be able to run on 64 nodes (1 chassis) without problem, perhaps even up to 4 chassis with big supercells. In that case, though, you will run into problems with memory, because it seems that the memory use of an HSE06 calculation increase linearly by the number of cores I use. I don’t know if it is a bug, or if the algorithm is actually designed that way, but it is worth keeping in mind when using hybrid functionals in VASP. Sometimes, the solution to an out of memory problem is to decrease the number of nodes.

In addition to parallel scaling, we are also interested in the actual runtime. In some ways, it is impressive. Smaller GGA calculations can complete 1 SCF cycle in less than 1 minute, and even relatively large hybrid-DFT jobs can generally be tuned to complete 1 SCF cycle in per hour. In other ways it is less impressive. While we can employ more compute nodes to speed up bigger cells, in general, we cannot always make a bigger system run as fast as a smaller by just adding more compute nodes. For example, an HSE06 calculation will take about two orders of magnitude longer time to run than a GGA calculation, but unfortunately, it cannot also make use of two orders of magnitude more compute nodes efficiently. Therefore, large hybrid calculations will remain a challenge to run until the parallelization in VASP is improved, especially with regards to memory consumption.

Selecting the Right Number of Cores for a VASP Calculation

A frequent question I encounter supporting VASP users is: “I have a cell with X atoms and Y electrons. How many compute nodes (or cores) should I choose for my simulation?”

It is an important question because using too many cores will be inefficient and the result is less number of jobs completed in a given computer time allocation.

Currently, there is only MPI parallelization in VASP, so by “cores”, I mean the number of MPI ranks or processes, i.e. the number you give to the mpirun command using the -n command line flag or the number of cores you request in the queue system.

Besides the suggestion of actually testing it out and finding a good number of cores, the main rule of thumb that I have been telling people is:

number of cores = number of atoms

This is almost always safe, and will not waste computer time. Typically, it will ensure a parallel efficiency of at least 80%. This is of course a very unscientific and handwavy rule, but it has a certain pedagogical elegance, because it is easy to remember, and you don’t need to look up any other technical parameters.

Let’s now look into how you could make a more accurate estimate. VASP has three levels of parallelization: over k-points, over bands, and over plane-wave coefficients (or equivalently Fast-Fourier transforms). You need to ensure that when the work is split up over several compute nodes, there is a sufficient amount of work allocated to each processor core, otherwise, they will just spend time waiting for more work to arrive. The fundamental numbers to be aware of are therefore:

  • The number of k-points
  • The number of bands (determined indirectly by the number of atoms and electrons)
  • The size of the basis set (i.e. number of plane waves, which corresponds to the number of grid points in the FFTs).

If you can estimate, or derive these numbers for your calculation, you can more precisely guess a suitable number of cores to use.

Bands and cores

The first step is to consider the number of bands (NBANDS). VASP has parallelization over bands (controlled by the NPAR tag). The ultimate limit is 1 band per core. So, for example, if you have 100 bands, you cannot run on more than 100 cores and expect it to work well. What I have seen, in my scaling tests, though, is that 1 band per core is too little work for a modern processor. You need to have at least 2 bands per core to reach more than 50% efficiency. A conservative choice is 8 bands/core. That will give you closer to 90% efficiency.

number of cores = NBANDS / 8

So how does this relate to the rule of thumb above? By apply it, you will arrive at a number of bands per core equal to the average number of valence electrons per atom in your calculation. If we assume that the typical VASP calculation has about 4-8 valence electrons per atom, this will land us in the ballpark of 4-8 bands/core, which is usually ok.

Let’s now try to apply this principle:

Example 1: We have a cell with 500 bands and a cluster with compute nodes having 16 cores per node. We aim for 8 bands/core, which unfortunately means 62.5 cores. It is better to have even numbers, so we increase the number of bands to 512 by setting NBANDS=512 in the INCAR file and allocate 64 cores, or 4 compute nodes.

Example 2: Suppose that you want to speed up the calculation in the previous example. You need the results fast, and care less about efficiency in terms of the number of core hours spent. You could drop down to 1 band/core (512 cores), but there is really not that much improvement compared to 2 bands/core (256 cores). So it seems like 256 cores is the maximum number possible. But what you can do is to take these 256 MPI processes and spread them out over more compute nodes. This will improve the memory bandwidth available to each MPI process, which usually speed things up. So you can try running on 32 nodes, but using 8 cores/node instead. It could be faster, if the extra communication overhead is not too large.

K-points and KPAR

The next step is to consider the number of k-points. VASP can treat each k-point independently. The number of k-point groups that run in parallel is controlled by the KPAR parameter. The upper limit of KPAR is obviously the number of k-points in your calculation. In theory, the maximum number of cores you can run on using combined k-point- and band-parallelization is NBANDS * KPAR. So for example, 500 bands and 10 k-points would allow up to 5000 cores, in principle. In practice, though, k-point parallelization does not scale that well. What I have found on the Triolith and Beskow systems is that supplying KPAR=compute nodes usually allows you to run on twice as many cores as you determined in the previous step, regardless of the actual value of KPAR. I would not recommend attempting run with KPAR>compute nodes, even though you may have more k-points than compute nodes.

(Note: A side of effect of this is that the most effective number of bands/core when using k-point parallelization is higher than without. This is likely due to the combined overhead of using two parallelization methods.)

Example 3: Consider the 500 bands cell above. 64 cores was a good choice when using just band parallelization. But you also have 8 k-points. So set KPAR to 8 and double the number of cores to 128 cores (or 8 compute nodes). In this case, we end up with 1 k-point per node, which is a very balanced setup. Note that this may increase the required memory per compute node, as k-point parallelization replicates a lot of data inside each k-point group. If you would run out of memory, the next step would be to lower KPAR to 2 or 4.

Basis set size, LPLANE and NGZ

As a last step, it might be worth considering what the load-balancing of your FFTs will look like. This is covered in the VASP manual in section 8.1. VASP, by default, works with the FFTs in a plane-wise manner (meaning LPLANE=.TRUE.), which reduces the amount of communication needed between MPI ranks. In general, you want to use this feature, as it is typically faster. The 3D FFT:s are split up into 2D planes, where each group (as determined by NPAR) works on a number of planes. It means that ideally, you want NGZ (the number of grid points in the Z direction) to be evenly divisible by NPAR. That will ensure good load-balance.


The second thing to consider is, according to the manual, that NGZ should be sufficiently big for the LPLANE approach to work:

NGZ ≥ 3*(number of cores)/NPAR = 3*NCORE

Since NCORE will be of the same magnitude as the number of cores per compute node, it means that NGZ should be at least 24-96, depending on the node configuration. More concretely, for the following clusters, you should check that the conditions below hold:

NSC Kappa/Matter: NGZ ≥ 24
NSC Triolith: NGZ ≥ 48
PDC Beskow: NGZ ≥ 72 (using 24c/node)

Typically, this is not a big problem. As an example of what NGZ can be, consider a 64-atom supercell of GaAs (113 Å) with a cut-off of 313 eV. Then the small FFT grid is 70x70x70 points. So that is approximately the smallest cell that you can run on many nodes without suffering from excessive load imbalance on Beskow. For bigger cells, with more than 100 atoms, NGZ is also usually larger than 100, so there will be no problem in this regard, as long as you stick to the rule of using NPAR=compute nodes or NCORE=cores/node. But you should still check that NGZ is an even number and not, for example, a prime number.

In order to tune NGZ, you have two choices, either adjust ENCUT to a more appropriate number and let VASP recalculate the values of NGX, NGY and NGZ, or switch from specifying the basis set size in terms of an energy cut-off and set the NG{X,Y,Z} parameters yourself directy in the INCAR file instead. For a very small system, with NGZ falling below the threshold above, you can also consider lowering the number of cores per node and adjusting NCORE accordingly. For example, on Triolith, using 12 cores/node and NCORE=12 would lower the threshold for NGZ to 36, which enables you to run a small system over many nodes.


  • Check the number of bands (NBANDS). The number of bands divided by 8 is a good starting guess for the number of cores to employ in your calculation.
  • If you have more than one k-point, set KPAR to the number of compute nodes or the number of k-points, whichever is the smallest number. Then double the amount of cores determined in the previous step.
  • Make a test run and check the value of NGZ, it should be an even number and sufficiently big (larger than 3*cores/node). Adjust either the basis set size or the number of cores/node.

How to Compile VASP on the Cray XC40

Recently, I have been working on making VASP installations for the new Cray XC40 (“Beskow”) at PDC in Stockholm. Here are some instructions for making a basic installation of VASP 5.3.5 using the Intel compiler. Some of it might be specific to the Cray at PDC, but Cray has a similar environment on all machines, so I expect it to be generally useful as well. My method of compiling VASP produces binaries which are around 30-50% faster than the ones that were provided to us by Cray, so I really recommend making the effort to recompile if you are a heavy VASP user.

If you have an account on Beskow, my binaries are available in the regular VASP module:

module load vasp/5.3.5-31Mar14 

The installation path is (as of now, it might change when the system becomes publically available):


You can find the makefiles and some README files too.

Summary of the findings

  • VASP compiles fine with module PrgEnv/intel and MKL on Cray XC-40.
  • Using MKL is still signficantly better, especially for the FFTW routines.
  • Optimization level -O2 -xCORE-AVX2 is enough to get good speed.
  • VASP does not seem to be helped much by AVX2 instructions (small matrices and limited by memory bandwidth).
  • A SCALAPACK blocking factor NB of 64 seems appropriate.
  • MPI_BLOCK should be increased as usual, 64kb is a good number.
  • Enabling MKL’s conditional bitwise reproducibility at the AVX2 level does not hurt performance, it may even be faster than running in automatic mode.
  • Memory “hugepages” does not seem to improve performance of VASP.
  • The compiler flags -DRPROMO_DGEMV and -DRACCMU_DGEMV have very little effect on speed.
  • Hyper-threading (symmetric multi-threading) does not improve performance, the overhead of running twice as many MPI ranks is too high.
  • Multithreading in MKL does not improve performance either.

Preparations for compiling

First, download the prerequisite source tarballs from the VASP home page: 

You need both the regular VASP source code, and the supporting “vasp 5” library:


I suggest to make a new directory called e.g. vasp.5.3.5, where you download and expand them. You would type commands approximately like this:

mkdir 5.3.5
cd 5.3.5
tar zxvf vasp.5.3.5.tar.gz
tar zxvf vasp.5.lib.tar.gz

This will set you up with the source code for VASP.

Load modules for compilers and libraries

The traditional compiler for VASP is Intel’s Fortran compiler (ifort command), so we will stick with Intel’s Fortran compiler in this guide. In the Cray environment, this module is called “PrgEnv-Intel”. Typically, PGI or Cray is the default preloaded compiler, so we have to swap compiler modules.

module swap PrgEnv-cray PrgEnv-intel/5.2.40

Check which version of the compiler you have by typing “ifort -v”:

$ ifort -v
ifort version 14.0.4

If you have the PrgEnv-intel/5.2.40 module loaded, it should state 14.0.4. This version can compile VASP with some special rules in the makefile (see compiler status for more information). Please note that the Fortran compiler command you should use to compile is always called ftn on the Cray (regardless of the module loaded).

We are going to use Intel’s math kernel library (MKL) for BLAS, LAPACK and FFTW, so we unload Cray’s LibSci, to be on the safe side.

module unload cray-libsci

Then I add these modules, to nail everything down:

module load cray-mpich/7.0.4
module load craype-haswell

This select Cray’s MPI library, which should be default, and the sets up the environment to compile for XC-40.

VASP 5 lib

Compiling the VASP 5 library is straightforward. It contains some timing and IO routines, necessary for VASP, and LINPACK. Just download my makefile for the VASP library into the vasp.5.lib directory and run the make command.

cd vasp.5.lib

When it is finished, there should be a file called libdmy.a in the directory. Leave it there, as the main VASP compilation picks it up automatically.

Editing the main VASP makefile

Go to the vasp.5.3 directory and download the main makefile.

cd vasp.5.3

I recommend that you edit the -DHOST variable in the makefile to something that you will recognize, like the machine name. The reason is that this piece of text is written out in the top of OUTCAR files.

   -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc -DNGZhalf \
   -DMPI_BLOCK=65536 -Duse_collective -DscaLAPACK \

You will usually need three different versions of VASP: the regular one, the gamma-point only version, and one for spin-orbit and/or non-collinear calculations. These are produced by the following combinations of precompiler flags that you have to put into the CPP line in the makefile:

regular:       -DNGZhalf
gamma-point:   -DNGZhalf -DwNGZhalf
non-collinear: (nothing)

At the Swedish HPC sites, we install and name the different binaries vasp, vasp-gamma, and vasp-noncollinear, but this is optional.


VASP does not have a makefile that supports parallel compilation. So in order to compile we just do:

make -f makefile.vasp5lib.crayxc40

If you really want to speed it up, you can try something like:

nice make -j4; make -j4; make -j4; make -j4;

Run these commands repeatedly until all the compiler errors are cleared (or write a loop in the bash shell). Obviously, this approach only works if you have a makefile that you know works from the start. When finished, you should find a binary called “vasp”. Rename it immediately after you are finished, otherwise it will get destroyed when you type make clean to compile the other VASP versions.


The Cray compiler environment produces statically linked binaries by default, since this is the most convenient way to run on the Cray compute nodes, so we just have to put e.g. the following in the job script to run on 2048 cores using 64 compute nodes with 32 cores per node and 16 cores per socket:

aprun -n 2048 -N 32 -S 16 /path/to/vasp

Normally, I would recommend lowering the number of cores per compute node. This will often make the calculation run faster. In the example below, I run with 24 cores per node (12 per socket), which is typically a good choice:

aprun -n 1536 -N 24 -S 12 /path/to/vasp

When running on the Cray XC-40, keeping in mind the basic topology of the fast network connecting the compute nodes. 4 nodes sit together on 1 board, and 16 boards connect to the same chassis (for a total of 64 compute nodes), while any larger job will have to span more than one chassis and/or physical rack, which slows down network communications. Therefore, it is best to keep the number of compute nodes to 64 at most, as few VASP jobs will run efficiently using more nodes than that.

Initial Test of VASP on Intel’s Xeon E5 V3 Processors (“Haswell”)

I finally got around to run some VASP benchmarks on the recently released Intel Xeon E5 v3-processors (see previous post for an overview of the differences vs older Xeon models). I have tested two different configurations:

  • A 16-core node from the coming weather forecasting cluster at NSC to be named “BiFrost”. It is equipped with the Xeon E5-2640v3 processor at 2.6 Ghz, together with 64 GB of 1866 MHz DDR4 memory.
  • The 32-core nodes in the Cray XC40 “Beskow” at machine at PDC. The processor model is Xeon E5-2698-v3. Beskow also has two sockets per node with 64 GB memory, but the speed of the memory is faster: 2133 MHz.

I did a quick compile of VASP with Intel Fortran (version 14 and 15) and ran some single node benchmarks of the 24-atom PbSO4 cell I traditionally use for initial testing.

Relative speed: Haswell vs Sandy Bridge

The initial results are basically along my expectations. The single core performance is very strong thanks to Turbo Boost and the improved memory bandwidth. It is the best I have measured so far: 480 seconds vs 570 seconds earlier for a 3.4 Ghz Sandy Bridge CPU.

When running on all 16 cores and comparing to a Triolith node, the absolute speed is up 30%, which is also approximately +30% improved performance per core, as the number of cores are the same and the effective clock frequency is also very close in practice. The intra-node parallel scaling has improved as well, which is important, because it hints that we will be able to run VASP on more than 16 cores per node without degrading performance too much on multi-core nodes such as the ones in Beskow. In fact, when running the above cell on a single Cray XC-40 node with 32-cores, I do see improvement in speed all the way up to 32 cores:

VASP Intra-node scaling on Beskow

This is a good result, since the paralllel scaling for such a small cell is not that good in the first place.

So overall, you can expect a Beskow compute node to be about twice as fast as a Triolith node when fully loaded. That is logical, as it has twice as many cores, but not something that you can take for granted. However, when running wide parallel jobs, I expect you are likely to find that using 24 cores/node is better, because 24-cores brings 90+% of the potential node performance, while at the same significantly lowering the communication overhead due to having many MPI processes.

A first case study: a 512-atom supercell

I will publish more comprehensive benchmarks later, but here is an indication of what kind of improvement you can expect on Beskow vs Triolith for more realistic production jobs. The test system is a 512-atom GaAs supercell with one Bi defect atom. A cell like this is something you would typically run as part of a convergence study to determine the necessary supercell size.

  • Using 64 compute nodes on Triolith (1024 cores in total) it takes 340s to run a full SCF cycle to convergence (13 iterations)
  • On Beskow, using 64 compute nodes and 1536 cores (24c/node), it takes 160s.

Again, about 2.0x faster per compute node. If you compare it to the old Cray XE6 “Lindgren” at PDC, the difference will be even bigger, between 3-4x faster. But please keep in mind that allocations in SNIC are accounted in core hours, and not node hours, so while your job will run approximately twice as fast on the same number of nodes, the “cost” in core hours is the same.

Visiting University of Antwerp

On Thursday and Friday (23-24 October), I will be visiting the University of Antwerp in Belgium and give some lectures as part of the Specialist course on efficient use of VASP that is being organized by CalcUA and the Flemish Supercomputing Centrum. You can find more information on the course web page. It seems like you need to register in advance to attend the course.

Most of the content that I will present is already available here, as part of several blog posts. If you check out the blog post archives, you will find many posts covering VASP and how certain input parameters affect performance. For information about compiling VASP, I have a special section on the web page called Compile VASP. The system guide for NSC’s Triolith is probably the best to start out with, as the hardware is very similar to what is currently available in Belgium. After the course, I plan to add a few articles based on the new material I prepared for this event.

I am looking forward to seeing you in Antwerp!

Update 2014-10-29: Thank you to everyone who participated. I had a pleasant stay in Antwerp and much enjoyed the discussions. It gave me some new ideas of things to test out. By request, here are the slides from the sessions:

Application Statistics for Triolith

What is Triolith being used for? We have some idea by looking at the computer time applications we get through SNAC, and the support cases we work on also tells us something about what people are doing on Triolith, but the most comprehensive picture is likely to be painted by actually analyzing, in real time, what is running on the compute nodes. During the last service stop, I had the opportunity to examine the low-level logging data of a sizeable set of Triolith compute nodes. I managed to collect a sample of one month of log data from 186 nodes. To get a sense of the scale, it expands to about 2 TB of data uncompressed. Fortunately, when you have access to a supercomputer, you can attack the problem in parallel, so with 186 compute nodes unleashed at the task, it took just 10 minutes.

What you see below is an estimate of the fraction of time that the compute nodes spent running different applications. The time period is August 17th to September 17th, but the relative distribution has been rather stable over the previous months.

ApplicationShare of core hours (%)
VASP 36.6%
Gromacs 10.0%
NEK5000 8.2%
[Not recognized] 6.7%
Ansys (Fluent) 3.3%
Gaussian 3.3%
Dalton 3.1%
C2-Ray 2.4%
Nemo 2.2%
NAMD 2.2%
Python 1.8%
a.out 1.5%
OpenFOAM 1.1%
CPMD 0.9%
EC-Earth 0.9%
KKR 0.8%
Spectral 0.8%
Rosetta 0.7%
CP2K 0.6%
Octopus 0.6%
RSPt 0.5%

Unsurprisingly, we find VASP at the top, accounting for about a third of the computing time. This is the reason why I spend so much time optimizing and investigating VASP – each of per cent of performance improvement is worth a lot of core hours in the cluster. We also have a good deal of molecular dynamics jobs (Gromacs, LAMMPS, NAMD, CPMD, ca 18%) and a steady portion of computational fluids dynamics jobs (Fluent + NEK5000 + OpenFOAM , ca 12%). Quantum chemistry programs, such as Gaussian, GAMESS, and Dalton (8%) catch the eye in the list, as expected, although this was a low month for Gaussian (3%), the usage is often higher (6-7%), competing for the top-5.

It would be interesting to compare this to other supercomputing sites. When talking to people at the SC conference, I get the impression that VASP is major workload at basically all academic sites, although perhaps not as much as 30%. In any case, getting statistics like this is going to be useful to plan application support and the design of future clusters that we buy.


Below follows some technical observations for people interested in the details behind getting the numbers above.

The data is based on collectl process data, but at the logging level, you only see the file name of the binary, so you have to identify a certain software package just by the name of its running binaries. This is easy for certain programs, such as VASP, which are always called vasp-something, but more difficult for others. You can, for example, find the notorious a.out in the list above, which could be any kind of code compiled by the users themselves.

A simple check of the comprehensiveness of the logging is to aggregate all the core hours encountered in the sample, and compare with the maximum amount possible (186 nodes running for 24 hours for 30 days). This number is around 75-85% with my current approach, which means that something might be missing, as Triolith is almost always utilized to >90%. I suspect it is a combination of the sampling resolution at the collectl level, and the fact that I filter out short-running processes (less than 6 minutes) in the data processing stage to reduce noise from background system processes. Certain software packages (like Wien2k and RSPT) run iteratively by launching a new process for each iteration, creating many short-lived processes inside a single job. Many of these are probably not included in the statistics above, which could account for the shortfall.