How Accurate Are Different DFT Codes?

How accurate is DFT in theory and in practice? There has been some reviews on the former, comparing calculations of a given DFT program with experiments, but not as much of the latter – comparing the numerical approximations inherent in different DFT codes. I came across a paper taking both of these aspects into account. The paper is titled “Error Estimates for Solid-State Density-Functional Theory Predictions: An Overview by Means of the Ground-State Elemental Crystals”, written by K. Lejaeghere et al. More information about their project to compare DFT codes can be found at their page at the Center for Molecular Modeling at the University of Gent.

Their approach to compare DFT codes is to look at the root mean square error of the equations of state w.r.t. the ones from Wien2K. They called this number the “delta-factor”. The sample set is the ground-state crystal structures of the elements H-Rn in the periodic table. I have plotted the outcome below, which is to be interpreted as the deviation from a full-potential APW+lo calculation, which is considered as the exact solution. Please note the logarithmic scale on the horizontal axis.

Delta factors for different DFT codes

My observations are:

  • Well-converged PAW calculations with good atomic setups are very accurate. Abinit with the JTH library achieves a delta value of 0.5 meV/atom vs Wien2K. As the authors put it in the paper: “predictions by APW+lo and PAW are for practical purposes identical”.
  • Norm-conserving pseudopotentials (NC) with plane-wave basis set are an order of magnitude worse than PAW. The numerical error is of the same magnitude as the intrinsic error vs experiments for the PBE exchange-correlation potential (23.5 meV/atom).
  • VASP is no longer the most accurate PAW solution. Similar, or better, quality results can now be arrived at with Abinit and GPAW.
  • The quality of the PAW atomic setups matters a lot. Compare the results for Abinit (blue bars in the graph) with different PAW libraries. I think this explains why VASP has remained so popular – only recently did PAW-libraries which surpass VASP’s built-in one become available.
  • The PAW setups for GPAW are of comparable quality to VASP’s, but GPAW’s grid approach seems to be detrimental to numerical precision. GPAW with plane-wave (PW) basis gets 1.7 meV/atom vs 3.3 meV/atom using finite differences.
  • OpenMX (pseudo-atomic orbitals + norm-conserving PPs) performs surprisingly well, matching the PAW results. I noticed that the calculations employed very large basis sets, though, which should slow down the speed significantly.

Another relevant aspect is the relative speed of the different codes. Do you have to trade speed for precision? The paper does not mention the accumulated runtime for the different data sets, which would otherwise have made an interesting “price/performance” analysis possible.

Before, I tried to compare the absolute performance and the parallel caling of Abinit and VASP, reaching the conclusion that Abinit was significant slower. Perhaps the improved precision is the reason why? Regarding GPAW, I know, from unpublished results, that GPAW exhibits similar parallel scaling to VASP and matches the per core performance, but SCF convergence can be an issue. OpenMX can be extremely fast compared to plane-wave codes, but the final outcome critically depends on the choice of the basis set.

I am putting GPAW and OpenMX on my list of codes to benchmark this year.

Live Profiling on Triolith With “Perf”

On Triolith, we have the Perf profiler tool installed on all computer nodes. It is pretty neat, because it allows you to look into your running jobs and see what they are doing, without recompiling or doing a special profiling run. This can be quite useful for locating bottlenecks in the code and to quickly check whether jobs appears to be running efficiently.

Here is a rundown on how do it. Suppose we are running a job on Triolith. First, we need to find out on which nodes the job is running on. This information is availble in the squeue output in the “NODELIST” column.

[pla@triolith1 ~]$ squeue -u pla
 JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
 1712173  triolith _interac      pla   R       0:17      2 n[2-3]

If you are running a job on a node, you are allowed to use ssh to log in there and check what is going on. Do that!

[pla@triolith1 ~]$ ssh n2
....(login message)...
[pla@n2 ~]$ 

Now, running the top command on the node, will show us that they we are busy running VASP here, as expected.

Output of top command showing vasp processes.

The next step is to run perf top instead. It will show us a similar “top view”, but of the subroutines running inside all of the processes running on the node. Once you have started perf top, you will have to wait at least a few seconds to allow the monitor to collect some samples before you get something representative.

Output of perf top sampling of a vasp job.

If your program is compiled to preserve subroutine names, you will see a continuously updating list of the “hot” subroutines in your program (like above) even including calls to external libraries such as MKL and MPI. The leftmost percentage number is the approximate amount of time that VASP, in this case, is spending in that particular subroutine. This specific profile looks ok, and is what I would expect for a properly sized VASP run. The program is spending most of the time inside libmkl_avx.so doing BLAS, LAPACK, and FFTWs operations, and we see some a moderate amount of time (about 10% in total) in libmpi.so doing and waiting for network communications.

For something more pathological, we can look at a Quantum Espresso phonon calculation, which I am deliberately running on too many cores.

Output of perf top sampling of a QE Phonon job.

Here, something is wrong, because almost 70% of the time seems to be spent inside the MPI communications library. There is actually very little computation being done – these compute nodes are just passing data back and forth. This is usually an indication that the job is not parallelizing well, and that you should run it on less nodes, or at least use less cores per node. In fact, here I was running a phonon job of a simple metal on 32 cores on 2 compute nodes. The runtime was 1m29s, but it would have run just as fast (1m27s) on a single compute node with just 4 cores. The serial runtime, for comparison, was 4m20s. Now, 1 minute on 1 compute is not much time saved, but imagine the effect if this was a job running on 16 compute nodes for one week. That is a saving of 20,000 core hours.

There are much more things you can do with perf, for example, gathering statistics from processor performance counters using perf stat, but for starters, I would suggest using it as a routine check when preparing new jobs to run on the cluster. For big jobs using hundreds of thousands of cores, I would always recommend doing a real parallel scaling study, but for small jobs, it might not be worth it. That is when perf top comes in handy.

Running VASP on 9,216 Cores

When I attended the supercomputing conference last year, I talked to many people in the booths and discovered that despite all the talk about petascale and exascale computing, VASP calculations are still a big part of what most supercomputing sites in the world are serving to their users. This seems to apply everywhere, not only in Sweden. But surprisingly, some described VASP as being “hugely problematic”, claiming that the parallel scalability was extremely bad – 64 cores maximum under any circumstances. In my experience, though, there is no problem running VASP on thousands of CPU cores, provided that you have a sufficiently large cell. My rule of thumb is that you need at least one atom per core, so if you have a supercell with a few thousand atoms, you can in fact run it with acceptable speed on current clusters by running it in a massive parallel fashion. It is true that a typical electronic structure calculation is not going to be that big, but I believe there are certain use cases for cells that big, for example when studying very low dopant concentrations, or simulations of nanostructures.

To show what would be possible for our users, provided they had a sufficiently large allocation of core hours, I decided to test the limits by setting up a new benchmark calculation: a 5900-atom supercell of Si-doped GaAs. It has dopants in random positions, and a big void in the middle, so there is no symmetry in this structure. In total, there is 5898 atoms, 23564 electrons, and about 14700 bands (depending on the number of cores employed).

GaAs supercell

The questions at issue are: would you be able to run this cell with VASP given a big enough allocation on a Sweden HPC resource like Triolith or Lindgren; and from technical point of view, can you even run VASP on that many cores in the first place? The answer is “yes” to both questions, as is evident from the graph below. Employing 256 nodes and 3072 cores of Triolith gives us a speed of roughly 30 SCF iterations per hour, and with Lindgren, the number is 20 SCF steps/hour using 384 nodes and 9216 cores (i.e. 25% of the full machine).

GaAs parallel scaling

I would like to emphasize that no special magic is required to get a parallel scaling like that. It is the standard VASP source code compiled with the Intel toolchain, as described in the guides I have published. There were no special runtime settings other than NPAR=number of nodes and switching to RMM-DIIS exclusively (IALGO=48).

To illuminate in more detail, what is going on below the surface, we can look at the time profile of different subroutines. A regular VASP calculation will spend most of its time doing electronic minimization (subroutine RMM-DIIS or equivalent), but that is not what we find here. A breakdown from the 384-node run on Lindgren shows:

ORTCH (orthogonalization of wavefunctions) 40%
EDIAG (subspace diagonalization)           50%
RMM-DIIS (electronic minimization)         10%

Most of the time is spent orthogonalizing and deconstructing linear combinations of Kohn-Sham orbitals! At first, I suspected that this was a parallel scaling issue with SCALAPACK, similar to what I found for smaller systems, but the time profile is almost the same on 32 nodes, so evidently, those are real, new bottlenecks that come into play. The reason is, I believe, that ORTCH and EDIAG formally scales as N3, and that finally, they seem to have overtaken the other terms scaling as N2 and N2 log N. During the DFT world record (?) featuring 107292 atoms on the K computer, Hasegawa et al observed the same effect, with the conjugate-gradient minimizing part only being 1% of the computational cost.

To me, this proves that the computationally intensive parts of VASP are very well parallelized and that there are no serial parts that overtake the computations as the calculation is scaled up. The trick is rather that you need to scale up your systems as well, should you want to run big.

New Version of the VASP Test Suite

A new version (v4) of my test suite for VASP is now available. You can find it on GitHub:

https://github.com/egplar/vasptest

This time, I decided to rewrite it from scratch, and drop the beetest testing framework in favor of using Python Behave instead. The extra work involved in maintaining my own testing software was not worth it when it was possible to get the same, or better, flexibility with an existing solution.

With behave, the tests are now written in plain English, which is much easier to read compared to the earlier XML format. The actual test code now looks like this:

...
Scenario: Fe-bcc
When I run VASP with a maximum of 8 ranks
Then the total energy should be -8.231456 +/- 1.0e-5 eV
and self consistency should be reached in 16 iterations
and the Fermi energy should be 9.629837 +/- 0.01 eV
and the pressure should be -39.29 +/- 0.1 kB
and the xx component of the stress tensor should be -39.28618 +/- 0.1 kB
and the yy component of the stress tensor should be -39.28618 +/- 0.1 kB
and the zz component of the stress tensor should be -39.28618 +/- 0.1 kB
and the xy component of the stress tensor should be 0.0 +/- 0.01 kB
and the magnetic moment should be 2.2095 +/- 0.01 uB
and the point group symmetry should be O_h
and the XML output should be valid
...

When the tests are run, the text above is translated into Python code which does the actual inspection and checking of the output files. Hopefully, this change will make it easier for other people to make changes and write their own test cases.

In addition to the code rewrite, there are several new test cases in version 4. They are aimed at testing the GW, hybrid-DFT, and density of states functionality. There is more information on the vasptest page.

Quantum Espresso vs VASP (Round 3)

I promised a third round of Quantum Espresso (QE) benchmarking vs VASP, where I would try out some large supercells. Supposedly, this is where QE was supposed to shine, judging from the reported benchmarks on their home page. They show a 1500-atom system consisting of a carbon nanotube with 8 porphyrin groups attached. It shows good scaling up to 4096 cores on a Cray XT4 system.

Carbon nanotube with porphyrin groups

My endeavor did not produce beautiful scaling graphs, though, as I tried several big supercells in QE with the PAW method, but were unable to get them to run reliably. They either run out of memory, or crash due to numerical instabilities. In the end, I decided to just pick the same 1532-atom carbon nanotube benchmark displayed above. It is a calculation with ultrasoft pseudopotentials, which would be unfair to compare with a VASP calculation with PAW. But since there is a special mode in VASP to emulate ultrasoft potentials, activated by LMAXPAW=-1, we can use that one make to the comparison more relevant.

In terms of numerical settings, we have 5232 electrons and the plane wave cutoff encutwfc in the QE reference calculation is 25 Ry (340 eV), with encutrho 200 Ry. The memory requirements are steep and VASP runs out of of memory on 8 nodes, but manages to run the simulation on 16 nodes, so the total memory requirement is between 256GB and 512GB. QE, on the other hand, cannot run the simulation even on 50 nodes, and it is not until I reduce encutwfc to 20 Ry and run with 50 nodes using 8 cores/node that I am able to fit in on Triolith with 32GB/node. This means that the memory requirements are significantly higher for QE than VASP. The “per-process dynamical memory” is reported as ca 1.1 GB in the output files, but in reality, it is using closer to 3 GB per process on 50 nodes.

Now, to performance results. The good news is that this system scales beautifully with VASP, but the bad news is that it does not look that great with QE. With VASP, I used no other tricks than the tried and tested NPAR=nodes, and for QE, I tested -ntg=[1-4] and used similar SCALAPACK setups (-ndiag 100 and -ndiag 196) as in the reference runs. -ntg=1 turned out to be optimal here, as expected (400-800 cores vs 500ish grid points in the z direction).

Parallel scaling comparison of CNT+porphyrin system

When looking at the scaling graph, we have near linear scaling in a good part of the range for VASP. It is quite remarkable that you can get ca 10 geometry optimization steps per hour on such a large system using just 4% of Triolith. This means that doing ab initio molecular dynamics on this system would be possible on Triolith, provided that you had a sufficiently large project allocation (several million core hours per month).

The high memory demands and instability of QE prevented me from doing a proper scaling study, but we have two reference points at 50 and 100 compute nodes. There is no speedup from 50 to 100 nodes. This is unlike the study on the old Cray XT4 machine, where the improvement was in the order of 1.5x going from 1024 to 2048 cores. So I am not really able to reproduce these results on modern hardware. I suspect that what we are seeing is an effect of faster processors. In general, the slower the compute node is, the better the scaling will be, because there is more work to be done relative to the communication. An alternative theory is that I am missing something fundamental in running PWscf in parallel, despite having perused the manual. Any suggestions from readers are welcome!

In conclusion, the absolute speed of Quantum Espresso using 50 compute nodes with a large simulation cell is less than half of that of VASP, which further confirms that it does not look attractive to run large supercells with QE. You are also going to need much more memory per core, which is a limitation on many big clusters today.

Update 2013-12-19: A reader asked about the effective size of the fast Fourier grids used, which is what actually matters, rather than the specified cut-of (at least for VASP). In the first results I presented, VASP actually used a 320x320x320 FFT grid vs the 375x375x375 in QE. To make the comparison more fair, I reran the data points for 50 and 100 nodes with PREC=Accurate in VASP, giving a 432x432x432 grid, which is what you are currently seeing in the graph above. The conclusion is still the same, though.

To further elaborate, I think that one of the main reasons for the difference in absolute speed (but not parallel scalability) is the lack of RMM-DIIS for matrix diagonalization in QE. In the VASP calculations, I used IALGO=48, which is RMM-DIIS only, but for QE I had to use Davidson iterative diagonalization. In the context of VASP, I have seen that RMM-DIIS can be 2x faster than Davidson for wide parallel runs, so something similar could apply for QE as well.

SC13 Conference

Next week, I am off to Denver, Colorado to participate in The Supercomputing Conference (SC13). I will also attend HP-CAST and the Intel HPC Roundtable. I am always interested in meeting other people working with ab initio/electronic structure software. Send me an email if you want to meet up.

How to Compile VASP 5.3.3 on Ubuntu 13.10

Here comes a more generic recipe for installing and compiling VASP using only open-source tools (i.e. without Intel’s Fortran compiler and MKL). This chould be useful if you want to run smaller calculations on a laptop or an office machine. Below follows how I did it on Ubuntu 13.10 with GCC/Gfortran, OpenMPI, OpenBLAS, FFTW and Netlib SCALAPACK. Please note that compiling VASP with gfortran is not recommended or supported by the VASP developers. From what I can tell, it appears to work, but I have only done limited testing.

Prerequisites

First of all you need the VASP source code, which you get from the VASP home page:

http://www.vasp.at/ 

Then we need to install some Ubuntu packages. Install either through the Synaptic program or apt-get in the terminal.

  • build-essential
  • gfortran
  • openmpi1.6-bin
  • libopenmpi1.6-dev
  • libfftw-double3
  • libfttw-single3
  • libfftw-dev

This is starting from a completely new Ubuntu installation. If you have done any programming on your machine before, some of these packages could already be installed. For other Linux distributions, you will need to find out the names of the corresponding packages. They should be similar, except for “build-essential”, which is specific to Debian.

I did not have much success using Ubuntu’s BLAS/LAPACK/ATLAS, so we will need to download the latest OpenBLAS and compile it ourselves from source. The same applies to SCALAPACK, which we have to tie together with our OpenBLAS and the system OpenMPI installation.

OpenBLAS

Download the latest OpenBLAS tarball from

http://www.openblas.net/

After decompressing it, you will have a directory called “xianyi-OpenBLAS-….”. Go inside and check the TargetList.txt file. You will have to decide which processor architecture target is appropiate for your processor. For a new Intel processor, “SANDYBRIDGE” should be best, and for a new AMD processor, “BULLDOZER”. Here, I choose the safe and conservative option “CORE2”, which should work on any recent processor. Then we compile with make.

make FC=gfortran CC=gcc USE_THREAD=0 TARGET=CORE2

This should produce a library called libopenblas_core2-r0.2.8.a (or similar). Make note of the directory in which you compiled OpenBLAS, you will need it later. Mine was “/home/pla/build/xianyi-OpenBLAS-9c51cdf”

SCALAPACK

Download the latest SCALAPACK tarball from Netlib.org. To compile it, we need to set up a SLmake.inc file containing some configuration parameters. Start by copying the SLmake.inc.example file. You need to update the BLASLIB and LAPACKLIB variables and insert a direct reference to your OpenBLAS compilation.

CDEFS         = -DAdd_
FC            = mpif90
CC            = mpicc 
NOOPT         = -O0
FCFLAGS       = -O3
CCFLAGS       = -O3
FCLOADER      = $(FC)
CCLOADER      = $(CC)
FCLOADFLAGS   = $(FCFLAGS)
CCLOADFLAGS   = $(CCFLAGS)
ARCH          = ar
ARCHFLAGS     = cr
RANLIB        = ranlib
SCALAPACKLIB  = libscalapack.a
BLASLIB       = -L/home/pla/build/xianyi-OpenBLAS-9c51cdf -lopenblas
LAPACKLIB     = $(BLASLIB)
LIBS          = $(LAPACKLIB) $(BLASLIB)

This should be enough to get SCALAPACK to compile by typing “make”. In the end, you should get a libscalapack.a file.

Compiling VASP

Proceed to compile VASP with gfortran according to the previous guide. You need to apply the source code patches described there, otherwise it is straightforward. If you have never compiled VASP before, looking through one of the more detailed system specific guides in the VASP compile section might help.

The makefiles and the source code patch I used are available for download: vasp-ubuntu.tar.gz.

Some highlights (update the paths if necessary):

FFLAGS = -ffree-form -ffree-line-length-0  -fno-second-underscore -I/usr/include

We need to include -I/usr/include to pick up the FFTW header file.

BLAS= ../../xianyi-OpenBLAS-9c51cdf/libopenblas-core2.a

And refer to the BLAS/LAPACK library from our OpenBLAS installation.

CPP    = $(CPP_) -DMPI  -DHOST=\"LinuxGfort\" \
     -DCACHE_SIZE=4000 -Davoidalloc -DNGZhalf \
     -DMPI_BLOCK=262144 -Duse_collective -DscaLAPACK -DMINLOOP=1  

And set the precompiler flags. In the MPI section of the makefile, there should be a reference to our compiled SCALAPACK:

SCA=/home/pla/build/scalapack-2.0.2/libscalapack.a

Running VASP

The binaries you compile are MPI-enabled, so they should be launched with mpirun. For example:

mpirun -np 4 ~/build/vasp-5.3.3/vasp.5.3/vasp

You will probably find that the --bind-to-core option will help performance.

mpirun -np 4 --bind-to-core ~/build/vasp-5.3.3/vasp.5.3/vasp

If you have dual socket workstation, similar to a compute cluster node, I recommend trying:

mpirun -np 16 --bind-to-core --npersocket 8 ~/build/vasp-5.3.3/vasp.5.3/vasp

Exploring Quadruple Precision Floating Point Numbers in GCC and ICC

When doing standard double precision floating point operations in C or Fortran, you can expect 15-17 significant digits. But what do you do when 15 decimals are not enough? The GNU multiple precision arithmetic library is go-to library when you need not 15, but thousands of decimals, but for more modest needs, quadruple precision (128 bits), might be enough. In Fortran, 128 bits is REAL*16, and in C we can access it by the type __float128. These may not be availabe in all compilers, but GCC (4.6 and newer) and Intel C/Fortran do support it. According to the GCC manual, we can expect it to be “an order of magnitude or two slower” than double precision.

I decided to test the quad precision by doing a simple matrix-matrix multiplication test. Two 256x256 matrices are initialized with trigonometric expressions and then multiplied by each other. It is straightforward to change the program into quad precision: the doublearrays should be declared as __float128, the trig. functions are called cosq/sinq, and we need to use the special function quadmath_snprintf to print the numbers to the screen. (In Fortran, you don’t even need to use a different print function.)

For simplicity, I look specifically at the decimal expansion of one matrix element (0,0). With gcc, I get

$ gcc -o matmul_double -O3 -mavx -std=c99 matmul_double.c -lm
$ ./matmul_double
Matrix size: 256 by 256 (double precision).
Time: 0.010 seconds (3355.4 MFLOP/s)
C(0,0)-diagnostic: -0.0939652685936642

gcc -o matmul_quad -std=c99 -O3 -mavx matmul_quad.c -lquadmath -lm
./matmul_quad
Matrix size: 256 by 256 (quad precision).
Time: 4.140 seconds (8.1 MFLOP/s)
C(0,0)-diagnostic: -0.093965268593662358578620940776

which confirms that __float128 works in gcc, but with significant speed penalty. In this case, the difference in runtime is around 1000x.

Does Intel’s C compiler fare better?

$ icc -o matmul_double -std=c99 -O3 -xavx mmul_double.c -lm
$ ./matmul_double
Matrix size: 256 by 256 (double precision).
0.000 seconds (inf MFLOP/s)
C(0,0)-diagnostic: -0.0939652685936624

$ icc -o matmul_quad -std=c99 -O3 -mavx -I/software/apps/gcc/4.7.2/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/include -L/software/apps/gcc/4.7.2/lib64/ matmul_quad.c -lquadmath -lm
$ ./matmul_quad
Matrix size: 256 by 256 (quad precision).
0.880 seconds (38.1 MFLOP/s)
C(0,0)-diagnostic: -0.093965268593662358578620940776

Yes, it is evident that the quad precision floating maths runs a few times faster with ICC, even though the same underlying library is used.

If we actually look at the decimals, and compare the results, we find something interesting. The quad precision results from GCC and ICC are identical, which is assuring, but the double precision binary compiled by ICC delivers higher precision, despite using very high optimization.

GCC(double) -0.09396526859366|42
ICC(double) -0.09396526859366|24
GCC(quad.)  -0.09396526859366|2358578620940776
ICC(quad.)  -0.09396526859366|2358578620940776

Lowering GCC’s optimization did not help, and gcc is not supposed to introduce any unsafe mathematical operations unless explicitly enabled by -ffast-math or -Ofast in the first place, so it is unclear to me where the difference comes from. Typically, fluctuations in the last decimal is not a problem in practice for many codes. There are many that run fine with e.g. gcc’s -ffast-math, but if you are struggling with ill-conditioned matrices, every decimal could count.

Shared Memory Communication vs Infiniband

Are bigger servers better for high-performance computing? It is often assumed that communication between processors within a compute node must be faster than using Infiniband networking in a cluster. Consequently, I come across scientists asking for big shared memory servers, believing that their simulations would run much faster there. In reality, it is not always so. In the previous post, I wrote about how the VASP application is bottlenecked by memory bandwidth. In such cases, compute work and communication will compete with each other for precious resources, with severe performance degradation as a result.

Consider this experiment. Let us first run a VASP compute job with 512 bands on a single compute node using 1 to 16 cores. This will give us a scaling graph showing what kind of improvement you can get by using more cores in a server. Now, for comparison, we run the same job but with only one core per compute node. This means that the 16-core job uses 16 compute nodes and only communicates over Infiniband. Which scenario will be faster?

VASP SMP vs Infiniband scaling

It turns out that communicating only over Infiniband is superior to shared memory. With 16 cores it gives twice as fast calculations. The reason is simply that we throw more hardware at the problem: our processors can now use all the memory bandwidth for computations, while exchanging data over the Infiniband network instead.

The graph above shows that there is no inherent benefit to run an MPI parallelized application such as VASP on a big server vs smaller servers in a cluster connected by Infiniband. The only advantage you get is the increased total amount of available memory per server.

As a user, you can apply techniques like this to speed up your calculations. For example, by using twice as many nodes, but with half the number of cores on each node. On Triolith, you can do like this:

#SBATCH -N 16
#SBATCH --ntasks-per-node 8

mpprun /software/apps/vasp/5.3.3-18Dec12/default/vasp

This will run your application with 8 cores per compute node, for a total of 128 cores. The improvement in speed compared to using 8 full nodes can be as much as 50%, and you will also have twice as much memory available per processor. The drawback is, of course, that you spend twice the core hours on your calculation. But if it is important to get the results quickly, it might be worth it.

Hardware Recommendations for VASP

I was recently asked what kind of hardware you should for running VASP calculations. I recommend looking at the configuration of the Triolith cluster at NSC. It was designed to run VASP as big part of the workload, and we did extensive benchmarking to optimize price/performance. An alternative is too look through the supercomputing Top500 list for the most recent entries, to get a feel for what the supercomputing centers are buying at the moment. At Top500, they also have a statistics section where you can follow trends in hardware over time. It is obvious, for example, that big shared memory systems have fallen out of favor.

I recommend is dual-socket servers with low clock frequency Intel Xeon E5 processors, connected by Infiniband networking. Why?

  • Type of CPU: At the moment, Intel’s server processors are significantly ahead of AMD’s in terms of performance and energy efficiency. VASP is also very reliant on Intel’s Fortran compiler. This makes Intel processors an optimal choice. It is possible to run VASP on POWER and SPARC processors, but these server platforms are generally not cost efficient for high-performance computing.

  • CPU model: VASP is very dependent on high memory bandwidth. It is one of the most memory intensive applications in high-performance computing. This means that you do not need processors with high clock frequency, because they will spend most of their time waiting for data to arrive from memory anyhow. What you need is the cheapest processor model that still comes with maximum memory bandwidth. In the current Xeon E5-2600 series, that is likely to be the 2650 and 2660 models. The quad-core 2643 model could also be interesting, because you typically gain only 25% when going from 8 to 16 cores per node.

  • Memory: It is crucial to have 1600 Mhz DDR3 memory, which is currently the fastest memory (1866 Mhz will soon come). All four memory channels should be occupied. Try to get dual rank memory modules if you have only 1 DIMM per memory channel – it will improve performance by ca 5%. Typically, 32 GB of memory is enough (2 GB/core), the reason being that you can easily get an equivalent of 4 GB per core by running with half the number of MPI ranks per node without losing too much performance. But if the majority of jobs are of hybrid type or GW, I would go for 64 GB per server instead.

  • Network: A fast network, like Infiniband, is necessary to run VASP in parallel on more than a few nodes. It is difficult to do comparative studies of different network setups due to the amount of hardware required, but in general VASP scales almost perfectly up to 16 compute nodes using both Mellanox and Qlogic (now Intel) Infiniband, so there is no given winner. FDR Infiniband does not significantly improve performance over QDR for VASP in the few tests I was able to do (+16% on a 16 node job spanning two switches), so I would look at it mostly from a price/performance perspective.