New Version of VASP - 5.4.1

A new version of VASP, denoted vasp.5.4.1 24Jun15 was released during the summer. The release was a bit stealthy, because there was no mention of it on the VASP home page until announcements of “bugfixes for vasp.5.4.1” showed up. There seems to be no official release notes published either, but the announcement email contains the following list of improvements and changes:

  • Interface to the solvation model code VASPsol of Mathew and Hennig.
  • Bugfix in the symmetry detection
  • Support for NCORE≠1 for hybrid functional calculations.
  • Bugfix in pead.F: macroscopic dielectric properties (LCALCEPS=.TRUE.) didn’t work with LREAL≠.FALSE.
  • Improvements to hybrid calculations: less memory use when run at large scale.
  • A new build system which simplifies the compilation of VASP. There is now a separate build directory and you can compile the three usual versions with distinct make commands.

Since the original 5.4.1 release, there has also been two patches released:

The first installations of VASP 5.4.1 binaries at NSC is available in the usual place, so you can put the following in your job scripts:

mpprun /software/apps/vasp/5.4.1-24Jun15/build04/vasp

Build04 includes both of the patches, but build01-02 only have the first patch. You can also do module load vasp/5.4.1-24Jun15, if you prefer to use modules. That command currently loads build04.

I believe that the new version is safe to use. I ran through the test suite that I have and saw identical output in most tests and small deviations for Cu-fcc and Si with GW. I especially recommend to try out 5.4.1 if you are struggling with large hybrid functional calculations and employ many compute nodes in the process. In repeating some of the GaAsBi benchmark runs with 5.4.1, I found that HSE06 is now significantly faster, and uses less memory per MPI rank, so you might be able use more cores per compute node without running out of memory. To exemplify, with 5.3.5, I was able to run my GaAsBi-512 atom test system on 96 Triolith compute nodes with 8 cores per node in 9812 seconds, but with 5.4.1, it completes in 6719 seconds using the same configuration. That is a 46% improvement in speed! Furthermore, it possible to scale up the calculation further to 128 nodes and 12c/node, which I was not able to do before due to memory shortage. I think this is good news for the people running large calculations, especially on Beskow.

P.S. A comment on the lack of updates to the blog:

Over the year, I have gradually moved on to another position at NSC. Today, I work as partner manager, developing NSC’s external collaborations with partners such as SMHI and Saab Group. I intend to keep publishing benchmarks and recommendations on the blog, as we move into the process of replacing Triolith, but the update frequency will likely be lower in the future. Weine Olovsson at NSC is taking over most of the actual support duties for VASP.

VASP Seminar at Uppsala University

Tomorrow (January 15th), I will be visiting Uppsala University and give some lectures on running VASP efficiently on the clusters available in Sweden. The seminar is being organized by NSC and UPPMAX, mainly for SNIC users, but everyone interested is welcome. It is not necessary to register in advance. You can find more information on the SNIC web page.

Most of the content that I will present is already available here, scattered over many blog posts. If you check out the blog post archives, you will find posts covering VASP and how certain input parameters affect performance. For information about compiling VASP, I have a special section on the web page called Compile VASP. It now has a new guide for the Beskow system at PDC in Stockholm. I will show some preliminary benchmark results from there and we will talk about what kind of simulations that are suitable for running at large scale on such a computer.

There is also a dedicated time slot for a Q&A session. It is a good time to ask questions, not only about VASP, as several of the SNIC application experts will be there.

I am looking forward to seeing you there!

Update 2015-01-16: Thank you to everyone who participated. It is always nice to return to Uppsala. By request, here are the slides from the sessions:

VASP on Cray XC-40 Beskow: Preliminary Benchmark Results

During the Christmas holidays, I had the oppurtunity to run some VASP benchmarks on Beskow, the Cray XC40 supercomputer currently being installed at PDC in Stockholm. The aim was to develop guidelines for VASP users with time allocations there. Beskow is a significant addition in terms of aggregated core hours available to Swedish researchers, so many of heavy users of supercomputing in Sweden, like the electronic structure community, were granted time there.

For this benchmarking round, I developed a new set of tests to gather data on the relationship between simulation cell size and the appropriate number of cores. There has been concerns that almost no ordinary VASP jobs would be meaningful to run on the Cray machine, because they would be too small to fit into the minimal allocation of 1024 cores (or 32 compute nodes, in my interpretation). Fortunately, my initial results show that this is not case, especially if you use k-point parallelization.

The tests consisted of GaAs supercells of varying sizes, doped with a single Bi atom. The cells and many of the settings are picked from a research paper, to make it more realistic. The supercell sizes were successive doublings of 64, 128, 256, and finally 512 atoms, with a decreasing number of k-points in the Monkhorst-Pack grids (36, 12, 9, 4). I think it is a quite realistic example of a supercell convergence study.

Chart of GaAsBi supercell timings on Beskow

First, please note the scales on the axes. The y-axis is reversed time in logarithmic scale, so upwards represents faster speed. Similarly, the x-axis is the number of compute nodes (with 32 cores each) in log-2 scale. A 256-node calculation is potentially 8192 cores! But in this case, I used 24 cores/node, except for the biggest HSE06 cells, where I had to reduce the number cores per node to gain more memory per MPI rank. The solid lines are PBE calculations and the dashed lines HSE06 calculations. Note the very consistent picture of parallel displacement of the scaling curves: bigger cells take longer time to run and scale to a larger number of compute nodes (although the increase is surprisingly small). The deviations from the straight lines comes from outlier cases where I had to use a sub-optimal value of KPAR, for example, with 128 atoms, 32 nodes, and 12 k-points, I had to use KPAR=4 instead of KPAR=12 to balance the number of k-points. For real production calculations, you could easily avoid such combinations.

The influential settings in the INCAR file were:

NCORE = cores/node (typically 24)
KPAR = MIN(number of nodes,number of k-points)
NSIM = 2
NBANDS = 192, 384, 768, 1536 (which are all divisible by 24)
LREAL = Auto
ALGO = Fast (Damped for HSE06)

There were no other special tricks involved, such as setting MPI environment variables or custom MPI rank placement. It is just standard VASP calculations with minimal input files and careful settings of the key parameters. I actually have more data points for even bigger runs, but I have chosen to cut off the curves where the parallel efficiency fell too much, usually to less than 50%. In my opinion, it is difficult to motivate a lower efficiency target than that. So what you see is the realistic range of compute nodes you should employ to simulate a cell of a given size.

If we apply the limit of 32 nodes, we see that a 64-atom GGA calculation might be a borderline case, which is simply to small to run on Beskow, but 128 atoms and more scale well up to 32 compute nodes, which is half a chassis on the Cray XC40. If you use hybrid DFT, such as HSE06, you should be able to run on 64 nodes (1 chassis) without problem, perhaps even up to 4 chassis with big supercells. In that case, though, you will run into problems with memory, because it seems that the memory use of an HSE06 calculation increase linearly by the number of cores I use. I don’t know if it is a bug, or if the algorithm is actually designed that way, but it is worth keeping in mind when using hybrid functionals in VASP. Sometimes, the solution to an out of memory problem is to decrease the number of nodes.

In addition to parallel scaling, we are also interested in the actual runtime. In some ways, it is impressive. Smaller GGA calculations can complete 1 SCF cycle in less than 1 minute, and even relatively large hybrid-DFT jobs can generally be tuned to complete 1 SCF cycle in per hour. In other ways it is less impressive. While we can employ more compute nodes to speed up bigger cells, in general, we cannot always make a bigger system run as fast as a smaller by just adding more compute nodes. For example, an HSE06 calculation will take about two orders of magnitude longer time to run than a GGA calculation, but unfortunately, it cannot also make use of two orders of magnitude more compute nodes efficiently. Therefore, large hybrid calculations will remain a challenge to run until the parallelization in VASP is improved, especially with regards to memory consumption.

Selecting the Right Number of Cores for a VASP Calculation

A frequent question I encounter supporting VASP users is: “I have a cell with X atoms and Y electrons. How many compute nodes (or cores) should I choose for my simulation?”

It is an important question because using too many cores will be inefficient and the result is less number of jobs completed in a given computer time allocation.

Currently, there is only MPI parallelization in VASP, so by “cores”, I mean the number of MPI ranks or processes, i.e. the number you give to the mpirun command using the -n command line flag or the number of cores you request in the queue system.

Besides the suggestion of actually testing it out and finding a good number of cores, the main rule of thumb that I have been telling people is:

number of cores = number of atoms

This is almost always safe, and will not waste computer time. Typically, it will ensure a parallel efficiency of at least 80%. This is of course a very unscientific and handwavy rule, but it has a certain pedagogical elegance, because it is easy to remember, and you don’t need to look up any other technical parameters.

Let’s now look into how you could make a more accurate estimate. VASP has three levels of parallelization: over k-points, over bands, and over plane-wave coefficients (or equivalently Fast-Fourier transforms). You need to ensure that when the work is split up over several compute nodes, there is a sufficient amount of work allocated to each processor core, otherwise, they will just spend time waiting for more work to arrive. The fundamental numbers to be aware of are therefore:

  • The number of k-points
  • The number of bands (determined indirectly by the number of atoms and electrons)
  • The size of the basis set (i.e. number of plane waves, which corresponds to the number of grid points in the FFTs).

If you can estimate, or derive these numbers for your calculation, you can more precisely guess a suitable number of cores to use.

Bands and cores

The first step is to consider the number of bands (NBANDS). VASP has parallelization over bands (controlled by the NPAR tag). The ultimate limit is 1 band per core. So, for example, if you have 100 bands, you cannot run on more than 100 cores and expect it to work well. What I have seen, in my scaling tests, though, is that 1 band per core is too little work for a modern processor. You need to have at least 2 bands per core to reach more than 50% efficiency. A conservative choice is 8 bands/core. That will give you closer to 90% efficiency.

number of cores = NBANDS / 8

So how does this relate to the rule of thumb above? By apply it, you will arrive at a number of bands per core equal to the average number of valence electrons per atom in your calculation. If we assume that the typical VASP calculation has about 4-8 valence electrons per atom, this will land us in the ballpark of 4-8 bands/core, which is usually ok.

Let’s now try to apply this principle:

Example 1: We have a cell with 500 bands and a cluster with compute nodes having 16 cores per node. We aim for 8 bands/core, which unfortunately means 62.5 cores. It is better to have even numbers, so we increase the number of bands to 512 by setting NBANDS=512 in the INCAR file and allocate 64 cores, or 4 compute nodes.

Example 2: Suppose that you want to speed up the calculation in the previous example. You need the results fast, and care less about efficiency in terms of the number of core hours spent. You could drop down to 1 band/core (512 cores), but there is really not that much improvement compared to 2 bands/core (256 cores). So it seems like 256 cores is the maximum number possible. But what you can do is to take these 256 MPI processes and spread them out over more compute nodes. This will improve the memory bandwidth available to each MPI process, which usually speed things up. So you can try running on 32 nodes, but using 8 cores/node instead. It could be faster, if the extra communication overhead is not too large.

K-points and KPAR

The next step is to consider the number of k-points. VASP can treat each k-point independently. The number of k-point groups that run in parallel is controlled by the KPAR parameter. The upper limit of KPAR is obviously the number of k-points in your calculation. In theory, the maximum number of cores you can run on using combined k-point- and band-parallelization is NBANDS * KPAR. So for example, 500 bands and 10 k-points would allow up to 5000 cores, in principle. In practice, though, k-point parallelization does not scale that well. What I have found on the Triolith and Beskow systems is that supplying KPAR=compute nodes usually allows you to run on twice as many cores as you determined in the previous step, regardless of the actual value of KPAR. I would not recommend attempting run with KPAR>compute nodes, even though you may have more k-points than compute nodes.

(Note: A side of effect of this is that the most effective number of bands/core when using k-point parallelization is higher than without. This is likely due to the combined overhead of using two parallelization methods.)

Example 3: Consider the 500 bands cell above. 64 cores was a good choice when using just band parallelization. But you also have 8 k-points. So set KPAR to 8 and double the number of cores to 128 cores (or 8 compute nodes). In this case, we end up with 1 k-point per node, which is a very balanced setup. Note that this may increase the required memory per compute node, as k-point parallelization replicates a lot of data inside each k-point group. If you would run out of memory, the next step would be to lower KPAR to 2 or 4.

Basis set size, LPLANE and NGZ

As a last step, it might be worth considering what the load-balancing of your FFTs will look like. This is covered in the VASP manual in section 8.1. VASP, by default, works with the FFTs in a plane-wise manner (meaning LPLANE=.TRUE.), which reduces the amount of communication needed between MPI ranks. In general, you want to use this feature, as it is typically faster. The 3D FFT:s are split up into 2D planes, where each group (as determined by NPAR) works on a number of planes. It means that ideally, you want NGZ (the number of grid points in the Z direction) to be evenly divisible by NPAR. That will ensure good load-balance.


The second thing to consider is, according to the manual, that NGZ should be sufficiently big for the LPLANE approach to work:

NGZ ≥ 3*(number of cores)/NPAR = 3*NCORE

Since NCORE will be of the same magnitude as the number of cores per compute node, it means that NGZ should be at least 24-96, depending on the node configuration. More concretely, for the following clusters, you should check that the conditions below hold:

NSC Kappa/Matter: NGZ ≥ 24
NSC Triolith: NGZ ≥ 48
PDC Beskow: NGZ ≥ 72 (using 24c/node)

Typically, this is not a big problem. As an example of what NGZ can be, consider a 64-atom supercell of GaAs (113 Å) with a cut-off of 313 eV. Then the small FFT grid is 70x70x70 points. So that is approximately the smallest cell that you can run on many nodes without suffering from excessive load imbalance on Beskow. For bigger cells, with more than 100 atoms, NGZ is also usually larger than 100, so there will be no problem in this regard, as long as you stick to the rule of using NPAR=compute nodes or NCORE=cores/node. But you should still check that NGZ is an even number and not, for example, a prime number.

In order to tune NGZ, you have two choices, either adjust ENCUT to a more appropriate number and let VASP recalculate the values of NGX, NGY and NGZ, or switch from specifying the basis set size in terms of an energy cut-off and set the NG{X,Y,Z} parameters yourself directy in the INCAR file instead. For a very small system, with NGZ falling below the threshold above, you can also consider lowering the number of cores per node and adjusting NCORE accordingly. For example, on Triolith, using 12 cores/node and NCORE=12 would lower the threshold for NGZ to 36, which enables you to run a small system over many nodes.


  • Check the number of bands (NBANDS). The number of bands divided by 8 is a good starting guess for the number of cores to employ in your calculation.
  • If you have more than one k-point, set KPAR to the number of compute nodes or the number of k-points, whichever is the smallest number. Then double the amount of cores determined in the previous step.
  • Make a test run and check the value of NGZ, it should be an even number and sufficiently big (larger than 3*cores/node). Adjust either the basis set size or the number of cores/node.

How to Compile VASP on the Cray XC40

Recently, I have been working on making VASP installations for the new Cray XC40 (“Beskow”) at PDC in Stockholm. Here are some instructions for making a basic installation of VASP 5.3.5 using the Intel compiler. Some of it might be specific to the Cray at PDC, but Cray has a similar environment on all machines, so I expect it to be generally useful as well. My method of compiling VASP produces binaries which are around 30-50% faster than the ones that were provided to us by Cray, so I really recommend making the effort to recompile if you are a heavy VASP user.

If you have an account on Beskow, my binaries are available in the regular VASP module:

module load vasp/5.3.5-31Mar14 

The installation path is (as of now, it might change when the system becomes publically available):


You can find the makefiles and some README files too.

Summary of the findings

  • VASP compiles fine with module PrgEnv/intel and MKL on Cray XC-40.
  • Using MKL is still signficantly better, especially for the FFTW routines.
  • Optimization level -O2 -xCORE-AVX2 is enough to get good speed.
  • VASP does not seem to be helped much by AVX2 instructions (small matrices and limited by memory bandwidth).
  • A SCALAPACK blocking factor NB of 64 seems appropriate.
  • MPI_BLOCK should be increased as usual, 64kb is a good number.
  • Enabling MKL’s conditional bitwise reproducibility at the AVX2 level does not hurt performance, it may even be faster than running in automatic mode.
  • Memory “hugepages” does not seem to improve performance of VASP.
  • The compiler flags -DRPROMO_DGEMV and -DRACCMU_DGEMV have very little effect on speed.
  • Hyper-threading (symmetric multi-threading) does not improve performance, the overhead of running twice as many MPI ranks is too high.
  • Multithreading in MKL does not improve performance either.

Preparations for compiling

First, download the prerequisite source tarballs from the VASP home page: 

You need both the regular VASP source code, and the supporting “vasp 5” library:


I suggest to make a new directory called e.g. vasp.5.3.5, where you download and expand them. You would type commands approximately like this:

mkdir 5.3.5
cd 5.3.5
tar zxvf vasp.5.3.5.tar.gz
tar zxvf vasp.5.lib.tar.gz

This will set you up with the source code for VASP.

Load modules for compilers and libraries

The traditional compiler for VASP is Intel’s Fortran compiler (ifort command), so we will stick with Intel’s Fortran compiler in this guide. In the Cray environment, this module is called “PrgEnv-Intel”. Typically, PGI or Cray is the default preloaded compiler, so we have to swap compiler modules.

module swap PrgEnv-cray PrgEnv-intel/5.2.40

Check which version of the compiler you have by typing “ifort -v”:

$ ifort -v
ifort version 14.0.4

If you have the PrgEnv-intel/5.2.40 module loaded, it should state 14.0.4. This version can compile VASP with some special rules in the makefile (see compiler status for more information). Please note that the Fortran compiler command you should use to compile is always called ftn on the Cray (regardless of the module loaded).

We are going to use Intel’s math kernel library (MKL) for BLAS, LAPACK and FFTW, so we unload Cray’s LibSci, to be on the safe side.

module unload cray-libsci

Then I add these modules, to nail everything down:

module load cray-mpich/7.0.4
module load craype-haswell

This select Cray’s MPI library, which should be default, and the sets up the environment to compile for XC-40.

VASP 5 lib

Compiling the VASP 5 library is straightforward. It contains some timing and IO routines, necessary for VASP, and LINPACK. Just download my makefile for the VASP library into the vasp.5.lib directory and run the make command.

cd vasp.5.lib

When it is finished, there should be a file called libdmy.a in the directory. Leave it there, as the main VASP compilation picks it up automatically.

Editing the main VASP makefile

Go to the vasp.5.3 directory and download the main makefile.

cd vasp.5.3

I recommend that you edit the -DHOST variable in the makefile to something that you will recognize, like the machine name. The reason is that this piece of text is written out in the top of OUTCAR files.

   -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc -DNGZhalf \
   -DMPI_BLOCK=65536 -Duse_collective -DscaLAPACK \

You will usually need three different versions of VASP: the regular one, the gamma-point only version, and one for spin-orbit and/or non-collinear calculations. These are produced by the following combinations of precompiler flags that you have to put into the CPP line in the makefile:

regular:       -DNGZhalf
gamma-point:   -DNGZhalf -DwNGZhalf
non-collinear: (nothing)

At the Swedish HPC sites, we install and name the different binaries vasp, vasp-gamma, and vasp-noncollinear, but this is optional.


VASP does not have a makefile that supports parallel compilation. So in order to compile we just do:

make -f makefile.vasp5lib.crayxc40

If you really want to speed it up, you can try something like:

nice make -j4; make -j4; make -j4; make -j4;

Run these commands repeatedly until all the compiler errors are cleared (or write a loop in the bash shell). Obviously, this approach only works if you have a makefile that you know works from the start. When finished, you should find a binary called “vasp”. Rename it immediately after you are finished, otherwise it will get destroyed when you type make clean to compile the other VASP versions.


The Cray compiler environment produces statically linked binaries by default, since this is the most convenient way to run on the Cray compute nodes, so we just have to put e.g. the following in the job script to run on 2048 cores using 64 compute nodes with 32 cores per node and 16 cores per socket:

aprun -n 2048 -N 32 -S 16 /path/to/vasp

Normally, I would recommend lowering the number of cores per compute node. This will often make the calculation run faster. In the example below, I run with 24 cores per node (12 per socket), which is typically a good choice:

aprun -n 1536 -N 24 -S 12 /path/to/vasp

When running on the Cray XC-40, keeping in mind the basic topology of the fast network connecting the compute nodes. 4 nodes sit together on 1 board, and 16 boards connect to the same chassis (for a total of 64 compute nodes), while any larger job will have to span more than one chassis and/or physical rack, which slows down network communications. Therefore, it is best to keep the number of compute nodes to 64 at most, as few VASP jobs will run efficiently using more nodes than that.

Initial Test of VASP on Intel’s Xeon E5 V3 Processors (“Haswell”)

I finally got around to run some VASP benchmarks on the recently released Intel Xeon E5 v3-processors (see previous post for an overview of the differences vs older Xeon models). I have tested two different configurations:

  • A 16-core node from the coming weather forecasting cluster at NSC to be named “BiFrost”. It is equipped with the Xeon E5-2640v3 processor at 2.6 Ghz, together with 64 GB of 1866 MHz DDR4 memory.
  • The 32-core nodes in the Cray XC40 “Beskow” at machine at PDC. The processor model is Xeon E5-2698-v3. Beskow also has two sockets per node with 64 GB memory, but the speed of the memory is faster: 2133 MHz.

I did a quick compile of VASP with Intel Fortran (version 14 and 15) and ran some single node benchmarks of the 24-atom PbSO4 cell I traditionally use for initial testing.

Relative speed: Haswell vs Sandy Bridge

The initial results are basically along my expectations. The single core performance is very strong thanks to Turbo Boost and the improved memory bandwidth. It is the best I have measured so far: 480 seconds vs 570 seconds earlier for a 3.4 Ghz Sandy Bridge CPU.

When running on all 16 cores and comparing to a Triolith node, the absolute speed is up 30%, which is also approximately +30% improved performance per core, as the number of cores are the same and the effective clock frequency is also very close in practice. The intra-node parallel scaling has improved as well, which is important, because it hints that we will be able to run VASP on more than 16 cores per node without degrading performance too much on multi-core nodes such as the ones in Beskow. In fact, when running the above cell on a single Cray XC-40 node with 32-cores, I do see improvement in speed all the way up to 32 cores:

VASP Intra-node scaling on Beskow

This is a good result, since the paralllel scaling for such a small cell is not that good in the first place.

So overall, you can expect a Beskow compute node to be about twice as fast as a Triolith node when fully loaded. That is logical, as it has twice as many cores, but not something that you can take for granted. However, when running wide parallel jobs, I expect you are likely to find that using 24 cores/node is better, because 24-cores brings 90+% of the potential node performance, while at the same significantly lowering the communication overhead due to having many MPI processes.

A first case study: a 512-atom supercell

I will publish more comprehensive benchmarks later, but here is an indication of what kind of improvement you can expect on Beskow vs Triolith for more realistic production jobs. The test system is a 512-atom GaAs supercell with one Bi defect atom. A cell like this is something you would typically run as part of a convergence study to determine the necessary supercell size.

  • Using 64 compute nodes on Triolith (1024 cores in total) it takes 340s to run a full SCF cycle to convergence (13 iterations)
  • On Beskow, using 64 compute nodes and 1536 cores (24c/node), it takes 160s.

Again, about 2.0x faster per compute node. If you compare it to the old Cray XE6 “Lindgren” at PDC, the difference will be even bigger, between 3-4x faster. But please keep in mind that allocations in SNIC are accounted in core hours, and not node hours, so while your job will run approximately twice as fast on the same number of nodes, the “cost” in core hours is the same.

Visiting University of Antwerp

On Thursday and Friday (23-24 October), I will be visiting the University of Antwerp in Belgium and give some lectures as part of the Specialist course on efficient use of VASP that is being organized by CalcUA and the Flemish Supercomputing Centrum. You can find more information on the course web page. It seems like you need to register in advance to attend the course.

Most of the content that I will present is already available here, as part of several blog posts. If you check out the blog post archives, you will find many posts covering VASP and how certain input parameters affect performance. For information about compiling VASP, I have a special section on the web page called Compile VASP. The system guide for NSC’s Triolith is probably the best to start out with, as the hardware is very similar to what is currently available in Belgium. After the course, I plan to add a few articles based on the new material I prepared for this event.

I am looking forward to seeing you in Antwerp!

Update 2014-10-29: Thank you to everyone who participated. I had a pleasant stay in Antwerp and much enjoyed the discussions. It gave me some new ideas of things to test out. By request, here are the slides from the sessions:

Application Statistics for Triolith

What is Triolith being used for? We have some idea by looking at the computer time applications we get through SNAC, and the support cases we work on also tells us something about what people are doing on Triolith, but the most comprehensive picture is likely to be painted by actually analyzing, in real time, what is running on the compute nodes. During the last service stop, I had the opportunity to examine the low-level logging data of a sizeable set of Triolith compute nodes. I managed to collect a sample of one month of log data from 186 nodes. To get a sense of the scale, it expands to about 2 TB of data uncompressed. Fortunately, when you have access to a supercomputer, you can attack the problem in parallel, so with 186 compute nodes unleashed at the task, it took just 10 minutes.

What you see below is an estimate of the fraction of time that the compute nodes spent running different applications. The time period is August 17th to September 17th, but the relative distribution has been rather stable over the previous months.

ApplicationShare of core hours (%)
VASP 36.6%
Gromacs 10.0%
NEK5000 8.2%
[Not recognized] 6.7%
Ansys (Fluent) 3.3%
Gaussian 3.3%
Dalton 3.1%
C2-Ray 2.4%
Nemo 2.2%
NAMD 2.2%
Python 1.8%
a.out 1.5%
OpenFOAM 1.1%
CPMD 0.9%
EC-Earth 0.9%
KKR 0.8%
Spectral 0.8%
Rosetta 0.7%
CP2K 0.6%
Octopus 0.6%
RSPt 0.5%

Unsurprisingly, we find VASP at the top, accounting for about a third of the computing time. This is the reason why I spend so much time optimizing and investigating VASP – each of per cent of performance improvement is worth a lot of core hours in the cluster. We also have a good deal of molecular dynamics jobs (Gromacs, LAMMPS, NAMD, CPMD, ca 18%) and a steady portion of computational fluids dynamics jobs (Fluent + NEK5000 + OpenFOAM , ca 12%). Quantum chemistry programs, such as Gaussian, GAMESS, and Dalton (8%) catch the eye in the list, as expected, although this was a low month for Gaussian (3%), the usage is often higher (6-7%), competing for the top-5.

It would be interesting to compare this to other supercomputing sites. When talking to people at the SC conference, I get the impression that VASP is major workload at basically all academic sites, although perhaps not as much as 30%. In any case, getting statistics like this is going to be useful to plan application support and the design of future clusters that we buy.


Below follows some technical observations for people interested in the details behind getting the numbers above.

The data is based on collectl process data, but at the logging level, you only see the file name of the binary, so you have to identify a certain software package just by the name of its running binaries. This is easy for certain programs, such as VASP, which are always called vasp-something, but more difficult for others. You can, for example, find the notorious a.out in the list above, which could be any kind of code compiled by the users themselves.

A simple check of the comprehensiveness of the logging is to aggregate all the core hours encountered in the sample, and compare with the maximum amount possible (186 nodes running for 24 hours for 30 days). This number is around 75-85% with my current approach, which means that something might be missing, as Triolith is almost always utilized to >90%. I suspect it is a combination of the sampling resolution at the collectl level, and the fact that I filter out short-running processes (less than 6 minutes) in the data processing stage to reduce noise from background system processes. Certain software packages (like Wien2k and RSPT) run iteratively by launching a new process for each iteration, creating many short-lived processes inside a single job. Many of these are probably not included in the statistics above, which could account for the shortfall.

Intel Releases New “Haswell” Xeon Server Processors

On September 8, Intel finally lifted the veil and revealed the new Xeon E5 server processors based on the “Haswell” architecture. These are the processers that you are likely to find in new supercomputer and cluster installations during the next years.

The main improvements are:

  • Up to 18-cores per processor socket. I expect that the mainstream configuration will be 10-14 cores/socket, so your typical 2-socket compute node will have 20-28 cores, and twice that number of threads with hyperthreading enabled.
  • Faster memory bandwidth with up to 2133 MHz DDR4-memory. Early benchmarks suggest a 40% improvement in bandwidth vs Triolith-style hardware (based on the “Sandy Bridge” platform). This is especially important for electronic structure codes, which tend to be limited by memory bandwidth.
  • Improved vectorization with AVX2 instructions. This can theoretically double the floating point arithmetics performance, but it reality there is diminishing return beyond some point for longer vectors. We expect +25% out of it at most. You will need to recompile your codes, or link to AVX2-enabled libraries such as Intel’s MKL, to use this feature.
  • Faster single core performance. Fortunately, the processor cores are still getting faster. Clock frequencies are not increasing, but according to Intel, the Haswell cores have about 10% better throughput of instructions per clock cycle. This is mainly from improvements in caches and better branch predictions, so it might not necessarily improve an already well-tuned and vectorized code.

Further reading: A longer technical overview of the Xeon E5 v3 series processors is available at and the old review of the Haswell microarchitecture on is still relevant.

Upcoming Haswell-based systems in Sweden

So when can you get access to hardware like this as a supercomputing user in Sweden?

  • PDC in Stockholm has just announced that they will be installing a new 1+ petaflops Cray XC30 system to replace the “Lindgren” Cray XE6 system. It will be based on the 16-core variant of these new processors for a total of 32 cores per node. The system will be available for SNIC users from January 1st 2015.
  • NSC will install a new cluster dedicated to weather forecasting in late 2014 based on the 8-core variant. This system belongs to the SMHI and will not be available for SNIC users, but it will be an interesting configuration, with very good balance between compute power, memory bandwidth, parallel interconnect performance and storage. While being optimized for weather forecasting, it could also perform very well on electronic structure workloads.

I expect to be able to work on VASP installations and run benchmarks on both of these systems during fall/winter, so please check in here later.

Peak VASP Calculations Ahead?

In June, I attended the International Supercomputing Conference in Leipzig, Germany. ISC is the second largest conference in the high performance computing field. Attending the big supercomputing conferences is always a good time to meditate on the state of the field and the future.

Instead of starting from the hardware point of view, I would like to begin from the other end: the scientific needs and how we can deliver more high-performance computing. At NSC, we believe in the usefulness of high-performance computing (HPC). So do our users, judging from the amount of computer time being applied for by Swedish scientists. When we compile the statistics, about three times more resources are asked for then we can provide. Clearly, the only way we can meet the need is to deliver more HPC capability in the future. The question, then, is how. There is always the possibility of increased funding for buying computers. Within our current facilities, we could accommodate many more systems of Triolith’s size, but realistically, I do not see funding for HPC systems increasing manyfold over the coming years, even though the potential benefits are great (see for example the recent report on e-infrastructure from the Swedish Science Council).

The traditional way has rather been to rely on new technology to bring us more compute power for the same amount of money. The goal is better price/performance, or more compute with the same amount of energy, which is related to the former. Fortunately, that approach has historically been very successful. Over the years, we have seen a steady stream of higher clocked CPU cores, multi-core servers, better memory bandwidth, and lower-latency networks being introduced. Each time we installed a new machine, our users could count on noticeable performance improvements, using the same simulation software as before, sometimes without even changing the underlying source code at all.

Thus, for a long time, the performance improvements have been essentially for free for our HPC users. I suspect, though, that this is a luxury that will come to an end. Why? Because currently, the way forward to more cost-effective computing, as envisioned by the HPC community, is:

  • Many-core architectures, such as IBM’s BlueGene and Intel’s Xeon Phi processors.
  • Vectorization, such as computing on GPU:s or with SIMD processors .
  • Special purpose hardware, such as custom SoC’s, FPGA:s, and ASICs.

Usually, such technologies are mentioned in the context of exascale computing, but it is important to realize that we would have to use the same technology if we wanted to build a smaller supercomputer, but for a fraction of the current cost. More concretely, what could happen in the coming years, is that there will be a new cluster with maybe ten times the floating point capability of today, but in the form of compute nodes with e.g. 256 cores and as much as 1000 threads. The key point, though, is that the speed of an individual core will most likely be less than on our current clusters. Thus, to actually get better performance out it, you will need excellent parallel scalability just to fully use a single compute node. The problem is that: 1) there are few mass market codes today that have this kind of scalability 2) Many current scientific models are simply not meaningful to run at the scale required to fully utilize such a machine.

In such a scenario, we could arrive at a “peak-VASP” situation, where traditional HPC algorithms, such as dense linear algebra operations and fast Fourier transforms, simply will not run any faster on the new hardware, which would essentially halt what has so far been seen as natural speed development. This could happen before any end of Moore’s law comes into play. It makes me think that there might be trouble ahead for traditional electronic structure calculations based on DFT unless there is a concerted effort to move to new hardware architectures. (This is also one of the conclusions in the Science Council report mentioned earlier.)

So what could such an effort look like?

  1. Putting resources into code optimization and parallelization is one obvious approach. SeRC and the application experts in the Swedish e-science community have been very active in this domain. There is clearly potential here, but my understanding is that it has always been difficult to get it done in practice due to the funding structure in science (you get money for doing science, not “IT”), and also personnel shortage in the case that you actually have the funding available. There is also a limit to how much you can parallelize, with diminishing returns as you put more effort into it. So I think it can only be part of the solution.
  2. Changing to scientific models that are more suitable to large-scale computations should also be considered, as this would amplify the results of the work done under point #1. This is something that has to be initiated from within the science community itself. There might be other methods to attack electronic structure problems which would be inconceivable at a smaller scale, but competitive at a large scale. I think the recent resurgence of interest in Monte Carlo and configuration-interaction methods is a sign of this development.
  3. In the cases where models cannot be changed, the hardware itself has to change. That means bigger awareness from funding agencies of the importance of high-throughput computing. By now, computational material science is a quite mature field, and perhaps the challenge today is not only to simulate bigger systems, but rather to apply it in a large scale, which could mean running millions of jobs instead of one job with a million cores. The insight that this can produce just as valuable science as big MPI-parallel calculations could open up for investments in new kinds of computational resources.