How to Compile VASP on NSC’s Triolith

These instructions are for the 5.3.3 version, but I expect the instructions to be applicable to the minor versions preceding and following 5.3.3.

First, download the prerequisite source tarballs from the VASP home page:

http://www.vasp.at/ 

You need both the regular VASP source code, and the supporting “vasp 5” library:

vasp.5.3.3.tar.gz
vasp.5.lib.tar.gz

I suggest to make a new directory called e.g. vasp.5.3.3, where you download and expand them. You would type commands approximately like this:

mkdir 5.3.3
cd 5.3.3
(download)
tar zxvf vasp.5.3.3.tar.gz
tar zxvf vasp.5.lib.tar.gz

Currently, you want to load these modules:

intel/12.1.4
impi/4.0.3.008
mkl/10.3.10.319

Which you can get bundled in the following module:

module load build-environment/nsc-recommended

VASP 5 lib

Compiling the VASP 5 library is straightforward. It contains some timing and IO routines, necessary for VASP, and LINPACK. My heavy edited makefile looks like this:

.SUFFIXES: .inc .f .F
#-----------------------------------------------------------------------
# Makefile for VASP 5 library on Triolith
#-----------------------------------------------------------------------

# C-preprocessor
CPP     = gcc -E -P -C -DLONGCHAR $*.F >$*.f
FC= ifort

CFLAGS = -O
FFLAGS = -Os -FI
FREE   =  -FR

DOBJ =  preclib.o timing_.o derrf_.o dclock_.o  diolib.o dlexlib.o drdatab.o


#-----------------------------------------------------------------------
# general rules
#-----------------------------------------------------------------------

libdmy.a: $(DOBJ) linpack_double.o
    -rm libdmy.a
    ar vq libdmy.a $(DOBJ)

linpack_double.o: linpack_double.f
    $(FC) $(FFLAGS) $(NOFREE) -c linpack_double.f

.c.o:
    $(CC) $(CFLAGS) -c $*.c
.F.o:
    $(CPP) 
    $(FC) $(FFLAGS) $(FREE) $(INCS) -c $*.f
.F.f:
    $(CPP) 
.f.o:
    $(FC) $(FFLAGS) $(FREE) $(INCS) -c $*.f

Note the addition of the “-DLONGCHAR” flag on the CPP line. It activates the longer input format for INCAR files, e.g. you can have MAGMOM lines with more than 256 characters. Now compile the library with the “make” command and check that you have the “libdmy.a” output file. Leave the file here, as the main VASP makefile will include it directly from here.

VASP 5 binary

Preparations

I only show how to build the parallel version with MPI and SCALAPACK here, as that is what you should run on Triolith. Navigate to the “vasp.5.3” library where the main source code is:

cd ..
cd vasp.5.3

Before we start, we want to think about how to find the external libraries that we need. These are:

  • BLAS/LAPACK (for basic linear algebra)
  • FFT library (for fast Fourier transform from reciprocal to real space)
  • MPI (for parallel communication)
  • SCALAPACK (for parallel linear algebra, e.g. orthogonalization of states)

For BLAS/LAPACK, we are going to use Intel’s Math Kernel Library (“MKL” henceforth). The easiest way to link to MKL at NSC is by adding the two following flags to the compiler command:

ifort -Nmkl -mkl=sequential ...

For fast Fourier transforms, we could use the common FFTW library with VASP, but MKL actually contains its own optimized FFTs together with an FFTW interface, so we can use these instead. Provided that we link with MKL, which we are already doing in order to get BLAS/LAPACK, we do not need to do anything more. The linker should pick up the FFTW subroutines automatically.

For MPI, we are going to use Intel’s MPI library. We have already loaded the “impi/4.0.3.008” module, so all we have to do is to add the “-Nmpi” flag to compiler command:

ifort -Nmpi ...

We don’t need to add explicit paths to any MPI libraries, or use the special “mpif90” compiler wrapper.

Editing the makefile

I suggest that you start from the Linux/Intel Fortran makefile:

cp makefile.linux_ifc_P4 makefile

It is important to realize that the makefile is split in two parts, and is intended to be used in an overriding fashion. If you don’t want to compile the serial version, you should enable the definitions of FC, CPP etc in the second half of the makefile to enable parallel compilation. These will then override the settings for the serial version.

Start by editing the Fortran compiler and its flags:

FC=ifort -I$(MKL_ROOT)/include/fftw 
FFLAGS =  -FR -lowercase -assume byterecl -Nmpi 

We need to add “-Nmpi” to get proper linking with Intel MPI at NSC. Then, we change the optimization flags:

OFLAG=-O2 -ip -xavx 

This is to be on the safe side, so that we get AVX optimizations. Include MKL with FFTW like this:

BLAS = -mkl=sequential
LAPACK = 

We use the serial version of MKL, without any multithreading, as VASP runs MPI on all cores with great success. Set the NSC specific linking options for MKL and MPI:

LINK    = -Nmkl -Nmpi 

Uncomment the CPP section for the MPI parallel VASP:

CPP    = $(CPP_) -DMPI  -DHOST=\"LinuxIFC\" -DIFC \
     -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc -DNGZhalf \
     -DMPI_BLOCK=8000 -Duse_collective -DscaLAPACK
    -DRPROMU_DGEMV  -DRACCMU_DGEMV

Change it to something like this:

CPP     = $(CPP_) -DMPI -DHOST=\"TRIOLITH-BUILD01\" -DIFC \
          -DCACHE_SIZE=4000  -DPGF90 -Davoidalloc -DNGZhalf \
          -DMPI_BLOCK=262144 -Duse_collective -DscaLAPACK \
          -DRPROMU_DGEMV  -DRACCMU_DGEMV  -DnoSTOPCAR

CACHE_SIZE is only relevant for Furth FFTs, which we do not use. The HOST variable is written out in the top of the OUTCAR file. It can be anything which helps you identify this compilation of VASP. The MPI_BLOCK variable needs to be set higher for best performance on Triolith. And finally, “noSTOPCAR” will disable the ability to stop a calculation by using the STOPCAR file. We do this to improve file I/O against the global file systems. (Otherwise, each VASP process will have to check this file for every SCF iteration.)

Finally, we enable SCALAPACK from MKL:

SCA= -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64

And the parallelized version of the fast Fourier transforms with FFTW bindings:

FFT3D   = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o

Note that we do not need to link to FFTW explicitly, since it is included in MKL. Finally, we uncomment the last library section:

LIB     = -L../vasp.5.lib -ldmy  \
      ../vasp.5.lib/linpack_double.o \
      $(SCA) $(LAPACK) $(BLAS)

We have to do this to include the “$(SCA)” variable. The full makefile can be found here on Triolith:

/software/apps/vasp/5.3.3-18Dec12/build01/makefile

Compiling

VASP does not have a makefile that supports parallel compilation. So in order to compile we just do:

make

If you really want to speed it up, you can try something like:

make -j4; make -j4; make -j4; make -j4;

Run these commands repeatedly until all the compiler errors are cleared (or write a loop in the bash shell). Obviously, this approach only works if you have a makefile that you know works from the start. When finished, you should find a binary called “vasp”.

Running

When you compile according to these instructions, there is no need to set LD_LIBRARY_PATHs and such. Instead, the ifort compiler will hard-code all library paths by using the RPATH mechanism and write information into the binary file about which MPI version you used. This means that you can launch VASP directly like this in a job shell:

mpprun /path/to/vasp

Mpprun will automatically pick up the correct number of processor cores from the queue system and launch your vasp binary using Intel’s MPI launcher.

K-point Parallelization in VASP, Part 2

Previously, I tested the k-point parallelization scheme in VASP 5.3 for a small system with hundreds of k-points. The outcome was acceptable, but less than stellar. Paul Kent (who implemented the scheme in VASP) suggested that it would be more instructive to benchmark medium to large hybrid calculations with just a few k-points, since this was the original use case, and consequently where you would be able to see the most benefit. To investigate this, I ran a 63-atom MgO cell with HSE06 functional and 4 k-points over 4 to 24 nodes:

K-point parallelization for MgO system

A suitable number of bands here is 192, so the maximum number of nodes we could expect to use with standard parallelization is 12, due to the fact that 12 nodes x 16 cores/node = 192 cores. And we do see that KPAR=1 flattens out at 1.8 jobs/h on 12 nodes. But with k-point parallelization, the calculation can be split into “independent” groups, each running on 192 cores. This enables us, for example, to run the job on 24 nodes using KPAR>=2, which in this case translates into a doubling of speed (4.0 jobs/h), compared to the best case scenario without k-point parallelization.

So there is indeed a real benefit for hybrid calculations of cells that are small enough to need a few k-points. And remember that in order for the k-point parallelization to work correctly with hybrids, you should set:

NPAR = total number of cores / KPAR.

VASP, ELPA, Lindgren and Triolith

So, can the ELPA library improve upon VASP’s SCALAPACK bottleneck?

Benchmarking of the ELPA-enabled version of VASP were performed on PDC’s Lindgren (a Cray XE6) and Phase 1 of Triolith at NSC (an HP SL6500-based cluster with Xeon E5 + FDR Infiniband). For this occasion, I developed a new test case consisting of a MgH2 supercell with 1269 atoms. The structure is an experimentally determined crystal structure, but with a few per cent of hydrogen vacancies. I feel this is a more realistic test case than the NiSi-1200 cell used before. Ideally, we should see decent scaling up about 1000 cores / 64 nodes for this simulation. As usual, we expect the “EDDAV” subroutine to eventually become a dominant. The number of bands is 1488, which creates a 1488x1488 matrix that needs to be diagonalized in the bottleneck phase. Actually, this matrix size is far smaller than what ELPA was intended for, which seems to be on the order of 10,000-100,0000. So perhaps, we will not see the true strength of ELPA here, but hopefully, it can alleviate some of the pathological behavior of SCALAPACK.

Triolith

First out is Triolith, with benchmarks for 4-64 compute nodes using both 8 and 16 cores per node. I keep NPAR=nodes/2, according to earlier findings. The recommended way to run with 8c/node at NSC is to invoke a special SLURM option – that way you don’t have to give the number of cores explicitly to mpprun:

#SBATCH --ntasks-per-node 8

Scaling of MgH2 on Triolith with and without ELPA

We find that the standard way of running VASP, with 16c/node and SCALAPACK, produces a top speed of about 16 jobs/h using 48 computes nodes, and going further actually degrades performance. The ELPA version, however, is able to maintain scaling to at least 64 nodes. In fact, the scaling curve looks very much like what you get when running VASP with SCALAPACK and 8c/node. Fortunately, the benefits of ELPA and 8c/node seem to be additive, meaning that ELPA wins over SCALAPACK on 48-64 nodes, even with 8c/nodes. In the end, the overall performance improvement is around 13% for the 64-node job. (While not shown here, I also ran with 96-128 nodes, and the difference there with ELPA is a stunning +30-50% in speed, but I consider the total efficiency too low to be useful.)

Lindgren

Now, let’s look at Lindgren, 8-64 compute nodes, using either 12 cores per node, or the full 24 cores. In the 12c case, I allocated three cores per socket, using

aprun  -N 12 -S 3 ...

I used NPAR=compute nodes here, like before.

Scaling of MgH2 on Lindgren with and without ELPA

On the Cray machine, we do not benefit as much from ELPA as on Triolith. The overall increase in speed on 64 nodes is 5%. Instead, it is essential to drop down to 12c/node to get good scaling beyond 32 nodes for this job. Also, note the difference of scale on the vertical axis. Triolith has much faster compute nodes! Employing 64 nodes gives us a speed of 24.3 jobs/h vs 14.2 jobs/h, that is, a 1.7x speed-up per node or a 2.5x speed-up on a per core basis.

Parallel scaling efficiency

Finally, it is instructive to compare the parallel scaling of Lindgren and Triolith. One of the strengths of the Cray system is the custom interconnect, and since the compute nodes are also slower than on Triolith, there is potential to realize better parallel scaling, when we normalize the absolute speeds.

Comparing scaling of MgH2 on Lindgren and Triolith

We find however, that the scaling curves are almost completely overlapping in the range where it is reasonable to run this job (4 to 64 nodes). The FDR Infiniband network is more than capable of handling this load, and the Cray interconnect is not so special at this, relatively, low-end scale.

Compiling VASP With the ELPA Library

Previously, I showed how SCALAPACK is a limiting factor in the parallel scaling of VASP. VASP 5.3.2 introduced support for the ELPA library, which can now be enabled in the subspace rotation phase of the program. You do this by compiling with the “-DELPA” preprocessor flag. In the VASP makefiles, there is a variable called CPP where this flag can be added:

CPP     = $(CPP_)  -DHOST=\"NSC-ELPATEST-B01\" -DMPI -DELPA \
...

In addition, you need to get access to ELPA (by registering on their site) and add the source files to the makefile. I did like this:

ELPA = elpa1.o elpa2.o elpa2_kernels.o

vasp: $(ELPA) $(SOURCE) $(FFT3D) $(INC) main.o 
  rm -f vasp
  $(FCL) -o vasp main.o  $(ELPA) $(SOURCE) $(FFT3D) $(LIB) $(LINK)

The ELPA developers recommend that you compile with “-O3” and full SSE support, so I put these special rules in the end of the makefile.

# ELPA rules
elpa1.o : elpa1.f90
        $(FC) $(FFLAGS) -O3 -xavx -c $*$(SUFFIX)
elpa2.o : elpa2.f90
        $(FC) $(FFLAGS) -O3 -xavx -c $*$(SUFFIX)
elpa2_kernels.o : elpa2_kernels.f90
     $(FC) $(FFLAGS) -O3 -xavx -c $*$(SUFFIX)

(Here, -xavx optimizes for Triolith with Sandy Bridge cpu:s.)

With this procedure, I was able to compile VASP with ELPA support. As far as I can see, there is no visual confirmation of ELPA being used in OUTCAR file or stdout. It looks like the regular VASP, but with some decimal fluctuations. I also saw some crashes when running on just a few nodes (< 4). Perhaps ELPA is not as robust in this case, since it is not the intended scenario of usage.

(Benchmarks of ELPA on Lindgren and Triolith will follow in the next post.)

Testing the K-point Parallelization in VASP

VASP 5.3.2 finally introduced official support for k-point parallelization. What can we expect from this new feature in terms of performance? In general, you only need many k-points in relatively small cells, so up front we would expect k-point parallelization to improve time-to-solution for small cells with hundreds or thousands of k-points. We do have a subset of users at NSC, running big batches of these jobs, so this may be a real advantage in the prototyping stage of simulations, when the jobs are set up. In terms of actual job throughput for production calculations, however, k-point parallelization should not help much, as the peak efficiency is reached already with 8-16 cores on a single node.

So let’s put this theory to test. Previously, I benchmarked the 8-atom FeH system with 400 k-points for this scenario. The maximum throughput was achieved with two 8-core jobs running on the same node, and the time-to-solution peaked at 3 minutes (20 jobs/h) with 16 cores on one compute node. What can k-point parallelization do here?

K-point parallelization for FeH system

KPAR is the new parameter which controls the number of k-point parallelized groups. KPAR=1 means no k-point parallelization, i.e. the default behavior of VASP. For each bar in the chart, the NPAR value has been individually optimized (and is thereby different for each number of cores). Previously, this calculation did not scale at all beyond one compute node (blue bars), but with KPAR=8 (purple bars), we can get close to linear (1.8x) speed-up going from 1 to 2 nodes, cutting the time-to-solution in half. As suspected, in terms of efficiency, the current k-point parallelization is not more efficient than the old scheme when running on a single node, which means that peak throughput remains the same at roughly 24 jobs/h per compute node. This is a little surprising, given that there should be overhead associated with running two job simultaneously on a node, compared to using k-point parallelization.

What must be remembered, though, is that it is considerably easier to handle the file and job management for several sequential KPAR runs vs juggling several jobs per node with many directories, so in this sense, KPAR seems like a great addition with respect to workflow optimization.

New Version of VASP - 5.3.2

A new version of VASP was released recently. There are many important improvements in this version and I encourage all VASP users to check the full release notes on the VASP community page.

Among the highlights are:

  • K-point parallelization (this should improve “scaling” for small jobs)
  • Molecular dynamics at constant pressure
  • Spin-orbit coupling calculation with symmetry
  • Subspace diagonalization by means of the ELPA library (this may improve scaling for wide parallel job running on e.g. PDC’s Lindgren).

The first installation of VASP 5.3.2 binaries on NSC is available in:

/software/apps/vasp/5.3.2-13Sep12/default/

Installations for Lindgren at PDC will follow shortly. The binaries are called vasp-[gamma,half,full] as usual. They ran through the test suite that I had without problems, but I noticed that on Triolith, some other calculations converged to different solutions when using the previous set of high optimizations used to compile 5.2.12, so I dropped the global optimization level down to -O1 for the Triolith installation, until things get sorted out. The overall performance drop is only 5%, at least for standard PBE-type calculations.

The plan for 5.3.2 is to produce two more versions:

  • A “stable”, alternative build, based on OpenMPI, and possibly a different numerical library, that can be used for comparison if you suspect trouble with your calculations.
  • A “fast” version tuned for maximum parallel performance, including ELPA support.

There has also been requests for versions with cell optimization restricted in different directions, like z-only, or xy-only. Apparently, this is an established “hack”, outlined on the VASP forums. To me, however, it seems better to implement this in the code by a set of new INCAR tags. This way, you can cover all combinations: x, xy, z, etc., without producing six different binaries. Hopefully, it will not be too difficult to make the changes.

VASP Hybrid Calculations on Triolith

VASP 5 introduced DFT hybrid functionals like PBE0. The Hartree-Fock calculations add a significant amount of computational time, however, and in addition, these algorithms require parallelization using NPAR=number of cores which is not as effective. In my experience, we are also haunted by SCF convergence problems, and you need to experiment with the other SCF algorithms. So what can we expect from Triolith here?

Parallel scaling MgO hybrid calculation

The chart above shows benchmark runs for a 63-atom MgO cell with Hartree-Fock turned on (corresponding to PBE0). ALGO=All is used, and NPAR=cores had to be set for each case separately. We find, not surprisingly, that we have good parallel scaling up to 4 compute nodes (equalling 1 atom per core). It is possible to crank up the speed by employing more compute nodes, but only by using 8-12 MPI ranks per node and idling half the cores. We have 192 bands in this calculation, so the maximum speed should be achieved with 16 nodes (16x12c/node = 192 ranks), which is also what we find (2.5 jobs/h).

These results should be compared with running the same job on Neolith, where an 16-node run (128 cores) reached 0.44 jobs/h, so Triolith is again a close to 6 times faster on a node-by-node basis.

Small VASP Jobs on Triolith

I have gotten requests about benchmarks of smaller simulations, rather than big supercells with hundreds of atoms. Below are the results for an 8-atom FeH cell with 64 bands and 400 k-points. Remember that VASP does not have parallelization over k-points, so it is very challenging to get good parallel performance in this case. Based on the good rule of thumb of using no more than 1 core per atom, or 4 bands per core, one should expect parallel scaling only within a compute node with 16 cores, but nothing beyond that.

Parameter study of FeH cell

This is also what I see when running full NPAR/NSIM tests with 4-32 ranks, as seen in the chart. Peak performance is achieved with 16 cores on one compute node, using NPAR=4. Using two compute nodes is actually slower, even when using the same number of MPI ranks. This implies that the performance is limited by communication and synchronization costs, and not by memory bandwidth (otherwise we would have seen an improvement in speed when using 16 ranks on two nodes instead of one.) An interesting finding is that if you are submitting many jobs like this in queue and are mainly interested in throughput rather than time to solution, then the optimal solution is to run two 8-core jobs on the same compute node.

The NSIM parameter does not seem to be as influential here, because we have so few bands. The full table is shown below:

Parameter study of FeH cell

I also checked the influence of LPLANE=.FALSE. These results are not shown, but the difference was within 1%, so it was likely just statistical noise.

Running VASP on Triolith

The test pilot phase of our new Triolith has now started, and our early users are on the system compiling and running codes. The hardware has been surprisingly stable so far, but we still have a lot to do in terms of software. Don’t expect all software presently found to on Matter, and Kappa to be available immediately, because we have to recompiled them for the new Xeon E5 processors.

Regarding material science codes, I have put up preliminary versions of VASP, based on both the original source, and our collection of SNIC patches. I am also working on getting up a good compilation of Quantum Espresso. We are seeing performance gains as expected, but it will remain a formidable challenge to make many codes scale properly to 16 cores per node and 100s of compute nodes.

These are my quick recommendations for VASP based on initial testing:

Nodes NPAR Cores/node
1 2 16 |
2 2 16 |
4 2 16 |
8 4 8
16 8 8 |
32 16 8 |
64-128 32 8 |

(Wider jobs remains to be tested…)

NPAR, NSIM, and LPLANE

It looks like the same rules for NPAR apply as on our previous systems. The quick and easy rule of NPAR=compute nodes can be used, but you should see a slight improvement decreasing NPAR somewhat from this value. But for NSIM, there is a difference compared to our previous systems: you should set NSIM = 1, and gain a few percent extra speed, especially for smaller jobs (1-4 nodes). Finally, I looked at the LPLANE tag, but saw no detectable performance increase by setting LPLANE=.TRUE, presumably because the bandwidth in the FDR Infiniband network is more than sufficient to support the FFT operations that VASP does.

Number of cores per node

With Neolith, Kappa and Matter, it was always advantageous to run with 8 MPI ranks on on each node, so that you would use all available cores. On Triolith, however, going from 8 to 16 cores per node gives you very little extra performance. On a single compute node, 8 to 16 gives +30%-50%, but this drops to around 10% using 4 nodes, and nothing when running on > 8 nodes. For really wide jobs (>16 nodes), performance might increase when reducing to number of cores used from 16/cores per node to 8/cores per node. To test this way to run, you should use the “–nranks” flags when launching VASP with “mpprun”, like this:

#SBATCH -N 32
#SBATCH --exclusive
#SBATCH --ntasks-per-node=8
...
mpprun /software/apps/vasp/5.2.12.1/default/vasp-gamma

Note that we have asked for 32 compute nodes (meaning 32*16=512 cores), but we are actually running on only 256 cores spread out over the all the 32 nodes because the queue system automatically spreads out the job so that each node gets 8 MPI ranks.

The reason why we see this behavior is a combination of three factors:

  • VASP calculations are limited by the available memory bandwidth, not the number of FLOPS.
  • The effective memory bandwidth per core has decreased with “Sandy Bridge” processor architecture, since each FPU can potentially do twice as many FLOPS per cycle.
  • Adding more cores creates overhead in the MPI communication layer.

So 8-12 cores/node is enough to max out the memory bandwidth in most scenarios. And since the overhead associated with using many MPI ranks increases nonlinearly with the number of ranks, there should logically be a crossover point where running on less cores/node gives you better parallel performance. My studies of big NiSi supercells (504-1200 atoms) suggests that this happens around 32 nodes. For calculations with hybrid functionals, it happens earlier, around 8 nodes. I plan to make further investigations to find out if this applies to all types of VASP jobs.

Triolith Visualized

Everyone is anxiously waiting for delivery of our new clusters: Triolith (for SNIC), Krypton (for SMHI), and Skywalker (for SAAB). Triolith will be the new capability cluster for academic users, which we hope will be the fastest supercomputer in Sweden once it is fully online. Yesterday, the smallest system for SAAB arrived. Unfortunately, Krypton (for SMHI) and Triolith are delayed and will arrive later.

In pictures, this is how Triolith relates to Neolith, the system it will replace.

Number of cores in Triolith vs Neolith

Each dot in this picture is a processor core. Triolith will have 1200 compute nodes with 19200 cores – compare this to the gray area corresponding to Neolith (6400 cores). However, this picture does not take into account the true performance improvement, because each core/compute node is also much faster. Taking this into account, the difference in compute power when running a mix of big VASP jobs is 9.6x per node which in total equals 14.4x improved throughput for the whole cluster:

Compute power Triolith vs Neolith

Other codes might not see as big improvements, but we expect at least a factor of 3x on a per node basis, by combining general improvements in IPC, AVX vector instructions, and better memory bandwidth.