How to Compile VASP on Cray XE6

Here are some instructions for making a basic installation of VASP 5.3.3 on Cray XE6. It applies specifically to the Cray XE6 at PDC called “Lindgren”, but Cray has a similar environment on all machines, so it might be helpful for other Cray sites as well.

First, download the prerequisite source tarballs from the VASP home page:

http://www.vasp.at/ 

You need both the regular VASP source code, and the supporting “vasp 5” library:

vasp.5.3.3.tar.gz
vasp.5.lib.tar.gz

I suggest to make a new directory called e.g. vasp.5.3.3, where you download and expand them. You would type commands approximately like this:

mkdir 5.3.3
cd 5.3.3
(download)
tar zxvf vasp.5.3.3.tar.gz
tar zxvf vasp.5.lib.tar.gz

Which compiler?

The traditional compiler for VASP is Intel’s Fortran compiler (“ifort”). Version 12.1.5 of ifort is now the “official” compiler, the one which the VASP developers use to compile the program. Unfortunately, ifort is one of the few compilers which can compile the VASP source unmodified, since the code contains non-standard Fortran constructs. To compile with e.g. gfortran, pgi, or pathscale ekopath, which theoretically could generate better code for AMD processors, source code modifications are necessary. So we will stick with Intel’s Fortran compiler in this guide. On the Cray machine, this module is called “PrgEnv-Intel”. Typically, PGI is the default preloaded compiler, so we have to swap compiler modules

module swap PrgEnv-pgi PrgEnv-intel

Check which version of the compiler you have by typing “ifort -v”:

$ ifort -v
ifort version 12.1.5

If you have the “PrgEnv-intel/4.0.46” module loaded, it should state “12.1.5”.

Which external libraries?

For VASP, we need BLAS, LAPACK, SCALAPACK and the FFTW library. On the Cray XE6, these are usually provided by Cray’s own “libsci” library. This library is supposedly specifically tuned for the Cray XE6 machine, and should offer good performance.

Check that the libsci module is loaded:

$ module list
...
xt-libsci/11.1.00
...

Normally, you combine libsci with the FFTW library. But I would recommend using the FFT routines from MKL instead, since they result in 10-15% faster overall speed in my benchmarks. Recent versions of MKL comes with FFTW3-compatible wrappers built-in (you don’t need to compile them separately), so by linking with MKL and libsci in the correct order, you get the best from both worlds.

VASP 5 lib

Compiling the VASP 5 library is straightforward. It contains some timing and IO routines, necessary for VASP, and LINPACK. My heavy edited makefile looks like this:

.SUFFIXES: .inc .f .F
#-----------------------------------------------------------------------
# Makefile for VASP 5 library on Cray XE6 Lindgren at PDC
#-----------------------------------------------------------------------

# C-preprocessor
CPP     = gcc -E -P -C -DLONGCHAR $*.F >$*.f
FC= ftn

CFLAGS = -O
FFLAGS = -O1 -FI
FREE   =  -FR

DOBJ =  preclib.o timing_.o derrf_.o dclock_.o  diolib.o dlexlib.o drdatab.o


#-----------------------------------------------------------------------
# general rules
#-----------------------------------------------------------------------

libdmy.a: $(DOBJ) linpack_double.o
    -rm libdmy.a
    ar vq libdmy.a $(DOBJ)

linpack_double.o: linpack_double.f
    $(FC) $(FFLAGS) $(NOFREE) -c linpack_double.f

.c.o:
    $(CC) $(CFLAGS) -c $*.c
.F.o:
    $(CPP) 
    $(FC) $(FFLAGS) $(FREE) $(INCS) -c $*.f
.F.f:
    $(CPP) 
.f.o:
    $(FC) $(FFLAGS) $(FREE) $(INCS) -c $*.f

Note that the Fortran compiler is always called “ftn” on the Cray (regardless of the module loaded), and the addition of the “-DLONGCHAR” flag on the CPP line. It activates the longer input format for INCAR files, e.g. you can have MAGMOM lines with more than 256 characters. Now compile the library with the “make” command and check that you have the “libdmy.a” output file. Leave the file here, as the main VASP makefile will include it directly from here.

Editing the main VASP makefile

I suggest that you start from the Linux/Intel Fortran makefile:

cp makefile.linux_ifc_P4 makefile

It is important to realise that the makefile is split in two parts, and is intended to be used in an overriding fashion. If you don’t want to compile the serial version, you should enable the definitions of FC, CPP etc in the second half of the makefile to enable parallel compilation. These will then override the settings for the serial version.

Start by editing the Fortran compiler and its flags:

FC=ftn -I$(MKL_ROOT)/include/fftw 
FFLAGS =  -FR -lowercase -assume byterecl

On Lindgren, I don’t get MKL_ROOT set by default when I load the PrgEnv-intel module, so you might have to do it by yourself too:

MKL_ROOT=/pdc/vol/i-compilers/12.1.5/composer_xe_2011_sp1.11.339/mkl

The step above is site-specific, so you should check where the Intel compilers are actually installed. If you cannot find any documentation, try inspecting the PATH after loading the module.

Then, we change the optimisation flags:

OFLAG=-O2 -ip 

Note that we leave out any SSE/architectural flags, since these are provided automatically by the “xtpe-mc12” module (make sure that it is loaded by checking the output of module list).

We do a similar trick for BLAS/LAPACK by providing empty definitions for them:

# BLAS/LAPACK should be linked automatically by libsci module
BLAS=
LAPACK=

We need to edit the LINK variable to include Intel’s MKL. I also like to add more verbose output of the linking process, to check that I am linking the correct library. One simple way to do this is to ask the linker to say from where it picks up the ZGEMM subroutine (it should be from Cray’s libsci, not MKL).

LINK = -mkl=sequential -Wl,-yzgemm_

Now, move further down in the makefile, to the MPI section, and edit the preprocessors flags:

CPP    = $(CPP_) -DMPI  -DHOST=\"PDC-REGULAR-B01\" -DIFC \
   -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc -DNGZhalf \
   -DMPI_BLOCK=262144 -Duse_collective -DscaLAPACK \
   -DRPROMU_DGEMV  -DRACCMU_DGEMV -DnoSTOPCAR

CACHE_SIZE is only relevant for Furth FFTs, which we do not use. The HOST variable is written out in the top of the OUTCAR file. It can be anything which helps you identify this compilation of VASP. The MPI_BLOCK variable needs to be set higher for best performance on the Cray XE6 interconnect. And finally, “noSTOPCAR” will disable the ability to stop a calculation by using the STOPCAR file. We do this to improve file I/O against the global file systems. (Otherwise, each VASP process will have to check this file for every SCF iteration.)

We will get Cray’s SCALAPACK linked in via libsci, so we set an empty SCA variable:

# SCALAPACK is linked automatically by libsci module
SCA= 

Then activate the parallelized version of the fast Fourier transforms with FFTW bindings:

FFT3D   = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o

Note that we do not need to link to FFTW explicitly, since it is included in MKL.

Finally, we uncomment the last library section for completeness:

LIB     = -L../vasp.5.lib -ldmy  \
      ../vasp.5.lib/linpack_double.o \
      $(SCA) $(LAPACK) $(BLAS)

The full makefile is provided here.

Compiling

VASP does not have a makefile that supports parallel compilation. So in order to compile we just do:

make

If you really want to speed it up, you can try something like:

make -j4; make -j4; make -j4; make -j4;

Run these commands repeatedly until all the compiler errors are cleared (or write a loop in the bash shell). Obviously, this approach only works if you have a makefile that you know works from the start. When finished, you should find a binary called “vasp”.

Running

The Cray compiler environment produces statically linked binaries by default, since this is the most convenient way to run on the Cray compute nodes, so we just have to put e.g. the following in the job script to run on 384 cores (16 compute nodes).

aprun -n 384 /path/to/vasp

I recommend setting NPAR equal to the number of compute nodes on the Cray XE6. As I have shown before, it is possible to run really big VASP simulations on the Cray with decent scaling over 1000-2000 cores. If you go beyond 32 compute nodes, it is worth trying to run on only half the number of cores per node. So on a machine with 24 cores per node, you would ask for 768 cores, but actually run like this:

aprun -n 384 -N 12 -S 3 /path/to/vasp

Tuning VASP: Fast Fourier Transforms

This is the first post in a series about VASP tuning. Optimized Fast Fourier transform subroutines are one of the keystones of getting a fast VASP installation. When I did a similar study for the Matter cluster at NSC (which has Intel “Nehalem” processors) in 2011, I found that MKL was superior. Now, it is time to look at Triolith, which has processors of “Sandy Bridge” architecture. These processors have new 256-bit vector instructions (called “AVX”), which need to be exploited for maximum floating-point performance.

Basically, we have three choices of FFTs:

  • VASP’s built-in library by Jürgen Furthmüller, called “FURTH”. It is quite old now, but has the advantage that it comes with the VASP code, so that we don’t have to rely on an external library; we can also recompile it for new architectures. For best performance, one have to optimize the CACHE_SIZE precompiler flag (usually a value between 0-32000).
  • The classical FFTW library. It can be optimized for many architectures by an automatic procedure. FFTW has support for AVX since version 3.3.1. On Triolith, we currently have version 3.3.2.
  • Intel’s own Math Kernel Library (“MKL”). Presumably, noone should be better at optimizing for Intel processors than Intel themselves? Intel is also very aggressive with processor support, and many times MKL has support for unreleased processors. MKL gained AVX support in version 10.2, but version 10.3 and higher uses AVX instructions automatically.

I chose the PbSO4 cell with 24 atoms as the test system, as it is quite small and more reliant on good FFT performance. Here are the results, without much further ado:

Runtimes of VASP with different FFT libraries

We can see that MKL 10.3 is the best choice here, with an average runtime of 61 seconds, 45% faster than FFTW 3.3.2. The results for FFT-FURTH does not come out well. I think one reason is that this library does not utilize AVX instructions fully on Sandy Bridge. The default optimization options in the makefile are very conservative (-O1/-O2), so we will not get the full benefit. It might be possible to compile it more aggressively and get better speed.

How to Compile VASP on NSC’s Triolith

These instructions are for the 5.3.3 version, but I expect the instructions to be applicable to the minor versions preceding and following 5.3.3.

First, download the prerequisite source tarballs from the VASP home page:

http://www.vasp.at/ 

You need both the regular VASP source code, and the supporting “vasp 5” library:

vasp.5.3.3.tar.gz
vasp.5.lib.tar.gz

I suggest to make a new directory called e.g. vasp.5.3.3, where you download and expand them. You would type commands approximately like this:

mkdir 5.3.3
cd 5.3.3
(download)
tar zxvf vasp.5.3.3.tar.gz
tar zxvf vasp.5.lib.tar.gz

Currently, you want to load these modules:

intel/12.1.4
impi/4.0.3.008
mkl/10.3.10.319

Which you can get bundled in the following module:

module load build-environment/nsc-recommended

VASP 5 lib

Compiling the VASP 5 library is straightforward. It contains some timing and IO routines, necessary for VASP, and LINPACK. My heavy edited makefile looks like this:

.SUFFIXES: .inc .f .F
#-----------------------------------------------------------------------
# Makefile for VASP 5 library on Triolith
#-----------------------------------------------------------------------

# C-preprocessor
CPP     = gcc -E -P -C -DLONGCHAR $*.F >$*.f
FC= ifort

CFLAGS = -O
FFLAGS = -Os -FI
FREE   =  -FR

DOBJ =  preclib.o timing_.o derrf_.o dclock_.o  diolib.o dlexlib.o drdatab.o


#-----------------------------------------------------------------------
# general rules
#-----------------------------------------------------------------------

libdmy.a: $(DOBJ) linpack_double.o
    -rm libdmy.a
    ar vq libdmy.a $(DOBJ)

linpack_double.o: linpack_double.f
    $(FC) $(FFLAGS) $(NOFREE) -c linpack_double.f

.c.o:
    $(CC) $(CFLAGS) -c $*.c
.F.o:
    $(CPP) 
    $(FC) $(FFLAGS) $(FREE) $(INCS) -c $*.f
.F.f:
    $(CPP) 
.f.o:
    $(FC) $(FFLAGS) $(FREE) $(INCS) -c $*.f

Note the addition of the “-DLONGCHAR” flag on the CPP line. It activates the longer input format for INCAR files, e.g. you can have MAGMOM lines with more than 256 characters. Now compile the library with the “make” command and check that you have the “libdmy.a” output file. Leave the file here, as the main VASP makefile will include it directly from here.

VASP 5 binary

Preparations

I only show how to build the parallel version with MPI and SCALAPACK here, as that is what you should run on Triolith. Navigate to the “vasp.5.3” library where the main source code is:

cd ..
cd vasp.5.3

Before we start, we want to think about how to find the external libraries that we need. These are:

  • BLAS/LAPACK (for basic linear algebra)
  • FFT library (for fast Fourier transform from reciprocal to real space)
  • MPI (for parallel communication)
  • SCALAPACK (for parallel linear algebra, e.g. orthogonalization of states)

For BLAS/LAPACK, we are going to use Intel’s Math Kernel Library (“MKL” henceforth). The easiest way to link to MKL at NSC is by adding the two following flags to the compiler command:

ifort -Nmkl -mkl=sequential ...

For fast Fourier transforms, we could use the common FFTW library with VASP, but MKL actually contains its own optimized FFTs together with an FFTW interface, so we can use these instead. Provided that we link with MKL, which we are already doing in order to get BLAS/LAPACK, we do not need to do anything more. The linker should pick up the FFTW subroutines automatically.

For MPI, we are going to use Intel’s MPI library. We have already loaded the “impi/4.0.3.008” module, so all we have to do is to add the “-Nmpi” flag to compiler command:

ifort -Nmpi ...

We don’t need to add explicit paths to any MPI libraries, or use the special “mpif90” compiler wrapper.

Editing the makefile

I suggest that you start from the Linux/Intel Fortran makefile:

cp makefile.linux_ifc_P4 makefile

It is important to realize that the makefile is split in two parts, and is intended to be used in an overriding fashion. If you don’t want to compile the serial version, you should enable the definitions of FC, CPP etc in the second half of the makefile to enable parallel compilation. These will then override the settings for the serial version.

Start by editing the Fortran compiler and its flags:

FC=ifort -I$(MKL_ROOT)/include/fftw 
FFLAGS =  -FR -lowercase -assume byterecl -Nmpi 

We need to add “-Nmpi” to get proper linking with Intel MPI at NSC. Then, we change the optimization flags:

OFLAG=-O2 -ip -xavx 

This is to be on the safe side, so that we get AVX optimizations. Include MKL with FFTW like this:

BLAS = -mkl=sequential
LAPACK = 

We use the serial version of MKL, without any multithreading, as VASP runs MPI on all cores with great success. Set the NSC specific linking options for MKL and MPI:

LINK    = -Nmkl -Nmpi 

Uncomment the CPP section for the MPI parallel VASP:

CPP    = $(CPP_) -DMPI  -DHOST=\"LinuxIFC\" -DIFC \
     -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc -DNGZhalf \
     -DMPI_BLOCK=8000 -Duse_collective -DscaLAPACK
    -DRPROMU_DGEMV  -DRACCMU_DGEMV

Change it to something like this:

CPP     = $(CPP_) -DMPI -DHOST=\"TRIOLITH-BUILD01\" -DIFC \
          -DCACHE_SIZE=4000  -DPGF90 -Davoidalloc -DNGZhalf \
          -DMPI_BLOCK=262144 -Duse_collective -DscaLAPACK \
          -DRPROMU_DGEMV  -DRACCMU_DGEMV  -DnoSTOPCAR

CACHE_SIZE is only relevant for Furth FFTs, which we do not use. The HOST variable is written out in the top of the OUTCAR file. It can be anything which helps you identify this compilation of VASP. The MPI_BLOCK variable needs to be set higher for best performance on Triolith. And finally, “noSTOPCAR” will disable the ability to stop a calculation by using the STOPCAR file. We do this to improve file I/O against the global file systems. (Otherwise, each VASP process will have to check this file for every SCF iteration.)

Finally, we enable SCALAPACK from MKL:

SCA= -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64

And the parallelized version of the fast Fourier transforms with FFTW bindings:

FFT3D   = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o

Note that we do not need to link to FFTW explicitly, since it is included in MKL. Finally, we uncomment the last library section:

LIB     = -L../vasp.5.lib -ldmy  \
      ../vasp.5.lib/linpack_double.o \
      $(SCA) $(LAPACK) $(BLAS)

We have to do this to include the “$(SCA)” variable. The full makefile can be found here on Triolith:

/software/apps/vasp/5.3.3-18Dec12/build01/makefile

Compiling

VASP does not have a makefile that supports parallel compilation. So in order to compile we just do:

make

If you really want to speed it up, you can try something like:

make -j4; make -j4; make -j4; make -j4;

Run these commands repeatedly until all the compiler errors are cleared (or write a loop in the bash shell). Obviously, this approach only works if you have a makefile that you know works from the start. When finished, you should find a binary called “vasp”.

Running

When you compile according to these instructions, there is no need to set LD_LIBRARY_PATHs and such. Instead, the ifort compiler will hard-code all library paths by using the RPATH mechanism and write information into the binary file about which MPI version you used. This means that you can launch VASP directly like this in a job shell:

mpprun /path/to/vasp

Mpprun will automatically pick up the correct number of processor cores from the queue system and launch your vasp binary using Intel’s MPI launcher.

K-point Parallelization in VASP, Part 2

Previously, I tested the k-point parallelization scheme in VASP 5.3 for a small system with hundreds of k-points. The outcome was acceptable, but less than stellar. Paul Kent (who implemented the scheme in VASP) suggested that it would be more instructive to benchmark medium to large hybrid calculations with just a few k-points, since this was the original use case, and consequently where you would be able to see the most benefit. To investigate this, I ran a 63-atom MgO cell with HSE06 functional and 4 k-points over 4 to 24 nodes:

K-point parallelization for MgO system

A suitable number of bands here is 192, so the maximum number of nodes we could expect to use with standard parallelization is 12, due to the fact that 12 nodes x 16 cores/node = 192 cores. And we do see that KPAR=1 flattens out at 1.8 jobs/h on 12 nodes. But with k-point parallelization, the calculation can be split into “independent” groups, each running on 192 cores. This enables us, for example, to run the job on 24 nodes using KPAR>=2, which in this case translates into a doubling of speed (4.0 jobs/h), compared to the best case scenario without k-point parallelization.

So there is indeed a real benefit for hybrid calculations of cells that are small enough to need a few k-points. And remember that in order for the k-point parallelization to work correctly with hybrids, you should set:

NPAR = total number of cores / KPAR.

VASP, ELPA, Lindgren and Triolith

So, can the ELPA library improve upon VASP’s SCALAPACK bottleneck?

Benchmarking of the ELPA-enabled version of VASP were performed on PDC’s Lindgren (a Cray XE6) and Phase 1 of Triolith at NSC (an HP SL6500-based cluster with Xeon E5 + FDR Infiniband). For this occasion, I developed a new test case consisting of a MgH2 supercell with 1269 atoms. The structure is an experimentally determined crystal structure, but with a few per cent of hydrogen vacancies. I feel this is a more realistic test case than the NiSi-1200 cell used before. Ideally, we should see decent scaling up about 1000 cores / 64 nodes for this simulation. As usual, we expect the “EDDAV” subroutine to eventually become a dominant. The number of bands is 1488, which creates a 1488x1488 matrix that needs to be diagonalized in the bottleneck phase. Actually, this matrix size is far smaller than what ELPA was intended for, which seems to be on the order of 10,000-100,0000. So perhaps, we will not see the true strength of ELPA here, but hopefully, it can alleviate some of the pathological behavior of SCALAPACK.

Triolith

First out is Triolith, with benchmarks for 4-64 compute nodes using both 8 and 16 cores per node. I keep NPAR=nodes/2, according to earlier findings. The recommended way to run with 8c/node at NSC is to invoke a special SLURM option – that way you don’t have to give the number of cores explicitly to mpprun:

#SBATCH --ntasks-per-node 8

Scaling of MgH2 on Triolith with and without ELPA

We find that the standard way of running VASP, with 16c/node and SCALAPACK, produces a top speed of about 16 jobs/h using 48 computes nodes, and going further actually degrades performance. The ELPA version, however, is able to maintain scaling to at least 64 nodes. In fact, the scaling curve looks very much like what you get when running VASP with SCALAPACK and 8c/node. Fortunately, the benefits of ELPA and 8c/node seem to be additive, meaning that ELPA wins over SCALAPACK on 48-64 nodes, even with 8c/nodes. In the end, the overall performance improvement is around 13% for the 64-node job. (While not shown here, I also ran with 96-128 nodes, and the difference there with ELPA is a stunning +30-50% in speed, but I consider the total efficiency too low to be useful.)

Lindgren

Now, let’s look at Lindgren, 8-64 compute nodes, using either 12 cores per node, or the full 24 cores. In the 12c case, I allocated three cores per socket, using

aprun  -N 12 -S 3 ...

I used NPAR=compute nodes here, like before.

Scaling of MgH2 on Lindgren with and without ELPA

On the Cray machine, we do not benefit as much from ELPA as on Triolith. The overall increase in speed on 64 nodes is 5%. Instead, it is essential to drop down to 12c/node to get good scaling beyond 32 nodes for this job. Also, note the difference of scale on the vertical axis. Triolith has much faster compute nodes! Employing 64 nodes gives us a speed of 24.3 jobs/h vs 14.2 jobs/h, that is, a 1.7x speed-up per node or a 2.5x speed-up on a per core basis.

Parallel scaling efficiency

Finally, it is instructive to compare the parallel scaling of Lindgren and Triolith. One of the strengths of the Cray system is the custom interconnect, and since the compute nodes are also slower than on Triolith, there is potential to realize better parallel scaling, when we normalize the absolute speeds.

Comparing scaling of MgH2 on Lindgren and Triolith

We find however, that the scaling curves are almost completely overlapping in the range where it is reasonable to run this job (4 to 64 nodes). The FDR Infiniband network is more than capable of handling this load, and the Cray interconnect is not so special at this, relatively, low-end scale.

Compiling VASP With the ELPA Library

Previously, I showed how SCALAPACK is a limiting factor in the parallel scaling of VASP. VASP 5.3.2 introduced support for the ELPA library, which can now be enabled in the subspace rotation phase of the program. You do this by compiling with the “-DELPA” preprocessor flag. In the VASP makefiles, there is a variable called CPP where this flag can be added:

CPP     = $(CPP_)  -DHOST=\"NSC-ELPATEST-B01\" -DMPI -DELPA \
...

In addition, you need to get access to ELPA (by registering on their site) and add the source files to the makefile. I did like this:

ELPA = elpa1.o elpa2.o elpa2_kernels.o

vasp: $(ELPA) $(SOURCE) $(FFT3D) $(INC) main.o 
  rm -f vasp
  $(FCL) -o vasp main.o  $(ELPA) $(SOURCE) $(FFT3D) $(LIB) $(LINK)

The ELPA developers recommend that you compile with “-O3” and full SSE support, so I put these special rules in the end of the makefile.

# ELPA rules
elpa1.o : elpa1.f90
        $(FC) $(FFLAGS) -O3 -xavx -c $*$(SUFFIX)
elpa2.o : elpa2.f90
        $(FC) $(FFLAGS) -O3 -xavx -c $*$(SUFFIX)
elpa2_kernels.o : elpa2_kernels.f90
     $(FC) $(FFLAGS) -O3 -xavx -c $*$(SUFFIX)

(Here, -xavx optimizes for Triolith with Sandy Bridge cpu:s.)

With this procedure, I was able to compile VASP with ELPA support. As far as I can see, there is no visual confirmation of ELPA being used in OUTCAR file or stdout. It looks like the regular VASP, but with some decimal fluctuations. I also saw some crashes when running on just a few nodes (< 4). Perhaps ELPA is not as robust in this case, since it is not the intended scenario of usage.

(Benchmarks of ELPA on Lindgren and Triolith will follow in the next post.)

Testing the K-point Parallelization in VASP

VASP 5.3.2 finally introduced official support for k-point parallelization. What can we expect from this new feature in terms of performance? In general, you only need many k-points in relatively small cells, so up front we would expect k-point parallelization to improve time-to-solution for small cells with hundreds or thousands of k-points. We do have a subset of users at NSC, running big batches of these jobs, so this may be a real advantage in the prototyping stage of simulations, when the jobs are set up. In terms of actual job throughput for production calculations, however, k-point parallelization should not help much, as the peak efficiency is reached already with 8-16 cores on a single node.

So let’s put this theory to test. Previously, I benchmarked the 8-atom FeH system with 400 k-points for this scenario. The maximum throughput was achieved with two 8-core jobs running on the same node, and the time-to-solution peaked at 3 minutes (20 jobs/h) with 16 cores on one compute node. What can k-point parallelization do here?

K-point parallelization for FeH system

KPAR is the new parameter which controls the number of k-point parallelized groups. KPAR=1 means no k-point parallelization, i.e. the default behavior of VASP. For each bar in the chart, the NPAR value has been individually optimized (and is thereby different for each number of cores). Previously, this calculation did not scale at all beyond one compute node (blue bars), but with KPAR=8 (purple bars), we can get close to linear (1.8x) speed-up going from 1 to 2 nodes, cutting the time-to-solution in half. As suspected, in terms of efficiency, the current k-point parallelization is not more efficient than the old scheme when running on a single node, which means that peak throughput remains the same at roughly 24 jobs/h per compute node. This is a little surprising, given that there should be overhead associated with running two job simultaneously on a node, compared to using k-point parallelization.

What must be remembered, though, is that it is considerably easier to handle the file and job management for several sequential KPAR runs vs juggling several jobs per node with many directories, so in this sense, KPAR seems like a great addition with respect to workflow optimization.

New Version of VASP - 5.3.2

A new version of VASP was released recently. There are many important improvements in this version and I encourage all VASP users to check the full release notes on the VASP community page.

Among the highlights are:

  • K-point parallelization (this should improve “scaling” for small jobs)
  • Molecular dynamics at constant pressure
  • Spin-orbit coupling calculation with symmetry
  • Subspace diagonalization by means of the ELPA library (this may improve scaling for wide parallel job running on e.g. PDC’s Lindgren).

The first installation of VASP 5.3.2 binaries on NSC is available in:

/software/apps/vasp/5.3.2-13Sep12/default/

Installations for Lindgren at PDC will follow shortly. The binaries are called vasp-[gamma,half,full] as usual. They ran through the test suite that I had without problems, but I noticed that on Triolith, some other calculations converged to different solutions when using the previous set of high optimizations used to compile 5.2.12, so I dropped the global optimization level down to -O1 for the Triolith installation, until things get sorted out. The overall performance drop is only 5%, at least for standard PBE-type calculations.

The plan for 5.3.2 is to produce two more versions:

  • A “stable”, alternative build, based on OpenMPI, and possibly a different numerical library, that can be used for comparison if you suspect trouble with your calculations.
  • A “fast” version tuned for maximum parallel performance, including ELPA support.

There has also been requests for versions with cell optimization restricted in different directions, like z-only, or xy-only. Apparently, this is an established “hack”, outlined on the VASP forums. To me, however, it seems better to implement this in the code by a set of new INCAR tags. This way, you can cover all combinations: x, xy, z, etc., without producing six different binaries. Hopefully, it will not be too difficult to make the changes.

VASP Hybrid Calculations on Triolith

VASP 5 introduced DFT hybrid functionals like PBE0. The Hartree-Fock calculations add a significant amount of computational time, however, and in addition, these algorithms require parallelization using NPAR=number of cores which is not as effective. In my experience, we are also haunted by SCF convergence problems, and you need to experiment with the other SCF algorithms. So what can we expect from Triolith here?

Parallel scaling MgO hybrid calculation

The chart above shows benchmark runs for a 63-atom MgO cell with Hartree-Fock turned on (corresponding to PBE0). ALGO=All is used, and NPAR=cores had to be set for each case separately. We find, not surprisingly, that we have good parallel scaling up to 4 compute nodes (equalling 1 atom per core). It is possible to crank up the speed by employing more compute nodes, but only by using 8-12 MPI ranks per node and idling half the cores. We have 192 bands in this calculation, so the maximum speed should be achieved with 16 nodes (16x12c/node = 192 ranks), which is also what we find (2.5 jobs/h).

These results should be compared with running the same job on Neolith, where an 16-node run (128 cores) reached 0.44 jobs/h, so Triolith is again a close to 6 times faster on a node-by-node basis.

Small VASP Jobs on Triolith

I have gotten requests about benchmarks of smaller simulations, rather than big supercells with hundreds of atoms. Below are the results for an 8-atom FeH cell with 64 bands and 400 k-points. Remember that VASP does not have parallelization over k-points, so it is very challenging to get good parallel performance in this case. Based on the good rule of thumb of using no more than 1 core per atom, or 4 bands per core, one should expect parallel scaling only within a compute node with 16 cores, but nothing beyond that.

Parameter study of FeH cell

This is also what I see when running full NPAR/NSIM tests with 4-32 ranks, as seen in the chart. Peak performance is achieved with 16 cores on one compute node, using NPAR=4. Using two compute nodes is actually slower, even when using the same number of MPI ranks. This implies that the performance is limited by communication and synchronization costs, and not by memory bandwidth (otherwise we would have seen an improvement in speed when using 16 ranks on two nodes instead of one.) An interesting finding is that if you are submitting many jobs like this in queue and are mainly interested in throughput rather than time to solution, then the optimal solution is to run two 8-core jobs on the same compute node.

The NSIM parameter does not seem to be as influential here, because we have so few bands. The full table is shown below:

Parameter study of FeH cell

I also checked the influence of LPLANE=.FALSE. These results are not shown, but the difference was within 1%, so it was likely just statistical noise.