Recently, I have been working on making VASP installations for the new Cray XC40 (“Beskow”) at PDC in Stockholm. Here are some instructions for making a basic installation of VASP 5.3.5 using the Intel compiler. Some of it might be specific to the Cray at PDC, but Cray has a similar environment on all machines, so I expect it to be generally useful as well. My method of compiling VASP produces binaries which are around 30-50% faster than the ones that were provided to us by Cray, so I really recommend making the effort to recompile if you are a heavy VASP user.
If you have an account on Beskow, my binaries are available in the regular VASP module:
module load vasp/5.3.5-31Mar14
The installation path is (as of now, it might change when the system becomes publically available):
You can find the makefiles and some README files too.
Summary of the findings
- VASP compiles fine with module PrgEnv/intel and MKL on Cray XC-40.
- Using MKL is still signficantly better, especially for the FFTW routines.
- Optimization level
-O2 -xCORE-AVX2 is enough to get good speed.
- VASP does not seem to be helped much by AVX2 instructions (small matrices and limited by memory bandwidth).
- A SCALAPACK blocking factor
NB of 64 seems appropriate.
MPI_BLOCK should be increased as usual, 64kb is a good number.
- Enabling MKL’s conditional bitwise reproducibility at the AVX2 level does not hurt performance, it may even be faster than running in automatic mode.
- Memory “hugepages” does not seem to improve performance of VASP.
- The compiler flags
-DRACCMU_DGEMV have very little effect on speed.
- Hyper-threading (symmetric multi-threading) does not improve performance, the overhead of running twice as many MPI ranks is too high.
- Multithreading in MKL does not improve performance either.
Preparations for compiling
First, download the prerequisite source tarballs from the VASP home page:
You need both the regular VASP source code, and the supporting “vasp 5” library:
I suggest to make a new directory called e.g. vasp.5.3.5, where you download and expand them. You would type commands approximately like this:
tar zxvf vasp.5.3.5.tar.gz
tar zxvf vasp.5.lib.tar.gz
This will set you up with the source code for VASP.
Load modules for compilers and libraries
The traditional compiler for VASP is Intel’s Fortran compiler (
ifort command), so we will stick with Intel’s Fortran compiler in this guide. In the Cray environment, this module is called “PrgEnv-Intel”. Typically, PGI or Cray is the default preloaded compiler, so we have to swap compiler modules.
module swap PrgEnv-cray PrgEnv-intel/5.2.40
Check which version of the compiler you have by typing “ifort -v”:
$ ifort -v
ifort version 14.0.4
If you have the
PrgEnv-intel/5.2.40 module loaded, it should state
14.0.4. This version can compile VASP with some special rules in the makefile (see compiler status for more information). Please note that the Fortran compiler command you should use to compile is always called
ftn on the Cray (regardless of the module loaded).
We are going to use Intel’s math kernel library (MKL) for BLAS, LAPACK and FFTW, so we unload Cray’s LibSci, to be on the safe side.
module unload cray-libsci
Then I add these modules, to nail everything down:
module load cray-mpich/7.0.4
module load craype-haswell
This select Cray’s MPI library, which should be default, and the sets up the environment to compile for XC-40.
VASP 5 lib
Compiling the VASP 5 library is straightforward. It contains some timing and IO routines, necessary for VASP, and LINPACK. Just download my makefile for the VASP library into the
vasp.5.lib directory and run the make command.
When it is finished, there should be a file called
libdmy.a in the directory. Leave it there, as the main VASP compilation picks it up automatically.
Editing the main VASP makefile
Go to the
vasp.5.3 directory and download the main makefile.
I recommend that you edit the
-DHOST variable in the makefile to something that you will recognize, like the machine name. The reason is that this piece of text is written out in the top of OUTCAR files.
CPP = $(CPP_) -DMPI -DHOST=\"MACHINE-VERSION\" -DIFC \
-DCACHE_SIZE=4000 -DPGF90 -Davoidalloc -DNGZhalf \
-DMPI_BLOCK=65536 -Duse_collective -DscaLAPACK \
-DRPROMU_DGEMV -DRACCMU_DGEMV -DnoSTOPCAR
You will usually need three different versions of VASP: the regular one, the gamma-point only version, and one for spin-orbit and/or non-collinear calculations. These are produced by the following combinations of precompiler flags that you have to put into the
CPP line in the makefile:
gamma-point: -DNGZhalf -DwNGZhalf
At the Swedish HPC sites, we install and name the different binaries
vasp-noncollinear, but this is optional.
VASP does not have a makefile that supports parallel compilation. So in order to compile we just do:
make -f makefile.vasp5lib.crayxc40
If you really want to speed it up, you can try something like:
nice make -j4; make -j4; make -j4; make -j4;
Run these commands repeatedly until all the compiler errors are cleared (or write a loop in the bash shell). Obviously, this approach only works if you have a makefile that you know works from the start. When finished, you should find a binary called “vasp”. Rename it immediately after you are finished, otherwise it will get destroyed when you type
make clean to compile the other VASP versions.
The Cray compiler environment produces statically linked binaries by default, since this is the most convenient way to run on the Cray compute nodes, so we just have to put e.g. the following in the job script to run on 2048 cores using 64 compute nodes with 32 cores per node and 16 cores per socket:
aprun -n 2048 -N 32 -S 16 /path/to/vasp
Normally, I would recommend lowering the number of cores per compute node. This will often make the calculation run faster. In the example below, I run with 24 cores per node (12 per socket), which is typically a good choice:
aprun -n 1536 -N 24 -S 12 /path/to/vasp
When running on the Cray XC-40, keeping in mind the basic topology of the fast network connecting the compute nodes. 4 nodes sit together on 1 board, and 16 boards connect to the same chassis (for a total of 64 compute nodes), while any larger job will have to span more than one chassis and/or physical rack, which slows down network communications. Therefore, it is best to keep the number of compute nodes to 64 at most, as few VASP jobs will run efficiently using more nodes than that.