First of all, VASP is licensed software, your name needs to be included on a VASP license in order to use NSC's centrally installed VASP binaries. Read more about how we handle licensing of VASP at NSC.
Some problems which can be encountered running VASP are described at the end of this page.
A minimum batch script for running VASP looks like this:
#!/bin/bash #SBATCH -J jobname #SBATCH -N 4 #SBATCH --ntasks-per-node=32 #SBATCH -t 4:00:00 #SBATCH -A SNIC-xxx-yyy module add VASP/220.127.116.1152018-nsc1-intel-2018b-eb mpprun vasp_[std/gam/ncl]
This script allocates 4 compute nodes with 32 cores each, for a total of 128 cores (or MPI ranks) and runs VASP in parallel using MPI. Note that you should edit the jobname and the account number before submitting.
For best performance, use these settings in the INCAR file for regular DFT calculations:
NCORE=32 (or the same number as you use for --ntasks-per-node) NSIM=16 (or higher)
For hybrid-DFT calculations, use:
NCORE=1 NSIM=16 (the value is not influential when using e.g. ALGO=damped)
PAW potential files, POTCARs, can be found here:
NSC's VASP binaries have hard-linked paths to their run-time libraries, so you do not need to load a VASP module or set
LD_LIBRARY_PATH and such for it work. We mainly provide modules for convenience, as you do not need to remember the full paths to the binaries. Though if you directly run the binary for non-vanilla installations with
intel-2018a, you need to set in the job script (see details in section "Problems" below):
This is not needed for other modules built with e.g.
There is generally only one module of VASP per released version directly visible in the module system. It is typically the recommended standard installation of VASP without source code modifications. Load the VASP module corresponding to the version you want to use.
module add VASP/18.104.22.16852018-nsc1-intel-2018b-eb
Then launch the desired VASP binary with "mpprun":
The VASP installations are found under
nscX directory corresponds to a separate installation. Typically, the higher build with the latest buildchain is the preferred version. They may differ by the compiler flags used, the level of optimization applied, and the number of third-party patches and bug-fixes added on.
Each installation contains three different VASP binaries:
We recommended using either
vasp_gam if possible, in order to decrease the memory usage and increase the numerical stability. Mathematically, the half and gamma versions should give identical results to the full versions, for applicable systems. Sometimes, you can see a disagreement due to the way floating point numerics works in a computer. In such cases, we would be more inclined to believe the gamma/half results, since they preserve symmetries to a higher degree.
Binaries for constrained structure relaxation are available for some of the modules. The naming scheme is as follows, e.g.:
OBS: This naming scheme gives the relaxation direction.
nsc1-intel-2018b: recommended installation (most recent). VASP built from the original source with minimal modifications using conservative optimization options (
-O2 -xCORE-AVX512). The VASP binaries enforce conditional numerical reproducibility at the AVX-512 bit level in Intel's MKL library, which we believe improves numerical stability with no cost to performance. The binaries also contain a hard-coded path to the
vdW_kernel.bindat file located in the
/software file system, so that you do not need to generate it from scratch.
module add VASP/22.214.171.12452018-nsc1-intel-2018b-eb mpprun vasp_std
nsc2-intel-2018a: Due to spurious problems when using
vasp_gam it's compiled differently as compared to the nsc1-intel-2018a build (
module add VASP/126.96.36.19952018-nsc2-intel-2018a-eb mpprun vasp_std
nsc1-intel-2018a: A special debug installation is available. VASP built with debugging information and lower optimizmation. Mainly intended for troubleshooting and running with a debugger. Do not use for regular calculations, e.g.:
module add VASP/188.8.131.5252018-vanilla-nsc1-intel-2018a-eb mpprun vasp_std
If you see lots of error messages
BRMIX: very serious problems this might provide a solution. OBS: This module doesn't actually include the latest patch 16052018.
nsc1-intel-2018a: VASP built for use together with Wannier90 2.1.0. Load and launch e.g. with:
module add VASP/184.108.40.20652018-wannier90-nsc1-intel-2018a-eb mpprun vasp_std
module add VASP-VTST/3.2-sol-220.127.116.1152018-nsc2-intel-2018a-eb mpprun vasp_std
nsc1-intel-2018a: A special debug installation is available.
module add VASP-VTST/3.2-sol-18.104.22.16852018-vanilla-nsc1-intel-2018a-eb mpprun vasp_std
module add VASP-OMC/22.214.171.12452018-nsc1-intel-2018a-eb mpprun vasp_std
In general, you can expect about 2-3x faster VASP speed per compute node vs Triolith, provided that your calculation can scale up to using more cores. In many cases, they cannot, so we recommend that if you ran on X nodes on Triolith (
#SBATCH -N X), use X/2 nodes on Tetralith, but change to
NCORE=32. The new processors are about 1.0-1.5x faster on a per core basis, so you will still enjoy some speed-up even when using the same number of cores.
Initial benchmarking on Tetralith showed that the parallel scaling of VASP on Tetralith is equal to, or better, than Triolith. This means that while you can run calculations close to the parallel scaling limit of VASP (1 electronic band per CPU core) it is not recommended from an efficiency point of view. You can easily end up wasting 2-3 times more core hours than you need to. A rule of thumb is that 6-12 bands/core gives you 90% efficiency, whereas scaling all the way out to 2 bands/core will give you 50% efficiency. 1 band/core typically results in < 50% efficiency, so we recommend against it. If you use k-point parallelization, which we also recommend, you can potentially multiply the number of nodes by up the number of k-points (check
NKPT in OUTCAR and set
KPAR in INCAR). A good guess for how many compute nodes to allocate is therefore:
number of nodes = KPAR * NBANDS / [200-400]
Example: suppose we want to do regular DFT molecular dynamics on a 250-atom metal cell with 16 valence electrons per atom. There will be at least 2000 + 250/2 = 2150 bands in VASP. Thus, this calculation can be run with up to (2150/32) = ca 67 Tetralith compute nodes, but it will be very inefficient. Instead, a suitable choice might be ca 10 bands per cores, or 2150/10 = 215 cores, which corresponds to around 6-7 compute nodes. To avoid prime numbers (7), we would likely run three test jobs with 6,8 and 12 Tetralith compute nodes to check the parallel scaling.
A more in depth explanation with example can be found in the blog post "Selecting the right number of cores for a VASP calculation".
To show the capability of Tetralith, and provide some guidance in what kind of jobs that can be run and how long they would take, we have re-run the test battery used for profiling the Cray XC-40 "Beskow" machine in Stockholm (more info here). It consists of doped GaAs supercells of varying sizes with the number of k-points adjusted correspondingly.
Fig. 1: Parallel scaling on Tetralith of GaAs supercells with 64 atoms / 192 bands, 128 atoms / 384 bands, 256 atoms / 768 bands, and 512 atoms / 1536 bands. All calculations used k-point parallelization to the maximum extent possible, typically KPAR=NKPT. The measured time is the time taken to complete a complete SCF cycle.
The tests show that small DFT jobs (< 100 atoms) run very fast using k-point parallelization, even with a modest number of compute nodes. The time for one SCP cycle can often be less than 1 minute, or 60 geometry optimization steps per hour. In contrast, hybrid-DFT calculations (HSE06) takes ca 50x longer time to finish, regardless of how many nodes are thrown at the problem. They scale somewhat better, so typically you can use twice the number of nodes at the same efficiency, but it is not enough to make up the difference in run-time. This is something that you must budget for when planning the calculation.
As a practical example, let us calculate how many core hours that would be required to run 10,000 full SCF cycles (say 100 geometry optimizations, or a few molecular dynamics simulations). The number of nodes has been chosen so that the parallel efficiency is > 90%:
The same table for 10,000 SCF cycles of HSE06 calculations looks like:
For comparison, a typical large SNAC project might have an allocation of 100,000-1,000,000 core hours per month with several project members, while a smaller personal allocation might be 5,000-10,000 core hours/month. Thus, while it is technically possible to run very large VASP calculations quickly on Tetralith, careful planning of core hour usage is necessary, or you will exhaust your project allocation.
Another important difference vs Triolith is the improved memory capacity. Tetralith has 96 GB RAM memory node (or about 3 GB/core vs 2 GB/core on Triolith). This allows you to run larger calculations using less compute nodes, which is typically more efficient. In the example above, the 512-atom GaAsBi supercell with HSE06 was not really possible to run efficiently on Triolith due to limited memory.
Finally, some notes and observations on compiling VASP on Tetralith. You can find makefiles in the VASP installation directories under
buildenv-intel/2018a-ebmodule on Tetralith to compile. You can use it even if you compile by hand and not use EasyBuild.
-xCORE-AVX512optimization flags for better performance on modern hardware.
-O3using Intel's 2018 compiler, stay with
-O2. When we tested, it is not faster, but it produces binaries which have random, but repeatable, convergence problems (1 out of 100 calculations or so).
make -j4 stdor similar repeatedly.
-xCORE-AVX512and letting Intel's MKL library use AVX-512 instructions (which it does by default) seems to help in most cases, ca 5-10% better performance. We have seen cases where VASP runs faster with AVX2 only, so it is worth trying. It might be dependent on a combination of Intel Turbo Boost frequencies and NSIM, but remains to be investigated. If you want to try, you can compile with
-xCORE-AVX2, and then set the environment variable
MKL_ENABLE_INSTRUCTIONS=AVX2to force AVX2 only. This should make the CPU cores clock a little bit higher at the expense of less FLOPS/cycle.
MKL_CBWR=AVX512environment variable. This is more of an old habit, as we haven't seen any explicit problems with reproducibility so far, but it does not hurt performance. The fluctuations are typically in the 15th decimal or so. Please note that
MKL_CBWR=AVXor similar, severely impacts performance (-20%).
If you encountered the problem
BRMIX: very serious problems the old and the new charge density differ written in the slurm output, with VASP calculations on Tetralith / Sigma for cases which typically work on other clusters and worked on Triolith / Gamma, it might be related to bug/s which was traced back to
MPI_REDUCE calls. These problems were transient, meaning that out of several identical jobs, some go through, while others fail. Our VASP modules are now updated to use another algorithm by setting
I_MPI_ADJUST_REDUCE=3, which shouldn't affect the performance. If you don't load modules, but run binaries directly, set in the job script:
More details for the interested: the problem was further traced down to our setting of the NB blocking factor for distribution of matrices to
scala.F. The VASP default of
NB=16 seems to work fine, while
NB=96 also worked fine on Triolith. By switching off ScaLAPACK in INCAR,
LSCALAPACK = .FALSE. it also works. Furthermore, the problem didn't appear for gcc + OpenBLAS + OpenMPI builds.
Newer VASP modules built with e.g.
intel-2018b use VASP default
NB=16, while the non-vanilla modules built with
If you find this problem
internal error in SETUP_DEG_CLUSTERS: NB_TOT exceeds NMAX_DEG typically encountered for phonon calculations, you can try the specially compiled versions with higher values of
mpprun /software/sse/manual/vasp/126.96.36.19952018/intel-2018b_NMAX_DEG/nsc1_128/vasp_std mpprun /software/sse/manual/vasp/188.8.131.5252018/intel-2018b_NMAX_DEG/nsc1_256/vasp_std