Next week, I am off to Denver, Colorado to participate in The Supercomputing Conference (SC13). I will also attend HP-CAST and the Intel HPC Roundtable. I am always interested in meeting other people working with ab initio/electronic structure software. Send me an email if you want to meet up.
Here comes a more generic recipe for installing and compiling VASP using only open-source tools (i.e. without Intel’s Fortran compiler and MKL). This chould be useful if you want to run smaller calculations on a laptop or an office machine. Below follows how I did it on Ubuntu 13.10 with GCC/Gfortran, OpenMPI, OpenBLAS, FFTW and Netlib SCALAPACK. Please note that compiling VASP with gfortran is not recommended or supported by the VASP developers. From what I can tell, it appears to work, but I have only done limited testing.
First of all you need the VASP source code, which you get from the VASP home page:
Then we need to install some Ubuntu packages. Install either through the Synaptic program or
apt-get in the terminal.
This is starting from a completely new Ubuntu installation. If you have done any programming on your machine before, some of these packages could already be installed. For other Linux distributions, you will need to find out the names of the corresponding packages. They should be similar, except for “build-essential”, which is specific to Debian.
I did not have much success using Ubuntu’s BLAS/LAPACK/ATLAS, so we will need to download the latest OpenBLAS and compile it ourselves from source. The same applies to SCALAPACK, which we have to tie together with our OpenBLAS and the system OpenMPI installation.
Download the latest OpenBLAS tarball from
After decompressing it, you will have a directory called “xianyi-OpenBLAS-….”. Go inside and check the
TargetList.txt file. You will have to decide which processor architecture target is appropiate for your processor. For a new Intel processor, “SANDYBRIDGE” should be best, and for a new AMD processor, “BULLDOZER”. Here, I choose the safe and conservative option “CORE2”, which should work on any recent processor. Then we compile with make.
make FC=gfortran CC=gcc USE_THREAD=0 TARGET=CORE2
This should produce a library called
libopenblas_core2-r0.2.8.a (or similar). Make note of the directory in which you compiled OpenBLAS, you will need it later. Mine was “/home/pla/build/xianyi-OpenBLAS-9c51cdf”
Download the latest SCALAPACK tarball from Netlib.org. To compile it, we need to set up a
SLmake.inc file containing some configuration parameters. Start by copying the
SLmake.inc.example file. You need to update the
LAPACKLIB variables and insert a direct reference to your OpenBLAS compilation.
CDEFS = -DAdd_ FC = mpif90 CC = mpicc NOOPT = -O0 FCFLAGS = -O3 CCFLAGS = -O3 FCLOADER = $(FC) CCLOADER = $(CC) FCLOADFLAGS = $(FCFLAGS) CCLOADFLAGS = $(CCFLAGS) ARCH = ar ARCHFLAGS = cr RANLIB = ranlib SCALAPACKLIB = libscalapack.a BLASLIB = -L/home/pla/build/xianyi-OpenBLAS-9c51cdf -lopenblas LAPACKLIB = $(BLASLIB) LIBS = $(LAPACKLIB) $(BLASLIB)
This should be enough to get SCALAPACK to compile by typing “make”. In the end, you should get a
Proceed to compile VASP with gfortran according to the previous guide. You need to apply the source code patches described there, otherwise it is straightforward. If you have never compiled VASP before, looking through one of the more detailed system specific guides in the VASP compile section might help.
The makefiles and the source code patch I used are available for download: vasp-ubuntu.tar.gz.
Some highlights (update the paths if necessary):
FFLAGS = -ffree-form -ffree-line-length-0 -fno-second-underscore -I/usr/include
We need to include
-I/usr/include to pick up the FFTW header file.
And refer to the BLAS/LAPACK library from our OpenBLAS installation.
CPP = $(CPP_) -DMPI -DHOST=\"LinuxGfort\" \ -DCACHE_SIZE=4000 -Davoidalloc -DNGZhalf \ -DMPI_BLOCK=262144 -Duse_collective -DscaLAPACK -DMINLOOP=1
And set the precompiler flags. In the MPI section of the makefile, there should be a reference to our compiled SCALAPACK:
The binaries you compile are MPI-enabled, so they should be launched with
mpirun. For example:
mpirun -np 4 ~/build/vasp-5.3.3/vasp.5.3/vasp
You will probably find that the
--bind-to-core option will help performance.
mpirun -np 4 --bind-to-core ~/build/vasp-5.3.3/vasp.5.3/vasp
If you have dual socket workstation, similar to a compute cluster node, I recommend trying:
mpirun -np 16 --bind-to-core --npersocket 8 ~/build/vasp-5.3.3/vasp.5.3/vasp
When doing standard double precision floating point operations in C or Fortran, you can expect 15-17 significant digits. But what do you do when 15 decimals are not enough? The GNU multiple precision arithmetic library is go-to library when you need not 15, but thousands of decimals, but for more modest needs, quadruple precision (128 bits), might be enough. In Fortran, 128 bits is
REAL*16, and in C we can access it by the type
__float128. These may not be availabe in all compilers, but GCC (4.6 and newer) and Intel C/Fortran do support it. According to the GCC manual, we can expect it to be “an order of magnitude or two slower” than double precision.
I decided to test the quad precision by doing a simple matrix-matrix multiplication test. Two 256x256 matrices are initialized with trigonometric expressions and then multiplied by each other. It is straightforward to change the program into quad precision: the
doublearrays should be declared as
__float128, the trig. functions are called
sinq, and we need to use the special function
quadmath_snprintf to print the numbers to the screen. (In Fortran, you don’t even need to use a different print function.)
For simplicity, I look specifically at the decimal expansion of one matrix element (0,0). With gcc, I get
$ gcc -o matmul_double -O3 -mavx -std=c99 matmul_double.c -lm $ ./matmul_double Matrix size: 256 by 256 (double precision). Time: 0.010 seconds (3355.4 MFLOP/s) C(0,0)-diagnostic: -0.0939652685936642 gcc -o matmul_quad -std=c99 -O3 -mavx matmul_quad.c -lquadmath -lm ./matmul_quad Matrix size: 256 by 256 (quad precision). Time: 4.140 seconds (8.1 MFLOP/s) C(0,0)-diagnostic: -0.093965268593662358578620940776
which confirms that
__float128 works in gcc, but with significant speed penalty. In this case, the difference in runtime is around 1000x.
Does Intel’s C compiler fare better?
$ icc -o matmul_double -std=c99 -O3 -xavx mmul_double.c -lm $ ./matmul_double Matrix size: 256 by 256 (double precision). 0.000 seconds (inf MFLOP/s) C(0,0)-diagnostic: -0.0939652685936624 $ icc -o matmul_quad -std=c99 -O3 -mavx -I/software/apps/gcc/4.7.2/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/include -L/software/apps/gcc/4.7.2/lib64/ matmul_quad.c -lquadmath -lm $ ./matmul_quad Matrix size: 256 by 256 (quad precision). 0.880 seconds (38.1 MFLOP/s) C(0,0)-diagnostic: -0.093965268593662358578620940776
Yes, it is evident that the quad precision floating maths runs a few times faster with ICC, even though the same underlying library is used.
If we actually look at the decimals, and compare the results, we find something interesting. The quad precision results from GCC and ICC are identical, which is assuring, but the double precision binary compiled by ICC delivers higher precision, despite using very high optimization.
GCC(double) -0.09396526859366|42 ICC(double) -0.09396526859366|24 GCC(quad.) -0.09396526859366|2358578620940776 ICC(quad.) -0.09396526859366|2358578620940776
Lowering GCC’s optimization did not help, and gcc is not supposed to introduce any unsafe mathematical operations unless explicitly enabled by
-Ofast in the first place, so it is unclear to me where the difference comes from. Typically, fluctuations in the last decimal is not a problem in practice for many codes. There are many that run fine with e.g. gcc’s
-ffast-math, but if you are struggling with ill-conditioned matrices, every decimal could count.
Are bigger servers better for high-performance computing? It is often assumed that communication between processors within a compute node must be faster than using Infiniband networking in a cluster. Consequently, I come across scientists asking for big shared memory servers, believing that their simulations would run much faster there. In reality, it is not always so. In the previous post, I wrote about how the VASP application is bottlenecked by memory bandwidth. In such cases, compute work and communication will compete with each other for precious resources, with severe performance degradation as a result.
Consider this experiment. Let us first run a VASP compute job with 512 bands on a single compute node using 1 to 16 cores. This will give us a scaling graph showing what kind of improvement you can get by using more cores in a server. Now, for comparison, we run the same job but with only one core per compute node. This means that the 16-core job uses 16 compute nodes and only communicates over Infiniband. Which scenario will be faster?
It turns out that communicating only over Infiniband is superior to shared memory. With 16 cores it gives twice as fast calculations. The reason is simply that we throw more hardware at the problem: our processors can now use all the memory bandwidth for computations, while exchanging data over the Infiniband network instead.
The graph above shows that there is no inherent benefit to run an MPI parallelized application such as VASP on a big server vs smaller servers in a cluster connected by Infiniband. The only advantage you get is the increased total amount of available memory per server.
As a user, you can apply techniques like this to speed up your calculations. For example, by using twice as many nodes, but with half the number of cores on each node. On Triolith, you can do like this:
#SBATCH -N 16 #SBATCH --ntasks-per-node 8 mpprun /software/apps/vasp/5.3.3-18Dec12/default/vasp
This will run your application with 8 cores per compute node, for a total of 128 cores. The improvement in speed compared to using 8 full nodes can be as much as 50%, and you will also have twice as much memory available per processor. The drawback is, of course, that you spend twice the core hours on your calculation. But if it is important to get the results quickly, it might be worth it.
I was recently asked what kind of hardware you should for running VASP calculations. I recommend looking at the configuration of the Triolith cluster at NSC. It was designed to run VASP as big part of the workload, and we did extensive benchmarking to optimize price/performance. An alternative is too look through the supercomputing Top500 list for the most recent entries, to get a feel for what the supercomputing centers are buying at the moment. At Top500, they also have a statistics section where you can follow trends in hardware over time. It is obvious, for example, that big shared memory systems have fallen out of favor.
I recommend is dual-socket servers with low clock frequency Intel Xeon E5 processors, connected by Infiniband networking. Why?
Type of CPU: At the moment, Intel’s server processors are significantly ahead of AMD’s in terms of performance and energy efficiency. VASP is also very reliant on Intel’s Fortran compiler. This makes Intel processors an optimal choice. It is possible to run VASP on POWER and SPARC processors, but these server platforms are generally not cost efficient for high-performance computing.
CPU model: VASP is very dependent on high memory bandwidth. It is one of the most memory intensive applications in high-performance computing. This means that you do not need processors with high clock frequency, because they will spend most of their time waiting for data to arrive from memory anyhow. What you need is the cheapest processor model that still comes with maximum memory bandwidth. In the current Xeon E5-2600 series, that is likely to be the 2650 and 2660 models. The quad-core 2643 model could also be interesting, because you typically gain only 25% when going from 8 to 16 cores per node.
Memory: It is crucial to have 1600 Mhz DDR3 memory, which is currently the fastest memory (1866 Mhz will soon come). All four memory channels should be occupied. Try to get dual rank memory modules if you have only 1 DIMM per memory channel – it will improve performance by ca 5%. Typically, 32 GB of memory is enough (2 GB/core), the reason being that you can easily get an equivalent of 4 GB per core by running with half the number of MPI ranks per node without losing too much performance. But if the majority of jobs are of hybrid type or GW, I would go for 64 GB per server instead.
Network: A fast network, like Infiniband, is necessary to run VASP in parallel on more than a few nodes. It is difficult to do comparative studies of different network setups due to the amount of hardware required, but in general VASP scales almost perfectly up to 16 compute nodes using both Mellanox and Qlogic (now Intel) Infiniband, so there is no given winner. FDR Infiniband does not significantly improve performance over QDR for VASP in the few tests I was able to do (+16% on a 16 node job spanning two switches), so I would look at it mostly from a price/performance perspective.
Occasionally, I get enquiries about various kinds of scripts for pre- and post-processing of VASP calculations. A comprehensive set of scripts covering the most common tasks is available within the Aflow high-throughput framework for VASP, developed by the Curtarolo group at Duke University.
To use Aflow on Triolith, just load the “vasptools/0.2” module; it will put the
aconvasp binary and other scripts such as “vasp2cif” and “vaspcheck” into your PATH.
module load vasptools/0.2
Here are some examples of what can you do with Aflow:
aconvasp --cart: convert POSCAR from direct to Cartesian coordinates
aconvasp --data: show basic structure data such as volume, alpha, beta gamma, etc.
aconvasp --volume 170.0: Change the volume of the cell to 170 A3 .
aconvasp --clat 5.0 5.0 7.0 90.0 90.0 120.0: convert (a,b,c,alpha,beta,gamma) to Cartesian basis vectors which can be copy-pasted into POSCAR.
aconvasp --chgdiff CHGCAR.1 CHGCAR.2: subtract charge densities in CHGCAR files. (But it seems to be broken when I test it.)
aconvasp --supercell 2 2 2: make supercell out of an existing POSCAR.
aconvasp --swap 0 1: swap coordinates of atomic species 0 and 1 in the POSCAR file.
aconvasp --spacegroup: spacegroup and symmetry detection.
aconvasp --cif: generate a CIF file from POSCAR.
aconvasp --xyz: generate an xyz file from POSCAR.
You can find more information in the full documentation. If you use aconvasp or aflow, don’t forget to cite their paper:
S. Curtarolo, W. Setyawan, G. L. W. Hart, M. Jahnatek, R. V. Chepulskii, R. H. Taylor, S. Wang, J. Xue, K. Yang, O. Levy, M. Mehl, H. T. Stokes, D. O. Demchenko, and D. Morgan, AFLOW: an automatic framework for high-throughput materials discovery, Comp. Mat. Sci. 58, 218-226 (2012).
This week we are running a course in parallel programming with OpenMP at NSC. Joachim Hein from LUNARC is teaching and a few of us from NSC are helping out with the programming labs.
It is often said that parallel programming can be incredibly hard, and that there is currently no reliable way to automatically parallelize an existing serial program. This statement is still true in general, but sometimes, parallel programming can also be embarrassingly easy. Why? Because while automatic parallelization is not perfect, it can still give you some improvement. There are also many subroutines in BLAS, LAPACK and FFTW that are already parallelized, and since many programs rely on these libraries, they can see speed-up on multicore processors by just linking the right library version and setting
OMP_NUM_THREADS=X in the shell.
Let us consider the Elk FP-LAPW code . It is written in Fortran90, and has already been parallelized using both OpenMP and MPI. But what could we have done in the hypothetical case of starting out with the serial version of Elk? How good is automatic parallelization? It will surely not get us all the way, but every percent counts, because you essentially get it for free. It is merely a question of finding the relevant compiler flags.
To establish a baseline, I have Elk compiled without any special compiler flags or machine-optimized numerical libraries. This may seem naive and unrealistic, but in reality, it is not uncommon to come across scientific software built without any compiler optimizations flags or optimized linear algebra libraries such as GotoBLAS, MKL, or ATLAS. (In my experience, it is not so much a result of ignorance, but rather technical problems with compilation and/or lack of time for tinkering.)
The test case I am using is the YBCO example distributed with Elk (13 atoms) with the
rgkmax parameter increased to 7.0 to get longer runtime.
The first step in our hypothetical example is to simply add “-O3” optimization. This gives us 9% speed boost. The next crucial step is to replace the bundled BLAS, LAPACK and FFT libraries with Intel’s MKL libraries, which improves the speed by 27%. And finally, we activate the compiler’s automatic threaded parallelization, which gives us +80%. The results can then be compared with the production version of Elk for Triolith, which uses aggressive compiler optimizations, MKL libraries, and has manual OpenMP parallelization.
We can see that automatic parallelization gives a modest speed-up of 1.7x using 16 cores on a Triolith compute node. Still, this is not too bad compared with the OpenMP parallelized version which gets 5.0x over the serial version in total, but only 2x of that is actually due to the OpenMP constructs in the code. So essentially, we get half of the parallelization done automatically for free, without having to change any Fortran code.
Another way of looking at this graph is that it can really pay off to spend some time looking into the best way to compile and link a program. The optimized auto-parallel version is 2.5x faster than the naive version built just from the Elk source with integrated numerical libraries.
Most of tricks I used to compile Elk in this example are listed in the Triolith User Guide. If you encounter problems compiling your own program on Triolith, or need help with choosing the best libraries, please don’t hesitate to contact firstname.lastname@example.org.
Yesterday, I installed a new version of the Atomic Simulation Environment on Triolith. ASE allows you to script your ab initio calculations using Python. Here comes a tutorial on how you can run VASP calculations with ASE on Triolith. It is a little bit more elaborate than the official documentation of VASP module which can be found on the ASE web site.
First, we need to load the Python and ASE modules. I recommend the
module load python/2.7.4-snic-1 ase/3.7.1
Next, we need to tell ASE’s VASP module where to find the VASP binaries and the POTCARs files.
export VASP_COMMAND="mpprun /software/apps/vasp/5.3.3-18Dec12/default/vasp" export VASP_PP_PATH=/software/apps/vasp/POTCARs
Note that ASE requires that there are directories called
potpaw_PBE below the
VASP_PP_PATH, so you may have to create symlinks with these names if you don’t have them in your own database.
Now, we need to create the actual ASE Python script that sets up the calculation. The idea in this little example is to calculate the equilibrium volume and bulk modulus of hcp Mg.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
The script works like this: first, we create cells for 10 different volumes and at the same time attach VASP settings to them, then we call
get_potential_energy() for each of these cells and collect the energies and volumes, and finally, we use ASE’s built-in equation of state subroutine to calculate the equilibrium volume and bulk modulus.
To run this script on a Triolith compute node, we need to create a job script. In the job script, we just launch the Python script, which will start ASE and then launch VASP for us.
#!/bin/bash #SBATCH -J mg #SBATCH -N 1 #SBATCH --exclusive #SBATCH -t 0:10:00 python mg-hcp.py
When the job has completed (it should not take more than a few minutes), look in the slurm.out file for the output from ASE.
[pla@triolith1 asetest]$ cat slurm-693921.out Calculation output ------------------ Volumes: [30.469003876435877, 32.782153276649034, 35.209509626875203, 37.753824901813466, 40.417851076162975, 43.204340124622782, 46.116044021892051, 49.155714742669836, 52.326104261655324, 55.629964553547552] Energies: [-1.86460718, -2.2747034, -2.5757448, -2.78654685, -2.92182549, -2.99416931, -3.01374966, -2.99022215, -2.93151648, -2.84418079] Equation of state parameters ---------------------------- E0: -3.013844 eV V0: 45.9 A^3 B: 36.5 GPa
The experimental bulk modulus of Mg is ca 45 GPa, so the result seems reasonable.
For some time, VASP has been centered on the x86 processors and Intel’s Fortran compiler. Inside the VASP source distribution, you can find some makefiles for other compilers, but they seldom work nowadays, and in many cases you need to make modifications to the source code to make it work with other compilers.
In particular, recent versions of Gfortran cannot compile VASP. If you try, the compiler would stop at errors concerning a circular dependency of a module (i.e. the module includes itself), and some output formatting errors.
From what I can understand, these problems are actually related to violations of the Fortran language standard, which are allowed by the Intel compiler. There are no compiler flags for gfortran that let you “relax” the standard like this to let it compile VASP, so you need to modify the source to make it compliant.
When I tested with gcc 4.7.2 and gcc 4.8.0, four files needed to be modified: us.F, vdwforcefield.F, finite_diff.F, and spinsym.F. I have prepared the patches as a “patch file” which you can download. To apply the patches to the source code, locate your VASP 5.3.3 source code directory and do
cd vasp.5.3 patch -p0 < vasp533gcc.patch
In the makefile, you need to set the following compiler flags for gfortran.
FC = mpif90 (or similar depending on the MPI) FFLAGS = -ffree-form -ffree-line-length-0 -fno-second-underscore OFLAG=-O3 -march=corei7-avx -mtune=corei7-avx
Global -O3 optimization seems to work for me on Triolith (Xeon E5 processors), but I haven’t tested all functionality of the gfortran version yet. As with the Intel compiler, you may have to decrease the optimization or disable aggressive inlining in certain files.
In the preprocessor section, put something like this. Note that you should not use the
-DPGF90 flag when compiling with gfortran.
CPP = $(CPP_) -DHOST=\"NSC-GFORTRAN-B01\" -DMPI -DMPI_BLOCK=262144 \ -Duse_collective -DCACHE_SIZE=12000 -Davoidalloc -DNGZhalf\
These tricks made it for me, and I now have a reference version of VASP compiled with Gfortran on Triolith. The speed seems to be about same as when compiled with Intel Fortran, since VASP relies heavily on FFTWs and BLAS calls and I still link with MKL and Intel’s MPI.
Later, I will try to make a longer guide how to compile VASP with a fully free software stack, and compare performance and stability.
I promised some multi-node scaling tests of the LiFeSiO4 128-atom job in the previous post. Here they come! The choice of NPAR is of particular interest. Do the old rules of
NPAR=sqrt(number of MPI ranks) still apply here?
To recap: when running on one node, I found that
NPAR=3 with 24 cores per compute node and a special MPI process binding scheme (round-robin over NUMA zones) gave the best performance. To check if it still applies across nodes, I ran a full characterization again, but this time with 2 compute nodes. In total, this was 225 calculations!
Inspecting the data points shows us that the same approach comes out winning again. Using 24 cores/compute node is still much more effective (+30%) than using all the cores, and
NPAR=6 is the best choice. Specifying process binding is essential, but the choice of a particular scheme does not influence as much as in the single node case, presumably because some of the load imbalance now happens in between nodes, which we cannot address this way.
From this I conclude that a reasonable scheme for choosing NPAR indeed seems to be:
NPAR = 3 * compute nodes
Or, if we have a recent version of VASP:
NCORE = 8
The “RR-NUMA” process binding has to be specified explicitly when you start VASP on Abisko:
srun --cpu_bind=map_cpu=0,6,12,18,24,30,36,42,2,8,14,20,26,32,38,44,4,10,16,22,28,34,40,46 /path/to/vasp
When using these settings, the parallel scaling for 1-8 compute nodes looks decent up to 4 compute nodes:
Remember that each node has 48 cores, of which we are using 24 cores, so 4 nodes = 96 MPI ranks. We get a top speed of about 30 Jobs/h. But what does this mean? It seems appropriate to elaborate on the choice of units here, as I have gotten questions about why I measure the speed like this instead of using wall time as a proxy for speed. The reasons is that you could interpret the “Speed” value on the y-axis as the number of geometry optimization steps you could run in one hour of wall time on the cluster. This is something which is directly relevant when doing production calculations.
For reference, we can compare the speeds above with Triolith. On Triolith, the same job (but with 512 bands instead of 480) tops out at about 38 Jobs/h with 16 compute nodes and 256 ranks. So the parallel scaling looks a bit weak compared Triolith, but the absolute time to solution is still good.