<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

  <title><![CDATA[Peter Larsson]]></title>
  <link href="http://www.nsc.liu.se/~pla/atom.xml" rel="self"/>
  <link href="http://www.nsc.liu.se/~pla/"/>
  <updated>2013-05-14T12:26:56+02:00</updated>
  <id>http://www.nsc.liu.se/~pla/</id>
  <author>
    <name><![CDATA[Peter Larsson]]></name>
    
  </author>
  <generator uri="http://octopress.org/">Octopress</generator>

  
  <entry>
    <title type="html"><![CDATA[Compiling VASP with Gfortran]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2013/05/14/vasp-gcc/"/>
    <updated>2013-05-14T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2013/05/14/vasp-gcc</id>
    <content type="html"><![CDATA[<p>For some time, VASP has been centered on the x86 processors and Intel’s Fortran compiler. Inside the VASP source distribution, you can find some makefiles for other compilers, but they seldom work nowadays, and in many cases you need to make modifications to the source code to make it work with other compilers.</p>

<p>In particular, recent versions of <a href="http://gcc.gnu.org/wiki/GFortran">Gfortran</a> cannot compile VASP. If you try, the compiler would stop at errors concerning a circular dependency of a module (i.e. the module includes itself), and some output formatting errors.</p>

<p>From what I can understand, these problems are actually related to violations of the Fortran language standard, which are allowed by the Intel compiler. There are no compiler flags for gfortran that let you “relax” the standard like this to let it compile VASP, so you need to modify the source to make it compliant.</p>

<p>When I tested with gcc 4.7.2 and gcc 4.8.0, four files needed to be modified: us.F, vdwforcefield.F, finite_diff.F, and spinsym.F. I have prepared the patches as a “patch file” which you can <a href="http://www.nsc.liu.se/~pla/software/vasp533gcc.patch">download</a>. To apply the patches to the source code, locate your VASP 5.3.3 source code directory and do</p>

<pre><code>cd vasp.5.3
patch -p0 &lt; vasp533gcc.patch
</code></pre>

<p>In the makefile, you need to set the following compiler flags for gfortran.</p>

<pre><code>FC = mpif90 (or similar depending on the MPI)
FFLAGS = -ffree-form -ffree-line-length-0  -fno-second-underscore
OFLAG=-O3 -march=corei7-avx -mtune=corei7-avx
</code></pre>

<p>Global -O3 optimization seems to work for me on Triolith (Xeon E5 processors), but I haven’t tested all functionality of the gfortran version yet. As with the Intel compiler, you may have to decrease the optimization or disable aggressive inlining in certain files.</p>

<p>In the preprocessor section, put something like this. Note that you should not use the <code>-DPGF90</code> flag when compiling with gfortran.</p>

<pre><code>CPP     = $(CPP_)  -DHOST=\"NSC-GFORTRAN-B01\" -DMPI 
    -DMPI_BLOCK=262144 \
    -Duse_collective -DCACHE_SIZE=12000 -Davoidalloc -DNGZhalf\
</code></pre>

<p>These tricks made it for me, and I now have a reference version of VASP compiled with Gfortran on Triolith. The speed seems to be about same as when compiled with Intel Fortran, since VASP relies heavily on FFTWs and BLAS calls and I still link with MKL and Intel’s MPI.</p>

<p>Later, I will try to make a longer guide how to compile VASP with a fully free software stack, and compare performance and stability.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Scaling of small jobs on the Abisko Interlagos cluster]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2013/04/05/vaspabisko2/"/>
    <updated>2013-04-05T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2013/04/05/vaspabisko2</id>
    <content type="html"><![CDATA[<p>I promised some multi-node scaling tests of the LiFeSiO4 128-atom job in the <a href="http://www.nsc.liu.se/~pla/blog/2013/03/25/vaspabisko/">previous post</a>. Here they come!  The choice of NPAR is of particular interest. Do the old rules of <code>NPAR=compute nodes</code>or <code>NPAR=sqrt(number of MPI ranks)</code> still apply here?</p>

<p>To recap: when running on one node, I found that <code>NPAR=3</code> with 24 cores per compute node and a special MPI process binding scheme (round-robin over NUMA zones) gave the best performance. To check if it still applies across nodes, I ran a full characterization again, but this time with 2 compute nodes. In total, this was 225 calculations!</p>

<p><img src="http://www.nsc.liu.se/~pla/images/Abisko2nodes.png" alt="Speed of VASP on Abisko as a function NPAR and binding" /></p>

<p>Inspecting the data points shows us that the same approach comes out winning again. Using 24 cores/compute node is still much more effective (+30%) than using all the cores, and <code>NPAR=6</code> is the best choice. Specifying process binding is essential, but the choice of a particular scheme does not influence as much as in the single node case, presumably because some of the load imbalance now happens in between nodes, which we cannot address this way.</p>

<p>From this I conclude that a reasonable scheme for choosing NPAR indeed seems to be:</p>

<pre><code>NPAR = 3 * compute nodes
</code></pre>

<p>Or, if we have a recent version of VASP:</p>

<pre><code>NCORE = 8
</code></pre>

<p>The &#8220;RR-NUMA&#8221; process binding has to be specified explicitly when you start VASP on Abisko:</p>

<pre><code>srun --cpu_bind=map_cpu=0,6,12,18,24,30,36,42,2,8,14,20,26,32,38,44,4,10,16,22,28,34,40,46 /path/to/vasp
</code></pre>

<p>When using these settings, the parallel scaling for 1-8 compute nodes looks decent up to 4 compute nodes:</p>

<p><img src="http://www.nsc.liu.se/~pla/images/Li128abiskoXN.png" alt="Inter-node scaling on Abisko" /></p>

<p> Remember that each node has 48 cores, of which we are using 24 cores, so 4 nodes = 96 MPI ranks. We get a top speed of about 30 Jobs/h. But what does this mean? It seems appropriate to elaborate on the choice of units here, as I have gotten questions about why I measure the speed like this instead of using wall time as a proxy for speed. The reasons is that you could interpret the &#8220;Speed&#8221; value on the y-axis as <strong>the number of geometry optimization steps you could run in one hour of wall time</strong> on the cluster. This is something which is directly relevant when doing production calculations.</p>

<p> For reference, we can compare the speeds above with Triolith. On Triolith, the same job (but with 512 bands instead of 480) tops out at about 38 Jobs/h with 16 compute nodes and 256 ranks. So the parallel scaling looks a bit weak compared Triolith, but the absolute time to solution is still good.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[VASP on the Abisko cluster (with AMD Interlagos)]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2013/03/25/vaspabisko/"/>
    <updated>2013-03-25T00:00:00+01:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2013/03/25/vaspabisko</id>
    <content type="html"><![CDATA[<p>When Triolith had a service stop recently, I took to the opportunity to explore the <a href="http://www.hpc2n.umu.se/resources/abisko">Abisko cluster</a> at HPC2N in Umeå. Abisko has 4-socket nodes with 12-core AMD Opteron &#8220;Interlagos&#8221; processors and lots of memory. Each compute node on Abisko features an impressive 500 gigaflop/s theoretical compute capability (compared to e.g. Triolith&#8217;s Xeon E5 nodes with 280 gigaflop/s). The question is how much of this performance we can get in practice when running VASP and how to set important parallelization parameters such as NPAR and NSIM.</p>

<h1>Summary of findings</h1>

<ul>
<li>One Abisko node is about 20% faster than one Triolith node, but you have to use 50% more cores and twice the memory bandwidth to get the job done.</li>
<li>You should run with 24 cores per node. More specifically, one core per Interlagos “module”.</li>
<li>It is imperative that you specify MPI process binding explicitly either with <code>mpirun</code> or <code>srun</code> to get good speed.</li>
<li>Surprisingly, process binding in a round-robin scheme over NUMA zones is preferable to straight sequential binding of 1 rank per module. (Please see below for the binding masks I used)</li>
<li><strong>NPAR</strong>: MPI ranks should be in groups of 8, this means <code>NPAR=3*nodes</code>.</li>
<li><strong>NSIM</strong>: 8, brings you +10% performance vs the default choice.</li>
</ul>


<h1>Background on AMD Interlagos</h1>

<p>To understand the results, we first need to have some background knowledge about the Interlagos processors and how they differ from earlier models.</p>

<p>The first aspect is the number of cores vs the number of floating-point units (FPU:s). The processors in Abisko are marketed by AMD as having 12 cores, but in reality there are only 6 FPU:s, each which are shared between 2 cores (called a &#8220;module&#8221;). So I consider them more like 6-core processors capable of running in two modes: either a &#8220;fat mode&#8221; with 1 thread with 8 flops/cycle or a &#8220;thin mode&#8221; with 2 threads with 4 flops/cycle. Which one is better will depend on the mix of integer and floating point instructions. In a code like VASP, which is heavily dependent on floating point calculations and memory bandwidth, I would expect that running with 6 threads is better because there is always some overhead involved with using more threads.</p>

<p>The second aspect to be aware of is the <a href="http://www.hpc2n.umu.se/resources/abisko/cpuarch">memory topology</a>. Each node on Abisko has 48 cores, but they are separated into 8 groups, each of which have their own local memory. Each core can still access memory from everywhere, but it is much slower to read and write memory from a distant group. These groups are usually called NUMA zones (or nodes, or islands). We would expect that we need to group the MPI processes by tweaking the NPAR parameter to reflect the NUMA zone configuration. Specifically, this means 8 groups of 6 MPI ranks per compute node on Abisko, but more about that later.</p>

<h1>Test setup</h1>

<p>Here, we will be looking at the Li2FeSiO4 supercell test case with 128 atoms. I am running a standard spin-polarized DFT calculation (no hybrid), which I run to self-consistency with ALGO=fast. I adjusted the number of bands to 480 to better match the number of cores per node.</p>

<h1>Naive single node test</h1>

<p>A first (naive) test is to characterize the parallel scaling in a single compute node, without doing anything special such as process binding. This produced an intra-node scaling that looks like this after having tuned the NPAR values:</p>

<p><img src="http://www.nsc.liu.se/~pla/images/Li128abisko1N.png" alt="Intra-node scaling on Abisko" /></p>

<p>Basically, this is what you get when you ask the queue system for 12,16,24,36,48 cores on 1 node with exclusively running rights (no other jobs on the same node), and you just launch VASP with <code>srun $VASP</code>in the job script. We see that we get nice intra-node scaling. In fact, it is much better than expected, but we will see in the next section that this is an illusion.</p>

<p>The optimal choice of <code>NPAR</code>turned out to be:</p>

<pre><code>12 cores NPAR=1
16 cores NPAR=1
24 cores NPAR=3
36 cores NPAR=3
48 cores NPAR=6
</code></pre>

<p>This was also surprising, since I had expected NPAR=8 to be optimal. With these settings, there would be MPI process groups of 6 ranks which exactly fit in a NUMA zone. Unexpectedly, NPAR=6 seems optimal when using all 48 cores, and either NPAR=1 or NPAR=3 for the other cases. This does not fit the original hypothesis, but a weakness in our analysis is that we don’t actually know were the processes end up in this scenario, since there is no binding. The only way that you can get a symmetric communication pattern with NPAR=6 is to place ranks in a round robin scheme around each NUMA zone or socket. Perhaps this is what the Linux kernel is doing? An alternative hypothesis is that the unconventional choice of NPAR creates a load imbalance that may actually be beneficial because it allows for better utilization of the second core in each module. To explore this, I decided to test different binding scenarios.</p>

<h1>The importance of process binding</h1>

<p>To bind MPI processes to a physical core and prevent the operating system from moving them on around inside the compute node, you need to give extra flags to either <code>srun</code> or your MPI launching command such as <code>mpirun</code>. On Abisko, we use <code>srun</code>, where binding is controlled through SLURM by setting e.g. in the job script:</p>

<pre><code>srun --cpu_bind=rank ...
</code></pre>

<p>This binds the MPI rank #1 to core #1, and so on in a sequential manner. It is also possible to explicitly specify where each rank should go. The following example binds 24 ranks to alternating cores, so that there is one rank running per Interlagos module:</p>

<pre><code>srun --cpu_bind=map_cpu=0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46 ...
</code></pre>

<p>In this scheme, neighboring ranks are close to each other: i.e. rank #1 and #2 are in the same NUMA zone. The aim is to maximize the NUMA locality.</p>

<p>The third type of binding I tried was to distribute the ranks in a round-robin scheme in steps of 6 cores. The aim is to minimize NUMA locality, since neighboring ranks are far apart from each other, i.e. rank #1 and #2 are in different NUMA zones.</p>

<pre><code>srun --cpu_bind=map_cpu=0,6,12,18,24,30,36,42,2,8,14,20,26,32,38,44,4,10,16,22,28,34,40,46 ...
</code></pre>

<p>Below are the results when comparing the speed of running with 48 cores and 24 cores with different kinds of process binding. The 48 core runs are with NPAR=6 and the 24 cores with NPAR=3.</p>

<p><img src="http://www.nsc.liu.se/~pla/images/Abisko1Nbinding.png" alt="Speed vs process binding on Abisko" /></p>

<p>It turns out that you can get all of the performance, and even more, by running with 24 cores in the “fat” mode. The trick is, however, that we need to enable the process binding ourselves. It does not happen by default when you run with half the number of cores per node (the &#8220;None&#8221; section in the graph).</p>

<p>We can further observe that straight sequential process binding actually worsens performance in the 48 core scenario. Only in the round-robin NUMA scheme (&#8220;RR-NUMA&#8221;) can we reproduce the performance of the unbound case. This leads me to believe that running with no binding gets you in similar situation with broken NUMA locality, which explains why NPAR=3/6 is optimal, and not NPAR=4.</p>

<p>The most surprising finding,however, is that the top speed was achieved not with the &#8220;alternate&#8221; binding scheme, which emphasizes NUMA memory locality, but rather with the round-robin scheme, which breaks memory locality of NPAR groups. The difference in speed is small (about 3%), but statistically significant. There are few scenarios where this kind of interleaving over NUMA zones is beneficial, so I suspect that it is not actually a NUMA issue, but rather related to memory caches. The L3 cache is shared between all cores in a NUMA zone, so perhaps the L3 cache is being trashed when all the ranks in an  NPAR group are accessing it? It would be interesting to try to measure this effect with hardware counters&#8230;</p>

<h1>NSIM</h1>

<p>Finally, I also made some tests with varying NSIM:</p>

<p><img src="http://www.nsc.liu.se/~pla/images/InterlagosNSIM.png" alt=“Speed on Abisko with different NPAR values“ /></p>

<p>NSIM=4 is the default setting in VASP. It usually gives good performance in many different scenarios. NPAR=4 works on Abisko too, but I gained about 7% by using NPAR=8 or 12. An odd finding was that NPAR=16 completely crippled the performance, doubling the wall time compared to NPAR=4. I have no good explanation, but it obviously seems that one should be careful with too high NPAR values on Abisko.</p>

<h1>Conclusion and overall speed</h1>

<p>In terms of absolute speed, we can compare with Triolith, where one node with 16 cores can run this example in 380s (9.5 jobs/h) with 512 bands, using the optimal settings of NPAR=2 and NSIM=1. <strong>So the overall conclusion is that one Abisko node is roughly 20% faster than one Triolith node.</strong> You can easily become disappointed by this when comparing the performance per core, which is 2.5x higher on Triolith, but I think it is not a fair comparison. In reality, the performance difference per FPU is more like 1.3x, and if you compensate for the fact that the Triolith processors in reality run at much higher frequency than the listed 2.2 Ghz, the true difference in efficiency per core-GHz is closer to 1.2x.</p>

<p>Hopefully, I can make some multi-node tests later and determine whether running with 24 cores per node and round-robin binding is the best thing there as well.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Making supercells for VASP]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2013/02/26/vaspsupercells/"/>
    <updated>2013-02-26T00:00:00+01:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2013/02/26/vaspsupercells</id>
    <content type="html"><![CDATA[<p>I often get questions on how to make supercells for VASP calculations. The problem is typically that you have a structure in a POSCAR file and then want to expand it to a bigger supercell to study e.g. defects. There are many programs available that can perform this task, like <a href="http://www.homepages.ucl.ac.uk/~ucfbdxa/phon/">PHON</a> and <a href="http://phonopy.sourceforge.net/">Phonopy</a>, and if you google, you can find many scripts, usually called &#8220;vasputil&#8221; that can do this task specifically. Many people in your research group have probably written their own program as well.</p>

<p>What I really recommend, however, is to not reinvent the wheel and instead use already available libraries to analyze and work with your ab initio calculations. One such library is the Atomic Simulation Environment (<a href="https://wiki.fysik.dtu.dk/ase/">ASE</a>) for Python, which supports many programs, including VASP. With ASE, you can do really cool stuff like making small Python programs which read your VASP input/output and then work on them programmatically. In fact, with ASE, it is almost trivial to make supercells. You can do it with 3 lines of Python code:</p>

<pre><code>import ase.io.vasp
cell = ase.io.vasp.read_vasp("POSCAR")
ase.io.vasp.write_vasp("POSCAR.4x4x4",cell*(4,4,4), label='444supercell',direct=True,sort=True)
</code></pre>

<p>The code above reads a POSCAR file in the current working directory, transforms it to a 4x4x4 supercell and writes the results to disk as &#8220;POSCAR.4x4x4&#8221;. Note that to run this on Triolith, you need to have the ASE module loaded</p>

<pre><code>module load ase/3.6.0
</code></pre>

<p>I thought this trick might be useful, so I made a script that can perform this procedure available in the <code>vasptools</code> module on Triolith. The script is called <code>supersize</code> and takes a POSCAR file as the first argument, and the supercell sizing as the second:</p>

<pre><code>$ supersize POSCAR 4x4x4
</code></pre>

<p>The output is a new POSCAR file called &#8220;POSCAR.4x4x4&#8221; containing the supercell repeated 4 times in a,b,c directions.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Quantum Espresso vs VASP (round 2)]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2013/02/19/qevasp-part2/"/>
    <updated>2013-02-19T00:00:00+01:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2013/02/19/qevasp-part2</id>
    <content type="html"><![CDATA[<p>In the second round, I wanted to do a medium-sized, more complicated system. The original plan was to run the Li2FeSiO4 supercell with spin polarization, which I have run extensively in VASP before. It is a nontrivial example from my previous research, and it can be tricky to get fast convergence to the right ground state. Unfortunately, I failed at getting the Li2FeSiO4 system to run. PWscf kept crashing, despite much tinkering, and all I got was the following error message:</p>

<pre><code>Error in routine rdiaghg (1539): S matrix not positive definite
</code></pre>

<p>The QE documentation does mention these kinds of errors saying that they are related to negative charges densities in the cores, which is basically either due to an unreasonable crystal structure or a poor choice of pseudopotentials. Standard tricks like increasing the <code>encutrho</code> parameter or changing the diagonalization algorithm did not help either, so my guess is that something was wrong with the available PAW datasets. The all-electron Li PAW, for example, comes with a suggested plane wave cutoff of 1100 eV, unlike 272 eV in VASP. I am not sure if it is numerically sane to mix it with the other ones with much lower cutoff. I have seen this give rise to numerical instabilities in VASP, for example.</p>

<h2>New test case: Fe-N-doped graphene</h2>

<p>Instead, I constructed a 128-atom supercell of graphene. I inserted an Fe cation site coordinated by four pyridinic sites, to make it a little bit more exciting and also have a reason to do spin polarization.</p>

<p><img src="http://www.nsc.liu.se/~pla/images/NFeGraphene.png" alt="N-Fe-doped graphene" /></p>

<p>The PAW datasets were:</p>

<pre><code>Fe.pbe-spn-kjpaw_psl.0.2.1.UPF (16 valence)
C.pbe-n-kjpaw_psl.0.1.UPF (4 valence)
N.pbe-n-kjpaw_psl.0.1.UPF (5 valence)
</code></pre>

<p>In total, this system has 127 atoms and 524 electrons, so 512 bands per spin channel is a nice and even number here. There is a similar issue with plane-wave cutoff as in the previous example. I set it to 500 eV to compare with the VASP calculation. It could be argued that it is artificially low, and that a real production calculation with QE using the PAW potentials would need to have a bigger cutoff.</p>

<p>A 500 eV basis requires an FFT grid of 144x144x72 points, which in the case of VASP means that an optimal plan-wise decomposition of the FFTs can be achieved for 1,2,4,8, and 12 compute nodes by using NPAR=1,2,4,8,12, respectively. If I understand the PWscf documentation correctly, 72 FFT planes in Z-direction means that we should be able to scale up to 2x72 MPI ranks, since we have spin polarization (2 &#8220;effective&#8221; k-points), and that we are also likely to be helped by FFT task group parallelization using the <code>-ntg</code> command line flag.</p>

<p>I ran the jobs on 1,2,4,8,12, and 16 Triolith compute nodes, using all available cores (16 cores per node). For VASP, the optimal setting is band parallelization using <code>NPAR=compute nodes</code>, and <code>LPLANE=.TRUE.</code>. All PWscf calculations were run with <code>-npool 2</code>, which activates k-point parallelization, together with several combinations of <code>-ntg</code> and <code>-ndiag</code>, which controls FFT task parallelization and SCALAPACK linear algebra parallelization. There is experimental support for band parallelization in QE (<code>-nband</code> flag), but it either crashed the program or ran horribly slow, so the results below are using the standard parallelization options.</p>

<h2>Results</h2>

<p><img src="http://www.nsc.liu.se/~pla/images/vaspqe2.png" alt="Fe-N-doped graphene QE vs VASP speed" /></p>

<p>VASP scales acceptably up to 12 nodes / 192 cores, whereas QE only has decent scaling from 1 to 2 nodes. I believe that the reason is that VASP has band parallelization, but QE not. To test my theory, I ran the VASP jobs with as low NPAR as possible, which is shown as the blue dotted line. This meant NPAR=1 (no band parallelization) for 1-8 nodes, and NPAR=2 for 12-16 nodes. The parallel scaling is much worse then, and essentially flat from 8 nodes and upwards, which is similar to the QE results.</p>

<p>In terms of absolute performance, VASP and QE are tied again when running on 16 and 32 cores, with PWscf actually being about 10% faster on 32 cores. But when comparing the top speed, VASP achieves at least 25 Jobs/h with 16 nodes vs. 10 Jobs/h with PWscf on 8 nodes. So we are looking at half the time to solution with VASP.</p>

<p>Another purpose of this study was to characterize the parallelization settings for QE when running on Triolith. The best parallelization settings for this system turned out to be:</p>

<pre><code>Nodes -ntg  -ndiag
1        1      16
2        1      16
4        2      16
8        2      16
12       4      16
16       4      16
</code></pre>

<p>FFT task groups (<code>-ntg</code>) seems to be necessary for higher core counts, just as suggested in the QE manual. The rule of thumb in the manual is to enable <code>-ntg</code> when the number of cores exceeds the number of FFT mesh points in the z direction, which seems accurate in this case.</p>

<p>I found the performance curve for SCALAPACK parallelization very flat for <code>ndiag=16/25/36</code>, so I was unable to resolve any difference with just 3-5 samples per point, but it seems like the performance flattens out above 16 cores for this system. Diagonalizing a 512x512 matrix is not that big of a task in the context of SCALAPACK, so this is not surprising.</p>

<p>Mixing and SCF stability turned out to be an influential factor when making the comparison between VASP and PWscf. The default mixing scheme in VASP is very good and can converge the graphene system studied here in ca 32 SCF iterations using default settings, but getting down to that level with PWscf required tuning of beta (0.2) and change of the mixing mode from <code>plain</code> to <code>local-TF</code>.</p>

<h2>Conclusion</h2>

<p>When it comes to bigger, more realistic calculations, PWscf is not as straightforward to work with as VASP. This is a combination of the robustness and availability of PAW datasets, and the increased need of parameter tuning necessary to get decent performance. The speed is on par with VASP for 1-2 compute nodes, but VASP has a much faster and more predictable parallel scaling beyond that. It was a surprise to me to not find working band parallelization in PWscf.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Quantum Espresso vs VASP (round 1)]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2013/02/04/qevasp-part1/"/>
    <updated>2013-02-04T00:00:00+01:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2013/02/04/qevasp-part1</id>
    <content type="html"><![CDATA[<p>There are just a few implementations of the PAW method: <a href="http://users.wfu.edu/natalie/papers/pwpaw/man.html">PWPAW</a>, <a href="http://www.abinit.org/">ABINIT</a>, <a href="http://www.vasp.at/">VASP</a>, <a href="https://wiki.fysik.dtu.dk/gpaw/">GPAW</a>, and in the PWscf program in <a href="http://www.quantum-espresso.org/">Quantum Espresso</a> (&#8220;QE&#8221; from now on). VASP is frequently held up as the fastest implementation, and I concluded in <a href="http://www.nsc.liu.se/~pla/blog/2012/04/18/abinitvasp-part2">earlier tests</a> that standard DFT in ABINIT is too slow compared to VASP to be useful for when running large supercells. But how does QE compare to VASP? There has been extensive work on QE in the last years, such as adding GPU support and mixed OpenMP/MPI parallelism, and you can find papers showing good parallel scalability such as <a href="http://www.prace-ri.eu/IMG/pdf/enabling_of_quantum_espresso_to_petascale_scientific_challenges.pdf">(ref1)</a> and <a href="http://www.fisica.uniud.it/~giannozz/Papers/rimini08.pdf">(ref2)</a>. By request, I will therefore perform and publish some comparisons of VASP and QE, in the context of PAW calculations. Primarily, I am interesting in:</p>

<ul>
<li>Raw speed, measured as performance on a single 16 core compute node.</li>
<li>Parallel scalability, measured as the time to solution and computational efficiency for very large supercell calculations.</li>
<li>General robustness and production readiness, in particular numerical stability, sane defaults, and quality of documentation.</li>
</ul>


<p>So as the first step, before going into parallel scaling studies, it is useful to know the performance on the level of a single compute node. In the ABINIT vs VASP study, I used a silicon supercell with 31 atoms and one vacancy for basic testing. I will use it here as well to provide a point of reference.</p>

<h2>Methods</h2>

<p>Access to prepared PAW dataset is crucial for a PAW code to be useful, as most users prefer to not generate pseudopotentials themselves (although it can be discussed whether this is a wise approach). Fortunately, there are now more <a href="http://www.quantum-espresso.org/?page_id=190">PAW datasets</a> available for QE. I found &#8220;Si.pbe-n-kjpaw_psl.0.1.UPF&#8221; to be the most similar one to VASP&#8217;s standard silicon POTCAR. It is a scalar relativistic setup with 4 valance electrons for the PBE exchange-correlation functional. It differs in the suggested plane-wave cutoff, though, where the QE value is much higher (13.8 Ry, ca 500 eV) compared to the VASP one (250 eV). I decided to use 250 eV in both programs for benchmarking purposes, but it is a highly relevant question if you get the same physical results at this cutoff? I will postpone the discussion of differences between PAW potentials for now, but I expect to return to it at a later time.</p>

<p>As usual, I put some effort into making sure that I am running as similar calculations as possible from a numerical point of view:</p>

<ul>
<li>The plane-wave cutoff was set to 18.4 Ry/250 eV. This lead to a 72x72x72 coarse FFT grid in both programs.</li>
<li>The fine FFT grids were 144x144x144.</li>
<li>80 bands, set manually</li>
<li>6 k-points (automatically generated)</li>
<li>6 symmetry operations detected in the crystal lattice</li>
<li>SCF convergence thresholds were 1.0e-6 Ry and 1.0e-5 eV, respectively.</li>
<li>Minimal disk I/O (i.e. no writing of WAVECAR and wfc files).</li>
</ul>


<p>A notable difference is that PWscf does not have the RMM-DIIS algorithm, instead I used the default Davidson iterative diagonalization. In the VASP calculations, I used the hybrid Davidson/RMM-DIIS approach (<code>ALGO=fast</code>), which is what I would use in a real-life scenario.</p>

<p>VASP was compiled with the <code>-DNGZhalf</code> preprocessor option for improved FFT performance. I could not find a clear answer in the PWscf documentation about this feature, but it does use the real FFT representation for gamma point calculations, so I presume that the &#8220;half&#8221; representation is also employed for normal calculations.</p>

<h2>Results</h2>

<p>Results when running the Si 31 atom cell on 1 Triolith compute node (16 Xeon E5 cores clocked at 2.2-3.0 Ghz):</p>

<p><img src="http://www.nsc.liu.se/~pla/images/vaspqe1.png" alt="Si31 cell QE vs VASP speed" /></p>

<p>It is great to find VASP and QE pretty much tied for this test case. VASP is around 6% faster, but the QE calculation also required 20 SCF steps instead of 15 steps in VASP, so it is possible that a more experienced QE user would be able to tune the SCF convergence better to achieve at least parity in speed.</p>

<p>I ran with both 8 cores and 16 cores per node to get a feel for the intra-node parallel scaling. QE actually scales better with 1.6x speed-up from 8 to 16 cores, vs 1.3x for VASP. This indicates to me that QE puts less pressure on the memory system and could scale better on big multicore nodes.</p>

<p>I also tested the OpenMP enabled version of QE, but for this test case there was no real benefit in terms of speed: 8 MPI ranks with 2 OpenMP threads each did run faster (+27%) than 8 ranks with only MPI, but not as fast as with 16 MPI ranks. But there was a slight reduction in memory: 16 MPI ranks with <code>-npool 2</code> required a maximum of 678 MB of memory, whereas 8 MPI ranks with 2 OpenMP threads used only 605 MB. So by using this approach, you could save some memory at the expense of speed. VASP, for comparison, used  1205 MB with k-point parallelization with 16 ranks, but 706 MB with band parallelization (which was the fastest option), so the memory usage of QE and VASP is very similar in practice.</p>

<h2>Summary</h2>

<p>In conclusion, this was my first serious calculations with PWscf and my overall impression is that the PAW speed is promising, and that it was relatively painless to set up the calculation and get going. I found the documentation comprehensive, but somewhat lacking in organization and mostly of reference kind. There are no explicit examples or practical notes of what you actually need to do to run a typical calculation. In fact, in my first attempt, I completely failed at getting PWscf to read my input files — I had to consult with an experienced QE user to understand why QE would not read my atomic positions (it turned out that the ATOMIC_POSITIONS section is only read if “nat” is set in the SYSTEM section). Once over the initial hurdle, it was a quite smooth ride, though. SCF convergence was achieved using the default settings, and all symmetries were detected correctly: that is something you cannot take for granted in all electronic structure codes.</p>

<p>Next up is a medium-sized calculation, the Li2FeSiO4 supercell with 128 atoms.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Tuning VASP: BLAS and FFT on Cray XE6]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2013/01/30/tuning-cray/"/>
    <updated>2013-01-30T00:00:00+01:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2013/01/30/tuning-cray</id>
    <content type="html"><![CDATA[<p>I did some more testing of BLAS and FFT libraries for VASP on Cray XE6, while working on VASP 5.3.3 for PDC&#8217;s Lindgren. Before, I always prepared VASP with Intel MKL and Cray MPI. This was mostly for compatibility reasons, but benchmarking also showed that the MKL version was much faster (ca 10%-20%) than the LibSci version. It is counterintuitive, since Cray has optimized BLAS routines in LibSci (in addition to the standard ones from GotoBLAS). Why was MKL so much faster? Could it be the FFT subroutines, just like what I saw on <a href="http://www.nsc.liu.se/~pla/blog/2013/01/10/tuning-ffts">Sandy Bridge</a> cpu:s? I decided to build a version of VASP with BLAS/LAPACK from LibSci and FFTs from Intel&#8217;s MKL to test this hypothesis. For reference, the LibSci version was 11.1, and the MKL version the one included in PrgEnv-intel/4.0.46, i.e. 10.3.</p>

<p><img src="http://www.nsc.liu.se/~pla/images/mkl-libsci.png" alt="Speed Cray LibSci vs MKL" /></p>

<p>I ran three test systems:</p>

<ul>
<li>Li2FeSiO4 cell with 128 atoms, standard DFT, with 2 compute nodes.</li>
<li>MgO cell with 63 atoms using HSE06 and k-point parallelization over 3 compute nodes.</li>
<li>NiSi cell with 504 atoms over 16 compute nodes (384 cores), standard DFT.</li>
</ul>


<p>The MKL version is indeed faster than the LibSci version, +7-12%, but it is possible to squeeze out a few % more performance by <strong>combining LibSci with MKL</strong>. The system jitter on Lindgren is normally very low , so the differences here are statistically significant. So, for the 5.3.3 version, I decided to deploy the LibSci+MKL version on Lindgren.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[How to compile VASP on Cray XE6]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2013/01/23/compile-vasp-on-lindgren/"/>
    <updated>2013-01-23T00:00:00+01:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2013/01/23/compile-vasp-on-lindgren</id>
    <content type="html"><![CDATA[<p>Here are some instructions for making a basic installation of VASP 5.3.3 on Cray XE6. It applies specifically to the Cray XE6 at PDC called <a href="http://www.pdc.kth.se/resources/computers/lindgren">&#8220;Lindgren&#8221;</a>, but Cray has a similar environment on all machines, so it might be helpful for other Cray sites as well.</p>

<p>First, download the prerequisite source tarballs from the VASP home page:</p>

<pre><code>http://www.vasp.at/ 
</code></pre>

<p>You need both the regular VASP source code, and the supporting &#8220;vasp 5&#8221; library:</p>

<pre><code>vasp.5.3.3.tar.gz
vasp.5.lib.tar.gz
</code></pre>

<p>I suggest to make a new directory called e.g. vasp.5.3.3, where you download and expand them. You would type commands approximately like this:</p>

<pre><code>mkdir 5.3.3
cd 5.3.3
(download)
tar zxvf vasp.5.3.3.tar.gz
tar zxvf vasp.5.lib.tar.gz
</code></pre>

<h2>Which compiler?</h2>

<p>The traditional compiler for VASP is Intel&#8217;s Fortran compiler (&#8220;ifort&#8221;). Version 12.1.5 of ifort is now the &#8220;official&#8221; compiler, the one which the VASP developers use to compile the program. Unfortunately, ifort is one of the few compilers which can compile the VASP source unmodified, since the code contains non-standard Fortran constructs. To compile with e.g. gfortran, pgi, or pathscale ekopath, which theoretically could generate better code for AMD processors, source code modifications are necessary. So we will stick with Intel&#8217;s Fortran compiler in this guide. On the Cray machine, this module is called &#8220;PrgEnv-Intel&#8221;. Typically, PGI is the default preloaded compiler, so we have to swap compiler modules</p>

<pre><code>module swap PrgEnv-pgi PrgEnv-intel
</code></pre>

<p>Check which version of the compiler you have by typing &#8220;ifort -v&#8221;:</p>

<pre><code>$ ifort -v
ifort version 12.1.5
</code></pre>

<p>If you have the &#8220;PrgEnv-intel/4.0.46&#8221; module loaded, it should state &#8220;12.1.5&#8221;.</p>

<h2>Which external libraries?</h2>

<p>For VASP, we need BLAS, LAPACK, SCALAPACK and the FFTW library. On the Cray XE6, these are usually provided by Cray&#8217;s own &#8220;libsci&#8221; library. This library is supposedly specifically tuned for the Cray XE6 machine, and should offer good performance.</p>

<p>Check that the libsci module is loaded:</p>

<pre><code>$ module list
...
xt-libsci/11.1.00
...
</code></pre>

<p>Normally, you combine libsci with the FFTW library. But I would recommend using the FFT routines from MKL instead, since they result in 10-15% faster overall speed in my benchmarks. Recent versions of MKL comes with FFTW3-compatible wrappers built-in (you don&#8217;t need to compile them separately), so <strong>by linking with MKL and libsci in the correct order</strong>, you get the best from both worlds.</p>

<h2>VASP 5 lib</h2>

<p>Compiling the VASP 5 library is straightforward. It contains some timing and IO routines, necessary for VASP, and LINPACK. My heavy edited makefile looks like this:</p>

<pre><code>.SUFFIXES: .inc .f .F
#-----------------------------------------------------------------------
# Makefile for VASP 5 library on Cray XE6 Lindgren at PDC
#-----------------------------------------------------------------------

# C-preprocessor
CPP     = gcc -E -P -C -DLONGCHAR $*.F &gt;$*.f
FC= ftn

CFLAGS = -O
FFLAGS = -O1 -FI
FREE   =  -FR

DOBJ =  preclib.o timing_.o derrf_.o dclock_.o  diolib.o dlexlib.o drdatab.o


#-----------------------------------------------------------------------
# general rules
#-----------------------------------------------------------------------

libdmy.a: $(DOBJ) linpack_double.o
    -rm libdmy.a
    ar vq libdmy.a $(DOBJ)

linpack_double.o: linpack_double.f
    $(FC) $(FFLAGS) $(NOFREE) -c linpack_double.f

.c.o:
    $(CC) $(CFLAGS) -c $*.c
.F.o:
    $(CPP) 
    $(FC) $(FFLAGS) $(FREE) $(INCS) -c $*.f
.F.f:
    $(CPP) 
.f.o:
    $(FC) $(FFLAGS) $(FREE) $(INCS) -c $*.f
</code></pre>

<p>Note that the Fortran compiler is always called “ftn” on the Cray (regardless of the module loaded), and the addition of the &#8220;-DLONGCHAR&#8221; flag on the CPP line. It activates the longer input format for INCAR files, e.g. you can have MAGMOM lines with more than 256 characters. Now compile the library with the &#8220;make&#8221; command and check that you have the &#8220;libdmy.a&#8221; output file. Leave the file here, as the main VASP makefile will include it directly from here.</p>

<h2>Editing the main VASP makefile</h2>

<p>I suggest that you start from the Linux/Intel Fortran makefile:</p>

<pre><code>cp makefile.linux_ifc_P4 makefile
</code></pre>

<p>It is important to realise that the makefile is split in two parts, and is intended to be used in an overriding fashion. If you don&#8217;t want to compile the serial version, you should enable the definitions of FC, CPP etc in the second half of the makefile to enable parallel compilation. These will then override the settings for the serial version.</p>

<p>Start by editing the Fortran compiler and its flags:</p>

<pre><code>FC=ftn -I$(MKL_ROOT)/include/fftw 
FFLAGS =  -FR -lowercase -assume byterecl
</code></pre>

<p>On Lindgren, I don’t get MKL_ROOT set by default when I load the PrgEnv-intel module, so you might have to do it by yourself too:</p>

<pre><code>MKL_ROOT=/pdc/vol/i-compilers/12.1.5/composer_xe_2011_sp1.11.339/mkl
</code></pre>

<p>The step above is site-specific, so you should check where the Intel compilers are actually installed. If you cannot find any documentation, try inspecting the PATH after loading the module.</p>

<p>Then, we change the optimisation flags:</p>

<pre><code>OFLAG=-O2 -ip 
</code></pre>

<p>Note that we leave out any SSE/architectural flags, since these are provided automatically by the “xtpe-mc12” module (make sure that it is loaded by checking the output of module list).</p>

<p>We do a similar trick for BLAS/LAPACK by providing empty definitions for them:</p>

<pre><code># BLAS/LAPACK should be linked automatically by libsci module
BLAS=
LAPACK=
</code></pre>

<p>We need to edit the LINK variable to include Intel’s MKL. I also like to add more verbose output of the linking process, to check that I am linking the correct library. One simple way to do this is to ask the linker to say from where it picks up the ZGEMM subroutine (it should be from Cray’s libsci, not MKL).</p>

<pre><code>LINK = -mkl=sequential -Wl,-yzgemm_
</code></pre>

<p>Now, move further down in the makefile, to the MPI section, and edit the preprocessors flags:</p>

<pre><code>CPP    = $(CPP_) -DMPI  -DHOST=\"PDC-REGULAR-B01\" -DIFC \
   -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc -DNGZhalf \
   -DMPI_BLOCK=262144 -Duse_collective -DscaLAPACK \
   -DRPROMU_DGEMV  -DRACCMU_DGEMV -DnoSTOPCAR
</code></pre>

<p>CACHE_SIZE is only relevant for Furth FFTs, which we do not use. The HOST variable is written out in the top of the OUTCAR file. It can be anything which helps you identify this compilation of VASP. The MPI_BLOCK variable needs to be set higher for best performance on the Cray XE6 interconnect. And finally, &#8220;noSTOPCAR&#8221; will disable the ability to stop a calculation by using the STOPCAR file. We do this to improve file I/O against the global file systems. (Otherwise, each VASP process will have to check this file for every SCF iteration.)</p>

<p>We will get Cray’s SCALAPACK linked in via libsci, so we set an empty SCA variable:</p>

<pre><code># SCALAPACK is linked automatically by libsci module
SCA= 
</code></pre>

<p>Then activate the parallelized version of the fast Fourier transforms with FFTW bindings:</p>

<pre><code>FFT3D   = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
</code></pre>

<p>Note that we do not need to link to FFTW explicitly, since it is included in MKL.</p>

<p>Finally, we uncomment the last library section for completeness:</p>

<pre><code>LIB     = -L../vasp.5.lib -ldmy  \
      ../vasp.5.lib/linpack_double.o \
      $(SCA) $(LAPACK) $(BLAS)
</code></pre>

<p>The full makefile is provided <a href="makefile.pdc533">here</a>.</p>

<h3>Compiling</h3>

<p>VASP does not have a makefile that supports parallel compilation. So in order to compile we just do:</p>

<pre><code>make
</code></pre>

<p>If you really want to speed it up, you can try something like:</p>

<pre><code>make -j4; make -j4; make -j4; make -j4;
</code></pre>

<p>Run these commands repeatedly until all the compiler errors are cleared (or write a loop in the bash shell). Obviously, this approach only works if you have a makefile that you know works from the start. When finished, you should find a binary called &#8220;vasp&#8221;.</p>

<h3>Running</h3>

<p>The Cray compiler environment produces statically linked binaries by default, since this is the most convenient way to run on the Cray compute nodes, so we just have to put e.g. the following in the job script to run on 384 cores (16 compute nodes).</p>

<pre><code>aprun -n 384 /path/to/vasp
</code></pre>

<p>I recommend setting NPAR equal to the number of compute nodes on the Cray XE6. <a href="http://www.nsc.liu.se/~pla/blog/2012/03/23/scalinglindgren">As I have shown before</a>, it is possible to run really big VASP simulations on the Cray with decent scaling over 1000-2000 cores. If you go beyond 32 compute nodes, it is worth <a href="http://www.nsc.liu.se/~pla/blog/2012/10/26/elpabench">trying to run on only half the number of cores per node</a>. So on a machine with 24 cores per node, you would ask for 768 cores, but actually run like this:</p>

<pre><code>aprun -n 384 -N 12 -S 3 /path/to/vasp
</code></pre>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Tuning VASP: fast Fourier transforms]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2013/01/10/tuning-ffts/"/>
    <updated>2013-01-10T00:00:00+01:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2013/01/10/tuning-ffts</id>
    <content type="html"><![CDATA[<p>This is the first post in a series about VASP tuning. Optimized Fast Fourier transform subroutines are one of the keystones of getting a fast VASP installation. When I did a similar study for the Matter cluster at NSC (which has Intel &#8220;Nehalem&#8221; processors) in 2011, I found that MKL was superior. Now, it is time to look at Triolith, which has processors of &#8220;Sandy Bridge&#8221; architecture. These processors have new 256-bit vector instructions (called &#8220;AVX&#8221;), which need to be exploited for maximum floating-point performance.</p>

<p>Basically, we have three choices of FFTs:</p>

<ul>
<li>VASP&#8217;s built-in library by Jürgen Furthmüller, called &#8220;FURTH&#8221;. It is quite old now, but has the advantage that it comes with the VASP code, so that we don&#8217;t have to rely on an external library; we can also recompile it for new architectures. For best performance, one have to optimize the CACHE_SIZE precompiler flag (usually a value between 0-32000).</li>
<li>The classical <a href="http://www.fftw.org/">FFTW library</a>. It can be optimized for many architectures by an automatic procedure. FFTW has support for AVX since version 3.3.1. On Triolith, we currently have version 3.3.2.</li>
<li>Intel&#8217;s own <a href="http://software.intel.com/en-us/intel-mkl">Math Kernel Library</a> (&#8220;MKL&#8221;). Presumably, noone should be better at optimizing for Intel processors than Intel themselves? Intel is also very aggressive with processor support, and many times MKL has support for unreleased processors. MKL gained AVX support in version 10.2, but version 10.3 and higher uses AVX instructions automatically.</li>
</ul>


<p>I chose the PbSO4 cell with 24 atoms as the test system, as it is quite small and more reliant on good FFT performance. Here are the results, without much further ado:</p>

<p><img src="http://www.nsc.liu.se/~pla/images/trioffts.png" alt="Runtimes of VASP with different FFT libraries" /></p>

<p>We can see that MKL 10.3 is the best choice here, with an average runtime of 61 seconds, 45% faster than FFTW 3.3.2. The results for FFT-FURTH does not come out well. I think one reason is that this library does not utilize AVX instructions fully on Sandy Bridge. The default optimization options in the makefile are very conservative (-O1/-O2), so we will not get the full benefit. It might be possible to compile it more aggressively and get better speed.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[How to compile VASP on NSC's Triolith]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2013/01/09/compile-vasp-on-triolith/"/>
    <updated>2013-01-09T00:00:00+01:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2013/01/09/compile-vasp-on-triolith</id>
    <content type="html"><![CDATA[<p>These instructions are for the 5.3.3 version, but I expect the instructions to be applicable to the minor versions preceding and following 5.3.3.</p>

<p>First, download the prerequisite source tarballs from the VASP home page:</p>

<pre><code>http://www.vasp.at/ 
</code></pre>

<p>You need both the regular VASP source code, and the supporting &#8220;vasp 5&#8221; library:</p>

<pre><code>vasp.5.3.3.tar.gz
vasp.5.lib.tar.gz
</code></pre>

<p>I suggest to make a new directory called e.g. vasp.5.3.3, where you download and expand them. You would type commands approximately like this:</p>

<pre><code>mkdir 5.3.3
cd 5.3.3
(download)
tar zxvf vasp.5.3.3.tar.gz
tar zxvf vasp.5.lib.tar.gz
</code></pre>

<p>Currently, you want to load these modules:</p>

<pre><code>intel/12.1.4
impi/4.0.3.008
mkl/10.3.10.319
</code></pre>

<p>Which you can get bundled in the following module:</p>

<pre><code>module load build-environment/nsc-recommended
</code></pre>

<h2>VASP 5 lib</h2>

<p>Compiling the VASP 5 library is straightforward. It contains some timing and IO routines, necessary for VASP, and LINPACK. My heavy edited makefile looks like this:</p>

<pre><code>.SUFFIXES: .inc .f .F
#-----------------------------------------------------------------------
# Makefile for VASP 5 library on Triolith
#-----------------------------------------------------------------------

# C-preprocessor
CPP     = gcc -E -P -C -DLONGCHAR $*.F &gt;$*.f
FC= ifort

CFLAGS = -O
FFLAGS = -Os -FI
FREE   =  -FR

DOBJ =  preclib.o timing_.o derrf_.o dclock_.o  diolib.o dlexlib.o drdatab.o


#-----------------------------------------------------------------------
# general rules
#-----------------------------------------------------------------------

libdmy.a: $(DOBJ) linpack_double.o
    -rm libdmy.a
    ar vq libdmy.a $(DOBJ)

linpack_double.o: linpack_double.f
    $(FC) $(FFLAGS) $(NOFREE) -c linpack_double.f

.c.o:
    $(CC) $(CFLAGS) -c $*.c
.F.o:
    $(CPP) 
    $(FC) $(FFLAGS) $(FREE) $(INCS) -c $*.f
.F.f:
    $(CPP) 
.f.o:
    $(FC) $(FFLAGS) $(FREE) $(INCS) -c $*.f
</code></pre>

<p>Note the addition of the &#8220;-DLONGCHAR&#8221; flag on the CPP line. It activates the longer input format for INCAR files, e.g. you can have MAGMOM lines with more than 256 characters. Now compile the library with the &#8220;make&#8221; command and check that you have the &#8220;libdmy.a&#8221; output file. Leave the file here, as the main VASP makefile will include it directly from here.</p>

<h2>VASP 5 binary</h2>

<h3>Preparations</h3>

<p>I only show how to build the parallel version with MPI and SCALAPACK here, as that is what you should run on Triolith. Navigate to the &#8220;vasp.5.3&#8221; library where the main source code is:</p>

<pre><code>cd ..
cd vasp.5.3
</code></pre>

<p>Before we start, we want to think about how to find the external libraries that we need. These are:</p>

<ul>
<li>BLAS/LAPACK (for basic linear algebra)</li>
<li>FFT library (for fast Fourier transform from reciprocal to real space)</li>
<li>MPI (for parallel communication)</li>
<li>SCALAPACK (for parallel linear algebra, e.g. orthogonalization of states)</li>
</ul>


<p>For <strong>BLAS/LAPACK</strong>, we are going to use Intel&#8217;s Math Kernel Library (&#8220;MKL&#8221; henceforth). The  easiest way to link to MKL at NSC is by adding the two following flags to the compiler command:</p>

<pre><code>ifort -Nmkl -mkl=sequential ...
</code></pre>

<p>For <strong>fast Fourier transforms</strong>, we could use the common FFTW library with VASP, but MKL actually contains its own optimized FFTs together with an FFTW interface, so we can use these instead. Provided that we link with MKL, which we are already doing in order to get BLAS/LAPACK, we do not need to do anything more. The linker should pick up the FFTW subroutines automatically.</p>

<p>For <strong>MPI</strong>, we are going to use Intel&#8217;s MPI library. We have already loaded the &#8220;impi/4.0.3.008&#8221; module, so all we have to do is to add the &#8220;-Nmpi&#8221; flag to compiler command:</p>

<pre><code>ifort -Nmpi ...
</code></pre>

<p>We don&#8217;t need to add explicit paths to any MPI libraries, or use the special &#8220;mpif90&#8221; compiler wrapper.</p>

<h3>Editing the makefile</h3>

<p>I suggest that you start from the Linux/Intel Fortran makefile:</p>

<pre><code>cp makefile.linux_ifc_P4 makefile
</code></pre>

<p>It is important to realize that the makefile is split in two parts, and is intended to be used in an overriding fashion. If you don&#8217;t want to compile the serial version, you should enable the definitions of FC, CPP etc in the second half of the makefile to enable parallel compilation. These will then override the settings for the serial version.</p>

<p>Start by editing the Fortran compiler and its flags:</p>

<pre><code>FC=ifort -I$(MKL_ROOT)/include/fftw 
FFLAGS =  -FR -lowercase -assume byterecl -Nmpi 
</code></pre>

<p>We need to add &#8220;-Nmpi&#8221; to get proper linking with Intel MPI at NSC. Then, we change the optimization flags:</p>

<pre><code>OFLAG=-O2 -ip -xavx 
</code></pre>

<p>This is to be on the safe side, so that we get AVX optimizations. Include MKL with FFTW like this:</p>

<pre><code>BLAS = -mkl=sequential
LAPACK = 
</code></pre>

<p>We use the serial version of MKL, without any multithreading, as VASP runs MPI on all cores with great success. Set the NSC specific linking options for MKL and MPI:</p>

<pre><code>LINK    = -Nmkl -Nmpi 
</code></pre>

<p>Uncomment the CPP section for the MPI parallel VASP:</p>

<pre><code>CPP    = $(CPP_) -DMPI  -DHOST=\"LinuxIFC\" -DIFC \
     -DCACHE_SIZE=4000 -DPGF90 -Davoidalloc -DNGZhalf \
     -DMPI_BLOCK=8000 -Duse_collective -DscaLAPACK
    -DRPROMU_DGEMV  -DRACCMU_DGEMV
</code></pre>

<p>Change it to something like this:</p>

<pre><code>CPP     = $(CPP_) -DMPI -DHOST=\"TRIOLITH-BUILD01\" -DIFC \
          -DCACHE_SIZE=4000  -DPGF90 -Davoidalloc -DNGZhalf \
          -DMPI_BLOCK=262144 -Duse_collective -DscaLAPACK \
          -DRPROMU_DGEMV  -DRACCMU_DGEMV  -DnoSTOPCAR
</code></pre>

<p>CACHE_SIZE is only relevant for Furth FFTs, which we do not use. The HOST variable is written out in the top of the OUTCAR file. It can be anything which helps you identify this compilation of VASP. The MPI_BLOCK variable needs to be set higher for best performance on Triolith. And finally, &#8220;noSTOPCAR&#8221; will disable the ability to stop a calculation by using the STOPCAR file. We do this to improve file I/O against the global file systems. (Otherwise, each VASP process will have to check this file for every SCF iteration.)</p>

<p>Finally, we enable SCALAPACK from MKL:</p>

<pre><code>SCA= -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
</code></pre>

<p>And the parallelized version of the fast Fourier transforms with FFTW bindings:</p>

<pre><code>FFT3D   = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
</code></pre>

<p>Note that we do not need to link to FFTW explicitly, since it is included in MKL. Finally, we uncomment the last library section:</p>

<pre><code>LIB     = -L../vasp.5.lib -ldmy  \
      ../vasp.5.lib/linpack_double.o \
      $(SCA) $(LAPACK) $(BLAS)
</code></pre>

<p>We have to do this to include the &#8220;$(SCA)&#8221; variable. The full makefile can be found here on Triolith:</p>

<pre><code>/software/apps/vasp/5.3.3-18Dec12/build01/makefile
</code></pre>

<h3>Compiling</h3>

<p>VASP does not have a makefile that supports parallel compilation. So in order to compile we just do:</p>

<pre><code>make
</code></pre>

<p>If you really want to speed it up, you can try something like:</p>

<pre><code>make -j4; make -j4; make -j4; make -j4;
</code></pre>

<p>Run these commands repeatedly until all the compiler errors are cleared (or write a loop in the bash shell). Obviously, this approach only works if you have a makefile that you know works from the start. When finished, you should find a binary called &#8220;vasp&#8221;.</p>

<h3>Running</h3>

<p>When you compile according to these instructions, there is no need to set LD_LIBRARY_PATHs and such. Instead, the ifort compiler will hard-code all library paths by using the RPATH mechanism and write information into the binary file about which MPI version you used. This means that you can launch VASP directly like this in a job shell:</p>

<pre><code>mpprun /path/to/vasp
</code></pre>

<p>Mpprun will automatically pick up the correct number of processor cores from the queue system and launch your vasp binary using Intel’s MPI launcher.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[K-point parallelization in VASP, part 2]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2012/11/26/vaspkpar2/"/>
    <updated>2012-11-26T00:00:00+01:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2012/11/26/vaspkpar2</id>
    <content type="html"><![CDATA[<p>Previously, I tested the <a href="http://www.nsc.liu.se/~pla/blog/2012/09/26/vaspkpar/">k-point parallelization scheme in VASP 5.3</a> for a small system with hundreds of k-points. The outcome was acceptable, but less than stellar. Paul Kent (who implemented the scheme in VASP) suggested that it would be more instructive to benchmark medium to large hybrid calculations with just a few k-points, since this was the original use case, and consequently where you would be able to see the most benefit. To investigate this, I ran a 63-atom MgO cell with HSE06 functional and 4 k-points over 4 to 24 nodes:</p>

<p><img src="http://www.nsc.liu.se/~pla/images/vaspkpar2.png" alt="K-point parallelization for MgO system" /></p>

<p>A suitable number of bands here is 192, so the maximum number of nodes we could expect to use with standard parallelization is 12, due to the fact that 12 nodes x 16 cores/node = 192 cores. And we do see that KPAR=1 flattens out at 1.8 jobs/h on 12 nodes. But with k-point parallelization, the calculation can be split into &#8220;independent&#8221; groups, each running on 192 cores. This enables us, for example, to run the job on 24 nodes using KPAR>=2, which in this case translates into a doubling of speed (4.0 jobs/h), compared to the best case scenario without k-point parallelization.</p>

<p>So there is indeed a real benefit for hybrid calculations of cells that are small enough to need a few k-points. And remember that in order for the k-point parallelization to work correctly with hybrids, you should set:</p>

<pre><code>NPAR = total number of cores / KPAR.
</code></pre>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[VASP, ELPA, Lindgren and Triolith]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2012/10/26/elpabench/"/>
    <updated>2012-10-26T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2012/10/26/elpabench</id>
    <content type="html"><![CDATA[<p>So, can the <a href="http://elpa-lib.fhi-berlin.mpg.de/wiki/index.php/Main_Page">ELPA</a> library improve upon VASP&#8217;s SCALAPACK bottleneck?</p>

<p>Benchmarking of the ELPA-enabled version of VASP were performed on PDC&#8217;s Lindgren (a Cray XE6) and Phase 1 of Triolith at NSC (an HP SL6500-based cluster with Xeon E5 + FDR Infiniband). For this occasion, I developed a new test case consisting of a MgH2 supercell with 1269 atoms. The structure is an experimentally determined crystal structure, but with a few per cent of hydrogen vacancies. I feel this is a more realistic test case than the NiSi-1200 cell used before. Ideally, we should see decent scaling up about 1000 cores / 64 nodes for this simulation. As usual, we expect the &#8220;EDDAV&#8221; subroutine to eventually become a dominant. The number of bands is 1488, which creates a 1488x1488 matrix that needs to be diagonalized in the bottleneck phase. Actually, this matrix size is far smaller than what ELPA was intended for, which seems to be on the order of 10,000-100,0000. So perhaps, we will not see the true strength of ELPA here, but hopefully, it can alleviate some of the pathological behavior of SCALAPACK.</p>

<h2>Triolith</h2>

<p>First out is Triolith, with benchmarks for 4-64 compute nodes using both 8 and 16 cores per node. I keep <code>NPAR=nodes/2</code>, according to earlier findings. The recommended way to run with 8c/node at NSC is to invoke a special SLURM option &#8211; that way you don&#8217;t have to give the number of cores explicitly to mpprun:</p>

<pre><code>#SBATCH --ntasks-per-node 8
</code></pre>

<p><img src="http://www.nsc.liu.se/~pla/images/mg-elpa-trio.png" alt="Scaling of MgH2 on Triolith with and without ELPA" /></p>

<p>We find that the standard way of running VASP, with 16c/node and SCALAPACK, produces a top speed of about 16 jobs/h using 48 computes nodes, and going further actually degrades performance. The ELPA version, however, is able to maintain scaling to at least 64 nodes. In fact, the scaling curve looks very much like what you get when running VASP with SCALAPACK and 8c/node. Fortunately, the benefits of ELPA and 8c/node seem to be additive, meaning that ELPA wins over SCALAPACK on 48-64 nodes, even with 8c/nodes. In the end, the overall performance improvement is around 13% for the 64-node job. (<em>While not shown here, I also ran with 96-128 nodes, and the difference there with ELPA is a stunning +30-50% in speed, but I consider the total efficiency too low to be useful.</em>)</p>

<h2>Lindgren</h2>

<p>Now, let&#8217;s look at Lindgren, 8-64 compute nodes, using either 12 cores per node, or the full 24 cores. In the 12c case, I allocated three cores per socket, using</p>

<pre><code>aprun  -N 12 -S 3 ...
</code></pre>

<p>I used <code>NPAR=compute nodes</code> here, like before.</p>

<p><img src="http://www.nsc.liu.se/~pla/images/mg-elpa-lindgren.png" alt="Scaling of MgH2 on Lindgren with and without ELPA" /></p>

<p>On the Cray machine, we do not benefit as much from ELPA as on Triolith. The overall increase in speed on 64 nodes is 5%. Instead, it is essential to drop down to 12c/node to get good scaling beyond 32 nodes for this job. Also, note the difference of scale on the vertical axis. Triolith has much faster compute nodes! Employing 64 nodes gives us a speed of 24.3 jobs/h vs 14.2 jobs/h, that is, <strong>a 1.7x speed-up per node</strong> or a <strong>2.5x speed-up on a per core basis</strong>.</p>

<h2>Parallel scaling efficiency</h2>

<p>Finally, it is instructive to compare the parallel scaling of Lindgren and Triolith. One of the strengths of the Cray system is the custom interconnect, and since the compute nodes are also slower than on Triolith, there is potential to realize better parallel scaling, when we normalize the absolute speeds.</p>

<p><img src="http://www.nsc.liu.se/~pla/images/mg-lindgren-trio.png" alt="Comparing scaling of MgH2 on Lindgren and Triolith" /></p>

<p>We find however, that the scaling curves are almost completely overlapping in the range where it is reasonable to run this job (4 to 64 nodes). The FDR Infiniband network is more than capable of handling this load, and the Cray interconnect is not so special at this, relatively, low-end scale.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Compiling VASP with the ELPA library]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2012/10/25/elpatest/"/>
    <updated>2012-10-25T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2012/10/25/elpatest</id>
    <content type="html"><![CDATA[<p>Previously, I showed how <a href="http://www.nsc.liu.se/~pla/blog/subroutines/">SCALAPACK is a limiting factor in the parallel scaling of VASP</a>. VASP 5.3.2 introduced support for the <a href="http://elpa-lib.fhi-berlin.mpg.de/wiki/index.php/Main_Page">ELPA</a> library, which can now be enabled in the subspace rotation phase of the program. You do this by compiling with the &#8220;-DELPA&#8221; preprocessor flag. In the VASP makefiles, there is a variable called CPP where this flag can be added:</p>

<pre><code>CPP     = $(CPP_)  -DHOST=\"NSC-ELPATEST-B01\" -DMPI -DELPA \
...
</code></pre>

<p>In addition, you need to get access to ELPA (by registering on their site) and add the source files to the makefile. I did like this:</p>

<pre><code>ELPA = elpa1.o elpa2.o elpa2_kernels.o

vasp: $(ELPA) $(SOURCE) $(FFT3D) $(INC) main.o 
  rm -f vasp
  $(FCL) -o vasp main.o  $(ELPA) $(SOURCE) $(FFT3D) $(LIB) $(LINK)
</code></pre>

<p>The ELPA developers recommend that you compile with &#8220;-O3&#8221; and full SSE support, so I put these special rules in the end of the makefile.</p>

<pre><code># ELPA rules
elpa1.o : elpa1.f90
        $(FC) $(FFLAGS) -O3 -xavx -c $*$(SUFFIX)
elpa2.o : elpa2.f90
        $(FC) $(FFLAGS) -O3 -xavx -c $*$(SUFFIX)
elpa2_kernels.o : elpa2_kernels.f90
     $(FC) $(FFLAGS) -O3 -xavx -c $*$(SUFFIX)
</code></pre>

<p>(Here, -xavx optimizes for Triolith with Sandy Bridge cpu:s.)</p>

<p>With this procedure, I was able to compile VASP with ELPA support. As far as I can see, there is no visual confirmation of ELPA being used in OUTCAR file or stdout. It looks like the regular VASP, but with some decimal fluctuations. I also saw some crashes when running on just a few nodes (&lt; 4). Perhaps ELPA is not as robust in this case, since it is not the intended scenario of usage.</p>

<p>(Benchmarks of ELPA on Lindgren and Triolith will follow in the next post.)</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Testing the k-point parallelization in VASP]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2012/09/26/vaspkpar/"/>
    <updated>2012-09-26T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2012/09/26/vaspkpar</id>
    <content type="html"><![CDATA[<p>VASP 5.3.2 finally introduced official support for k-point parallelization. What can we expect from this new feature in terms of performance? In general, you only need many k-points in relatively small cells, so up front we would expect k-point parallelization to improve <strong>time-to-solution</strong> for small cells with hundreds or thousands of k-points. We do have a subset of users at NSC, running big batches of these jobs, so this may be a real advantage in the prototyping stage of simulations, when the jobs are set up. In terms of actual job throughput for production calculations, however, k-point parallelization should not help much, as the peak efficiency is reached already with 8-16 cores on a single node.</p>

<p>So let&#8217;s put this theory to test. Previously, I benchmarked the <a href="http://www.nsc.liu.se/~pla/blog/2012/07/17/triobench-part2/">8-atom FeH system</a> with 400 k-points for this scenario. The maximum throughput was achieved with two 8-core jobs running on the same node, and the time-to-solution peaked at 3 minutes (20 jobs/h) with 16 cores on one compute node. What can k-point parallelization do here?</p>

<p><img src="http://www.nsc.liu.se/~pla/images/vaspkpar.png" alt="K-point parallelization for FeH system" /></p>

<p>KPAR is the new parameter which controls the number of k-point parallelized groups. KPAR=1 means no k-point parallelization, i.e. the default behavior of VASP. For each bar in the chart, the NPAR value has been individually optimized (and is thereby different for each number of cores). Previously, this calculation did not scale at all beyond one compute node (blue bars), but with KPAR=8 (purple bars), we can get close to linear (1.8x) speed-up going from 1 to 2 nodes, cutting the time-to-solution in half. As suspected, in terms of efficiency, the current k-point parallelization is not more efficient than the old scheme when running on a single node, which means that peak throughput remains the same at roughly 24 jobs/h per compute node. This is a little surprising, given that there should be overhead associated with running two job simultaneously on a node, compared to using k-point parallelization.</p>

<p>What must be remembered, though, is that it is considerably easier to handle the file and job management for several sequential KPAR runs vs juggling several jobs per node with many directories, so in this sense, KPAR seems like a great addition with respect to workflow optimization.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[New version of VASP - 5.3.2]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2012/09/25/vasp532/"/>
    <updated>2012-09-25T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2012/09/25/vasp532</id>
    <content type="html"><![CDATA[<p>A new version of VASP was released recently. There are many important improvements in this version and I encourage all VASP users to check the full <a href="https://cms.mpi.univie.ac.at/marsweb/index.php?option=com_content&amp;view=article&amp;id=102:new-release-vasp532&amp;catid=44:administrative&amp;Itemid=55">release notes</a> on the VASP community page.</p>

<p>Among the highlights are:</p>

<ul>
<li><strong>K-point parallelization</strong>  (this should improve &#8220;scaling&#8221; for small jobs)</li>
<li>Molecular dynamics at constant pressure</li>
<li>Spin-orbit coupling calculation with symmetry</li>
<li>Subspace diagonalization by means of the <a href="http://elpa-lib.fhi-berlin.mpg.de/wiki/index.php/Main_Page">ELPA</a> library (this may improve scaling for wide parallel job running on e.g. PDC&#8217;s Lindgren).</li>
</ul>


<p>The first installation of VASP 5.3.2 binaries on NSC is available in:</p>

<pre><code>/software/apps/vasp/5.3.2-13Sep12/default/
</code></pre>

<p>Installations for Lindgren at PDC will follow shortly. The binaries are called <code>vasp-[gamma,half,full]</code> as usual. They ran through the test suite that I had without problems, but I noticed that on Triolith, some other calculations converged to different solutions when using the previous set of high optimizations used to compile 5.2.12, so I dropped the global optimization level down to -O1 for the Triolith installation, until things get sorted out. The overall performance drop is only 5%, at least for standard PBE-type calculations.</p>

<p>The plan for 5.3.2 is to produce two more versions:</p>

<ul>
<li>A &#8220;stable&#8221;, alternative build, based on OpenMPI, and possibly a different numerical library, that can be used for comparison if you suspect trouble with your calculations.</li>
<li>A &#8220;fast&#8221; version tuned for maximum parallel performance, including ELPA support.</li>
</ul>


<p>There has also been requests for versions with cell optimization restricted in different directions, like z-only, or xy-only. Apparently, this is an established &#8220;hack&#8221;, outlined on the VASP forums. To me, however, it seems better to implement this in the code by a set of new INCAR tags. This way, you can cover all combinations: x, xy, z, etc., without producing six different binaries. Hopefully, it will not be too difficult to make the changes.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[VASP hybrid calculations on Triolith]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2012/07/27/triobench-part3/"/>
    <updated>2012-07-27T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2012/07/27/triobench-part3</id>
    <content type="html"><![CDATA[<p>VASP 5 introduced DFT hybrid functionals like PBE0. The Hartree-Fock calculations add a significant amount of computational time, however, and in addition, these algorithms require parallelization using <code>NPAR=number of cores</code> which is not as effective. In my experience, we are also haunted by SCF convergence problems, and you need to experiment with the other SCF algorithms. So what can we expect from Triolith here?</p>

<p><img src="http://www.nsc.liu.se/~pla/images/MgOchart.png" alt="Parallel scaling MgO hybrid calculation" /></p>

<p>The chart above shows benchmark runs for a 63-atom MgO cell with Hartree-Fock turned on (corresponding to PBE0). <code>ALGO=All</code> is used, and <code>NPAR=cores</code> had to be set for each case separately. We find, not surprisingly, that we have good parallel scaling up to 4 compute nodes (equalling 1 atom per core). It is possible to crank up the speed by employing more compute nodes, but only by using 8-12 MPI ranks per node and idling half the cores. We have 192 bands in this calculation, so the maximum speed should be achieved with 16 nodes (16x12c/node = 192 ranks), which is also what we find (2.5 jobs/h).</p>

<p>These results should be compared with running the same job on Neolith, where an 16-node run (128 cores) reached 0.44 jobs/h, so Triolith is again a close to 6 times faster on a node-by-node basis.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Small VASP jobs on Triolith]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2012/07/17/triobench-part2/"/>
    <updated>2012-07-17T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2012/07/17/triobench-part2</id>
    <content type="html"><![CDATA[<p>I have gotten requests about benchmarks of smaller simulations, rather than big supercells with hundreds of atoms. Below are the results for an 8-atom FeH cell with 64 bands and 400 k-points. Remember that VASP does not have parallelization over k-points, so it is very challenging to get good parallel performance in this case. Based on the good rule of thumb of using no more than 1 core per atom, or 4 bands per core, one should expect parallel scaling only within a compute node with 16 cores, but nothing beyond that.</p>

<p><img src="http://www.nsc.liu.se/~pla/images/FeHchart.png" alt="Parameter study of FeH cell" /></p>

<p>This is also what I see when running full NPAR/NSIM tests with 4-32 ranks, as seen in the chart. <strong>Peak performance is achieved with 16 cores on one compute node, using NPAR=4</strong>. Using two compute nodes is actually slower, even when using the same number of MPI ranks. This implies that the performance is limited by communication and synchronization costs, and not by memory bandwidth (otherwise we would have seen an improvement in speed when using 16 ranks on two nodes instead of one.) An interesting finding is that if you are submitting many jobs like this in queue and are mainly interested in throughput rather than time to solution, then the <strong>optimal solution is to run two 8-core jobs on the same compute node</strong>.</p>

<p>The NSIM parameter does not seem to be as influential here, because we have so few bands. The full table is shown below:</p>

<p><img src="http://www.nsc.liu.se/~pla/images/FeHstudy.png" alt="Parameter study of FeH cell" /></p>

<p>I also checked the influence of LPLANE=.FALSE. These results are not shown, but the difference was within 1%, so it was likely just statistical noise.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Running VASP on Triolith]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2012/07/10/triobench-part1/"/>
    <updated>2012-07-10T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2012/07/10/triobench-part1</id>
    <content type="html"><![CDATA[<p>The test pilot phase of our new Triolith has now started, and our early users are on the system compiling and running codes. The hardware has been surprisingly stable so far, but we still have a lot to do in terms of software. Don&#8217;t expect all software presently found to on Matter, and Kappa to be available immediately, because we have to recompiled them for the new Xeon E5 processors.</p>

<p>Regarding material science codes, I have put up preliminary versions of VASP, based on both the original source, and our collection of SNIC patches. I am also working on getting up a good compilation of Quantum Espresso. We are seeing performance gains as expected, but it will remain a formidable challenge to make many codes scale properly to 16 cores per node and 100s of compute nodes.</p>

<p>These are my quick recommendations for VASP based on initial testing:</p>

<table>
<thead>
<tr>
<th></th>
<th align="left"> Nodes </th>
<th align="right"> NPAR </th>
<th align="right"> Cores/node </th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td align="left"> 1 </td>
<td align="right"> 2 </td>
<td align="right"> 16 |</td>
</tr>
<tr>
<td></td>
<td align="left"> 2 </td>
<td align="right"> 2 </td>
<td align="right"> 16 |</td>
</tr>
<tr>
<td></td>
<td align="left"> 4 </td>
<td align="right"> 2 </td>
<td align="right"> 16 |</td>
</tr>
<tr>
<td></td>
<td align="left"> 8 </td>
<td align="right"> 4 </td>
<td align="right"> 8</td>
</tr>
<tr>
<td></td>
<td align="left"> 16 </td>
<td align="right"> 8 </td>
<td align="right"> 8 |</td>
</tr>
<tr>
<td></td>
<td align="left"> 32 </td>
<td align="right"> 16 </td>
<td align="right"> 8 | </td>
</tr>
<tr>
<td></td>
<td align="left"> 64-128 </td>
<td align="right"> 32 </td>
<td align="right"> 8 |</td>
</tr>
</tbody>
</table>


<p>(Wider jobs remains to be tested&#8230;)</p>

<h2>NPAR, NSIM, and LPLANE</h2>

<p>It looks like the same rules for NPAR apply as on our previous systems. The quick and easy rule of NPAR=compute nodes can be used, but you should see a slight improvement decreasing NPAR somewhat from this value. But for NSIM, there is a difference compared to our previous systems: <strong>you should set NSIM = 1</strong>, and gain a few percent extra speed, especially for smaller jobs (1-4 nodes). Finally, I looked at the LPLANE tag, but saw no detectable performance increase by setting LPLANE=.TRUE, presumably because the bandwidth in the FDR Infiniband network is more than sufficient to support the FFT operations that VASP does.</p>

<h2>Number of cores per node</h2>

<p>With Neolith, Kappa and Matter, it was always advantageous to run with 8 MPI ranks on on each node, so that you would use all available cores. On Triolith, however, going from 8 to 16 cores per node gives you very little extra performance. On a single compute node, 8 to 16 gives +30%-50%, but this drops to around 10% using 4 nodes, and nothing when running on > 8 nodes. For really wide jobs (>16 nodes), performance might <strong>increase</strong> when reducing to number of cores used from 16/cores per node to 8/cores per node. To test this way to run, you should use the &#8220;&#8211;nranks&#8221; flags when launching VASP with &#8220;mpprun&#8221;, like this:</p>

<pre><code>#SBATCH -N 32
#SBATCH --exclusive
#SBATCH --ntasks-per-node=8
...
mpprun /software/apps/vasp/5.2.12.1/default/vasp-gamma
</code></pre>

<p>Note that we have asked for 32 compute nodes (meaning 32*16=512 cores), but we are actually running on only 256 cores <em>spread out over the all the 32 nodes</em> because the queue system automatically spreads out the job so that each node gets 8 MPI ranks.</p>

<p>The reason why we see this behavior is a combination of three factors:</p>

<ul>
<li>VASP calculations are limited by the available memory bandwidth, not the number of FLOPS.</li>
<li>The effective memory bandwidth per core has decreased with &#8220;Sandy Bridge&#8221; processor architecture, since each FPU can potentially do twice as many FLOPS per cycle.</li>
<li>Adding more cores creates overhead in the MPI communication layer.</li>
</ul>


<p>So 8-12 cores/node is enough to max out the memory bandwidth in most scenarios. And since the overhead associated with using many MPI ranks increases nonlinearly with the number of ranks, there should logically be a crossover point where running on less cores/node gives you better parallel performance. My studies of big NiSi supercells (504-1200 atoms) suggests that this happens around 32 nodes. For calculations with hybrid functionals, it happens earlier, around 8 nodes. I plan to make further investigations to find out if this applies to all types of VASP jobs.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Triolith visualized]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2012/05/29/triolithvisualized/"/>
    <updated>2012-05-29T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2012/05/29/triolithvisualized</id>
    <content type="html"><![CDATA[<p>Everyone is anxiously waiting for delivery of our new clusters: Triolith (for SNIC), Krypton (for SMHI), and Skywalker (for SAAB). Triolith will be the new capability cluster for academic users, which we hope will be the fastest supercomputer in Sweden once it is fully online. Yesterday, the smallest system for SAAB arrived. Unfortunately, Krypton (for SMHI) and Triolith are delayed and will arrive later.</p>

<p>In pictures, this is how Triolith relates to Neolith, the system it will replace.</p>

<p><img src="http://www.nsc.liu.se/~pla/images/neotriocores.png" alt="Number of cores in Triolith vs Neolith" /></p>

<p>Each dot in this picture is a processor core. Triolith will have 1200 compute nodes with 19200 cores &#8211; compare this to the gray area corresponding to Neolith (6400 cores). However, this picture does not take into account the true performance improvement, because each core/compute node is also much faster. Taking this into account, the difference in compute power when running a mix of big VASP jobs is 9.6x per node which in total equals 14.4x improved throughput for the whole cluster:</p>

<center>
<img src="http://www.nsc.liu.se/~pla/images/neotriovasp.png" alt="Compute power Triolith vs Neolith" />
</center>


<p>Other codes might not see as big improvements, but we expect at least a factor of 3x on a per node basis, by combining general improvements in IPC, AVX vector instructions, and better memory bandwidth.</p>
]]></content>
  </entry>
  
  <entry>
    <title type="html"><![CDATA[Understanding the limits of VASP parallel scaling]]></title>
    <link href="http://www.nsc.liu.se/~pla/blog/2012/05/14/subroutines/"/>
    <updated>2012-05-14T00:00:00+02:00</updated>
    <id>http://www.nsc.liu.se/~pla/blog/2012/05/14/subroutines</id>
    <content type="html"><![CDATA[<p>In a previous post, <a href="http://www.nsc.liu.se/~pla/blog/scalinglindgren">Running big on Lindgren</a>, we saw how VASP could be run with up to 2000 cores if a sufficiently big supercell was used. But where does this limit come from?</p>

<p>What happens is that the fraction of time that a program spends waiting for network communication grows as you increase the number of MPI ranks. When there are too many ranks, there is no gain anymore, and adding more cores only creates communication overhead. This is where our calculation &#8220;stops scaling&#8221; in everyday speak. In many cases, performance might even go down when you add more cores.</p>

<p>We can see this effect by measuring the amount of time spent in MPI calls. One tool that can do this is the <a href="http://mpip.sourceforge.net/">mpiP library</a>. MPIP is simple library that you need to link to your program. It will then intercept any MPI calls from your program and collect statistics. When increasing the number of ranks in a VASP calculation, it can look like this:</p>

<p><img src="http://www.nsc.liu.se/~pla/images/Li128mpip.png" alt="MPI communciations overhead in VASP" /></p>

<p>(The graph above was generated from runtimes on the Matter cluster at NSC, but the picture should be the same on Lindgren.) It is clear that this problem cannot be subdivided to arbitrary degree. With 256 and more cores, the individual chunk of work that each core has to do is so small that it takes longer time send to results to other cores than to actually calculate them.</p>

<p>Another way to look at it is to pinpoint where in the program the bad scaling arises. It is often a performance critical serial routine, or an MPI parallelized subroutine, which is heavily dependent on network performance. Conveniently, VASP has built-in timing libraries that measures contributions of several important subroutines in the program. As far as I know, the timings are accurate.</p>

<p><img src="http://www.nsc.liu.se/~pla/images/subroutinebytime.png" alt="Timings of subroutines in VASP" /></p>

<p>This graph shows the share of the total CPU time spent in each subroutine. &#8220;24c&#8221; means 24 cpu cores, or one computer node in Lindgren. The maximum number of 1008 cores is equal to 42 compute nodes.</p>

<p>To understand this data, first note that each column is normalized to 100% runtime, because we are not interested in the runtime per se, but rather what happens to individual parts when we run a wide parallel job. Ideally, all the bars should be level. This is because if there is no (or constant) communication overhead, a certain subroutine will always requires the same amount of aggregated compute time, because the actual compute work (without communication) is the same regardless of how many cores we are running on. Instead, we see that some subroutines (EDDIAG/ORTHCH) increase their share of time as we run on more tanks. These are the ones that exhibit poor parallel scaling.</p>

<p>What does EDDIAG and ORTHCH do? They are involved in the so-called &#8220;sub-space&#8221; diagonalization step. ORTHCH does the LU decomposition and EDDIAG diagonalizes the Hamiltonian in a space spanned by the trial Kohn-Sham orbitals. This procedure is necessary because the RMM-DIIS algorithm in VASP does not give the exact orbitals, but rather a linear combination of the eigenfunctions with the lowest energy. The problem is that this procedure is usually trivial and fast for a small system (the matrix size is NBANDS x NBANDS), but it becomes a bottleneck for large cells.</p>

<p>There is, however, a way parallelize this operation with SCALAPACK, which VASP also employs, but it is well-known that the scalability of SCALAPACK in this sense is limited. It is also mentioned in the VASP manual in several places <a href="http://cms.mpi.univie.ac.at/vasp/vasp/Performance_parallel_code_on_various_machines.html">(link)</a>. We can see this effect in the graph, where the yellow fraction stays relatively flat up to 192 cores, but then starts to eat up computational time. So, for now, the key to running VASP in parallel on > 1000 cores is to have many bands. A more long-term solution would be to find an alternative to using SCALAPACK for the sub-space diagonalization.</p>
]]></content>
  </entry>
  
</feed>
