Running VASP on Triolith

The test pilot phase of our new Triolith has now started, and our early users are on the system compiling and running codes. The hardware has been surprisingly stable so far, but we still have a lot to do in terms of software. Don’t expect all software presently found to on Matter, and Kappa to be available immediately, because we have to recompiled them for the new Xeon E5 processors.

Regarding material science codes, I have put up preliminary versions of VASP, based on both the original source, and our collection of SNIC patches. I am also working on getting up a good compilation of Quantum Espresso. We are seeing performance gains as expected, but it will remain a formidable challenge to make many codes scale properly to 16 cores per node and 100s of compute nodes.

These are my quick recommendations for VASP based on initial testing:

Nodes	NPAR	Cores/node
1	2	16 \|
2	2	16 \|
4	2	16 \|
8	4	8
16	8	8 \|
32	16	8 \|
64-128	32	8 \|

(Wider jobs remains to be tested…)

NPAR, NSIM, and LPLANE

It looks like the same rules for NPAR apply as on our previous systems. The quick and easy rule of NPAR=compute nodes can be used, but you should see a slight improvement decreasing NPAR somewhat from this value. But for NSIM, there is a difference compared to our previous systems: you should set NSIM = 1, and gain a few percent extra speed, especially for smaller jobs (1-4 nodes). Finally, I looked at the LPLANE tag, but saw no detectable performance increase by setting LPLANE=.TRUE, presumably because the bandwidth in the FDR Infiniband network is more than sufficient to support the FFT operations that VASP does.

Number of cores per node

With Neolith, Kappa and Matter, it was always advantageous to run with 8 MPI ranks on on each node, so that you would use all available cores. On Triolith, however, going from 8 to 16 cores per node gives you very little extra performance. On a single compute node, 8 to 16 gives +30%-50%, but this drops to around 10% using 4 nodes, and nothing when running on > 8 nodes. For really wide jobs (>16 nodes), performance might increase when reducing to number of cores used from 16/cores per node to 8/cores per node. To test this way to run, you should use the “–nranks” flags when launching VASP with “mpprun”, like this:

#SBATCH -N 32
#SBATCH --exclusive
#SBATCH --ntasks-per-node=8
...
mpprun /software/apps/vasp/5.2.12.1/default/vasp-gamma

Note that we have asked for 32 compute nodes (meaning 32*16=512 cores), but we are actually running on only 256 cores spread out over the all the 32 nodes because the queue system automatically spreads out the job so that each node gets 8 MPI ranks.

The reason why we see this behavior is a combination of three factors:

VASP calculations are limited by the available memory bandwidth, not the number of FLOPS.
The effective memory bandwidth per core has decreased with “Sandy Bridge” processor architecture, since each FPU can potentially do twice as many FLOPS per cycle.
Adding more cores creates overhead in the MPI communication layer.

So 8-12 cores/node is enough to max out the memory bandwidth in most scenarios. And since the overhead associated with using many MPI ranks increases nonlinearly with the number of ranks, there should logically be a crossover point where running on less cores/node gives you better parallel performance. My studies of big NiSi supercells (504-1200 atoms) suggests that this happens around 32 nodes. For calculations with hybrid functionals, it happens earlier, around 8 nodes. I plan to make further investigations to find out if this applies to all types of VASP jobs.