A frequent question I encounter supporting VASP users is: *“I have a cell with X atoms and Y electrons. How many compute nodes (or cores) should I choose for my simulation?”*

It is an important question because using too many cores will be inefficient and the result is less number of jobs completed in a given computer time allocation.

Currently, there is only MPI parallelization in VASP, so by “cores”, I mean the number of MPI ranks or processes, i.e. the number you give to the `mpirun`

command using the `-n`

command line flag or the number of cores you request in the queue system.

Besides the suggestion of actually testing it out and finding a good number of cores, **the main rule of thumb** that I have been telling people is:

```
number of cores = number of atoms
```

This is almost always safe, and will not waste computer time. Typically, it will ensure a parallel efficiency of at least 80%. This is of course a very unscientific and handwavy rule, but it has a certain pedagogical elegance, because it is easy to remember, and you don’t need to look up any other technical parameters.

Let’s now look into how you could make **a more accurate estimate**. VASP has three levels of parallelization: over k-points, over bands, and over plane-wave coefficients (or equivalently Fast-Fourier transforms). You need to ensure that when the work is split up over several compute nodes, there is a sufficient amount of work allocated to each processor core, otherwise, they will just spend time waiting for more work to arrive. The fundamental numbers to be aware of are therefore:

- The
**number of k-points** - The
**number of bands**(determined indirectly by the number of atoms and electrons) - The
**size of the basis set**(i.e. number of plane waves, which corresponds to the number of grid points in the FFTs).

If you can estimate, or derive these numbers for your calculation, you can more precisely guess a suitable number of cores to use.

## Bands and cores

The first step is to consider **the number of bands** (`NBANDS`

). VASP has parallelization over bands (controlled by the `NPAR`

tag). The ultimate limit is 1 band per core. So, for example, if you have 100 bands, you cannot run on more than 100 cores and expect it to work well. What I have seen, in my scaling tests, though, is that 1 band per core is too little work for a modern processor. You need to have at least 2 bands per core to reach more than 50% efficiency. **A conservative choice is 8 bands/core**. That will give you closer to 90% efficiency.

```
number of cores = NBANDS / 8
```

So how does this relate to the rule of thumb above? By apply it, you will arrive at a number of bands per core equal to the average number of valence electrons per atom in your calculation. If we assume that the typical VASP calculation has about 4-8 valence electrons per atom, this will land us in the ballpark of 4-8 bands/core, which is usually ok.

Let’s now try to apply this principle:

**Example 1**: We have a cell with 500 bands and a cluster with compute nodes having 16 cores per node. We aim for 8 bands/core, which unfortunately means 62.5 cores. It is better to have even numbers, so we increase the number of bands to 512 by setting `NBANDS=512`

in the INCAR file and allocate 64 cores, or 4 compute nodes.

**Example 2**: Suppose that you want to speed up the calculation in the previous example. You need the results fast, and care less about efficiency in terms of the number of core hours spent. You could drop down to 1 band/core (512 cores), but there is really not that much improvement compared to 2 bands/core (256 cores). So it seems like 256 cores is the maximum number possible. But what you can do is to take these 256 MPI processes and spread them out over more compute nodes. This will improve the memory bandwidth available to each MPI process, which usually speed things up. So you can try running on 32 nodes, but using 8 cores/node instead. It could be faster, if the extra communication overhead is not too large.

## K-points and KPAR

The next step is to consider **the number of k-points**. VASP can treat each k-point independently. The number of k-point groups that run in parallel is controlled by the `KPAR`

parameter. The upper limit of `KPAR`

is obviously the number of k-points in your calculation. In theory, the maximum number of cores you can run on using combined k-point- and band-parallelization is NBANDS * KPAR. So for example, 500 bands and 10 k-points would allow up to 5000 cores, in principle. In practice, though, k-point parallelization does not scale that well. What I have found on the Triolith and Beskow systems is that supplying `KPAR=compute nodes`

usually allows you to run on **twice as many cores** as you determined in the previous step, *regardless of the actual value of KPAR*. I would not recommend attempting run with `KPAR>compute nodes`

, even though you may have more k-points than compute nodes.

(Note: A side of effect of this is that the most effective number of bands/core when using k-point parallelization is higher than without. This is likely due to the combined overhead of using two parallelization methods.)

**Example 3:** Consider the 500 bands cell above. 64 cores was a good choice when using just band parallelization. But you also have 8 k-points. So set KPAR to 8 and double the number of cores to 128 cores (or 8 compute nodes). In this case, we end up with 1 k-point per node, which is a very balanced setup. Note that this may increase the required memory per compute node, as k-point parallelization replicates a lot of data inside each k-point group. If you would run out of memory, the next step would be to lower KPAR to 2 or 4.

## Basis set size, LPLANE and NGZ

As a last step, it might be worth considering what the load-balancing of your FFTs will look like. This is covered in the VASP manual in section 8.1. VASP, by default, works with the FFTs in a plane-wise manner (meaning `LPLANE=.TRUE.`

), which reduces the amount of communication needed between MPI ranks. In general, you want to use this feature, as it is typically faster. The 3D FFT:s are split up into 2D planes, where each group (as determined by NPAR) works on a number of planes. It means that ideally, you want `NGZ`

(the number of grid points in the Z direction) to be evenly divisible by `NPAR`

. That will ensure good load-balance.

```
NGZ=n*NPAR
```

The second thing to consider is, according to the manual, that `NGZ`

should be sufficiently big for the LPLANE approach to work:

```
NGZ ≥ 3*(number of cores)/NPAR = 3*NCORE
```

Since `NCORE`

will be of the same magnitude as the number of cores per compute node, it means that `NGZ`

should be at least 24-96, depending on the node configuration. More concretely, for the following clusters, you should check that the conditions below hold:

```
NSC Kappa/Matter: NGZ ≥ 24
NSC Triolith: NGZ ≥ 48
PDC Beskow: NGZ ≥ 72 (using 24c/node)
```

Typically, this is not a big problem. As an example of what NGZ can be, consider a 64-atom supercell of GaAs (11^{3} Å) with a cut-off of 313 eV. Then the small FFT grid is 70x70x70 points. So that is approximately the smallest cell that you can run on many nodes without suffering from excessive load imbalance on Beskow. For bigger cells, with more than 100 atoms, NGZ is also usually larger than 100, so there will be no problem in this regard, as long as you stick to the rule of using `NPAR=compute nodes`

or `NCORE=cores/node`

. But you should still check that NGZ is an even number and not, for example, a prime number.

In order to tune NGZ, you have two choices, either adjust ENCUT to a more appropriate number and let VASP recalculate the values of NGX, NGY and NGZ, or switch from specifying the basis set size in terms of an energy cut-off and set the NG{X,Y,Z} parameters yourself directy in the INCAR file instead. For a very small system, with NGZ falling below the threshold above, you can also consider lowering the number of cores per node and adjusting NCORE accordingly. For example, on Triolith, using 12 cores/node and NCORE=12 would lower the threshold for NGZ to 36, which enables you to run a small system over many nodes.

## Summary

- Check the number of bands (
`NBANDS`

). The number of bands divided by 8 is a good starting guess for the number of cores to employ in your calculation. - If you have more than one k-point, set
`KPAR`

to the number of compute nodes or the number of k-points, whichever is the smallest number. Then double the amount of cores determined in the previous step. - Make a test run and check the value of
`NGZ`

, it should be an even number and sufficiently big (larger than 3*cores/node). Adjust either the basis set size or the number of cores/node.