Berzelius - Getting Started Guide

Introduction

Berzelius is an AI/ML focused compute cluster permitting scale-out compute jobs aggregating the computational power of up to 480 NVIDIA A100 GPUs. The interconnect fabric allows RDMA, non-blocking connection between all of these GPUs with a bandwidth of 200 GB/s and µs order latencies between any two endpoints. This makes several hundred (AI) petaflops available to individual jobs for certain workloads. The resource is available to Swedish academic researchers as described in Project applications and Resource Allocations on Berzelius.

Compute Cluster Basics

At its core the Berzelius SuperPOD is a compute cluster running the Linux operating system, specifically Red Hat Enterprise Linux 8 (RHEL8). As such, most examples in the Berzelius documentation at NSC use the command line interface (CLI) since CLI instructions can be copied verbatim by users and examples are easy to follow with less room for mistakes. The CLI is also an extremely powerful tool enabling high productivity for users, and is an inescapable part of any HPC environment. Note that there are some differences between a typical desktop Linux and HPC environment, e.g. you can't use sudo to install things.

What is a Compute Cluster

Basically, a compute cluster enables parallel computations spanning interconnected compute units (nodes), i.e. compute servers and a messaging interconnect, by providing the means for the compute units' work to be orchestrated via some software framework(s) supported by the cluster.

The premise for this to work is that your software has already been adapted using one of these frameworks, in essence having formulated a parallel solution to the computational problem studied through the use of the parallel computation software framework. Some common parallel frameworks are MPI, NCCL and Apache Hadoop. On Berzelius MPI and NCCL are supported as these are the completely dominant parallel frameworks for the architecture and fitting the purpose of the cluster. There are currently no plans to support other frameworks.

On a single compute server level, there is also a different parallel paradigm available on modern clusters as there are many CPU compute cores (and GPUs, as on Berzelius) usable via POSIX or OpenMP threads. On Berzelius, since having very high-end compute servers, a very large category of problems studied will never need to scale out beyond a single compute node and can make effective use of this level of parallelism without involving the additional complexities of multi-node parallelism.

On a higher level, there are specialised nodes of a cluster serving different roles in it. There are:

Login (or access) nodes

Typically a few nodes accessible from the internet. This where you end up when accessing the cluster via SSH or ThinLinc for instance, and is where you transfer files to. These nodes are not meant for heavy computations as they are shared by all users accessing the cluster, but are used to request compute resources (and build code, edit scripts etc) for compute intensive jobs. On berzelius these nodes are called berzelius1.nsc.liu.se and berzelius2.nsc.liu.se.

System (or master) nodes

This is the nexus of control on the cluster and is not user accessible. Typically these are one or two servers (in an HA arrangement). Compute resource requests made by users on the login or compute nodes are relayed here and are handled by the resource manager, on Berzelius SLURM, which allocates compute resources to jobs and prioritizes resource requests in a queue whenever there are more jobs lined up than there are free resources available.

Compute nodes

These nodes constitute the bulk of the cluster, and is where compute jobs requested from the resource manager are carried out. These nodes are on berzelius named node001node060.

In addition, it is very common for compute clusters to have a shared storage file system served by a separate specialized cluster of storage servers. Berzelius' file systems /proj and /home are served by such a storage cluster.

From a cluster functionality perspective, the prototypical workflow using Berzelius thus consists of

  1. Transferring your datasets to the cluster
  2. Access the cluster login nodes via SSH/ThinLinc
  3. Request compute resources to
    • Perform exploratory, fast feedback type work interactively and/or
    • Execute scripted batch jobs for longer/heavier calculations
  4. (monitor/supervise batch jobs' progress)

General Information

Berzelius is currently to a large degree a work in progress in terms of user environment and feature maturity. Some general gotchas identified in the period leading up to Berzelius' general availability to be aware of are

  • SLURM resource allocation.
    • There are many conflicting sbatch/salloc directives, in particular when it comes to GPUs. The examples provided here are known to work and are adapted to what we think will fit the most common use cases, but if your needs deviate substantially from these and you can't make it work on your own, please contact support (see below).
    • The currently installed version of SLURM does not properly set the CUDA_VISIBLE_DEVICES variable in parallel jobs launched with srun including the NSC provided interactive tool. This is fixed in a later SLURM release which we will roll out as soon as it is available. If the CUDA_VISIBLE_DEVICES variable is needed, please try set the variable manually and export it to any srun-launched processes using the switch --export=ALL,CUDA_VISIBLE_DEVICES, or you can try and launch your job with mpirun/mpiexec, in which case you will need to load a buildenv module to make the command available.
  • CUDA driver version. The current CUDA driver on the compute nodes is from the CUDA 11.2 release, and software requiring CUDA version 11.3 driver and runtime (or above) does not work on Berzelius until this is updated. Containers from the 21.04+ series of NVIDIA NGC are based on CUDA 11.3 and are not expected to work. Use earlier NGC containers, for reference see https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html.

Support

Mail any support issues to berzelius-support@nsc.liu.se or use the interface available in SUPR. Please report the problems and obstacles you face when you encounter them, and provide as much detail on the issue as you can so we may reproduce and fix the issues. A very important piece of information here is the SLURM JobID of any job having a problem, as this will allow us to track which nodes have been used and when.

The support mail address is also the interface to make feature requests to add to Berzelius, and we also have the possibility to bring in the Berzelius vendor Atos or NVIDIA, should there be issues where extra support is needed.

Quick start guide

Assuming you have received your Berzelius account credentials, here's the process to login to the Berzelius login node, allocate compute resources from SLURM for interactive or batch work to get access to those compute resources in the cluster:

  1. SSH to berzelius1.nsc.liu.se (substitute x_abcde for your real user name)

    [you@your_computer ~]$ ssh x_abcde@berzelius1.nsc.liu.se
    [x_abcde@berzelius001 ~]$
  2. From the login node (berzelius1), allocate resources for interactive or batch work on a compute node. Resource allocation defaults per GPU are; 1 task, 16 cores per task, 125 GB RAM per task and a default wall time of 2h. Override the defaults as you see fit.

Resource Allocations Examples

Example 1

Interactive work using a single GPU and default number of tasks, cores per task, RAM and wall time.

[x_abcde@berzelius001 ~]$ interactive --gpus=1
[x_abcde@node001 ~]$ nvidia-smi # Check that the GPU resources are allocated
Mon Jun  7 15:51:42 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 460.27.04    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-SXM4-40GB      On   | 00000000:07:00.0 Off |                    0 |
| N/A   25C    P0    50W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                                
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Example 2: Interactive work on 2 GPUs, with default resources per GPU for 30 minutes

[x_abcde@berzelius001 ~]$ interactive --gpus=2 -t 30
[x_abcde@node001 ~]$ 

Example 3: Interactive work on 1 DGX-A100 with all its resources (GPU, CPU and memory) exclusively for 6h

[x_abcde@berzelius001 ~]$ interactive -N 1 --exclusive -t 6:00:00
[x_abcde@node001 ~]$ 

Example 4: Submitting a batch script using four GPUs for 3 days

[x_abcde@berzelius001 ~]$ cat << EOF > my_example_batch_script.sh
#!/bin/bash
#SBATCH --gpus 4
#SBATCH -t 3-00:00:00
#SBATCH -A <your-project-account>

# The '-A' SBATCH switch above is only necessary if you are member of several
# projects on Berzelius, and can otherwise be left out.

# Singularity images can only be used outside /home. In this example the
# image is located here
cd /proj/my_project_storage/users/$(id -un)

# Execute my Singularity image binding in the current working directory
# containing the Python script I want to execute
singularity exec --nv -b $(pwd) Example_Image.sif python train_my_DNN.py

EOF
[x_abcde@berzelius001 ~]$ sbatch my_example_batch_script.sh

Access and Data transfer to Berzelius

Berzelius uses SSH for (CLI) access and the VNC solution ThinLinc for remote desktop access (including terminal CLI access). Access Berzelius via the host name berzelius1.nsc.liu.se using SSH or the ThinLinc client with your user name provided by SUPR. Example:

[you@your_computer ~]$ ssh x_abcde@berzelius1.nsc.liu.se

For data transfers to Berzelius, please use scp and rsync. Other file transfer tools (e.g. FileZilla, WinSCP) using SCP or SFTP protocol should likely work as well, but have not been tested. Examples:

[you@your_computer ~]$ # SCP an archive file to Berzelius
[you@your_computer ~]$ scp my_dataset.tar \
    x_abcde@berzelius1.nsc.liu.se:/proj/<your_project_name>/users/x_abcde
[you@your_computer ~]$ # rsync an unpacked directory containing a dataset
[you@your_computer ~]$ rsync -av my_unpacked_dataset \
    x_abcde@berzelius1.nsc.liu.se:/proj/<your_project_name>/users/x_abcde/

Always upload large datasets to your /proj directory and never to /home, since the /home quota is only 20 GB, see below.

Usability Features of Berzelius

For AI/ML work on Berzelius, where highly complex production environments and high user customizability is more or less required, NSC strongly recommends using a container environment, where Singularity or ENROOT are the supported options. Docker is not supported for security reasons.

Using a container environment will allow a highly portable workflow and reproducible results between systems as diverse as a laptop, Berzelius or EuroHPC resources such as LUMI for instance. It will also bring a level of familiarity of use where a user is free to choose their operating system independently of the host environment.

There are however many other options than containers you can use on Berzelius to manage your compute environment, the most prominent ones are listed below.

Modules

All software external to the RHEL OS is installed under the /software directory, and are made conveniently available via the module system. Handle your environment by loading modules, e.g. module load Anaconda/2021.05-nsc1, instead of manually adding paths and similar in your ~/.bashrc file. Check module availability with module avail.

Only basic modules are available currently, but will be expanded as time progress. You are very welcome to make feature requests via support, so that we can customize the Berzelius user environment to be most effective for you.

Singularity

Using Singularity on Berzelius should not differ appreciably to using it in other settings, except that you will not have root access on Berzelius. Whenever root privileges are required for your use of Singularity, build or adapt your images in advance where you have such privileges and upload the image to your /proj directory (it won't launch under /home). The canonical reference and user guide for Singularity can be found at https://sylabs.io/guides/3.8/user-guide/, which should apply directly to Singularity's use on Berzelius.

We are very keen to find out if there are any significant differences to using it in other settings. Please report such differences to support.

Points of note about Singularity on Berzelius:

  • Remember to use the --nv switch to import the GPU devices into your container. This is critical to using NVIDIA GPUs in a Singularity container.
  • Available on the compute nodes only, so you need to work in a SLURM environment on a node to use it.
  • Available on the compute nodes without loading a module.
  • Should work as you would expect it to on any platform. In particular, parallel runs launched from within a container on a single node should work out of the box.
  • Containers pulled from NVIDIA NGC are especially built and QA tested by NVIDIA to work on an NVIDIA SuperPOD like Berzelius.
  • Potentially you can launch multi-node Singularity jobs using mpirun from the build environment buildenv-gcccuda/11.2-8.3.1-bare. Otherwise, for multi-node container jobs, you may need to shift focus to the NVIDIA Enroot container solution, see https://github.com/NVIDIA/enroot, which can be used via srun (plus additional switches) on Berzelius.

Conda

Anaconda installations are available via the module system, e.g. Anaconda/2021.05-nsc1. They are special NSC installations which only make the conda command available (and doesn't mess up your ~/.bashrc file and login environment). Use this conda to set up your conda environments, and put these environments under the /proj file system, either by making ~/.conda a symlink to a directory under /proj/<your_project_dir>/users/$(id -un) or by using the -p switch to conda create, for example

$ conda create -p /proj/<your_project_dir>/users/$(id -un)/mycondaenv \
    python=3.8 scipy=1.5.2

Other than this, using Anaconda on Berzelius should not differ appreciably to using it in other settings.

It should be noted that the CUDA driver and runtime on the nodes (currently version 11.2) must be supported by whatever you install in your conda env. Typically, installed drivers and runtime supports only environments using up to that particular version, but not anything over that.

System Python

Using the RHEL 8 provided Python in virtualenvs could also be a viable approach in many cases. Make sure you're working with python 3.6. Python 3.6.8 should be the system default on Berzelius, but please verify for yourself. Also, put your virtualenv's under /proj and not /home, we want to avoid filling your /home directory up with non-precious data and backing it up to tape.

Install packages with pip install <packagename> in your virtualenv, and build packages using a build environment, for instance 'buildenv-gcccuda/11.2-8.3.1-bare'. It should be noted that the CUDA driver and runtime on the nodes must be supported by whatever software you install in your virtualenv.

Example:

$ cd /proj/example_project/users/$(id -un)
$ python3 -m venv myenv
$ source myenv/bin/activate
(myenv) $ python --version
Python 3.6.8
(myenv) $ pip install --upgrade pip
... (snip) ...
(myenv) $ module load buildenv-gcccuda/11.2-8.3.1-bare
(myenv) $ pip3 install mpi4py
...

Build environment

A basic build environment, buildenv-gcccuda/11.2-8.3.1-bare, is available for those who may need to build software for the RHEL 8 environment on Berzelius. For instance, if you need to build the mpi4py Python package or CUDA dependent Python packages, have this module loaded when building.

The build environment is based on the system GCC (8.3), CUDA 11.2, OpenMPI 4.1.1, OpenBLAS 0.3.15, FFTW 3.3.9 and ScaLAPACK 2.1.0. Please report any problems using it to support.

Running Graphical Applications

Inevitably there comes a need to use graphical applications on the cluster. The NSC recommended way to run graphical applications on Berzelius is via the ThinLinc VNC solution (TL), which is leveraged to provide a remote desktop interface on Berzelius, see Running graphical applications for more information. The TL remote desktop presented is XFCE due to it being light on server resources. It will provide a much better user experience for modern GUIs than X-forwarding, although X-forwarding is not prohibited. In addition, it provides session management, allowing the users to disconnect from the session while running processes are kept running, like a GUI version of terminal multiplexers like GNU screen.

The ThinLinc client is available free of charge and has packages available for the major OS platforms (Linux, MacOS and Windows) from Cendio at URL https://www.cendio.com/thinlinc/download. Applications of particular interest to the Berzelius users, which benefit from use via TL include

  • Firefox and consequently
    • Jupyter notebooks
    • Tensorboard
  • Integrated Development Environments, for instance PyCharm
  • Debuggers and Profilers like NVIDIA Nsight-systems

Submitting and Running Jobs

Berzelius is a SuperPOD compute cluster using SLURM as its resource manager. General SLURM documentation should be valid for making resource allocations on Berzelius. Check the man pages or documentation at https://slurm.schedmd.com/. Other NSC SLURM documentation is likely useful also in this context and can be found at https://www.nsc.liu.se/support/batch-jobs/, but may not apply in all parts.

The wall time limit has been set to 3 days (72h) to ensure that there is reasonable turnover of jobs. If this wall time limit is a show-stopper for you, please contact support explaining why, and we can extend the wall time limit for select jobs.

There are default allocation settings in place based on the number of GPUs allocated. Currently these are, 1 task using 16 CPU cores and 125 GB RAM for every GPU allocated. This allows you to only specify the amount of GPUs your job requires — e.g. --gpus=2, getting you 32 CPU cores and 250 GB RAM as well as the 2 GPUs — and you will get a sensible allocation for most circumstances.

In addition, the default wall time allocated for jobs is 2h. Remember to override this with -t <timestring> when needed. All of these defaults are overridden by user switches when provided.

Submitting Jobs

Allocation of resources in general follow that of the SLURM documentation. One thing to be aware of is that there are very many ways to give conflicting directives when allocating resources, and they will result in either not getting the allocation or getting the wrong allocation. Before submitting a resource intensive batch job, it's worthwhile to check out the settings in an interactive session verifying the resources are properly allocated.

In general, if you need to specify more than solely the number of GPUs your job requires, a recommended way to allocate is via the switch combination -n X --gpus=Y (plus other needed switches, like -t), where X and Y are "tasks" (CPU cores) and "gpus available to your X tasks", respectively. This pattern seems to work in most circumstances to get you what you expect. In some circumstances you may need to also specify -c Z where Z are the number of hyperthreads allocated to each task (there are two hyperthreads per physical CPU core), but this should not be required normally.

When allocating a partial node, NSCs recommendation is to allocate tasks (CPU cores) in proportion to how large a fraction of the node's total GPUs you allocate, i.e. for every GPU (1/8 of a node) you allocate, you should also allocate 16 tasks (1/8 = 16/128). The default memory allocation follows that of the cores, i.e. for every CPU core allocated, 7995 MB of RAM (a small bit less than 1/128th of node's total RAM) is allocated. This is automatically taken care of when using only the switch --gpus=X as in the examples of the quick start guide.

Please note that when allocating more GPUs than 8 (more than one node), you will end up on several nodes and will require multi-node capabilities on your software setup to make use of all allocated GPUs.

A note about more advanced and multi-node allocations: This version of SLURM has a bug when allocating resources using the flag --gpus-per-task, which is simply not working as intended. This bug makes for instance the switch combination -n 1 --gpus-per-task=1 allocate all GPUs on a node for your job, whether you use them or not. A working switch combination which can accomplish the same intended thing as above is -n 1 --ntasks-per-gpu=1. You can use this switch combination with different number of tasks and tasks per GPU, for instance -n 32 --ntasks-per-gpu=2, which will allocate a total of two nodes (16 GPUs) with two tasks per GPU. The bug appears to have been fixed in later versions of SLURM and we will upgrade as soon as possible.

Interactive Work

Interactive work in a shell on the allocated resources can be performed via the NSC provided script interactive which is a wrapper around the SLURM commands salloc and srun, and as such accepts all switches available to salloc. For reference see https://www.nsc.liu.se/support/running-applications/ under the "Interactive jobs" heading.

For specific examples of the use of interactive, see for instance the Quick start guide above.

Batch Jobs

Batch jobs are supported in the standard SLURM way. For a guide see https://www.nsc.liu.se/support/batch-jobs/.

Single node jobs

Jobs run on a single node should be straightforward in both Singularity container environments and the host OS. Make sure you've got the allocated resources, and for Singularity containers, remembered to import the GPUs (using the --nv switch) into the container.

Multi-node jobs

Multi-node jobs for regular MPI-parallel applications should be pretty standard on the cluster, and can use the common mpirun (or mpiexec) launcher available via the buildenv-gcccuda/11.2-8.3.1-bare module or srun --mpi=X (supported X by SLURM are pmi2, pmix and pmix_v3). If your application has been built with NSC provided toolchain(s) you should also be able to launch it with mpprun in standard NSC fashion.

Multi-node jobs using GPUs can be challenging when running in a Singularity container. For NVIDIA NGC containers used with Singularity, you can possibly launch your job using the mpirun provided by loading the buildenv-gcccuda/11.2-8.3.1-bare module. For reference see https://sylabs.io/guides/3.7/user-guide/mpi.html.

Otherwise, SLURM on Berzelius has support for launching ENROOT containers directly using srun, see https://github.com/NVIDIA/enroot, which should work for at least NVIDIA NGC containers of recent date used to build your ENROOT container. ENROOT containers are similar to Singularity containers but don't require superuser privileges to build or modify them. ENROOT containers can be built on the compute nodes of the cluster but not on the login nodes,

Singularity Job Example

Singularity interactively:

2 GPUs (plus 32 CPU cores implied) for one hour

[x_abcde@berzelius1] ~$ interactive --gpus=2 -t 60
[x_abcde@node001] ~$ cd /proj/example_project_dir/x_abcde
[x_abcde@node001] $ singularity shell --nv my_singularity_image.sif
Singularity> nvidia-smi #Checking for GPUs
... (snip) ...
Singularity> mpirun <args> <your_executable> <exeargs> #Requires an image with MPI installed

Data storage

The shared storage and data transport fabric on Berzelius are very high performance, and should suffice for most IO loads on it, specifically data intensive AI/ML loads.

This is especially the case when the data sets are well formatted. Examples of good such formats are TFRecords (from TensorFlow), RecordIO (from MXNet) or Petastorm (Über).

The use of datasets in these formats can greatly reduce IO-wait time on the GPU compared to raw file system access, and will also reduce load on the shared storage. NSC highly recommends that you store and use your data sets using some such format.

Shared storage

There are two shared storage areas set up for use; /home/$USER and /proj/<your_project_dir>/users/$USER. The /home/$USER area is backed-up (nightly) and small, 20 GB quota per user, and is only meant for data you cannot put under /proj. The standard quota for the /proj directory is 5,000 GiB and 5 M, but this can be increased, either at the time you apply for the project or as a complementary application at a later stage.

Node Local Storage

High performance NVMe SSD node local storage is available on each compute node.

There are a few points to note with respect to the available node local storage

  • For every job, node local scratch space is mounted under /scratch/local.
  • Separate jobs can't access another job's /scratch/local when several jobs are sharing a node.
  • Each job's /scratch/local is erased between jobs. Data not saved (e.g. moved to somewhere under /proj) at the end of a job is lost, with no getting back.
  • In case you need to use it for your datasets, try to store your dataset as uncompressed tar-archives preferentially split in many parts and unpack in parallel, this will increase your data transfer speed tremendously compared to single processes. Example:

      # 144 GB ILSVRC 2012 data set in TFRecord format split in 128 tar archives
      # unpacked with 16 parallel workers to /scratch/local. A single worker takes
      # 106s to do the same task.
      [raber@node001 ILSVRC2012]$ time ls *.tar | xargs -n 1 -P 16 tar -x -C /scratch/local/ -f
      real  0m16.763s
      user  0m3.192s
      sys   8m4.740s

Quotas

Quotas and your current use of it can be checked with the command nscquota. Complementary requests for increases in storage allocation can be made in SUPR if you find out you need it. If in doubt on how to do this, please contact support.


User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express