Berzelius is an AI/ML focused compute cluster permitting scale-out compute jobs aggregating the computational power of up to 480 NVIDIA A100 GPUs. The interconnect fabric allows RDMA, non-blocking connection between all of these GPUs with a bandwidth of 200 GB/s and µs order latencies between any two endpoints. This makes several hundred (AI) petaflops available to individual jobs for certain workloads. The resource is available to Swedish academic researchers as described in Project applications and Resource Allocations on Berzelius.
At its core the Berzelius SuperPOD is a compute cluster running the Linux operating system, specifically Red Hat Enterprise Linux 8 (RHEL8). As such, most examples in the Berzelius documentation at NSC use the command line interface (CLI) since CLI instructions can be copied verbatim by users and examples are easy to follow with less room for mistakes. The CLI is also an extremely powerful tool enabling high productivity for users, and is an inescapable part of any HPC environment. Note that there are some differences between a typical desktop Linux and HPC environment, e.g. you can't use sudo
to install things.
Basically, a compute cluster enables parallel computations spanning interconnected compute units (nodes), i.e. compute servers and a messaging interconnect, by providing the means for the compute units' work to be orchestrated via some software framework(s) supported by the cluster.
The premise for this to work is that your software has already been adapted using one of these frameworks, in essence having formulated a parallel solution to the computational problem studied through the use of the parallel computation software framework. Some common parallel frameworks are MPI, NCCL and Apache Hadoop. On Berzelius MPI and NCCL are supported as these are the completely dominant parallel frameworks for the architecture and fitting the purpose of the cluster. There are currently no plans to support other frameworks.
On a single compute server level, there is also a different parallel paradigm available on modern clusters as there are many CPU compute cores (and GPUs, as on Berzelius) usable via POSIX or OpenMP threads. On Berzelius, since having very high-end compute servers, a very large category of problems studied will never need to scale out beyond a single compute node and can make effective use of this level of parallelism without involving the additional complexities of multi-node parallelism.
On a higher level, there are specialised nodes of a cluster serving different roles in it. There are:
Typically a few nodes accessible from the internet. This where you end up when accessing the cluster via SSH or ThinLinc for instance, and is where you transfer files to. These nodes are not meant for heavy computations as they are shared by all users accessing the cluster, but are used to request compute resources (and build code, edit scripts etc) for compute intensive jobs. On berzelius these nodes are called berzelius1.nsc.liu.se
and berzelius2.nsc.liu.se
. They can also be reached at berzelius.nsc.liu.se
, from which you will be assigned to one of the two.
This is the nexus of control on the cluster and is not user accessible. Typically these are one or two servers (in an HA arrangement). Compute resource requests made by users on the login or compute nodes are relayed here and are handled by the resource manager, on Berzelius SLURM, which allocates compute resources to jobs and prioritizes resource requests in a queue whenever there are more jobs lined up than there are free resources available.
These nodes constitute the bulk of the cluster, and is where compute jobs requested from the resource manager are carried out. These nodes are on berzelius named node001
— node060
.
In addition, it is very common for compute clusters to have a shared storage file system served by a separate specialized cluster of storage servers. Berzelius' file systems /proj
and /home
are served by such a storage cluster.
From a cluster functionality perspective, the prototypical workflow using Berzelius thus consists of
Berzelius is currently to a large degree a work in progress in terms of user environment and feature maturity. Some general gotchas identified in the period leading up to Berzelius' general availability to be aware of are
srun
including the NSC provided interactive
tool. This is fixed in a later SLURM release which we will roll out as soon as it is available. If the CUDA_VISIBLE_DEVICES variable is needed, please try set the variable manually and export it to any srun
-launched processes using the switch --export=ALL,CUDA_VISIBLE_DEVICES
, or you can try and launch your job with mpirun
/mpiexec
, in which case you will need to load a buildenv module to make the command available.Mail any support issues to berzelius-support@nsc.liu.se
or use the interface available in SUPR. Please report the problems and obstacles you face when you encounter them, and provide as much detail on the issue as you can so we may reproduce and fix the issues. A very important piece of information here is the SLURM JobID of any job having a problem, as this will allow us to track which nodes have been used and when.
The support mail address is also the interface to make feature requests to add to Berzelius, and we also have the possibility to bring in the Berzelius vendor Atos or NVIDIA, should there be issues where extra support is needed.
Assuming you have received your Berzelius account credentials, here's the process to login to the Berzelius login node, allocate compute resources from SLURM for interactive or batch work to get access to those compute resources in the cluster:
SSH to berzelius1.nsc.liu.se (substitute x_abcde for your real user name)
[you@your_computer ~]$ ssh x_abcde@berzelius1.nsc.liu.se
[x_abcde@berzelius001 ~]$
From the login node (berzelius1), allocate resources for interactive or batch work on a compute node. Resource allocation defaults per GPU are; 1 task, 16 cores per task, 125 GB RAM per task and a default wall time of 2h. Override the defaults as you see fit.
Interactive work using a single GPU and default number of tasks, cores per task, RAM and wall time.
[x_abcde@berzelius001 ~]$ interactive --gpus=1
[x_abcde@node001 ~]$ nvidia-smi # Check that the GPU resources are allocated
Fri Mar 3 10:30:59 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off | 0 |
| N/A 29C P0 51W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The A100 GPUs have a feature for splitting the GPU into multiple independent instances. Nodes in the reservation 3g.20gb have this feature enabled, splitting the GPUs in half. All jobs in this reservation receive a fix amount of resources and are half the cost of a normal job using 1 A100 GPU. If your workload fits in one of these jobs it is an excellent use of resources.
To achieve the fix resource allocation per job, some options are overridden during job submission. For example, you will always get 1 MIG GPU and 8 CPU cores, regardless of what you asked for.
[x_abcde@berzelius001 ~]$ interactive --reservation=3g.20gb
[x_abcde@node059 ~]$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4bc10a41-11e4-0c97-4027-190e9f305d52)
MIG 3g.20gb Device 0: (UUID: MIG-37769be6-b4f5-5a17-bd14-6ad61e7c3d62)
...
[x_abcde@berzelius001 ~]$ interactive --gpus=2 -t 30
[x_abcde@node001 ~]$
[x_abcde@berzelius001 ~]$ interactive -N 1 --exclusive -t 6:00:00
[x_abcde@node001 ~]$
[x_abcde@berzelius001 ~]$ cat << EOF > my_example_batch_script.sh
#!/bin/bash
#SBATCH --gpus 4
#SBATCH -t 3-00:00:00
#SBATCH -A <your-project-account>
# The '-A' SBATCH switch above is only necessary if you are member of several
# projects on Berzelius, and can otherwise be left out.
# Singularity images can only be used outside /home. In this example the
# image is located here
cd /proj/my_project_storage/users/$(id -un)
# Execute my Singularity image binding in the current working directory
# containing the Python script I want to execute
singularity exec --nv -b $(pwd) Example_Image.sif python train_my_DNN.py
EOF
[x_abcde@berzelius001 ~]$ sbatch my_example_batch_script.sh
Berzelius uses SSH for (CLI) access and the VNC solution ThinLinc for remote desktop access (including terminal CLI access). Access Berzelius via the host name berzelius1.nsc.liu.se
using SSH or the ThinLinc client with your user name provided by SUPR. Example:
[you@your_computer ~]$ ssh x_abcde@berzelius1.nsc.liu.se
For data transfers to Berzelius, please use scp
and rsync
. Other file transfer tools (e.g. FileZilla, WinSCP) using SCP or SFTP protocol should likely work as well, but have not been tested. Examples:
[you@your_computer ~]$ # SCP an archive file to Berzelius
[you@your_computer ~]$ scp my_dataset.tar \
x_abcde@berzelius1.nsc.liu.se:/proj/<your_project_name>/users/x_abcde
[you@your_computer ~]$ # rsync an unpacked directory containing a dataset
[you@your_computer ~]$ rsync -av my_unpacked_dataset \
x_abcde@berzelius1.nsc.liu.se:/proj/<your_project_name>/users/x_abcde/
Always upload large datasets to your /proj
directory and never to /home
, since the /home
quota is only 20 GB, see below.
For AI/ML work on Berzelius, where highly complex production environments and high user customizability is more or less required, NSC strongly recommends using a container environment, where Singularity or ENROOT are the supported options. Docker is not supported for security reasons.
Using a container environment will allow a highly portable workflow and reproducible results between systems as diverse as a laptop, Berzelius or EuroHPC resources such as LUMI for instance. It will also bring a level of familiarity of use where a user is free to choose their operating system independently of the host environment.
There are however many other options than containers you can use on Berzelius to manage your compute environment, the most prominent ones are listed below.
All software external to the RHEL OS is installed under the /software
directory, and are made conveniently available via the module system. Handle your environment by loading modules, e.g. module load Anaconda/2021.05-nsc1
, instead of manually adding paths and similar in your ~/.bashrc
file. Check module availability with module avail
.
Only basic modules are available currently, but will be expanded as time progress. You are very welcome to make feature requests via support, so that we can customize the Berzelius user environment to be most effective for you.
Using Singularity on Berzelius should not differ appreciably to using it in other settings, except that you will not have root access on Berzelius, i.e. issuing the privilege escalating sudo
command is not working, and singularity build features relying on user namespaces (but not formally sudo
) are not available.
Whenever root privileges are required (like using sudo
) for your use of Singularity, build or adapt your images in advance where you have such privileges and upload the image to your /proj
directory (it won't launch under /home
). The canonical reference and user guide for Singularity can be found at https://sylabs.io/guides/3.8/user-guide/, which should apply directly to Singularity's use on Berzelius.
We are very keen to find out if there are any significant differences to using it in other settings. Please report such differences to support.
Points of note about Singularity on Berzelius:
--nv
switch to import the GPU devices into your container. This is critical to using NVIDIA GPUs in a Singularity container.--fakeroot
and --userns
switches for building singularity images will not work because they rely on user namespaces being enabled.mpirun
from the build environment buildenv-gcccuda/11.4-8.3.1-bare
. Otherwise, for multi-node container jobs, you may need to shift focus to the NVIDIA Enroot container solution, see https://github.com/NVIDIA/enroot, which can be used via srun
(plus additional switches) on Berzelius.Please read the Berzelius Apptainer Guide for more details.
Enroot is a simple, yet powerful tool to turn container images into unprivileged sandboxes. Enroot is targeted for HPC environments with integration with the Slurm scheduler, but can also be used as a standalone tool to run containers as an unprivileged user. Enroot is similar to Singularity, but with the added benefit of allowing users to read/write in the container and also to appear as a root user within the container environment. Please read the Berzelius Enroot Guide for more details.
Anaconda installations are available via the module system, e.g. Anaconda/2021.05-nsc1
. They are special NSC installations which only make the conda
command available (and doesn't mess up your ~/.bashrc
file and login environment).
The default location for conda environments is ~/.conda
in your home directory. This location can be problematic since these environments can become very large. Therefore, it is suggested to redirect this directory using a symbolic link to your project directory.
mv ~/.conda /path/to/your/proj/
ln -s /path/to/your/proj/.conda ~/.conda
Or you can use the -p
switch to conda create
, for example
$ conda create -p /proj/<your_project_dir>/users/$(id -un)/mycondaenv \
python=3.8 scipy=1.5.2
Other than this, using Anaconda on Berzelius should not differ appreciably to using it in other settings.
It should be noted that the CUDA driver and runtime on the nodes (currently version 11.4) must be supported by whatever you install in your conda env. Typically, installed drivers and runtime supports only environments using up to that particular version, but not anything over that.
Using the RHEL 8 provided Python in virtualenvs could also be a viable approach in many cases. Make sure you're working with python 3.6. Python 3.6.8 should be the system default on Berzelius, but please verify for yourself. Also, put your virtualenv's under /proj
and not /home
, we want to avoid filling your /home
directory up with non-precious data and backing it up to tape.
Install packages with pip install <packagename>
in your virtualenv, and build packages using a build environment, for instance 'buildenv-gcccuda/11.4-8.3.1-bare'. It should be noted that the CUDA driver and runtime on the nodes must be supported by whatever software you install in your virtualenv.
Example:
$ cd /proj/example_project/users/$(id -un)
$ python3 -m venv myenv
$ source myenv/bin/activate
(myenv) $ python --version
Python 3.6.8
(myenv) $ pip install --upgrade pip
... (snip) ...
(myenv) $ module load buildenv-gcccuda/11.4-8.3.1-bare
(myenv) $ pip3 install mpi4py
...
Assume that you have installed Jupyter Notebook in your conda environment. On a compute node, load the python module and activate your environment:
module load Anaconda/2021.05-nsc1
conda activate myenv_example
Start a Jupyter notebook with the no-browser flag:
(myenv_example) [x_abcde@node021 ~]$ jupyter-notebook --no-browser --ip=node021 --port=9988
Please use the --ip
flag to specify the node that you are working on.
You will see the following info printed out on your terminal:
[I 2023-02-20 12:36:50.238 LabApp] JupyterLab extension loaded from /home/x_abcde/.conda/envs/myenv_example/lib/python3.10/site-packages/jupyterlab
[I 2023-02-20 12:36:50.238 LabApp] JupyterLab application directory is /proj/proj_name/x_abcde/.conda/envs/myenv_example/share/jupyter/lab
[I 12:36:50.243 NotebookApp] Serving notebooks from local directory: /home/x_abcde
[I 12:36:50.243 NotebookApp] Jupyter Notebook 6.5.2 is running at:
[I 12:36:50.243 NotebookApp] http://node021:9988/?token=xxxx
[I 12:36:50.243 NotebookApp] or http://127.0.0.1:9988/?token=xxxx
[I 12:36:50.243 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 12:36:50.256 NotebookApp]
To access the notebook, open this file in a browser:
file:///home/x_abcde/.local/share/jupyter/runtime/nbserver-511678-open.html
Or copy and paste one of these URLs:
http://node021:9988/?token=xxxx
or http://127.0.0.1:9988/?token=xxxx
In a second terminal, make a ssh tunnel login:
ssh -N -L localhost:9988:node021:9988 x_abcde@berzelius1.nsc.liu.se
Start your favorite browser on your local computer, paste the following URL given by the jupyter-notebook on Berzelius with your own token:
http://localhost:9988/?token=xxxx
or
http://127.0.0.1:9988/?token=xxxx
Note: The port 9988 is arbitrary. If 9988 is already in use, then just try 9989, etc.
A basic build environment, buildenv-gcccuda/11.4-8.3.1-bare
, is available for those who may need to build software for the RHEL 8 environment on Berzelius. For instance, if you need to build the mpi4py
Python package or CUDA dependent Python packages, have this module loaded when building.
The build environment is based on the system GCC (8.3), CUDA 11.4, OpenMPI 4.1.1, OpenBLAS 0.3.15, FFTW 3.3.9 and ScaLAPACK 2.1.0. Please report any problems using it to support.
Inevitably there comes a need to use graphical applications on the cluster. The NSC recommended way to run graphical applications on Berzelius is via the ThinLinc VNC solution (TL), which is leveraged to provide a remote desktop interface on Berzelius, see Running graphical applications for more information. The TL remote desktop presented is XFCE due to it being light on server resources. It will provide a much better user experience for modern GUIs than X-forwarding, although X-forwarding is not prohibited. In addition, it provides session management, allowing the users to disconnect from the session while running processes are kept running, like a GUI version of terminal multiplexers like GNU screen
.
The ThinLinc client is available free of charge and has packages available for the major OS platforms (Linux, MacOS and Windows) from Cendio at URL https://www.cendio.com/thinlinc/download. Applications of particular interest to the Berzelius users, which benefit from use via TL include
Berzelius is a SuperPOD compute cluster using SLURM as its resource manager. General SLURM documentation should be valid for making resource allocations on Berzelius. Check the man
pages or documentation at https://slurm.schedmd.com/. Other NSC SLURM documentation is likely useful also in this context and can be found at https://www.nsc.liu.se/support/batch-jobs/, but may not apply in all parts.
The wall time limit has been set to 3 days (72h) to ensure that there is reasonable turnover of jobs. You can extend the wall time of a running job using boost-tools.
There are default allocation settings in place based on the number of GPUs allocated. Currently these are, 1 task using 16 CPU cores and 125 GB RAM for every GPU allocated. This allows you to only specify the amount of GPUs your job requires — e.g. --gpus=2
, getting you 32 CPU cores and 250 GB RAM as well as the 2 GPUs — and you will get a sensible allocation for most circumstances.
In addition, the default wall time allocated for jobs is 2h. Remember to override this with -t <timestring>
when needed. All of these defaults are overridden by user switches when provided.
Allocation of resources in general follow that of the SLURM documentation. One thing to be aware of is that there are very many ways to give conflicting directives when allocating resources, and they will result in either not getting the allocation or getting the wrong allocation. Before submitting a resource intensive batch job, it's worthwhile to check out the settings in an interactive session verifying the resources are properly allocated.
In general, if you need to specify more than solely the number of GPUs your job requires, a recommended way to allocate is via the switch combination -n X --gpus=Y
(plus other needed switches, like -t
), where X and Y are "tasks" (CPU cores) and "gpus available to your X tasks", respectively. This pattern seems to work in most circumstances to get you what you expect. In some circumstances you may need to also specify -c Z
where Z are the number of hyperthreads allocated to each task (there are two hyperthreads per physical CPU core), but this should not be required normally.
When allocating a partial node, NSCs recommendation is to allocate tasks (CPU cores) in proportion to how large a fraction of the node's total GPUs you allocate, i.e. for every GPU (1/8 of a node) you allocate, you should also allocate 16 tasks (1/8 = 16/128). The default memory allocation follows that of the cores, i.e. for every CPU core allocated, 7995 MB of RAM (a small bit less than 1/128th of node's total RAM) is allocated. This is automatically taken care of when using only the switch --gpus=X
as in the examples of the quick start guide.
Jobs using the reservation with MIG GPUs (--reservation=3g.20gb
) will always receive a fix amount of resources per job (eg. only 1 MIG GPU per job). Even if you request a particular amount of resources the job will start with the fix amount. If your job require more resources than the fix amount you should not use this reservation.
Please note that when allocating more GPUs than 8 (more than one node), you will end up on several nodes and will require multi-node capabilities on your software setup to make use of all allocated GPUs.
A note about more advanced and multi-node allocations: This version of SLURM has a bug when allocating resources using the flag --gpus-per-task
, which is simply not working as intended. This bug makes for instance the switch combination -n 1 --gpus-per-task=1
allocate all GPUs on a node for your job, whether you use them or not. A working switch combination which can accomplish the same intended thing as above is -n 1 --ntasks-per-gpu=1
. You can use this switch combination with different number of tasks and tasks per GPU, for instance -n 32 --ntasks-per-gpu=2
, which will allocate a total of two nodes (16 GPUs) with two tasks per GPU. The bug appears to have been fixed in later versions of SLURM and we will upgrade as soon as possible.
Interactive work in a shell on the allocated resources can be performed via the NSC provided script interactive
which is a wrapper around the SLURM commands salloc
and srun
, and as such accepts all switches available to salloc
. For reference see https://www.nsc.liu.se/support/running-applications/ under the "Interactive jobs" heading.
For specific examples of the use of interactive, see for instance the Quick start guide above.
Batch jobs are supported in the standard SLURM way. For a guide see https://www.nsc.liu.se/support/batch-jobs/.
Jobs run on a single node should be straightforward in both Singularity container environments and the host OS. Make sure you've got the allocated resources, and for Singularity containers, remembered to import the GPUs (using the --nv
switch) into the container.
Multi-node jobs for regular MPI-parallel applications should be pretty standard on the cluster, and can use the common mpirun
(or mpiexec
) launcher available via the buildenv-gcccuda/11.4-8.3.1-bare
module or srun --mpi=X
(supported X by SLURM are pmi2, pmix and pmix_v3). If your application has been built with NSC provided toolchain(s) you should also be able to launch it with mpprun
in standard NSC fashion.
Multi-node jobs using GPUs can be challenging when running in a Singularity container. For NVIDIA NGC containers used with Singularity, you can possibly launch your job using the mpirun
provided by loading the buildenv-gcccuda/11.4-8.3.1-bare
module. For reference see https://sylabs.io/guides/3.7/user-guide/mpi.html.
Otherwise, SLURM on Berzelius has support for launching ENROOT containers directly using srun
, see https://github.com/NVIDIA/enroot, which should work for at least NVIDIA NGC containers of recent date used to build your ENROOT container. ENROOT containers are similar to Singularity containers but don't require superuser privileges to build or modify them. ENROOT containers can be built on the compute nodes of the cluster but not on the login nodes,
Singularity interactively:
2 GPUs (plus 32 CPU cores implied) for one hour
[x_abcde@berzelius1] ~$ interactive --gpus=2 -t 60
[x_abcde@node001] ~$ cd /proj/example_project_dir/x_abcde
[x_abcde@node001] $ singularity shell --nv my_singularity_image.sif
Singularity> nvidia-smi #Checking for GPUs
... (snip) ...
Singularity> mpirun <args> <your_executable> <exeargs> #Requires an image with MPI installed
NSC boost-tools are an attempt to add more flexibility to the job scheduling.
We currently provide three tools:
Please refer to the NSC boost-tools page for the usage instructions.
The shared storage and data transport fabric on Berzelius are very high performance, and should suffice for most IO loads on it, specifically data intensive AI/ML loads.
This is especially the case when the data sets are well formatted. Examples of good such formats are TFRecords (from TensorFlow), RecordIO (from MXNet) or Petastorm (Über).
The use of datasets in these formats can greatly reduce IO-wait time on the GPU compared to raw file system access, and will also reduce load on the shared storage. NSC highly recommends that you store and use your data sets using some such format.
There are two shared storage areas set up for use; /home/$USER
and /proj/<your_project_dir>/users/$USER
. The /home/$USER
area is backed-up (nightly) and small, 20 GB quota per user, and is only meant for data you cannot put under /proj
. The standard quota for the /proj
directory is 5,000 GiB and 5 M, but this can be increased, either at the time you apply for the project or as a complementary application at a later stage.
High performance NVMe SSD node local storage is available on each compute node.
There are a few points to note with respect to the available node local storage
/proj
) at the end of a job is lost, with no getting back.In case you need to use it for your datasets, try to store your dataset as uncompressed tar
-archives preferentially split in many parts and unpack in parallel, this will increase your data transfer speed tremendously compared to single processes. Example:
# 144 GB ILSVRC 2012 data set in TFRecord format split in 128 tar archives
# unpacked with 16 parallel workers to /scratch/local. A single worker takes
# 106s to do the same task.
[raber@node001 ILSVRC2012]$ time ls *.tar | xargs -n 1 -P 16 tar -x -C /scratch/local/ -f
real 0m16.763s
user 0m3.192s
sys 8m4.740s
Quotas and your current use of it can be checked with the command nscquota
. Complementary requests for increases in storage allocation can be made in SUPR if you find out you need it. If in doubt on how to do this, please contact support.
Guides, documentation and FAQ.
Applying for projects and login accounts.