Berzelius is an AI/ML focused compute cluster permitting scale-out compute jobs aggregating the computational power of up to 480 NVIDIA A100 GPUs. The interconnect fabric allows RDMA, non-blocking connection between all of these GPUs with a bandwidth of 200 GB/s and µs order latencies between any two endpoints. This makes several hundred (AI) petaflops available to individual jobs for certain workloads. The resource is available to Swedish academic researchers as described in Project applications and Resource Allocations on Berzelius.
At its core the Berzelius SuperPOD is a compute cluster running the Linux operating system, specifically Red Hat Enterprise Linux 8 (RHEL8). As such, most examples in the Berzelius documentation at NSC use the command line interface (CLI) since CLI instructions can be copied verbatim by users and examples are easy to follow with less room for mistakes. The CLI is also an extremely powerful tool enabling high productivity for users, and is an inescapable part of any HPC environment. Note that there are some differences between a typical desktop Linux and HPC environment, e.g. you can't use
sudo to install things.
Basically, a compute cluster enables parallel computations spanning interconnected compute units (nodes), i.e. compute servers and a messaging interconnect, by providing the means for the compute units' work to be orchestrated via some software framework(s) supported by the cluster.
The premise for this to work is that your software has already been adapted using one of these frameworks, in essence having formulated a parallel solution to the computational problem studied through the use of the parallel computation software framework. Some common parallel frameworks are MPI, NCCL and Apache Hadoop. On Berzelius MPI and NCCL are supported as these are the completely dominant parallel frameworks for the architecture and fitting the purpose of the cluster. There are currently no plans to support other frameworks.
On a single compute server level, there is also a different parallel paradigm available on modern clusters as there are many CPU compute cores (and GPUs, as on Berzelius) usable via POSIX or OpenMP threads. On Berzelius, since having very high-end compute servers, a very large category of problems studied will never need to scale out beyond a single compute node and can make effective use of this level of parallelism without involving the additional complexities of multi-node parallelism.
On a higher level, there are specialised nodes of a cluster serving different roles in it. There are:
Typically a few nodes accessible from the internet. This where you end up when accessing the cluster via SSH or ThinLinc for instance, and is where you transfer files to. These nodes are not meant for heavy computations as they are shared by all users accessing the cluster, but are used to request compute resources (and build code, edit scripts etc) for compute intensive jobs. On berzelius these nodes are called
berzelius2.nsc.liu.se. They can also be reached at
berzelius.nsc.liu.se, from which you will be assigned to one of the two.
This is the nexus of control on the cluster and is not user accessible. Typically these are one or two servers (in an HA arrangement). Compute resource requests made by users on the login or compute nodes are relayed here and are handled by the resource manager, on Berzelius SLURM, which allocates compute resources to jobs and prioritizes resource requests in a queue whenever there are more jobs lined up than there are free resources available.
These nodes constitute the bulk of the cluster, and is where compute jobs requested from the resource manager are carried out. These nodes are on berzelius named
In addition, it is very common for compute clusters to have a shared storage file system served by a separate specialized cluster of storage servers. Berzelius' file systems
/home are served by such a storage cluster.
From a cluster functionality perspective, the prototypical workflow using Berzelius thus consists of
srunincluding the NSC provided
interactivetool. This is fixed in a later SLURM release which we will roll out as soon as it is available. If the CUDA_VISIBLE_DEVICES variable is needed, please try set the variable manually and export it to any
srun-launched processes using the switch
--export=ALL,CUDA_VISIBLE_DEVICES, or you can try and launch your job with
mpiexec, in which case you will need to load a buildenv module to make the command available.
Mail any support issues to
email@example.com or use the interface available in SUPR. Please report the problems and obstacles you face when you encounter them, and provide as much detail on the issue as you can so we may reproduce and fix the issues. A very important piece of information here is the SLURM JobID of any job having a problem, as this will allow us to track which nodes have been used and when.
The support mail address is also the interface to make feature requests to add to Berzelius, and we also have the possibility to bring in the Berzelius vendor Atos or NVIDIA, should there be issues where extra support is needed.
Assuming you have received your Berzelius account credentials, here's the process to login to the Berzelius login node, allocate compute resources from SLURM for interactive or batch work to get access to those compute resources in the cluster:
SSH to berzelius1.nsc.liu.se (substitute x_abcde for your real user name)
[you@your_computer ~]$ ssh firstname.lastname@example.org [x_abcde@berzelius001 ~]$
From the login node (berzelius1), allocate resources for interactive or batch work on a compute node. Resource allocation defaults per GPU are; 1 task, 16 cores per task, 125 GB RAM per task and a default wall time of 2h. Override the defaults as you see fit.
Interactive work using a single GPU and default number of tasks, cores per task, RAM and wall time.
[x_abcde@berzelius001 ~]$ interactive --gpus=1 [x_abcde@node001 ~]$ nvidia-smi # Check that the GPU resources are allocated Mon Jun 7 15:51:42 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 | | N/A 25C P0 50W / 400W | 0MiB / 40536MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
The A100 GPUs have a feature for splitting the GPU into multiple independent instances. Nodes in the reservation 3g.20gb have this feature enabled, splitting the GPUs in half. All jobs in this reservation receive a fix amount of resources and are half the cost of a normal job using 1 A100 GPU. If your workload fits in one of these jobs it is an excellent use of resources.
To achieve the fix resource allocation per job, some options are overridden during job submission. For example, you will always get 1 MIG GPU and 8 CPU cores, regardless of what you asked for.
[x_abcde@berzelius001 ~]$ interactive --reservation=3g.20gb [x_abcde@node059 ~]$ nvidia-smi -L GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4bc10a41-11e4-0c97-4027-190e9f305d52) MIG 3g.20gb Device 0: (UUID: MIG-37769be6-b4f5-5a17-bd14-6ad61e7c3d62) ...
[x_abcde@berzelius001 ~]$ interactive --gpus=2 -t 30 [x_abcde@node001 ~]$
[x_abcde@berzelius001 ~]$ interactive -N 1 --exclusive -t 6:00:00 [x_abcde@node001 ~]$
[x_abcde@berzelius001 ~]$ cat << EOF > my_example_batch_script.sh #!/bin/bash #SBATCH --gpus 4 #SBATCH -t 3-00:00:00 #SBATCH -A <your-project-account> # The '-A' SBATCH switch above is only necessary if you are member of several # projects on Berzelius, and can otherwise be left out. # Singularity images can only be used outside /home. In this example the # image is located here cd /proj/my_project_storage/users/$(id -un) # Execute my Singularity image binding in the current working directory # containing the Python script I want to execute singularity exec --nv -b $(pwd) Example_Image.sif python train_my_DNN.py EOF [x_abcde@berzelius001 ~]$ sbatch my_example_batch_script.sh
Berzelius uses SSH for (CLI) access and the VNC solution ThinLinc for remote desktop access (including terminal CLI access). Access Berzelius via the host name
berzelius1.nsc.liu.se using SSH or the ThinLinc client with your user name provided by SUPR. Example:
[you@your_computer ~]$ ssh email@example.com
For data transfers to Berzelius, please use
rsync. Other file transfer tools (e.g. FileZilla, WinSCP) using SCP or SFTP protocol should likely work as well, but have not been tested. Examples:
[you@your_computer ~]$ # SCP an archive file to Berzelius [you@your_computer ~]$ scp my_dataset.tar \ firstname.lastname@example.org:/proj/<your_project_name>/users/x_abcde [you@your_computer ~]$ # rsync an unpacked directory containing a dataset [you@your_computer ~]$ rsync -av my_unpacked_dataset \ email@example.com:/proj/<your_project_name>/users/x_abcde/
Always upload large datasets to your
/proj directory and never to
/home, since the
/home quota is only 20 GB, see below.
For AI/ML work on Berzelius, where highly complex production environments and high user customizability is more or less required, NSC strongly recommends using a container environment, where Singularity or ENROOT are the supported options. Docker is not supported for security reasons.
Using a container environment will allow a highly portable workflow and reproducible results between systems as diverse as a laptop, Berzelius or EuroHPC resources such as LUMI for instance. It will also bring a level of familiarity of use where a user is free to choose their operating system independently of the host environment.
There are however many other options than containers you can use on Berzelius to manage your compute environment, the most prominent ones are listed below.
All software external to the RHEL OS is installed under the
/software directory, and are made conveniently available via the module system. Handle your environment by loading modules, e.g.
module load Anaconda/2021.05-nsc1, instead of manually adding paths and similar in your
~/.bashrc file. Check module availability with
Only basic modules are available currently, but will be expanded as time progress. You are very welcome to make feature requests via support, so that we can customize the Berzelius user environment to be most effective for you.
Using Singularity on Berzelius should not differ appreciably to using it in other settings, except that you will not have root access on Berzelius, i.e. issuing the privilege escalating
sudo command is not working, and singularity build features relying on user namespaces (but not formally
sudo) are not available.
Whenever root privileges are required (like using
sudo) for your use of Singularity, build or adapt your images in advance where you have such privileges and upload the image to your
/proj directory (it won't launch under
/home). The canonical reference and user guide for Singularity can be found at https://sylabs.io/guides/3.8/user-guide/, which should apply directly to Singularity's use on Berzelius.
We are very keen to find out if there are any significant differences to using it in other settings. Please report such differences to support.
Points of note about Singularity on Berzelius:
--nvswitch to import the GPU devices into your container. This is critical to using NVIDIA GPUs in a Singularity container.
--usernsswitches for building singularity images will not work because they rely on user namespaces being enabled.
mpirunfrom the build environment
buildenv-gcccuda/11.2-8.3.1-bare. Otherwise, for multi-node container jobs, you may need to shift focus to the NVIDIA Enroot container solution, see https://github.com/NVIDIA/enroot, which can be used via
srun(plus additional switches) on Berzelius.
Anaconda installations are available via the module system, e.g.
Anaconda/2021.05-nsc1. They are special NSC installations which only make the
conda command available (and doesn't mess up your
~/.bashrc file and login environment). Use this
conda to set up your conda environments, and put these environments under the
/proj file system, either by making
~/.conda a symlink to a directory under
/proj/<your_project_dir>/users/$(id -un) or by using the
-p switch to
conda create, for example
$ conda create -p /proj/<your_project_dir>/users/$(id -un)/mycondaenv \ python=3.8 scipy=1.5.2
Other than this, using Anaconda on Berzelius should not differ appreciably to using it in other settings.
It should be noted that the CUDA driver and runtime on the nodes (currently version 11.2) must be supported by whatever you install in your conda env. Typically, installed drivers and runtime supports only environments using up to that particular version, but not anything over that.
Using the RHEL 8 provided Python in virtualenvs could also be a viable approach in many cases. Make sure you're working with python 3.6. Python 3.6.8 should be the system default on Berzelius, but please verify for yourself. Also, put your virtualenv's under
/proj and not
/home, we want to avoid filling your
/home directory up with non-precious data and backing it up to tape.
Install packages with
pip install <packagename> in your virtualenv, and build packages using a build environment, for instance 'buildenv-gcccuda/11.2-8.3.1-bare'. It should be noted that the CUDA driver and runtime on the nodes must be supported by whatever software you install in your virtualenv.
$ cd /proj/example_project/users/$(id -un) $ python3 -m venv myenv $ source myenv/bin/activate (myenv) $ python --version Python 3.6.8 (myenv) $ pip install --upgrade pip ... (snip) ... (myenv) $ module load buildenv-gcccuda/11.2-8.3.1-bare (myenv) $ pip3 install mpi4py ...
A basic build environment,
buildenv-gcccuda/11.2-8.3.1-bare, is available for those who may need to build software for the RHEL 8 environment on Berzelius. For instance, if you need to build the
mpi4py Python package or CUDA dependent Python packages, have this module loaded when building.
The build environment is based on the system GCC (8.3), CUDA 11.2, OpenMPI 4.1.1, OpenBLAS 0.3.15, FFTW 3.3.9 and ScaLAPACK 2.1.0. Please report any problems using it to support.
Inevitably there comes a need to use graphical applications on the cluster. The NSC recommended way to run graphical applications on Berzelius is via the ThinLinc VNC solution (TL), which is leveraged to provide a remote desktop interface on Berzelius, see Running graphical applications for more information. The TL remote desktop presented is XFCE due to it being light on server resources. It will provide a much better user experience for modern GUIs than X-forwarding, although X-forwarding is not prohibited. In addition, it provides session management, allowing the users to disconnect from the session while running processes are kept running, like a GUI version of terminal multiplexers like GNU
The ThinLinc client is available free of charge and has packages available for the major OS platforms (Linux, MacOS and Windows) from Cendio at URL https://www.cendio.com/thinlinc/download. Applications of particular interest to the Berzelius users, which benefit from use via TL include
Berzelius is a SuperPOD compute cluster using SLURM as its resource manager. General SLURM documentation should be valid for making resource allocations on Berzelius. Check the
man pages or documentation at https://slurm.schedmd.com/. Other NSC SLURM documentation is likely useful also in this context and can be found at https://www.nsc.liu.se/support/batch-jobs/, but may not apply in all parts.
The wall time limit has been set to 3 days (72h) to ensure that there is reasonable turnover of jobs. You can extend the wall time of a running job using boost-tools.
There are default allocation settings in place based on the number of GPUs allocated. Currently these are, 1 task using 16 CPU cores and 125 GB RAM for every GPU allocated. This allows you to only specify the amount of GPUs your job requires — e.g.
--gpus=2, getting you 32 CPU cores and 250 GB RAM as well as the 2 GPUs — and you will get a sensible allocation for most circumstances.
In addition, the default wall time allocated for jobs is 2h. Remember to override this with
-t <timestring> when needed. All of these defaults are overridden by user switches when provided.
Allocation of resources in general follow that of the SLURM documentation. One thing to be aware of is that there are very many ways to give conflicting directives when allocating resources, and they will result in either not getting the allocation or getting the wrong allocation. Before submitting a resource intensive batch job, it's worthwhile to check out the settings in an interactive session verifying the resources are properly allocated.
In general, if you need to specify more than solely the number of GPUs your job requires, a recommended way to allocate is via the switch combination
-n X --gpus=Y (plus other needed switches, like
-t), where X and Y are "tasks" (CPU cores) and "gpus available to your X tasks", respectively. This pattern seems to work in most circumstances to get you what you expect. In some circumstances you may need to also specify
-c Z where Z are the number of hyperthreads allocated to each task (there are two hyperthreads per physical CPU core), but this should not be required normally.
When allocating a partial node, NSCs recommendation is to allocate tasks (CPU cores) in proportion to how large a fraction of the node's total GPUs you allocate, i.e. for every GPU (1/8 of a node) you allocate, you should also allocate 16 tasks (1/8 = 16/128). The default memory allocation follows that of the cores, i.e. for every CPU core allocated, 7995 MB of RAM (a small bit less than 1/128th of node's total RAM) is allocated. This is automatically taken care of when using only the switch
--gpus=X as in the examples of the quick start guide.
Jobs using the reservation with MIG GPUs (
--reservation=3g.20gb) will always receive a fix amount of resources per job (eg. only 1 MIG GPU per job). Even if you request a particular amount of resources the job will start with the fix amount. If your job require more resources than the fix amount you should not use this reservation.
Please note that when allocating more GPUs than 8 (more than one node), you will end up on several nodes and will require multi-node capabilities on your software setup to make use of all allocated GPUs.
A note about more advanced and multi-node allocations: This version of SLURM has a bug when allocating resources using the flag
--gpus-per-task, which is simply not working as intended. This bug makes for instance the switch combination
-n 1 --gpus-per-task=1 allocate all GPUs on a node for your job, whether you use them or not. A working switch combination which can accomplish the same intended thing as above is
-n 1 --ntasks-per-gpu=1. You can use this switch combination with different number of tasks and tasks per GPU, for instance
-n 32 --ntasks-per-gpu=2, which will allocate a total of two nodes (16 GPUs) with two tasks per GPU. The bug appears to have been fixed in later versions of SLURM and we will upgrade as soon as possible.
Interactive work in a shell on the allocated resources can be performed via the NSC provided script
interactive which is a wrapper around the SLURM commands
srun, and as such accepts all switches available to
salloc. For reference see https://www.nsc.liu.se/support/running-applications/ under the "Interactive jobs" heading.
For specific examples of the use of interactive, see for instance the Quick start guide above.
Batch jobs are supported in the standard SLURM way. For a guide see https://www.nsc.liu.se/support/batch-jobs/.
Jobs run on a single node should be straightforward in both Singularity container environments and the host OS. Make sure you've got the allocated resources, and for Singularity containers, remembered to import the GPUs (using the
--nv switch) into the container.
Multi-node jobs for regular MPI-parallel applications should be pretty standard on the cluster, and can use the common
mpiexec) launcher available via the
buildenv-gcccuda/11.2-8.3.1-bare module or
srun --mpi=X (supported X by SLURM are pmi2, pmix and pmix_v3). If your application has been built with NSC provided toolchain(s) you should also be able to launch it with
mpprun in standard NSC fashion.
Multi-node jobs using GPUs can be challenging when running in a Singularity container. For NVIDIA NGC containers used with Singularity, you can possibly launch your job using the
mpirun provided by loading the
buildenv-gcccuda/11.2-8.3.1-bare module. For reference see https://sylabs.io/guides/3.7/user-guide/mpi.html.
Otherwise, SLURM on Berzelius has support for launching ENROOT containers directly using
srun, see https://github.com/NVIDIA/enroot, which should work for at least NVIDIA NGC containers of recent date used to build your ENROOT container. ENROOT containers are similar to Singularity containers but don't require superuser privileges to build or modify them. ENROOT containers can be built on the compute nodes of the cluster but not on the login nodes,
2 GPUs (plus 32 CPU cores implied) for one hour
[x_abcde@berzelius1] ~$ interactive --gpus=2 -t 60 [x_abcde@node001] ~$ cd /proj/example_project_dir/x_abcde [x_abcde@node001] $ singularity shell --nv my_singularity_image.sif Singularity> nvidia-smi #Checking for GPUs ... (snip) ... Singularity> mpirun <args> <your_executable> <exeargs> #Requires an image with MPI installed
The shared storage and data transport fabric on Berzelius are very high performance, and should suffice for most IO loads on it, specifically data intensive AI/ML loads.
This is especially the case when the data sets are well formatted. Examples of good such formats are TFRecords (from TensorFlow), RecordIO (from MXNet) or Petastorm (Über).
The use of datasets in these formats can greatly reduce IO-wait time on the GPU compared to raw file system access, and will also reduce load on the shared storage. NSC highly recommends that you store and use your data sets using some such format.
There are two shared storage areas set up for use;
/home/$USER area is backed-up (nightly) and small, 20 GB quota per user, and is only meant for data you cannot put under
/proj. The standard quota for the
/proj directory is 5,000 GiB and 5 M, but this can be increased, either at the time you apply for the project or as a complementary application at a later stage.
High performance NVMe SSD node local storage is available on each compute node.
There are a few points to note with respect to the available node local storage
/proj) at the end of a job is lost, with no getting back.
In case you need to use it for your datasets, try to store your dataset as uncompressed
tar-archives preferentially split in many parts and unpack in parallel, this will increase your data transfer speed tremendously compared to single processes. Example:
# 144 GB ILSVRC 2012 data set in TFRecord format split in 128 tar archives # unpacked with 16 parallel workers to /scratch/local. A single worker takes # 106s to do the same task. [raber@node001 ILSVRC2012]$ time ls *.tar | xargs -n 1 -P 16 tar -x -C /scratch/local/ -f real 0m16.763s user 0m3.192s sys 8m4.740s
Quotas and your current use of it can be checked with the command
nscquota. Complementary requests for increases in storage allocation can be made in SUPR if you find out you need it. If in doubt on how to do this, please contact support.