Berzelius - Getting Started Guide


Berzelius is an AI/ML focused compute cluster permitting scale-out compute jobs aggregating the computational power of up to 752 NVIDIA A100 GPUs. The interconnect fabric allows RDMA, non-blocking connection between all of these GPUs with a bandwidth of 200 GB/s and µs order latencies between any two endpoints. This makes several hundred (AI) petaflops available to individual jobs for certain workloads. The resource is available to Swedish academic researchers as described in Project applications and Resource Allocations on Berzelius.

Compute Cluster Basics

At its core the Berzelius SuperPOD is a compute cluster running the Linux operating system, specifically Red Hat Enterprise Linux 8 (RHEL8). As such, most examples in the Berzelius documentation at NSC use the command line interface (CLI) since CLI instructions can be copied verbatim by users and examples are easy to follow with less room for mistakes. The CLI is also an extremely powerful tool enabling high productivity for users, and is an inescapable part of any HPC environment. Note that there are some differences between a typical desktop Linux and HPC environment, e.g. you can't use sudo to install things.

What is a Compute Cluster

Basically, a compute cluster enables parallel computations spanning interconnected compute units (nodes), i.e. compute servers and a messaging interconnect, by providing the means for the compute units' work to be orchestrated via some software framework(s) supported by the cluster.

The premise for this to work is that your software has already been adapted using one of these frameworks, in essence having formulated a parallel solution to the computational problem studied through the use of the parallel computation software framework. Some common parallel frameworks are MPI, NCCL and Apache Hadoop. On Berzelius MPI and NCCL are supported as these are the completely dominant parallel frameworks for the architecture and fitting the purpose of the cluster. There are currently no plans to support other frameworks.

On a single compute server level, there is also a different parallel paradigm available on modern clusters as there are many CPU compute cores (and GPUs, as on Berzelius) usable via POSIX or OpenMP threads. On Berzelius, since having very high-end compute servers, a very large category of problems studied will never need to scale out beyond a single compute node and can make effective use of this level of parallelism without involving the additional complexities of multi-node parallelism.

Login nodes and compute nodes

On a higher level, there are specialised nodes of a cluster serving different roles in it. There are:

Login nodes
Typically a few nodes accessible from the internet. This where you end up when accessing the cluster via SSH or ThinLinc for instance, and is where you transfer files to. These nodes are not meant for heavy computations as they are shared by all users accessing the cluster, but are used to request compute resources (and build code, edit scripts etc) for compute intensive jobs. On berzelius these nodes are called and They can also be reached at, from which you will be assigned to one of the two.
Compute nodes
These nodes constitute the bulk of the cluster, and is where compute jobs requested from the resource manager are carried out. These nodes are on berzelius named node001node094. Nodes node061node094 are the new "fat" nodes with more HBM2 VRAM.

In addition, it is very common for compute clusters to have a shared storage file system served by a separate specialized cluster of storage servers. Berzelius' file systems /proj and /home are served by such a storage cluster.

General Information

Berzelius is currently to a large degree a work in progress in terms of user environment and feature maturity. Some general gotchas identified in the period leading up to Berzelius' general availability to be aware of are

  • SLURM resource allocation.
    • There are many conflicting sbatch/salloc directives, in particular when it comes to GPUs. The examples provided here are known to work and are adapted to what we think will fit the most common use cases, but if your needs deviate substantially from these and you can't make it work on your own, please contact support (see below).
    • The currently installed version of SLURM does not properly set the CUDA_VISIBLE_DEVICES variable in parallel jobs launched with srun including the NSC provided interactive tool. This is fixed in a later SLURM release which we will roll out as soon as it is available. If the CUDA_VISIBLE_DEVICES variable is needed, please try set the variable manually and export it to any srun-launched processes using the switch --export=ALL,CUDA_VISIBLE_DEVICES, or you can try and launch your job with mpirun/mpiexec, in which case you will need to load a buildenv module to make the command available.
  • CUDA driver version. The current CUDA driver on the compute nodes is from the CUDA 12.0 release with compatibility packages for 12.2.


Mail any support issues to or use the interface available in SUPR. Please report the following information when you encounter problems and obstacles:

  • A general discription of the problems
  • Job IDs
  • Error messages
  • Commands to reproduces the error messages

The support mail address is also the interface to make feature requests to add to Berzelius, and we also have the possibility to bring in the Berzelius vendor Atos or NVIDIA, should there be issues where extra support is needed.


We kindly ask the acknowledgement of Berzelius and NSC in scientific publications that required the use of Berzelius resources, services or expertise. This helps us ensure continued funding and maintain service levels. Please see Acknowledgment suggestion for examples.

Quick Start Guide

From a cluster functionality perspective, the prototypical workflow using Berzelius thus consists of

  1. Transferring your datasets to the cluster
  2. Access the cluster login nodes via SSH/ThinLinc
  3. Request compute resources to
    • Perform exploratory, fast feedback type work interactively and/or
    • Execute scripted batch jobs for longer/heavier calculations
  4. Monitor/supervise batch jobs' progress

Login to Berzelius

Assuming you have received your Berzelius account credentials, here's the process to login to the Berzelius login node.

SSH to or (substitute x_abcde for your real user name)

[you@your_computer ~]$ ssh
[x_abcde@berzelius001 ~]$

Resource Allocations Examples

Now we can allocate compute resources from SLURM for interactive or batch work. From the login node, allocate resources for interactive or batch work on a compute node. Resource allocation defaults per GPU are: 1 task, 16 cores per task, 125 GB RAM per task and a default wall time of 2h. Override the defaults as you see fit.

Example 1: Interactive work

Interactive work using a single GPU and default number of tasks, cores per task, RAM and wall time.

[x_abcde@berzelius001 ~]$ interactive --gpus=1

You can check the GPU resources allocated:

[x_abcde@node001 ~]$ nvidia-smi # Check that the GPU resources are allocated
Fri Mar  3 10:30:59 2023       
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  On   | 00000000:BD:00.0 Off |                    0 |
| N/A   29C    P0    51W / 400W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

Example 2: Work on a MIG GPU

The A100 GPUs have a feature for splitting the GPU into multiple independent instances. Nodes in the reservation 3g.20gb have this feature enabled, splitting the GPUs in half. All jobs in this reservation receive a fix amount of resources and are half the cost of a normal job using 1 A100 GPU. If your workload fits in one of these jobs it is an excellent use of resources.

To achieve the fix resource allocation per job, some options are overridden during job submission. For example, you will always get 1 MIG GPU and 8 CPU cores, regardless of what you asked for.

[x_abcde@berzelius001 ~]$ interactive --reservation=3g.20gb
[x_abcde@node059 ~]$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4bc10a41-11e4-0c97-4027-190e9f305d52)
  MIG 3g.20gb     Device  0: (UUID: MIG-37769be6-b4f5-5a17-bd14-6ad61e7c3d62)

Example 3: Interactive work on 2 GPUs, with default resources per GPU for 30 minutes

[x_abcde@berzelius001 ~]$ interactive --gpus=2 -t 30
[x_abcde@node001 ~]$ 

Example 4: Interactive work on 1 DGX-A100 with all its resources (GPU, CPU and memory) exclusively for 6h

[x_abcde@berzelius001 ~]$ interactive -N 1 --exclusive -t 6:00:00
[x_abcde@node001 ~]$ 

Example 5: Submitting a batch script using four GPUs for 3 days

You need to create the batch script which contains:

#SBATCH --gpus 4
#SBATCH -t 3-00:00:00
#SBATCH -A <your-project-account>

# The '-A' SBATCH switch above is only necessary if you are member of several
# projects on Berzelius, and can otherwise be left out.

# Apptainer images can only be used outside /home. In this example the
# image is located here
cd /proj/my_project_storage/users/$(id -un)

# Execute my Apptainer image binding in the current working directory
# containing the Python script I want to execute
apptainer exec --nv -b $(pwd) Example_Image.sif python

You can submit the batch work by:

[x_abcde@berzelius001 ~]$ sbatch

Example 6: Interactive work on a "fat" GPU, with default resources per GPU for 30 minutes

Using the feature flag -C "fat" only "fat" nodes will be considered for the job. The corresponding flag for "thin" nodes is -C "thin. If the feature flag is not specified any node can be used. When the "fat" feature is used the job is assigned 254GB of system memory per GPU. The GPUh cost of a "fat" GPU is twice that of a "thin" GPU, even if the job was not launched with -C "fat".

[x_abcde@berzelius001 ~]$ interactive --gpus=1 -C "fat" -t 30
[x_abcde@node063 ~]$ 

Resource Allocations Costs

Depending on the type of resources allocated to a job the cost in GPUh will vary. Using feature flags it is possible to select either "thin" (A100 40GB) or "fat" (A100 80GB) for a job, a job not specifying either can use either. MIG GPUs are accessed through the MIG reservation.

GPU Internal SLURM cost GPUh Cost per hour Accessed through
MIG 3g.20gb 8 0.5 --reservation=3g.20gb
A100 40GB 16 1 -C "thin", or no flag
A100 80GB 32 2 -C "fat", or no flag

Data Transfer to Berzelius

For data transfers between Berzelius and your local computer, please use scp or rsync. Other file transfer tools (e.g. FileZilla, WinSCP) using SCP or SFTP protocol should likely work as well, but have not been tested.

  • To upload from your local computer to Berzelius
scp /your_local_dir/my_dataset.tar
rsync -av /your_local_dir/unpacked_dataset

Always upload large datasets to your /proj directory and never to /home, since the /home quota is only 20 GB, see Data Storage.

  • To download from Berzelius to your local computer
scp /your_local_dir
rsync -av /your_local_dir

Usability Features of Berzelius

For AI/ML work on Berzelius, where highly complex production environments and high user customizability is more or less required, NSC strongly recommends using a container environment, where Apptainer or Enroot are the supported options. Docker is not supported for security reasons.

Using a container environment will allow a highly portable workflow and reproducible results between systems as diverse as a laptop, Berzelius or EuroHPC resources such as LUMI for instance. It will also bring a level of familiarity of use where a user is free to choose their operating system independently of the host environment.

There are however many other options than containers you can use on Berzelius to manage your compute environment, the most prominent ones are listed below.


All software external to the RHEL OS is installed under the /software directory, and are made conveniently available via the module system. Handle your environment by loading modules, e.g. module load Anaconda/2021.05-nsc1, instead of manually adding paths and similar in your ~/.bashrc file. Check module availability with module avail.

Only basic modules are available currently, but will be expanded as time progress. You are very welcome to make feature requests via support, so that we can customize the Berzelius user environment to be most effective for you.


Using Apptainer on Berzelius should not differ appreciably to using it in other settings, except that you will not have root access on Berzelius, i.e. issuing the privilege escalating sudo command is not working, and apptainer build features relying on user namespaces (but not formally sudo) are not available.

Whenever root privileges are required (like using sudo) for your use of Apptainer, build or adapt your images in advance where you have such privileges and upload the image to your /proj directory (it won't launch under /home). The canonical reference and user guide for Apptainer can be found at, which should apply directly to Apptainer's use on Berzelius.

We are very keen to find out if there are any significant differences to using it in other settings. Please report such differences to support.

Points of note about Apptainer on Berzelius:

  • Remember to use the --nv switch to import the GPU devices into your container. This is critical to using NVIDIA GPUs in a Apptainer container.
  • Available on the compute nodes only, so you need to work in a SLURM environment on a node to use it.
  • User namespaces are not enabled on Berzelius for security reasons. This means that the --fakeroot and --userns switches for building apptainer images will not work because they rely on user namespaces being enabled.
  • Available on the compute nodes without loading a module.
  • Should work as you would expect it to on any platform. In particular, parallel runs launched from within a container on a single node should work out of the box.
  • Containers pulled from NVIDIA NGC are especially built and QA tested by NVIDIA to work on an NVIDIA SuperPOD like Berzelius.
  • Potentially you can launch multi-node Apptainer jobs using mpirun from the build environment buildenv-gcccuda/11.4-8.3.1-bare. Otherwise, for multi-node container jobs, you may need to shift focus to the NVIDIA Enroot container solution, see, which can be used via srun (plus additional switches) on Berzelius.

Please read the Berzelius Apptainer Guide for more details.


Enroot is a simple, yet powerful tool to turn container images into unprivileged sandboxes. Enroot is targeted for HPC environments with integration with the Slurm scheduler, but can also be used as a standalone tool to run containers as an unprivileged user. Enroot is similar to Apptainer, but with the added benefit of allowing users to read/write in the container and also to appear as a root user within the container environment. Please read the Berzelius Enroot Guide for more details.


Anaconda installations are available via the module system, e.g. Anaconda/2021.05-nsc1. They are special NSC installations which only make the conda command available (and doesn't mess up your ~/.bashrc file and login environment).

The default location for conda environments is ~/.conda in your home directory. This location can be problematic since these environments can become very large. Therefore, it is suggested to redirect this directory using a symbolic link to your project directory.

mv ~/.conda /proj/your_proj/users/your_username
ln -s /proj/your_proj/users/your_username/.conda ~/.conda

A basic example for creating a conda environment called myenv with Python 3.8 with the pandas and seaborn packages:

module load Anaconda/2021.05-nsc1
conda create -n myenv python=3.8 pandas seaborn
conda activate myenv

Please see the conda cheatsheet for the most important commands about using conda.

Mamba is a reimplementation of the conda package manager in C++ offering the following benefits: - much faster dependency solving - parallel downloading of repository data and package files

Berzelius makes mamba (and conda) available via the Mambaforge miniforge distribution, with which users can create, alter and activate conda environments. For example, to set up a customized environment and run a Python program in it:

module load Mambaforge/23.3.1-1
mamba create -n myenv python=3.8 scipy=1.5.2
mamba activate myenv

It should be noted that the CUDA driver and runtime on the nodes (currently version 12.0) must be supported by whatever you install in your conda env. Typically, installed drivers and runtime supports only environments using up to that particular version, but not anything over that.

System Python

Using the RHEL 8 provided Python in virtualenvs could also be a viable approach in many cases. Make sure you're working with python 3.6. Python 3.6.8 should be the system default on Berzelius, but please verify for yourself. Also, put your virtualenv's under /proj and not /home, we want to avoid filling your /home directory up with non-precious data and backing it up to tape.

Install packages with pip install <packagename> in your virtualenv, and build packages using a build environment, for instance 'buildenv-gcccuda/11.4-8.3.1-bare'. It should be noted that the CUDA driver and runtime on the nodes must be supported by whatever software you install in your virtualenv.


$ cd /proj/your_proj_dir/users/your_username
$ python3 -m venv myenv
$ source myenv/bin/activate
(myenv) $ python --version
Python 3.6.8
(myenv) $ pip install --upgrade pip
... (snip) ...
(myenv) $ module load buildenv-gcccuda/11.4-8.3.1-bare
(myenv) $ pip3 install mpi4py

Jupyter Notebook

Jupyter notebooks can be used via Thinlinc (see Running Graphical Applications) or SSH port forwarding.

Assume that you have installed Jupyter Notebook in your conda environment. On a compute node, load the python module and activate your environment:

module load Anaconda/2021.05-nsc1
conda activate myenv_example

Start a Jupyter notebook with the no-browser flag:

(myenv_example) [x_abcde@node021 ~]$ jupyter-notebook --no-browser --ip=node021 --port=9988

Please use the --ip flag to specify the node that you are working on.

You will see the following info printed out on your terminal:

[I 2023-02-20 12:36:50.238 LabApp] JupyterLab extension loaded from /home/x_abcde/.conda/envs/myenv_example/lib/python3.10/site-packages/jupyterlab
[I 2023-02-20 12:36:50.238 LabApp] JupyterLab application directory is /proj/proj_name/x_abcde/.conda/envs/myenv_example/share/jupyter/lab
[I 12:36:50.243 NotebookApp] Serving notebooks from local directory: /home/x_abcde
[I 12:36:50.243 NotebookApp] Jupyter Notebook 6.5.2 is running at:
[I 12:36:50.243 NotebookApp] http://node021:9988/?token=xxxx
[I 12:36:50.243 NotebookApp]  or
[I 12:36:50.243 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 12:36:50.256 NotebookApp] 
    To access the notebook, open this file in a browser:
    Or copy and paste one of these URLs:

In a second terminal, make a ssh tunnel login:

ssh -N -L localhost:9988:node021:9988

Start your favorite browser on your local computer, paste the following URL given by the jupyter-notebook on Berzelius with your own token:



Note: The port 9988 is arbitrary. If 9988 is already in use, then just try 9989, etc.

Build Environment

A basic build environment, buildenv-gcccuda/11.4-8.3.1-bare, is available for those who may need to build software for the RHEL 8 environment on Berzelius. For instance, if you need to build the mpi4py Python package or CUDA dependent Python packages, have this module loaded when building.

The build environment is based on the system GCC (8.3), CUDA 11.4, OpenMPI 4.1.1, OpenBLAS 0.3.15, FFTW 3.3.9 and ScaLAPACK 2.1.0. Please report any problems using it to support.

Running Graphical Applications

Inevitably there comes a need to use graphical applications on the cluster. The NSC recommended way to run graphical applications on Berzelius is via the ThinLinc VNC solution (TL), which is leveraged to provide a remote desktop interface on Berzelius, see Running graphical applications for more information. The TL remote desktop presented is XFCE due to it being light on server resources. It will provide a much better user experience for modern GUIs than X-forwarding, although X-forwarding is not prohibited. In addition, it provides session management, allowing the users to disconnect from the session while running processes are kept running, like a GUI version of terminal multiplexers like GNU screen.

The ThinLinc client is available free of charge and has packages available for the major OS platforms (Linux, MacOS and Windows) from Cendio at URL Applications of particular interest to the Berzelius users, which benefit from use via TL include

  • Firefox and consequently
    • Jupyter notebooks
    • Tensorboard
  • Integrated Development Environments, for instance PyCharm
  • Debuggers and Profilers like NVIDIA Nsight-systems

Submitting and Running Jobs

Berzelius is a SuperPOD compute cluster using SLURM as its resource manager. General SLURM documentation or the man pages should be valid for making resource allocations on Berzelius. Other NSC SLURM documentation is likely useful but may not apply in all parts.

The wall time limit has been set to 3 days (72h) to ensure that there is reasonable turnover of jobs. You can extend the wall time of a running job using boost-tools.

There are default allocation settings in place based on the number of GPUs allocated. Currently these are, 1 task using 16 CPU cores and 127 GB RAM for every GPU allocated. This allows you to only specify the amount of GPUs your job requires — e.g. --gpus=2, getting you 32 CPU cores and 250 GB RAM as well as the 2 GPUs — and you will get a sensible allocation for most circumstances.

In addition, the default wall time allocated for jobs is 2h. Remember to override this with -t <timestring> when needed. All of these defaults are overridden by user switches when provided.

If "thin" or "fat" nodes are desired this can be specified with the feature flags -C "fat" and -C "thin". If this is not specified in the submission both types of nodes may be used by the job.

Submitting Jobs

Be aware that there are very many ways to give conflicting directives when allocating resources, and they will result in either not getting the allocation or getting the wrong allocation. Before submitting a resource intensive batch job, it's worthwhile to check out the settings in an interactive session verifying the resources are properly allocated.

In general, if you need to specify more than solely the number of GPUs your job requires, a recommended way to allocate is via the switch combination -n X --gpus=Y (plus other needed switches, like -t), where X and Y are "tasks" (CPU cores) and "gpus available to your X tasks", respectively. This pattern seems to work in most circumstances to get you what you expect. In some circumstances you may need to also specify -c Z where Z are the number of hyperthreads allocated to each task (there are two hyperthreads per physical CPU core), but this should not be required normally.

When allocating a partial node, NSC's recommendation is to allocate tasks (CPU cores) in proportion to how large a fraction of the node's total GPUs you allocate, i.e. for every GPU (1/8 of a node) you allocate, you should also allocate 16 tasks (1/8 = 16/128). The default memory allocation follows that of the cores, i.e. for every CPU core allocated, 7995 MB of RAM (a small bit less than 1/128th of node's total RAM) is allocated. This is automatically taken care of when using only the switch --gpus=X as in the examples of the quick start guide. On "fat" nodes twice the ammount of memory is allocated if the job used the -C "fat" feature flag.

Jobs using the reservation with MIG GPUs (--reservation=3g.20gb) will always receive a fix amount of resources per job (eg. only 1 MIG GPU per job). Even if you request a particular amount of resources the job will start with the fix amount. If your job require more resources than the fix amount you should not use this reservation.

Please note that when allocating more GPUs than 8 (more than one node), you will end up on several nodes and will require multi-node capabilities on your software setup to make use of all allocated GPUs.

A note about more advanced and multi-node allocations: This version of SLURM has a bug when allocating resources using the flag --gpus-per-task, which is simply not working as intended. This bug makes for instance the switch combination -n 1 --gpus-per-task=1 allocate all GPUs on a node for your job, whether you use them or not. A working switch combination which can accomplish the same intended thing as above is -n 1 --ntasks-per-gpu=1. You can use this switch combination with different number of tasks and tasks per GPU, for instance -n 32 --ntasks-per-gpu=2, which will allocate a total of two nodes (16 GPUs) with two tasks per GPU. The bug appears to have been fixed in later versions of SLURM and we will upgrade as soon as possible.

Interactive Work

Interactive work in a shell on the allocated resources can be performed via the NSC provided script interactive which is a wrapper around the SLURM commands salloc and srun, and as such accepts all switches available to salloc. For reference see under the "Interactive jobs" heading.

For specific examples of the use of interactive, see for instance the Quick start guide above.

Batch Jobs

Batch jobs are supported in the standard SLURM way. For a guide see

Single Node Jobs

Jobs run on a single node should be straightforward in both Apptainer container environments and the host OS. Make sure you've got the allocated resources, and for Apptainer containers, remembered to import the GPUs (using the --nv switch) into the container.

Multi-node Jobs

Multi-node jobs for regular MPI-parallel applications should be pretty standard on the cluster, and can use the common mpirun (or mpiexec) launcher available via the buildenv-gcccuda/11.4-8.3.1-bare module or srun --mpi=X (supported X by SLURM are pmi2, pmix and pmix_v3). If your application has been built with NSC provided toolchain(s) you should also be able to launch it with mpprun in standard NSC fashion.

Multi-node jobs using GPUs can be challenging when running in a Apptainer container. For NVIDIA NGC containers used with Apptainer, you can possibly launch your job using the mpirun provided by loading the buildenv-gcccuda/11.4-8.3.1-bare module. See the [reference] (

Otherwise, SLURM on Berzelius has support for launching Enroot containers directly using srun, see, which should work for at least NVIDIA NGC containers of recent date used to build your Enroot container. Enroot containers are similar to Apptainer containers but don't require superuser privileges to build or modify them. Enroot containers can be built on the compute nodes of the cluster but not on the login nodes,

NSC boost-tools

NSC boost-tools are an attempt to add more flexibility to the job scheduling.

We currently provide three tools:

  • nsc-boost-priority: Increase the priority of a job.
  • nsc-boost-timelimit: Extend the time limit of a job.
  • nsc-boost-reservation: Reserve nodes for a specific time period.

Please refer to the NSC boost-tools page for the usage instructions.


As the demand for time on Berzelius is high, we need to ensure allocated time is efficiently used. The efficiency of running jobs is monitored continuously by automated systems. Users can, and should, monitor how well jobs are performing. Users working via Thinlinc can use the tool jobgraph to check how individual jobs are performing. For example, jobgraph -j 12345678 will generate a png-file with data on how job 12345678 is running. Please note that for jobarrays you need to look at the "raw" jobid. It is also possible to log onto a node running a job using jobsh -j $jobid and then use tools such as nvidia-smi to look at statistics in more detail.

In normal use, we expect jobs properly utilizing the GPUs to pull 200W, with most AI/ML-workloads at above 300W. The average idle power for the GPUs are about 52 Watt.

Automatic termination of jobs

Particularly inefficient jobs will be automatically terminated. The following criteria are used to determine if a job should be canceled:

  • Average power utilization per GPU from start of job is below 60W across. This is slightly above our measured idle-level of 52W. In normal use, we expect jobs properly utilizing the GPUs to pull 200W, with most AI/ML workloads at above 300W.
  • The job is scheduled without any GPU.

Exceptions from this are

  • Jobs that have not yet run for one hour, to allow for some preprocessing at the beginning of jobs,
  • Interactive jobs, started with the NSC interactive tool,
  • Jobs running within reservations, including devel,
  • Explicitly whitelisted projects (please contact us if you believe this applies to you).

These criteria are slightly simplified and will be made stricter over time. In particular, we expect to increase the power utilization criteria up from 60W at some point in the future. Users are informed about canceled jobs hourly, to avoid sending out too much spam for users getting jobarrays canceled.

Data Storage

The shared storage and data transport fabric on Berzelius are very high performance, and should suffice for most IO loads on it, specifically data intensive AI/ML loads.

This is especially the case when the data sets are well formatted. Examples of good such formats are TFRecords (from TensorFlow), RecordIO (from MXNet) or Petastorm (Über).

The use of datasets in these formats can greatly reduce IO-wait time on the GPU compared to raw file system access, and will also reduce load on the shared storage. NSC highly recommends that you store and use your data sets using some such format.

Shared Storage

There are two shared storage areas set up for use: /home/$USER and /proj/<your_project_dir>/users/$USER. The /home/$USER area is backed-up (nightly) and small, 20 GB quota per user, and is only meant for data you cannot put under /proj. The standard quota for the /proj directory is 5,000 GiB and 5 M, but this can be increased, either at the time you apply for the project or as a complementary application at a later stage.

Node Local Storage

High performance NVMe SSD node local storage is available on each compute node.

There are a few points to note with respect to the available node local storage

  • For every job, node local scratch space is mounted under /scratch/local.
  • Separate jobs can't access another job's /scratch/local when several jobs are sharing a node.
  • Each job's /scratch/local is erased between jobs. Data not saved (e.g. moved to somewhere under /proj) at the end of a job is lost, with no getting back.
  • In case you need to use it for your datasets, try to store your dataset as uncompressed tar-archives preferentially split in many parts and unpack in parallel, this will increase your data transfer speed tremendously compared to single processes. Example:

      # 144 GB ILSVRC 2012 data set in TFRecord format split in 128 tar archives
      # unpacked with 16 parallel workers to /scratch/local. A single worker takes
      # 106s to do the same task.
      [raber@node001 ILSVRC2012]$ time ls *.tar | xargs -n 1 -P 16 tar -x -C /scratch/local/ -f
      real  0m16.763s
      user  0m3.192s
      sys   8m4.740s


Quotas and your current use of it can be checked with the command nscquota. Complementary requests for increases in storage allocation can be made in SUPR if you find out you need it. If in doubt on how to do this, please contact support.

Efficient dataset transfer from shared storage to node local storage

There is occasionally the need to use node local storage under /scratch/local to avoid starving the GPUs of data to work with, and hence bring up the efficiency of your job. There may also be situations where you cannot work with efficient data formats in your preferred framework/application and when you need to reduce the number of files stored on the Berzelius shared storage by using archive formats such as .tar for instance, therefore requiring transfer and unpacking to node local storage to conduct your work.

In these situations, an efficient and well performing means of handling the data transfer from the shared project storage to local disc is essential to avoid excessive startup times of your job. This can be accomplished using .tar archive files (other archive formats can be used as well) comprising your data set paired with parallel unpacking. There are only two steps needed to do this on Berzelius:

  1. Partition and pack your dataset in multiple .tar archives, balanced in terms of size and number of files. For performance reasons, it is important to not also use compression here as de-compression becomes a serious bottleneck when unpacking in the next step.
  2. On an allocated compute node, unpack these multiple .tar files using parallel threads directly to the local scratch disc.


In this example we use a synthetic data set of 579 GB size comprising around 4.7M files of 128kb size each. The performance figures reported below for this dataset should serve as a good indicator of what performance you can expect in your individual case if you scale it appropriately, e.g. a twice as large dataset should take roughly twice as long to transfer.

# Partition and create the .tar archives

$ module load Fpart/1.5.1

# Trying to get a reasonable number of .tar archives with our data set, we here
# use a .tar archive max size 2350MB and max number of files of 19000. Test
# parameters out for your own situation, don't blindly use these. In the
# example case the command finished in 22 minutes using 8 parallel processes.

$ fpsync -n 8 -m tarify -s 2350M -f 19000 /absolute/path/to/dataset/ /absolute/path/to/tar/files/

On an allocated compute node you can then unpack the archive files in parallel from shared storage to node local disc using the below command.

# Pipe a listing of the .tar files to xargs, which here uses 8 parallel worker
# threads unpacking different individual .tar archive files. For the example,
# this command finished in just above 2 mins

you@node0XX $ ls /absolute/path/to/tar/files/*.tar | xargs -n 1 -P 8 tar -x -C /scratch/local/ -f

Depending on the size and number files in the dataset .tar archives as well as the workload of the involved file systems, you are likely to see some degree of variation in unpack times. On a quiescent (i.e. only one user) compute node using the example dataset we measured the following unpack times using various numbers of parallel workers:

N Time
2 5m14s
4 3m8s
8 2m2s
16 2m12s
32 2m26s
64 2m39s

While it is of course possible to use other ways to transfer your data set, we recommend you use this way as it is highly performant as well as being "nice" to the shared file system, i.e. keeping track of fewer files conserves resources on the shared file system servers.

Verifying data integrity

Before removing the original unpacked data set from Berzelius shared storage, it would be prudent to verify the data integrity of the .tar archives. The approach is to 1) create a checksum file of the dataset in the top level directory, 2) unpack the .tar archives to some place and verify those unpacked files against the original checksums. For our purposes, the md5sum tool produces sufficiently good checksums. Here's one way of doing it:

Create an md5sum listing on the original data set

$ cd /absolute/path/to/dataset/
$ find . -type f -exec md5sum '{}' \; | tee my_dataset_md5sums.txt

Assuming the my_dataset_md5sums.txt text file is included at the top level in the unpacked dataset as above, check the data integrity of it by for instance doing

$ cd /path/to/unpacked/dataset/
$ md5sum --quiet --check my_dataset_md5sums.txt && echo "Verification SUCCESSFUL" || echo "Verification FAILED"

User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems


NSC Express