Berzelius Getting Started

Introduction

Berzelius is an AI/ML focused compute cluster permitting scale-out compute jobs aggregating the computational power of up to 752 NVIDIA A100 GPUs. The interconnect fabric allows RDMA, non-blocking connection between all of these GPUs with a bandwidth of 200 GB/s and µs order latencies between any two endpoints. This makes several hundred (AI) petaflops available to individual jobs for certain workloads. Berzelius currently runs Red Hat Enterprise Linux release 8.8.

Tne Berzelius User Guides mainly consist of 4 parts:

  1. Berzelius Getting Started (the current page)
  2. Berzelius Resource Allocations
  3. Berzelius GPU User Guide
  4. Berzelius Software Guide

Getting Access to Berzelius

Berzelius Resource Allocation Policy

Berzelius resource is available to Swedish academic researchers as described in Project Applications and Resource Allocations on Berzelius.

Submitting Project Proposals in SUPR

Project applications are made in SUPR (Swedish User and Project Repository) in an application round called "LiU Berzelius", available via the menu drill down "Rounds" --> "AI/ML" --> "LiU Berzelius". Membership of project can be added or requested using the "Add Project Members" and "Request Membership in Project function" in SUPR.

Getting a Login Account

To login to Berzelius, you need a login account. Please refer to Getting a login account for how to get an account.

Login to Berzelius

On Berzelius, there are two types of specialised nodes serving different roles.

Login nodes

A login node is a server specifically designated for user access and interaction with the cluster. It serves as the entry point for users who want to submit jobs, access data, compile code, and perform other tasks related to HPC computing.

It's important to note that the login node is typically not intended for computationally intensive tasks. The actual heavy computation is offloaded to the compute nodes, which are dedicated to running user jobs.

On berzelius the login nodes are called berzelius1.nsc.liu.se and berzelius2.nsc.liu.se. They can also be reached at berzelius.nsc.liu.se, from which you will be assigned to one of the two.

Compute nodes

Compute nodes are the workhorses of a HPC cluster. They are dedicated server nodes designed for executing computationally intensive tasks, simulations, data analysis, and other high-performance computing workloads. Compute nodes typically make up the bulk of the resources in an HPC cluster and are responsible for performing the actual computations requested by users.

These nodes on Berzelius are named node001node094. Nodes node001node060 are the "thin" nodes, and Nodes node061node094 are the "fat" nodes.

Node Type GPUs CPUs RAM VRAM/GPU Local SSD
Thin 8 x NVIDIA A100 2 x AMD Epyc 7742 1 TB 40 GB 15 TB
Fat 8 x NVIDIA A100 2 x AMD Epyc 7742 2 TB 80 GB 30 TB

Login via SSH

Assuming you have received your Berzelius account credentials, you can use SSH (Secure Shell) to login to the Berzelius login node from your local computer. You need to input the 2FA verification code after the password.

ssh username@berzelius1.nsc.liu.se

Login via ThinLinc

ThinLinc is a remote desktop server software designed to provide secure and efficient remote access to Linux and UNIX desktop environments and applications. ThinLinc is the recommended way to run graphical applications on Berzelius. It provides a much better user experience for modern GUIs than X-forwarding. In addition, ThinLinc provides session management, allowing the users to disconnect from the session while running processes are kept running.

The ThinLinc client is available free of charge and has packages available for the major OS platforms (Linux, MacOS and Windows). See Running graphical applications for more information.

Data Storage on Berzelius

The shared storage and data transport fabric on Berzelius are very high performance, and should suffice for most IO loads on it, specifically data intensive AI/ML loads.

This is especially the case when the data sets are well formatted. Examples of good such formats are TFRecords (from TensorFlow), RecordIO (from MXNet) or Petastorm (Über).

The use of datasets in these formats can greatly reduce IO-wait time on the GPU compared to raw file system access, and will also reduce load on the shared storage. NSC highly recommends that you store and use your data sets using some such format.

Shared Storage

There are two shared storage areas set up for use:

  • the home directory /home/$USER, nightly backed-up and small (20 GB quota per user)
  • the project directory /proj/<your_project_dir>/users/$USER

Node Local Storage

High performance NVMe SSD node local storage is available on each compute node. There are a few points to note with respect to the available node local storage.

  • For every job, node local scratch space is mounted under /scratch/local.
  • Separate jobs can't access another job's /scratch/local when several jobs are sharing a node.
  • Each job's /scratch/local is erased between jobs. Data not saved (e.g. moved to somewhere under /proj) at the end of a job is lost, with no getting back.
  • In case you need to use it for your datasets, try to store your dataset as uncompressed tar-archives preferentially split in many parts and unpack in parallel, this will increase your data transfer speed tremendously compared to single processes. Example:

      # 144 GB ILSVRC 2012 data set in TFRecord format split in 128 tar archives
      # unpacked with 16 parallel workers to /scratch/local. A single worker takes
      # 106s to do the same task.
      [raber@node001 ILSVRC2012]$ time ls *.tar | xargs -n 1 -P 16 tar -x -C /scratch/local/ -f
      real  0m16.763s
      user  0m3.192s
      sys   8m4.740s

Efficient Dataset Transfer from Shared Storage to Node Local Storage

There is occasionally the need to use node local storage under /scratch/local to avoid starving the GPUs of data to work with, and hence bring up the efficiency of your job. There may also be situations where you cannot work with efficient data formats in your preferred framework/application and when you need to reduce the number of files stored on the Berzelius shared storage by using archive formats such as .tar for instance, therefore requiring transfer and unpacking to node local storage to conduct your work.

In these situations, an efficient and well performing means of handling the data transfer from the shared project storage to local disc is essential to avoid excessive startup times of your job. This can be accomplished using .tar archive files (other archive formats can be used as well) comprising your data set paired with parallel unpacking. There are only two steps needed to do this on Berzelius:

  1. Partition and pack your dataset in multiple .tar archives, balanced in terms of size and number of files. For performance reasons, it is important to not also use compression here as de-compression becomes a serious bottleneck when unpacking in the next step.
  2. On an allocated compute node, unpack these multiple .tar files using parallel threads directly to the local scratch disc.

Example

In this example we use a synthetic data set of 579 GB size comprising around 4.7M files of 128kb size each. The performance figures reported below for this dataset should serve as a good indicator of what performance you can expect in your individual case if you scale it appropriately, e.g. a twice as large dataset should take roughly twice as long to transfer.

# Partition and create the .tar archives

$ module load Fpart/1.5.1-gcc-8.5.0

# Trying to get a reasonable number of .tar archives with our data set, we here
# use a .tar archive max size 2350MB and max number of files of 19000. Test
# parameters out for your own situation, don't blindly use these. In the
# example case the command finished in 22 minutes using 8 parallel processes.

$ fpsync -n 8 -m tarify -s 2350M -f 19000 /absolute/path/to/dataset/ /absolute/path/to/tar/files/

On an allocated compute node you can then unpack the archive files in parallel from shared storage to node local disc using the below command.

# Pipe a listing of the .tar files to xargs, which here uses 8 parallel worker
# threads unpacking different individual .tar archive files. For the example,
# this command finished in just above 2 mins

you@node0XX $ ls /absolute/path/to/tar/files/*.tar | xargs -n 1 -P 8 tar -x -C /scratch/local/ -f

Depending on the size and number files in the dataset .tar archives as well as the workload of the involved file systems, you are likely to see some degree of variation in unpack times. On a quiescent (i.e. only one user) compute node using the example dataset we measured the following unpack times using various numbers of parallel workers:

N Time
2 5m14s
4 3m8s
8 2m2s
16 2m12s
32 2m26s
64 2m39s

While it is of course possible to use other ways to transfer your data set, we recommend you use this way as it is highly performant as well as being "nice" to the shared file system, i.e. keeping track of fewer files conserves resources on the shared file system servers.

Verifying Data Integrity

Before removing the original unpacked data set from Berzelius shared storage, it would be prudent to verify the data integrity of the .tar archives. The approach is to 1) create a checksum file of the dataset in the top level directory, 2) unpack the .tar archives to some place and verify those unpacked files against the original checksums. For our purposes, the md5sum tool produces sufficiently good checksums. Here's one way of doing it:

Create an md5sum listing on the original data set

$ cd /absolute/path/to/dataset/
$ find . -type f -exec md5sum '{}' \; | tee my_dataset_md5sums.txt

Assuming the my_dataset_md5sums.txt text file is included at the top level in the unpacked dataset as above, check the data integrity of it by for instance doing

$ cd /path/to/unpacked/dataset/
$ md5sum --quiet --check my_dataset_md5sums.txt && echo "Verification SUCCESSFUL" || echo "Verification FAILED"

Quotas

Quotas and your current usage can be checked with the command nscquota.

The standard quota is as follows:

  • The /home/$USER directory: 20 GB, 1 million files.
  • The /proj/your_project_dir directory: 2 TB, 2 million files.

The quota for the project directory can be increased, either at the time you apply for the project or as a complementary application at a later stage in SUPR.

Data Transfer from/to Berzelius

For data transfers between Berzelius and your local computer, please use scp or rsync. Other file transfer tools (e.g. FileZilla, WinSCP) using SCP or SFTP protocol should likely work as well.

Always upload large datasets to your /proj directory and never to /home, since the /home quota is only 20 GB, see Data Storage.

Transfer using the Command Line

  • Transfer from your local computer to Berzelius
# To transfer a file
scp /your_local_dir/dataset.tar username@berzelius1.nsc.liu.se:/proj/your_proj/users/username/
rsync -av /your_local_dir/dataset.tar username@berzelius1.nsc.liu.se:/proj/your_proj/users/username/


# To transfer a directory
scp -r /your_local_dir/dataset username@berzelius1.nsc.liu.se:/proj/your_proj/users/username/
rsync -av /your_local_dir/dataset username@berzelius1.nsc.liu.se:/proj/your_proj/users/username/
  • Transfer from Berzelius to your local computer
# To transfer a file
scp username@berzelius1.nsc.liu.se:/proj/your_proj/users/username/results.tar /your_local_dir/
rsync -av username@berzelius1.nsc.liu.se:/proj/your_proj/users/username/results.tar /your_local_dir/

# To transfer a directory
scp -r username@berzelius1.nsc.liu.se:/proj/your_proj/users/username/results /your_local_dir/ 
rsync -av username@berzelius1.nsc.liu.se:/proj/your_proj/users/username/results /your_local_dir/ 

Transfer using FileZilla

FileZilla is a popular open-source FTP (File Transfer Protocol), FTPS (FTP Secure), and SFTP (SSH File Transfer Protocol) client that allows you to transfer files between your local computer and remote servers. It provides a user-friendly graphical interface for managing file transfers and is available for Windows, macOS, and Linux. Please refer to the Filezilla User Guide

Modules and Build Environment

Modules

All software external to the RHEL OS is installed under the /software directory, and are made conveniently available via the module system. You are very welcome to make feature requests via support, so that we can customize the Berzelius user environment to be most effective for you.

  • Check module availability

    module avail
  • To load a modules:

    module load Anaconda/2021.05-nsc1
  • To remove a module:

    module rm Anaconda/2021.05-nsc1

Build Environment

A basic build environment is available via the module buildenv-gcccuda/11.4-8.3.1-bare for those who may need to build software for the RHEL 8 environment on Berzelius. The build environment is based on the system GCC (8.3), CUDA 11.4, OpenMPI 4.1.1, OpenBLAS 0.3.15, FFTW 3.3.9 and ScaLAPACK 2.1.0. For instance, if you need to build the mpi4py Python package or CUDA dependent Python packages, have this module loaded when building.

System Status

Please check Berzelius system status on this page. If something is not working, please don't hesitate to contact us.

User Support

Mail any support issues to berzelius-support@nsc.liu.se or use the interface available in SUPR. Please report the following information when you encounter problems and obstacles:

  • A general description of the problems
  • Job IDs
  • Error messages
  • Commands to reproduces the error messages

The support mail address is also the interface to make feature requests to add to Berzelius, and we also have the possibility to bring in the Berzelius vendor Atos or NVIDIA, should there be issues where extra support is needed.

Berzelius Events

Research Projects on Berzelius

Berzelius hosts research projects in many different fields. If you are a Berzelius user and want your research to be featured on our website, please contact berzelius-support@nsc.liu.se and we'll be happy to have you.

Frequently Asked Questions

Here is a FAQ page for common questions.

Our FAQ is a collaborative effort. If you have a question that you believe should be included in our FAQ, or if you'd like to contribute an answer, please contact us. We value your input and are dedicated to making this resource as helpful as possible.

Acknowledgement

We kindly ask the acknowledgement of Berzelius and NSC in scientific publications that required the use of Berzelius resources, services or expertise. This helps us ensure continued funding and maintain service levels. Please see Acknowledgment suggestion for examples.

User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express