Berzelius is an AI/ML focused compute cluster permitting scale-out compute jobs aggregating the computational power of up to 752 NVIDIA A100 GPUs. The interconnect fabric allows RDMA, non-blocking connection between all of these GPUs with a bandwidth of 200 GB/s and µs order latencies between any two endpoints. This makes several hundred (AI) petaflops available to individual jobs for certain workloads. The resource is available to Swedish academic researchers as described in Project applications and Resource Allocations on Berzelius.
At its core the Berzelius SuperPOD is a compute cluster running the Linux operating system, specifically Red Hat Enterprise Linux 8 (RHEL8). As such, most examples in the Berzelius documentation at NSC use the command line interface (CLI) since CLI instructions can be copied verbatim by users and examples are easy to follow with less room for mistakes. The CLI is also an extremely powerful tool enabling high productivity for users, and is an inescapable part of any HPC environment. Note that there are some differences between a typical desktop Linux and HPC environment, e.g. you can't use sudo
to install things.
Basically, a compute cluster enables parallel computations spanning interconnected compute units (nodes), i.e. compute servers and a messaging interconnect, by providing the means for the compute units' work to be orchestrated via some software framework(s) supported by the cluster.
The premise for this to work is that your software has already been adapted using one of these frameworks, in essence having formulated a parallel solution to the computational problem studied through the use of the parallel computation software framework. Some common parallel frameworks are MPI, NCCL and Apache Hadoop. On Berzelius MPI and NCCL are supported as these are the completely dominant parallel frameworks for the architecture and fitting the purpose of the cluster. There are currently no plans to support other frameworks.
On a single compute server level, there is also a different parallel paradigm available on modern clusters as there are many CPU compute cores (and GPUs, as on Berzelius) usable via POSIX or OpenMP threads. On Berzelius, since having very high-end compute servers, a very large category of problems studied will never need to scale out beyond a single compute node and can make effective use of this level of parallelism without involving the additional complexities of multi-node parallelism.
On a higher level, there are specialised nodes of a cluster serving different roles in it. There are:
berzelius1.nsc.liu.se
and berzelius2.nsc.liu.se
. They can also be reached at berzelius.nsc.liu.se
, from which you will be assigned to one of the two.
node001
— node094
. Nodes node061
— node094
are the new "fat" nodes with more HBM2 VRAM.
In addition, it is very common for compute clusters to have a shared storage file system served by a separate specialized cluster of storage servers. Berzelius' file systems /proj
and /home
are served by such a storage cluster.
Berzelius is currently to a large degree a work in progress in terms of user environment and feature maturity. Some general gotchas identified in the period leading up to Berzelius' general availability to be aware of are
srun
including the NSC provided interactive
tool. This is fixed in a later SLURM release which we will roll out as soon as it is available. If the CUDA_VISIBLE_DEVICES variable is needed, please try set the variable manually and export it to any srun
-launched processes using the switch --export=ALL,CUDA_VISIBLE_DEVICES
, or you can try and launch your job with mpirun
/mpiexec
, in which case you will need to load a buildenv module to make the command available.Mail any support issues to berzelius-support@nsc.liu.se
or use the interface available in SUPR. Please report the following information when you encounter problems and obstacles:
The support mail address is also the interface to make feature requests to add to Berzelius, and we also have the possibility to bring in the Berzelius vendor Atos or NVIDIA, should there be issues where extra support is needed.
We kindly ask the acknowledgement of Berzelius and NSC in scientific publications that required the use of Berzelius resources, services or expertise. This helps us ensure continued funding and maintain service levels. Please see Acknowledgment suggestion for examples.
From a cluster functionality perspective, the prototypical workflow using Berzelius thus consists of
Assuming you have received your Berzelius account credentials, here's the process to login to the Berzelius login node.
SSH to berzelius1.nsc.liu.se
or berzelius2.nsc.liu.se
(substitute x_abcde for your real user name)
[you@your_computer ~]$ ssh x_abcde@berzelius1.nsc.liu.se
[x_abcde@berzelius001 ~]$
Now we can allocate compute resources from SLURM for interactive or batch work. From the login node, allocate resources for interactive or batch work on a compute node. Resource allocation defaults per GPU are: 1 task, 16 cores per task, 125 GB RAM per task and a default wall time of 2h. Override the defaults as you see fit.
Interactive work using a single GPU and default number of tasks, cores per task, RAM and wall time.
[x_abcde@berzelius001 ~]$ interactive --gpus=1
You can check the GPU resources allocated:
[x_abcde@node001 ~]$ nvidia-smi # Check that the GPU resources are allocated
Fri Mar 3 10:30:59 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:BD:00.0 Off | 0 |
| N/A 29C P0 51W / 400W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
The A100 GPUs have a feature for splitting the GPU into multiple independent instances. Nodes in the reservation 3g.20gb have this feature enabled, splitting the GPUs in half. All jobs in this reservation receive a fix amount of resources and are half the cost of a normal job using 1 A100 GPU. If your workload fits in one of these jobs it is an excellent use of resources.
To achieve the fix resource allocation per job, some options are overridden during job submission. For example, you will always get 1 MIG GPU and 8 CPU cores, regardless of what you asked for.
[x_abcde@berzelius001 ~]$ interactive --reservation=3g.20gb
[x_abcde@node059 ~]$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-4bc10a41-11e4-0c97-4027-190e9f305d52)
MIG 3g.20gb Device 0: (UUID: MIG-37769be6-b4f5-5a17-bd14-6ad61e7c3d62)
...
[x_abcde@berzelius001 ~]$ interactive --gpus=2 -t 30
[x_abcde@node001 ~]$
[x_abcde@berzelius001 ~]$ interactive -N 1 --exclusive -t 6:00:00
[x_abcde@node001 ~]$
You need to create the batch script my_example_batch_script.sh
which contains:
#!/bin/bash
#SBATCH --gpus 4
#SBATCH -t 3-00:00:00
#SBATCH -A <your-project-account>
# The '-A' SBATCH switch above is only necessary if you are member of several
# projects on Berzelius, and can otherwise be left out.
# Apptainer images can only be used outside /home. In this example the
# image is located here
cd /proj/my_project_storage/users/$(id -un)
# Execute my Apptainer image binding in the current working directory
# containing the Python script I want to execute
apptainer exec --nv -b $(pwd) Example_Image.sif python train_my_DNN.py
You can submit the batch work by:
[x_abcde@berzelius001 ~]$ sbatch my_example_batch_script.sh
Using the feature flag -C "fat"
only "fat" nodes will be considered for the job. The corresponding flag for "thin" nodes is -C "thin
. If the feature flag is not specified any node can be used. When the "fat" feature is used the job is assigned 254GB of system memory per GPU. The GPUh cost of a "fat" GPU is twice that of a "thin" GPU, even if the job was not launched with -C "fat"
.
[x_abcde@berzelius001 ~]$ interactive --gpus=1 -C "fat" -t 30
[x_abcde@node063 ~]$
Depending on the type of resources allocated to a job the cost in GPUh will vary. Using feature flags it is possible to select either "thin" (A100 40GB) or "fat" (A100 80GB) for a job, a job not specifying either can use either. MIG GPUs are accessed through the MIG reservation.
GPU | Internal SLURM cost | GPUh Cost per hour | Accessed through |
---|---|---|---|
MIG 3g.20gb | 8 | 0.5 | --reservation=3g.20gb |
A100 40GB | 16 | 1 | -C "thin", or no flag |
A100 80GB | 32 | 2 | -C "fat", or no flag |
For data transfers between Berzelius and your local computer, please use scp
or rsync
. Other file transfer tools (e.g. FileZilla, WinSCP) using SCP or SFTP protocol should likely work as well, but have not been tested.
scp /your_local_dir/my_dataset.tar x_abcde@berzelius1.nsc.liu.se:/proj/your_proj_name/users/x_abcde
rsync -av /your_local_dir/unpacked_dataset x_abcde@berzelius1.nsc.liu.se:/proj/your_proj_name/users/x_abcde
Always upload large datasets to your /proj
directory and never to /home
, since the /home
quota is only 20 GB, see Data Storage.
scp x_abcde@berzelius1.nsc.liu.se:/proj/your_proj_name/users/x_abcde/my_dataset.tar /your_local_dir
rsync -av x_abcde@berzelius1.nsc.liu.se:/proj/your_proj_name/users/x_abcde/unpacked_dataset /your_local_dir
For AI/ML work on Berzelius, where highly complex production environments and high user customizability is more or less required, NSC strongly recommends using a container environment, where Apptainer or Enroot are the supported options. Docker is not supported for security reasons.
Using a container environment will allow a highly portable workflow and reproducible results between systems as diverse as a laptop, Berzelius or EuroHPC resources such as LUMI for instance. It will also bring a level of familiarity of use where a user is free to choose their operating system independently of the host environment.
There are however many other options than containers you can use on Berzelius to manage your compute environment, the most prominent ones are listed below.
All software external to the RHEL OS is installed under the /software
directory, and are made conveniently available via the module system. Handle your environment by loading modules, e.g. module load Anaconda/2021.05-nsc1
, instead of manually adding paths and similar in your ~/.bashrc
file. Check module availability with module avail
.
Only basic modules are available currently, but will be expanded as time progress. You are very welcome to make feature requests via support, so that we can customize the Berzelius user environment to be most effective for you.
Using Apptainer on Berzelius should not differ appreciably to using it in other settings, except that you will not have root access on Berzelius, i.e. issuing the privilege escalating sudo
command is not working, and apptainer build features relying on user namespaces (but not formally sudo
) are not available.
Whenever root privileges are required (like using sudo
) for your use of Apptainer, build or adapt your images in advance where you have such privileges and upload the image to your /proj
directory (it won't launch under /home
). The canonical reference and user guide for Apptainer can be found at https://sylabs.io/guides/3.8/user-guide/, which should apply directly to Apptainer's use on Berzelius.
We are very keen to find out if there are any significant differences to using it in other settings. Please report such differences to support.
Points of note about Apptainer on Berzelius:
--nv
switch to import the GPU devices into your container. This is critical to using NVIDIA GPUs in a Apptainer container.--fakeroot
and --userns
switches for building apptainer images will not work because they rely on user namespaces being enabled.mpirun
from the build environment buildenv-gcccuda/11.4-8.3.1-bare
. Otherwise, for multi-node container jobs, you may need to shift focus to the NVIDIA Enroot container solution, see https://github.com/NVIDIA/enroot, which can be used via srun
(plus additional switches) on Berzelius.Please read the Berzelius Apptainer Guide for more details.
Enroot is a simple, yet powerful tool to turn container images into unprivileged sandboxes. Enroot is targeted for HPC environments with integration with the Slurm scheduler, but can also be used as a standalone tool to run containers as an unprivileged user. Enroot is similar to Apptainer, but with the added benefit of allowing users to read/write in the container and also to appear as a root user within the container environment. Please read the Berzelius Enroot Guide for more details.
Anaconda installations are available via the module system, e.g. Anaconda/2021.05-nsc1
. They are special NSC installations which only make the conda
command available (and doesn't mess up your ~/.bashrc
file and login environment).
The default location for conda environments is ~/.conda
in your home directory. This location can be problematic since these environments can become very large. Therefore, it is suggested to redirect this directory using a symbolic link to your project directory.
mv ~/.conda /proj/your_proj/users/your_username
ln -s /proj/your_proj/users/your_username/.conda ~/.conda
A basic example for creating a conda environment called myenv with Python 3.8 with the pandas and seaborn packages:
module load Anaconda/2021.05-nsc1
conda create -n myenv python=3.8 pandas seaborn
conda activate myenv
Please see the conda cheatsheet for the most important commands about using conda.
Mamba is a reimplementation of the conda package manager in C++ offering the following benefits: - much faster dependency solving - parallel downloading of repository data and package files
Berzelius makes mamba (and conda) available via the Mambaforge miniforge distribution, with which users can create, alter and activate conda environments. For example, to set up a customized environment and run a Python program my_scipy_python_program.py
in it:
module load Mambaforge/23.3.1-1
mamba create -n myenv python=3.8 scipy=1.5.2
mamba activate myenv
python my_scipy_python_program.py
It should be noted that the CUDA driver and runtime on the nodes (currently version 12.0) must be supported by whatever you install in your conda env. Typically, installed drivers and runtime supports only environments using up to that particular version, but not anything over that.
Using the RHEL 8 provided Python in virtualenvs could also be a viable approach in many cases. Make sure you're working with python 3.6. Python 3.6.8 should be the system default on Berzelius, but please verify for yourself. Also, put your virtualenv's under /proj
and not /home
, we want to avoid filling your /home
directory up with non-precious data and backing it up to tape.
Install packages with pip install <packagename>
in your virtualenv, and build packages using a build environment, for instance 'buildenv-gcccuda/11.4-8.3.1-bare'. It should be noted that the CUDA driver and runtime on the nodes must be supported by whatever software you install in your virtualenv.
Example:
$ cd /proj/your_proj_dir/users/your_username
$ python3 -m venv myenv
$ source myenv/bin/activate
(myenv) $ python --version
Python 3.6.8
(myenv) $ pip install --upgrade pip
... (snip) ...
(myenv) $ module load buildenv-gcccuda/11.4-8.3.1-bare
(myenv) $ pip3 install mpi4py
...
Jupyter notebooks can be used via Thinlinc (see Running Graphical Applications) or SSH port forwarding.
Assume that you have installed Jupyter Notebook in your conda environment. On a compute node, load the python module and activate your environment:
module load Anaconda/2021.05-nsc1
conda activate myenv_example
Start a Jupyter notebook with the no-browser flag:
(myenv_example) [x_abcde@node021 ~]$ jupyter-notebook --no-browser --ip=node021 --port=9988
Please use the --ip
flag to specify the node that you are working on.
You will see the following info printed out on your terminal:
[I 2023-02-20 12:36:50.238 LabApp] JupyterLab extension loaded from /home/x_abcde/.conda/envs/myenv_example/lib/python3.10/site-packages/jupyterlab
[I 2023-02-20 12:36:50.238 LabApp] JupyterLab application directory is /proj/proj_name/x_abcde/.conda/envs/myenv_example/share/jupyter/lab
[I 12:36:50.243 NotebookApp] Serving notebooks from local directory: /home/x_abcde
[I 12:36:50.243 NotebookApp] Jupyter Notebook 6.5.2 is running at:
[I 12:36:50.243 NotebookApp] http://node021:9988/?token=xxxx
[I 12:36:50.243 NotebookApp] or http://127.0.0.1:9988/?token=xxxx
[I 12:36:50.243 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 12:36:50.256 NotebookApp]
To access the notebook, open this file in a browser:
file:///home/x_abcde/.local/share/jupyter/runtime/nbserver-511678-open.html
Or copy and paste one of these URLs:
http://node021:9988/?token=xxxx
or http://127.0.0.1:9988/?token=xxxx
In a second terminal, make a ssh tunnel login:
ssh -N -L localhost:9988:node021:9988 x_abcde@berzelius1.nsc.liu.se
Start your favorite browser on your local computer, paste the following URL given by the jupyter-notebook on Berzelius with your own token:
http://localhost:9988/?token=xxxx
or
http://127.0.0.1:9988/?token=xxxx
Note: The port 9988 is arbitrary. If 9988 is already in use, then just try 9989, etc.
A basic build environment, buildenv-gcccuda/11.4-8.3.1-bare
, is available for those who may need to build software for the RHEL 8 environment on Berzelius. For instance, if you need to build the mpi4py
Python package or CUDA dependent Python packages, have this module loaded when building.
The build environment is based on the system GCC (8.3), CUDA 11.4, OpenMPI 4.1.1, OpenBLAS 0.3.15, FFTW 3.3.9 and ScaLAPACK 2.1.0. Please report any problems using it to support.
Inevitably there comes a need to use graphical applications on the cluster. The NSC recommended way to run graphical applications on Berzelius is via the ThinLinc VNC solution (TL), which is leveraged to provide a remote desktop interface on Berzelius, see Running graphical applications for more information. The TL remote desktop presented is XFCE due to it being light on server resources. It will provide a much better user experience for modern GUIs than X-forwarding, although X-forwarding is not prohibited. In addition, it provides session management, allowing the users to disconnect from the session while running processes are kept running, like a GUI version of terminal multiplexers like GNU screen
.
The ThinLinc client is available free of charge and has packages available for the major OS platforms (Linux, MacOS and Windows) from Cendio at URL https://www.cendio.com/thinlinc/download. Applications of particular interest to the Berzelius users, which benefit from use via TL include
Berzelius is a SuperPOD compute cluster using SLURM as its resource manager. General SLURM documentation or the man
pages should be valid for making resource allocations on Berzelius. Other NSC SLURM documentation is likely useful but may not apply in all parts.
The wall time limit has been set to 3 days (72h) to ensure that there is reasonable turnover of jobs. You can extend the wall time of a running job using boost-tools.
There are default allocation settings in place based on the number of GPUs allocated. Currently these are, 1 task using 16 CPU cores and 127 GB RAM for every GPU allocated. This allows you to only specify the amount of GPUs your job requires — e.g. --gpus=2
, getting you 32 CPU cores and 250 GB RAM as well as the 2 GPUs — and you will get a sensible allocation for most circumstances.
In addition, the default wall time allocated for jobs is 2h. Remember to override this with -t <timestring>
when needed. All of these defaults are overridden by user switches when provided.
If "thin" or "fat" nodes are desired this can be specified with the feature flags -C "fat"
and -C "thin"
. If this is not specified in the submission both types of nodes may be used by the job.
Be aware that there are very many ways to give conflicting directives when allocating resources, and they will result in either not getting the allocation or getting the wrong allocation. Before submitting a resource intensive batch job, it's worthwhile to check out the settings in an interactive session verifying the resources are properly allocated.
In general, if you need to specify more than solely the number of GPUs your job requires, a recommended way to allocate is via the switch combination -n X --gpus=Y
(plus other needed switches, like -t
), where X and Y are "tasks" (CPU cores) and "gpus available to your X tasks", respectively. This pattern seems to work in most circumstances to get you what you expect. In some circumstances you may need to also specify -c Z
where Z are the number of hyperthreads allocated to each task (there are two hyperthreads per physical CPU core), but this should not be required normally.
When allocating a partial node, NSC's recommendation is to allocate tasks (CPU cores) in proportion to how large a fraction of the node's total GPUs you allocate, i.e. for every GPU (1/8 of a node) you allocate, you should also allocate 16 tasks (1/8 = 16/128). The default memory allocation follows that of the cores, i.e. for every CPU core allocated, 7995 MB of RAM (a small bit less than 1/128th of node's total RAM) is allocated. This is automatically taken care of when using only the switch --gpus=X
as in the examples of the quick start guide. On "fat" nodes twice the ammount of memory is allocated if the job used the -C "fat"
feature flag.
Jobs using the reservation with MIG GPUs (--reservation=3g.20gb
) will always receive a fix amount of resources per job (eg. only 1 MIG GPU per job). Even if you request a particular amount of resources the job will start with the fix amount. If your job require more resources than the fix amount you should not use this reservation.
Please note that when allocating more GPUs than 8 (more than one node), you will end up on several nodes and will require multi-node capabilities on your software setup to make use of all allocated GPUs.
A note about more advanced and multi-node allocations: This version of SLURM has a bug when allocating resources using the flag --gpus-per-task
, which is simply not working as intended. This bug makes for instance the switch combination -n 1 --gpus-per-task=1
allocate all GPUs on a node for your job, whether you use them or not. A working switch combination which can accomplish the same intended thing as above is -n 1 --ntasks-per-gpu=1
. You can use this switch combination with different number of tasks and tasks per GPU, for instance -n 32 --ntasks-per-gpu=2
, which will allocate a total of two nodes (16 GPUs) with two tasks per GPU. The bug appears to have been fixed in later versions of SLURM and we will upgrade as soon as possible.
Interactive work in a shell on the allocated resources can be performed via the NSC provided script interactive
which is a wrapper around the SLURM commands salloc
and srun
, and as such accepts all switches available to salloc
. For reference see https://www.nsc.liu.se/support/running-applications/ under the "Interactive jobs" heading.
For specific examples of the use of interactive, see for instance the Quick start guide above.
Batch jobs are supported in the standard SLURM way. For a guide see https://www.nsc.liu.se/support/batch-jobs/.
Jobs run on a single node should be straightforward in both Apptainer container environments and the host OS. Make sure you've got the allocated resources, and for Apptainer containers, remembered to import the GPUs (using the --nv
switch) into the container.
Multi-node jobs for regular MPI-parallel applications should be pretty standard on the cluster, and can use the common mpirun
(or mpiexec
) launcher available via the buildenv-gcccuda/11.4-8.3.1-bare
module or srun --mpi=X
(supported X by SLURM are pmi2, pmix and pmix_v3). If your application has been built with NSC provided toolchain(s) you should also be able to launch it with mpprun
in standard NSC fashion.
Multi-node jobs using GPUs can be challenging when running in a Apptainer container. For NVIDIA NGC containers used with Apptainer, you can possibly launch your job using the mpirun
provided by loading the buildenv-gcccuda/11.4-8.3.1-bare
module. See the [reference] (https://apptainer.org/docs/user/1.0/mpi.html).
Otherwise, SLURM on Berzelius has support for launching Enroot containers directly using srun
, see https://github.com/NVIDIA/enroot, which should work for at least NVIDIA NGC containers of recent date used to build your Enroot container. Enroot containers are similar to Apptainer containers but don't require superuser privileges to build or modify them. Enroot containers can be built on the compute nodes of the cluster but not on the login nodes,
NSC boost-tools are an attempt to add more flexibility to the job scheduling.
We currently provide three tools:
Please refer to the NSC boost-tools page for the usage instructions.
As the demand for time on Berzelius is high, we need to ensure allocated time is efficiently used. The efficiency of running jobs is monitored continuously by automated systems. Users can, and should, monitor how well jobs are performing. Users working via Thinlinc can use the tool jobgraph
to check how individual jobs are performing. For example, jobgraph -j 12345678
will generate a png-file with data on how job 12345678 is running. Please note that for jobarrays you need to look at the "raw" jobid. It is also possible to log onto a node running a job using jobsh -j $jobid
and then use tools such as nvidia-smi
to look at statistics in more detail.
In normal use, we expect jobs properly utilizing the GPUs to pull 200W, with most AI/ML-workloads at above 300W. The average idle power for the GPUs are about 52 Watt.
Particularly inefficient jobs will be automatically terminated. The following criteria are used to determine if a job should be canceled:
Exceptions from this are
interactive
tool,devel
,These criteria are slightly simplified and will be made stricter over time. In particular, we expect to increase the power utilization criteria up from 60W at some point in the future. Users are informed about canceled jobs hourly, to avoid sending out too much spam for users getting jobarrays canceled.
The shared storage and data transport fabric on Berzelius are very high performance, and should suffice for most IO loads on it, specifically data intensive AI/ML loads.
This is especially the case when the data sets are well formatted. Examples of good such formats are TFRecords (from TensorFlow), RecordIO (from MXNet) or Petastorm (Über).
The use of datasets in these formats can greatly reduce IO-wait time on the GPU compared to raw file system access, and will also reduce load on the shared storage. NSC highly recommends that you store and use your data sets using some such format.
There are two shared storage areas set up for use: /home/$USER
and /proj/<your_project_dir>/users/$USER
. The /home/$USER
area is backed-up (nightly) and small, 20 GB quota per user, and is only meant for data you cannot put under /proj
. The standard quota for the /proj
directory is 5,000 GiB and 5 M, but this can be increased, either at the time you apply for the project or as a complementary application at a later stage.
High performance NVMe SSD node local storage is available on each compute node.
There are a few points to note with respect to the available node local storage
/proj
) at the end of a job is lost, with no getting back.In case you need to use it for your datasets, try to store your dataset as uncompressed tar
-archives preferentially split in many parts and unpack in parallel, this will increase your data transfer speed tremendously compared to single processes. Example:
# 144 GB ILSVRC 2012 data set in TFRecord format split in 128 tar archives
# unpacked with 16 parallel workers to /scratch/local. A single worker takes
# 106s to do the same task.
[raber@node001 ILSVRC2012]$ time ls *.tar | xargs -n 1 -P 16 tar -x -C /scratch/local/ -f
real 0m16.763s
user 0m3.192s
sys 8m4.740s
Quotas and your current use of it can be checked with the command nscquota
. Complementary requests for increases in storage allocation can be made in SUPR if you find out you need it. If in doubt on how to do this, please contact support.
There is occasionally the need to use node local storage under /scratch/local
to avoid starving the GPUs of data to work with, and hence bring up the efficiency of your job. There may also be situations where you cannot work with efficient data formats in your preferred framework/application and when you need to reduce the number of files stored on the Berzelius shared storage by using archive formats such as .tar
for instance, therefore requiring transfer and unpacking to node local storage to conduct your work.
In these situations, an efficient and well performing means of handling the data transfer from the shared project storage to local disc is essential to avoid excessive startup times of your job. This can be accomplished using .tar
archive files (other archive formats can be used as well) comprising your data set paired with parallel unpacking. There are only two steps needed to do this on Berzelius:
.tar
archives, balanced in terms of size and number of files. For performance reasons, it is important to not also use compression here as de-compression becomes a serious bottleneck when unpacking in the next step..tar
files using parallel threads directly to the local scratch disc.In this example we use a synthetic data set of 579 GB size comprising around 4.7M files of 128kb size each. The performance figures reported below for this dataset should serve as a good indicator of what performance you can expect in your individual case if you scale it appropriately, e.g. a twice as large dataset should take roughly twice as long to transfer.
# Partition and create the .tar archives
$ module load Fpart/1.5.1
# Trying to get a reasonable number of .tar archives with our data set, we here
# use a .tar archive max size 2350MB and max number of files of 19000. Test
# parameters out for your own situation, don't blindly use these. In the
# example case the command finished in 22 minutes using 8 parallel processes.
$ fpsync -n 8 -m tarify -s 2350M -f 19000 /absolute/path/to/dataset/ /absolute/path/to/tar/files/
On an allocated compute node you can then unpack the archive files in parallel from shared storage to node local disc using the below command.
# Pipe a listing of the .tar files to xargs, which here uses 8 parallel worker
# threads unpacking different individual .tar archive files. For the example,
# this command finished in just above 2 mins
you@node0XX $ ls /absolute/path/to/tar/files/*.tar | xargs -n 1 -P 8 tar -x -C /scratch/local/ -f
Depending on the size and number files in the dataset .tar
archives as well as the workload of the involved file systems, you are likely to see some degree of variation in unpack times. On a quiescent (i.e. only one user) compute node using the example dataset we measured the following unpack times using various numbers of parallel workers:
N | Time |
---|---|
2 | 5m14s |
4 | 3m8s |
8 | 2m2s |
16 | 2m12s |
32 | 2m26s |
64 | 2m39s |
While it is of course possible to use other ways to transfer your data set, we recommend you use this way as it is highly performant as well as being "nice" to the shared file system, i.e. keeping track of fewer files conserves resources on the shared file system servers.
Before removing the original unpacked data set from Berzelius shared storage, it would be prudent to verify the data integrity of the .tar
archives. The approach is to 1) create a checksum file of the dataset in the top level directory, 2) unpack the .tar
archives to some place and verify those unpacked files against the original checksums. For our purposes, the md5sum
tool produces sufficiently good checksums. Here's one way of doing it:
Create an md5sum listing on the original data set
$ cd /absolute/path/to/dataset/
$ find . -type f -exec md5sum '{}' \; | tee my_dataset_md5sums.txt
Assuming the my_dataset_md5sums.txt
text file is included at the top level in the unpacked dataset as above, check the data integrity of it by for instance doing
$ cd /path/to/unpacked/dataset/
$ md5sum --quiet --check my_dataset_md5sums.txt && echo "Verification SUCCESSFUL" || echo "Verification FAILED"
Guides, documentation and FAQ.
Applying for projects and login accounts.