CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model. On Berzelius, both the CUDA driver and CUDA toolkit are installed, but their versions may differ — and this is normal.
The CUDA driver version is reported by the nvidia-smi
tool. This version corresponds to the GPU driver installed on the system, which manages the GPU hardware and enables GPU acceleration for compatible software.
Note: The driver version does not have to match the CUDA Toolkit version, but it must be new enough to support the toolkit version you plan to use.
The CUDA Toolkit includes compilers (nvcc), libraries, and development tools for building and running GPU-accelerated applications. You can check the version of the currently loaded toolkit with nvcc -V
.
Many Python libraries or frameworks (like PyTorch, TensorFlow, etc.) typically include the necessary CUDA runtime libraries as part of the installation. This means you do not need to install or load the full CUDA Toolkit module just to use these frameworks.
At the time of writing, the CUDA driver on the compute nodes corresponds to the CUDA 12.0 release, but it includes compatibility support for CUDA 12.2. Here’s a sample nvidia-smi output:
Fri Apr 18 21:15:34 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02 Driver Version: 535.230.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 |
| N/A 25C P0 59W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
SLURM (Simple Linux Utility for Resource Management) is an open-source, highly configurable workload manager and job scheduler widely used on high-performance computing (HPC) systems. On Berzelius, SLURM is used to allocate compute resources—such as GPUs, CPU cores, and memory—and to manage both batch and interactive jobs.
When allocating a partial node, NSC recommends keeping the allocation of CPU cores proportional to the number of GPUs you request. Each compute node has 8 GPUs and 128 CPU cores, so a 1:8 ratio applies.
For example:
The default memory allocation follows this CPU core count:
If you simply use the --gpus=X
flag, SLURM will automatically allocate a proportional number of CPU cores and RAM for you, according to the defaults above.
There are CPU nodes available in the partition “berzelius-cpu”. To use a CPU node the flag --partition=berzelius-cpu
must be set in the job description.
If your job requests more than 8 GPUs, it will span multiple nodes. In such cases, your software must support multi-node execution (e.g., through MPI or distributed frameworks like PyTorch DDP or Horovod). Be sure to verify this capability in your environment before submitting large-scale jobs.
There are many ways to unintentionally provide conflicting SLURM directives (e.g., requesting 4 GPUs but only 8 cores), which can result in job failures or incorrect resource allocation.
Recommendation: Before submitting a large or long-running batch job, start with an interactive session to verify that your SLURM settings behave as expected. Use the interactive
tool to launch a temporary job and inspect the allocated resources.
An interactive session allows you to run commands on the cluster in real time, making it ideal for tasks like development, testing, debugging, and exploring data.
Running a basic interactive command like:
interactive --gpus=1
will by default allocate:
These defaults follow the proportional resource allocation policy (1/8 of a node per GPU).
interactive --gpus=1
interactive --gpus=1 --nodelist=node045
interactive --gpus=1 -A <your-project-account>
interactive --gpus=2 -t 00-00:30:00
interactive -N 1 --exclusive -t 6:00:00
interactive --gpus=1 -C fat -t 30
The feature flag -C fat
restricts job placement to fat nodes (2 TB RAM). The equivalent for thin nodes is -C thin
. If no flag is specified, SLURM may place your job on either type.
Note on GPUh Cost: When using -C fat
, your job is assigned 254 GB of system memory per GPU. However, fat node GPUs are charged at double the GPU-hour rate, even if your job was not explicitly submitted with -C fat
but landed there.
interactive -p berzelius-cpu -n1 -c12 -t 180
Note on GPUh Cost: When using -p berzelius-cpu
, your job is assigned to a node without GPUs but with substantially more performant CPUs as compared to the GPU nodes. The cost for 16 CPU cores is the same as for 1 thin node GPU. Your jobs can not land on CPU node by chance, you must specify the partition with the flag -p berzelius-cpu
or --partition=berzelius-cpu
.
In the context of HPC clusters, batch jobs are non-interactive computational tasks submitted to a job scheduler for deferred execution. Batch submission is the standard way to efficiently manage long-running or resource-intensive workloads on Berzelius.
To submit a batch job, you first need to create a job script—typically named something like batch_script.sh
. Here’s a basic example:
#!/bin/bash
# SLURM batch job script for Berzelius
#SBATCH -A <your-project-account> # Replace with your project account name
#SBATCH --gpus=4 # Request 4 GPUs
#SBATCH -t 3-00:00:00 # Wall time: 3 days (72h)
# Load your environment
module load Miniforge3/24.7.1-2-hpc1-bdist
mamba activate pytorch-2.6.0
# Execute your code
python train_model.py
A more detailed introduction of batch jobs can be found here.
Once your script is ready, submit it using:
sbatch batch_script.sh
You can monitor the job with:
squeue -u <your username>
The maximum wall time for jobs on Berzelius is 3 days (72 hours). This limit ensures fair scheduling and reasonable job turnover for all users. If your work requires more time, consider splitting it into multiple shorter runs or using checkpointing.
To provide greater flexibility when working with SLURM on Berzelius, NSC offers a set of utilities collectively known as NSC boost-tools. We currently provide three tools:
NVIDIA Multi-Instance GPU (MIG) is a feature which allows a single GPU to be partitioned into multiple smaller GPU instances, each of which can be allocated to different tasks or users. This technology helps improve GPU utilization and resource allocation in multi-user and multi-workload environments.
Nodes in the reservation 1g.10gb
have MIG feature enabled. Each 1g.10gb
-instance is equipped with
If your job require more resources than the fix amount you should not use this reservation.
interactive --reservation=1g.10gb
Running multi-node jobs on Berzelius is fully supported and follows standard practices for MPI-parallel applications and distributed GPU workloads.
For traditional MPI applications, you can use:
mpirun
or mpiexec
, available via the module:
module load buildenv-gcccuda/11.8.0-gcc11.3.0
srun --mpi=<type>
, where <type>
is one of pmi2
, pmix
, or pmix_v3
(all supported by SLURM).
mpprun
, the standard NSC wrapper for MPI job launching, compatible with applications built using NSC’s toolchains.Running multi-node jobs with GPUs inside Apptainer containers (formerly Singularity) can be more complex due to MPI and GPU passthrough requirements. If you’re using NVIDIA NGC containers, you may still be able to use mpirun
with proper environment setup.
Steps:
module load buildenv-gcccuda/11.8.0-gcc11.3.0
We have a few examples of multi-node jobs available for your reference.
Please refer to the NSC boost-tools for how to reserve GPUs/nodes for a specific time period.
Depending on the type of resources allocated to a job the cost in GPUh will vary. Using feature flags it is possible to select either “thin” (A100 40GB) or “fat” (A100 80GB) for a job, a job not specifying either can use either. MIG GPUs are accessed through the MIG reservation.
GPU | Internal SLURM cost | GPUh Cost per hour | Accessed through |
---|---|---|---|
MIG 1g.10gb | 4 | 0.25 | –reservation=1g.10gb |
A100 40GB | 16 | 1 | -C “thin”, or no flag |
A100 80GB | 32 | 2 | -C “fat”, or no flag |
CPU nodes are accessible in the berzelius-cpu partition and the cost for 16 CPU cores is the same as for one A100 40GB.
As the demand for time on Berzelius is high, we must ensure that allocated GPU resources are used efficiently. The performance of all running jobs is continuously monitored by automated systems, and users are encouraged to monitor their own jobs as well.
You can use the tool jobgraph
to visualize GPU usage:
jobgraph -j <jobID>
This command generates a .png
file showing how the job is utilizing resources. For job arrays, make sure to use the raw job ID (i.e., the base job ID, not the individual array task ID).
You can also log into the node running your job using:
jobsh -j <jobID>
Inside the job environment, tools such as nvidia-smi
and nvtop
provide detailed, real-time GPU statistics.
Properly utilizing jobs typically pull 200W or more per GPU, with many AI/ML workloads reaching 300W+. In contrast, idle GPUs consume approximately 50–60W. Jobs with consistently low GPU usage are likely not utilizing the allocated resources effectively.
To maintain fair usage and cluster efficiency, jobs that fall below certain thresholds may be automatically canceled by the system. The key criteria are:
The following job types are exempt from this automatic cancellation policy:
interactive
tool, for up to 8 hours.devel
and safe
.Note: These criteria are intentionally simplified and will become stricter over time. For example, the grace period for interactive jobs may be shortened in future policy updates.
Users are informed hourly about any job cancellations caused by inefficiency, to avoid excessive spam—especially in cases where many jobs in a job array are affected. We recommend reviewing canceled job details to identify and fix potential inefficiencies in your workload.
In some cases, individual tasks do not fully utilize a GPU (e.g., low power usage), resulting in poor resource efficiency. One way to increase throughput is to run multiple such tasks concurrently within the same job. This strategy can make better use of allocated GPU time, even if each task runs slightly slower on its own.
This method is useful for:
We create a simple file data.txt, with one word per line:
[user@berzelius1 xargs-example]$ cat data.txt | wc -l
24
[user@berzelius1 xargs-example]$ head -n 3 data.txt
anniversary
annotated
annotation
We have a script poet.sh
that generates a poem from a single word:
[user@berzelius1 xargs-example]$ srun --gpus=1 ./poet.sh "hello" 2>/dev/null
Hello,
I'm a simple greeting
...
A wrapper script poet_wrapper.sh
runs poet.sh
and saves the output to a file:
[user@berzelius1 xargs-example]$ cat poet_wrapper.sh
#!/bin/bash
mkdir -p results # Create output directory
./poet.sh "$1" > results/$1.txt # Run task and redirect output
We now submit a job with a script like concurrent_poet.sh
, which launches multiple tasks using xargs
:
[user@berzelius1 xargs-example]$ cat concurrent_poet.sh
#!/bin/bash
#SBATCH -J concurrent_poet
#SBATCH --gpus=1
CONCURRENT_TASKS=4
# -P $CONCURRENT_TASKS: Run this many tasks in parallel.
# -I {}: Placeholder for each line of input.
# -d '\n': Treat input lines as newline-delimited.
cat data.txt | \
xargs -d '\n' -I {} -P $CONCURRENT_TASKS ./poet_wrapper.sh "{}"
After the job finishes, all output is available under the results/
directory:
[user@berzelius1 xargs-example]$ ls results/ | wc -l
24
[user@berzelius1 xargs-example]$ ls results/ | head -n 3
allergy.txt
anniversary.txt
annotated.txt
[user@berzelius1 xargs-example]$ head -n 5 results/annotated.txt
And as I go through life, with each passing year,
I find that there's so much more to learn and share.
Each memory becomes an annotation,
A marking of my thoughts and emotions,
As I explore and grow along the way.
If your job benefits from an exclusive node (e.g., large scratch usage), and you want to utilize all 8 GPUs:
#SBATCH --exclusive
and #SBATCH -N 1
CUDA_VISIBLE_DEVICES
or run separate xargs
blocks per GPUGuides, documentation and FAQ.
Applying for projects and login accounts.