Differences between Freja and Bi

The Freja cluster is the replacement for Bi. This page outlines the key differences between Freja and Bi. It also documents some of the experiences from the pilot testing phase. If you have been using Bi before, the information here might help you in migrating your jobs to Freja.

Hardware differences

Freja has 64 cores per compute node, four time the amount of Bi. There is also only 78 compute nodes in Freja, compared to the 641 nodes Bi had during the later half of it’s lifetime. If you have a working job configuration for Bi, you should make sure you take this into account by scheduling your smaller jobs on an appropriate number of cores instead of whole nodes.

Freja does not have hyper-threading, so there are no virtual cores. Bi did have hyper-threads that was enabled with --ntasks-per-core=2. That option should NOT be used on Freja.

Operating System differences

Freja is using Rocky Linux 9 (equivalent of RHEL9), vs Bi with CentOS 7 (equivalent of RHEL7). Those were released eight years apart, and there are more changes than can easily be enumerated, but it should feel very similar on the surface.

Software/configuration differences

Node sharing is available, so you can run more than one job on a node. Considering the amount of cores available you should do that more often than not. See Scheduling policy on Freja.

You cannot use normal ssh NODENAME to login to a node where you are running a job. Use jobsh -j JOBID NODENAME instead.

For you as a user, the compiler environment should be very similar to Bi. But Freja is running a much newer operating system, so available software will differ somewhat and you should recompile your own.

Examples of how to launch jobs

Freja uses the Slurm job scheduling system, like earlier clusters at NSC. Below, we present some example of how to launch parallel jobs with different kinds of parallelization.

Pure MPI

This is the simplest way of running. The job script below will launch the job on 4 compute nodes and you will get 64 MPI ranks per node (1 per core).

#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
...
mpprun binary.x

Scheduling differences

Use of the fat nodes counts towards fairshare usage at double the cost of normal nodes. Jobs not requesting fat nodes can be scheduled on fat nodes if no other nodes are available, but will then not be hit with the extra cost. This in contrast with Bi where requesting a fat node had no extra cost.

The “high” and “risk” qos classes no longer exist. Users of “risk” should use “low” that now have a 4h timelimit. Users of “high” are encouraged to test the boost-tools described below.

There is a new tool available to all users to change the priority of jobs themselves:

A project may increase the priority of a small number of jobs for added flexibility using boost-tools.
A project may increase the time limit of a small number of jobs beyond the normal maximum (7 days) using boost-tools.
A project may reserve nodes for a certain time period using boost-tools.

Suggestions of what to test

Try to recompile your software on Freja with the new compilers. Use the module buildenv-intel/2023a-eb.
Run the job like you did on Bi (using 64 cores/node). Check the output for correctness and then look at the speed. The job should run faster on Freja.
Next try reducing your jobs to run on a specified number of cores instead of full nodes.

Known problems

/esgf is not yet mounted

If you rely on the /esgf filesystem you will have to keep using it from Bi for a little while yet.

No Publisher

It is not yet possible to use Publisher from Freja. You can do the work on Freja and then publish the result from Bi for now.