The Bi cluster is the replacement for Krypton. This page outlines the key differences between Bi and Krypton. It also documents some of the experiences from the pilot testing phase. If you have been using Krypton before, the information here might help you in migrating your jobs to Bi.
Bi has 16 cores per compute node, just like Krypton. If you have a working job configuration for Krypton, you should be able to run exactly the same job on Bi -- it will just run much faster (typical improvement 50%).
Bi has hyper-threading available, making each physical core capable of appearing as two virtual cores. You need to enable it using
mpprun automatically starts 32 MPI ranks per compute node. If you do not do that, you get 16 MPI ranks per compute nodes (as long as you don't change that using other parameters). Hyper-threading makes some applications like Arome run faster (about 10%). See below for more information about hyper-threading and Slurm. Note: during the pilot phase until 2015-02-25, hyper-threading was on by default.
Bi has Intel Xeon E5v3 processors of the "Haswell" generation. Haswell CPUs have improved vectorization with AVX2 instructions. In theory, up to 8 floating points instructions can be handled per clock cycle (up from 4 using AVX). To benefit from this, you need to recompile your software with high optimization (like
-O2 -xCORE-AVX2) or at least link with an external library that has AVX2 support (like Intel's MKL).
Bi has 64 GB of memory in the thin compute nodes. This is twice the amount of Krypton. The memory speed has also improved. Bi has 1866 Mhz DDR4 memory. In low-level memory benchmarks like STREAM, we can see up to 30% improvement. For certain applications, this can lead to substantial speed-up, even without recompiling them.
Bi has Intel Truescale Infiniband (previous known as Qlogic Truescale) -- earlier clusters at NSC have had Infiniband from Mellanox. As a user, you will probably not notice this, but if you are using your own MPI library, you may have to supply special flags or recompile it with "PSM" or "TMI" support to get the best performance. In low-level benchmarks, we have seen that Truescale Infiniband is especially strong at small messages (high "packet rate").
Node sharing is available, so you can run more than one job on a node. See Scheduling policy on Bi.
You cannot use normal
ssh NODENAME to login to a node where you are running a job. Use
jobsh -j JOBID NODENAME instead.
The newer compiler wrappers and module system (same as on Triolith) are used.
A new implementation of
interactive is used.
Bi has the Slurm job scheduling system, like earlier clusters at NSC. Below, we present some example on how to launch parallel jobs with different kinds of parallelization.
This is the simplest way of running. The job script below will launch the job on e.g. 8 compute nodes and you will get 16 MPI ranks per node (1 per core). Run like this if you want everything to be as similar as possible to Krypton.
#!/bin/bash # SBATCH -J jobname # SBATCH -t HH:MM:SS # SBATCH -N 8 ... mpprun binary.x
If you want to activate hyperthreading and run using MPI only, you need to tell Slurm that you want 2 MPI ranks per core. Mpprun will then launch 32 MPI ranks per node automatically. You also need to send a special option to underlying MPI (
PSM_RANKS_PER_CONTEXT=auto). Please note that this is not a recommended way of running, it is better to use MPI+OpenMP parallelization (see below) with hyper-threading.
#!/bin/bash # SBATCH -J jobname # SBATCH -t HH:MM:SS # SBATCH -N 8 # SBATCH --ntasks-per-core=2 ... export PSM_RANKS_PER_CONTEXT=auto mpprun binary.x
In this case, each MPI rank will spawn a number of OpenMP threads. You can have up to 2 OpenMP threads per core. There are many possible combinations. We expect that the following combinations are likely run well:
16 MPI ranks x 2 OpenMP threads = 1 MPI rank per physical core and 2 OpenMP threads per virtual core. Job script
#!/bin/bash # SBATCH -J jobname # SBATCH -t HH:MM:SS # SBATCH -N 8 # SBATCH --ntasks-per-node=16 ... export OMP_NUM_THREADS=2 mpprun binary.x
2 MPI ranks x 16 OpenMP threads = 1 MPI ranks per socket and 16 OpenMP threads on each socket.
#!/bin/bash # SBATCH -J jobname # SBATCH -t HH:MM:SS # SBATCH -N 8 # SBATCH --ntasks-per-node=2 ... export OMP_NUM_THREADS=16 mpprun binary.x
Instead of giving the flag
--ntasks-per-node, you can also affect the number tasks per node indirectly by giving e.g.
--ntasks-per-core=2. This effectively enables hyperthreading and starts 32 MPI ranks per node.
mpiexec.hydra. The startup time can be improved by setting
export I_MPI_HYDRA_PMI_CONNECT=alltoallin the job script. Please note that the IntelMPI module and the
mpprunprogram does this automatically for you.
export KMP_AFFINITY=scatterto change thread affinity.
export MKL_CBWR = "AVX2"or
export MKL_CBWR = "AVX".
These are some specific tips for the Nemo code supplied by Torgny and the vendor's own testing. Suitable compiler options are:
%FC ifort -c -cpp -Nmpi %FCFLAGS -r8 -i4 -O3 -fp-model precise -xCORE-AVX2 -ip -unroll-aggressive %FFLAGS -r8 -i4 -O3 -fp-model precise -xCORE-AVX2 -ip -unroll-aggressive %LD ifort -O3 -fp-model precise -assume byterecl -convert big_endian -Nmpi
Example batch-script for a 16-node Nemo run. Here, we are not using hyperthreading, as NEMO does not benefit from that. Thre is also no OpenMP usage.
#!/bin/sh #SBATCH -N 16 #SBATCH -t 01:00:00 ....................................... time mpprun -np 255 ./nemo.exe ...........................................................
Some early experiences from the Arome benchmarking.
See example Arome "makeup" file below.
Suppose we want to run a 48 node Arome job using Intel MPI. In this case, we want to:
The script would look like:
#!/bin/sh #SBATCH -J Forecast #SBATCH -N 49 #SBATCH --ntasks-per-node=16 #SBATCH -t 01:00:00 ................. export NPROCX=16 export NPROCY=48 export NPROC_IO=16 export NPROC=$(( $NPROCX * $NPROCY )) export TOTPROC=$(( $NPROCX * $NPROCY + $NPROC_IO )) export NSTRIN=$NPROC export NSTROUT=$NPROC export OMP_NUM_THREADS=2 export KMP_STACKSIZE=128m ........................................................................NAMELIST etc.................... time mpprun LINK_TO_MASTERODB -maladin -vmeteo -eHARM -c001 -t$TSTEP -fh$FCLEN -asli || exit
NPROMA=-32 seems to work fine
Speedup launching of MPI-jobs:
Improve MPI-performance by tweaking some of the MPI routines alternatives:
export I_MPI_ADJUST_ALLREDUCE=6 export I_MPI_ADJUST_BARRIER=1 export I_MPI_ADJUST_ALLTOALLV=2
Improve dynamic memory allocation:
export MALLOC_MMAP_MAX_=0 export MALLOC_TRIM_THRESHOLD_=- 1
Improve performance for larger values of OMP_NUM_THREADS (4 and bigger):
export KMP_AFFINITY=compact export I_MPI_PIN_DOMAIN=omp:platform
Sometimes it can be beneficial to reduce the number of ranks, for example run 15 ranks on each node, each with 2 OpenMP threads seems to reduce the variability of runtime. See example in the table below for 96 nodes.
To enable reproducible output, independent of MPI-rank distribution and number of OpenMP-threads:
For very large number of MPI-ranks (ca 2500 and more) there is an additional overhead for each I/O-step. It is not clear yet why this happens.
|Total number of nodes||49||65||97||97||145||194|
|Total number of nodes||48||64||96||96||144||192|
MOD=mod FOPT=-noauto -convert big_endian -assume byterecl -openmp -openmp-threadprivate=compat -O3 -fpe0 -fp-model precise -fp-speculation=safe -ftz COPT=-O2 -fp-model precise -openmp -fp-speculation=safe -openmp-threadprivate=compat DEFS=-DLINUX -DLITTLE -DLITTLE_ENDIAN -DHIGHRES -DADDRESS64 -DPOINTER_64 -D_ABI64 -DBLAS \ -DSTATIC_LINKING -DINTEL -D_RTTOV_DO_DISTRIBCOEF -DINTEGER_IS_INT \ -DREAL_8 -DREAL_BIGGER_THAN_INTEGER -DUSE_SAMIO -D_RTTOV_DO_DISTRIBCOEF -DNO_CURSES \ -DFA=fa -DLFI=lfi -DARO=aro -DOL=ol -DASC=asc -DTXT=txt CC=icc -g -traceback -Nmpi CCFLAGS=$(COPT) $(DEFS) -Dlinux -DFOPEN64 FC=ifort -Nmpi -g -traceback FCFLAGS=$(FOPT) $(DEFS) FCFREE=-free FCFIXED=-nofree AUTODBL=-r8 LD=ifort -Nmpi -O3 -g -traceback -fp-model precise -fpe0 -ftz LDFLAGS=-pc 64 -openmp MKLROOT=/software/apps/intel/composer_xe_2015.1.133/mkl # System-dependent libraries - ALWAYS LOADED - (absolute filename or short name) : LD_SYS01 = -lpthread -lm # INTEL Math Kernel Library LD_LANG01 = $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a LD_LANG02 = $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a LD_LANG03 = -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 LD_LANG04 = -lmkl_core LD_LANG05 = -lmkl_intel_thread # MPI: LD_MPI01 = -L$(I_MPI_ROOT)/intel64/lib -ldl -lrt -lpthread SYSLIBS= $(LD_SYS01) \ $(LD_LANG01) $(LD_LANG02) $(LD_LANG03) $(LD_LANG04) $(LD_LANG05) $(LD_MPI01) \ $(GRIB_API_LIB) #INCLDIRS=$(GRIB_API_INCLUDE) -I$(NETCDFINCLUDE) INCLDIRS=$(GRIB_API_INCLUDE) RANLIB=ls -l PRESEARCH=-Wl,--start-group POSTSEARCH=-Wl,--end-group MPIDIR=/software/apps/intel/impi/5.0.2.044/intel64//lib MPIDIR_INCL=/software/apps/intel/impi/5.0.2.044/intel64/include YACCLEX_LIBS=-lm LDCC=icc -Nmpi -O3 -DLINUX -w -lifcore $(LD_MPI01) NPES=1 AUXSOURCES=sources.linux # comma-separated list of external module references EXTMODS=hdf5