Bi has 16 cores per compute node, just like Krypton. If you have a working job configuration for Krypton, you should be able to run exactly the same job on Bi -- it will just run much faster (typical improvement 50%).
Bi has hyper-threading available, making each physical core appear as two virtual cores. This means that
mpprun automatically starts 32 MPI ranks per compute node. You need to enable it using
--ntasks-per-core=2. If you do not do that, you get 16 MPI ranks per compute nodes (as long as you don't change that using other parameters). Hyper-threading makes some applications like Arome run faster (about 10%). See below for more information about hyper-threading and Slurm. Note: during the pilot phase until 2015-02-25, hyper-threading was on by default.
Bi has Intel Xeon E5v3 processors of the "Haswell" generation. Haswell CPUs have improved vectorization with AVX2 instructions. In theory, up to 8 floating points instructions can be handled per clock cycle (up from 4 using AVX). To benefit from this, you need to recompile your software with high optimization (like
-O2 -xCORE-AVX2) or at least link with an external library that has AVX2 support (like Intel's MKL).
Bi has 64 GB of memory in the thin compute nodes. This is twice the amount of Krypton. The memory speed has also improved. Bi has 1866 Mhz DDR4 memory. In low-level memory benchmarks like STREAM, we can see up to 30% improvement. For certain applications, this can lead to substantial speed-up, even without recompiling them.
Bi has Intel Truescale Infiniband (previous known as Qlogic Truescale) -- earlier clusters at NSC have had Infiniband from Mellanox. As a user, you will probably not notice this, but if you are using your own MPI library, you may have to supply special flags or recompile it with "PSM" or "TMI" support to get the best performance. In low-level benchmarks, we have seen that Truescale Infiniband is especially strong at small messages (high "packet rate").
Bi has the Slurm job scheduling system, like earlier clusters at NSC. Below, we present some example on how to launch parallel jobs with different kinds of parallelization.
This job script will launch e.g. 8 nodes with 16 cores/node. Run like this if you want everything to be as similar as possible to Krypton:
#!/bin/bash # SBATCH -J jobname # SBATCH -t HH:MM:SS # SBATCH -N 8 ... mpprun binary.x
This is the simplest way of running, mpprun will launch 32 MPI ranks per node.
#!/bin/bash # SBATCH -J jobname # SBATCH -t HH:MM:SS # SBATCH -N 8 # SBATCH --ntasks-per-core=2 ... mpprun binary.x
In this case, each MPI rank will spawn a number of OpenMP threads. You can have up to 2 OpenMP threads per core. There are many possible combinations. We expect that the following combinations are likely run well:
16 MPI ranks x 2 OpenMP threads = 1 MPI rank per physical core and 2 OpenMP threads per virtual core. Job script
#!/bin/bash # SBATCH -J jobname # SBATCH -t HH:MM:SS # SBATCH -N 8 # SBATCH --ntasks-per-node=16 ... export OMP_NUM_THREADS=2 mpprun binary.x
2 MPI ranks x 16 OpenMP threads = 1 MPI ranks per socket and 16 OpenMP threads on each socket.
#!/bin/bash # SBATCH -J jobname # SBATCH -t HH:MM:SS # SBATCH -N 8 # SBATCH --ntasks-per-node=2 ... export OMP_NUM_THREADS=16 mpprun binary.x
Instead of giving the flag
--ntasks-per-node, you can also affect the number tasks per node indirectly by giving e.g.
--ntasks-per-core=1. This effectively disables hyperthreading and starts 16 MPI ranks per node. Update:
--ntasks-per-core=1 is now default. Use
--ntasks-per-core=2 to enable hyper-threading.
mpiexec.hydra. The startup time can be improved by setting
export I_MPI_HYDRA_PMI_CONNECT=alltoallin the job script. Please note that the IntelMPI module and the
mpprunprogram does this automatically for you.
export KMP_AFFINITY=scatterto change thread affinity.
/software/appsdirectory with precompiled software is a work in progress and may not be available from day one of pilot testing.
export MKL_CBWR = "AVX2"or
export MKL_CBWR = "AVX".
These are some specific tips for Nemo supplied by Torgny and the vendor's own testing. Suitable compiler options are:
%FC ifort -c -cpp -Nmpi %FCFLAGS -r8 -i4 -O3 -fp-model precise -xCORE-AVX2 -ip -unroll-aggressive %FFLAGS -r8 -i4 -O3 -fp-model precise -xCORE-AVX2 -ip -unroll-aggressive %LD ifort -O3 -fp-model precise -assume byterecl -convert big_endian -Nmpi
Example batch-script for a 16-node Nemo run. Here, we are not using hyperthreading, as NEMO does not benefit from that. Thre is also no OpenMP usage.
#!/bin/sh #SBATCH -N 16 #SBATCH -t 01:00:00 ....................................... time mpprun -np 255 ./nemo.exe ...........................................................
Some early experiences from the Arome benchmarking.
See example Arome "makeup" file below.
Suppose we want to run a 48 node Arome job using Intel MPI. In this case, we want to:
The script would look like:
#!/bin/sh #SBATCH -J Forecast #SBATCH -N 49 #SBATCH --ntasks-per-node=16 #SBATCH -t 01:00:00 ................. export NPROCX=16 export NPROCY=48 export NPROC_IO=16 export NPROC=$(( $NPROCX * $NPROCY )) export TOTPROC=$(( $NPROCX * $NPROCY + $NPROC_IO )) export NSTRIN=$NPROC export NSTROUT=$NPROC export OMP_NUM_THREADS=2 export KMP_STACKSIZE=128m ........................................................................NAMELIST etc.................... time mpprun LINK_TO_MASTERODB -maladin -vmeteo -eHARM -c001 -t$TSTEP -fh$FCLEN -asli || exit
NPROMA=-32 seems to work fine
Speedup launching of MPI-jobs:
Improve MPI-performance by tweaking some of the MPI routines alternatives:
export I_MPI_ADJUST_ALLREDUCE=6 export I_MPI_ADJUST_BARRIER=1 export I_MPI_ADJUST_ALLTOALLV=2
Improve dynamic memory allocation:
export MALLOC_MMAP_MAX_=0 export MALLOC_TRIM_THRESHOLD_=- 1
Improve performance for larger values of OMP_NUM_THREADS (4 and bigger):
export KMP_AFFINITY=compact export I_MPI_PIN_DOMAIN=omp:platform
Sometimes it can be beneficial to reduce the number of ranks, for example run 15 ranks on each node, each with 2 OpenMP threads seems to reduce the variability of runtime. See example in the table below for 96 nodes.
To enable reproducible output, independent of MPI-rank distribution and number of OpenMP-threads:
For very large number of MPI-ranks (ca 2500 and more) there is an additional overhead for each I/O-step, not clear yet why.
|Total number of nodes||49||65||97||97||145||194|
|Total number of nodes||48||64||96||96||144||192|
MOD=mod FOPT=-noauto -convert big_endian -assume byterecl -openmp -openmp-threadprivate=compat -O3 -fpe0 -fp-model precise -fp-speculation=safe -ftz COPT=-O2 -fp-model precise -openmp -fp-speculation=safe -openmp-threadprivate=compat DEFS=-DLINUX -DLITTLE -DLITTLE_ENDIAN -DHIGHRES -DADDRESS64 -DPOINTER_64 -D_ABI64 -DBLAS \ -DSTATIC_LINKING -DINTEL -D_RTTOV_DO_DISTRIBCOEF -DINTEGER_IS_INT \ -DREAL_8 -DREAL_BIGGER_THAN_INTEGER -DUSE_SAMIO -D_RTTOV_DO_DISTRIBCOEF -DNO_CURSES \ -DFA=fa -DLFI=lfi -DARO=aro -DOL=ol -DASC=asc -DTXT=txt CC=icc -g -traceback -Nmpi CCFLAGS=$(COPT) $(DEFS) -Dlinux -DFOPEN64 FC=ifort -Nmpi -g -traceback FCFLAGS=$(FOPT) $(DEFS) FCFREE=-free FCFIXED=-nofree AUTODBL=-r8 LD=ifort -Nmpi -O3 -g -traceback -fp-model precise -fpe0 -ftz LDFLAGS=-pc 64 -openmp MKLROOT=/software/apps/intel/composer_xe_2015.1.133/mkl # System-dependent libraries - ALWAYS LOADED - (absolute filename or short name) : LD_SYS01 = -lpthread -lm # INTEL Math Kernel Library LD_LANG01 = $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a LD_LANG02 = $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a LD_LANG03 = -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 LD_LANG04 = -lmkl_core LD_LANG05 = -lmkl_intel_thread # MPI: LD_MPI01 = -L$(I_MPI_ROOT)/intel64/lib -ldl -lrt -lpthread SYSLIBS= $(LD_SYS01) \ $(LD_LANG01) $(LD_LANG02) $(LD_LANG03) $(LD_LANG04) $(LD_LANG05) $(LD_MPI01) \ $(GRIB_API_LIB) #INCLDIRS=$(GRIB_API_INCLUDE) -I$(NETCDFINCLUDE) INCLDIRS=$(GRIB_API_INCLUDE) RANLIB=ls -l PRESEARCH=-Wl,--start-group POSTSEARCH=-Wl,--end-group MPIDIR=/software/apps/intel/impi/5.0.2.044/intel64//lib MPIDIR_INCL=/software/apps/intel/impi/5.0.2.044/intel64/include YACCLEX_LIBS=-lm LDCC=icc -Nmpi -O3 -DLINUX -w -lifcore $(LD_MPI01) NPES=1 AUXSOURCES=sources.linux # comma-separated list of external module references EXTMODS=hdf5