Systems  
Status displays
System status
Retired systems
 
 
 
 
 
 
 
 
 
 
 
 

Gimle User Guide

Short Description

Bore/Gimle is a Linux-based cluster with 140 HP ProLiant DL160 G5 and 128 HP Proliant DL170h G6 compute servers with a combined peak performance of 20 Tflops. Each DL160 compute server is equipped with two quad-core Intel® Xeon® E5462 processors while the DL170h compute servers contain two quad-core Intel® Xeon® E5520 each. The installation also includes a total of 7 ProLiant DL380 G5 system servers that handle cluster storage and administration tasks. In total, the cluster has over 4.5 TiB of main memory. The compute nodes communicate over a high-speed network based on Infiniband equipment from Cisco and Voltaire.

The compute servers are split between the Bore and Gimle parts of the cluster. The Bore part of the cluster is dedicated to weather forecast production. The Gimle part of the cluster, the topic of the rest of this guide, is used for research and development.

At this moment (May 30, 2011) the Gimle part has 108+120 nodes with the rest assigned to Bore.

The environment on Gimle is based on the modern environment developed for the SNIC cluster Neolith. Please take your time and learn more about Gimle from the information in this user guide.

Hardware

Processor Intel Xeon E5462 Quad Core 2.80 GHz, 6 MB L2 cache Intel Xeon E5520 Quad Core 2.26 GHz, 8 MB L3 cache
Interconnect InfiniBand ConnectX InfiniBand ConnectX
Node memory 16 GiB (32 GiB on two "fat" nodes) 24 GiB

Software

Operating system CentOS 5 x86_64
Resource Manager SLURM
Scheduler Moab
Compilers Intel compiler collection
Math libraries Intel Math Kernel Library (MKL)
MPI Scali MPI, OpenMPI
Applications See Application Software
[top]

Quickstart Guide

  1. Use ssh to access the system

    When you have received a username and a password from NSC, log in to Gimle using ssh:
          $ ssh username@gimle.nsc.liu.se
       
    
  2. Change your password

    As soon as possible after receiving the username and initial password, log in and change your password. The system should prompt you for a new password automatically. You may change it again later with the command:
          $ passwd
    
    See more details on security.
  3. Compile a program

    To compile a parallel (MPI) program, load the appropriate MPI-module and add the "-N" compiler flag [more details]. When compiling a FORTRAN program, do:
          $ module add scampi
          $ ifort -Nmpi mpiprog.f
              
    
    or, when compiling a C-program, do:
          $ module add scampi
          $ icc -Nmpi mpiprog.c
              
    
  4. Run an application

    Run the application as a batch job [more details]:
    1. Create a submit script. This file contains information about how many nodes you wish to use, how long you expect the job to run, how to start the application, etc.
    2. Submit the job:
            $ sbatch script.sh   
      
      Note: the maximum walltime for jobs will be 7 days
  5. Get support

    If you need support, please contact smhi-support@nsc.liu.se.
[top]

Security and Accessing the System

Accessing the System

Log in to Gimle with ssh

To log into the system, use the username provided to you by NSC, and issue

         $ ssh username@gimle.nsc.liu.se
  

Unix:
ssh (OpenSSH) is most likely installed on your Linux, Solaris, or Mac OS X machine.

Windows:
PuTTY is a commonly used free SSH implementation (there are also other alternatives).

Both OpenSSH and PuTTY can be used for "X forwarding": With ssh add the command line flag -X, or with PuTTY toggle "Enable X11 forwarding" in the preferences. Note that using X forwarding may require additional configuration of your local machine, e.g. you will need an X server. Please consult your local system administrator if you run into trouble.

File transfer is available using scp, sftp, or sshfs

  • scp is a tool useful for copying single (or a few) files to or from a remote system. To copy a local file named local-file to your home directory on Gimle, issue
            $ scp local-file username@gimle.nsc.liu.se:
          
    See the scp man pages for further information.
  • sftp is an interactive file transfer program, similar to ftp. Example:
            $ sftp username@gimle.nsc.liu.se:testdir
            Connecting to gimle.nsc.liu.se...
            Changing to: /home/username/testdir
            sftp> ls
            file-1  file-2
            sftp> get file-2
            Fetching /home/username/testdir/file-2 to file-2
      
    
    For additional information about sftp, see the sftp man page.
  • sshfs is a "user space file system" which allows for transparent file system access to remote machines. Example:
            $ mkdir mnt
            $ ls mnt
            $ sshfs username@gimle.nsc.liu.se:testdir mnt
            $ ls mnt
            file-1  file-2
            $ fusermount -u mnt
            $ ls mnt
      
    
    The use of sshfs can be very convenient, but is often not available by default. Consult your local system administrator to see if sshfs is available for your desktop machine.

Security

When a system is compromised and passwords stolen, the thing that causes the most grief is when the stolen password can be used for more than one system. A user that has accounts on many different computers and gets his/her shared password stolen will allow the intruders to easily cross administrative domains and further compromise other systems.

  • DO NOT use a trivial password based on your name, account, dogs name, etc.
  • DO NOT share passwords between different systems.

Logging into a system and traversing from that system to another one in a chain (as illustrated below) should be avoided.

login_recommendation

When logging into a system, please check the “last login” information shown. If you can't verify the information, contact smhi-support@nsc.liu.se as soon as possible.

Checklist:

  • Use different passwords for different systems.

  • Do not use weak passwords.

  • Avoid chains of ssh sessions.

  • Check: “Last login: DATE from MACHINE”

SSH Public-key Authentication

There is an alternative to traditional passwords. This method of authentication is known as key-pair or public-key authentication. While a password is simple to understand (the secret is in your head until you give it to the ssh server which grants or denies access), a key-pair is somewhat more complicated.

A key-pair is as the name suggests a pair of cryptographic keys. One of the keys is called the private key (this one should be kept secure and protected with a pass phrase) and a public key (this one can be passed around freely as the name suggests).

After you have created the pair, you have to copy the public key to all systems to which you wish to establish a ssh-connection. The private key is kept as secure as possible and protected with a good pass phrase. On your laptop/workstation you use a key-agent to hold the private key while you work. Benefits and drawbacks:

  • Can be much more secure than regular password authentication.

  • Can be less secure if used incorrectly (understand before use).

  • Allows multiple logins without reentering password/pass phrase.

  • Allows safer use of ssh chains when they are necessary.

Short description of SSH public-key authentication (see also Chapter 4 in SSH tips, tricks & protocol tutorial by Damien Miller):

  • Generate a key-pair (ssh-keygen with OpenSSH), choose a good pass phrase and make sure private key is secure (once).

  • Put your public key into the ~/.ssh/authorized_keys file on systems you want to access in this manner.

  • Load your private key into your local key-agent (ssh-add with OpenSSH).

  • Run ssh, scp, or sshfs all you want without reentering your pass phrase, without the risk of anyone stealing your password.

[top]

Storage

Available file systems

Users have access to different file systems on Gimle. Below is a list of available file systems and their respective total sizes. Note, however, that the available size per user may be limited by quotas. Use the command
$ quota -s
to see your own quotas.

Mount point Size Comment
/home ~4 TiB Backed up
/nobackup/rossbyN, /nobackup/fouoN, /nobackup/smhidN ~8–320 TiB each Not backed up. Storage shared with the Vagn cluster.
/scratch/local ~35–190 GiB Not backed up, automatically cleared after each job
/software ~ 40 GiB Read only access, software installed by NSC

/home, used for important data

The home file system is mounted at /home on each machine in the cluster, and is backed up on a dayly basis. Each user has its own home directory (see the environment variable HOME).

/nobackup, used for scratch data

The nobackup file systems are mounted on subdirectories of /nobackup/ on each machine in the cluster, and is not backed up. Each user has own directories /nobackup/filesystem/$USER (where $USER means the username of corresponding user).

/scratch/local, used for local scratch data

On each compute node, there is a node-local file system mounted at /scratch/local. This can be useful for certain applicatons.

/software, contain applications

Common applications installed by NSC are found on the /software file system and is accessable from every machine in the cluster. This file system is not user writable.

Publishing data to non-Gimle users

Gimle is connected to the SMHI Publisher system, which allows Gimle users to copy data to a publishing server, from where it can be downloaded by users without the need for a Gimle account.

Please read the Publisher User Guide for more information.

[top]

Environment

We use cmod (module) to handle the environment when there exist several installed versions of the same software. This application sets up the correct paths to the binaries, man-pages, libraries, etc. for the currently selected module.

The correct environment is set up by using the module command. A list of some subcommands to module includes:

module

lists the available subcommands

module list

lists currently loaded modules

module avail

lists the modules available for use

module load example

loads the environment specified in the module named example

module unload example

unloads the environment specified in the module named example

A default environment is automatically declared when you log in. The default modules are:

[username@gimle ~]$ module list
Currently loaded modules:
  1) ifort
  2) icc
  3) idb
  4) dotmodules
  5) base-config
  6) default

In order to find out to which version of the compiler the module ifort refer, you may list all modules:

[username@gimle ~]$ module avail

In directory /etc/cmod/modulefiles:

  -base-config/1 (def)           -ifort/9.1                   
  -base-config/default           -ifort/9.1.052               
  +default                       -ifort/default               
  +dotmodules                    -intel/10.1                  
  -icc/10.1 (def)                -intel/9.1                   
  -icc/10.1.011                  -intel/default               
  -icc/10.1.017                  -mkl/10.0.3.020 (def)        
  -icc/9.1                       -mkl/9.1.023                 
  -icc/9.1.052                   -mkl/default                 
  -icc/default                   -openmpi/1.2.3-g411          
  -idb/10.1 (def)                -openmpi/1.2.3-i100025       
  -idb/10.1.011                  -openmpi/1.2.4-i100026       
  -idb/10.1.017                  -openmpi/1.2.5-i101011 (def) 
  -idb/9.1                       -openmpi/default             
  -idb/9.1.052                   -pyenv/default               
  -idb/default                   -pyenv/nsc1 (def)            
  -ifort/10.1 (def)              -scampi/3.12.0-1 (def)       
  -ifort/10.1.011                -scampi/default              
  -ifort/10.1.017              

The note "(def)" indicates which version that is the default, and, in case of the Fortran compiler, it is thus version 10.1. Please note, however, that the choice of default module may change over time. Therefore, if you wish to re-compile part of a program and link a new executable, you may need to ensure that you are using the same version of the compiler that you had at the time of the first built. You can switch to another version of the compiler as follows:

[username@gimle ~]$ module list        
Currently loaded modules:
  1) ifort
  2) icc
  3) idb
  4) dotmodules
  5) base-config
  6) default
[username@gimle ~]$ module unload ifort
[username@gimle ~]$ module list
Currently loaded modules:
  1) icc
  2) idb
  3) dotmodules
  4) base-config
  5) default
[username@gimle ~]$ module load ifort/9.1.052
[username@gimle ~]$ module list
Currently loaded modules:
  1) icc
  2) idb
  3) dotmodules
  4) base-config
  5) default
  6) ifort/9.1.052

Hint: The environment is specified in the files located under /etc/cmod/modulefiles.

Resource Name Environment Variable

If you are using several NSC resources and copying scripts between them, it can be useful for a script to have a way of knowing what resource it is running on. You can use the NSC_RESOURCE_NAME variable for that:

[username@gimle ~]$ echo "Running on $NSC_RESOURCE_NAME"
Running on gimle
[top]

Compiling

We recommend using the Intel compilers: ifort (Fortran), icc (C), and icpc (C++).

Compiling OpenMP Applications

Example: compiling the OpenMP-program, openmp.f with ifort:

        $ ifort -openmp openmp.f

Example: compiling the OpenMP-program, openmp.c with icc:

        $ icc -openmp openmp.c

Compiling MPI Applications

Before compiling an MPI application you should load an MPI module. We recommend the Scali MPI, which is added to your environment with the command:

        $ module add scampi

Example: compiling the MPI-program, mpiprog.f with ifort:

        $ ifort -Nmpi mpiprog.f 
Where mpiprog.f being:
      program mpiprog
      implicit none
      include "mpif.h"
C
      integer error, rank, size, mpi_common_world
C     
      call mpi_init(error)
      call mpi_comm_rank(mpi_comm_world,rank,error)
      call mpi_comm_size(mpi_comm_world,size,error)
C
      print *, "Rank number", rank, " of total", size, "."
C
      call mpi_finalize(error)
C
      end program mpiprog

Example: compiling the MPI-program, mpiprog.c with icc:

        $ icc -Nmpi mpiprog.c

Compiler Wrappers

When invoking any of the intel compilers (icc, ifort, or icpc), there is a wrapper-script that looks for Gimle-specific options. Options starting with -N are used by the wrapper to affect the compilation and/or linking processes, but these options are not passed to the compiler itself.

-Nhelp
Write wrapper-help
-Nverbose
Let the wrapper be more verbose
-Nmkl
Make the compiler compile and link against the currently loaded MKL-module
-Nmpi
Make the compiler compile and link against the currently loaded MPI-module
-Nmixrpath
Make the compiler link a program build with both icc/icpc and ifort

For example:

$ module load mkl
$ ifort -Nverbose -Nmkl -o example example.F -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_lapack -lmkl_core -openmp -lpthread
ifort INFO: Linking with MKL mkl/10.0.3.020.
ifort INFO: -Nmkl resolved to: -I/software/intel/mkl/10.0.3.020/include -L/software/intel/mkl/10.0.3.020/lib/em64t -Wl,--rpath,/software/intel/mkl/10.0.3.020/lib/em64t

The wrappers add tags to the executables with information regarding the compilation and linking. You may use the dumptag command to get a list of these labels:

[user@gimle ~]$ dumptag mpiprog.x 
-- NSC-tag ----------------------------------------------------------
File name:              /home/kent/mpiprog.x

Properly tagged:        yes
Tag version:            4
Build date:             080702
Build time:             142958
Built with MPI:         scampi 3_12_0_1
Built with MKL:         no (or build in an unsupported way)
Linked with:            ifort 10_1_011
---------------------------------------------------------------------
[user@gimle ~]$ 

Useful Options for the Intel Compilers

Below is a short list of useful compiler options.
The manual pages "man ifort" and "man icc" contain more details, and further information is also found at the Intel homepage [here].

Optimization

There are three different optimization levels in Intel's compilers and then some more knobs to turn:
-O0

Disable optimizations.

-O1,-O2 

Enable optimizations (DEFAULT).

-O3

Enable -O2 plus more aggressive optimizations that may not improve performance for all programs.

-ip

Enables interprocedural optimizations for single file compilation.

-ipo

Enables multifile interprocedural (IP) optimizations (between files).
Hint: If your build process uses ar to create .a-archives you need to use xiar (Intels implementation) instead of the system's /usr/bin/ar for an IPO build to work.

-xS

Optimize for the processors in Gimle. This can generate SIMD Extensions 4 (SSE4) Vectorizing Compiler and Media Accelerators instructions.

-xH

Optimize for the processors in the new nehalem partition of Gimle. Code compiled using this cannot run in the old harpertown partition. (Note: For version 12 compilers, use -xSSE4.2 instead.

Recommended optimization options

-O2 -mp

Safe

-O2 -xS

Default

-O3 -xS [ -ip | -ipo]

Aggressive

Debugging

-g

Generate symbolic debug information.

-traceback

Generate extra information in the object file to allow the display of source file traceback information at runtime when a severe error occurs.

-fpe<n>

Specifies floating-point exception handling at run-time.

-mp

Maintains floating-point precision (while disabling some optimizations).

Profiling

-p

Compile and link for function profiling with UNIX gprof tool.

Options that only apply to Fortran programs

-assume byterecl

Specifies (for unformatted data files) that the units for the OPEN statement RECL specifier (record length) value are in bytes, not longwords (four-byte units). For formatted files, the RECL unit is always in bytes.

-r8

Set default size of REAL to 8 bytes.

-i8

Set default size of integer variables to 8 bytes.

-zero 

Implicitly initialize all data to zero.

-save

Save variables (static allocation) except local variables within a recursive routine; opposite of -auto.

-CB

Performs run-time checks on whether array subscript and substring references are within declared bounds.

Miscellaneous

Little endian to Big endian conversion in Fortran is done through the F_UFMTENDIAN environment variable. When set, the following operations are done:
  • The WRITE operation converts little endian format to big endian format.
  • The READ operation converts big endian format to little endian format.
F_UFMTENDIAN = big 

Convert all files.

F_UFMTENDIAN ="big;little:8" 

All files except those connected to unit 8 are converted.

[top]

Math libraries

MKL, Intel Math Kernel Library

The Intel Math Kernel Library (MKL) is available, and we strongly recommend using it. Several versions of MKL may exist, you can see which versions are available with the "module avail" command. The instructions here are valid for MKL 10.0 and newer, older versions worked differently.

The library includes the following groups of routines:

  • Basic Linear Algebra Subprograms (BLAS):

    • vector operations

    • matrix-vector operations

    • matrix-matrix operations

  • Sparse BLAS (basic vector operations on sparse vectors)

  • Fast Fourier transform routines (with Fortran and C interfaces). There exist wrappers for FFTW 2.x and FFTW 3.x compatibility.

  • LAPACK routines for solving systems of linear equations

  • LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations

  • ScaLAPACK routines including a distributed memory version of BLAS (PBLAS or Parallel BLAS) and a set of Basic Linear Algebra Communication Subprograms (BLACS) for inter-processor communication.

  • Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments (with Fortran and C interfaces).

Full documentation can be found online at http://www.intel.com/software/products/mkl/ and in ${MKL_ROOT}/doc on Gimle.

Library structure

The Intel MKL is located in the /software/intel/mkl/ directory. The MKL consists of two parts: a linear algebra package and processor specific kernels. The former part contains LAPACK and ScaLAPACK routines and drivers that were optimized as without regard to processor so that it can be used effectively on different processors. The latter part contains processor specific kernels such as BLAS, FFT, BLACS, and VML that were optimized for the specific processor.

Linking with MKL

To use LAPACK and BLAS software you must link several libraries: MKL LAPACK and the threaded or sequential kernel. The required MKL-path is automatically added by the compiler wrapper if the option -Nmkl is added, and the appropriate MKL-module is loaded.

This table lists the most common MKL link options. See the following chapter for examples.

-Nmkl

Add required paths corresponding to the loaded MKL module.

-lmkl_lapack

Use MKL LAPACK and BLAS

-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -openmp -lpthread

Use threaded MKL

-lmkl_intel_lp64 -lmkl_sequential -lmkl_core

Use sequential MKL

MKL and threading

If threaded or sequential MKL gives best performance varies between applications. MPI applications will typically launch one MPI-rank on each processor core on each node, in this case threads are not needed as all cores are already used. However if you use threaded MKL you can start fewer ranks per node and increase the number of threads per rank accordingly.

The threading of MKL can be controlled at run time through the use of a few special environment variables.

  • OMP_NUM_THREADS controls how many OpenMP threads that should be started by default. This variable affects all OpenMP programs including the MKL library.
  • MKL_NUM_THREADS controls how many threads MKL-routines should spawn by default. This variable affects only the MKL library, and takes precedence over any OMP_NUM_THREADS setting.
  • MKL_DOMAIN_NUM_THREADS let the user control individual parts of the MKL library. E.g. MKL_DOMAIN_NUM_THREADS="MKL_ALL=1;MKL_BLAS=2;MKL_FFT=4" would instruct MKL to use one thread by default, two threads for BLAS calculations, and four threads for FFT routines. MKL_DOMAIN_NUM_THREADS also takes precedence over OMP_NUM_THREADS.
If the OpenMP enironment variable controlling the number of threads is unset when launching an MPI application with mpprun, mpprun will by default set OMP_NUM_THREADS=1.

Example, dynamic linking using ifort and lapack


Use MKL LAPACK and threaded MKL:
$ module load mkl
$ ifort -Nmkl -o example example.o -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_lapack -lmkl_core -openmp -lpthread
ifort INFO: Linking with MKL mkl/10.0.3.020.

Use MKL LAPACK and sequential MKL:
$ module load mkl
$ ifort -Nmkl -o example example.o -lmkl_intel_lp64 -lmkl_sequential -lmkl_lapack -lmkl_core
ifort INFO: Linking with MKL mkl/10.0.3.020.

Example, linking with MKL ScaLAPACK and OpenMPI

ScaLAPACK depends on BLACS, LAPACK, and BLAS (in that order), where the BLACS library also depends on an underlying MPI. Therefore, it is important to choose the correct combination of libraries in the right order when linking a program with ScaLAPACK. MKL is shipped with BLACS-libraries which are precompiled for OpenMPI and IntelMPI (the latter is not installed on Gimle). To link a program with ScaLAPACK and OpenMPI:
$ module load mkl
$ module load openmpi
$ ifort -Nmkl -Nmpi -o my_binary my_code.f90 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 \
-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_lapack -lmkl_core -openmp -lpthread 
ifort INFO: Linking with MPI openmpi/1.2.5-i101011.
ifort INFO: Linking with MKL mkl/10.0.2.018.

Example, linking with ScaLAPACK, alternatives to MKL and OpenMPI

By default we would recommend using the above combination (OpenMPI + MKL), but there are alteratives. It so happens that both mvapich2 and IntelMPI are derived from the same code base (mpich2), and mvapich2 can (usally) be used as a drop in replacement for IntelMPI. As compared to the OpenMPI+MKL example above, instead of blacs_openmpi use blacs_intelmpi. I.e.:
$ module load mkl
$ module load mvapich2
$ ifort -Nmkl -Nmpi -o my_binary my_code.f90 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 \
-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_lapack -lmkl_core -openmp -lpthread 
ifort INFO: Linking with MPI mvapich2/1.0.2-i101011.
ifort INFO: Linking with MKL mkl/10.0.2.018.
It is also possible to use ScaliMPI by using the "vanilla" netlib ScaLAPACK and BLACS, and link them against your LAPACK/BLAS of choice. If your choice of LAPACK/BLAS is MKL (generally the best choice):
$ module load mkl
$ module load scampi
$ sppath=/software/libs/scalapack/1.8.0/i101011
$ blpath=/software/libs/BLACS/i101011/LIB-scamp
$ ifort -Nmkl -Nmpi -o my_binary my_code.f90 $sppath/libscalapack.a \
$blpath/blacsF77init_MPI-Gimle-0.a $blpath/blacs_MPI-Gimle-0.a \
-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_lapack -lmkl_core -openmp -lpthread
[top]

Executing Parallel Jobs

There are two main alternatives to develop program codes that can be executed on multiple processor cores: OpenMP and MPI. OpenMP parallelization can be used for paralllelization of code that is to run within a single node (with up to 8 cores), whereas MPI is used for parallelization of code that can run on single as well as multiple nodes. The two types of applications are executed differently.

Executing an MPI application

An MPI application is started with the command:
$ mpprun mpiprog.x

Use "mpprun --help" to get a list of options and a brief description.

Note:
  • mpprun has to be started from a SLURM job. Either write a batch script and submit it with sbatch, or start an interactive shell using the command interactive [more details].
  • mpprun will launch a number of ranks determined from the SLURM environment variables [more details].
  • mpprun requires an MPI binary built according to NSC-recomendations in order to automatically choose the correct MPI implementation [more details].
  • In order to explicitly choose an MPI implementation to use, invoke mpprun with the flag
    --force-mpi=<MPI module>.
    

Executing an OpenMP application

The number of threads to be used by the application must be defined, and should be less or equal to eight. You can set the number of threads to be used by the application in two ways, either by defining a shell environment variable before starting the application or by calling an OpenMP library routine in the serial portion of the code.
  1. Environment variable:
    export OMP_NUM_THREADS=N
    time openmp.x
    
  2. Library routine:

    In Fortran:

    SUBROUTINE OMP_SET_NUM_THREADS(scalar_integer_expression)
    
    In C/C++:
    #include <omp.h>
    void omp_set_num_threads(int num_threads)
    
Note:
  • The maximum number of threads can be queried in your application by use of the external integer function:

    In Fortran:

    INTEGER FUNCTION OMP_GET_MAX_THREADS()
    
    In C/C++:
    #include <omp.h>
    int omp_get_max_threads(void)
    
[top]

Submitting Jobs

The batch queue system is comprised of two parts: (i) the SLURM resource manager and (ii) the Moab scheduler.

There are two ways to submit jobs to the batch queue system, either as an interactive job or as a batch job. Interactive jobs are most useful for debugging as you get interactive access to the input and the output of the job when it is running. But the normal way to run the applications is by submitting them as batch jobs.

Interactive job submission

An interactive access to the compute nodes is provided with the command interactive. This command accepts the same options as the sbatch command described below.

In order to start an interactive jobs allocating 2 nodes and 10 cores for 10 minutes, you type

$ interactive -N 2 -n 10 -t 00:10:00

Note: If you leave out the "-n 10" argument in the command, you will by default be given all available cores (in this case 16).

Once your interactive jobs has started, you are logged in to the first node in the list of nodes that has been assigned for the job. An environment has been created for you that in addition to ordinary variables also contain a number of SLURM environment variables:

[user@n212 ~]$ env | grep -i slurm
SLURM_NODELIST=n[212-213]
SLURMD_NODENAME=n212
SLURM_PRIO_PROCESS=0
SLURM_NNODES=2
SLURM_JOBID=5341
SLURM_TASKS_PER_NODE=8(x2)
STY=1755.slurm5341
SLURM_JOB_ID=5341
SLURM_UMASK=0022
SLURM_NODEID=0
SLURM_TASK_PID=1755
SLURM_NPROCS=10
SLURM_PROCID=0
SLURM_JOB_NODELIST=n[212-213]
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=8(x2)
SLURM_GTIDS=0
SLURM_JOB_NUM_NODES=2
[user@n212 ~]$ 

Let us now run the trivial MPI Fortran application given above [mpiprog.f]:

[user@n212 ~]$ mpprun mpiprog.x
mpprun: INFO: using job specified number of tasks
mpprun: INFO: starting scampi run on 2 nodes (10 tasks)
Taking nodenames from "/tmp/tmp.hIniRn1821", number of nodes specified 
by -np /opt/scali/bin/mpimon -stdin all  mpiprog.x  --  n212 5 n213 5
 Rank number           8  of total          10 .
 Rank number           1  of total          10 .
 Rank number           5  of total          10 .
 Rank number           6  of total          10 .
 Rank number           3  of total          10 .
 Rank number           7  of total          10 .
 Rank number           9  of total          10 .
 Rank number           0  of total          10 .
 Rank number           2  of total          10 .
 Rank number           4  of total          10 .
[user@n212 ~]$

Batch job submission

The two main commands for handling job submissions are:
sbatch

Submits a job to the queue system.

scancel JOBID

Deletes a job from the queue system.

Batch jobs are submitted to the queue system with the command sbatch:

$ sbatch -J jobname submit.sh

A minimal submit script that requires 2 nodes and 16 cores for 10 minutes may look like:

#!/bin/bash
#SBATCH -N 2
#SBATCH -t 00:10:00

mpprun ./mpiprog.x

# End of script

We note the use of "#SBATCH" lines in the script. This is an alternative way of specifying options to the sbatch command. We could thus have specified the jobname in the script with an additional line reading

#SBATCH -J jobname

Let us submit the above script:

[user@gimle ~]$ sbatch -J mpiprog submit.sh
sbatch: Submitted batch job 5351
[user@gimle ~]$

After the job has completed, the output to standard out and standard error (if not re-directed) is returned from the system in a file called

slurm-JOBID.out

In this case this is where we find the output from our program:

[user@gimle paralllel_program_test]$ cat slurm-5351.out
mpprun: INFO: number of tasks set to all cores on allocated nodes
mpprun: INFO: starting scampi run on 2 nodes (16 tasks)
Taking nodenames from "/tmp/tmp.IieKGI4556", number of nodes specified by -np
/opt/scali/bin/mpimon -stdin all  ./mpiprog.x  --  n212 8 n213 8
 Rank number           8  of total          16 .
 Rank number          11  of total          16 .
 Rank number          13  of total          16 .
 Rank number          10  of total          16 .
 Rank number          14  of total          16 .
 Rank number          15  of total          16 .
 Rank number           2  of total          16 .
 Rank number          12  of total          16 .
 Rank number           9  of total          16 .
 Rank number           1  of total          16 .
 Rank number           0  of total          16 .
 Rank number           4  of total          16 .
 Rank number           3  of total          16 .
 Rank number           6  of total          16 .
 Rank number           5  of total          16 .
 Rank number           7  of total          16 .
[user@gimle paralllel_program_test]$

Useful options to sbatch are listed with the command

$man sbatch

The most useful options are listed below. They work for the interactive command too.

-N nodes

The number of nodes to run the job on, each node has 8 cores.

-n tasks

The total number of tasks (MPI ranks).

--tasks-per-node tasks

The number of tasks (MPI ranks) per node.

-J jobname

Name of the job.

-t hh:mm:ss

The maximum execution time for the job.

-t days-hh

An alternative specification of the maximum execution time for the job.

-d JOBID

Defer the start of this job until the specified jobid has com- pleted.

--mem MiB

Specify the minimum amount of memory in MiB for the job. If this number exceeds 16000 MiB (16 GiB minus some overhead) your job will be scheduled for execution on the fat (32 GiB) memory nodes.

-p partition

The partition this job should run in (see below). Instead of specifying this, you could set the SBATCH_PARTITION environment variable.

Partitions

As Gimle now contains two types of nodes (connected to separate InfiniBand interconnects too), we need to separate the two types. That is done using the SLURM partition concept.

To run on the "old" nodes, specify partition harpertown (the codename of that processor generation).

To run on the "new" nodes, specify partition nehalem (the codename of that processor generation).

As of December 2009, different groups at SMHI are assigned to either the "old" or the "new" nodes. We try to set the SBATCH_PARTITION environment variable automatically to make sure that your jobs end up in the right partition. If that does not work, please set SBATCH_PARTITION yourselves or use the -p flag to the sbatch and interactive commands.

Opportunistic jobs ("riskjobb")

Sometimes, all nodes of the system are not running regular jobs, because of project restrictions or system reservations.

To fill them up, you may use opportunistic jobs (our Swedish translation is "riskjobb"). Those are able to bypass project and system restrictions but, on the down side, have two drawbacks:

  • They have a very low queue priority.
  • When a regular job is submitted, the regular job is able to automatically cancel the running opportunistic job.
There are two variants of opportunistic jobs:
  • The ordinary opportunistic job that needs to be resubmitted when it is cancelled, if you want to run it again. It is submitted with the "-p r_harpertown" or "-p r_nehalem" flag depending on what partition you want to run on. Example:

    sbatch -p r_harpertown script
  • The requeueable opportunistic job that will automatically requeue itself when cancelled, i.e. stay within the batch queue system. It is submitted with the additional flag "--requeue ". Example:

    sbatch -p r_harpertown --requeue script

When using a requeueable opportunistic job, please note that it may be interrupted anywhere during execution, and later rerun from the start. This works for many applications and scripts, but not for all. Your will have to save restart information repeatedly within your job script, but you must be aware that the script might be cancelled in the middle of the saving.

An opportunistic job that is cancelled by the system will get a line like the one below in the SLURM output file:

*** JOB 297014 CANCELLED AT 08/28-09:12:40 ***

(You will get the same kind of message if you cancel the job yourself using scancel but not if it completes or crasches.)

[top]

Supervising Jobs

In many cases it is desirable to supervise your running and scheduled jobs in order to find out if jobs have started or completed, how much remains of the allocated wall clock time, if a job produces sensible results, if a job makes efficient use of the cores, etc.

Get a Quick Overview via the Web

If you need a quick overview of the scheduling status of the cluster, please look at the Scheduling Status for Gimle web page.

Monitor the queue

Useful commands to monitor the queue are:
squeue

Monitor jobs in the queue system (SLURM).

showq

List all jobs visible to the scheduler (Moab).

checkjob

Display numerous scheduling details for a job (Moab).

sinfo

Show node information (SLURM).

sinfo -R

Show reasons for nodes that are drained etc. (SLURM).

User selective information is obtained with the "squeue" command:

[user@gimle ~]$ squeue -u panor
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   5351   gimle  mpiprog    panor   R       0:01      2 n[91-92]
[user@gimle ~]$

We note that the output from "squeue" includes information about which nodes your application is running on. This information (plus other details) is also available with use of the "checkjob" command:

[user@gimle ~]$ checkjob 28905
job 28905

AName: "cf3cl"
State: Running 
Creds:  user:panor  group:nsc  account:nsc  class:slabanja  qos:Normal
WallTime:   12:11:25:22 of 20:16:00:00
SubmitTime: Mon Feb 18 13:54:07
  (Time Queued  Total: 1:21:50:49  Eligible: 3:30:40)

StartTime: Wed Feb 20 11:44:56
Total Requested Tasks: 8

Req[0]  TaskCount: 8  Partition: slurm  
Memory >= 1M  Disk >= 1M  Swap >= 0
Opsys:   ---  Arch: ---  Features: ---
NodeCount:  1

Allocated Nodes:
[n650:8]


StartCount:     3
Partition Mask: [slurm]
StartPriority:  7730245
Reservation '28905' (  - -13days -> 7:13:26:52  Duration: 20:16:00:00)

[user@gimle ~]$ 

Monitor a running job

Applications have various ways to return output from the calculations; some write to standard output (which may be re-directed) whereas others write specific output files that often reside in the scratch directory. In order to list the output of a running calculation in the latter case, you may need to access the local file systems of the compute nodes named "/scratch/local/". This is possible since you are allowed to log in with "ssh" to all compute nodes where you have running applications:

[user@gimle ~]$ ssh n650
Last login: Mon Mar  3 10:28:03 2008 from l1
[user@n650 ~]$ df -m
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sda1                 9844      1496      7848  17% /
tmpfs                     8028         0      8028   0% /dev/shm
/dev/sda3               226365     36184    190181  16% /scratch/local
d1:/home               4194172   1602713   2591460  39% /home
s1:/software             95834     10259     85575  11% /software
[user@n650 ~]$ 

Once logged in to a compute node with a running application, you may monitor the performance of your application with e.g. the "top" command:

[user@n650 ~]$ top -u panor
top - 14:35:09 up 14 days, 23:56,  1 user,  load average: 1.73, 1.69, 1.60
Tasks: 170 total,   2 running, 168 sleeping,   0 stopped,   0 zombie
Cpu(s):  9.2%us,  3.4%sy,  0.0%ni, 87.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16439708k total, 16353084k used,    86624k free,      880k buffers
Swap:  2047840k total,      180k used,  2047660k free, 14840652k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7615 panor     25   0 1855m 928m 7768 R   99  5.8   6661:50 dalton.x           
 3350 panor     15   0 12712 1164  832 R    0  0.0   0:00.09 top                
 3249 panor     15   0 87504 1668  964 S    0  0.0   0:00.00 sshd               
 3250 panor     16   0 68240 1768 1312 S    0  0.0   0:00.03 bash               
 7596 panor     17   0 65872 1192 1004 S    0  0.0   0:00.00 script             
 7597 panor     23   0 65876 1288 1056 S    0  0.0   0:00.00 dalton             

You can also run a command on each node in a job using srun from the login node as shown in the example below (where uptime is run on every node belonging to job 22684):

[user@gimle ~]$ srun --jobid=22684 uptime
 16:11:32 up 23:35,  0 users,  load average: 7.74, 7.45, 7.42
 16:11:32 up 23:35,  0 users,  load average: 7.74, 7.44, 7.41
 16:11:32 up 23:35,  0 users,  load average: 7.74, 7.43, 7.40
 16:11:32 up 23:35,  0 users,  load average: 7.75, 7.46, 7.43
 16:11:32 up 23:35,  0 users,  load average: 7.79, 7.54, 7.48
 16:11:32 up 23:35,  0 users,  load average: 7.75, 7.46, 7.41
 16:11:32 up 23:35,  0 users,  load average: 7.79, 7.57, 7.50
 16:11:32 up 23:35,  0 users,  load average: 7.74, 7.45, 7.41

[top]

Job Scheduling

The priority of your queued job is calculated as the number of minutes your job has been eligible/idle in the queue, ready to run. "An early bird catches the worm." [top]

Debugging and tracing

Standard debugging tools like the GNU debugger gdb and Intel debugger idb are installed on Gimle. There are also a few special programs available to help trace and debug parallel applications.

Intel Trace Analyzer and Collector

This tool was previously named Vampir. It can be used to trace the communication patterns of a MPI application. This is accomplished by recompiling you application linked against trace libraries. The application then writes trace files when it is executed. These files can then be analyzed using the graphical trace analyzer from the login node.

ITAC have several features not described here, full documentation is available in the directory /software/intel/itac/7.1/doc

How to use:
1. Use with Intel MPI. Other implementation might work but are not as well tested.

  $ module add impi
  

2. Load the Intel Trace Analyzer module:

  $ module add itac
  

3. Compile and link the MPI program with the extra CFLAGS "-lVT -I$VT_ROOT/include -L$VT_LIB_DIR $VT_ADD_LIBS" and "-Nmpi":

  $ icc mpiprog.c -o mpiprog -Nmpi -lVT -I$VT_ROOT/include -L$VT_LIB_DIR $VT_ADD_LIBS
  

4. Run the program with mpprun as usual. This will write trace files in the work directory.

  $ mpprun ./mpiprog
  

5. Open the trace files using the trace analyzer on the login node.

  [faxen@gimle ~]$ traceanalyzer mpiprog-0(mpi:24646@n8).stf
  

TotalView Parallel Debugger

Full documentation for TotalView, including a User Guide is available in the directory /software/apps/toolworks/totalview.8.7.0-7/doc/pdf or at the vendor's website.

License information: There is currently only one single license for TotalView installed. If you encounter license availability problems then please contact support@nsc.liu.se so we can consider purchasing more licenses.

Recipe for running TotalView:

1. Make sure that you can run X11 applications on the login node. (start an xterm or something similar to verify)

2. Load the MPI module you use. At the moment, Scali MPI, Intel MPI and OpenMPI (version 1.4.1 and higher) work:

  $ module add scampi

3. Load the TotalView module:

  $ module add totalview 

4. Compile your application with -Nmpi -g to get MPI support and debug information in the binary (of course you need to use ifort instead of icc if your program is using Fortran):

  $ icc -Nmpi -g -o myapp myapp.c

5. Start an interactive job:

  $ interactive -N 1 -t 01:00:00

6. Launch the MPI program with TotalView in the interactive job shell by adding --totalview to the rest of the flags you use with mppun:

  $ mpprun --totalview ./myapp 

7. Quick Start:

  • Click "OK" in the "Startup Parameters - mpimon" dialog.
  • Click the "Go" button.
  • TotalView detects that you are starting a parallel program, click "Yes" to stop it.
  • It is time to set break points etc, you are now debugging your MPI program!
  • Reading the TotalView manual is highly recommended!
[top]

List of Acronyms

GiB       gibibyte, 1024**3 bytes
MiB       mebibyte, 1024**2 bytes
MKL       Math Kernel Library
MPI       Message Passing Interface
OpenMP    Open Multi-Processing
scp       secure copy
SLURM     Simple Linux Utility for Resource Management
ssh       secure shell
TiB       tebibyte, 1024**4 bytes
[top]

Frequently Asked Questions

This part will be filled as needed.

[top]




Page last modified: 2012-07-13 09:41
For more information contact us at info@nsc.liu.se.