Systems  
Systems log
Status displays
System status
Retired systems
 
 
 
 
 
 
 
 
 
 
 

Neolith User Guide

Short Description

Neolith is a Linux-based cluster with 805 HP ProLiant DL140 G3 compute servers with a combined peak performance of 60 Tflops. Each compute server is equipped with two quad-core processors of type Intel® Xeon® E5345. The installation also includes a total of 13 ProLiant DL380 G5 system servers, which will handle cluster storage and administration tasks. In total, the cluster has 14 Tb of main memory. The compute nodes communicate over a high-speed network based on Infiniband equipment from Cisco® with a total network bandwidth of more than 32 Tbits per second.

The resource is intended for Swedish academic users and it is equipped with application software that reflects the needs among the Swedish research community in natural sciences. Please take your time and learn more about Neolith from the information in this user guide, and see whether the computer can expand the scope of your scientific work.

Hardware

Processor Intel Xeon E5345 Quad Core Processor 2.33 GHz, 4 MB Level 2 cache
Interconnect Infiniband ConnectX interconnect
Node memory 16 GiB or 32 GiB

Software

Operating system CentOS 5 x86_64
Resource Manager SLURM
Scheduler Moab
Compilers Intel compiler collection
Math libraries Intel Math Kernel Library (MKL)
MPI Scali MPI, OpenMPI
Applications See Application Software
[top]

Quickstart Guide

  1. Use ssh to access the system

    When you have received a username and a password from NSC, log in to Neolith using ssh:
          $ ssh username@neolith.nsc.liu.se
       
    
  2. Change your password

    As soon as possible after receiving the username and initial password, log in and change your password with the command:
          $ passwd
    
    See more details on security.
  3. Compile a program

    To compile a parallel (MPI) program, load the appropriate MPI-module and add the "-N" compiler flag [more details]. When compiling a FORTRAN program, do:
          $ module add scampi
          $ ifort -Nmpi mpiprog.f
              
    
    or, when compiling a C-program, do:
          $ module add scampi
          $ icc -Nmpi mpiprog.c
              
    
  4. Run an application

    Run the application as a batch job [more details]:
    1. Create a submit script. This file contains information about which project the job should be accounted on, how many processors you wish to use, how long you expect the job to run, how to start the application, etc.
    2. Submit the job:
            $ sbatch script.sh   
      
      Note: the current maximum walltime for jobs is set to 3 days.
[top]

Security, and Accessing the System

Accessing the System

Log in to Neolith with ssh

To log into the system, use the username provided to you by NSC, and issue
         $ ssh username@neolith.nsc.liu.se
  

Unix:
ssh (OpenSSH) is most likely installed on your Linux, Solaris, or Mac OS-X machine.

Windows:
PuTTY is a commonly used free SSH implementation (there are also other alternatives). Both OpenSSH and PuTTY can be used for X-forwarding; with ssh add the command line flag -X, or with PuTTY toggle "Enable X11 forwarding" in the preferences. Note that using X-forwarding may require additional configuration of your local machine, e.g. you need an X-server, please consult your local system administrator if you run into trouble.

File transfer is available using scp, sftp, or sshfs

  • scp is a tool useful for copying single, or a few files to or from a remote system. To copy a local file named local-file to your home directory on Neolith, issue
            $ scp local-file username@neolith.nsc.liu.se:
      
    
    See the scp man pages for further information.
  • sftp is an interactive file transfer program, similar to ftp. For example:
            $ sftp username@neolith.nsc.liu.se:testdir
            Connecting to neolith.nsc.liu.se...
            Changing to: /home/username/testdir
            sftp> ls
            file-1  file-2
            sftp> get file-2
            Fetching /home/username/testdir/file-2 to file-2
      
    
    For additional information about sftp, see the sftp man page.
  • sshfs is a "user space file system" which allows for transparent file system access to remote machines. Example
            $ mkdir mnt
            $ ls mnt
            $ sshfs username@neolith.nsc.liu.se:testdir mnt
            $ ls mnt
            file-1  file-2
      
    
    The use of sshfs can be very convenient, but is often not available by default. Consult your local system administrator to see if sshfs is available for your desktop machine.

Security

When a system is compromised and passwords stolen, the thing that causes the most grief is when the stolen password can be used for more than one system. A user that has accounts on many different computers and gets his/her shared password stolen will allow the intruders to easily cross administrative domains and further compromise other systems.

  • DO NOT use a trivial password based on your name, account, dogs name, etc.
  • DO NOT share passwords between different systems.

To login to a system and then continue from that system to a third (as illustrated below) should be avoided.

login_recommendation

When logging into a system, read the “last login” information. If you can't verify the information, contact support@nsc.liu.se as soon as possible.

Checklist:

  • Use different passwords for different systems.

  • Do not use weak passwords.

  • Avoid chains of ssh sessions.

  • Check: “Last login: DATE from MACHINE”

SSH public-key authentication

There is an alternative to traditional passwords. This method of authentication is known as key-pair or public-key authentication. While a password is simple to understand (the secret is in your head until you give it to the ssh server which grants or denies access), a key-pair is somewhat more complicated.

A key-pair is as the name suggests a pair of cryptographic keys. One of the keys is called the private key (this one should be kept secure and protected with a pass phrase) and a public key (this one can be passed around freely as the name suggests).

After you have created the pair, you have to copy the public key to all systems to which you wish to establish a ssh-connection. The private key is kept as secure as possible and protected with a good pass phrase. On your laptop/workstation you use a key-agent to hold the private key while you work.

  • Can be much more secure than regular password authentication.

  • Can be less secure if used incorrectly (understand before use).

  • Allows multiple logins without reentering password/pass phrase.

  • Allows safer use of ssh chains.

  • Enables message passing (with e.g. MPI and Linda) between nodes.

Short description of SSH public-key authentication (see also Chapter 4 in SSH tips, tricks & protocol tutorial by Damien Miller):

  • Generate a key-pair on your computer, choose a good pass phrase and make sure private key is secure (once).(Use the command ssh-keygen for this)

  • Put your public key into ~/.ssh/authorized_keys on desired systems. (The script ssh-copy-id can help with this, example ssh-copy-id neolith.nsc.liu.se)

  • Load your private key into your key-agent (ssh-add with OpenSSH. NSC recommends using "ssh-add -c" - This will ask for confirmation every time the key is used which increases security.)

  • Run ssh, scp, or sshfs all you want without reentering your pass phrase, without the risk of anyone stealing your password. (If you used ssh-add -c as suggested above then you have to hit enter in the confirmation dialog every time your key is used)

[top]

Storage

Available file systems

Users have access to different file systems on Neolith. Below is a list of available file systems and their respective total sizes. Note, however, that the available size per user may be limited by quotas. Use the nscquota command to see your own quotas and usage:
$ nscquota
FILE SYSTEM                  USED        QUOTA        LIMIT        GRACE
--                           ----         ----         ----        -----
/home                     3.1 GiB     20.0 GiB     30.0 GiB
/nobackup/global         32.0 KiB    200.0 GiB    250.0 GiB
Mount point Size Comment
/home ~4 TiB Backed up
/nobackup/global ~15TiB Not backup up
/scratch/local ~220 GiB Not backed up, automatically cleared after each job
/software <1TiB Read only access

home, used for important data

The home file system is mounted at /home on each machine in the cluster, and is backed up on a dayly basis. Each user has its own home-directory (see the environmet variable HOME). Currently (January 2008), the home file system is accessed via NFS, but will be migrated to GPFS in the near future.

nobackup, used for scratch data

The nobackup file system is mounted at /nobackup/global on each machine in the cluster, and is not backed up. Each user has its own directory /nobackup/global/$USER (where $USER means the username of corresponding user). Nobackup is a GPFS file system.

Please use the nobackup file system for files that can be recreated by rerunning computation jobs. Do not store this type of data on /home as this waste space in our backup systems.

scratch/local, used as a local scratch dir

On each compute node, there is a node-local file system mounted at /scratch/local. This can be useful for certian types of calculations.

software, contain applications

Common applications installed by NSC, are found on the software file system and is accessable from every machine in the cluster. This file system is not user writable.

Note:
General Parallel File System (GPFS) is a proprietary cluster file system developed by IBM. The advantages of GPFS compared to NFS are higher performance and better scalability. On Neolith, we are using GPFS in a "typical" cluster configuration; all disks used for GPFS are connected to a storage area network (SAN), and the SAN is accessed through eight dedicated disk servers (NSD-servers in GPFS terminology).
[top]

Environment

We use cmod (module) to handle the environment when there exist several installed versions of the same software. This application sets up the correct paths to the binaries, man-pages, libraries, etc. for the currently selected module.

The correct environment is set up by using the module command . A list of some arguments to module includes:

module

lists the available arguments

module list

lists currently loaded modules

module avail

lists the available modules for use

module load example

loads the environment specified in the module named example

module unload example

unloads the environment specified in the module named example

A default environment is automatically declared when you log in. The default modules are:

[username@neolith1 ~]$ module list
Currently loaded modules:
  1) ifort
  2) icc
  3) idb
  4) dotmodules
  5) base-config
  6) default

In order to find out to which version of the compiler the module ifort refer, you may list all modules:

[username@neolith1 ~]$ module avail

In directory /etc/cmod/modulefiles:

  -base-config/1 (def)           -ifort/10.0 (def)            
  -base-config/default           -ifort/10.0.025              
  +default                       -ifort/9.1                   
  +dotmodules                    -ifort/9.1.039               
  -icc/10.0 (def)                -ifort/9.1.051               
  -icc/10.0.025                  -ifort/default               
  -icc/9.1                       -intel/10.0                  
  -icc/9.1.051                   -intel/9.1                   
  -icc/default                   -intel/default               
  -idb/10.0 (def)                -openmpi/1.2.3-g411          
  -idb/10.0.025                  -openmpi/1.2.3-i100025 (def) 
  -idb/9.1                       -openmpi/default             
  -idb/9.1.051                   -scampi/3.12.0-1 (def)       
  -idb/default                   -scampi/default              

The note "(def)" indicates which version that is the default, and, in case of the Fortran compiler, it is thus version 10.0. Please note, however, that the choice of default module may change over time, e.g., the default ifort-module in January 2008 is ifort/10.1. Therefore, if you wish to re-compile part of a program and link a new executable, you may need to ensure that you are using the same version of the compiler that you had at the time of the first built. You can switch to another version of the compiler as follows:

[username@neolith1 ~]$ module list        
Currently loaded modules:
  1) ifort
  2) icc
  3) idb
  4) dotmodules
  5) base-config
  6) default
[username@neolith1 ~]$ module unload ifort
[username@neolith1 ~]$ module list
Currently loaded modules:
  1) icc
  2) idb
  3) dotmodules
  4) base-config
  5) default
[username@neolith1 ~]$ module load ifort/9.1.051
[username@neolith1 ~]$ module list
Currently loaded modules:
  1) icc
  2) idb
  3) dotmodules
  4) base-config
  5) default
  6) ifort/9.1.051

Tip: The environment is specified in the files located under /etc/cmod/modulefiles.

Resource Name Environment Variable

If you are using several NSC resources and copying scripts between them, it can be useful for a script to have a way of knowing what resource it is running on. You can use the NSC_RESOURCE_NAME variable for that:

[username@neolith1 ~]$ echo "Running on $NSC_RESOURCE_NAME"
Running on neolith
[top]

Compiling

We recommend using the Intel compilers: ifort (Fortran), icc (C), and icpc (C++).

Compiling OpenMP applications

Example: compiling the OpenMP-program, openmp.f with ifort:

        $ ifort -openmp openmp.f

Example: compiling the OpenMP-program, openmp.c with icc:

        $ icc -openmp openmp.c

Compiling MPI applications

Before compiling an MPI application you should load an MPI module. We recommend the Scali MPI, which is added to your environment with the command:

        $ module add scampi

Example: compiling the MPI-program, mpiprog.f with ifort:

        $ ifort -Nmpi mpiprog.f 
Where mpiprog.f being:
      program mpiprog
      implicit none
      include "mpif.h"
C
      integer error, rank, size, mpi_common_world
C     
      call mpi_init(error)
      call mpi_comm_rank(mpi_comm_world,rank,error)
      call mpi_comm_size(mpi_comm_world,size,error)
C
      print *, "Rank number", rank, " of total", size, "."
C
      call mpi_finalize(error)
C
      end program mpiprog

Example: compiling the MPI-program, mpiprog.c with icc:

        $ icc -Nmpi mpiprog.c

Compiler wrappers

When invoking any of the intel compilers (icc, ifort, or icpc), there is a wrapper-script that looks for Neolith-specific options. Options starting with -N are used by the wrapper to affect the compilation and/or linking processes, but these options are not passed to the compiler itself.

-Nhelp
Write wrapper-help
-Nverbose
Let the wrapper be more verbose
-Nmkl
Make the compiler compile and link against the currently loaded MKL-module
-Nmpi
Make the compiler compile and link against the currently loaded MPI-module
-Nmixrpath
Make the compiler link a program build with both icc/icpc and ifort

For example:

$ module load mkl
$ ifort -Nverbose -Nmkl -o example example.F -lmkl_lapack -lmkl -lguide -lpthread
ifort INFO: Linking with MKL mkl/9.1.023.
ifort INFO: -Nmkl resolved to: -I/software/intel/cmkl/9.1.023/include 
-L/software/intel/cmkl/9.1.023/lib/em64t 
-Wl,--rpath,/software/intel/cmkl/9.1.023/lib/em64t

The wrappers add tags to the executables with information regarding the compilation and linking. You may use the dumptag command to get a list of these labels:

[panor@neolith1 ~]$ dumptag mpiprog.x 
-- NSC-tag ----------------------------------------------------------
File name:              /home/panor/calc/paralllel_program_test/mpiprog.x

Properly tagged:        yes
Tag version:            4
Build date:             080114
Build time:             143824
Built with MPI:         scampi 3_12_0_1
Built with MKL:         no (or build in an unsupported way)
Linked with:            ifort 10_1_011
---------------------------------------------------------------------
[panor@neolith1 ~]$ 

Intel compiler, useful compiler options

Below is a short list of useful compiler options.
The manual pages "man ifort" and "man icc" contain more details, and further information is also found at the Intel homepage [here].

Optimization

There are three different optimization levels in Intel's compilers:
-O0

Disable optimizations.

-O1,-O2 

Enable optimizations (DEFAULT).

-O3

Enable -O2 plus more aggressive optimizations that may not improve performance for all programs.

-ip

Enables interprocedural optimizations for single file compilation.

-ipo

Enables multifile interprocedural (IP) optimizations (between files).
Tip:If your build process uses ar to create .a-archives you need to use xiar (Intels implementation) instead of the systems /usr/bin/ar for an IPO build to work.

-xT

Optimize for the processors in Neolith, Intel Core(TM)2 Duo family. This can generate SSSE3, SSE3, SSE2, and SSE instructions.

A recommended flag in general is "-O2", and for best performance "-O3 -ipo -xT" or "-O3 -ip -xT". As always however, aggressive optimisation runs a higher risk of encountering compiler limitations.

Debugging

-g

Generate symbolic debug information.

-traceback

Generate extra information in the object file to allow the display of source file traceback information at runtime when a severe error occurs.

-fpe<n>

Specifies floating-point exception handling at run-time.

-mp

Maintains floating-point precision (while disabling some optimizations).

Profiling

-p

Compile and link for function profiling with UNIX gprof tool.

Options that only apply to Fortran programs

-assume byterecl

Specifies (for unformatted data files) that the units for the OPEN statement RECL specifier (record length) value are in bytes, not longwords (four-byte units). For formatted files, the RECL unit is always in bytes.

-r8

Set default size of REAL to 8 bytes.

-i8

Set default size of integer variables to 8 bytes.

-zero 

Implicitly initialize all data to zero.

-save

Save variables (static allocation) except local variables within a recursive routine; opposite of -auto.

-CB

Performs run-time checks on whether array subscript and substring references are within declared bounds.

Miscellaneous

Little endian to Big endian conversion in Fortran is done through the F_UFMTENDIAN environment variable. When set, the following operations are done:
  • The WRITE operation converts little endian format to big endian format.
  • The READ operation converts big endian format to little endian format.
F_UFMTENDIAN = big 

Convert all files.

F_UFMTENDIAN ="big;little:8" 

All files except those connected to unit 8 are converted.

[top]

Math libraries

MKL, Intel Math Kernel Library

The Intel Math Kernel Library (MKL) is available, and we strongly recommend using it. Several versions of MKL may exist, you can see which versions are available with the "module avail" command. The library includes the following groups of routines:

  • Basic Linear Algebra Subprograms (BLAS):

    • vector operations

    • matrix-vector operations

    • matrix-matrix operations

  • Sparse BLAS (basic vector operations on sparse vectors)

  • Fast Fourier transform routines (with Fortran and C interfaces). There exist wrappers for FFTW 2.x and FFTW 3.x compatibility.

  • LAPACK routines for solving systems of linear equations

  • LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations

  • ScaLAPACK routines including a distributed memory version of BLAS (PBLAS or Parallel BLAS) and a set of Basic Linear Algebra Communication Subprograms (BLACS) for inter-processor communication.

  • Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments (with Fortran and C interfaces).

Full documentation can be found online at http://www.intel.com/software/products/mkl/ and in ${MKL_ROOT}/doc on Neolith.

Library structure

The Intel MKL is located in the /software/intel/mkl/ directory. The MKL consists of two parts: a linear algebra package and processor specific kernels. The former part contains LAPACK and ScaLAPACK routines and drivers that were optimized as without regard to processor so that it can be used effectively on different processors. The latter part contains processor specific kernels such as BLAS, FFT, BLACS, and VML that were optimized for the specific processor.

Linking with MKL

To use LAPACK and BLAS software you must link two libraries: MKL LAPACK and the threaded or sequential kernel. The required MKL-path is automatically added by the compiler wrapper if the option -Nmkl is added, and the appropriate MKL-module is loaded.

This table lists the most common MKL link options. See the following chapter for examples.

-Nmkl

Add required paths corresponding to the loaded MKL module.

-lmkl_lapack

Use MKL LAPACK and BLAS

-lmkl -lguide -lpthread

Use threaded MKL

-lmkl_intel_lp64 -lmkl_sequential -lmkl_core

Use sequential MKL (see next chapter).

MKL and threading

The MKL is threaded by default, but there is also a non-threaded "sequential" version available. (The instructions here are valid for MKL 10.0 and newer, older versions worked differently.)

If threaded or sequential MKL gives best performance varies between applications. MPI applications will typically launch one MPI-rank on each processor core on each node, in this case threads are not needed as all cores are already used. However if you use threaded MKL you can start fewer ranks per node and increase the number of threads per rank accordingly.

The threading of MKL can be controlled at run time through the use of a few special environment variables.

  • OMP_NUM_THREADS controls how many OpenMP threads that should be started by default. This variable affects all OpenMP programs including the MKL library.
  • MKL_NUM_THREADS controls how many threads MKL-routines should spawn by default. This variable affects only the MKL library, and takes precedence over any OMP_NUM_THREADS setting.
  • MKL_DOMAIN_NUM_THREADS let the user control individual parts of the MKL library. E.g. MKL_DOMAIN_NUM_THREADS="MKL_ALL=1;MKL_BLAS=2;MKL_FFT=4" would instruct MKL to use one thread by default, two threads for BLAS calculations, and four threads for FFT routines. MKL_DOMAIN_NUM_THREADS also takes precedence over OMP_NUM_THREADS.
If the OpenMP enironment variable controlling the number of threads is unset when launching an MPI application with mpprun, mpprun will by default set OMP_NUM_THREADS=1.

Example, dynamic linking using ifort and lapack


Use MKL LAPACK and threaded MKL:
$ module load mkl
$ ifort -Nmkl -o example example.o -lmkl_lapack -lmkl -lguide -lpthread
ifort INFO: Linking with MKL mkl/10.0.2.018.

Use MKL LAPACK and sequential MKL:
$ module load mkl
$ ifort -Nmkl -o example example.o -lmkl_lapack -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
ifort INFO: Linking with MKL mkl/10.0.2.018.

Example, linking with MKL ScaLAPACK and OpenMPI

ScaLAPACK depends on BLACS, LAPACK, and BLAS (in that order), where the BLACS library also depends on an underlying MPI. Therefore, it is important to choose the correct combination of libraries in the right order when linking a program with ScaLAPACK. MKL is shipped with BLACS-libraries which are precompiled for OpenMPI and IntelMPI (the latter is not installed on Neolith). To link a program with ScaLAPACK and OpenMPI:
$ module load mkl
$ module load openmpi
$ ifort -Nmkl -Nmpi -o my_binary my_code.f90 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64 \
-lmkl_lapack -lmkl -lguide -lpthread 
ifort INFO: Linking with MPI openmpi/1.2.5-i101011.
ifort INFO: Linking with MKL mkl/10.0.2.018.

Example, linking with ScaLAPACK, alternatives to MKL and OpenMPI

By default we would recommend using the above combination (OpenMPI + MKL), but there are alteratives. It so happens that both mvapich2 and IntelMPI are derived from the same code base (mpich2), and mvapich2 can (usally) be used as a drop in replacement for IntelMPI. As compared to the OpenMPI+MKL example above, instead of blacs_openmpi use blacs_intelmpi. I.e.:
$ module load mkl
$ module load mvapich2
$ ifort -Nmkl -Nmpi -o my_binary my_code.f90 -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64 \
-lmkl_lapack -lmkl -lguide -lpthread 
ifort INFO: Linking with MPI mvapich2/1.0.2-i101011.
ifort INFO: Linking with MKL mkl/10.0.2.018.
It is also possible to use ScaliMPI by using the "vanilla" netlib ScaLAPACK and BLACS, and link them against your LAPACK/BLAS of choice. If your choice of LAPACK/BLAS is MKL (generally the best choice):
$ module load mkl
$ module load scampi
$ sppath=/software/libs/scalapack/1.8.0/i101011
$ blpath=/software/libs/BLACS/i101011/LIB-scamp
$ ifort -Nmkl -Nmpi -o my_binary my_code.f90 $sppath/libscalapack.a \
$blpath/blacsF77init_MPI-Neolith-0.a $blpath/blacs_MPI-Neolith-0.a \
-lmkl_lapack -lmkl -lguide -lpthread
[top]

Executing parallel jobs

There are two main alternatives to develop program codes that can be executed on multiple processor cores namely OpenMP and MPI. OpenMP parallelization can be used for paralllelization of code that is to run within a single node (with up to 8 cores), whereas MPI is used for parallelization of code that can run on single as well as multiple nodes. The two types of applications are executed differently.

Executing an MPI application

An MPI application is started with the command:
$ mpprun mpiprog.x

Use "mpprun --help" to get a list of options and a brief description.

Note:
  • mpprun has to be started from a SLURM job. Either write a batch script and submit it with sbatch, or start an interactive shell using the command interactive [more details].
  • mpprun will launch a number of ranks determined from the SLURM environment variables [more details].
  • mpprun requires an MPI binary built according to NSC-recomendations in order to automatically choose the correct MPI implementation [more details].
  • In order to explicitly choose an MPI implementation to use, invoke mpprun with the flag
    --force-mpi=<MPI module>.
    

Executing an OpenMP application

The number of threads to be used by the application must be defined, and should be less or equal to eight. You can set the number of threads to be used by the application in two ways, either by defining a shell environment variable before starting the application or by calling an OpenMP library routine in the serial portion of the code.
  1. Environment variable:
    export OMP_NUM_THREADS=N
    time openmp.x
    
  2. Library routine:

    In Fortran:

    SUBROUTINE OMP_SET_NUM_THREADS(scalar_integer_expression)
    
    In C/C++:
    #include <omp.h>
    void omp_set_num_threads(int num_threads)
    
Note:
  • The maximum number of threads can be queried in your application by use of the external integer function:

    In Fortran:

    INTEGER FUNCTION OMP_GET_MAX_THREADS()
    
    In C/C++:
    #include <omp.h>
    int omp_get_max_threads(void)
    
[top]

Submitting jobs

The batch queue system is comprised of two parts: (i) the SLURM resource manager and (ii) the Moab scheduler.

There are two ways to submit jobs to the batch queue system, either as an interactive job or as a batch job. Interactive jobs are most useful for debugging as you get interactive access to the input and the output of the job when it is running. But the normal way to run the applications is by submitting them as batch jobs.

Interactive job submission

An interactive access to the compute nodes is provided with the command interactive. This command accepts the same options as the sbatch command described below.

In order to start an interactive jobs allocating 2 nodes and 10 cores for 10 minutes, you type

$ interactive -N 2 -n 10 -t 00:10:00

Note: If you leave out the "-n 10" argument in the command, you will by default be given all available cores (in this case 16).

Once your interactive jobs has started, you are logged in to the first node in the list of nodes that has been assigned for the job. An environment has been created for you that in addition to ordinary variables also contain a number of SLURM environment variables:

[panor@n212 ~]$ env | grep -i slurm
SLURM_NODELIST=n[212-213]
SLURMD_NODENAME=n212
SLURM_PRIO_PROCESS=0
SLURM_NNODES=2
SLURM_JOBID=5341
SLURM_TASKS_PER_NODE=8(x2)
STY=1755.slurm5341
SLURM_JOB_ID=5341
SLURM_UMASK=0022
SLURM_NODEID=0
SLURM_TASK_PID=1755
SLURM_NPROCS=10
SLURM_PROCID=0
SLURM_JOB_NODELIST=n[212-213]
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=8(x2)
SLURM_GTIDS=0
SLURM_JOB_NUM_NODES=2
[panor@n212 ~]$ 

Let us now run the trivial MPI Fortran application given above [mpiprog.f]:

[panor@n212 ~]$ mpprun mpiprog.x
mpprun: INFO: using job specified number of tasks
mpprun: INFO: starting scampi run on 2 nodes (10 tasks)
Taking nodenames from "/tmp/tmp.hIniRn1821", number of nodes specified 
by -np /opt/scali/bin/mpimon -stdin all  mpiprog.x  --  n212 5 n213 5
 Rank number           8  of total          10 .
 Rank number           1  of total          10 .
 Rank number           5  of total          10 .
 Rank number           6  of total          10 .
 Rank number           3  of total          10 .
 Rank number           7  of total          10 .
 Rank number           9  of total          10 .
 Rank number           0  of total          10 .
 Rank number           2  of total          10 .
 Rank number           4  of total          10 .
[panor@n212 ~]$

Batch job submission

The two main commands for handling job submissions are:
sbatch

Submits a job to the queue system.

scancel JOBID

Deletes a job from the queue system.

Batch jobs are submitted to the queue system with the command sbatch:

$ sbatch -J jobname submit.sh

A minimal submit script that requires 2 nodes and 16 cores for 10 minutes may look like:

#!/bin/bash
#SBATCH -N 2
#SBATCH -t 00:10:00

mpprun ./mpiprog.x

# End of script

We note the use of "#SBATCH" lines in the script. This is an alternative way of specifying options to the sbatch command. We could thus have specified the jobname in the script with an additional line reading

#SBATCH -J jobname

Let us submit the above script:

[panor@neolith1 ~]$ sbatch -J mpiprog submit.sh
sbatch: Submitted batch job 5351
[panor@neolith1 ~]$

After the job has completed, the output to standard out and standard error (if not re-directed) is returned from the system in a file called

slurm-JOBID.out

In this case this is where we find the output from our program:

[panor@neolith1 paralllel_program_test]$ cat slurm-5351.out
mpprun: INFO: number of tasks set to all cores on allocated nodes
mpprun: INFO: starting scampi run on 2 nodes (16 tasks)
Taking nodenames from "/tmp/tmp.IieKGI4556", number of nodes specified by -np
/opt/scali/bin/mpimon -stdin all  ./mpiprog.x  --  n212 8 n213 8
 Rank number           8  of total          16 .
 Rank number          11  of total          16 .
 Rank number          13  of total          16 .
 Rank number          10  of total          16 .
 Rank number          14  of total          16 .
 Rank number          15  of total          16 .
 Rank number           2  of total          16 .
 Rank number          12  of total          16 .
 Rank number           9  of total          16 .
 Rank number           1  of total          16 .
 Rank number           0  of total          16 .
 Rank number           4  of total          16 .
 Rank number           3  of total          16 .
 Rank number           6  of total          16 .
 Rank number           5  of total          16 .
 Rank number           7  of total          16 .
[panor@neolith1 paralllel_program_test]$

Useful options to sbatch are listed with the command

$man sbatch

A selection of the most useful options includes:

-U account_string

The project the job should be accounted on.

Large and medium scaled projects have a project id of the form "SNIC xxx/yy-zz". The corresponding account string is obtained by taking the project id and remove all blanks " " and all replace "/" with "-". For example to account on the SNAC project "SNIC 005/06-98" the string "SNIC005-06-98" should be used.

Small scaled projects (also known as test projects) have a project id of the form xxxxxxx. The corresponding account string is pxxxxxxx.

A person that is member of a single project may omit this argument.

-N nodes

The number of nodes to run the job on, each node has 8 cores.

-n tasks

The total number of tasks (mpi ranks).

--tasks-per-node tasks

The number of tasks (mpi ranks) per node.

-J jobname

Name of the job.

-t hh:mm:ss

The maximum execution time for the job.

-t days-hh

An alternative specification of the maximum execution time for the job.

-d JOBID

Defer the start of this job until the specified jobid has com- pleted.

--mem MB

Specify the minimum amount of memory in mega bytes for the job. If this number exceeds 16384 MiB (=16GiB) your job will be scheduled for execution on the fat memory nodes.

[top]

Supervising jobs

In many cases it is desirable to supervise your running and scheduled jobs in order to find out if jobs have started or completed, how much remains of the allocated wall clock time, if a job produces sensible results, if a job makes efficient use of the cores, etc.

Monitor the queue

Useful commands to monitor the queue are:
squeue

Monitor jobs in the queue system.

showq

List all jobs visible to the scheduler.

checkjob

Display numerous scheduling details for a job.

User selective information is obtained with the "squeue" command:

[panor@neolith1 ~]$ squeue -u panor
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   5351   neolith  mpiprog    panor   R       0:01      2 n[212-213]
[panor@neolith1 ~]$

We note that the output from "squeue" includes information about which nodes your application is running on. This information (plus other details) is also available with use of the "checkjob" command:

[panor@neolith1 ~]$ checkjob 28905
job 28905

AName: "cf3cl"
State: Running 
Creds:  user:panor  group:nsc  account:nsc  class:slabanja  qos:Normal
WallTime:   12:11:25:22 of 20:16:00:00
SubmitTime: Mon Feb 18 13:54:07
  (Time Queued  Total: 1:21:50:49  Eligible: 3:30:40)

StartTime: Wed Feb 20 11:44:56
Total Requested Tasks: 8

Req[0]  TaskCount: 8  Partition: slurm  
Memory >= 1M  Disk >= 1M  Swap >= 0
Opsys:   ---  Arch: ---  Features: ---
NodeCount:  1

Allocated Nodes:
[n650:8]


StartCount:     3
Partition Mask: [slurm]
StartPriority:  7730245
Reservation '28905' (  - -13days -> 7:13:26:52  Duration: 20:16:00:00)

[panor@neolith1 ~]$ 

Monitor a running job

Applications have various ways to return output from the calculations; some write to standard output (which may be re-directed) whereas others write specific output files that often reside in the scratch directory. In order to list the output of a running calculation in the latter case, you may need to access the local file systems of the compute nodes named "/scratch/local/". This is possible since you are allowed to log in with "ssh" to all compute nodes where you have running applications:

[panor@neolith1 ~]$ ssh n650
Last login: Mon Mar  3 10:28:03 2008 from l1
[panor@n650 ~]$ df -m
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sda1                 9844      1496      7848  17% /
tmpfs                     8028         0      8028   0% /dev/shm
/dev/sda3               226365     36184    190181  16% /scratch/local
d1:/home               4194172   1602713   2591460  39% /home
s1:/software             95834     10259     85575  11% /software
[panor@n650 ~]$ 

Once looged in to a compute node with a running application, you may monitor the performance of your application with e.g. the "top" command:

[panor@n650 ~]$ top -u panor
top - 14:35:09 up 14 days, 23:56,  1 user,  load average: 1.73, 1.69, 1.60
Tasks: 170 total,   2 running, 168 sleeping,   0 stopped,   0 zombie
Cpu(s):  9.2%us,  3.4%sy,  0.0%ni, 87.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16439708k total, 16353084k used,    86624k free,      880k buffers
Swap:  2047840k total,      180k used,  2047660k free, 14840652k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7615 panor     25   0 1855m 928m 7768 R   99  5.8   6661:50 dalton.x           
 3350 panor     15   0 12712 1164  832 R    0  0.0   0:00.09 top                
 3249 panor     15   0 87504 1668  964 S    0  0.0   0:00.00 sshd               
 3250 panor     16   0 68240 1768 1312 S    0  0.0   0:00.03 bash               
 7596 panor     17   0 65872 1192 1004 S    0  0.0   0:00.00 script             
 7597 panor     23   0 65876 1288 1056 S    0  0.0   0:00.00 dalton             
[top]

Job Scheduling

The priority of your queued job is basically proportional to the percentage of your project's monthly allocation, that is unused. This "fairshare" priority is based on an approximation of the used time during the last 30 days. If your project has exceeded its allocation, your job priority turns negative.

Use the command "nscjobinfo" to get a more detailed description of the current scheduling policy as well as batch queue limits.

[top]

Debugging and tracing

Standard debugging tools like the GNU debugger gdb and Intel debugger idb are installed on Neolith. There are also a few special programs available to help trace and debug parallel applications.

Intel Trace Analyzer and Collector

This tool was previously named Vampir. It can be used to trace the communication patterns of a MPI application. This is accomplished by recompiling you application linked against trace libraries. The application then writes trace files when it is executed. These files can then be analyzed using the graphical trace analyzer from the login node.

ITAC have several features not described here, full documentation is available in the directory /software/intel/itac/7.1/doc

How to use:
1. Use ScaliMPI, other implementation might work but are untested.

  $ module add scampi
  

2. Load the Intel Trace Analyzer module:

  $ module add itac
  

3. Compile the MPI program with the extra CFLAGS "-lVT -I$VT_ROOT/include -L$VT_LIB_DIR $VT_ADD_LIBS" and -Nmpi

  $ icc mpiprog.c -o mpiprog -Nmpi -lVT -I$VT_ROOT/include -L$VT_LIB_DIR $VT_ADD_LIBS
  

4. Run the program with mpprun like usual. This will write trace files in the work directory.

  $ mpprun ./mpiprog
  

5. Open the trace files using the trace analyzer on the login node.

  [paran@neolith1 ~]$ traceanalyzer mpiprog-0(mpi:24646@n8).stf
  

TotalView Parallel Debugger

Full documentation for TotalView, including a User Guide is available in the directory /software/apps/toolworks/totalview.8.4.0-0/doc/pdf/

License information: There is currently only one single license for TotalView installed. If you encounter license availability problems then please contact support@nsc.liu.se so we can consider purchasing more licenses.

TotalView is not yet integrated with the normal MPI launcher mpprun on Neolith, so you have to use the MPI implementations launcher programs manually. The following guide shows how to run a MPI program on a single node using ScaliMPI. TotalView supports several other MPI implementations but this have not been tested by NSC on Neolith, feel free to try if you want.

0. Make sure that you can run X11 applications on the login node. (start an xterm or something similar to verify)

1. Load the ScaliMPI module.

  $ module add scampi

2. Set up your PATH to include the TotalView directory.

  $ export PATH=$PATH:/software/apps/toolworks/totalview.8.6.0-2/bin/

3. Compile your application with -Nmpi -g to use ScaliMPI and get debug information in the binary. (Of course you need to use ifort instead of icc if your program is using Fortran)

  $ icc -Nmpi -g -o myapp myapp.c

4. Start an interactive job on a single node.

  $ interactive -N 1 -t 01:00:00 -U YOUR-SNIC-PROJECT

5. When the interactive job have started you can launch the MPI program with TotalView in the interactive job shell. (Change the 8 to a lower number to use fewer ranks.)

  $ tvmpimon ./myapp -- $SLURM_NODELIST 8 

6. Quick Start:

  • Click "OK" in the "Startup Parameters - mpimon" dialog.
  • Click the "Go" button.
  • TotalView detects that you are starting a parallel program, click "Yes" to stop it.
  • It is time to set break points etc, you are now debugging your MPI program!
  • Reading the TotalView manual is highly recommended!
[top]

List of acronyms

gib       gibibyte
mib       mebibyte
mkl       math kernel library
mpi       message passing interface
openmp    open multi-processing
scp       secure copy
slurm     simple linux utility for resource management
ssh       secure shell
tib       tebibyte
[top]

Frequently asked questions

Questions:
  1. How do I find out the acceptance limits for job submission?
  2. How much data am I allowed to store on the various file systems?
  3. Why does not my job start?
  4. How many hours have I consumed this month?
  5. How do I know if my job was killed due to exceeded wall clock time?
Answers:
  1. Run the command "nscjobinfo" (without options and arguments). See also above.
  2. Run the command "nscquota". See also above.
  3. A basic description of the queueing priorities are given in the output from the command "nscjobinfo" (without options and arguments or with a specific JOBID as argument). Your jobs will be listed together with information on load averages on nodes for running jobs and hints as to why blocked jobs are blocked. See also above.
  4. Run the command "projinfo" (takes no options or arguments). See also above.
  5. At the moment this information is only recorded in log files accessible to the NSC staff. We are working on a solution to make the information available to our users.
[top]




Page last modified: 2009-10-09 14:08
For more information contact us at info@nsc.liu.se.