Systems  
Status displays
System status
Retired systems
 
 
 
 
 
 
 
 
 
 
 
 

Picture of the Neolith cluster

User Guide

SNIC computing resources?

This User Guide covers the SNIC computing resources Kappa and Matter. This is a name we use at NSC to describe several separate, but very similar HPC clusters, with an associated shared Centre Storage system. Currently, the SNIC computing resources at NSC are:

Resource Type Cores Nodes/servers Performance/capacity In use since
Kappa Cluster 2912 364 27 TFLOPS April 2010
Matter Cluster 4128 516 37 TFLOPS October 2010
Centre Storage Storage 8 65 TiB 2007
Triolith Cluster 19200 1200 338 TFLOPS July 2012

Note: Triolith is a regular SNIC system, but it is not documented here as it's software is of a later generation. Please read the Triolith User Guide for information on Triolith.

Note: Matter is not a regular SNIC system: it was funded by KAW via SNIC, and time on it is not allocated by SNIC. But from a user perspective it works in the same way as Kappa, so it makes sense to use the same User Guide for Matter.

These resources are intended for Swedish academic users and they are equipped with application software that reflects the needs among the Swedish research community in natural sciences. Please take your time and learn more about the systems from the information in this user guide, and see whether they can expand the scope of your scientific work.

Most things work in the same way on both Kappa and Matter, but where they differ we try to describe the differences and highlight them, e.g by a different colour, (like this paragraph).

System description: Kappa

Kappa is a Linux-based cluster with 364 HP ProLiant DL170h G6 compute servers and 2 HP proliant DL980 G7 huge nodes with large amounts of memory, giving a combined peak performance of 27 TFLOPS.

Each compute server (DL170h) is equipped with two quad-core processors of type Intel® Xeon® E5520 and 24 or 72 GB RAM. Each huge node (DL980) has 1 TiB RAM and eight processors of CPU type Intel® Xeon® E7-2800 which have eight cores each, giving 64 cores in total. For more information about using the large memory nodes, see the batch job submission notes.

The installation also includes a total of 4 HP ProLiant DL180 G6 system servers which handles cluster storage and administration tasks.

See the table below for more information.

In total, the cluster has 13 TiB of main memory.

The compute nodes communicate over a high-speed network based on InfiniBand equipment from Voltaire®

  • 26 nodes are owned solely by The Department of Physics, Chemistry and Biology within the center for "Advanced Functional Materials" (AFM) at Linköping University.

  • The largest part of Kappa (322 nodes - including all the fat nodes) is funded with money from SNIC. Half of this and the two huge nodes are dedicated to local research groups at Linköping University and the other half is dedicated for all Swedish academic users via SNAC.

System description: Matter

Matter is a Linux-based cluster with 512 HP Proliant SL2x170z and four HP Proliant DL160 G6 compute servers with a combined peak performance of 37 TFLOPS.

All SL170z compute servers are equipped with two quad-core processors of type Intel® Xeon® E5520 and 36 GB RAM. The DL160 compute servers are equipped with two quad-core processors of type Intel® Xeon® X5570 and 144 GB RAM. For more information about using the large memory nodes, see the batch job submission notes

See the table below for more information.

In total, the cluster has 19 TiB of main memory.

Matter is financed by the Knut and Alice Wallenberg Foundation and is dedicated to calculations within material science.

System description: Triolith

Triolith is NSC:s current big system, a capability cluster with 1200 nodes. Please read the Triolith User Guide for information on Triolith.

System description: Centre Storage

The shared storage system for the SNIC computing resources consists of 8 HP ProLiant DL380 G5 servers, connected to 5 SATABeast RAID arrays. The total usable capacity is 65 TiB.

More information

Kappa Matter
Compute nodes 364 HP ProLiant DL170h G6 and 2 HP Proliant DL980 G7 512 HP ProLiant SL170z and 4 HP ProLiant DL160 G6
CPU Intel Xeon E5520 and Intel Xeon E7-2800 Intel Xeon E5520 and Intel Xeon X5570
Memory 2 "huge" nodes with 1TB RAM, 56 "fat" nodes with 72GB RAM, 308 nodes with 24 GB RAM 4 "fat" nodes with 144GB RAM, 512 nodes with 36GB RAM
Interconnect Infiniband interconnect with equpiment from Mellanox and Voltaire Infiniband interconnect with equpiment from Mellanox and Voltaire
Operating system CentOS 5.x x86_64 CentOS 5.x x86_64
Resource Manager SLURM 2 SLURM 2
Scheduler SLURM 2.2 with Multifactor Priority Plugin SLURM 2.2 with Multifactor Priority Plugin
Compilers Intel compiler collection Intel compiler collection
Math libraries Intel Math Kernel Library (MKL) Intel Math Kernel Library (MKL)
MPI Open MPI, Intel MPI Open MPI, Intel MPI
Applications Most applications exist on all SNIC clusters, see Applications
[top]

Quickstart Guide - from login to your first MPI application

Use SSH to access the system

When you have received a username and a password from NSC, log in to the correct cluster using SSH. Example using a Linux SSH client:

$ ssh x_abcde@kappa.nsc.liu.se
or
$ ssh x_abcde@matter.nsc.liu.se

Change your password

The first time you login, you will be prompted to change your password. If you want to change your password later, you can do so with the passwd command:

[x_makro@kappa ~]$ passwd
Changing password for user x_makro.
Changing password for x_makro
(current) UNIX password: 
New UNIX password: 
Retype new UNIX password: 
passwd: all authentication tokens updated successfully.

See more details on security. We recommend that you setup public-key authentication for SSH.

The login node

After logging in to the cluster using SSH, you are logged in to the "login node". This is the main user-accessible frontend for the cluster.

Acceptable things to do on the login node:

  • Compiling your application
  • Limited testing of your application (if you can do so without disturbing other login node users)
  • Transferring data to and from the cluster (scp, sshfs etc)
  • Preparing data, job scripts etc
  • Managing your jobs (start, monitor, cancel, ...)

Things NOT to do on the login node:

  • Run applications that use a lot of resources (CPU cores, CPU time, memory, I/O, ...). There is no hard limit, but you may not run anything that makes the login node noticeably slow for other users. Use common sense (and "top")...
  • Run "real" jobs - this is what the compute nodes are for. Note: you can run interactive commands on the compute nodes using the "interactive" command, as well as batch jobs.
  • Run jobs that need to run for a long time. The login node might be restarted to apply urgent security fixes anywhere from once a month to several times a week (NSC has no control over the release of security vulerabilities and patches, so we cannot plan this in advance). If you have applications (e.g your own scheduler or license manager) that run on the login node and need to run all the time, please contact support@nsc.liu.se to discuss a suitable solution for you.

Compiling a simple MPI application

To compile a parallel (MPI) program, load the appropriate MPI-module and add the "-N" compiler flag [more details].

The -Nmpi flag is an NSC-specific modification to the compiler wrapper script. It will add the compiler options needed to build your binary using the MPI type and version that is currently loaded (e.g by "module add openmpi/1.2.3"). It will also tag the binary with the information needed so that "mpprun" can start the binary using the correct MPI and version. Run "icc -Nhelp" to see more things that the NSC compiler wrapper can do.

On Kappa and Matter, we recommend Intel MPI or Open MPI (Intel MPI is in most cases faster):

$ module add openmpi
or
$ module add impi
FORTRAN example (mpitest_f77.f):
C
C Hello World in F77
C
        program main
        implicit none
        include 'mpif.h'
        integer ie, rank, size

        call MPI_INIT(ie)
        call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ie)
        call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ie)
        print *, "Hello, world, I am ", rank, " of ", size
        call MPI_FINALIZE(ie)

        end
[x_makro@kappa ~]$ module add openmpi
[x_makro@kappa ~]$ ifort -Nmpi -o mpitest_f77 mpitest_f77.f
ifort INFO: Linking with MPI openmpi/1.4.1-i101011.
[x_makro@kappa ~]$ ls -l mpitest_f77
-rwxrwxr-x 1 x_makro x_makro 12830 Apr 28 13:45 mpitest_f77
C example (mpitest_c.c):
/*                                                                              
 * Hello World in C                                                           
 */
#include <stdio.h>
#include "mpi.h"

int main(int argc, char* argv[])
{
    int rank, size;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    printf("Hello, world, I am %d of %d\n", rank, size);
    MPI_Barrier(MPI_COMM_WORLD);
    MPI_Finalize();

    return 0;
}
[x_makro@kappa ~]$ module add openmpi
[x_makro@kappa ~]$ icc -Nmpi -o mpitest_c mpitest_c.c
icc INFO: Linking with MPI openmpi/1.4.1-i101011.
[x_makro@kappa ~]$ ls -l mpitest_c
-rwxrwxr-x 1 x_makro x_makro 10024 Apr 28 13:46 mpitest_c
C++ example (mpitest_cpp.cpp) using OpenMPI:
//                                                                              
// Hello World in C++                                                           
//                                                                              
#include "mpi.h"
#include <iostream>

int main(int argc, char **argv)
{
    int rank, size;

    MPI::Init();
    rank = MPI::COMM_WORLD.Get_rank();
    size = MPI::COMM_WORLD.Get_size();
    std::cout << "Hello, world!  I am " << rank << " of " << size << std::endl;
    MPI::Finalize();

    return 0;
}
[x_makro@kappa ~]$ module add openmpi
[x_makro@kappa ~]$ icpc -Nmpi -o mpitest_cpp mpitest_cpp.cpp
icpc INFO: Linking with MPI openmpi/1.4.1-i101011.
[x_makro@kappa ~]$ ls -l mpitest_cpp
-rwxrwxr-x 1 x_makro x_makro 13229 Apr 28 13:50 mpitest_cpp

Note: "icc" is the Intel C compiler. The Intel C++ compiler is named "icpc".

Running an MPI application

Running in batch mode

This is the normal way of submitting jobs. [more details]:

First, you need to create a submit script. This script must contain information on how to start your job. It may also contain information about which project the job should be accounted on, how many processors you wish to use, how long you expect the job to run, how to start the application, etc.

In the example we use the batch script to indicate that we want two nodes, and 10 minutes of wall time. We then submit the job, using command line options to select a name for the job.

Example (submit_mpitest_c.sh):

#!/bin/bash
#SBATCH -N 2
#SBATCH -t 00:10:00
mpprun ./mpitest_c
[andjo@kappa ~]$ sbatch -J myjobname submit_mpitest_c.sh
Submitted batch job 708486
[andjo@kappa ~]$ squeue -u $USER
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
 708486     kappa myjobnam    andjo   R       0:00      2 n[282,287]

Once the job has been run (it will then disappear from the squeue output), you can check the result. The output (stdout/stderr) from the job is normally written to a file named slurm-JOBID.out (e.g slurm-708486.out) in the directory where you submitted the job.

Example output:

mpprun INFO: Starting openmpi run on 2 nodes (16 ranks)...
Hello, world, I am 3 of 16
Hello, world, I am 4 of 16
Hello, world, I am 2 of 16
Hello, world, I am 7 of 16
Hello, world, I am 6 of 16
Hello, world, I am 9 of 16
Hello, world, I am 14 of 16
Hello, world, I am 15 of 16
Hello, world, I am 5 of 16
Hello, world, I am 13 of 16
Hello, world, I am 11 of 16
Hello, world, I am 0 of 16
Hello, world, I am 10 of 16
Hello, world, I am 8 of 16
Hello, world, I am 12 of 16
Hello, world, I am 1 of 16
Running in interactive mode

Interactive mode can be useful for development, very short jobs, or jobs where you need to interact with the application (e.g it has a GUI).

In this example we will start an incteractive shell using two nodes and 5 minutes of walltime. We use the test program we compiled earlier.

The "interactive" command will request a certain number of nodes from the queue system, just a a batch job would, but you will end up with a terminal window on the first node, from where you can start your application. When you exit the terminal window, your interactive session ends.

[x_makro@neolith1 ~]$ interactive -N2 -t 00:05:00
Waiting for JOBID 871772 to start
....
[x_makro@n781 ~]$ mpprun ./mpitest_c
mpprun INFO: Starting openmpi run on 2 nodes (16 ranks)...
Hello, world, I am 14 of 16
Hello, world, I am 1 of 16
Hello, world, I am 5 of 16
Hello, world, I am 7 of 16
Hello, world, I am 8 of 16
Hello, world, I am 10 of 16
Hello, world, I am 12 of 16
Hello, world, I am 13 of 16
Hello, world, I am 11 of 16
Hello, world, I am 0 of 16
Hello, world, I am 9 of 16
Hello, world, I am 4 of 16
Hello, world, I am 6 of 16
Hello, world, I am 15 of 16
Hello, world, I am 2 of 16
Hello, world, I am 3 of 16
[x_makro@n781 ~]$ exit
.............[screen is terminating]
Connection to n781 closed.
[x_makro@neolith1 ~]$ 

Note: On Kappa you might want to use the "devel" partition if you are going to run a shurt interactive job on just a few nodes (e.g "interactive -p devel ..."). Matter has currently no development nodes.

[top]

Important details and differences

Software differences

We try to keep the software on the clusters identical, to make it easier to run the same types of jobs on different clusters. However, sometimes licensing terms or other reasons means that not all software is available everywhere.

Starting and managing jobs

Note: CPU time on the SNIC resources is always allocated on a particular cluster. E.g: if your project only has time on Kappa, you can only run jobs on Kappa. If you for some reason need to move time to another cluster (e.g if your application does not work where you have been allocated time), you must contact support@nsc.liu.se.

User accounts are not automatically closed, so just because you can login to a cluster does not mean you can run jobs there.

Note: If you are a member of more than one project, you must specify (e.g "sbatch -A project") a project when submitting a job, otherwise your job will never start. Note: this also applies to "interactive".

Note: there is a limit on how long jobs (walltime) you can run. On Kappa and Matter, this limit is currently 7 days (168 hours). Risk jobs on Kappa are limited to 24h walltime. If you submit a job with too long walltime, it will never start. If you need to run a job for longer than the max walltime, please contact support@nsc.liu.se

Note: if you do not choose a wall time limit for your job, you will get the system default limit, which is usually quite low (e.g 2h).

Your files

Note: Kappa and Matter uses shared file systems (/home, /nobackup/global and /software). It is important to remember that any changes you do to files in /home or /nobackup/global (e.g shell startup files (.bashrc etc) and job scripts) will affect all SNIC resources.

[top]

Accessing the System

To access the cluster (to start jobs, move data etc), you need to use SSH (Secure Shell).

To log into the system, use the username provided to you by NSC, and issue this command (replace CLUSTER with kappa or matter):

$ ssh username@CLUSTER.nsc.liu.se

A typical first-time login session:

kronberg@ming:~$ ssh x_makro@kappa.nsc.liu.se
The authenticity of host 'kappa.nsc.liu.se (130.236.100.31)' can't be established.
RSA key fingerprint is c9:88:68:1e:c0:38:4c:52:e3:45:a9:83:8d:04:cb:b5.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'kappa.nsc.liu.se' (RSA) to the list of known hosts.
x_makro@kappa.nsc.liu.se's password: 
Last login: Wed Apr 28 10:40:58 2010 from kubb
<<< PLEASE READ >>>

[...very long informational message removed...]

[x_makro@kappa ~]$ 

If you are using Windows

If your Windows computer does not already have an SSH client installed, contact your system administrator, or install one yourself. A good free SSH client is PuTTY (there are also other alternatives).

Both OpenSSH and PuTTY can be used for X-forwarding; with OpenSSH use "ssh -X", or with PuTTY toggle "Enable X11 forwarding" in the preferences. Note that using X-forwarding may require additional configuration of your local machine, e.g. you need an X-server, please consult your local system administrator if you run into trouble.

If you are using Linux/Solaris/*BSD or MacOS X

A command-line SSH client is most likely already installed if you use Linux, Solaris, any BSD variant or MacOS X.

Other platforms

Contact your local system administrator. You need an SSH client that can use protocol version 2 to use the SNIC computing resources.

File transfer

File transfer is available using scp, sftp, or sshfs.

  • scp is a tool useful for copying single, or a few files to or from a remote system. To copy a local file named local-file to your home directory on a SNIC cluster, issue
    scp local-file username@CLUSTER.nsc.liu.se:
    
    See the scp man pages for further information.
  • sftp is an interactive file transfer program, similar to ftp. For example:
    sftp username@CLUSTER.nsc.liu.se:testdir
    Connecting to CLUSTER.nsc.liu.se...
    Changing to: /home/username/testdir
    sftp> ls
    file-1  file-2
    sftp>get file-2
    Fetching /home/username/testdir/file-2 to file-2
    
    For additional information about sftp, see the sftp man page.
  • sshfs is a "user space file system" which allows for transparent file system access to remote machines. Example
    $ mkdir mnt
    $ ls mnt
    $ sshfs username@CLUSTER.nsc.liu.se:testdir mnt
    $ ls mnt
    file-1  file-2
    $ fusermount -u mnt
    $ ls mnt
    
    The use of sshfs can be very convenient, but is often not available by default. Consult your local system administrator to see if sshfs is available for your desktop machine.

There are several graphical SCP/SFTP clients available if you do not want to use command-line tools, e.g WinSCP, Filezilla.

[top]

Security

Security is something we take seriously. The cost of investigating and cleaning up a system after an intrusion is very high, both in man-hours for NSC staff, but also in lost computing time for users.

Please read the Security Guide carefully, it explains what you must do to keep our systems safe, but it also contains useful tips on how you can be both safe and productive at the same time.

Storage

Quota - how much data can I store?

Users have access to different file systems on the SNIC computing resources. The main file systems use quotas to limit how much data each user can store. Use the nscquota command to see your own quotas and usage:

[x_makro@neolith1 ~]$ nscquota
FILE SYSTEM                  USED        QUOTA        LIMIT        GRACE
--                           ----         ----         ----        -----
/home                   400.0 KiB     20.0 GiB     30.0 GiB             
/nobackup/global         32.0 KiB    250.0 GiB    300.0 GiB             

The default quota is currently 20 GiB on /home, and 250 GiB on /nobackup/global.

The policy for increased quota is simple: If space is available, you can get more quota, provided you explain to us how much you need, why, and for how long. An example:

  • How much: "I need a total quota of 500GB on /nobackup/global"
  • Why: "I expect to run up to 10 jobs at the same time in my new project, and each job needs 50 GB of storage space for its output files."
  • For how long: "I need this space for the duration of my project (until 2011-06-01)"

Before requesting more quota, make sure that you store your data in the correct location (e.g /nobackup for data that can be recreated and does not need expensive daily tape backups).

Note: quota is shared between Matter, Kappa and Triolith, so the total amount of quota you ask for must be sufficent for all your needs on these three systems.

Send your requests for more quota to support@nsc.liu.se

Available file systems

Mount point Size Availability/sharing Comment
/home ~5 TiB Shared between Kappa, Matter, Triolith Backed up to tape daily
/nobackup/global ~60 TiB Shared between Kappa, Matter, Triolith NO BACKUP! (Yes, this means that if you delete a file, it is really gone forever...)
/scratch/local ~150 GiB Not shared, local on each compute node NO BACKUP, automatically cleared after each job. ~150 GiB on Kappa and ~380 GiB on Matter
/software <1 TiB Shared between Kappa, Matter, Triolith Not writable for users

/home, used for important data

The home file system is mounted at /home on each machine in the cluster, and is backed up on a daily basis. Each user has its own home-directory (the environment variable HOME is always set to the home directory).

home is a GPFS file system with 256 KiB blocksize, optimized for small files.

/nobackup, used for scratch data

The nobackup file system is mounted at /nobackup/global on each machine in the cluster, and is not backed up. Each user has its own directory /nobackup/global/$USER (where $USER means the username of corresponding user).

Please use the nobackup file system for files that can be recreated by rerunning computation jobs. Do not store this type of data on /home as this waste space in our (expensive) tape backup system.

This is a GPFS filesystem with 1 MiB blocksize, optimized for large files.

/scratch/local, used as a local scratch dir

On each compute node, there is a node-local file system mounted at /scratch/local. If your jobs performs a lot of disk IO to files that does not need to be shared between nodes, then please use this filesystem. This takes load of the central disk servers, and most of the time also makes your job run faster.

/scratch/local is an XFS file system, typically 150 GiB in size per node on Kappa and 380 GiB on Matter.

/software, contains applications

Common applications installed by NSC, are found on the software file system and is accessable from every machine in the cluster. This file system is not user writable.

Note:
General Parallel File System (GPFS) is a proprietary cluster file system developed by IBM. The advantages of GPFS compared to NFS are higher performance and better scalability.
All disks are connected to eight dedicated disk servers (NSD-servers in GPFS teminology). These servers form a GPFS cluster that exports the filesystems to Kappa, Matter and Triolith.

[top]

Environment

We use cmod (module) to handle the environment when there exist several installed versions of the same software. This application sets up the correct paths to the binaries, man-pages, libraries, etc. for the currently selected module.

The correct environment is set up by using the module command . A list of some arguments to module includes:

module

lists the available arguments

module list

lists currently loaded modules

module avail

lists the available modules for use

module load example

loads the environment specified in the module named example

module unload example

unloads the environment specified in the module named example

module try-add example

loads the environment specified in the module named example, but don't complain if it does not exist.

A default environment is automatically declared when you log in. The default modules can be listed:

[username@neolith1 ~]$ module list
Currently loaded modules:
  1) snic
  2) neolith
  3) slurm
  4) intel
  5) dotmodules
  6) default

In order to find out to which version of the compiler the module ifort refer, you may list all modules:

[username@neolith1 ~]$ module avail
R/2.13.1 (def)
base-config/1 (def)
blast+/2.2.24 (def)

[...many more modules...]

vmd/1.8.7 (def)
vtune-amplifier-xe/2011.u3 (def)
xmgr/4.1.2 (def)
[username@neolith1 ~]$

[username@neolith1 ~]$ module avail | grep intel
intel/10.0
intel/10.1
intel/11.1
intel/12.0.3
intel/12.1.0 (def)
intel/9.1
[username@neolith1 ~]$

The note "(def)" indicates which version that is the default, and, in case of the intel compiler suiter, it is thus version 12.1.0. Please note, however, that the choice of default module may change over time. Therefore, if you wish to re-compile part of a program and link a new executable, you may need to ensure that you are using the same version of the compiler that you had at the time of the first built. You can switch to another version of the compiler as follows:

[username@neolith1 ~]$ module list        
Currently loaded modules:
snic
matter
kappa
slurm
intel
dotmodules
default
[username@neolith1 ~]$ module add intel/12.0.3
Conflicting modules warning: Unloading intel before loading intel/12.0.3
[username@neolith1 ~]$ module list
Currently loaded modules:
snic
matter
kappa
slurm
dotmodules
default
intel/12.0.3

Tip: If you want to see what a particular module contains, read the corresponding file under /etc/cmod/modulefiles:

[x_makro@kappa ~]$ cat /etc/cmod/modulefiles/openmpi/1.4.1-i101011
$(/etc/cmod/modulegroups mpi openmpi/1.4.1-i101011)
#MDB path /software/mpi/openmpi/1.4.1/i101011/bin
#MDB libpath /software/mpi/openmpi/1.4.1/i101011/lib
#MDB incpath /software/mpi/openmpi/1.4.1/i101011/include
append-path PATH /software/mpi/openmpi/1.4.1/i101011/bin
append-path MANPATH /software/mpi/openmpi/1.4.1/i101011/share/man/

Useful Environment Variables

NSC_RESOURCE_NAME - If you are using several NSC resources and copying scripts between them, it can be useful for a script to have a way of knowing what resource it is running on. You can use the environment variable NSC_RESOURCE_NAME for that. It will be set to the cluster name in lowercase, e.g "kappa".

SNIC Resource Name Environment Variables

There are a couple of environment variables that are available on all Swedish SNIC resources, pointing to different types of storage. It is recommended that you refer to these variables instead of hardcoding the filesystem names within your job scripts.

Variable Definition Default value
SNIC_BACKUP
The user's primary directory at the centre (the part of the Centre Storage that is backed up)
/home/$USER
SNIC_NOBACKUP
Recommended directory for project storage without backup (also on the Centre Storage)
/nobackup/global/$USER
SNIC_TMP
Recommended directory for best performance during a job (local disk on nodes if applicable)
/scratch/local
SNIC_SITE
At what SNIC site am I running?
nsc
SNIC_RESOURCE
What resource am I using at this SNIC site?
triolith, kappa or matter
[top]

Compiling

We recommend using the Intel compilers: ifort (Fortran), icc (C), and icpc (C++).

Compiling OpenMP applications

Example: compiling the OpenMP-program, openmp.f with ifort:

$ ifort -openmp openmp.f

Example: compiling the OpenMP-program, openmp.c with icc:

$ icc -openmp openmp.c

Compiling MPI applications

For a few simple examples of how to build simple MPI applications, see Quickstart

Compiler wrappers

When invoking any of the intel compilers (icc, ifort, or icpc), there is a wrapper-script that looks for NSC/SNIC-specific options. Options starting with -N are used by the wrapper to affect the compilation and/or linking processes, but these options are not passed to the compiler itself.

-Nhelp
Write wrapper-help
-Nverbose
Let the wrapper be more verbose
-Nmkl
Make the compiler compile and link against the currently loaded MKL-module. Note: you will also need to add e.g -mkl=parallel
-Nmpi
Make the compiler compile and link against the currently loaded MPI-module
-Nmixrpath
Make the compiler link a program build with both icc/icpc and ifort

For example:

$ module load mkl
$ icc -Nverbose -Nmkl -o example example.c -mkl=sequential
icc INFO: Linking with MKL mkl/10.3.6.233.
icc INFO: -Nmkl resolved to: -I/software/intel/composer_xe_2011_sp1.6.233/mkl/include -L/software/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64 -Wl,--rpath,/software/intel/composer_xe_2011_sp1.6.233/mkl/lib/intel64

The wrappers add tags to the executables with information regarding the compilation and linking. You may use the dumptag command to get a list of these labels:

[x_makro@neolith1 ~]$ dumptag mpitest_f77
-- NSC-tag ----------------------------------------------------------
File name:              /home/x_makro/mpitest_f77

Properly tagged:        yes
Tag version:            4
Build date:             100428
Build time:             134552
Built with MPI:         openmpi 1_4_1_i101011
Built with MKL:         no (or build in an unsupported way)
Linked with:            ifort 10_1_017
---------------------------------------------------------------------

Intel compiler, useful compiler options

Below is a short list of useful compiler options.
The manual pages "man ifort" and "man icc" contain more details, and further information is also found at the Intel homepage [here].

Optimization

There are three different optimization levels in Intel's compilers:

-O0

Disable optimizations.

-O1,-O2 

Enable optimizations (DEFAULT).

-O3

Enable -O2 plus more aggressive optimizations that may not improve performance for all programs.

-ip

Enables interprocedural optimizations for single file compilation.

-ipo

Enables multifile interprocedural (IP) optimizations (between files).
Tip:If your build process uses ar to create .a-archives you need to use xiar (Intels implementation) instead of the systems /usr/bin/ar for an IPO build to work.

-xH

Optimize for the processors in Kappa/Matter.

For general use we recommend "-O2".

For best performance on Kappa/Matter we recommend "-O3 -ipo -xH" or "-O3 -ip -xH"

Note: Using aggressive optimisation flags runs a higher risk of encountering compiler limitations. If you experience problems with your code when using high optimization settings, try lowering them (e.g to "-O2").

Debugging

-g

Generate symbolic debug information.

-traceback

Generate extra information in the object file to allow the display of source file traceback information at runtime when a severe error occurs.

-fpe<n>

Specifies floating-point exception handling at run-time.

-mp

Maintains floating-point precision (while disabling some optimizations).

Profiling

-p

Compile and link for function profiling with UNIX gprof tool.

Options that only apply to Fortran programs

-assume byterecl

Specifies (for unformatted data files) that the units for the OPEN statement RECL specifier (record length) value are in bytes, not longwords (four-byte units). For formatted files, the RECL unit is always in bytes.

-r8

Set default size of REAL to 8 bytes.

-i8

Set default size of integer variables to 8 bytes.

-zero 

Implicitly initialize all data to zero.

-save

Save variables (static allocation) except local variables within a recursive routine; opposite of -auto.

-CB

Performs run-time checks on whether array subscript and substring references are within declared bounds.

Miscellaneous

Little endian to Big endian conversion in Fortran is done through the F_UFMTENDIAN environment variable. When set, the following operations are done:

  • The WRITE operation converts little endian format to big endian format.
  • The READ operation converts big endian format to little endian format.
F_UFMTENDIAN = big 

Convert all files.

F_UFMTENDIAN ="big;little:8" 

All files except those connected to unit 8 are converted.

[top]

Math libraries

MKL, Intel Math Kernel Library

The Intel Math Kernel Library (MKL) is available, and we strongly recommend using it. Several versions of MKL may exist, you can see which versions are available with the "module avail" command. The library includes the following groups of routines:

  • Basic Linear Algebra Subprograms (BLAS):

    • vector operations

    • matrix-vector operations

    • matrix-matrix operations

  • Sparse BLAS (basic vector operations on sparse vectors)

  • Fast Fourier transform routines (with Fortran and C interfaces). There exist wrappers for FFTW 2.x and FFTW 3.x compatibility.

  • LAPACK routines for solving systems of linear equations

  • LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations

  • ScaLAPACK routines including a distributed memory version of BLAS (PBLAS or Parallel BLAS) and a set of Basic Linear Algebra Communication Subprograms (BLACS) for inter-processor communication.

  • Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments (with Fortran and C interfaces).

Full documentation can be found online at http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation/.

Library structure

The Intel MKL is located in the /software/intel/mkl/ directory. When you have loaded an mkl module, the environment variable $MKL_ROOT will point to the MKL installation directory for that version (e.g /software/intel/composer_xe_2011_sp1.6.233/mkl).

The MKL consists of two parts: a linear algebra package and processor specific kernels. The former part contains LAPACK and ScaLAPACK routines and drivers that were optimized as without regard to processor so that it can be used effectively on different processors. The latter part contains processor specific kernels such as BLAS, FFT, BLACS, and VML that were optimized for the specific processor.

Linking with MKL

If you want to build an application using MKL with the Intel compilers at NSC, we recommend using the flag "-Nmkl" (to get your application correctly tagged) and "-mkl=MKLTYPE". The "-mkl" flag is available in Intel compilers from version 11 (so it will be available unless you for some reason need to use a really old compiler).

  • "-mkl=parallel" will link the with the (default) threaded Intel MKL.
  • "-mkl=sequential" will link with the sequential version of Intel MKL.
  • "-mkl=cluster" will link with Intel MKL cluster components (sequential) that use Intel MPI. If you use this option you should also load an MPI module (e.g "module load impi").

If for some reason you cannot use the "-mkl" flag, please read the Intel documentation to find out what linker flags you need. You might also find this Intel page useful.

MKL and threading

The MKL is threaded by default, but there is also a non-threaded "sequential" version available. (The instructions here are valid for MKL 10.0 and newer, older versions worked differently.)

If threaded or sequential MKL gives best performance varies between applications. MPI applications will typically launch one MPI-rank on each processor core on each node, in this case threads are not needed as all cores are already used. However if you use threaded MKL you can start fewer ranks per node and increase the number of threads per rank accordingly.

The threading of MKL can be controlled at run time through the use of a few special environment variables.

  • OMP_NUM_THREADS controls how many OpenMP threads that should be started by default. This variable affects all OpenMP programs including the MKL library.
  • MKL_NUM_THREADS controls how many threads MKL-routines should spawn by default. This variable affects only the MKL library, and takes precedence over any OMP_NUM_THREADS setting.
  • MKL_DOMAIN_NUM_THREADS let the user control individual parts of the MKL library. E.g. MKL_DOMAIN_NUM_THREADS="MKL_ALL=1;MKL_BLAS=2;MKL_FFT=4" would instruct MKL to use one thread by default, two threads for BLAS calculations, and four threads for FFT routines. MKL_DOMAIN_NUM_THREADS also takes precedence over OMP_NUM_THREADS.

If the OpenMP enironment variable controlling the number of threads is unset when launching an MPI application with mpprun, mpprun will by default set OMP_NUM_THREADS=1. [top]

Executing parallel jobs

There are two main alternatives to develop program codes that can be executed on multiple processor cores namely OpenMP and MPI. OpenMP parallelization can be used for parallelization of code that is to run within a single node (i.e only up to 8 cores on current hardware), whereas MPI is used for parallelization of code that can run on single as well as multiple nodes. The two types of applications are executed differently.

Executing an MPI application

An MPI application is started with the command:
$ mpprun mpiprog.x

Use "mpprun --help" to get a list of options and a brief description.

Note:
  • mpprun has to be started from a SLURM job. Either write a batch script and submit it with sbatch, or start an interactive shell using the command interactive [more details].
  • mpprun will launch a number of ranks determined from the SLURM environment variables [more details].
  • mpprun requires an MPI binary built according to NSC-recomendations in order to automatically choose the correct MPI implementation [more details].
  • Warning, not recommended: In order to explicitly choose an MPI implementation to use, invoke mpprun with the flag
    --force-mpi=<MPI module>.
    

Executing an OpenMP application

The number of threads to be used by the application must be defined, and should be less or equal to eight. You can set the number of threads to be used by the application in two ways, either by defining a shell environment variable before starting the application or by calling an OpenMP library routine in the serial portion of the code.

  1. Environment variable:
    export OMP_NUM_THREADS=N
    time openmp.x
    
  2. Library routine:

    In Fortran:

    SUBROUTINE OMP_SET_NUM_THREADS(scalar_integer_expression)
    
    In C/C++:
    #include <omp.h>
    void omp_set_num_threads(int num_threads)
    
Note:
  • The maximum number of threads can be queried in your application by use of the external integer function:

    In Fortran:

    INTEGER FUNCTION OMP_GET_MAX_THREADS()
    
    In C/C++:
    #include <omp.h>
    int omp_get_max_threads(void)
    
[top]

Executing serial jobs

On Kappa and Matter, the smallest part of the system you can allocate is one compute node (8 CPU cores). Your project will always be charged for the use of the whole compute node regardless of how you use it.

To get the most out of your allocated computing time, you should always try to use all cores in the node if possible. Some exceptions: short test jobs, or if your application needs all the RAM in a node, but only a single CPU core.

Below we describe two methods you can use to pack several single-threaded jobs (from here on we will refer to them as "job steps", since this is the term SLURM uses for "jobs within a job") into a single 1-node job.

We recommend using method 1 if all your job steps will run for approximately the same length of time.

If your job steps have very different run times, a lot of time will be spent "wait"-ing on the longest running job. In this case method 2 is better.

Method 1 - start 8 jobs in the background, then wait for them to finish

Sample job script:

#!/bin/bash
#SBATCH -N 1
#
# Change directory to job1, start job1 and send all output to a log file
# Repeat for all 8 jobs.
cd /path/to/job1
./job1 > job1.log 2>&1 &

cd /path/to/job2
./job2 > job2.log 2>&1 &

cd /path/to/job3
./job3 > job3.log 2>&1 &

cd /path/to/job3
./job4 > job4.log 2>&1 &

cd /path/to/job5
./job5 > job5.log 2>&1 &

cd /path/to/job6
./job6 > job6.log 2>&1 &

cd /path/to/job7
./job7 > job7.log 2>&1 &

cd /path/to/job8
./job8 > job8.log 2>&1 &
# Now, wait for all jobs to finish
wait
# End of script

If your job steps are short, you can pack even more of them into a single job by repeating the "start jobs, wait" cycle:

#!/bin/bash
#SBATCH -N 1
#
cd /path/to/job1
./job1 > job1.log 2>&1 &

[...repeat for job 2-7...]

cd /path/to/job8
./job8 > job8.log 2>&1 &
# Now, wait for all jobs to finish
wait

# Start 8 more jobs
cd /path/to/job9
./job9 > job9.log 2>&1 &

[...repeat for job 10-15...]

cd /path/to/job16
./job15 > job15.log 2>&1 &
# Now, wait for all jobs to finish
wait
# End of script

Method 2 - use srun to schedule job steps within a job

In this method you use srun to queue all your job steps inside the job. SLURM will then figure out when to start new job steps, keeping the total number running at 8 at all times.

While the job is running, you can monitor the job steps using e.g "squeue -u $USER -s".

Sample submit script:

#!/bin/bash
#SBATCH -N1
#
cd /path/to/job1
srun --quiet -n 1 --exclusive --nodes=1-1 ./job1 > job1.log 2>&1 &
cd /path/to/job2
srun --quiet -n 1 --exclusive --nodes=1-1 ./job2 > job2.log 2>&1 &

[...repeat for jobs 2-99...]

cd /path/to/job100
srun --quiet -n 1 --exclusive --nodes=1-1 ./job100 > job100.log 2>&1 &
# Wait for all queued job steps to complete
wait
# End of script
[top]

Submitting jobs

You submit jobs in the same way on all systems (interactive, sbatch).

There are two ways to submit jobs to the batch queue system, either as an interactive job or as a batch job. Interactive jobs are most useful for debugging as you get interactive access to the input and the output of the job when it is running. But the normal way to run the applications is by submitting them as batch jobs.

Interactive job submission

An interactive access to the compute nodes is provided with the command interactive. This command accepts the same options as the sbatch command described below.

In order to start an interactive jobs allocating 2 nodes and 10 cores for 10 minutes, you type

$ interactive -N 2 -n 10 -t 00:10:00

Note: If you leave out the "-n 10" argument in the command, you will by default be given all available cores (in this case 16).

Once your interactive jobs has started, you are logged in to the first node in the list of nodes that has been assigned for the job. An environment has been created for you that in addition to ordinary variables also contain a number of SLURM environment variables:

[x_makro@n1 ~]$ env | grep -i slurm
SLURM_NODELIST=n[1-2]
SLURM_JOB_NAME=_interactive
SLURMD_NODENAME=n1
SLURM_PRIO_PROCESS=0
SLURM_NO_REQUEUE=
SLURM_NNODES=2
SLURM_JOBID=871896
SLURM_TASKS_PER_NODE=5(x2)
STY=20242.slurm871896
SLURM_JOB_ID=871896
SLURM_UMASK=0002
SLURM_NODEID=0
SLURM_NPROCS=10
SLURM_TASK_PID=20242
SLURM_CPUS_ON_NODE=8
SLURM_PROCID=0
SLURM_JOB_NODELIST=n[1-2]
SLURM_LOCALID=0
SLURM_JOB_CPUS_PER_NODE=8(x2)
SLURM_GTIDS=0
SLURM_JOB_NUM_NODES=2

Let us now run the trivial MPI Fortran application given above [mpitest_f77.f]:

[andjo@n282 ~]$ mpprun mpitest_f77
mpprun INFO: Starting openmpi run on 2 nodes (10 ranks)...
 Hello, world, I am            5  of           10
 Hello, world, I am            9  of           10
 Hello, world, I am            4  of           10
 Hello, world, I am            6  of           10
 Hello, world, I am            7  of           10
 Hello, world, I am            0  of           10
 Hello, world, I am            1  of           10
 Hello, world, I am            2  of           10
 Hello, world, I am            8  of           10
 Hello, world, I am            3  of           10

Batch job submission

The two main commands for handling job submissions are:

sbatch

Submits a job to the queue system.

scancel JOBID

Deletes a job from the queue system.

Batch jobs are submitted to the queue system with the command sbatch:

$ sbatch -J jobname submit.sh

A minimal submit script that requires 2 nodes and 16 cores for 10 minutes may look like:

#!/bin/bash
#SBATCH -N 2
#SBATCH -t 00:10:00

mpprun ./mpiprog.x

# End of script

We note the use of "#SBATCH" lines in the script. This is an alternative way of specifying options to the sbatch command. We could thus have specified the jobname in the script with an additional line reading

#SBATCH -J jobname

Let us submit the above script:

[andjo@kappa ~]$ sbatch -J mpitest_f77 submit.sh
Submitted batch job 708487

After the job has completed, the output to standard out and standard error (if not re-directed) is returned from the system in a file called

slurm-JOBID.out

In this case this is where we find the output from our program:

[andjo@kappa ~]$ cat slurm-708487.out
mpprun INFO: Starting openmpi run on 2 nodes (16 ranks)...
 Hello, world, I am            5  of           16
 Hello, world, I am            0  of           16
 Hello, world, I am            1  of           16
 Hello, world, I am            2  of           16
 Hello, world, I am            3  of           16
 Hello, world, I am            7  of           16
 Hello, world, I am           10  of           16
 Hello, world, I am           11  of           16
 Hello, world, I am            9  of           16
 Hello, world, I am            4  of           16
 Hello, world, I am           13  of           16
 Hello, world, I am           12  of           16
 Hello, world, I am            8  of           16
 Hello, world, I am           14  of           16
 Hello, world, I am           15  of           16
 Hello, world, I am            6  of           16

Useful options to sbatch are listed with the command

$ man sbatch

Please read this man page, as the options are not exactly the same in all SLURM versions.

A selection of the most useful options includes:

-A account_string 

The project the job should be accounted on.

Large and medium scaled projects have a project id of the form "SNIC xxx/yy-zz". The corresponding account string is obtained by taking the project id and remove all blanks " " and all replace "/" with "-". For example to account on the SNAC project "SNIC 005/06-98" the string "SNIC005-06-98" should be used.

Small scaled projects (also known as test projects) have a project id of the form LiU-20NN-NNNNN-N. The corresponding account string is liu-20NN-NNNNN-N.

A person that is member of a single project may omit this argument.

-N nodes

The number of nodes to run the job on, each node has 8 cores.

-n tasks

The total number of tasks (mpi ranks).

--ntasks-per-node tasks

The number of tasks (mpi ranks) per node.

-c cpus

Number of CPUs per task. Use this for running multi-threaded tasks, for example applications using OpenMP.

-J jobname

Name of the job.

-t hh:mm:ss

The maximum execution time for the job.

-t days-hh

An alternative specification of the maximum execution time for the job.

-d JOBID

Defer the start of this job until the specified jobid has com- pleted.

--mem MB

Specify the minimum amount of memory in MiB ("MB") for the job. Your job will be run on nodes with at least this much RAM. Note: don't specify this unless you need to force your job to run on the "fat" nodes. E.g to force your Kappa job to run on the fat nodes, specify a number greater than 24576.

-C fat

An alternative to specifying the minimum amount of memory per node is to specify that you only want to run on "fat" nodes using a constraint. The constraint for using "fat" nodes is named "fat" on Kappa and Matter, but the amount of memory in a "fat" node is different (72GB on Kappa and 144GB on Matter)

-d, --dependency=<dependency_list>

Defer the start of this job until the specified dependencies have been satisfied completed (please read the man page for sbatch). This can be useful if you have a long task that is divided into several different jobs that need to be run serially. The "afterok" dependency type is recommended.

[top]

Supervising jobs

In many cases it is desirable to supervise your running and scheduled jobs in order to find out if jobs have started or completed, how much remains of the allocated wall clock time, if a job produces sensible results, if a job makes efficient use of the cores, etc.

Monitor the queue

Useful commands to monitor the queue are:

squeue

Monitor jobs in the queue system.

sprio

List pending jobs including their current priority.

sview

X-application for monitoring jobs in the queue system (remember to login to the system with X-forwarding enabled). Visually displays allocated nodes and also any available reservations and partitions.

smap

Another way of monitoring jobs in the queue system. (typical use: "smap -c").

sshare

View fair-share information.

User selective information is obtained with the "squeue" command:

[panor@neolith1 ~]$ squeue -u panor
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
   5351   neolith  mpiprog    panor   R       0:01      2 n[212-213]
[panor@neolith1 ~]$

We note that the output from "squeue" includes information about which nodes your application is running on.

Monitor a running job

Applications have various ways to return output from the calculations; some write to standard output (which may be re-directed) whereas others write specific output files that often reside in the scratch directory. In order to list the output of a running calculation in the latter case, you may need to access the local file systems of the compute nodes named "/scratch/local/". This is possible since you are allowed to log in with "ssh" to all compute nodes where you have running applications:

[panor@neolith1 ~]$ ssh n650
Last login: Mon Mar  3 10:28:03 2008 from l1
[panor@n650 ~]$ df -m
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sda1                 9844      1496      7848  17% /
tmpfs                     8028         0      8028   0% /dev/shm
/dev/sda3               226365     36184    190181  16% /scratch/local
d1:/home               4194172   1602713   2591460  39% /home
s1:/software             95834     10259     85575  11% /software
[panor@n650 ~]$ 

Once logged in to a compute node with a running application, you may monitor the performance of your application with e.g. the "top" command:

[panor@n650 ~]$ top -u panor
top - 14:35:09 up 14 days, 23:56,  1 user,  load average: 1.73, 1.69, 1.60
Tasks: 170 total,   2 running, 168 sleeping,   0 stopped,   0 zombie
Cpu(s):  9.2%us,  3.4%sy,  0.0%ni, 87.3%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  16439708k total, 16353084k used,    86624k free,      880k buffers
Swap:  2047840k total,      180k used,  2047660k free, 14840652k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 7615 panor     25   0 1855m 928m 7768 R   99  5.8   6661:50 dalton.x           
 3350 panor     15   0 12712 1164  832 R    0  0.0   0:00.09 top                
 3249 panor     15   0 87504 1668  964 S    0  0.0   0:00.00 sshd               
 3250 panor     16   0 68240 1768 1312 S    0  0.0   0:00.03 bash               
 7596 panor     17   0 65872 1192 1004 S    0  0.0   0:00.00 script             
 7597 panor     23   0 65876 1288 1056 S    0  0.0   0:00.00 dalton             

You can also run a command on each node in a job using srun from the login node as shown in the example below (where uptime is run on every node belonging to job 22684):

$ srun --jobid=284 uptime
 16:11:32 up 23:35,  0 users,  load average: 7.74, 7.45, 7.42
 16:11:32 up 23:35,  0 users,  load average: 7.74, 7.44, 7.41
 16:11:32 up 23:35,  0 users,  load average: 7.74, 7.43, 7.40
 16:11:32 up 23:35,  0 users,  load average: 7.75, 7.46, 7.43
 16:11:32 up 23:35,  0 users,  load average: 7.79, 7.54, 7.48
 16:11:32 up 23:35,  0 users,  load average: 7.75, 7.46, 7.41
 16:11:32 up 23:35,  0 users,  load average: 7.79, 7.57, 7.50
 16:11:32 up 23:35,  0 users,  load average: 7.74, 7.45, 7.41
[top]

Job Scheduling

The priority of your queued job is basically proportional to the percentage of your project's monthly allocation, that is unused. This is called fair-share.

This "fair-share" priority is based on an approximation of the used time during the last 30 days. If your project has exceeded its allocation, your job priority vill be very low. If your project has used very little of your allocation, your job priority will be very high.

No matter how much time you have used, you can still submit jobs, and they will run if no higher-priority jobs are available.

Fair-share details on Kappa

Fair-share on Kappa is computed hierarchically as: your percentage share within your parent account multiplied by your parent accounts percentage share among its sibling accounts, and your grandparent account's percentage share among its sibling accounts, etc.

Here is the current fair-share configuration on Kappa at the moment this guide was written displayed as a hierarchical tree:

                Account FairShare
   -------------------- ---------
   root                         1
    liu                       485
     liu1                  160000
     liu2                  160000
     liu3                  160000
     liu4                  160000
     liu5                  160000
    nsc                         1
    pilot                       1
    snac                      485
     p2010505                5000
     snic001-09-128         20000
     snic001-09-145         40000
     snic001-09-162         30000
     snic001-09-163         40000
     snic001-09-169         30000
     snic001-09-177         40000
     snic001-10-12          40000
     snic001-10-13          20000
     snic001-10-3           20000
     snic001-10-36          30000
     snic001-10-7           20000
     snic022-09-17          50000
     snic022-09-5          125000
     snic022-09-9          100000
    theophys                    1

root is always the top level parent account for all and always has a share of 100%.

The account liu has a percentage share of 485 / (485 + 1 + 1 + 485 + 1) = 49.8%, liu1 has a share of (1/5) * 49.8% = 10%, etc. If there only is one user in liu1 then he has 10% share of Kappa, if there are 2 users in liu1 with equal shares within liu1 then they each have 5% share of kappa.

When a job has finished, its used time is accounted on all its parent accounts and therefore affects the priority for all other jobs running below those accounts.

Note: fair-share only influences a jobs priority in the queue, if some accounts are not using their allotted shares then fair-share nicely allows other accounts to use more than their share.

Jobs on Kappa - things you need to know

The configuration of Kappa is more complex that the one on Matter. This is due to both the shared funding (some nodes are directly owned by IFM), and also due to the different needs of some of the groups that use Kappa.

As of 2013-07-01, the configuration of Kappa is as follows:

Part Size Who can use this part Configuration How to use this part
Development nodes 4 All Kappa projects except afm, liu1 and liu5 A separate partition that is active during office hours. Max walltime is one hour. sbatch -p devel --reservation=devel -A <account>
AFM nodes 32 Members of the "afm" project (PI Sergey Simak) A separate partition sbatch -p afm -A afm
Green-type part 79 Members of the "liu1" and "liu5" projects A separate partition. 2 guaranteed nodes per user for liu1, 3 guaranteed nodes for liu5. sbatch -p green (or green_risk) -A <account>
Huge nodes 2 Local research groups at Linköping University. A separate partition. Note: as slurm account use the project name with the suffix -huge appended to it. sbatch -p huge -A <account> (e.g. sbatch -p huge -A liu0-huge).

The development nodes are available between 08:00 and 18:00 local time (CET/CEST) every day. The purpose of these nodes is to ensure that nodes are available for quick testing of small jobs, before submitting your full-size jobs to the kappa partition. The maximum walltime of a job on the development nodes is one hour. AFM or Green-type users should use the afm and green partitions for development. Do not abuse the devel nodes by submitting many short "real jobs" to them!

The AFM partition is allocated by the PI according to the current needs within that group. I.e - check with the PI before using it.

"Green-type" scheduling is very different from our normal fair-share scheduling. Users of this part of Kappa always have guaranteed access to a certain number of nodes (see table). Since this setup can lead to a lot of wasted computing time when the guaranteed node are not used, "Green-type" scheduling also includes the possibility of running "risk jobs" to use that idle time. "Risk jobs" are jobs that might be killed at any time if a regular job needs to run. To avoid wasting computing time, you should make sure that all risk jobs use application-level checkpointing so that they can be restarted and continue where they left off. Risk jobs that are killed are not restarted by default. It is currently not possible to enable automatic restarting of killed jobs.

The huge nodes are currently configured to allow only one job at a time to run, this will most probably change in the future. Their usage is controlled by a separate fair-share, which is not affected by the usage of other parts of Kappa. All projects having an allocation on the huge nodes must specify a slurm account which is named after the project name with the suffix -huge appended to it, e.g. liu17-huge.

Trivia: The name "Green-type" comes from the old cluster "Green" which was used by some of the Kappa user groups before Kappa was available.

You should always specify which account/project to use for each job (sbatch -A account). If you don't, the scheduling system will always use your default account, even if that account is not allowed to use the partition you have specified. Note: a specific account can normally only be used in a single part of Kappa (e.g you cannot run in green_risk using a SNIC* account).

Interactive jobs: use the same options to the "interactive" command as are listed for sbatch in the table above.

[top]

Debugging and tracing

Standard debugging tools like the GNU debugger gdb and Intel debugger idb are installed. There are also a few special programs available to help trace and debug parallel applications.

Intel Trace Analyzer and Collector

Note: we have only verified that ITAC works when using Intel MPI.

This tool was previously named Vampir. It can be used to trace the communication patterns of a MPI application. This is accomplished by recompiling you application linked against trace libraries. The application then writes trace files when it is executed. These files can then be analyzed using the graphical trace analyzer from the login node.

ITAC have several features not described here, full documentation is available in the directory /software/intel/itac/$VERSION/doc

How to use:
1. Load the Intel MPI and ITAC modules.

  $ module add impi itac

2. Compile the MPI program with the extra CFLAGS "-lVT -I$VT_ROOT/include -L$VT_LIB_DIR $VT_ADD_LIBS" and -Nmpi

  $ icc mpiprog.c -o mpiprog -Nmpi -lVT -I$VT_ROOT/include -L$VT_LIB_DIR $VT_ADD_LIBS

3. Run the program with mpprun like usual. This will write trace files in the work directory.

  $ mpprun ./mpiprog

5. Open the trace files using the trace analyzer on the login node.

  [paran@neolith1 ~]$ traceanalyzer mpiprog.stf

Sample session:

[kronberg@neolith1 neolith]$ module add impi itac
[kronberg@neolith1 neolith]$ icc mpiprog.c -o mpiprog -Nmpi -lVT -I$VT_ROOT/include -L$VT_LIB_DIR $VT_ADD_LIBS
icc INFO: Linking with MPI impi/4.0.3.008.
[kronberg@neolith1 neolith]$ interactive -N2 -t 00:10:00 -A nsc
Waiting for JOBID 1450192 to start
...
[kronberg@n781 neolith]$ mpprun ./mpiprog
mpprun INFO: Starting impi run on 2 nodes (16 ranks)...
Hello, world, I am 0 of 16
Hello, world, I am 2 of 16
Hello, world, I am 9 of 16
Hello, world, I am 10 of 16
Hello, world, I am 4 of 16
Hello, world, I am 3 of 16
Hello, world, I am 12 of 16
Hello, world, I am 6 of 16
Hello, world, I am 15 of 16
Hello, world, I am 1 of 16
Hello, world, I am 13 of 16
Hello, world, I am 7 of 16
Hello, world, I am 11 of 16
Hello, world, I am 14 of 16
Hello, world, I am 8 of 16
Hello, world, I am 5 of 16
[0] Intel(R) Trace Collector INFO: Writing tracefile mpiprog.stf in /home/kronberg/work/rt/67774/neolith
[kronberg@n781 neolith]$ exit
Connection to n781 closed.
[kronberg@neolith1 neolith]$ traceanalyzer mpiprog.stf

TotalView Parallel Debugger

Full documentation for TotalView, including a User Guide is available in the directory /software/apps/toolworks/totalview.8.7.0-7/doc/pdf/

License information: There is currently only one single license for TotalView installed. If you encounter license availability problems then please contact support@nsc.liu.se so we can consider purchasing more licenses.

The mpprun script has TotalView support, so you can now start TotalView directly from mpprun. The following guide shows how to run and debug an MPI program on two nodes using ScaliMPI. TotalView supports several other MPI implementations but this have not been verified by NSC on Neolith, feel free to try if you want.

Note: TotalView has been successfully run on Kappa with OpenMPI when using OpenMPI and ICC versions openmpi/1.4-i111059 and icc/11.1.069. It should also work on Matter.

Simple TotalView example (from Neolith using OpenMPI):

  1. Load an MPI module and the totalview module, and if needed a non-default version of ICC:
    [kronberg@neolith1 tview2]$ module add openmpi/1.4-i111059
    [kronberg@neolith1 tview2]$ module add icc/11.1.069
    [kronberg@neolith1 tview2]$ module add totalview
    
  2. Make sure that you can run X11 applications on the login node. (start an xterm or something similar to verify).
  3. Build your application, e.g:
    [kronberg@neolith1 tview2]$ icc -Nmpi -g -o mpitest_c mpitest_c.c 
    icc INFO: Linking with MPI openmpi/1.4-i111059.
    [kronberg@neolith1 tview2]$
    
  4. Start an interactive job (in this case, 2 nodes and 10 minutes runtime).

    Note: it's recommended that you allocate two or more nodes when testing TotalView, so you can see the cool feature of debugging processes on several nodes from one window properly. :)

    [kronberg@neolith1 tview2]$ interactive -N 2 -t 00:10:00 -A YOUR_PROJECT
    Waiting for JOBID 875869 to start
    ...
    
    Wait for the interactive job to start.
  5. Start your application under TotalView control.
    [kronberg@n126 tview2]$ mpprun --totalview ./mpitest_c
    
  6. TotalView Quick Start:
    • Click "OK" in the "Startup Parameters - mpimon" dialog.
    • Click the "Go" button.
    • TotalView detects that you are starting a parallel program, click "Yes" to stop it.
    • It is time to set break points etc, you are now debugging your MPI program!
    • Reading the TotalView manual is highly recommended!
[top]

Applications

Most, but not all, applications have a corresponding module.

Also see the complete NSC Application Software List.

A selection of some of the most relevant applications:

Application Module(s) Kappa Matter
Intel C/C++ compiler icc see "module avail" see "module avail"
Intel Fortran compiler ifort see "module avail" see "module avail"
Intel debugger idb see "module avail" see "module avail"
Intel Trace Analyzer and Collector itac see "module avail" see "module avail"
Gaussian gaussian see "module avail" see "module avail"
Intel MPI impi see "module avail" see "module avail"
Open MPI openmpi see "module avail" see "module avail"
TotalView debugger totalview see "module avail" see "module avail"
Matlab matlab see "module avail" see "module avail"
Octave octave see "module avail" see "module avail"
Intel Math Kernel Library (MKL) mkl see "module avail" see "module avail"
Molden molden see "module avail" see "module avail"

Note: Not all applications are listed here. Run "module avail" on the cluster to get the most recent information. Also remember that some applications might not have a module. If you have questions about software, e-mail support@nsc.liu.se.

[top]

VASP licensing

Due to license restrictions, NSC is only allowed to give access to our VASP binaries to users that are already covered by a VASP license.

In order to access NSCs VASP binaries on Kappa and Matter, you need to be a member of the vasp4 or vasp5 group (you can check this with the "groups" command).

There are three ways to get access to NSC VASP binaries:

  • You can be added to one of the licenses that we have on file by referring us to the license number that you are covered by. We will then confirm this with the holder of that license.
  • If we don't have the license that you are covered by on file, then you must provide a photocopy of the license agreement and the license number. This number is printed on the invoice, so people generally send us a photocopy of the invoice as well.
  • If you know that you are covered by a license, but we don't have it on file and you don't have access to the license contract, then we can confirm that you are a registered user with the VASP developers. This will usually take a couple of days since we have to email the VASP developers and get a reply from them.

To get access, send a request to support@nsc.liu.se. Remember to tell us which of the three methods apply to you, and any details needed (e.g your license number).

List of acronyms

GiB
gibibyte, 1024**3 bytes
MiB
mebibyte, 1024**2 bytes
MKL
Math Kernel Library
MPI
Message Passing Interface
OpenMP
Open Multi-Processing
scp
Secure Copy
SLURM
Simple Linux Utility for Resource Management
ssh
Secure Shell
TiB
tebibyte, 1024**4 bytes
[top]

Frequently asked questions

Questions:

  1. How much data am I allowed to store on the various file systems?
  2. Why does not my job start?
  3. How many hours have I consumed this month?
  4. How do I know if my job was killed due to exceeded wall clock time?
  5. How do I copy files from a compute node when my job exceeds its timelimit?

Answers:

  1. Run the command "nscquota". See also above.

  2. See the reason field from the command "sinfo".

    See also above.

  3. Run the command "projinfo" (takes no options or arguments). See also above.

  4. At the moment this information is only recorded in log files accessible to the NSC staff. We are working on a solution to make the information available to our users.

  5. When a job exceeds its timelimit SLURM sends the job a SIGTERM signal. This signal can be trapped by a job script, in order to clean up. 300 seconds after the SIGTERM a SIGKILL is sent, which can't be trapped. (The limit is configurable, please contact NSC support you think 300 seconds is to short).

    Here is an example of how to trap SIGTERM in a job script, and then copy the file MYDATA from SNIC_TMP to your HOME directory:

    # Copy MYDATA from node local scratch to HOME
    function cleanup(){
        cp ${SNIC_TMP}/MYDATA ${HOME}/
        }
    # Trap SIGTERM, sent when timelimit is exceeded.
    trap cleanup SIGTERM
    
[top]

If you need help

For more details about support, please see the NSC support page.

For the SNIC resources, you should use the email address support@nsc.liu.se. Don't forget to tell us which system you are having trouble with, and describe the problem in as much detail as possible.

You can also send more general questions to the same address, e.g "how do I use application X?", "can you install application Y?", "how can I make my Z application scale better?".






Page last modified: 2014-03-28 14:32
For more information contact us at info@nsc.liu.se.