Systems  
Status displays
System status
Retired systems «
 
 
 
 
 
 
 
 
 
 
 
 

Monolith User Guide

Contents

The Monolith cluster

The Linux cluster Monolith at NSC is build with 206 rack mounted nodes. Each node is a PC with dual Intel 2.4Ghz Xeon processors with 2GB of memory. Nodes are divided into: login, service, storage and compute nodes. Currently there are 3 login nodes, 1 service node, 4 storage nodes and 198 compute nodes.

Only the login nodes have a connection with the rest of the world.

Monolith has an internal 100 Mbps Ethernet which is used for file transfer, transfer of control and user level communication, and a high bandwidth, low latency SCI network, used for MPI communication.

The operating system is Linux. Red Hat is the distribution we use and the current version is 7.3. The currently running Linux kernel is 2.4.

The following chapters are a summary of the usage of Monolith.

Key features of Monolith

  • 408 Intel Xeon 2.4GHz processors.

  • 3 front end nodes, 198 compute nodes (396 processors).

  • Each node has 2 processors. Allocation is always on a per node basis.

  • Peak performance 4.8 Gflops/processor, a total of 1.9 Tflop peak for the whole system

  • MPI over two different networks:

    1. SCI-based network, using ScaMPI.

    2. Fast Ethernet, using MPICH or LAM.

  • Linux Red Hat.

System information

Login procedure

ssh to the login node that you have been assigned to, one of the following:

  • login-1.monolith.nsc.liu.se (or just monolith.nsc.liu.se)
  • login-2.monolith.nsc.liu.se
  • login-3.monolith.nsc.liu.se

When you login to monolith, you end up on a front end node. Use this node when you compile and run very short non-parallel jobs (less than 1 minute). For computation, use the computing nodes through the batch system.

Accessing Nodes

For normal use, you never have to login to the compute nodes. Most things can be done from the front end. "rlogin" to the compute nodes should only be used in case things can not be handled from the front end. Log out as soon you are done on a node.

A node has 2 processors, sharing memory and interconnect. Nodes are never shared between users, they are always allocated on a per node basis. Even if you only want to use a single processor, you have to allocate one node and will be charged for both processors! Thus, running two single processor runs in the same job is a better idea.

Monitoring the system

The current status of Monolith (running, reserved and idle nodes) is graphically displayed in real time at http://status.nsc.liu.se/monolith

Login shell

The following shells are available: sh, bash, csh, tcsh. To see which shell you are currently running, type 'echo $SHELL'. To change your default login shell, use the chsh command. Your default $PATH is initialized at login by the system so that most tools (compilers, debuggers and performance tools) are available without having to give the absolute path.

On line data storage

Three types of file systems can be used for file storage; /home, /disk/global, and /disk/local.

/home


There are several "home" file systems (/home, /home2 etc ) and you will be assigned to one when your account is set up. "home" is of limited size and should be used for files that are not easily reproduced such as dot-files, init files, edited source code, and manually created input files. The file system is exported and NFS-mounted on all nodes. It is backed up nightly on another disk. The backups are stored for one week.

/disk/global


There are several "global" file systems (/disk/global, /disk/global2, /disk/global3 etc ) and you will get access to one when your account is set up. "global" is larger than "home" and can be used for voluminous and automatically reproducible data. The file system is exported and NFS-mounted on all nodes. No backup is taken of these files.

/disk/local


is a fairly large file system (currently 60GB) that is local to the node. Use this for storing temporary files during computation. All files in this file system are removed upon completion of the running job.

Interactive usage

Interactive usage of the system can be done either on the front end node or through the batch system.

  • The front end node should only be used when you compile and run very short non-parallel jobs (less than 1 minute).

  • For computation, use the computing nodes through the batch system. This can be done either interactively or by submitting a batch job script.

Backups

Backup of home directories is taken every night.

File transfers

Use scp or sftp to transfer files to and from the system.

Email list

NSC maintains an email list monolith-users that is open for all users at NSC. Each user is automatically enrolled. You can manage your account from the web page: http://www.nsc.liu.se/mailman/listinfo/monolith-users

Editors

vi and emacs are available.

The modules system for maintaining system software

The modules environment enables dynamic configuration of your environment; system and application software can be added, deleted or switched with one simple command.

Currently the following default modules are assigned at login:

  • intel: Intel compiler version 7.1

  • pgi: Portland Group compiler

  • mkl: Intel math library

  • totalview: Totalview debugger

  • vampir: Vampir/Vampirtrace MPI analyzing tool

  • pbs: the batch System

  • maui: the batch system scheduler

Type "module avail" to see all available modules and "module list" to see a list of all loaded modules.

"module load module_name" will load the module "module_name".

"module unload old_module_name; module load new_module_name" will switch module.

To automatically load a module that is not a default, put the module name in a file called ".modules" in your home directory. "module avail" will list all available modules.

Running jobs on Monolith

Batch queue system

The batch system on the Beowulf is PBS-Pro. General information about PBS is available through "man pbs". Use the batch queue system for all jobs, interactive as well as noninteractive.

Daemons Communication

As seen by the picture above the batch system has two major parts:

  • The PBS batch server.

  • The Maui job scheduler.

On the next pages is condensed information on how to submit and monitor a batch job. Following pages has more detailed information about the various parts of the Monolith batch system.

Submitting a batch job

Batch jobs are submitted with the qsub command. PBS directives are specified either as comments in the script or options to the qsub command.

To run a job, the following parameters are required:


Required parameter

Maximum value allowed

PBS name


Project to account used CPU-hours on

account_string


Number of processors

396

nodes


Total job run time (wallclock)

144 hours

walltime


Number of processors to use on each node

2

ppn

account_string

The account should be specified as given by the command projinfo. That is:

For SNAC projects all blanks " " should be removed and all "/" replaced with "-". For example to account on the SNAC project "SNIC 005/06-98" the string "SNIC005-06-98" should be used.

A letter "p" should be added in front of NSC project numbers. For example to account on the NSC project "2006599" the string "p2006599" should be used.

Persons that are only a member in a single project with valid allocation may omit the account string argument.

Maximal running (non-completed) jobs: 6912 CPU-hours

Jobs are automatically routed to the appropriate queue based on these parameters.

Here is a sample PBS script for running a MPI job on 16 processors (8 nodes) and accounting the job on the SNAC project "SNIC 005/06-98":

#/bin/sh

# Account the job on the SNAC project "SNIC 005/06-98"
#PBS -A SNIC005-06-98

# Request 8 nodes with 2 processors-per-node (ppn=2), total of 16 processors.
#PBS -l nodes=8:ppn=2

# Request 6 hours and 10 minutes of wall-clock time.
#PBS -l walltime=6:10:00

# Request regular output (stdout) and error output (stderr) to the same file.
#PBS -j oe

# Send mail when the job aborts (a) or exits (e).
#PBS -m ae
#PBS -M user@some.where 

# Goto the directory from which you submitted the job.
cd $PBS_O_WORKDIR 

# Start the job with mpirun on the nodes that the batch queue system have allocated for 
# your job. (PBS -l nodes above). 
/usr/local/bin/mpirun ./a.out < inputfile

Submit by doing "qsub batch-script".

See "man qsub" for an explanation of the submit options used in the script.

Queue limits are subject to change, please check the web for the latest information!

Monitoring your job

Scheduler commands:

showq

List all jobs visible to the scheduler.

showstart

Makes a qualified guess about when a job will start.

PBS commands:

qstat -u "my_user_name" 

Show information about all of my jobs.

qstat -f job_id

Show detailed information about job "job_id".

Interactive Access

You can run interactively (e.g. debugging), by adding '-I' to the qsub command. If there are idle nodes available and your request are within the limits, the scheduler will allocate the nodes and return a prompt to you on one of the allocated nodes.

Example: Allocate two nodes (four processors) for one hour of interactive access accounting the job on the SNAC project "SNIC 005/06-98":

moonwatch$ qsub -I -A SNIC005-06-98 -lwalltime=1:00:00,nodes=2:ppn=2
qsub: waiting for job 2561.moonwatch to start
qsub: job 2561.moonwatch ready

n25$ _

To run your MPI program interactively, you can also use /usr/local/bin/mpirun directlyfrom the front-end:

/usr/local/bin/mpirun -A <account_string> -np <NN> <program> <args ...> 

It automatically uses PBS to allocate an interactive job on NN processors for one hour. The terminal will retain the I/O to the job.

PBS environment variables

When the job starts, two environment variables assigned by PBS are of special interest:

$PBS_NODEFILE

The name of the file in which the allocated nodes are listed.

$PBS_O_WORKDIR

The directory from which the job was submitted.

For more environment variables see the man page for "qsub".

Accessing the output of a running job

The batch queue system handles and keeps standard output (stdout) and standard error(stderr) from all jobs. When the job is finished, the output is delivered in the directory in which the job was started unless otherwise specified.

With the command pbspeek you can take a peek at output files of your own jobs even when they are still running.

Usage: pbspeek [-o|-e][-h] <jobid> 
-o

Show stdout.

-e 

Show stderr.

-h 

Show this help.

You can also explicitly redirect the output to a file with the ">" redirect symbol:"mpprun a.out > output" in the batch script.

Job cleanup

Automatic cleanup is performed when a job is finished. This includes killing all the user processes and removing everything from /disk/local on the compute nodes that participated in the batch job.

Saving /disk/local data after a job crash

To prevent losing data that is stored in /disk/local in the event of a job crash, the use of the PBS stage-out facility is recommended.

Example of PBS stage-out facility:

#!/bin/sh
#PBS -A account_string
#PBS -lwalltime=1
#PBS -lnodes=1:ppn=2
#PBS -W stageout=/disk/local/file1@localhost:/disk/global/my_user/file1
#PBS -W stageout=/disk/local/file2@localhost:/disk/global/my_user/file2
 
cat >/disk/local/file1 <<EOF
This is file one
EOF
cat >/disk/local/file2 <<EOF
This is file two
EOF
sleep 10000
#

In this example /disk/local/{file1,file2} will be copied to /disk/global/my_user/ when the job is finished (or aborted because it exceeds the time limit). There is a corresponding stage-in facility. There are more information in the man page for "qsub" on Monolith.

Bonus

The bonus system that NSC successfully use on the other super computer systems to achieve a fair distribution of resources among users is also running on Monolith. Its purpose is to lower the priority of projects and users that has consumed their alloted time. Jobs from bonus users are only scheduled whenever there is no other, normal priority, job to run.

Frequently used PBS user commands:


qstat

Show status of PBS batch jobs.


qsub

Submits a job to the PBS queuing system.


qdel

Delete a PBS job from the queue.

Less frequently used PBS user commands:


qalter

Modifies the attributes of a job.


qhold

Requests that the PBS server place a hold on a job.


qrerun

Reruns a PBS batch job.


qrls

Release hold on PBS batch job.


qsig

Requests that a signal be send to the session leader of a batch job.

For more information, please see the corresponding man page.

Maui Job Scheduler

The Maui Scheduler is used to schedule batch jobs. It creates advance reservations for jobs which are considered possible to run. This allows large jobs (many nodes) to start in a reasonable time and avoids starvation due to overtaking by smaller jobs (less number of nodes). Also, better control of quality of service is achieved since the priority of a job has more impact in this reservation scheme compared to other schedulers.

Commands for extracting information from the scheduler currently available to users:


showq

List all jobs visible to the scheduler.


showbf

Show resources available for immediate access.


showstart

Makes a qualified guess about when a job will start.


checkjob

Display numerous scheduling details for a job.


jobstate

Shows what state the job is in: running, idle or hold.

Configuration

Currently, there is one run queue, "dque", which has the limits of 144 hours of wall-clock time and 396 processors (198 nodes) per job. Every user who submits a job within these limits and is a member of a granted project will end up in this queue. Other users are routed to the queue "wait" which is stopped. Jobs from this queue can be started by the system administrator if needed.

To facilitate interactive development and achieve tolerable turnaround time for very short test runs, a standing advance reservation of 16 processors, 09:00 - 17:00, Monday - Friday has been created. This reservation accepts jobs with limits less than or equal to 8nodes and 1 hour.

No upper limit on the number of submitted jobs exists. Instead, limits set in the scheduler prohibit users from getting high priority due to extensive queueing times.

Currently, the Maui scheduler performs a full scheduling cycle each minute.

For more information about the user commands and job priority and rating, see http://www.nsc.liu.se/systems/monolith/maui.html

Programming environment

There are three compiler suites available:

1. Intel's Compiler suite



7.1

8


C

icc
icc


C++

icc
icc


Fortran 77

ifc
ifort


Fortran 90

ifc
ifort



2. Portland Group's (PGI) Compiler Suite


C

pgcc


C++

pgCC


Fortran 77

pgf77


Fortran 90

pgf90


HPF

pghpf


Debugger

pgdbg


Profiler

pgprof



3. GNU Compiler Collection


C

gcc or cc 


C++

gxx or g++ 


Fortran 77

g77


Debugger

gdb


Profiler

gprof

The Intel or PGI suites are recommended. They produce, generally speaking, more efficient code and are also more integrated with the MPI run time environment. Your default $PATH is initialized at login by the system so that most tools (compilers, debuggers and performance tools) are available without having to give the absolute path.

Intel compilers

There are two versions of the Intel compiler available: 7.1 and 8.0. The two versions are two separate compilers that are incompatible and have different syntax.

At login you are given the 7.1 version which is the version with full support in terms of external libraries. The 8.0 version generally results in better performance however and has also an extended set of functionalities. ScaMPI is supported for 8.0.

To move from Intel 7.1 to the 8.0 environment give the command sequence:module unload intel; module load intel/8.0

To move from Intel 8.0 version to the 7.1 environment give the command sequence:module unload intel/8.0; module load intel

Intel 7.1 compiler, useful compiler options

Below are some useful compiler options, please do "man ifc" or "man icc" for more!

a) Optimisation

There are three different optimization levels in Intel's compilers:


-O0 

Disable optimizations.


-O1,-O2 

Enable optimizations (DEFAULT).


-O3 

Enable -O2 plus more aggressive optimizations that may not improve performance for all programs.

A recommended flag for general code is -O2 and for best performance "-O3 -xW -tp p7" which will enable software vectorisation. As always however, aggressive optimisation runs a higher risk of encountering compiler limitations.

b) Debugging


-g

Generate symbolic debug information.

c) Profiling

-p

Compile and link for function profiling with UNIX gprof tool.

d) Options that only apply to Fortran programs


-r8

Set default size of REAL to 8 bytes. -i8 Set default size of integer variables to 8 bytes.


-zero

Implicitly initialize all data to zero.


-save

Save variables (static allocation) except local variables within a recursive routine; opposite of -auto.


-C

Enable extensive runtime error checking (-CA, -CB, -CS, -CU, -CV) with tracing of errors.

e) Linking

"-xW" is required when linking object code that was compiled with that option.

Other libraries that are not default but that you might need:

  • Vaxlib (Vax compatible routines)

  • posixlib

  • C90 (I/O with C)

f) Large File Support

To read/write files larger than 2GB you need to specify some flags at compilation:

Fortran: no additional flags needed.

CC/C++: LFS is obtained by specifying the flags below when compiling and linking: -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE

g) Special MPI options

There are three command-line options (locally implemented, not mentioned in the man- pages) to all of Intel and PGI's compilers to make compiling and linking programs with MPI easier:


-Nscampi

Use include files and libraries from the ScaMPI implementation of MPI.


-Nmpich

Use include files and libraries from the MPICH implementation of MPI.


-Nlam

Use include files and libraries from the LAM implementation of MPI.

The options should be used both at compile-time (to specify the path to the include files) and at link-time (to specify the correct libraries). For more information about the MPI implementations and how to run MPI programs see the "Parallelisation" section.

h) Miscellaneous options

Little endian to Big endian conversion in Fortran is done through the F_UFMTENDIAN environment variable. When set, the following operations are done:

  • The WRITE operation converts little endian format to big endian format.

  • The READ operation converts big endian format to little endian format.

Examples:


F_UFMTENDIAN = big            

; convert all files.


F_UFMTENDIAN ="big;little:8"

; all files except those connected to unit 8 are converted.

For more options, please read the man page on the specific compiler on the system or read the Intel Fortran/C Compiler User's Guide that is available at

Intel 8.0 compiler, useful compiler options

Below are some useful compiler options, please do "man ifort" or "man icc" for more!

a) Optimisation

There are three different optimization levels in Intel's compilers:


-O0

Disable optimizations.


-O1,-O2 

Enable optimizations (DEFAULT).


-O3

Enable -O2 plus more aggressive optimizations that may not improve performance for all programs.

A recommended flag for general code is -O2 and for best performance "-O3 -xW -tp p7" which will enable software vectorisation. As always however, aggressive optimisation runs a higher risk of encountering compiler limitations.

b) Debugging


-g

Generate symbolic debug information.


-traceback

Generate extra information in the object file to allow the display of source file traceback information at runtime when a severe error occurs.


-fpe<n>

Specifies floating-point exception handling at run-time.


-mp

Maintains floating-point precision (while disabling some optimizations).

c) Profiling


-p

Compile and link for function profiling with UNIX gprof tool.

d) Options that only apply to Fortran programs


-assume byterecl

Specifies (for unformatted data files) that the units for the OPEN statement RECL specifier (record length) value are in bytes, not longwords (four-byte units). For formatted files, the RECL unit is always in bytes.


-r8

Set default size of REAL to 8 bytes.


-i8

Set default size of integer variables to 8 bytes.


-zero 

Implicitly initialize all data to zero.


-save

Save variables (static allocation) except local variables within a recursive routine; opposite of -auto.


-CB

Performs run-time checks on whether array subscript and substring references are within declared bounds.

f) Large File Support (LFS).

To read/write files larger than 2GB you need to specify some flags at compilation:

Fortran: no additional flags needed.

CC/C++: LFS is obtained by specifying the flags below when compiling and linking:


-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE 

g) Special MPI options

There are three command-line options (locally implemented, not mentioned in the man-pages) to all of Intel and PGI's compilers to make compiling and linking programs with MPI easier:


-Nscampi

Use include files and libraries from the ScaMPI implementation of MPI.


-Nmpich

Not implemented yet for the Intel 8.0 compilers.


-Nlam

Not implemented yet for the Intel 8.0 compilers.

The options should be used both at compile-time (to specify the path to the include files) and at link-time (to specify the correct libraries). For more information about the MPI implementations and how to run MPI programs see the "Parallelisation" section.

h) Miscellaneous options

Little endian to Big endian conversion in Fortran is done through the F_UFMTENDIAN environment variable. When set, the following operations are done:

  • The WRITE operation converts little endian format to big endian format.

  • The READ operation converts big endian format to little endian format.

Examples:


F_UFMTENDIAN = big 

; Convert all files.


F_UFMTENDIAN ="big;little:8" 

; All files except those connected to unit 8 are converted.

For more options, please read the man page on the specific compiler on the system or read the Intel Fortran/C Compiler User's Guide that is available at

PGI compilers, useful compiler options

Below are some useful compiler options, please do "man pgf90" or "man pgcc" for more!

a) Optimization

There are three different optimization levels in PGI's compilers:


-O0

No optimization.


-O1

Local optimization (default).


-O,-O2

Agressive optimization.

A recommended flag for general code is -O2 and for best performance -fast which is equivalent to "-O2 -Munroll -Mnoframe" As always however, aggressive optimisation runs a higher risk of encountering compiler limitations.

b) Debugging


-g

Generate symbolic debug information.

c) Porting

These options only apply to Fortran programs.


-r8

Interpretate REAL variables as DOUBLE PRECISION.


-i8

Treat INTEGER variables as eight bytes.


-byteswapio

Swap bytes from big-endian to little-endian or vice versa on input/output of unformatted FORTRAN data. Use of this option enables reading/writing of FORTRAN unformatted data files compatible with those produced on Sun or SGI systems.

d) Profiling


-Mprof=[option[,option,...]]

Set profile options.


func

Perform PGI-style function level profiling.


lines

Perform PGI-style line level profiling.

e) Large File Support

To read/write files larger than 2GB you need to specify some flags at compilation:

Fortran: add the flag "-Mlfs" to your compile and link command.

CC/C++: LFS is obtained by specifying the flags below when compiling and linking:


-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE

f) Special MPI options

There are three command-line options (locally implemented, not mentioned in the man- pages) to all of Intel's and PGI's compilers to make compiling and linking programs with MPI easier:


-Nscampi

Use include files and libraries from the ScaMPI implementation of MPI.


-Nmpich

Use include files and libraries from the MPICH implementation of MPI.


-Nlam

Use include files and libraries from the LAM implementation of MPI.

The options should be used both at compile-time (to specify the path to the include files) and at link-time (to specify the correct libraries). For more information about the MPI implementations and how to run MPI programs see see the "Parallelisation" section.

g) Miscellaneous options

The IA-32 architecture on Monolith use 80-bit registers for floating-point operations. This extended format can lead to answers that, when rounded, not match expected result. The option -pc 64 can be used to explicitly set the precision to standard IEEE double-precision using 64 bits.

For more options, please read the man page on the specific compiler on the system or read the PGI User's Guide: http://www.nsc.liu.se/pgi/

Parallelization

Message passing: MPI and PVM are available on the system.

Message Passing Interface (MPI)

There are three different MPI implementations available on monolith: ScaMPI, MPICH,and LAM. ScaMPI uses the high performance SCI network, while MPICH and LAM use Fast Ethernet.

Fast, easy-to-use, Guide for the Impatient

1.


Compile your program using the the Intel or PGI compilers. There are compilers for C, C++, F77, F90, and HPF (PGI only). Do not specify any include paths or libraries related to MPI, just supply the appropriate MPI option -Nscampi, -Nlam, or -Nmpich.

2.

Start you program with


/usr/local/bin/mpirun <your_program>


in your batch job script. The appropriate method of running your program (ScaMPI, LAM, MPICH) will be automatically used. Also, the program will automatically start on all the nodes the batch queue system has allocated for your job.

/usr/local/bin/mpirun is a locally supplied, generic script that finds out which MPI implementation is used and starts the appropriate daemons and monitors. To be sure it recognizes your program, please try:


/usr/local/bin/mpirun -q <your_program> 

on the command-line on the front-end (i0) before you use it in a batch script.

If /usr/local/bin/mpirun does not recognize your program or you want to start a different number of instances of your program, the following options are available:

Usage: mpirun [-h][-q][-np <procs> | -s][-Nscampi|-Nlam|-Nmpich|-Npvm] <program> ...


-Nscampi <program>

is linked with ScaMPI


-Nlam <program>

is linked with LAM


-Nmpich <program>

is linked with MPICH


-Npvm <program>

is linked with PVM


-Norder=block

nodes are sorted in block order


-core

enable core dump for all processes


-np <nproc>

specify the number of processes to spawn (default: number of hosts in $PBS_NODEFILE)


-s 

one process per node.


-q

show which MPI mpprun will use and then exit.


-h

show help and exit.

/usr/local/bin/mpirun can also be used directly from the front-end. See Interactive Access for more information.

Description of the Different MPI Implementations: ScaMPI, MPICH and LAM

If you need to use another compiler and/or you choose to use the options below, none of the options -Nscampi, -Nlam, or -Nmpich available to the Intel and PGI compilers can be used.

ScaMPI

To use the fast SCI network, you have to compile and link with ScaMPI, a proprietary MPI implementation from SCALI. It is installed in /opt/scali. Furthermore, you must use /opt/scali/bin/mpirun to start your application.

To link against the MPI library from SCALI you should use one of the following lines:


C

-L/opt/scali/lib -lmpi -lpthread 


Fortran

-L/opt/scali/lib -lfmpi -lmpi -lpthread

Do not forget to include -lpthread when using ScaMPI. Even though the object files might link without errors, the resulting executable may hang when started.

MPICH

MPICH 1.2.4 is installed in "/usr/local/mpich-1.2.4/"compiler". There are three compiler versions available:

  • gcc-2.96

  • intel-6.0

  • pgi-4.0

Choose the appropriate version depending on the compiler you use. The main compatibility difference is the number of suffix underscores in the identifiers of compiled Fortran code. GNU use two underscores while Intel and PGI use only one. For general information how to use MPICH, see http://www-unix.mcs.anl.gov/mpi/mpich There are also man-pages available on the Monolith system.

To link against the MPICH MPI library you should use one of the following lines:


C, C++

-L/usr/local/mpich-1.2.4/"compiler" -lmpich 


Fortran

-L/usr/local/mpich-1.2.4/"compiler" -lfmpich -lmpich 

Include files are in /usr/local/mpich/include and in /usr/local/mpich/include/mpi2c++ (C++).

To run your MPI(CH) program on e.g. 10 nodes do:


/usr/local/mpich/bin/mpirun -np 10 <program> <arguments> 

mpirun is modified and adapted to work properly in

  1. PBS batch queue scripts

  2. NSC's environment.

The nodes used for programs started with mpirun are extracted from the file specified by the environment variable $PBS_NODEFILE.

The option -np defaults to 1 if not given.

LAM

LAM-6.5.6 is installed in "/usr/local/lam-6.5.6/"compiler" There are three compiler versions available:

  • gcc-2.96

  • intel-6.0

  • pgi-4.0

Choose the appropriate version depending on the compiler you use. The main compatibility difference is the number of suffix underscores in the identifiers of compiled Fortran code. GNU use two underscores while Intel and PGI use only one. For general information how to use LAM, see http://www.lam-mpi.org. There are also man pages available on the Monolith system.

To link against the LAM MPI library you should use one of the following lines:


C

-L/usr/local/lam-6.5.6/"compiler" -lmpi -llam 


C++

-L/usr/local/lam-6.5.6/"compiler" -llammpi++ -lmpi -llam 


Fortran

-L/usr/local/lam-6.5.6/"compiler" -llamf77mpi -lmpi -llam 

Include files are in /usr/local/lam/include and in /usr/local/lam/include/mpi2c++ (C++).

Launching a LAM program requires a little more effort then launching a MPICH or ScaMPI program (unless you use the /usr/local/bin/mpprun described above). LAM requires daemons to be started on each node before the job is launched and for them to be stopped after the job has finished. Here is how it can be done (typically in a PBS script):

/usr/local/lam/bin/lamboot -v $PBS_NODEFILE   
sleep 1
/usr/local/lam/bin/mpirun -np 10 <program&rt; <arguments&rt;
/usr/local/lam/bin/lamhalt -v
sleep 1 

The two sleep commands are used to ensure that LAM manages to start/stop the daemons. Note that this is NOT the, by NSC, recommended way to launch LAM applications. The recommended way to launch any parallel application is using /usr/local/bin/mpprun.

Performance analysis and debugging

Profiling

If you use the PGI-compilers, pgprof is a tool which analyzes data generated during execution of specially compiled programs (using the -Mprof=func or -Mprof=lines compiler command line options).

Example: profile and analyze the executable a.out

1.

Compile the code using the "-Mprof=func" compiler option

2.

Run the executable. This will generate performance data in the file pgprof.out

3.

Generate the profiling output by running pgprof. You will now get a profile of the time spent in various subroutines.

See "man pgprof" for more details.

For Intel and GNU compilers, compiling with "-p" or "-pg" together with the "prof" or "gprof" utilities provides a similar functionality.

Profiling a MPI application

There is no support for profiling of MPI application in the tools listed above. We are in the process of evaluating the Intel Vtune profiler to see if it can provide this functionality.

Meanwhile you need to view your application as separate processes, each generating the profile output. Since all processes will be using the same file name ("pgprof.out" or "gmon.out") the best way to distinguish them is to run your application using only one processor/node (ppn:1) and from the /disk/local directory thus providing a separate location for the output from each process. At the end of the batch script (after the execution) collect and rename the different output files, for example with the command:


jpdsh 'cp /disk/local/gmon.out  /disk/global/$USER/prof_dir/gmon.out.$HOST' 

Core dump

When using ScaMPI with NSC's /usr/local/bin/miprun you can now enable core dumps by supplying the option '-core' to mpirun:


/usr/local/bin/mpirun -core your_binary

Please, don't use this option unless you really is in need of the core dump for debugging. When enabled, running a parallel application can in short time generate many core dumps, consuming a lot of disk space. Please, refrain from using /home when this option is enabled!

Debugging

Several debuggers are available:

  • totalview is the most feature rich debugger with full GUI support for debugging of parallel programs. See below for a more detailed instruction.

  • pgdbg is the PGI source-level debugger for the Monolith cluster.

  • gdb is the GNU Debugger.

You can do live debugging or "post mortem" debugging with all debuggers. See "man pgdbg" and "man gdb" for more information.

Totalview debugger

The TotalView debugger is a source-level debugger with a graphic user interface and features for debugging distributed programs, multiprocess programs, and multithreaded programs. Totalview can be used to debug "live" programs as well as portmortem debug on core files:


totalview [ filename [ corefile ]] [ options ] 

where "filename" specifies the name of an executable to be debugged and "corefile"specifies the name of a core file. The executable must be compiled with source line information (usually the -g compiler switch) in order to give full debug capabilities.

On Monolith please note the following:

  • Postmortem debug can be used from the login node.

  • Live debug is only supported for interactive jobs and requires a special procedure since X-forwarding is not supported through PBS:

    1. Start an interactive job, for example: "qsub -I -lwalltime=1:00:00,nodes=4". This will create an interactive job and the environment variable PBS_NODEFILE will contain a list of the nodes you interactive job is running on.

    2. In a separate window, do a "ssh" to one of the nodes in the interactive job. From this window you can now start "totalview executable" and do a live debug.

Options to the totalview command are described in the TotalView User's Guide. Online documentation is located at http://www.etnus.com/Support/docs/

Vampir

VAMPIR is a graphical tool for analyzing the performance and message passing characteristics of parallel programs that use the MPI message passing library.

The VAMPIR package has two parts:

  1. Vampir-trace which is a library that you link in to your application. This will produce a trace file.

  2. Vampir is used to analyze the trace file.

The full user documentation can be found at: /usr/local/tools/vampir/3.0/doc/

  • Vampir-userguide.pdf for Vampir.

  • Vampirtrace-userguide.pdf for Vampir-trace.

There are also man pages for Vampir and the Vampir-trace library routines.

Follow these steps to start using Vampir:

1.

Compile and link your MPI code for tracing. Be sure to obey the order of the various libraries:

libVT.a must be linked before the MPI library.

Examples on linking with Vampir trace using Fortran:


a)

With ScaMPI and Intel compiler:



ifc code.f -L$PAL_ROOT/lib -I/opt/scali/include -L/opt/scali/lib -lfmpi -lVT -lmpi
-lpthread -lPEPCF90 -ldwarf -lelf


b)

With MPICH and Portland Group compiler:



pgf90 simple.f -I/usr/local/mpich-1.2.4/pgi-4.0/include -L$PAL_ROOT/lib 
-L/usr/local/mpich-1.2.4/pgi-4.0/lib -lfmpich -lVT -lmpich -ldwarf -lelf

2.

Define the environmental variable VT_PROGNAME to be the name of the executable.

3.

Run the executable as usual. In addition to the usual output this will generate a VAMPIRtrace output file which will have the extension ".bvt".

4.

Analyze the resulting VAMPIRtrace output file by running "vampir", specifying the .bvt file: "vampir a.out.bvt".

A very good VAMPIR tutorial is available at http://www.arsc.edu/support/howtos/usingvampir.html

If you are using LAM MPI instead of ScaMPI or MPICH you need to run a different version of Vampir. By doing the commands: "module unload vampir; module load vampir/4.0.lam" you will get the Vampir version that supports LAM MPI.

Math libraries

Intel compilers

The Intel math kernel library "mkl" is recommended. The Math Kernel Library includes the following groups of routines:

  • Basic Linear Algebra Subprograms (BLAS):

    • vector operations

    • matrix-vector operations

    • matrix-matrix operations

  • Sparse BLAS (basic vector operations on sparse vectors)

  • Fast Fourier transform routines (with Fortran and C interfaces)

  • LAPACK routines for solving systems of linear equations

  • LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations

  • Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments (with Fortran and C interfaces).

Full documentation can be found at http://www.intel.com/software/products/mkl/

Directory Structure

mkl is located in $MKL_ROOT, defined at login. Semantically, MKL consists of two parts: LAPACK and processor specific kernels. The LAPACK library contains LAPACK routines and drivers that were optimized as without regard to processor so that it can be used effectively on processors from Pentium to Pentium 4. Processor specific kernels contain BLAS, FFTs, CBLAS, VML that were optimized for the specific processor. Threading software is supplied as a separate dynamic link library (so), libguide.so, when linking dynamically to MKL.

The information below indicates the library's directory structure.


mkl/lib

Contains all libraries


mkl/lib/32

Contains all libraries for 32-bit applications


libmkl_lapack.a

LAPACK routines and drivers


libmkl_def.a

default kernel (Pentium, Pentium Pro, Pentium II processors)


libmkl_p3.a

Pentium III processor kernel


libmkl_p4.a

Pentium 4 processor kernel


libmkl_lapack32.so

LAPACK routines and drivers, single precision data types


libmkl_lapack64.so

LAPACK routines and drivers, double precision data types


libmkl.so

library dispatcher for appropriate kernel loading


libmkl_def.so

default kernel (Pentium, Pentium Pro, Pentium II processors)


libmkl_p3.so

Pentium III processor kernel


libmkl_p4.so

Pentium 4 processor kernel


libvml.so

library dispatcher for appropriate VML kernel loading


libmkl_vml_def.so

VML part of default kernel (Pentium, Pentium Pro, Pentium II processors)


libmkl_vml_p3.so

VML part of Pentium III processor kernel


libmkl_vml_p4.so

VML part of Pentium 4 processor kernel


libguide.so

KAI threading software

Linking with MKL

To use LAPACK and BLAS software you must link two libraries: LAPACK and one of the processor specific kernels. Some possible variants:

a) LAPACK library, Pentium 4 processor kernel:

"ld myprog.o -L$MKL_ROOT -lmkl_lapack -lmkl_p4" 

b) Dynamic linking. DLL dispatcher will load the appropriate dll for the processor dynamic kernel:

"ld myprog.o -L$MKL_ROOT -lmkl -lguide -lpthread"  

Using MKL Parallelism

The Math Kernel Library is threaded in a number of places: LAPACK (*GETRF, *POTRF, *GBTRF routines), Level 3 BLAS, and FFTs. MKL 5.2 uses KAI OpenMP threading software.

Setting the number of threads: The OMP software responds to the environmental variable OMP_NUM_THREADS. The number of threads can be set in the shell the program is running in. To change the number of threads, in a command shell in which the program is going to run, enter:

export OMP_NUM_THREADS=<number of threads to use>

If the variable OMP_NUM_THREADS is not set, MKL software will run on the number of threads equal to the number of processors. We recommend always setting OMP_NUM_THREADS.

KMP_STACK_SIZE environment variable should be set to 2m or more if MKL functions are called from OMP parallel regions.

Performance

The obtain the best performance with MKL, make sure the following conditions are fulfilled: arrays must be aligned on 16-byte boundary and leading dimension values (n*element_size) of two-dimensional arrays must be divisible by 16. There are additional conditions for the FFT functions see the full documentation for details.

PGI compilers

LAPACK and BLAS are included with the PGI software.

LAPACK: Link with "-llapack"

BLAS: Link with "-lblas"

Best performance however is with ATLAS tuned BLAS libraries are available for the C/C++ and Fortran.

Link with " -L/usr/local/lib -lf77blas -latlas" for Fortran

" -L/usr/local/lib -lcblas -latlas" for C/C++

Porting code to/from Monolith

General

1.

Compile the code with modest optimization (-O0 or -O1)

2.

Execute the code.

3.

Verify correct answers

4.

In case of problems, see useful compiler options on the FORTRAN and C/C++ compiler chapter. The debugger is described in a previous chapter.

5.

Turn on more aggressive optimization (-O2, -fast etc )

6.

Common problems:


a)

Reading Fortran binary files from workstations (SGI, SUN etc ). You need the:

PGI compilers: "-byteswapio" option to swap bytes from big-endian to little-endian

Intel compilers: Define the environment variable F_UFMTENDIAN to "big"


b)

Double precision is by default using 80 bits with PGI compilers which means that results can differ slightly from other platforms. You can manipulate this yourself and change the precision to 64-bits. See "/usr/include/fpu_control.h".

Data types and corresponding bitsizes

FORTRAN

Type

Size (bits)

REAL

32

REAL*4

32

REAL*8

64

DOUBLE PRECISION

64

COMPLEX

64

COMPLEX*8

64

COMPLEX*16

128

INTEGER

32

INTEGER*8

64

LOGICAL*1

8

LOGICAL*8

64

C/C++

Data type

Size (bits)

char

8

short

16

int

32

long int

32

float

32

double

64

long double

64

pointer

32

Monolith documentation

There are various ways to get information about the system. The most important are:

1) NSC maintains a web-document for the cluster where you will find detailed information about the system as well as the most up to date information about the current status:

http://www.nsc.liu.se/monolith for a general description.

2) Portland Group Documentation: http://www.nsc.liu.se/pgi/

3) Intel compiler documentation. Intel Fortran/C Compiler User's Guide are available at:

- http://www.intel.com/software/products/compilers/flin/ for Fortran

- http://www.intel.com/software/products/compilers/clin/ for C/C++

4) Intel Math Kernel Library (mkl) documentation is available at: http://www.intel.com/software/products/mkl/

5) SCAMPI user documentation can be downloaded from http://www.scali.com/download/documents.html

6) The "man" and "apropos" utility. If you are uncertain about a command or function on the system, try these!






Page last modified: 2006-10-05 11:51
For more information contact us at info@nsc.liu.se.