![]() | ||
|
Monolith User GuideContents
The Monolith clusterThe Linux cluster Monolith at NSC is build with 206 rack mounted nodes. Each node is a PC with dual Intel 2.4Ghz Xeon processors with 2GB of memory. Nodes are divided into: login, service, storage and compute nodes. Currently there are 3 login nodes, 1 service node, 4 storage nodes and 198 compute nodes. Only the login nodes have a connection with the rest of the world. Monolith has an internal 100 Mbps Ethernet which is used for file transfer, transfer of control and user level communication, and a high bandwidth, low latency SCI network, used for MPI communication. The operating system is Linux. Red Hat is the distribution we use and the current version is 7.3. The currently running Linux kernel is 2.4. The following chapters are a summary of the usage of Monolith. Key features of Monolith
System informationLogin proceduressh to the login node that you have been assigned to, one of the following:
When you login to monolith, you end up on a front end node. Use this node when you compile and run very short non-parallel jobs (less than 1 minute). For computation, use the computing nodes through the batch system. Accessing NodesFor normal use, you never have to login to the compute nodes. Most things can be done from the front end. "rlogin" to the compute nodes should only be used in case things can not be handled from the front end. Log out as soon you are done on a node. A node has 2 processors, sharing memory and interconnect. Nodes are never shared between users, they are always allocated on a per node basis. Even if you only want to use a single processor, you have to allocate one node and will be charged for both processors! Thus, running two single processor runs in the same job is a better idea. Monitoring the systemThe current status of Monolith (running, reserved and idle nodes) is graphically displayed in real time at http://status.nsc.liu.se/monolith
Login shellThe following shells are available: sh, bash, csh, tcsh. To see which shell you are currently running, type 'echo $SHELL'. To change your default login shell, use the chsh command. Your default $PATH is initialized at login by the system so that most tools (compilers, debuggers and performance tools) are available without having to give the absolute path. On line data storageThree types of file systems can be used for file storage; /home, /disk/global, and /disk/local. /home
/disk/global
/disk/local
Interactive usageInteractive usage of the system can be done either on the front end node or through the batch system.
BackupsBackup of home directories is taken every night. File transfersUse scp or sftp to transfer files to and from the system. Email listNSC maintains an email list monolith-users that is open for all users at NSC. Each user is automatically enrolled. You can manage your account from the web page: http://www.nsc.liu.se/mailman/listinfo/monolith-users Editorsvi and emacs are available. The modules system for maintaining system softwareThe modules environment enables dynamic configuration of your environment; system and application software can be added, deleted or switched with one simple command. Currently the following default modules are assigned at login:
Type "module avail" to see all available modules and "module list" to see a list of all loaded modules. "module load module_name" will load the module "module_name". "module unload old_module_name; module load new_module_name" will switch module. To automatically load a module that is not a default, put the module name in a file called ".modules" in your home directory. "module avail" will list all available modules. Running jobs on MonolithBatch queue systemThe batch system on the Beowulf is PBS-Pro. General information about PBS is available through "man pbs". Use the batch queue system for all jobs, interactive as well as noninteractive.
As seen by the picture above the batch system has two major parts:
On the next pages is condensed information on how to submit and monitor a batch job. Following pages has more detailed information about the various parts of the Monolith batch system. Submitting a batch jobBatch jobs are submitted with the qsub command. PBS directives are specified either as comments in the script or options to the qsub command. To run a job, the following parameters are required:
Maximal running (non-completed) jobs: 6912 CPU-hours Jobs are automatically routed to the appropriate queue based on these parameters. Here is a sample PBS script for running a MPI job on 16 processors (8 nodes) and accounting the job on the SNAC project "SNIC 005/06-98":
Submit by doing "qsub batch-script". See "man qsub" for an explanation of the submit options used in the script. Queue limits are subject to change, please check the web for the latest information! Monitoring your jobScheduler commands:
PBS commands:
Interactive AccessYou can run interactively (e.g. debugging), by adding '-I' to the qsub command. If there are idle nodes available and your request are within the limits, the scheduler will allocate the nodes and return a prompt to you on one of the allocated nodes. Example: Allocate two nodes (four processors) for one hour of interactive access accounting the job on the SNAC project "SNIC 005/06-98":
To run your MPI program interactively, you can also use /usr/local/bin/mpirun directlyfrom the front-end:
It automatically uses PBS to allocate an interactive job on NN processors for one hour. The terminal will retain the I/O to the job. PBS environment variablesWhen the job starts, two environment variables assigned by PBS are of special interest:
For more environment variables see the man page for "qsub". Accessing the output of a running jobThe batch queue system handles and keeps standard output (stdout) and standard error(stderr) from all jobs. When the job is finished, the output is delivered in the directory in which the job was started unless otherwise specified. With the command pbspeek you can take a peek at output files of your own jobs even when they are still running. Usage: pbspeek [-o|-e][-h] <jobid>
You can also explicitly redirect the output to a file with the ">" redirect symbol:"mpprun a.out > output" in the batch script. Job cleanupAutomatic cleanup is performed when a job is finished. This includes killing all the user processes and removing everything from /disk/local on the compute nodes that participated in the batch job. Saving /disk/local data after a job crashTo prevent losing data that is stored in /disk/local in the event of a job crash, the use of the PBS stage-out facility is recommended. Example of PBS stage-out facility:
In this example /disk/local/{file1,file2} will be copied to /disk/global/my_user/ when the job is finished (or aborted because it exceeds the time limit). There is a corresponding stage-in facility. There are more information in the man page for "qsub" on Monolith. BonusThe bonus system that NSC successfully use on the other super computer systems to achieve a fair distribution of resources among users is also running on Monolith. Its purpose is to lower the priority of projects and users that has consumed their alloted time. Jobs from bonus users are only scheduled whenever there is no other, normal priority, job to run. Frequently used PBS user commands:
Less frequently used PBS user commands:
For more information, please see the corresponding man page. Maui Job SchedulerThe Maui Scheduler is used to schedule batch jobs. It creates advance reservations for jobs which are considered possible to run. This allows large jobs (many nodes) to start in a reasonable time and avoids starvation due to overtaking by smaller jobs (less number of nodes). Also, better control of quality of service is achieved since the priority of a job has more impact in this reservation scheme compared to other schedulers. Commands for extracting information from the scheduler currently available to users:
ConfigurationCurrently, there is one run queue, "dque", which has the limits of 144 hours of wall-clock time and 396 processors (198 nodes) per job. Every user who submits a job within these limits and is a member of a granted project will end up in this queue. Other users are routed to the queue "wait" which is stopped. Jobs from this queue can be started by the system administrator if needed. To facilitate interactive development and achieve tolerable turnaround time for very short test runs, a standing advance reservation of 16 processors, 09:00 - 17:00, Monday - Friday has been created. This reservation accepts jobs with limits less than or equal to 8nodes and 1 hour. No upper limit on the number of submitted jobs exists. Instead, limits set in the scheduler prohibit users from getting high priority due to extensive queueing times. Currently, the Maui scheduler performs a full scheduling cycle each minute. For more information about the user commands and job priority and rating, see http://www.nsc.liu.se/systems/monolith/maui.html Programming environmentThere are three compiler suites available:
The Intel or PGI suites are recommended. They produce, generally speaking, more efficient code and are also more integrated with the MPI run time environment. Your default $PATH is initialized at login by the system so that most tools (compilers, debuggers and performance tools) are available without having to give the absolute path. Intel compilersThere are two versions of the Intel compiler available: 7.1 and 8.0. The two versions are two separate compilers that are incompatible and have different syntax. At login you are given the 7.1 version which is the version with full support in terms of external libraries. The 8.0 version generally results in better performance however and has also an extended set of functionalities. ScaMPI is supported for 8.0. To move from Intel 7.1 to the 8.0 environment give the command sequence:module unload intel; module load intel/8.0 To move from Intel 8.0 version to the 7.1 environment give the command sequence:module unload intel/8.0; module load intel Intel 7.1 compiler, useful compiler optionsBelow are some useful compiler options, please do "man ifc" or "man icc" for more! a) OptimisationThere are three different optimization levels in Intel's compilers:
A recommended flag for general code is -O2 and for best performance "-O3 -xW -tp p7" which will enable software vectorisation. As always however, aggressive optimisation runs a higher risk of encountering compiler limitations. b) Debugging
c) Profiling
d) Options that only apply to Fortran programs
e) Linking"-xW" is required when linking object code that was compiled with that option. Other libraries that are not default but that you might need:
f) Large File SupportTo read/write files larger than 2GB you need to specify some flags at compilation: Fortran: no additional flags needed. CC/C++: LFS is obtained by specifying the flags below when compiling and linking: -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE g) Special MPI optionsThere are three command-line options (locally implemented, not mentioned in the man- pages) to all of Intel and PGI's compilers to make compiling and linking programs with MPI easier:
The options should be used both at compile-time (to specify the path to the include files) and at link-time (to specify the correct libraries). For more information about the MPI implementations and how to run MPI programs see the "Parallelisation" section. h) Miscellaneous optionsLittle endian to Big endian conversion in Fortran is done through the F_UFMTENDIAN environment variable. When set, the following operations are done:
Examples:
For more options, please read the man page on the specific compiler on the system or read the Intel Fortran/C Compiler User's Guide that is available at
Intel 8.0 compiler, useful compiler optionsBelow are some useful compiler options, please do "man ifort" or "man icc" for more! a) OptimisationThere are three different optimization levels in Intel's compilers:
A recommended flag for general code is -O2 and for best performance "-O3 -xW -tp p7" which will enable software vectorisation. As always however, aggressive optimisation runs a higher risk of encountering compiler limitations. b) Debugging
c) Profiling
d) Options that only apply to Fortran programs
f) Large File Support (LFS).To read/write files larger than 2GB you need to specify some flags at compilation: Fortran: no additional flags needed. CC/C++: LFS is obtained by specifying the flags below when compiling and linking:
g) Special MPI optionsThere are three command-line options (locally implemented, not mentioned in the man-pages) to all of Intel and PGI's compilers to make compiling and linking programs with MPI easier:
The options should be used both at compile-time (to specify the path to the include files) and at link-time (to specify the correct libraries). For more information about the MPI implementations and how to run MPI programs see the "Parallelisation" section. h) Miscellaneous optionsLittle endian to Big endian conversion in Fortran is done through the F_UFMTENDIAN environment variable. When set, the following operations are done:
Examples:
For more options, please read the man page on the specific compiler on the system or read the Intel Fortran/C Compiler User's Guide that is available at
PGI compilers, useful compiler optionsBelow are some useful compiler options, please do "man pgf90" or "man pgcc" for more! a) OptimizationThere are three different optimization levels in PGI's compilers:
A recommended flag for general code is -O2 and for best performance -fast which is equivalent to "-O2 -Munroll -Mnoframe" As always however, aggressive optimisation runs a higher risk of encountering compiler limitations. b) Debugging
c) PortingThese options only apply to Fortran programs.
d) Profiling
e) Large File SupportTo read/write files larger than 2GB you need to specify some flags at compilation: Fortran: add the flag "-Mlfs" to your compile and link command. CC/C++: LFS is obtained by specifying the flags below when compiling and linking:
f) Special MPI optionsThere are three command-line options (locally implemented, not mentioned in the man- pages) to all of Intel's and PGI's compilers to make compiling and linking programs with MPI easier:
The options should be used both at compile-time (to specify the path to the include files) and at link-time (to specify the correct libraries). For more information about the MPI implementations and how to run MPI programs see see the "Parallelisation" section. g) Miscellaneous optionsThe IA-32 architecture on Monolith use 80-bit registers for floating-point operations. This extended format can lead to answers that, when rounded, not match expected result. The option -pc 64 can be used to explicitly set the precision to standard IEEE double-precision using 64 bits. For more options, please read the man page on the specific compiler on the system or read the PGI User's Guide: http://www.nsc.liu.se/pgi/ ParallelizationMessage passing: MPI and PVM are available on the system. Message Passing Interface (MPI)There are three different MPI implementations available on monolith: ScaMPI, MPICH,and LAM. ScaMPI uses the high performance SCI network, while MPICH and LAM use Fast Ethernet. Fast, easy-to-use, Guide for the Impatient
/usr/local/bin/mpirun is a locally supplied, generic script that finds out which MPI implementation is used and starts the appropriate daemons and monitors. To be sure it recognizes your program, please try:
on the command-line on the front-end (i0) before you use it in a batch script. If /usr/local/bin/mpirun does not recognize your program or you want to start a different number of instances of your program, the following options are available: Usage: mpirun [-h][-q][-np <procs> | -s][-Nscampi|-Nlam|-Nmpich|-Npvm] <program> ...
/usr/local/bin/mpirun can also be used directly from the front-end. See Interactive Access for more information. Description of the Different MPI Implementations: ScaMPI, MPICH and LAMIf you need to use another compiler and/or you choose to use the options below, none of the options -Nscampi, -Nlam, or -Nmpich available to the Intel and PGI compilers can be used. ScaMPITo use the fast SCI network, you have to compile and link with ScaMPI, a proprietary MPI implementation from SCALI. It is installed in /opt/scali. Furthermore, you must use /opt/scali/bin/mpirun to start your application. To link against the MPI library from SCALI you should use one of the following lines:
Do not forget to include -lpthread when using ScaMPI. Even though the object files might link without errors, the resulting executable may hang when started. MPICHMPICH 1.2.4 is installed in "/usr/local/mpich-1.2.4/"compiler". There are three compiler versions available:
Choose the appropriate version depending on the compiler you use. The main compatibility difference is the number of suffix underscores in the identifiers of compiled Fortran code. GNU use two underscores while Intel and PGI use only one. For general information how to use MPICH, see http://www-unix.mcs.anl.gov/mpi/mpich There are also man-pages available on the Monolith system. To link against the MPICH MPI library you should use one of the following lines:
Include files are in /usr/local/mpich/include and in /usr/local/mpich/include/mpi2c++ (C++). To run your MPI(CH) program on e.g. 10 nodes do:
mpirun is modified and adapted to work properly in
The nodes used for programs started with mpirun are extracted from the file specified by the environment variable $PBS_NODEFILE. The option -np defaults to 1 if not given. LAMLAM-6.5.6 is installed in "/usr/local/lam-6.5.6/"compiler" There are three compiler versions available:
Choose the appropriate version depending on the compiler you use. The main compatibility difference is the number of suffix underscores in the identifiers of compiled Fortran code. GNU use two underscores while Intel and PGI use only one. For general information how to use LAM, see http://www.lam-mpi.org. There are also man pages available on the Monolith system. To link against the LAM MPI library you should use one of the following lines:
Include files are in /usr/local/lam/include and in /usr/local/lam/include/mpi2c++ (C++). Launching a LAM program requires a little more effort then launching a MPICH or ScaMPI program (unless you use the /usr/local/bin/mpprun described above). LAM requires daemons to be started on each node before the job is launched and for them to be stopped after the job has finished. Here is how it can be done (typically in a PBS script):
The two sleep commands are used to ensure that LAM manages to start/stop the daemons. Note that this is NOT the, by NSC, recommended way to launch LAM applications. The recommended way to launch any parallel application is using /usr/local/bin/mpprun. Performance analysis and debuggingProfilingIf you use the PGI-compilers, pgprof is a tool which analyzes data generated during execution of specially compiled programs (using the -Mprof=func or -Mprof=lines compiler command line options). Example: profile and analyze the executable a.out
See "man pgprof" for more details. For Intel and GNU compilers, compiling with "-p" or "-pg" together with the "prof" or "gprof" utilities provides a similar functionality. Profiling a MPI applicationThere is no support for profiling of MPI application in the tools listed above. We are in the process of evaluating the Intel Vtune profiler to see if it can provide this functionality. Meanwhile you need to view your application as separate processes, each generating the profile output. Since all processes will be using the same file name ("pgprof.out" or "gmon.out") the best way to distinguish them is to run your application using only one processor/node (ppn:1) and from the /disk/local directory thus providing a separate location for the output from each process. At the end of the batch script (after the execution) collect and rename the different output files, for example with the command:
Core dumpWhen using ScaMPI with NSC's /usr/local/bin/miprun you can now enable core dumps by supplying the option '-core' to mpirun:
Please, don't use this option unless you really is in need of the core dump for debugging. When enabled, running a parallel application can in short time generate many core dumps, consuming a lot of disk space. Please, refrain from using /home when this option is enabled! DebuggingSeveral debuggers are available:
You can do live debugging or "post mortem" debugging with all debuggers. See "man pgdbg" and "man gdb" for more information. Totalview debuggerThe TotalView debugger is a source-level debugger with a graphic user interface and features for debugging distributed programs, multiprocess programs, and multithreaded programs. Totalview can be used to debug "live" programs as well as portmortem debug on core files:
where "filename" specifies the name of an executable to be debugged and "corefile"specifies the name of a core file. The executable must be compiled with source line information (usually the -g compiler switch) in order to give full debug capabilities. On Monolith please note the following:
Options to the totalview command are described in the TotalView User's Guide. Online documentation is located at http://www.etnus.com/Support/docs/ VampirVAMPIR is a graphical tool for analyzing the performance and message passing characteristics of parallel programs that use the MPI message passing library. The VAMPIR package has two parts:
The full user documentation can be found at: /usr/local/tools/vampir/3.0/doc/
There are also man pages for Vampir and the Vampir-trace library routines. Follow these steps to start using Vampir:
A very good VAMPIR tutorial is available at http://www.arsc.edu/support/howtos/usingvampir.html If you are using LAM MPI instead of ScaMPI or MPICH you need to run a different version of Vampir. By doing the commands: "module unload vampir; module load vampir/4.0.lam" you will get the Vampir version that supports LAM MPI. Math librariesIntel compilersThe Intel math kernel library "mkl" is recommended. The Math Kernel Library includes the following groups of routines:
Full documentation can be found at http://www.intel.com/software/products/mkl/ Directory Structuremkl is located in $MKL_ROOT, defined at login. Semantically, MKL consists of two parts: LAPACK and processor specific kernels. The LAPACK library contains LAPACK routines and drivers that were optimized as without regard to processor so that it can be used effectively on processors from Pentium to Pentium 4. Processor specific kernels contain BLAS, FFTs, CBLAS, VML that were optimized for the specific processor. Threading software is supplied as a separate dynamic link library (so), libguide.so, when linking dynamically to MKL. The information below indicates the library's directory structure.
Linking with MKLTo use LAPACK and BLAS software you must link two libraries: LAPACK and one of the processor specific kernels. Some possible variants: a) LAPACK library, Pentium 4 processor kernel: "ld myprog.o -L$MKL_ROOT -lmkl_lapack -lmkl_p4" b) Dynamic linking. DLL dispatcher will load the appropriate dll for the processor dynamic kernel: "ld myprog.o -L$MKL_ROOT -lmkl -lguide -lpthread" Using MKL ParallelismThe Math Kernel Library is threaded in a number of places: LAPACK (*GETRF, *POTRF, *GBTRF routines), Level 3 BLAS, and FFTs. MKL 5.2 uses KAI OpenMP threading software. Setting the number of threads: The OMP software responds to the environmental variable OMP_NUM_THREADS. The number of threads can be set in the shell the program is running in. To change the number of threads, in a command shell in which the program is going to run, enter: export OMP_NUM_THREADS=<number of threads to use> If the variable OMP_NUM_THREADS is not set, MKL software will run on the number of threads equal to the number of processors. We recommend always setting OMP_NUM_THREADS. KMP_STACK_SIZE environment variable should be set to 2m or more if MKL functions are called from OMP parallel regions. PerformanceThe obtain the best performance with MKL, make sure the following conditions are fulfilled: arrays must be aligned on 16-byte boundary and leading dimension values (n*element_size) of two-dimensional arrays must be divisible by 16. There are additional conditions for the FFT functions see the full documentation for details. PGI compilersLAPACK and BLAS are included with the PGI software. LAPACK: Link with "-llapack" BLAS: Link with "-lblas" Best performance however is with ATLAS tuned BLAS libraries are available for the C/C++ and Fortran. Link with " -L/usr/local/lib -lf77blas -latlas" for Fortran " -L/usr/local/lib -lcblas -latlas" for C/C++ Porting code to/from MonolithGeneral
Data types and corresponding bitsizesFORTRAN
C/C++
Monolith documentationThere are various ways to get information about the system. The most important are: 1) NSC maintains a web-document for the cluster where you will find detailed information about the system as well as the most up to date information about the current status: http://www.nsc.liu.se/monolith for a general description. 2) Portland Group Documentation: http://www.nsc.liu.se/pgi/ 3) Intel compiler documentation. Intel Fortran/C Compiler User's Guide are available at: - http://www.intel.com/software/products/compilers/flin/ for Fortran - http://www.intel.com/software/products/compilers/clin/ for C/C++ 4) Intel Math Kernel Library (mkl) documentation is available at: http://www.intel.com/software/products/mkl/ 5) SCAMPI user documentation can be downloaded from http://www.scali.com/download/documents.html 6) The "man" and "apropos" utility. If you are uncertain about a command or function on the system, try these! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||