Systems  
Status displays
System status
Retired systems
 
 
 
 
 
 
 
 
 
 
 
 

Triolith User Guide

Table of Contents

1 About this document

This is the Triolith User Guide. Is is written to be a standalone document, but if you don't find what you are looking for here you might also want to check the SNIC User Guide (for Kappa, Neolith and Matter) and other documentation available on the NSC web site.

You will also find much useful information in the Triolith application software documentation.

2 About Triolith

Triolith is a capability cluster with a total of 25600 cores and a theoretical peak performance of 450 Tflops/s.

It is equipped with a fast interconnect for high performance for parallel applications.

The operating system is CentOS 6.x x86_64.

Each of the 1600 HP SL230s compute nodes is equipped with two Intel E5-2660 (2.2 GHz Sandybridge) processors with 8 cores each, i.e 16 cores per node. Hyper-threading is not enabled.

56 of the compute nodes ("fat" nodes) have 128 GiB memory each and the remaining 1144 have 32 GiB each.

The fast interconnect is Infiniband from Mellanox (FDR IB, 56 Gb/s).

Benchmarks have shown that one Triolith core is about 3 to 4.5 times faster than one Neolith core on applications such as VASP.

Each compute node has a 500GB local hard disk, of which ~440GiB is available as temporary scratch space for user jobs.

All 1600 compute nodes are now installed and available to users.

Triolith is a SNIC-funded system, and computing time on Triolith is exclusively allocated to SNIC projects (Large, Medium and Small). See http://www.snic.vr.se/apply-for-resources and http://www.nsc.liu.se/start/apply/ for information on how to apply for a SNIC project.

The NSC Centre Storage file systems (/home and /nobackup/global) are available on Triolith. The contents of the /software directory from Neolith/Kappa/Matter is not available. Since most software will benefit significantly by being recompiled to use AVX we felt that it was best to start from scratch with an empty software directory.

Job scheduling on Triolith is very similar to previous NSC systems (e.g Neolith). Triolith uses fairshare on the project level, i.e the more CPU time the project has used (as a percentage of its monthly allocation), the lower the priority of all queued jobs in the project will be.

3 Getting an account on Triolith

Everyone who is a member of a project that has been granted computing time on Triolith may apply for a login account.

The process for registering as a new NSC user, applying for membership in a project etc is described in detail on http://www.nsc.liu.se/start/apply/faq.html.

4 Accessing Triolith

Triolith is normally accessed using either Secure Shell (SSH) or ThinLinc (a remote desktop software). File transfers can be done using any method that supports SSH (e.g "scp") or SFTP.

Note: you must use SSH for your very first login to Triolith (when you change your initial temporary password).

To login to Triolith using SSH, connect to "triolith.nsc.liu.se" using your Triolith user name.

E.g ssh -X x_abcde@triolith.nsc.liu.se.

Please read this guide on security on NSC systems, it will show you how to keep your account and the rest of the system secure, and also how to use SSH logins as efficiently as possible (e.g not having to type your password all the time).

When you have received a username and a password from NSC, login to Triolith (triolith.nsc.liu.se) using SSH. SSH client software is available for most operating systems (e.g PuTTY for Windows, OpenSSH for Linux/MacOS). Remember to tell your SSH client to use your Triolith username.

4.1 Logging in for the first time

Please note that your first login to Triolith must be made using regular SSH, e.g not using an SFTP client or ThinLinc. Once you have changed your temporary password to a permanent one, you an login using any method.

Example (using OpenSSH):

kronberg@ming:~$ ssh x_makro@triolith.nsc.liu.se
x_makro@triolith.nsc.liu.se's password: 
Last login: Thu Sep  8 12:09:17 2011 from ming.nsc.liu.se
[...Triolith welcome message and help text...]

This is the first time you log in to this NSC resource.
You will have to change the initial password we sent you.
Please choose a good, strong password that you do not use at
any other system.

Enter the initial password we sent you: <ENTER YOUR PASSWORD AGAIN HERE>
Enter your new password: <CHOOSE A NEW PASSWORD AND ENTER IT HERE>
Enter the new password again: <...AND AGAIN HERE>

Your password has been changed.

NOTE: Using key-based SSH login in the right way is both
more secure and more convenient than using passwords.
Read more at:
http://www.nsc.liu.se/support/userguides/remote/ssh.html

Press the RETURN key to log out --> 
Your user is now activated and you may log in.
Connection to triolith.nsc.liu.se closed.
kronberg@ming:~$ 

You can use any SSH client that supports the SSH protocol version 2 (all modern SSH clients should do this, e.g OpenSSH, PuTTY).

Note that the first time you log in you will need to change the temporary password we sent to you to a permanent one.

If you do not log in and set a permanent password within 7 days, your account will be locked. If this happens to you, contact support@nsc.liu.se to have it unlocked.

Choose a good password, and do not use that password for anything else except your Triolith account. If you want to change your password, use the "passwd" command on the login node.

4.2 If you cannot log in

If you forget your password, lose your SSH keys etc, please contact support@nsc.liu.se and we will help you get access to the system again.

4.3 Getting data to and from Triolith

You can transfer files to and from Triolith using e.g SFTP, scp and rsync (basically any method that uses SSH to move data).

4.3.1 scp

scp is a simple tool that is useful for copying a single of a few files to or from a remote system. Example - copy a local file named local-file to your home directory on Triolith:

$ scp local-file username@triolith.nsc.liu.se:

4.3.2 sftp

sftp is an interactive file transfer program, similar to ftp. Example:

$ sftp username@triolith.nsc.liu.se:testdir
Connecting to triolith.nsc.liu.se...
Changing to: /home/username/testdir
sftp> ls
file-1  file-2
sftp> get file-2
Fetching /home/username/testdir/file-2 to file-2

There are also graphical SFTP clients available.

4.3.3 rsync

rsync is a file copying tool that can be used both locally and over the network. Its main advantage over scp and sftp is that is handles copying of whole directory trees well, and that rsync transfers can easily be restarted without having to re-transfer data. Example - copy the directory tree named local-tree to Triolith:

$ rsync -av local-tree username@triolith.nsc.liu.se:

4.3.4 Swestore

If you need to transfer data to/from SweStore, please read the SNIC knowledge base docs.

4.4 Running graphical applications using SSH and X tunneling

Some applications on Triolith (e.g Matlab) have a graphical user interface. To be able to display windows from an application running on Triolith on your own computer, you need two things:

  1. An X server software installed on your computer.
    • If you run Linux, this is already taken care of
    • If you run MacOS, you might need to install and start X11.app which is included in MacOS but not always installed.
    • If you run Windows, you need to find a third-party X server software (e.g Xming), as this is not normally included in Windows. Ask your local system administrator.
  2. Enable X11 forwarding in your SSH client. This allows windows from Triolith to be displayed on your local computer. If you use OpenSSH this is done using the -X option to ssh, e.g ssh -X username@triolith.nsc.liu.se.

Note: For better performance when running graphical applications, we recommend using ThinLinc - a remote desktop/visualization software.

4.5 Running graphical applications using ThinLinc - a remote desktop/visualization software

ThinLinc is a remote desktop solution from Cendio Systems. See http://www.cendio.com/ for a complete description.

By running the X server on a server in the cluster (i.e closer to your application) and using an efficient method for delivering the image to your local computer (VNC-based), most graphical applications will run significantly better than when using X-forwarding tunneled through SSH.

ThinLinc can also make use of a graphics card in the ThinLinc server to provide hardware acceleration to OpenGL applications (e.g VMD, Maestro, Gaussview).

4.5.1 Why would you want to use ThinLinc?

Here are some use cases:

  1. Using accelerated OpenGL applications

    Perhaps you want to run a graphical user interface (GUI) that is using OpenGL (e.g VMD) to visualize data that is located on Triolith. Rather than moving a large amount of data to your local computer and visualize it there, you can run the GUI directly on Triolith and display the window on your computer with much better results (higher framerate etc) than using traditional X-windows tunneling through SSH.

  2. Modern GUIs that do not run well using X-forwarding

    Certain graphical user interfaces are implemented with no regard for performance when tunneled through SSH on a connection with high latency, and will be more or less unusable. Since ThinLinc presents a local X server to the application (with almost zero latency) and handles the transportation of the graphics data invisible to the application, it can perform much better for these types of applications.

4.5.2 Installing the ThinLinc client and connecting to Triolith

The ThinLinc client can be downloaded for free from http://www.cendio.com/downloads/clients/. It is available for Windows, Mac OS X, Linux and Solaris.

To use ThinLinc to connect to Triolith:

  1. Download the client matching your local computer (i.e Windows, Linux, MacOS X or Solaris) and install it.
  2. Start the client
  3. Change the "Server" setting to "triolith-thinlinc.nsc.liu.se"
  4. Change the "Name" setting to your Triolith username (e.g x_abcde).
  5. You do not need to change any other settings
  6. Enter your Triolith password in the "Password" box
  7. Press the "Connect" button.

The first time you connect, you will get a message saying "The server's host key is not cached …". Verify that the server key for triolith-thinlinc.nsc.liu.se is "4d:66:25:46:82:a9:1c:bc:8c:04:77:b9:b0:6b:64:8b", then press Continue.

After a few seconds, a window with a simple desktop session in it will appear. From the Applications menu, start a Terminal Window. You are now logged in to Triolith and can submit jobs, start interactive sessions, start graphical interfaces as usual.

Please note that all Triolith applications are available on the ThinLinc server, not just the ones listed in the Application menu.

To log out end end your session, click the green "running man" icon to the right of the Applications menu and select Logout.

The default session is a fullscreen session (will cover your entire screen). If this is not what you want, you can change it in the ThinLinc client settings. Click Options, select the Screen tab and deselect Full Screen Mode. You will then get a window with your Triolith desktop inside it, which you can resize to whatever size you want.

In most cases you also want to disable the session option "send system keys". This option is on by default, and it means that "system keys" (e.g Alt-Tab, Cmd-Tab etc) are sent to the ThinLinc server and not to your local computer while the ThinLinc session is running.

4.5.3 Using SSH public key authentication instead of password

If you use SSH public key authentication to login to Triolith you need to do this to use this method also for ThinLinc:

  1. Start the ThinLinc client
  2. Click "Options"
  3. In the "Security" tab, Change "Password" to "Public key"
  4. Press OK
  5. The "Password" box has now changed to "Key". Click the browse button to the right of the Key field and select your SSH private key file (or enter the path to your key directly)
  6. Press the "Connect" button.
  7. Enter the passphrase for your SSH private key (if you don't have one, you really should…)

Note: you will need to enter your SSH key passphrase each time you log in. This is due to the ThinLinc client not being integrated with the Linux ssh-agent.

4.5.4 Using ssh-agent with ThinLinc (unsupported)

If you want to use SSH keys loaded into ssh-agent to connect to ThinLinc, you can do that by modifying your ThinLinc client.

The method described below is unsupported by Cendio and NSC. Use it at your own risk. However, it's unlikely that anything can go wrong that you cannot fix by reinstalling the ThinLinc client.

This has been tested on Ubuntu Linux running the ThinLinc client version 4.1.1, but it will probably work with other Linux distributions and ThinLinc client versions. Note: you will have to re-apply this fix every time you upgrade the ThinLinc client.

If you use this fix, the ThinLinc client will use any keys from your ssh-agent to log in regardless of whether you have specified password or public key login in the client settings. Recommendation: set client to use password and don't enter anything into the password box.

To apply the fix, run the following as root on your Linux client:

mv /opt/thinlinc/lib/tlclient/ssh /opt/thinlinc/lib/tlclient/ssh.real
cat > /opt/thinlinc/lib/tlclient/ssh <<"EOF"
#! /bin/sh
# SSH wrapper for ThinLinc client to enable using keys loaded into
# ssh-agent to be used
TLSSHARGS="$*"

# Only try to use this method if an ssh-agent is actually running
if [ "$SSH_AGENT_PID" != "" ] ; then
    SSH_AUTH_SOCK="$MY_SSH_AUTH_SOCK"
    if [ -S "$SSH_AUTH_SOCK" ] ; then
        TLSSHARGS="`echo "$*" | sed s/\-o\ PubkeyAuthentication=no//`"
    fi
fi

# Run the real SSH client
exec $0.real $TLSSHARGS
EOF
chmod 755 /opt/thinlinc/lib/tlclient/ssh
mv /opt/thinlinc/bin/tlclient /opt/thinlinc/bin/tlclient.real

cat > /opt/thinlinc/bin/tlclient <<"EOF"
#!/bin/bash
# ThinLinc client wrapper to enable using SSH keys loaded into
# ssh-agent for authentication.
#
# Since ThinLinc will unset $SSH_AUTH_SOCK, we need to save the value
# of it into another variable.
export MY_SSH_AUTH_SOCK="$SSH_AUTH_SOCK"

# Launch the real ThinLinc client
exec $0.real $*
EOF
chmod 755 /opt/thinlinc/bin/tlclient

4.5.5 Running accelerated OpenGL applications

In order to make use of hardware-accelerated OpenGL, the application needs to be launched in a certain way.

Some applications have already been modified to do this automatically. The applications listed below will automatically be accelerated when run from ThinLinc, so you just need to start them manually.

  • GaussView (e.g "module add gaussview/5.0.9; gv")
  • VMD (e.g "module add vmd/1.9.1; vmd")
  • Maestro (e.g "module add schrodinger/2012u1-nsc; maestro")
  • VESTA (e.g "module add vesta/3.1.3; vesta")

All other OpenGL applications needs to be launched using "vglrun", e.g "vglrun SOME_OPENGL_APPLICATION".

4.5.6 Thinlinc sessions

If you close your ThinLinc client or explicitly disconnect, your session on Triolith will still be running, and you will automatically be reconnected to that session the next time you login to ThinLinc.

If you will not be using ThinLinc for a few days, we recommend logging out (using the green "running man" logout icon in the ThinLinc desktop).

NSC reserves the right to log out sessions that have been idle for a significant time. Also, if a login node is rebooted, all ThinLinc sessions on that node will be logged out.

If you have no current ThinLinc session, your next one will be on the login node with the lowest load. You can not control which server your next session will use. If you need to access the other login node, login to it using SSH (e.g "ssh triolith2").

5 Sharing Triolith with others

5.1 Using the login nodes

When you first login to Triolith (triolith.nsc.liu.se), you reach the "login node" (hostname "triolith1"). This is just a small part of Triolith, it is a single Linux server that serves as Triolith's connection to the outside world.

Triolith actually has two login nodes. They are identical (128GB RAM and 16 CPU cores of the same type as the compute nodes).

The first one can be reached at the addresses triolith.nsc.liu.se and triolith1.nsc.liu.se. The second one can be reached at triolith2.nsc.liu.se.

During future system upgrades, we will try to always upgrade the login nodes one at a time, to ensure that one of them is always available. You can also try the second login node if the first should fail (e.g due to a hardware failure or running out of memory).

It is important to know that a login node is a resource that is shared with all other Triolith users, and if it is slow or crashes all Triolith users are affected. For this reason we do not allow you to run anything but the most essential things on the login node.

On the login node, you are permitted to:

  • Run file transfers to and from Triolith
  • Manage your files on Triolith (copy, edit, delete files etc)
  • Submit batch and interactive jobs (more about that later)
  • Run small applications if you are certain that they will not use large amounts of memory or CPU. As a guideline, anything using more than 1GB of RAM or that runs on more than one CPU core should probably not be run on the login node. If you are unsure, please contact the Triolith support team (support@nsc.liu.se) and discuss if what you need to do is suitable for the login node.

Anything not permitted to run on the login node should be run on one or more of the compute nodes as an interactive or batch job.

5.2 Interactive jobs

An interactive job is what you use if you "just want to run an application on a compute node". This is what happens under the hood when you use the "interactive" command:

  1. You run "interactive", usually with some extra options to use non-default settings, e.g to request more memory or more CPU cores.
  2. The scheduling system puts your request in the queue, waiting for resources (CPU, memory or a certain node type) to become available.
  3. You wait for the job to start.
  4. The scheduling system starts your job on a suitable compute node, and reserves the amount of memory and CPU cores you requested.
  5. You are automatically logged in to the compute node and can start working.

If your interactive session has not started after 30 seconds, all resources on Triolith are probably already in use and you will have to wait in the queue. You can check the queue status by logging in to Triolith again in another window and using the "squeue" command.

Hint: for small and short interactive sessions, use the nodes reserved for development, e.g: interactive -N1 -t 00:30:00 --reservation=devel

Example interactive session (here I reserve 1 node exclusively for my job for 4 hours):

[kronberg@triolith1 ~]$ interactive -N1 --exclusive -t 4:00:00
Waiting for JOBID 38222 to start
[kronberg@n76 ~]$ module add matlab/R2012a
[kronberg@n76 ~]$ matlab &

[...using Matlab for an hour or two...]

[kronberg@n76 ~]$ exit
[kronberg@triolith1 ~]$

Remember to end your interactive session by typing "exit". When you do that, the node(s) you reserved are released and become available to other users.

Note: the "interactive" command takes the same options as "sbatch", so you can read the sbatch man page to find out all the options that can be used. The most common ones are:

-t HH:MM:SS
choose for how long you want to reserve resources. The default value is 2 hours and the maximum is three days (72h). Choose a reasonable value! If everyone always use the maximum allowed number, it becomes very difficult to estimate when new jobs can start, and if you forget to end your interactive session, resources will be unavailable to other users until the limit is reached.
-N X --exclusive
reserve X whole nodes
-n X
reserve X CPU cores
--mem X
reserve X megabytes of memory
--reservation=devel
use one of the nodes reserved for short test and development jobs

Hint: It is possible to run several terminals "inside" your interactive shell in a way that still stays inside the job. Since the interactive shell is implemented using "screen" (a terminal window multiplexer) you can use all screen features (see the screen man page or the table below).

Table 1: Some common screen commands (read "man screen" for more information):
Command What it does
Ctrl-a c Create a new terminal inside screen
Ctrl-a w List the terminals inside this screen
Ctrl-a " List the terminals inside this screen as a menu
Ctrl-a K Close the current terminal
Ctrl-a n Go to the next terminal
Ctrl-a A Name the current terminal
Ctrl-a h Write terminal contents to file ("screendump")
Ctrl-a H Start/stop logging of terminal to file

5.3 Batch jobs

A batch job is a non-interactive (no user input is possible) job. What happens during the batch job is controlled by the job script (sometimes known as "submit script").

Preparing a batch job:

  1. Copy any needed input files to Triolith.
  2. Write the job script (some examples are included below)

Submitting a batch job:

  1. Load any modules needed to run your job. The environment in the shell where you run "sbatch" will be saved and recreated when starting the job. This includes the current working directory. You can also place the "module load" commands in your job script, and they will be run automatically then the job starts.
  2. Submit the job to the queue (e.g "sbatch myjob.sh")
    • Job options (e.g amount of memory reserved, number of CPU cores reserved, maximum wall time etc) can either be set in the job script (by adding "#SBATCH <options>" lines) or by giving the same options to sbatch. You can put options in both locations. If an option is present in both places, the sbatch option is used.
    • The environment (current directory, loaded modules, $PATH and other environment variables) is recorded by sbatch and will be restored when the job starts.
  3. The job is now in the queue. How long it will stay there until it is started depends on the priority of your project, what other jobs are in the queue and what new jobs are submitted while your job is waiting in the queue.

Monitoring a batch job:

  • You can monitor all your jobs, both batch and interactive, using the "squeue" command (e.g squeue -u $USER to see your jobs).
  • If you want to cancel (end) a queued or running job, use the "scancel" command and provide the job ID (e.g "scancel 12345").
  • When the job has started, the standard output and standard error from the job script (which will contain output from your application if you have not redirected it elsewhere) will be written to a file named slurm-NNNNN.out in the directory where you submitted the job (NNNNN is replaced with the job ID).

What happens when a job starts?

  1. The environment (current working directory and environment variables such as $PATH) that were set when you submitted the job are recreated on the node where the job will be started.
  2. The job script starts executing on the first node allocated to the job. If you have requested more than one node, your job script is responsible for starting your processes on all nodes in the job, e.g by using srun, ssh or an MPI launcher.
  3. The job ends when your job script ends. All processes started by the job will be terminated if they are still running. The resources allocated to the job are now free to use for other jobs.
    • Note: if you run applications in the background ("application &") from your job script, you have to make sure that the job script does not end until all background applications has ended. This can be accomplished by adding a "wait" line to the script. Wait will cause the script to stop executing on that line until all background applications have finished.
    • Note: if your job runs for longer than the time you requested (sbatch -t HH:MM:SS), the job will be killed automatically.
  4. You can now fetch the output files generated by your job.

Sample job script: run an MPI application on two nodes (32 cores)

#!/bin/bash
#
#SBATCH -J myjobname
#SBATCH -t 00:30:00
#SBATCH --mem=6000
#SBATCH -N 2
#SBATCH --exclusive
#
mpprun ./mympiapp

# Script ends here

Sample job script: run a single-threaded application on a single core and 2GB RAM (the node will be shared with other jobs). Also send email then when job starts and ends.

#!/bin/bash
#
#SBATCH -J myjobname
#SBATCH -t 00:30:00
#SBATCH --mem=2000
#SBATCH -n 1
#SBATCH --mail-type=ALL
#SBATCH --mail-user=your@email.addr.es
#
# Run a single task in the foreground.
./myapp --flags
#
# Script ends here

5.4 Logins outside the interactive/batch system

In order to allow you to monitor and debug running jobs, you can login to a compute node directly from the login node provided that you have an active job running on that node.

(If you try to login to a compute node where you do not have a job running you will get the error message "Access denied: user x_XXXXX (uid=NNNN) has no active jobs".)

This feature is only intended for monitoring and debugging running jobs! Do not start any compute jobs from this type of "direct" login! If you do, you circumvent the normal limitations on job length, memory use etc, and you will likely cause problems for other users (e.g causing the node to run out of memory and stop working).

To use this feature, find out which node your job is using (use e.g squeue -u $USER), then run e.g ssh n123 from the login node to login to that compute node. You can then use normal Unix tools like "top" and "ps" to monitor your job.

[x_makro@triolith1 ~]$ ssh n123
Access denied: user x_makro (uid=3375) has no active jobs.
Connection closed by 192.168.192.2
[x_makro@triolith1 ~]$ squeue -u x_makro
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  48584  triolith _interac  x_makro   R       0:09      1 n234
[x_makro@triolith1 ~]$ ssh n234
Last login: Tue Jan 17 11:56:44 2012 from l1
[x_makro@n234 ~]$ top
...

6 Modifying existing job scripts for Triolith

You should check your existing job scripts before trying to use them on Triolith. Here are some of the changes you might have to do to convert jobs script from e.g Neolith or Kappa to Triolith:

6.1 16 cores per node

If you have hard-coded the number of cores per node, please note that Triolith has 16 cores per compute node, not 8 as on Neolith/Kappa/Matter.

NOTE: It is not recommended to hard-code the number of cores in this way. It will break your jobs if you run on e.g the "huge" Kappa nodes. Please use the relevant SLURM environment variables instead, e.g SLURM_JOB_CPUS_ON_NODE. For more information, read the sbatch man page.

6.2 Change /scratch/local to $SNIC_TMP

There is still a local scratch disk available on each node, but you can no longer write files directly to /scratch/local. Instead use the environment variable $SNIC_TMP, which will be set to a directory that will be created for each job (and deleted when the job ends).

E.g: if your job script looks like this

#!/bin/bash
#SBATCH -t 00:10:00
#
./myapp --tempdir=/scratch/local

then change it to

#!/bin/bash
#SBATCH -t 00:10:00
#
./myapp --tempdir=$SNIC_TMP

Note: --tempdir in the example above is just used in this example, not a magic option that you can use for any application to control where it write temporary files!

This change is being done to enable sharing of nodes between jobs.

By the way: there are other standardized SNIC_* variables set that you can use to make your job script portable across all SNIC HPC sites:

Table 2: SNIC standardized environment variables
Env.variable Definition On Triolith
SNIC_SITE At what SNIC site am I running? nsc
SNIC_RESOURCE At what compute resource at the SNIC site $SNIC_SITE (see above) am I running? triolith
SNIC_BACKUP Shared directory with tape backup /home/$USERNAME
SNIC_NOBACKUP Shared directory without tape backup /nobackup/global/$USERNAME
SNIC_TMP Recommended directory for best performance during a job (local disk on nodes if applicable) set for each individual job, e.g /scratch/local/12345

6.3 Jobs will require less wall time

If you move existing jobs from Neolith, Kappa or Matter, the actual run time will almost certainly be shorter if you run them on the same number of nodes.

You can either reduce the number of nodes or request a shorter wall time to accommodate this.

6.4 Applications not installed yet or in another location

Some applications that are installed on Neolith/Kappa/Matter are not yet available on Triolith. Before submitting a job, check that the application is available.

Even if the application is available, the version on Triolith might be different, so you might need to use a different path in your job script. Some module names might also have changed.

If you are missing an application on Triolith, please contact support@nsc.liu.se and ask us to install it.

6.5 The "module" command is now available in job scripts

Sometimes it can be useful to use the "module" command in job scripts. By doing this, you do not have to remember loading certain module before submitting the job.

Example:

#!/bin/bash 
  
#SBATCH --time=10:00:00
#SBATCH --nodes=2
#SBATCH --exclusive

module load someapp/1.2.3

someapp my_input_file.dat

If you write your job scripts using /bin/bash, /bin/csh or /bin/tcsh, the "module" command is automatically available from your script.

If you use /bin/ksh or /bin/sh, you need to add the line ". /etc/profile.d/cmod.sh" to your script in order to enable "module":

#!/bin/sh 
  
#SBATCH --time=10:00:00
#SBATCH --nodes=2
#SBATCH --exclusive

. /etc/profile.d/cmod.sh
module load someapp/1.2.3

someapp my_input_file.dat

7 Differences in /software and module files

On Triolith, we are doing several changes in how we manage the software installed on the system.

  1. There is now a web page with updated, automatically generated documentation of all applications installed in /software. This documentation is generated from files names README.NSC in the /software directory tree. These files are text files. Try e.g "less /software/apps/vasp/README.NSC".
  2. In the documentation, NSC will clearly state the level of support we provide for the application (e.g has it just been compiled and not tested, or have we run extensive tests and benchmarks?). In the same place, we will also provide notes on how to use the application (e.g sample job scripts).
  3. There will be no default versions of software modules. Background: On earlier systems, NSC for most applications provided a default version. If you ran "module add intel" you got the version of the Intel compilers that we considered the best. However, that version would be changed from time to time, and the behaviour of the application would then change without the user being aware of it. On Triolith, we will only recommend a certain version, it will be up to the user to decide if and when to change the version used. E.g "module add intel" will now display a message stating what the recommended version is, and the user can then load it using e.g "module load intel/12.1.4". In order to prevent users from running very old or broken versions, we will add a warning to certain old modules urging you to stop using it. But the ultimate decision is up to the user, apart from the warning message these deprecated modules will continue working.
  4. No compilers are loaded when you login to the system. If you do "module load build-environment/nsc-recommended", you will get the the recommended versions of the Intel compilers (icc, ifort etc), Intel MPI and Intel MKL loaded. If you want to access the GCC compiler that is bundled with CentOS, you will have to load the gcc/4-centos6 module. The thinking behind the "build-environment/nsc-recommended" is that it will contain a core set of development tools that we have tested together.

8 Software documentation

A fairly complete list of application software installed on Triolith can be found here, along with documentation how to use the software, example batch scripts etc.

9 Software installation policy

If you need to run a partiular application and you find that it is not available on Triolith, please contact support@nsc.liu.se.

You are allowed to install software in your home or /nobackup directory, provided that you have a license for the software (if needed) that allows for use on NSC systems.

NSC will provide you with a reasonable amount of help when you are installing your own software. Contact support@nsc.liu.se if you run into problems.

NSC will install and maintain software in a central location if we believe that it will be useful for many of our users, or if the effort to install it is small (e.g prepackaged software available in CentOS or the EPEL package repository).

9.1 Commercial software that you already have a license for

If your license allows you to use the software on NSCs systems and the license is technically possible to use from NSC systems (i.e if there is a license server you need to be able to connect to it from Triolith) you can install and run it on our systems.

9.2 Commercial software that you do not have a license for

NSC will in some cases buy commercial software for use on our systems if we consider the value of that software to be high compared to the cost. Contact support@nsc.liu.se and propose that we buy the application.

If NSC decides not to pay for the software, you or your research group can still pay for it and run on NSC systems (if the license allows for that).

10 Scheduling policy

Triolith uses a fairshare scheduler operating on the the project level, i.e the more CPU time the project has used (as a percentage of its monthly allocation), the lower the priority of all queued jobs in the project will be.

The maximum wall time limit for a job is 7 days (-t 7-00:00:00). The default wall time limit for a job is 2 hours.

The maximum wall time limit for a job on a node reserved for development and testing (--reservation=devel) is 1 hour. There are currently eight nodes reserved for test and development jobs.

Please note that the test and development nodes are expected to be available with little or no queue time to all users, so it is not acceptable for a single user to use all of them. Please use common sense.

10.1 Fairshare scheduling on Triolith

10.1.1 How does fairshare scheduling work on Triolith?

Fairshare scheduling on Triolith attempts to give each project (not user) a "fair share" of the available computing time of the system over time.

A "fair share" is not an equal share. A project's share is the time SNIC allocated to the project (e.g 100000 core hours per month) divided by the total capacity of the system (13.8 million core hours per month).

A project that makes a reasonable effort to use its allocated time (i.e does not wait until the last day of the month before running anything) can expect to be able to run approximately as many core hours as allocated by SNIC, or more.

The fairshare scheduler tries to achieve this by adjusting the priority of queued jobs. Since the queue is continuously re-sorted by priority, this generally results in short queue times for jobs submitted by projects with high priority, and long queue times for jobs submitted by projects with low priority.

The priority of a queued job is determined by how much the project has run recently compared to its allocation. The higher the usage is (as a percentage of the allocation), the lower the priority is.

There is no limit on how much a project can run in a month. But the more you run, the lower your priority will be, so the harder it will be to run the next job.

If you are interested in the gory details of how this is implemented: we use the SLURM multifactor plugin (https://computing.llnl.gov/linux/slurm/priority_multifactor.html)

The running configuration settings for the multifactor plugin can be seen by running "scontrol show config". As of 2012-10-24, the most interesting ones are:

PriorityDecayHalfLife   = 21-00:00:00
PriorityWeightFairShare = 1000000
PriorityWeightAge       = 1000

This means that the job priority is almost entirely determined by the FairShare of the project. The PriorityWeightAge is so much smaller that the age of a job will never affect the ordering of projects, only the ordering of jobs belonging to the same project.

10.1.2 Backfill - unfair handling of small jobs!

In addition to the main fairshare scheduler, which will always try to start the highest priority job, we also use a backfill scheduler.

Backfilling is the process of scheduling jobs into the holes created from large jobs that are waiting for nodes to become available.

If there are idle nodes available and a lower priority job can be started without affecting the start time of the highest priority job, the lower priority job is started. If more than one low-priority job could be started using backfill, the highest priority one is selected.

10.1.3 How can I adapt the scheduling to my workflow?

An example of how you can make sure you get the scheduling needed for your workflow:

Group A: is allocated 50000 core hours per month. They only care about getting as much work done as possible, so they submit many jobs and make sure that some are always waiting in the queue. This group might be able to run significantly more than 50000 core hours per month, but their queue priority will be low, and each job will on average wait a long time in the queue.

Group B: is allocated 50000 core hours per month. They need to run a limited number of jobs, but need their jobs to start quickly. As long as they run significantly less than their allocation per month (say 30000 hours), they will have a high priority, and their jobs will start immediately or as soon as nodes become available.

Note that the scheduler responds fairly slowly to changes in behaviour. If you change your behaviour (e.g stop running jobs) it will still take days or a few weeks until the full effect of this is seen in your queue waiting times.

10.1.4 Why fairshare scheduling on Triolith?

SNIC allocates a certain number of "core hours" to each project that is allocated time on Triolith. SNIC also wants utilization of its systems to be high.

NSC has decided to use fairshare scheduling on Triolith (and our other academic systems) because we believe it is the best way to share the system "fairly" (guided by how much time a project was allocated by SNIC) in a way that keeps utilization high, while still allowing different research groups to use the system in different ways that suits their workflow.

10.2 Idle nodes but your job won't start?

Sometimes you might see many idle nodes but your job still won't start. Here are some common reasons for this:

  1. The system is gathering nodes for a wide job. If a wide (i.e needs many nodes) job is the highest priority job, but there are not enough idle nodes to start it on, the scheduler will need to wait until enough jobs have ended to have the required number of idle nodes. I.e if a 128 node job is waiting to be started, you might see 127 idle nodes, and your job would still not be started (unless it was short enough to be run using backfill).
  2. Nodes are reserved. Sometimes compute nodes are reserved for a particular purpose, and not available to normal jobs. You can view all reservations using the command "scontrol show reservations".
  3. A scheduled service stop is coming up. When we need to perform maintenance on the system, we notify users via email and then reserve all compute nodes from a particular date and time. When this time is approaching, jobs will not be started if they cannot finish before the service stop reservation starts. E.g if the service stop starts Monday at 08:00, on Saturday at 08:00, only jobs with a wall time limit of less than 48 hours will be started.
  4. The "MAXPS" limit. In order to prevent a single project from using a very large part of the system, there is a hard limit on how much outstanding work a project may have running at any one time. The amount of outstanding work is defined as the sum of (number_of_cores * remaining runtime) for all running jobs in the project. The amount of outstanding work is limited to the monthly allocation of the project. E.g a project that is allocated 100000 core hours per month can start 130 single-node jobs with 48h walltime (100000 core hours / (48 h * 16 cores/node)). If your project has hit this limit your jobs will be shown by squeue with "Reason" set to "AssociationResourceLimit".
  5. Too-long walltime. If your job requests more than the allowed wall time limit, it will not start. You will also get an email notifying you of this. The job will be shown by squeue as "PartitionTimeLimit".
  6. Your project has expired. If your project's allocation has ended, your running jobs will finish but no new ones will start. The command "projinfo" will show a "Current allocation" of "-" for that project.

11 Running jobs

11.1 Node sharing is enabled!

Node sharing is enabled on Triolith. This means that jobs that request less than a full node (e.g sbatch -n2) might share that node with other jobs.

On Triolith a job will be allocated only the resources that it actually requests. If you request one core you will get one core, etc. On Neolith, Kappa and Matter your job would always get a complete node, even if you only requested a single CPU core.

If you want to ensure that your job always gets whole nodes, add the flag --exclusive to sbatch/interactive/salloc/srun. You will then get the same behaviour as on Neolith/Kappa/Matter. Note that we sometimes automatically add --exclusive in order to be compatible with older job scripts, see below.

Why node sharing? Some reasons:

  1. Running single-core jobs becomes easier. There is no longer any need to package those jobs into bigger packages that use a whole node. You can submit each single-core task as a separate job and let the scheduling system figure out which ones run together.
  2. Triolith has 16 cores per node. Some applications cannot utilize 16 cores. Without node sharing, the unused cores would be wasted. One example of this is development and testing jobs. If test jobs only use one core instead of a whole node, many more users can share a single development node.
  3. "Fat" nodes (with lots of RAM) can be shared between jobs. E.g two 64GB jobs can fit into one 128GB "fat" node.

11.1.1 Backwards compatibility

In order to avoid causing problems for users who are used to Neolith-style behaviour (i.e no node sharing), we have added a few extra rules to the scheduler configuration:

Note that these are just fallback mechanisms, we recommend that you always specify exactly what resources you want (e.g -N2 --exclusive).

  • If you specify -N or -nodes but not -n or --ntasks, the system will automatically add --exclusive.
  • If you request more than one node (e.g -n22), the system will automatically add --exclusive).

11.2 Examples

(In the examples below we use "interactive", but the same options can be used for sbatch, srun and salloc).

11.2.1 Without sharing nodes

To get "Neolith-style" behaviour, add --exclusive.

Requesting two full nodes (32 cores) for 24 hours:

interactive -N2 --exclusive -t 24:00:00

Requesting 32 cores (2 nodes) for 24 hours:

interactive -n32 --exclusive -t 24:00:00

Request 2 full nodes, but tell SLURM to only launch one task per node (e.g for a hybrid MPI/OpenMP application):

interactive -N2 --exclusive --cpus-per-task=16 -t 24:10:00

To use the development nodes that are reserved for short development and testing jobs, add --reservation=devel and request a walltime of less than one hour.

One development node for interactive use for 10 minutes

interactive -N1 --exclusive --reservation=devel -t 00:10:00

11.2.2 With node sharing

Request 4 cores for 24h, allow node sharing:

interactive -n4 -t 24:00:00

One CPU core on a development node for 10 minutes

interactive -n1 --reservation=devel -t 00:10:00

You can also request a certain amount of RAM:

One CPU core and 16GB RAM for 10 minutes on a development node:

interactive -n1 --mem=16000 --reservation=devel -t 00:10:00

11.3 Using the "fat" nodes

If you need more than 32GiB RAM per node, you can request that your job be run on the "fat" nodes which has 128GiB RAM by adding -C fat or specifying how much RAM you need --mem=xxxGB.

Please note that this will usually give you longer waiting times in the queue, since there are only 56 "fat" nodes in the system.

11.4 Node sharing limitations

Currently there is no quota on the /scratch/local disk, so if you use /scratch/local and want to make sure that no other job can use up all the space there, always use --exclusive. We will probably develop a solution for this.

11.5 Monitoring your jobs

The usual SLURM commands are available, e.g "sinfo", "squeue".

To cancel a queued or running job, use "scancel".

The NSC "projinfo" command is available. It will display your projects and their usage, giving you a rough idea of what priority your jobs will have in the queue. If your project has used a large percentage of its allocation, your priority will be low.

There is also a graphical display of the node utilization, the amount of jobs in the queue etc on http://www.nsc.liu.se/status/.

By using suitable options to squeue, you can get an overview of the jobs in the queue. This will give you an idea of how long your jobs might have to wait.

To display quite a lot of detail about each queued job, sorted by priority, run squeue like this:

squeue -o "%.12Q %.7i %.8j %.8u %.15a %.12l %.19S %.4D %.4C %.6h %R" --state=PD -S "-p" | less

NOTE: "START_TIME" is just an estimate by the scheduler, based on the current jobs in the queue. Due to the fact that any job submitted in the future with a higher priority than your job will skip ahead of you in the queue, the estimated start time is very unreliable if your priority is low. if jobs end ahead of schedule, the opposite can happen - your job might start earlier than the estimated time.

Example:

[kronberg@triolith1 ~]$ squeue -o "%.12Q %.7i %.8j %.8u %.15a %.12l %.19S %.4D %.4C %.6h %R" --state=PD -S "-p" | less
    PRIORITY   JOBID     NAME     USER         ACCOUNT    TIMELIMIT          START_TIME NODE CPUS SHARED NODELIST(REASON)
  1000000153   75956 tt4_bdis    raber             nsc   1-00:00:00 2012-10-25T16:08:24    4    4     no (Resources)
     1000132   76091 ScZrNiCo  x_robjo  snic001-11-241     10:00:00 2012-10-25T16:08:24    1    1     no (Resources)
      995105   76217   ISDAC1  x_julsa   snic002-12-16   1-00:00:00 2012-10-25T16:08:24    1    8 unknwn (Resources)
      995067   77248 x_FOTO_x  x_laubr   snic002-12-16      3:00:00 2012-10-25T16:08:24    1    1     no (Resources)
      989612   76208       MG  x_kanil  snic001-12-100      3:00:00 2012-10-25T16:08:24    4   64     no (Resources)
      989611   76211       FG  x_kanil  snic001-12-100   2-22:00:00 2012-10-25T16:08:24    4   64     no (Resources)
      976515   77119 001_2L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976514   77128 001_3L_C  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976514   77133 001_3L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976514   77149 001_4L_c  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976514   77157 001_4L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976514   77193 010_2L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976513   77195 010_4L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976513   77196 010_3L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77197 011_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77198 011_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77199 011_2L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77214 100_4L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77217 100_4L_c  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77219 100_2L_C  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77222 100_2L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77224 100_3L_M  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976512   77227 100_3L_C  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976511   77309 101_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976511   77310 101_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976511   77311 101_2L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77317 110_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77318 110_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77319 110_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77320 110_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77321 110_33L_  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77322 110_33L_  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77323 110_33L_  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77324 110_33L_  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77325 110_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77326 110_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77327 110_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77328 110_4L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77329 111_2L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
      976510   77330 111_3L_T  x_vsera   snic001-12-80     10:59:00 2012-10-25T16:08:24    2    2     no (Resources)
[...]

11.6 If your job failed

11.6.1 Running out of memory (OOM)

A common cause for failed jobs is running out of memory on one or more node in the job. When this happens, the job might fail in several different ways:

  1. The job might stop immediately if the application dies when running out of memory. Often you will find signs of this in the application output (usually in the slurm-JOBID.out file if you have not redirected it elsewhere)
  2. The job might stop making progress, but continue running until it hits the walltime limit.
  3. One or more of the nodes becomes so slow due to the lack of available memory that the scheduler takes it offline. In this case, you might get an email from NSC informing you of what happened.

Log files that might be useful:

  • The slurm-JOBID.out file is created in the directory from where you submitted the job. Unless you have redirected the output somewhere else, this is where the output from your job script will end up.
  • Any log files written by your application.
  • The NSC accounting logs in /var/log/slurm/accounting/YYYY-MM-DD on the login node. All jobs that have ended are listed there. In this file you can find e.g:
    • jobstate - some common states:
      • COMPLETED: the job script exited normally (i.e with exit status == 0). This does not necessarily mean that the job was successful, only that the job script did not return an error to the scheduler.
      • FAILED: the job script returned a non-zero exit status. This usually means that something went wrong, but it does not necessarily mean that the application itself failed, it might be e.g a failed "cp" command that was run as the last command in the job script.
      • CANCELLED: you (or a system administrator) cancelled the job (using scancel).
      • NODE_FAIL: one or more of the compute nodes in the job failed in such a way that the scheduling system decided to take it offline. Common causes: the application cause the node to run out of memory, or a hardware failure.
      • TIMEOUT: the job ran until it had used all the walltime requested by it, and was terminated by the scheduler.

Jobs that have not ended can be viewed using the squeue command, run e.g squeue -l -u $USER.

You can also ask NSC (support@nsc.liu.se, remember to include the job ID) for assistance in determining if your job ran out of memory. We can check system logs on the compute nodes that are not available to you, and these logs will usually tell us if the node ran out of memory or not.

If your job ran out of memory, please do not resubmit it. Unless you modify the job to use less memory, it will just fail again.

If your job runs out of memory when submitted normally, these are some of your options:

  • Use nodes with more memory
    • There are 56 "fat" nodes in Triolith with 128GiB RAM (the other 1144 "thin" nodes have 32GiB RAM). You can request "fat" nodes by adding the option "-C fat" to your sbatch or interactive command.
    • Note: since there are relatively few fat nodes, your job might need to wait for longer than usual in the queue if demand for fat nodes is high.
  • Use less memory per node
    • If you run an MPI application, you can usually try running fewer ranks per node, and either run on more nodes or accept a longer runtime. E.g sbatch --ntasks-per-node=8 will run 8 ranks per node instead of 16.
    • If your application has a configuration option for how much memory to use per node, try lowering that. E.g Gaussian has such a switch. Remember that even if the compute nodes hat 32GiB RAM, you cannot use all of that for your application, some room must be left for the operating system, disk cache etc. A value of around 30GiB is usually OK.

12 Optimizing your code for Triolith

Recompile your own applications! If you have previously compiled your own software we definitely recommend recompiling it on Triolith. See instructions on this page for how to build your applications on Triolith.

There are basically two compiler suites supported by NSC on Triolith, Intel (intel) and the GNU compiler collection (gcc). Other compilers may be installed and made accessible in due course, but support will be restricted to intel and gcc for the foreseeable future. The optimization instructions below only refer to these.

12.1 Intel Compilers

Support for the Sandybridge type of processor in Triolith is invoked by means of the -xAVX switch to all intel compilers (icc, ifort and icpc). Alternatively if you compile on Triolith you can invoke the -xHost switch which makes the compilation default to the highest available instruction set on the compilation host machine, effectively -xAVX on Triolith.

Code compiled this way on Triolith can only be run on Intel processors supporting the AVX instruction set, at the time this is written only Intel Sandybridge and Ivybridge (consumer level processors presently) based CPU:s support AVX. If support for generic x86 processors, earlier Intel CPU:s and AMD CPU:s, is desired the -axAVX can be attempted. There may be a performance penalty on Triolith using this option which will have to be checked on a case by case basis.

Regarding the global optimization level switch -O<X>, where X is between 0-3 for the Intel compilers, it is tempting to turn this all the way up to 3. However, this will not unequivocally yield better performing binaries, often they will perform worse than those using the default -O2, but it will unconditionally lead to more compilation trouble for any decent size code. If you still choose to try the -O3 switch, it is good practise to also add the -no-ipo switch which removes many problems related to the use of -O3.

If you have OpenMP code to compile, you need to also add the -openmp switch to enable this by the intel compilers.

Examples:

ifort -O2 -xAVX -o mybinary mycode.f90
icc -O3 -no-ipo -xAVX -o mybinary mycode.c
icpc -xHost -openmp -o mybinary mycode.cpp #default global optimization level is "-O2"

12.2 GNU Compiler Collection

The GNU compilers shipped with CentOS 6 (the operating system on Triolith) were released well before the Intel Sandybridge line of processors. The support for AVX is therefore not as well developed on these compilers as in the Intel compilers. There is some support however and later compiler releases can be expected to produce better performing binaries.

A good choice of GCC compiler flags on Triolith is -O3 -mavx -march=native for any installed version of GCC, either those shipped with CentOS 6 or those installed by NSC, accessible via the module system. A binary built this way will run exclusively on AVX capable CPU:s. The choice of -O3 is safe for GCC compilers in general as the developers are more conservative with respect to numerically less precise code generation.

If you instead desire a binary capable of running on generic x86 CPU:s while retaining some tuning similar to the Intel -axAVX switch you could consider the switches -O3 -mtune=native -msse<X> with a suitable value for <X>, e.g. -msse3 should let the binary run on the vast majority of current HPC CPU:s from both AMD and Intel. If your code uses OpenMP you are advised to use the -fopenmp switch to make use of this feature,

Examples:

gfortran -O3 -mavx -march=native mybinary mycode.f90
gcc -O3 -mtune=native -msse3 -o mybinary mycode.c
g++ -O3 -mavx -march=native -fopenmp -o mybinary mycode.cpp

The normal NSC compiler wrappers and mpprun are available, so to build and run an MPI application you only need to do load a module containing an MPI (e.g build-environment/nsc-recommended) and add the -Nmpi flag when compiling.

Example:

module add build-environment/nsc-recommended
icc -Nmpi -o myapp myapp.c

To run such an application you only need to use mpprun to start it, e.g

mpprun ./myapp

The compiler wrapper and mpprun will handle the compiler options to build against the loaded MPI version (Intel MPI in this case, it's part of build-environment/nsc-recommended) and how to launch an MPI application built against that MPI.

13 MPI

Currently both Intel MPI and OpenMPI are installed and supported on Triolith.

NSC recommends Intel MPI, as it has shown the best performance for most applications. However, if your application does not work/compile with Intel MPI or gets better performance when using OpenMPI, please use that instead.

Note to Neolith users: Scali MPI is not available on Triolith.

You can see which versions of Intel MPI and OpenMPI are installed by running "module avail" (look for "impi" and "openmpi").

Intel MPI is loaded in the build-environment/nsc-recommended module. To use Intel MPI with the Intel compilers, just load build-environment/nsc-recommended. To use OpenMPI, load build-environment/nsc-recommended and then load an openmpi module (which will then unload the impi module).

The recommended way to build MPI binaries at NSC is to use the NSC-specific compiler flag -Nmpi, then use mpprun to launch the resulting binary.

You do not need to specify which MPI to use, the compiler figures that out from the module you have loaded.

When launching the binary using mpprun, you do not need to specify how many ranks to start or which MPI should be used, mpprun will figure that out from the binary and the job environment.

Important note regarding OpenMPI performance

The currently (2012-10-24) installed OpenMPI version is close to IntelMPI performance wise only if you set the core binding yourself using extra flags to mpprun (unlike IntelMPI where this is done by default). We are working on incorporating this into mpprun, but there are many corner cases to work out. Until this is done, we recommend that you launch your OpenMPI applications like this:

mpprun --pass="--bind-to-core --bysocket" /software/apps/vasp/5.3.2-13Sep12/openmpi/vasp-gamma

13.1 Example: building and launching a simple MPI application

This is our source code:

/*                                                                              
 * Hello World in C                                                           
 */
#include <stdio.h>
#include "mpi.h"

int main(int argc, char* argv[])
{
  int rank, size;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  printf("Hello, world, I am %d of %d\n", rank, size);
  MPI_Barrier(MPI_COMM_WORLD);
  MPI_Finalize();

  return 0;
}

Building the binaries:

[kronberg@triolith1 mpi]$ module add build-environment/nsc-recommended
Conflicting modules warning: Unloading no-compilers-loaded/1 before loading intel/12.1.4
[kronberg@triolith1 mpi]$ icc -Nmpi -o mpitest_c mpitest_c.c 
icc INFO: Linking with MPI impi/4.0.3.008.
[kronberg@triolith1 mpi]$ module add openmpi/1.6.2-build1
Conflicting modules warning: Unloading impi/4.0.3.008 before loading openmpi/1.6.2-build1
*** WARNING: this is a test version of OpenMPI and it might be removed or changed without warning! ***
[kronberg@triolith1 mpi]$ icc -Nmpi -o mpitest_c_openmpi mpitest_c.c 
icc INFO: Linking with MPI openmpi/1.6.2-build1.
[kronberg@triolith1 mpi]$

Examining the binaries with "dumptag", which shows information about how the binaries were built:

[kronberg@triolith1 mpi]$ dumptag mpitest_c -- NSC-tag
---------------------------------------------------------- File name:
/home/kronberg/mpi/mpitest_c

Properly tagged:        yes
Tag version:            4
Build date:             121024
Build time:             131142
Built with MPI:         impi 4_0_3_008
Built with MKL:         no (or build in an unsupported way)
Linked with:            intel 12_1_4
---------------------------------------------------------------------
[kronberg@triolith1 mpi]$ dumptag mpitest_c_openmpi
-- NSC-tag ----------------------------------------------------------
File name:              /home/kronberg/mpi/mpitest_c_openmpi

Properly tagged:        yes
Tag version:            4
Build date:             121024
Build time:             131446
Built with MPI:         openmpi 1_6_2_build1
Built with MKL:         no (or build in an unsupported way)
Linked with:            intel 12_1_4
---------------------------------------------------------------------
[kronberg@triolith1 mpi]$

Running the binaries using mpprun in an interactive session on two nodes:

[kronberg@triolith1 mpi]$ interactive -N2 --exclusive -t 00:10:00 --reservation=devel
Waiting for JOBID 77079 to start
...
[kronberg@n1137 mpi]$ mpprun mpitest_c
mpprun INFO: Starting impi run on 2 nodes (32 ranks)...
Hello, world, I am 16 of 32
[...]
Hello, world, I am 31 of 32
Hello, world, I am 6 of 32
[kronberg@n1137 mpi]$ mpprun mpitest_c_openmpi 
mpprun INFO: Starting openmpi run on 2 nodes (32 ranks)...
Hello, world, I am 18 of 32
Hello, world, I am 19 of 32
[...]
Hello, world, I am 8 of 32
Hello, world, I am 10 of 32
[kronberg@n1137 mpi]$ 
[kronberg@n1137 mpi]$ exit
[screen is terminating]
Connection to n1137 closed.
[kronberg@triolith1 mpi]$ 

14 MKL

MKL, Intel Math Kernel Library

The Intel Math Kernel Library (MKL) is available, and we strongly recommend using it. Several versions of MKL may exist, you can see which versions are available with the "module avail" command. The library includes the following groups of routines:

  • Basic Linear Algebra Subprograms (BLAS):
    • vector operations
    • matrix-vector operations
    • matrix-matrix operations
  • Sparse BLAS (basic vector operations on sparse vectors)
  • Fast Fourier transform routines (with Fortran and C interfaces). There exist wrappers for FFTW 2.x and FFTW 3.x compatibility.
  • LAPACK routines for solving systems of linear equations
  • LAPACK routines for solving least-squares problems, eigenvalue and singular value problems, and Sylvester's equations
  • ScaLAPACK routines including a distributed memory version of BLAS (PBLAS or Parallel BLAS) and a set of Basic Linear Algebra Communication Subprograms (BLACS) for inter-processor communication.
  • Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments (with Fortran and C interfaces).

Full documentation can be found online at http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation/.

14.1 MKL Library structure

The Intel MKL installations are located in the /software/intel directory, usually as part of an Intel Composer installation (compiler + MKL and other tools).

When you have loaded an mkl module (or the build-environment/nsc-recommended module, which contains MKL), the environment variable $MKL_ROOT will point to the MKL installation directory for that version (e.g /software/intel/composer_xe_2011_sp1.10.319/mkl).

The MKL consists of two parts: a linear algebra package and processor specific kernels. The former part contains LAPACK and ScaLAPACK routines and drivers that were optimized as without regard to processor so that it can be used effectively on different processors. The latter part contains processor specific kernels such as BLAS, FFT, BLACS, and VML that were optimized for the specific processor.

14.2 Linking with MKL

If you want to build an application using MKL with the Intel compilers at NSC, we recommend using the flag -Nmkl (to get your application correctly tagged) and -mkl=MKLTYPE. The -mkl flag is available in Intel compilers from version 11 (so it will be available unless you for some reason need to use a really old compiler).

-mkl=parallel
will link the with the (default) threaded Intel MKL.
-mkl=sequential
will link with the sequential version of Intel MKL.
-mkl=cluster
will link with Intel MKL cluster components (sequential) that use Intel MPI. If you use this option you should also load an MPI module (e.g "module load impi").

If for some reason you cannot use the "-mkl" flag, please read the Intel documentation to find out what linker flags you need. You might also find this Intel page useful.

14.3 MKL and threading

The MKL is threaded by default, but there is also a non-threaded "sequential" version available. (The instructions here are valid for MKL 10.0 and newer, older versions worked differently.)

If threaded or sequential MKL gives best performance varies between applications. MPI applications will typically launch one MPI-rank on each processor core on each node, in this case threads are not needed as all cores are already used. However if you use threaded MKL you can start fewer ranks per node and increase the number of threads per rank accordingly.

The threading of MKL can be controlled at run time through the use of a few special environment variables.

  • OMP_NUM_THREADS controls how many OpenMP threads that should be started by default. This variable affects all OpenMP programs including the MKL library.
  • MKL_NUM_THREADS controls how many threads MKL-routines should spawn by default. This variable affects only the MKL library, and takes precedence over any OMP_NUM_THREADS setting.
  • MKL_DOMAIN_NUM_THREADS let the user control individual parts of the MKL library. E.g. MKL_DOMAIN_NUM_THREADS="MKL_ALL=1;MKL_BLAS=2;MKL_FFT=4" would instruct MKL to use one thread by default, two threads for BLAS calculations, and four threads for FFT routines. MKL_DOMAIN_NUM_THREADS also takes precedence over OMP_NUM_THREADS.

If the OpenMP enironment variable controlling the number of threads is unset when launching an MPI application with mpprun, mpprun will by default set OMP_NUM_THREADS=1.

15 OpenMP

Note: OpenMP is NOT the same as OpenMPI!

Example: compiling the OpenMP-program, openmp.f with ifort:

$ ifort -openmp openmp.f

Example: compiling the OpenMP-program, openmp.c with icc:

$ icc -openmp openmp.c

IMPORTANT: Please see MKL and threading for how to use e.g OMP_NUM_THREADS to make your OpenMP application use all CPU cores!

16 Storage

Triolith users can store files in several different locations. Each location has its own characteristics. Some locations (/home, /nobackup/global) are located on NSC's Centre Storage system and is shared with other systems (e.g Kappa and Matter).

There are limits to how much data you can store in each location. On /home and /nobackup/global, a quota system limits how much you can use. On /scratch/local you are limited by the physical size of the disk in each comupte node.

Mount point Use for Default quota Comment
/home Important data 20 GiB Shared between all nodes in Kappa, Matter and Triolith. All files are backed up to tape daily. If you need to restore a file from backup, contact NSC support
/nobackup/global Less important data or data that can be restored by re-running calculations 250 GiB Shared between all nodes in Kappa, Matter and Triolith. NOT backed up to tape.
/scratch/local Local scratch data during the running of a job none (but see below) Not shared between nodes. NOT backed up to tape. Contents deleted after each job.
/software Not writable by users - This file system contains software provided by NSC and is not writable by users.

Please do not store large amounts of data in other writable locations (e.g on /tmp, /var/tmp, …), since the space there is very limited and shared by all users.

16.1 Where should I store my data of type X?

Some rough guidelines:

Type of data Suggested location Why?
Temporary files created during a job that are not needed after the job ends /scratch/local ($SNIC_TMP) Using the local node disk for temporary files puts less load on the shared filesystems, giving better performance for everyone
Output files from jobs that can be recreated by re-running jobs /nobackup/global Tape backup is expensive, re-running jobs is cheap.
Small important files where disk performance is less important /home We perform tape backups of /home, so if you accidentally delete a file you can usually get it back. /nobackup/global has better performace than /home for most workloads.
Files that are read several times during each job (e.g large databases) Store on /nobackup, but copy to /scratch/local in the beginning of the job The shared file system used (GPFS) has a very limited read cache compared to local disk, so if you e.g read a 4GB file ten times, it is read from the share file servers ten times. If you read the same file from local disk, the contents will after the first time be cached in memory, giving mush better performance for you and less load on the shared servers.

Note that these are just recommendations. For example, if you have run a very long series of jobs that generate only a small amount of output data, that data is very valuable compared to its size, and it makes sense to use backed-up disk storage (/home) for it.

Please contact support@nsc.liu.se if you need advice on where to store your files.

16.2 Using the local disk in compute nodes (/scratch/local/NNNNNN)

Note: in Triolith you cannot write directly to /scratch/local (since the compute node might be shared with other jobs and users). Instead you need to use the location specified in the environment variable $SNIC_TMP. This variable will point to a directory (e.g /scratch/local/12345) that is created for your job, and which will be deleted after your job ends. This variable is not defined until the job starts, but is available in the environment where your job script runs.

Example job script:

#!/bin/bash
# Copy an input file to local disk
cp /home/x_makro/inputs/foo.dat $SNIC_TMP/input.dat

# Change working directory to local disk
cd $SNIC_TMP

# Run application. let us assume that it creates some large
# temporary files in the current directory, and one output
# file that we need to keep after the job
./myapplication

# Copy files we want to keep after the job to a safe location
cp output.dat /nobackup/global/x_makro/output_from_job_x.dat

# There is no need to delete the temporary files, since
# $SNIC_TMP will be deleted after the job ends.

16.3 How much storage space am I using, and what are the limits? (Quota)

The command "snicquota" will tell you how much space you are using on filesystems that use quota (/home and /nobackup/global), and what your current limits are.

[x_makro@triolith1 ~]$ snicquota
FILE SYSTEM                  USED        QUOTA        LIMIT        GRACE
--                           ----         ----         ----        -----
/home                    10.6 MiB     20.0 GiB     30.0 GiB             
/nobackup/global         64.0 KiB    250.0 GiB    300.0 GiB             
[x_makro@triolith1 ~]$ 

Details: Quota in this system has two limits. The "QUOTA" limit is the long-term limit. As long as you are below this, you can work normally. If you go above the "QUOTA" limit, you can still write more data to the file system until you reach the hard limit ("LIMIT"), but a timer is started (the "GRACE" column). When the grace timer displays "expired" (after a week) you can not write any more data to the file system until you delete or move enough files to get below the "QUOTA" limit.

16.4 If you need more storage space

The policy for increased storage quota is simple: If space is available, you can get more quota, provided you explain to us how much you need, why, and for how long. An example:

  • How much: "I need a total quota of 500GB on /nobackup/global"
  • Why: "I expect to run up to 10 jobs at the same time in my new project, and each job needs 50 GB of storage space for its output files."
  • For how long: "I need this space for the duration of my project (until 2011-06-01)"

Before requesting more quota, make sure that you store your data in the correct location (e.g /nobackup for data that can be recreated and does not need expensive daily tape backups).

Note: quota is shared between several systems (e.g Matter, Kappa and Triolith, so the total amount of quota you ask for must be sufficent for all your needs on these three systems.

Send your requests for more quota to support@nsc.liu.se

17 Common problems

17.1 Not specifying how many cores you want

Note: If you on e.g Neolith have used the "-N" option (e.g -N2) only to get a number of full nodes, on Triolith you need to add --exclusive (or some other means of specifying the number of cores to allocate).

sbatch -N2 will give you a total of two cores spread out over two nodes, which is probably not what you want.

17.2 Gaussian: GAUSS_SCRDIR set to /scratch/local

Do not set GAUSS_SCRDIR in your environment. If it is set, $g09root/g09/bsd/g09.profile will not change it!

See /software/apps/gaussian/README.NSC for more Gaussian information.

17.3 Complaints about non-existing modules when logging in

Since many of the software modules present on Neolith/Kappa/Matter do not exist on Triolith or have other names, trying to load them will result in an error message.

If you have previously loaded modules from e.g .bashrc or .modules, some of those modules may generate an error message due to this.

Workaround: don't load modules at login, load them when they are needed, or in the job script.

17.4 Matlab not available

Due to the terms of the Linköping University Matlab license, NSC is not allowed to make this Matlab license available to anyone not affiliated (student, researcher etc) with Linköping University.

If you are a LiU user and cannot access Matlab, please contact support@nsc.liu.se and we will add you to the Matlab license group ("liu"), which will give you access.

However, due to recent license change, we can in many cases allow users to "bring their own license". If you want to use Matlab on Triolith using your home university Matlab license, please contact support@nsc.liu.se and ask about this.

17.5 I cannot access VASP

Note: NSC will not buy VASP licenses for our users.

The terms of the VASP license requires that we verify that a user is covered by a valid VASP license before we can give access to our VASP binaries.

There are three ways in which you can show to us that you are covered by a VASP license:

  • You can be added to one of the licenses that we have on file by referring us to the license number that you are covered by. We will then confirm this with the holder of that license. It is also possible for the holder of that license to add you directly from the SNIC portal SUPR.
  • If we don't have the license that you are covered by on file, then you must provide a photocopy of the license agreement and the license number. This number is printed on the invoice, so people generally send us a photocopy of the invoice as well.
  • If you know that you are covered by a license, but we don't have it on file and you don't have access to the license contract, then we can confirm that you are a registered user with the VASP developers. This will usually take a couple of days since we have to email the VASP developers and get a reply from them.

To get access, send a request to support@nsc.liu.se. Remember to tell us which of the three methods apply to you, and any details needed (e.g your license number).

17.6 Cannot submit jobs when you are a member of multiple projects?

If you are a member of a single project, the scheduler assumes that your will run all your jobs using that project.

When you are a member of multiple projects however, the scheduler cannot decide for you which project to use, to you need to specify which project to use for each job.

If you don't specify a project when you're a member of multiple project, you will get the following error:

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Note: since the PI of a project may add new members at any time, you might suddenly find yourself a member of multiple projects, so if you want to be on the safe side, always specify which project to use, even if you are currently a member of just one.

You can specify which project to use for a job in several ways:

  1. Add "-A PROJECT_NAME" or "–account=PROJECT_NAME" as an option to sbatch or interactive on the command line (e.g "interactive -A snic-123-456")
  2. Add "#SBATCH -A PROJECT_NAME" or "#SBATCH –account=PROJECT_NAME" to your job script
  3. Set the environment variable $SBATCH_ACCOUNT to PROJECT_NAME

Note: replace PROJECT_NAME in the examples with your actual project name, which you can find in NSC Express (the Resource Manager Name line when you view a project) or by running "projinfo" on Triolith.

18 Getting help

You can contact the Triolith support team using the email address support@nsc.liu.se. You can use this address for anything related to Triolith, e.g

  • Asking a question
  • Telling us that something is wrong
  • Start a discussion regarding some long-term issue or future needs
  • Requesting the installation of a software package

When reporting a problem, please include all relevant information in your initial email, such as:

  • A relevant subject line (e.g "I cannot start Matlab on Triolith").
  • Your Triolith username.
  • That your question is regarding the Triolith system.
  • Which software you are using, including compilers (for example "ifort 9.0.032") and switches (for example -apo).
  • A short description of the problem, specifying what actions you have performed, which results you got, and which results you expected to get.
  • For a communication problem, please include details of your own computer and network.
  • If possible include specific information to nail down when and where the problem arised, e.g. job number, nodes, or point in time. It makes it easier for us to dig out logged information and possibly correlate your problem with other activities on the system.

By providing as much information in your initial email, we can get started on the problem immediately without having to ask followup questions.

If you have more than one separate question/problem, please send one email for each.

You may use English or Swedish. We will try to reply in the same language. Please note that as we have some staff that are not fluent in Swedish, you may sometimes get an answer in English regardless of the language of your original question.

We read email to the support address during normal office hours (approximately 08-17 local Swedish time: CET/CEST). We try to always give you some kind of answer (not counting the automated one) within two working days (but you will usually hear from us sooner than that).

You will get an automated reply from our support ticket system within a few minutes. If you want to add more information about your problem, reply to that email, that way the additional information will automatically be added to our database.

When you have a new question, please send a new email, do not reply to an old conversation. A reply to an old email might only reach the person who handled that problem, and that person could be busy, on leave etc. Sending a new email ensures that your request is seen by all support staff as soon as possible.

19 Document history

Table 3: Document history (major changes only)
Date Version Changes
2012-06-29 1.0 Initial version
2012-07-02 1.1 Added information about node sharing
2012-07-04 1.2 Added section on optimizing/compiling your code for Triolith
2012-07-07 1.3 Module command available in job scripts, some notes on monitoring jobs
2012-09-13 1.4 Minor refresh (wall time limit changed, pilot test ended etc)
2012-10-24 1.5 Major refresh
2013-01-17 1.6 Added information about disk storage
2013-03-01 1.7 Added link to Triolith software documentation
2013-10-29 1.8 Added ThinLinc documentation, updated for expansion to 1600 nodes

Last updated: 2014-02-14

Date: %Y-%m-%d

Validate






Page last modified: 2014-02-14 17:27
For more information contact us at info@nsc.liu.se.