Systems  
Status displays
System status
Retired systems «
 
 
 
 
 
 
 
 
 
 
 
 

Cray T3E

The Cray T3E service was discontinued January 31st, 2003.

Cray T3E

Basic system information

The T3E is a massively parallel system comprised of processors with local memory (PEs) connected though a high bandwidth, low latency network arranged in a 3 dimensional torus. The memory is physically distributed but can be accessed as logically shared.

The processors are super scalar microprocessors (DEC Alpha EV5) with a clock frequency of 300 MHz. Since each processor can execute two floating point operations per cycle (multiply and add) the peak performance is 600 MFLOPs per processor.

The configuration of the T3E:

  • 256 Application (APP) PEs (where your parallel applications run)
  • 13 Command (CMD) PEs (where you log in, compile and edit - interactive use)
  • 3 OS PEs (used by the operating system)
  • A total of 45.6 GB of memory (100 nodes with 256MB and 156 nodes with 128MB of memory)
  • Current system software

You can access the T3E by doing telnet to "t3e.nsc.liu.se". You need to apply for computer time to get an account on the system. To check how time that is left on your project, type the command "/usr/local/bin/project".

Scheduled maintenance time for the T3E is Monday 7-9 (am) and Thursday 16-20 (4-8 pm).


User environment

File systems

Currently the T3E has three main user file systems:

  • home directories. Use it for small files only!
  • semi permanent directories: /nsc/stor/username. Files not used for 14 days are automatically removed. This directory has read access for all, and the access privileges should therefore be modified by the user
  • a large temporary storage ( /tmp)

All three directories can be accessed from any node on the T3E. User directories are backed up but only kept for one week.

For each login the system will create a temporary directory ( accessed as $TMPDIR or $TMP ) that can be used during the time you are logged in. This directory will be removed when you log out!

The disc organization described above is for NSC's academic users. Slightly different rules apply to users from SAAB and SMHI.

Batch queues

The batch system on the T3E is NQE

You need the following to be able to use NQE succesfully:

The following commands are useful:

  • qsub (submit a job, see below for a short example!)
  • qdel (delete a job)
  • qstat (general information on status of the queuing system)
  • qstat -am (show status of submitted jobs)
  • qstat -m (information about queue limits)
The system monitor xlotto displays the state of all queues as well as the processor map.
There is also a command called cstat that will print out a nice summary of the batch queue status.
Finally there is a summary of the T3E queues on the Web:http://www.nsc.liu.se/cgi-bin/xusage/

When you submit a batch job you are required to specify the maximum number of PEs and the maximum time that your job requires, see the example below.
Based on the number of PEs and the time limit that you are requesting, the system will choose the correct queue.

The basic ideas behind the batch queueing system is to:

  • Prioritize short jobs during daytime (weekdays 08:00-18:00).
  • Large jobs are prioritized.
  • 20 PEs are reserved during daytime for test ( <60 minutes, max 65PEs ), test.256mb ( <60 minutes, max 48PEs ) and interactive ( <15 minutes, max 16 PEs ) jobs.
  • Very long and/or large jobs ( >48 hours and/or > 130 PEs ) will be routed to a special queue where jobs have to be manually started by an operator. You also need to send mail to support@nsc.liu.se and request that your job is started.
  • A maximum of three simultaneous jobs/user are allowed, queued or running!
  • All running jobs are checkpointed every three hours and when there is an operator initiated shutdown of the system. Whenever the system is rebooted, all running jobs are restarted from their latest checkpoint.

Further information

Example batch script:

Put the script below in a file "batch_script" and submit the job by doing "qsub batch_script". The system will choose the correct queue based on the number of PEs that you are requesting. The system will send mail to the specified address (replace "mname" with your email address) when the job is finished or if there is an error.

-----------------------------------------------------------------------------------------
#QSUB -s /bin/csh                             #Specify C Shell for 'set echo'
#QSUB -eo -o output_file                      #Write NQS error and output to single file.
#QSUB -r my_job                               #Job name
#QSUB -l mpp_t=7200                           #Maximum CPU Time For All PEs (Required).
#QSUB -l mpp_p=8                              #Maximum PEs Needed (Required).
#QSUB -mu  mname                              #Mail from the system will be sent to "mname".
                                              #This can be any Internet e-mail address. 
#QSUB -me                                     #Sends mail when the request ends execution.
#
set echo
cd $TMPDIR                                    #move to temporary directory
cp $HOME/my_dir/a.out .                       #copy executable into temporary directory
cp $HOME/my_dir/data.in .                     #copy data file into temporary directory
#
(time mpprun -n 8 ./a.out < data.in ) >& out  #Execute on 8 PEs reading data.in
cp out $HOME/my_dir                           #save output data
#
rm -rf out data.in                            #clean up
#
exit
-----------------------------------------------------------------------------------------

$TMPDIR (or $TMP) is a temporary directory that the system assigns for each batch job. It is unique for each job and will be deleted after the job is finished.

Mail from NQE can only be sent to the system where the job was submitted, in this case the T3E. To receive mail from NQE somewhere else you need to re-route the mail. For example If you would like to receive the mail at your own local workstation just put that mail address in a file called ".forward" in your home directory.

The log files will be saved in sequentially numbered files "job name".number

Other useful QSUB options:

  • "QSUB -nr" Specifies that the batch request cannot be rerun. This can be useful to prevent the job from being restarted after a system restart.


Using the T3E

Programming environment/modules

The environment on the T3E is set up using modules. The modules environment enables dynamic configuration of your environment; system and application software can be added, deleted or switched with one simple command.

Main modules are:

  • PrgEnv. Programming Environment. Compilers, libraries, and program development tools.
  • nqe. Network Queuing Environment. The batch system .
  • mpt. Message Passing Toolkit. MPI and PVM tuned for the T3E.
  • hpf. High Performance Fortran from Portland Group.

Initialization of modules

The PrgEnv and nqe modules are automatically loaded when you log in. To get access to additional modules you need to add them, preferably in your shell initialization file (".cshrc" for C-shell,".profile" for ksh):
  • C-shell (csh):
    module add mpt hpf
    setenv TARGET cray-t3e

  • Korn Shell (ksh):
    module add mpt hpf
    export TARGET=cray-t3e
Type "module avail" to see all available modules and "module list" to see a list of all loaded modules.


Compile and execute

  • Fortran 90 Compiler

    Fortran programs are compiled with the Cray Fortran 90 compiler - f90:

    Compile and link: f90 [options] file.f (or file.f90 )
    Compile only: f90 -c [options] file.f (or file.f90 )

  • C/C++ Compiler

    Compile C programs with the cc command and C++ programs with CC.

    Compile and link: cc [options] file.c
    Compile only : cc -c [options] file.c

  • Execute

    To compile for a fixed number (N) nodes, use "-XN" when linking (e.g., "-X64" to compile for 64 nodes).

    If you want to be able to run the program on different numbers of nodes, omit this option to create a malleable executable. To run a malleable program :

    mpprun -n N ./program_name
    where "N" is the number of nodes you want to run on.


System monitor

To monitor what goes on in the system you can use a tool called "xlotto" (/usr/local/bin/xlotto), developed at NSC. Just type "xlotto" and you will get a map over processor usage as well as status of all batch queues and interactive jobs:

When you move the mouse over an application name the corresponding PEs are highlighted. The picture is updated every 30 seconds. Make sure you have your DISPLAY environment variable correctly defined. For a full description on "xlotto" and its usage, visit the Xlotto homepage!

You can also use "ps" or "grmview" to get additional information.


Parallelization

There are several different programming models available for parallelization:
  • Message Passing Toolkit - mpt

    The mpt module contains implementations of the two most popular message passing libraries: Message Passing Interface (MPI) and Parallel Virtual Machine (PVM).

    mpt is loaded by doing "module add mpt"

    The MPI and PVM libraries are callable from Fortran, C and C++. Both point-to-point and collective communication are supported.

    For the Cray Manual on mpt : Message passing toolkit

    For general information see

     - MPI: MPI from Argonne National Laboratory.
     - PVM: The PVM Home Page from Oak Ridge National Laboratory.
    

  • High Performance Fortran - hpf

    The Portland Group HPF/CRAFT compiler is available on the T3E. This compiler includes most elements of the Fortran 90 language, version 1.1 of the HPF standard and most of the CRAFT-77 features.

    "module add hpf" will enable the HPF environment. To compile use the command "pghpf".

    For further information:

     - Edinburgh Parallel Computing Centre: EPCC HPF page
    

    We would like to hear about your experience from using hpf, send email to support@nsc.liu.se!

  • shmem

    Logically shared memory access routines that operates on remote and local memory. Highest performance:latency is minimal and bandwidth maximal. Supported routines includes:
    - remote data transfer
    - atomic swap
    - atomic increment and add
    - work-shared broadcast and reduction
    - barrier synchronization
    

    shmem is loaded with the programming environment (PrgEnv). Do "man shem" on the T3E for more info.


Profiling and debugging

    Fortran: f90 -eA -lapp file.f
             a.out [options]
             apprentice &
    
    C:       cc -h apprentice  -lapp file.c
             a.out [options]
             apprentice &
    
    Fortran: f90 -g -Xnpes file.f
             totalview [options] &
    
    C:       cc -g -Xnpes file.c
             totalview [options] &
    

To enable core dumps (for debugging purposes) , execute the following command : "/usr/bin/limit -p 0 -d 0 -v" . This will increase the maximum file size on your core file from 4 kB to 0.5 GB. Remember to remove unwanted core files, they tend to get very large.


Single processor optimization

  • Compiler switches

    Use the compiler flags "-O3,pipeline3" to get the best performance.

  • The streams feature

    An important feature for memory intensive applications is the streams feature. The T3E hardware automatically detects repeated cache misses and prefetches data from local memory into stream buffers. This can significantly increase available bandwidth from secondary cache to main memory.

    Code that uses STREAMS and at the same time directly uses E-registers is potentially unsafe and can in worst case crash the system! The use of E-registers is detected by the linker and STREAMS will be turned off.

    If you only use MPI and/or PVM nothing will change and STREAMS will be enabled.

    The following types of code is using E-registers and will not run with STREAMS enabled:

    -  calls to the SHMEM library
    -  use of the directive "!DIR$ CACHE_BYPASS"
    -  direct use of the E-registers
    -  calls to "benchlib"
    -  HPF (since the HPF-library is built on top of SHMEM)
    

    We recommend that you put the following in your code so that you can keep track on how your code is using STREAMS:

       integer get_d_stream,itemp
       itemp=get_d_stream()
    
       if (itemp.eq.1) then
         write (6,*) 'Streams ON!'
       else
         write (6,*) 'Streams OFF!'
       endif
    

    We are in the process of working out a policy to enable simultaneous E-registers and STREAMS usage on an user per user basis.


VAMPIR performance analysis tool for MPI programs

    VAMPIR is a graphical tool for analyzing the performance and message passing characteristics of parallel programs that use the MPI message passing library. The full user documentation can be found on the T3E at: /appl/tools/vampir/doc/VT-userguide.ps, see also our short Vampir web page.

    The following steps are necessary to get started with VAMPIR:

    1) Define the following environment variables:

            PAL_ROOT=/appl/tools/vampir/
            PAL_LICENSEFILE=/appl/tools/vampir/etc/license.dat
    

    2) Compile and link your MPI code for tracing:

     f90 ex.f  -I/appl/tools/vampir/include -L/appl/tools/vampir/lib -lVT -lpmpi -lmpi
                  
    
    3) Run the executable on the T3E as usual. In addition to the usual output this will generate a VAMPIRtrace output file which will have the extension ".bpv".

    4) Analyze the resulting VAMPIRtrace output file:
    - Define your DISPLAY environment variable
    - Run "vampir", specifying the .bpv file:

      /appl/tools/vampir/bin/vampir a.out.bpv
    

Using nodes with 256 MB of local memory

  • Executing code

      There are 100 PEs that each have 256 MB of local memory and 156 PEs that have 128 MB. All 256 MB nodes have been assigned a label, called "256Mb". If you want to run your code on the 256 MB PEs you have to label your binary with a matching label. This can be done with static or dynamic labelling:

    • Dynamic labeling with the "mpprun" command:
        - Type: "mpprun -L   H256Mb -n nr_of_processors my_binary"
        - Add "#QSUB -la   H256Mb" to your batchscript

    • Static labeling:
        - Type: "setlabel -l   H256Mb my_binary"
        - Add "#QSUB -la   H256Mb" to your batchscript
        Static labeling will add the label to the binary so you only need to do this once (and of course each time the code is relinked)

      Adding the option "#QSUB -la   H256Mb" to your batchscript has the effect that your job will now be queued in a dedicated queue for 256 MB nodes only. If you fail to add this option your job might appear as if it is running when you look at the batch queues while it in fact is waiting for available 256MB nodes, blocking 128MB jobs from being scheduled. It is therefore required to add this option to the batchscript!
      See also job scheduling on the T3E

  • Compiling code that requires static memory of 256mb memory nodes

      If your code require more than 128mb and less than 256mb of static memory, you have to issue the command:
      "setenv TARGET cray-t3e,memsize=32m"
      before you start the compilation.

      This is due to the fact that the default maximum static memory that is available to an application is the same amount of memory that is available on the command PE that does the linking. All our command PEs has 128MB of memory. Do "man target" for the full details!


Running an application on a single PE

Most of the time, an application on the T3E is running in parallel on 2 or more PEs, using the "mpprun -n nr_of_processors my_binary" command. Sometimes it is desirable to run an application on one PE however and this is done with the command: "mpprun -a my_binary". If you do not use the "mpprun -a" option your job will be scheduled on the command PEs where it will time share with other command tasks such as editors, scripts, login shells etc

The "mpprun -a" option is inherited so if you run a script with this option, all commands and applications within the script will be run on a single application PE as well. If you want to run your application on one 256 MB PE you need to specify "mpprun -a -L H256Mb my_binary"

When submitting a batch job that only requires a single application PE, remember to specify this in the batchscript:
#QSUB -l mpp_p=1

If you want to use a 256 MB PE, remember to add:
#QSUB -la   H256Mb


Math libraries

  • LIBSCI

    LIBSCI on the T3E contains ScaLAPACK which provides routines for common linear algebra tasks.

    Also incorporated in LIBSCI is a collection of routines for signal processing, such as Fourier transform and convolution. Documentation is available via "man libsci".

    LIBSCI is loaded by default.


Application software

Currently we have the following application software available on the T3E:

If you would like software for the T3E at NSC, please send email to support@nsc.liu.se.


Training

NSC will be hosting a set of courses.

We will cover both the T3E, the SGI and the clusters,and the plan is to give training for new as well as experienced users of our systems.

Part of the course material is available here.


New features and updates


Links to additional T3E information


Feedback

We are very interested in getting your feedback from using the system. Send us email at support@nsc.liu.se. For performance related questions you can also send me email directly to faxen@nsc.liu.se.






Page last modified: 2006-05-11 11:04
For more information contact us at info@nsc.liu.se.