Cray T3E
The Cray T3E service was discontinued January 31st, 2003.
Basic system information
The T3E is a massively parallel system comprised of processors with local memory
(PEs) connected though a high bandwidth, low latency network arranged in a 3 dimensional torus.
The memory is physically distributed but can be accessed as logically shared.
The processors are super scalar microprocessors (DEC Alpha EV5) with
a clock frequency of 300 MHz. Since each processor can execute two floating point
operations per cycle (multiply and add) the peak performance is 600 MFLOPs per processor.
The configuration of the T3E:
- 256 Application (APP) PEs (where your parallel applications run)
- 13 Command (CMD) PEs (where you log in, compile and edit - interactive use)
- 3 OS PEs (used by the operating system)
- A total of 45.6 GB of memory (100 nodes with 256MB and 156 nodes with 128MB of memory)
-
Current system software
You can access the T3E by doing telnet to "t3e.nsc.liu.se". You need to
apply for computer
time to get an account on the system. To check how time that is left on your project,
type the command "/usr/local/bin/project".
Scheduled maintenance time for the T3E is Monday 7-9 (am) and Thursday 16-20
(4-8 pm).
User environment
File systems
Currently the T3E has three main user file systems:
- home directories. Use it for small files only!
- semi permanent directories: /nsc/stor/username.
Files not used for 14 days are automatically removed.
This directory has read access for all, and the access privileges
should therefore be modified by the user
- a large temporary storage ( /tmp)
All three directories can be accessed from any node on the T3E.
User directories are backed up but only kept for one week.
For each login the system will create a temporary directory ( accessed as $TMPDIR
or $TMP ) that can be used during the time you are logged in. This directory will
be removed when you log out!
The disc organization described above is for NSC's academic users.
Slightly different rules apply to users from SAAB and SMHI.
Batch queues
The batch system on the T3E is NQE
You need the following to be able to use NQE succesfully:
The following commands are useful:
- qsub (submit a job, see below for a short example!)
- qdel (delete a job)
- qstat (general information on status of the queuing system)
- qstat -am (show status of submitted jobs)
- qstat -m (information about queue limits)
The system monitor xlotto displays the state of all
queues as well as the processor map.
There is also a command called cstat that will print out a nice summary
of the batch queue status.
Finally there is a summary of
the T3E queues on the Web:http://www.nsc.liu.se/cgi-bin/xusage/
When you submit a batch job you are required to specify the maximum
number of PEs and the maximum time that your job requires, see the example below.
Based on the
number of PEs and the time limit that you are requesting, the system will
choose the correct queue.
The basic ideas behind the batch queueing system is to:
- Prioritize short jobs during daytime (weekdays 08:00-18:00).
- Large jobs are prioritized.
- 20 PEs are reserved during daytime for test ( <60 minutes, max 65PEs ), test.256mb
( <60 minutes, max 48PEs ) and interactive ( <15 minutes, max 16 PEs ) jobs.
- Very long and/or large jobs ( >48 hours and/or > 130 PEs ) will be routed to a
special queue where jobs have to be manually started by an operator. You also need
to send mail to support@nsc.liu.se and
request that your job is started.
- A maximum of three simultaneous jobs/user are allowed, queued or running!
- All running jobs are checkpointed every three hours and when there is an
operator initiated shutdown of the system. Whenever the system is rebooted,
all running jobs are restarted from their latest checkpoint.
Further information
Example batch script:
Put the script below in a file "batch_script" and submit the job by doing
"qsub batch_script". The system will choose the correct queue based on the
number of PEs that you are requesting. The system will send mail to the
specified address (replace "mname" with your email address) when the job is
finished or if there is an error.
-----------------------------------------------------------------------------------------
#QSUB -s /bin/csh #Specify C Shell for 'set echo'
#QSUB -eo -o output_file #Write NQS error and output to single file.
#QSUB -r my_job #Job name
#QSUB -l mpp_t=7200 #Maximum CPU Time For All PEs (Required).
#QSUB -l mpp_p=8 #Maximum PEs Needed (Required).
#QSUB -mu mname #Mail from the system will be sent to "mname".
#This can be any Internet e-mail address.
#QSUB -me #Sends mail when the request ends execution.
#
set echo
cd $TMPDIR #move to temporary directory
cp $HOME/my_dir/a.out . #copy executable into temporary directory
cp $HOME/my_dir/data.in . #copy data file into temporary directory
#
(time mpprun -n 8 ./a.out < data.in ) >& out #Execute on 8 PEs reading data.in
cp out $HOME/my_dir #save output data
#
rm -rf out data.in #clean up
#
exit
-----------------------------------------------------------------------------------------
$TMPDIR (or $TMP) is a temporary directory that the system assigns for each batch job. It is
unique for each job and will be deleted after the job is finished.
Mail from NQE can only be sent to the system where the job was submitted, in this case the
T3E. To receive mail from NQE somewhere else you need to re-route the mail. For example If
you would like to receive the mail at your own local workstation just put that mail address in a
file called ".forward" in your home directory.
The log files will be saved in sequentially numbered files "job name".number
Other useful QSUB options:
- "QSUB -nr" Specifies that the batch request cannot be rerun. This can be useful
to prevent the job from being restarted after a system restart.
Using the T3E
Programming environment/modules
The environment on the T3E is set up using
modules.
The modules environment enables dynamic configuration of your environment;
system and application software can be added, deleted or switched with one
simple command.
Main modules are:
- PrgEnv. Programming Environment. Compilers, libraries, and program development tools.
- nqe. Network Queuing Environment. The batch system .
- mpt. Message Passing Toolkit. MPI and PVM tuned for the T3E.
- hpf. High Performance Fortran from Portland Group.
Initialization of modules
The PrgEnv and nqe modules are automatically loaded when you log in.
To get access to additional modules you need to add them, preferably
in your shell initialization file (".cshrc" for C-shell,".profile" for ksh):
- C-shell (csh):
- module add mpt hpf
- setenv TARGET cray-t3e
- Korn Shell (ksh):
- module add mpt hpf
- export TARGET=cray-t3e
Type "module avail" to see all available modules and "module list" to see a list
of all loaded modules.
Compile and execute
Fortran 90 Compiler
Fortran programs are compiled with the Cray Fortran 90 compiler - f90:
- Compile and link: f90 [options] file.f (or file.f90 )
- Compile only: f90 -c [options] file.f (or file.f90 )
C/C++ Compiler
Compile C programs with the cc command and C++ programs with CC.
- Compile and link: cc [options] file.c
- Compile only : cc -c [options] file.c
Execute
To compile for a fixed number (N) nodes, use "-XN" when linking (e.g.,
"-X64" to compile for 64 nodes).
If you want to be able to run the program on different numbers of nodes,
omit this option to create a malleable executable.
To run a malleable program :
- mpprun -n N ./program_name
- where "N" is the number of nodes you want to run on.
System monitor
To monitor what goes on in the system you can use a tool called "xlotto" (/usr/local/bin/xlotto),
developed at NSC. Just type "xlotto" and you will get a map over processor usage as well as status of
all batch queues and interactive jobs:
When you move the mouse over an application name the corresponding PEs are
highlighted. The picture is updated every 30 seconds.
Make sure you have your DISPLAY environment variable correctly defined.
For a full description on "xlotto" and its usage, visit the Xlotto
homepage!
You can also use "ps" or "grmview" to get additional information.
Parallelization
There are several different programming models available for parallelization:
Message Passing Toolkit - mpt
The mpt module contains implementations of the two most popular
message passing libraries: Message Passing Interface (MPI) and
Parallel Virtual Machine (PVM).
mpt is loaded by doing "module add mpt"
The MPI and PVM libraries are callable from Fortran, C and C++.
Both point-to-point and collective communication are supported.
For the Cray Manual on mpt :
Message passing toolkit
For general information see
- MPI: MPI from Argonne National Laboratory.
- PVM: The PVM Home Page from Oak Ridge National Laboratory.
High Performance Fortran - hpf
The Portland Group HPF/CRAFT compiler is available on the T3E.
This compiler includes most elements of the Fortran 90 language,
version 1.1 of the HPF standard and most of the CRAFT-77 features.
"module add hpf" will enable the HPF environment. To compile use the command "pghpf".
For further information:
- Edinburgh Parallel Computing Centre: EPCC HPF page
We would like to hear about your experience from using hpf,
send email to support@nsc.liu.se!
shmem
Logically shared memory access routines that operates on remote and local
memory. Highest performance:latency is minimal and bandwidth maximal. Supported
routines includes:
- remote data transfer
- atomic swap
- atomic increment and add
- work-shared broadcast and reduction
- barrier synchronization
shmem is loaded with the programming environment (PrgEnv).
Do "man shem" on the T3E for more info.
Profiling and debugging
Fortran: f90 -eA -lapp file.f
a.out [options]
apprentice &
C: cc -h apprentice -lapp file.c
a.out [options]
apprentice &
Fortran: f90 -g -Xnpes file.f
totalview [options] &
C: cc -g -Xnpes file.c
totalview [options] &
To enable core dumps (for debugging purposes) , execute the following command
: "/usr/bin/limit -p 0 -d 0 -v" .
This will increase the maximum file size on your core file from 4 kB to 0.5 GB.
Remember to remove unwanted core files, they tend to get very large.
Single processor optimization
The streams feature
An important feature for memory intensive applications is the streams feature.
The T3E hardware automatically detects repeated cache misses and prefetches data from
local memory into stream buffers. This can significantly increase available
bandwidth from secondary cache to main memory.
Code that uses STREAMS and at the same time directly uses E-registers is
potentially unsafe and can in worst case crash the system! The use of E-registers
is detected by the linker and STREAMS will be turned off.
If you only use MPI and/or PVM nothing will change and STREAMS will be enabled.
The following types of code is using E-registers and will not run with STREAMS enabled:
- calls to the SHMEM library
- use of the directive "!DIR$ CACHE_BYPASS"
- direct use of the E-registers
- calls to "benchlib"
- HPF (since the HPF-library is built on top of SHMEM)
We recommend that you put the following in your code so that you can keep
track on how your code is using STREAMS:
integer get_d_stream,itemp
itemp=get_d_stream()
if (itemp.eq.1) then
write (6,*) 'Streams ON!'
else
write (6,*) 'Streams OFF!'
endif
We are in the process of working out a policy to enable simultaneous E-registers
and STREAMS usage on an user per user basis.
VAMPIR performance analysis tool for MPI programs
VAMPIR is a graphical tool for analyzing the performance and message passing
characteristics of parallel programs that use the MPI message passing library.
The full user documentation can be found on the T3E at:
/appl/tools/vampir/doc/VT-userguide.ps, see also our
short Vampir web page.
The following steps are necessary to get started with VAMPIR:
1) Define the following environment variables:
PAL_ROOT=/appl/tools/vampir/
PAL_LICENSEFILE=/appl/tools/vampir/etc/license.dat
2) Compile and link your MPI code for tracing:
f90 ex.f -I/appl/tools/vampir/include -L/appl/tools/vampir/lib -lVT -lpmpi -lmpi
3) Run the executable on the T3E as usual. In addition to the usual output this will
generate a VAMPIRtrace output file which will have the extension
".bpv".
4) Analyze the resulting VAMPIRtrace output file:
- Define your DISPLAY environment variable
- Run "vampir", specifying the .bpv file:
/appl/tools/vampir/bin/vampir a.out.bpv
Using nodes with 256 MB of local memory
Executing code
There are 100 PEs that each have 256 MB of local memory and 156 PEs
that have 128 MB. All 256 MB nodes have been assigned a label, called
"256Mb". If you want to run your code on the 256 MB PEs you have to label
your binary with a matching label. This can be done with static or dynamic labelling:
- Dynamic labeling with the "mpprun" command:
- Type: "mpprun -L H256Mb -n nr_of_processors my_binary"
- Add "#QSUB -la H256Mb" to your batchscript
- Static labeling:
- Type: "setlabel -l H256Mb my_binary"
- Add "#QSUB -la H256Mb" to your batchscript
Static labeling will add the label to the binary so you only need to do this once
(and of course each time the code is relinked)
Adding the option "#QSUB -la H256Mb" to your batchscript has the effect that
your job will now be queued in a dedicated queue for 256 MB nodes only. If you fail
to add this option your job might appear as if it is running when you look at the batch
queues while it in fact is waiting for available 256MB nodes, blocking 128MB jobs from
being scheduled. It is therefore required to add
this option to the batchscript! See also
job scheduling on the T3E
Compiling code that requires static memory of 256mb memory nodes
If your code require more than 128mb and less than 256mb of static memory, you have
to issue the command: "setenv TARGET cray-t3e,memsize=32m" before you start the
compilation.
This is due to the fact that the default maximum static memory that is available to
an application is the same amount of memory that is available on the command PE that
does the linking. All our command PEs has 128MB of memory. Do "man target" for the
full details!
Running an application on a single PE
Most of the time, an application on the T3E is running in parallel on 2 or more PEs, using
the "mpprun -n nr_of_processors my_binary" command. Sometimes it is desirable to run an
application on one PE however and this is done with the command: "mpprun -a my_binary".
If you do not use the "mpprun -a" option your job will be scheduled on the command PEs
where it will time share with other command tasks such as editors, scripts, login shells etc
The "mpprun -a" option is inherited so if you run a script with this option, all commands and
applications within the script will be run on a single application PE as well.
If you want to run your application on one 256 MB PE you need to specify "mpprun -a -L H256Mb my_binary"
When submitting a batch job that only requires a single application PE, remember to specify
this in the batchscript:
#QSUB -l mpp_p=1
If you want to use a 256 MB PE, remember to add:
#QSUB -la H256Mb
Math libraries
- LIBSCI
LIBSCI on the T3E contains ScaLAPACK which provides routines for common linear algebra
tasks.
Also incorporated in LIBSCI is a collection of routines for signal
processing, such as Fourier transform and convolution.
Documentation is available via "man libsci".
LIBSCI is loaded by default.
Application software
Currently we have the following application software available on the T3E:
If you would like software for the T3E at NSC, please send email to support@nsc.liu.se.
Training
NSC will be hosting a set of
courses.
We will cover both the T3E, the SGI and the clusters,and the plan is to give
training for new as well as experienced users of our systems.
Part of the course material is available
here.
New features and updates
Links to additional T3E information
Feedback
We are very interested in getting your feedback from using the system.
Send us email at support@nsc.liu.se.
For performance related questions you can also send me email directly to
faxen@nsc.liu.se.
Page last modified: 2006-05-11 11:04
For more information contact us at
info@nsc.liu.se.
|