When you first login to an NSC compute cluster (e.g Triolith), you reach the "login node" (some systems have more than one). This is just a small part of the system, it is a single Linux server that serves as a connection to the outside world.
It is important to know that a login node is a resource that is shared with all users of that system, and if it is slow or crashes all users are affected. For this reason we do not allow you to run anything but the most essential things on the login node.
On the login node, you are permitted to:
A very simple rule is "don't run things on the login node that will inconvenience other users".
The more CPU and memory you use, and the longer you use it, the greater the risk that someone else will suffer. Try to use common sense.
If NSC finds what we consider improper use of the login node through complaints from other users or automatic monitoring, we might kill or stop your processes. If this happens, we will notify you.
If your processes use a large portion of the login node's memory, one or more of your processes may get killed automatically by the operating system. This is done to make sure that there is enough memory available for other users as well as system services running on the login node.
If you are unsure about if a certain task can be run on the login node, please contact and ask us.
Anything not permitted to run on the login node should be run on one or more of the compute nodes in an "interactive" shell or as a batch job.
An interactive job is what you use if you "just want to run an application", but on a compute node. This is what happens under the hood when you use the "interactive" command:
If your interactive session has not started after 30 seconds, all resources on the system are probably already in use and you will have to wait in the queue. You can check the queue status by logging in to the system again in another window and using the "squeue" command.
Hint: some systems (e.g Triolith, Gamma) have nodes reserved for small and short interactive sessions. See the system-specific information for how to use the development nodes.
Example interactive session (here I reserve 1 node exclusively for my job for 4 hours on Triolith and start Matlab on it):
[kronberg@triolith1 ~]$ interactive -N1 --exclusive -t 4:00:00 Waiting for JOBID 38222 to start [kronberg@n76 ~]$ module add matlab/R2012a [kronberg@n76 ~]$ matlab & [...using Matlab for an hour or two...] [kronberg@n76 ~]$ exit [kronberg@triolith1 ~]$
Remember to end your interactive session by typing "exit". When you do that, the node(s) you reserved are released and become available to other users.
Note: the "interactive" command takes the same options as "sbatch", so you can read the sbatch man page to find out all the options that can be used. The most common ones are:
-t HH:MM:SS: choose for how long you want to reserve resources. Choose a reasonable value! If everyone always use the maximum allowed number, it becomes very difficult to estimate when new jobs can start, and if you forget to end your interactive session, resources will be unavailable to other users until the limit is reached.
-N X --exclusive: reserve X whole nodes
-n X: reserve X CPU cores
--mem X: reserve X megabytes of memory
--reservation=devel: use one of the nodes reserved for short test and development jobs
A batch job is a non-interactive (no user input is possible) job. What happens during the batch job is controlled by the job script that is submitted with the job. The job enters the scheduling queue, where it may have to wait for some time until nodes are available to run the job.
Read more about batch jobs and scheduling.
In order to allow you to monitor and debug running jobs, you can login to a compute node directly from the login node provided that you have an active job running on that node.
(If you try to login to a compute node where you do not have a job running you will get error messages like "srun: error: Unable to allocate resources: Invalid job id specified", "srun: error: Unable to create job step: Access/permission denied", "srun: error: Unable to create job step: Requested node configuration is not available" or similar)
This feature is primarily intended for monitoring and debugging running jobs, not for starting applications in the first place (see the documentation on batch jobs and interactive jobs above.)
To use this feature, find out the job ID and the nodes your job is using (use e.g
squeue -u $USER), then run a command like
jobsh -j 374242 n525 from the login node to login to one of the compute nodes in the job. You can then use normal Unix tools like "top" and "ps" to monitor your job. Example:
[kent@bi1 ~]$ squeue -u $USER JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 374242 bi interact kent R 8:59 1 n525 [kent@bi1 ~]$ jobsh -j 374242 n525 [kent@n525 ~]$ date Wed Aug 26 15:42:25 CEST 2015 ... [kent@n525 ~]$ exit [kent@bi1 ~]$ jobsh -j 374242 n526 srun: error: Unable to create job step: Requested node configuration is not available [kent@bi1 ~]$ jobsh -j 374241 n525 srun: error: Unable to create job step: Access/permission denied [kent@bi1 ~]$
You need to specify the job ID as you can have more than one job running on a node if each job does not request all the cores. The job ID is used to place your login to the node under the same resource limits as the job.
When you are inside a job environment on a node (for example in an interactive job, a batch job or a login session created using
jobsh as above) you can use
jobsh without the
-j flag as the job ID is picked up from the
$SLURM_JOBID environment variable.
it's difficult to give exact numbers. If you use more than one CPU core, more than a few GB or RAM or run for longer than half an hour, please consider the impact to other users and if you can run on a compute node instead.↩