You are always welcome to report slowness to us, but please focus more on telling us exactly what problems you experienced rather than what you think the cause is. E.g was “ls -l” slow? In which directory? Was a certain application slow? Was file transfers slow? “squeue”? Something else?
Unfortunately there is often no link between a process using lots of CPU in “top” and that process being responsible for the login node feeling slow. There are several reasons for this:
The login nodes are configured to give each user that tries to use CPU an equal share. A normal Linux system gives each process/thread an equal share. This means that the impact to other users if one or a few users run applications that want to use lots of CPU is much less than you might expect.
Slowness can come not just from lack of available CPU, but also things like storage (/home, /proj) being slow or memory bandwidth being a bottleneck. If you run an application that uses accelerated graphics, the shared GPU in the login node can also become a bottleneck.
Commands that talk to Slurm (e.g squeue, sinfo, sbatch, projinfo, …) are often slow because the main Slurm service (slurmctld) is overloaded, e.g by having to start lots of small jobs or by users bombarding it with requests.
Storage slowness is probably both the most common one and the hardest to diagnose… Due to how the distributed file system works, access can be slow for parts of the file system (e.g one directory where lots of file operations happen), or just from one node (e.g if other users do lots of small-file I/O or use lots of bandwidth), or for the entire file system (e.g if storage servers or disks are overloaded).
Yes, we monitor the login nodes, and if our monitoring detects e.g high load we investigate and do something about it.
But since slowness comes in many forms (as described above), it’s not easy to setup monitoring that reliably detects all forms without too many false alarms.
Regarding memory use: the login nodes have out-of-memory protection that will kill processes when it’s running low on available memory. The largest processes will be killed first, so in some sense this is “fair”. Ideally we would have given each user personal memory limits, but it was hard to find a setup that worked due to some limitations in the Linux kernel used. Instead all user processes are placed in the same pool and the largest process(es) is killed when memory runs low.
Unfortunately the “oom-killer” in Linux is sometimes a bit slow in sorting out a low-memory situation, which means the login node can sometimes become very sluggish for a while when running low on memory.
Some things you can do if the login node feels slow besides telling us there’s a problem:
If the cluster has more than one login node (for example Tetralith),
use another login node. You can either login directly to that login
node from the Internet (e.g ssh tetralith2.nsc.liu.se
) or from another
login node (ssh l2
). If you use the internal name “l2”
(corresponding to “tetralith2.nsc.liu.se” in this example) you don’t have
to enter your password again.
For short tasks you can use a node (or part of one) reserved for test and development if such nodes are available on the cluster.
Run long-running tasks (e.g archiving data into tar or zip files) on a regular compute node.
Guides, documentation and FAQ.
Applying for projects and login accounts.