Slow login node

You are always welcome to report slowness to us, but please focus more on telling us exactly what problems you experienced rather than what you think the cause is. E.g was “ls -l” slow? In which directory? Was a certain application slow? Was file transfers slow? “squeue”? Something else?

“top” does not tell the whole truth

Unfortunately there is often no link between a process using lots of CPU in “top” and that process being responsible for the login node feeling slow. There are several reasons for this:

The login nodes are configured to give each user that tries to use CPU an equal share. A normal Linux system gives each process/thread an equal share. This means that the impact to other users if one or a few users run applications that want to use lots of CPU is much less than you might expect.
Slowness can come not just from lack of available CPU, but also things like storage (/home, /proj) being slow or memory bandwidth being a bottleneck. If you run an application that uses accelerated graphics, the shared GPU in the login node can also become a bottleneck.
Commands that talk to Slurm (e.g squeue, sinfo, sbatch, projinfo, …) are often slow because the main Slurm service (slurmctld) is overloaded, e.g by having to start lots of small jobs or by users bombarding it with requests.

Storage

Storage slowness is probably both the most common one and the hardest to diagnose… Due to how the distributed file system works, access can be slow for parts of the file system (e.g one directory where lots of file operations happen), or just from one node (e.g if other users do lots of small-file I/O or use lots of bandwidth), or for the entire file system (e.g if storage servers or disks are overloaded).

Monitoring

Yes, we monitor the login nodes, and if our monitoring detects e.g high load we investigate and do something about it.

But since slowness comes in many forms (as described above), it’s not easy to setup monitoring that reliably detects all forms without too many false alarms.

Running out of memory

Regarding memory use: the login nodes have out-of-memory protection that will kill processes when it’s running low on available memory. The largest processes will be killed first, so in some sense this is “fair”. Ideally we would have given each user personal memory limits, but it was hard to find a setup that worked due to some limitations in the Linux kernel used. Instead all user processes are placed in the same pool and the largest process(es) is killed when memory runs low.

Unfortunately the “oom-killer” in Linux is sometimes a bit slow in sorting out a low-memory situation, which means the login node can sometimes become very sluggish for a while when running low on memory.

Things you can do

Some things you can do if the login node feels slow besides telling us there’s a problem:

If the cluster has more than one login node (for example Tetralith), use another login node. You can either login directly to that login node from the Internet (e.g ssh tetralith2.nsc.liu.se) or from another login node (ssh l2). If you use the internal name “l2” (corresponding to “tetralith2.nsc.liu.se” in this example) you don’t have to enter your password again.
For short tasks you can use a node (or part of one) reserved for test and development if such nodes are available on the cluster.
Run long-running tasks (e.g archiving data into tar or zip files) on a regular compute node.

“top” does not tell the whole truth

Storage

Monitoring

Running out of memory

Things you can do

User support

Getting access

Everything OK!

Self-service

Reporting a login node that feels slow

“top” does not tell the whole truth

Storage

Monitoring

Running out of memory

Things you can do

User support

Getting access

Everything OK!

Self-service