Summary: jobs that use a large number of nodes and only run for a short time are not allowed on Tetralith!
In some cases running such jobs is necessary and will be permitted (e.g for application scaling tests). Please contact NSC Support before queueing any such jobs. There are ways to minimize the impact to other users from your jobs (e.g node reservations).
As a rule of thumb, any job with where the ratio (number of nodes / runtime in minutes) is greater than one should be avoided or discussed with NSC. This is not a hard rule - a single test job might be OK, but running dozens of such jobs is not.
From a technical point of view (interconnect network etc), Tetralith is capable of running very "wide" jobs (jobs using many hundreds of nodes).
However, whenever a job gets to the head of the job queue and is about to be started, enough idle nodes must be made available to start it.
The only way for the job scheduler to find those idle nodes is to wait until enough other jobs have ended (as we never kill running jobs to start others).
This means that for every job using more than a single compute node, there is a hidden cost. Not only will it consume X core hours when running, but it also has a hidden cost of Y core hours due to nodes being idle while waiting for the job to start.
Sometimes it is possible to utilize the idle nodes for productive work by starting short jobs on them that will finish before the estimated start time of the wide job ("backfill"). But on average, there are not enough suitable jobs to fully utilize the idle nodes.
For most jobs, the hidden cost is negligible. But for jobs with a high ratio of nodes to run time, the hidden cost becomes a problem.
An example to illustrate the problem: A user submits a test job using 256 nodes for 10 minutes. The job itself uses 680 core hours, but the hidden cost in lost computing time is thousands of core hours (7500 hours if we assume 50% of idle time can be used by backfill and that the average job length is 48h).