Recent power outages - summary

On March 6th, power went out to parts of Linköping including NSC due to a high-voltage cable being severed by construction work going on near campus. This was unfortunate, but unrelated to what happened next.

On March 28th and April 5th there were two power outages that affected one of our data centers ("Kärnhuset", where Triolith, Gamma, Bi and Frost are located). In each case, a protection device in the high voltage substation feeding Kärnhuset shut the power off after detecing an electrical arc. The local power utility in both cases failed to find anything wrong and restored power after a short time.

The third time this happened (on the morning of April 6th), they finally were able to find some clues to the root cause of the problem. Apparently a bolt deep inside a hard-to-reach part of the station became loose and caused arcing now and then.

Once the problem was found (around lunchtime), it was obvious that a repair would take days, so the local utility started planning for how to provide another source of power. At 17:30, the first large diesel generator arrived on site. Since Kärnhuset is not designed to be powered by a diesel generator, some creative work was needed to get the cables into the building and connected. At 19:49 power was restored. At 20:09 the first servers were started. At 21:00 users were able to access Triolith, Gamma, Bi and Frost again. At 22:00 we had powered up as many compute nodes as we dared to do and resumed running jobs on them.

During the weekend, a specialist team from ABB (that built the station) dismantled parts of the station and repaired/replaced the failed components.

On Monday, April 10th the repairs were completed and we switched Kärnhuset back to normal power at 10:02 CEST. At 10:45 all compute nodes were started and running jobs.

In order to physically disconnect the diesel generator we will need to stop most compute nodes in Triolith, Gamma and Bi one more time, on Wednesday April 12th.

Diesel generator arriving and being connected to Kärnhuset
Diesel generator arriving and being connected to Kärnhuset

Recently resolved problems

  • 2017-04-05 16:50 - 18:30 (shorter for many systems): Power outage in Linköping grid affecting Triolith, Gamma, Bi and Frost compute nodes, all running jobs failed. See this email thread for more information.

  • 2017-03-28 02:30-06:45: Power outage affecting one NSC computer room. At 02:30, power was lost to Triolith, Bi and Frost compute nodes are located. All running jobs failed. All systems available to users from 06:45.

  • 2017-03-06 07:30 - 08:10: Campus-wide power outage affecting both NSC computer rooms and offices. Most systems went down, all running compute jobs failed. Most systems back up before lunch. Triolith and Gamma kept offline to perform maintenance that was originally scheduled for Thursday March 9th.

Planned maintenance

  • 2017-04-03 from 07:00 CEST: Gamma unavailable until at least April 4th, see this email thread for details.

Planned hardware changes

  • The part of Triolith available to SNAC projects will shrink from 1536 nodes to 960 nodes on April 1st, 2017. This is a result of the delay in funding a replacement system. All Medium and Large projects on Triolith will have their computing time allocation reduced by 40% from April 1st. Due to reducing the number of nodes and receiving some additional funding from SNIC, NSC currently believe we will be able to keep the rest of the system running at least until July 1st, 2018 or until a replacement system is ready.
  • The LiU-only HPC service using the Gamma cluster will move to a different computer room (still on the LiU Valla Campus) on April 3st, 2017 to make room for a new MET Norway HPC system.

Queue system status

An overview of the overall system load is available on the status page.

Graphical representations of the current queue on some of our systems:

