No reported problems
Please note: this is a manually updated message.
We will try to update this message when we have major user-visible problems on any of our systems. We also provide information via email to all users of affected systems (via the firstname.lastname@example.org mailing lists).
If something is not working, please don't hesitate to contact us even if this message says that everything is working! Sometimes we forget to update this page when we're busy investigating a major problem...
On March 6th, power went out to parts of Linköping including NSC due to a high-voltage cable being severed by construction work going on near campus. This was unfortunate, but unrelated to what happened next.
On March 28th and April 5th there were two power outages that affected one of our data centers ("Kärnhuset", where Triolith, Gamma, Bi and Frost are located). In each case, a protection device in the high voltage substation feeding Kärnhuset shut the power off after detecing an electrical arc. The local power utility in both cases failed to find anything wrong and restored power after a short time.
The third time this happened (on the morning of April 6th), they finally were able to find some clues to the root cause of the problem. Apparently a bolt deep inside a hard-to-reach part of the station became loose and caused arcing now and then.
Once the problem was found (around lunchtime), it was obvious that a repair would take days, so the local utility started planning for how to provide another source of power. At 17:30, the first large diesel generator arrived on site. Since Kärnhuset is not designed to be powered by a diesel generator, some creative work was needed to get the cables into the building and connected. At 19:49 power was restored. At 20:09 the first servers were started. At 21:00 users were able to access Triolith, Gamma, Bi and Frost again. At 22:00 we had powered up as many compute nodes as we dared to do and resumed running jobs on them.
During the weekend, a specialist team from ABB (that built the station) dismantled parts of the station and repaired/replaced the failed components.
On Monday, April 10th the repairs were completed and we switched Kärnhuset back to normal power at 10:02 CEST. At 10:45 all compute nodes were started and running jobs.
In order to physically disconnect the diesel generator we will need to stop most compute nodes in Triolith, Gamma and Bi one more time, on Wednesday April 12th.
2017-04-05 16:50 - 18:30 (shorter for many systems): Power outage in Linköping grid affecting Triolith, Gamma, Bi and Frost compute nodes, all running jobs failed. See this email thread for more information.
2017-03-28 02:30-06:45: Power outage affecting one NSC computer room. At 02:30, power was lost to Triolith, Bi and Frost compute nodes are located. All running jobs failed. All systems available to users from 06:45.
2017-03-06 07:30 - 08:10: Campus-wide power outage affecting both NSC computer rooms and offices. Most systems went down, all running compute jobs failed. Most systems back up before lunch. Triolith and Gamma kept offline to perform maintenance that was originally scheduled for Thursday March 9th.
An overview of the overall system load is available on the status page.
Graphical representations of the current queue on some of our systems: