Down time
You might have noticed that for the past two days or so this website was off-line. The reason for it is a bit curious.
On Friday, 18 July at 17:45, two of my Ethernet switches failed - one in Logatec and one in Ljubljana, separated by around 35 km. They crashed simultaneously at the exactly the same second.
Here are the relevant log entries of two machines connected to them. Both had clocks synchronized via NTP, so the time stamps should be fairly accurate. The machines logged these messages when their Ethernet adapters reported the loss of carrier signal on the cable from their respective switches:
Jul 18 17:45:21 gildedale kernel: [1658547.286187] PHY: sunxi_gmac-0:00 - Link is Down Jul 18 17:45:21 chandra kernel: [512418.004319] e100 0000:00:0c.0: eth3: NIC Link is Down
These two pieces of equipment were geographically separated and had nothing in common. As far fetched as it seems that they would fail because of a common reason, this appears to be the case here.
Newspapers reported on Friday afternoon that a small explosion occurred at a transformer station in Ljubljana. The distribution company confirms the event, but doesn't share many details. There is also no official source of the exact time, but the first tweet about it appeared at 17:45.
I guess that whatever occurred at the transformer station caused a transient on the power grid that crashed my switches. It must have been fairly short because two other computers that were connected to the same outlet did not reboot or report any problems.
It's curious that this effect threw both of the switches in a state where they didn't reboot, but didn't function either. One of these is a fairly old, low-cost affair that I have seen behave in this way a few times before. The other one is integrated into a Linksys WRT54GL access point that has been fairly reliable so far.
This is the second time this year that something completely unexpected happened with my servers. It's not that I try to run some kind of high-reliability shop, but it is annoying and makes me appreciate the work real system administrators do each day. I guess that's the cost of trying to stay away from the cloud these days.
Amazing. Did they die or just hang?