Down time

21.07.2014 21:14

You might have noticed that for the past two days or so this website was off-line. The reason for it is a bit curious.

On Friday, 18 July at 17:45, two of my Ethernet switches failed - one in Logatec and one in Ljubljana, separated by around 35 km. They crashed simultaneously at the exactly the same second.

Here are the relevant log entries of two machines connected to them. Both had clocks synchronized via NTP, so the time stamps should be fairly accurate. The machines logged these messages when their Ethernet adapters reported the loss of carrier signal on the cable from their respective switches:

Jul 18 17:45:21 gildedale kernel: [1658547.286187] PHY: sunxi_gmac-0:00 - Link is Down
Jul 18 17:45:21 chandra kernel: [512418.004319] e100 0000:00:0c.0: eth3: NIC Link is Down

These two pieces of equipment were geographically separated and had nothing in common. As far fetched as it seems that they would fail because of a common reason, this appears to be the case here.

Newspapers reported on Friday afternoon that a small explosion occurred at a transformer station in Ljubljana. The distribution company confirms the event, but doesn't share many details. There is also no official source of the exact time, but the first tweet about it appeared at 17:45.

I guess that whatever occurred at the transformer station caused a transient on the power grid that crashed my switches. It must have been fairly short because two other computers that were connected to the same outlet did not reboot or report any problems.

It's curious that this effect threw both of the switches in a state where they didn't reboot, but didn't function either. One of these is a fairly old, low-cost affair that I have seen behave in this way a few times before. The other one is integrated into a Linksys WRT54GL access point that has been fairly reliable so far.

This is the second time this year that something completely unexpected happened with my servers. It's not that I try to run some kind of high-reliability shop, but it is annoying and makes me appreciate the work real system administrators do each day. I guess that's the cost of trying to stay away from the cloud these days.

Posted by Tomaž | Categories: Life

Comments

Amazing. Did they die or just hang?

I just remembered that this was also the time when my NAS went bananas. It suddenly turned off for about a second before coming back to life, at first misreporting disk statuses (heart stops a bit when 3 disks in RAID6 configuration are marked as faulty).

Luckily everything turned out fine, but it was another push that I need to have a better backup solution.

No, they just hung. Disconnecting from power for a minute and plugging back brought both of them back to life.

Actually, I see in the logs that the WRT54GL interface went up every few hours for two seconds or so, then went dead again. The other switch just appeared dead with all LEDs lit on the panel.

Posted by Tomaž

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)