VESNA reliability and failure modes
As you might know from my previous writings and talks, Jožef Stefan Institute runs an experimental wireless communications testbed as part of an European FP7 CREW project. Testbed is located in Logatec, a small city around 30 km from Ljubljana is unimaginatively called Log-a-tec. It consists of 54 VESNA devices mounted outside on street lights.
Each node has 24-hour power supply, but no wired communication lines to other nodes. Instead it has three separate radios. One of them is used to connect to a ZigBee mesh network that is used for management purposes. The other two are used to set up experimental networks and perform various measurements of the usage of the radio frequency spectrum.
The testbed is divided into three separate clusters. One ZigBee coordinator node per cluster provides a gateway from the mesh network to the Internet.
The testbed was deployed in steps around June 2012. It has been operating continuously since and while its reliability has been patchy at best it has nevertheless supported several experiments.
In the near future we are planning the first major maintenance operation. Nodes that have failed since deployment have already been unmounted. They will have failed components replaced and will at one point be mounted back on their positions on street lights. Therefore I think now is the perfect time to look back at the last year and a half and see how well the testbed has been doing overall.
First, here are some basic reliability indicators for time between August 2012 and November 2013:
- Average availability of nodes (ping): 44.6%
- Average time between resets (uptime): 26 days
- Number of nodes not seen once: 24% (= 13/54)
Following two graphs show availability and uptime per individual node, colored by cluster. 13 nodes that have never been seen on the network are not shown (they have 0% availability and 0 uptime). Also note that when a coordinator (node00) was down, that usually meant that the whole cluster was unreachable.
I have also been working on diagnosing specific problems with failed nodes. Unfortunately because sometimes work has been somewhat rushed due to impending deadlines, my records are not as good as I would wish for. Hence I can't easily give an exact breakdown of how much downtime was due to what problem. If at one point I will have time to go through my mail archive and gather all my old notes I might write a more detailed report.
However, Since I am getting a lot of questions regarding what exactly went wrong with nodes, here is a more or less complete list of problems I found, divided between those that have been seen once and those that were occurring more frequently.
Recurring failures, ordered roughly by severity:
- Broken boxes. VESNA nodes have been mounted in boxes certified for outdoor use. Nevertheless, a lot of them have cracked since deployment. This often resulted in condensation and in at least one case a node that was submerged in water. A lot of other failures on this list were likely indirectly caused by this.
- I have already written about problems with Atmel ZigBit modules. While intermittent serial line problems have been mostly worked around, the persistent corruption of ZigBit firmware was one of the most common reasons why a node would not be reachable on the network. A corrupted ZigBit module does not join the mesh and requires firmware reprogramming to restore, something that can not be done remotely.
- There have been some problems with an old version of our network driver that would sometimes fall into an infinite loop while it kept resetting the watchdog. Since we have no means of remotely resetting a node in that case, this bug has caused a lot of downtime in the early days of deployment. It proved so hard to debug that I ended up rewriting the problematic part of the code from scratch.
- Texas Instruments CC-series transceiver degradation. While this has not resulted in a node downtime (and is not counted in the statistics above) it has nonetheless rendered several nodes useless for experiments.
- Failed microcontroller flash. Due to an unfortunate design of VESNA's bootloader, it reprograms a block of flash on each boot. For nodes that were rebooting frequently (often because of other problems) this feature commonly resulted in stuck bits and a failed node.
- Failed SD card interface. For mass storage, VESNA uses an SD card and on several nodes it has become inoperable. Since the SD card itself can still be read on another device, I suspect the connector (which was not designed for outdoor use).
- Failed MRAM interface. In addition to SD card there is a small amount of non-volatile MRAM on board and on several nodes it has failed for an unknown reason.
- People unplugging UTP cables and other problems with Internet connectivity at the remote end beyond our control.
One-time failures:
- Digi Connect ME module SSL implementation bug.
- Failed Ethernet PHY on a Digi Connect ME module. While these two problems only occurred once each, they were responsible for a lot of downtime for the whole City center cluster.
- Failed interrupt request line on a CC1101 transceiver. Unknown reason, could be bad soldering.