I've written before about problems with VESNA deployments that have come to consume large amounts of time and nerves. Several of these have come in turn from two proprietary microprocessor modules we use: Digi Connect ME for Ethernet connectivity and Atmel SerialNet for IEEE 802.15.4 mesh networking.
One of these issues, which now finally seems to be just on the brink of being resolved, has been been dragging on from the late summer last year. We have deployed several Digi Connect ME modules as parts of gateways between IEEE 802.15.4 mesh in clusters of VESNA nodes and the Internet. One of deployments has proved especially problematic. Encrypted SSL connections from the module would randomly get dropped and re-connect only after several hours of downtime.
The issue at first proved impossible to reproduce in a lab environment and since the exact same device worked on other networks the ISP and the firewall performing NAT was blamed. However, several trips to the location and many packet captures later I could find no specific problem with TCP or IP headers I could point my finger to. We replaced a network switch with no effect. Later, by experimenting with Digi Connect TCP keep-alive settings, a colleague found a setting that caused the dropped connection to be re-established immediately instead of causing hours of down-time, making the deployment at least partially useful.
Finally, last week I managed to reproduce the problem on my desk. I noticed that the TCP connections from that location had an unusually low MSS - just 536 bytes. By simulating this I could reliably reproduce connection drops and by experimenting further I found out that SSL data records fragmented in a particular way will cause the module to drop the connection. It was somewhat specific to the Java SSL implementation we used on the other end of connection and very unlikely to happen with other connections that used larger segment sizes.
The cause of the issue was therefore in the Digi Connect module. Before having a reproducible test case I haven't even considered a possibility that a change on the link layer somewhere in the route could trigger a bug at the application layer.
After I had that piece of information, a helpful member of the support forums quickly provided a solution. The issue itself however is not yet resolved since the change in the firmware broke all sorts of other things which now need to be looked into and fixed as well.
I can't say that all of our hard-to-solve bugs came from Digi Connect or Atmel modules. We caused plenty ourselves. But having now experienced working with these two fine products, my opinion is that less time would be wasted if we went for a lower-level solution (just an interface on the physical layer) and then used an open source network stack on top. It would take more time to get to a working solution but I think problems would be much easier to diagnose and solve than with what is essentially a magical black box.
Both Digi Connect and Atmel modules suffer from the fact that they hide some very complex machinery behind a very simplistic interface. Aside from the problem of leaky abstractions, when the machinery itself fails, they provide no information that would help you work around the problem (solving it is out of the question anyway because of proprietary software). Both also come with documentation that is focused on getting a working system as fast as possible, but lacks details on what happens in corner cases. These are mostly left to your imagination and experiments and as experience has shown, behavior can change between firmware revisions. In most cases you can't even practically test against these changes, since that would involve complicated hardware test harnesses.