Remember my recent story about Atmel modules? For the last two days I had a very strong déjà-vu feeling when I was trying to debug an issue that popped up in the last minute and is preventing a somewhat time critical experiment from being performed on one of our VESNA testbeds.
It turned out that while 7-bit ASCII strings were correctly transferred between VESNAs and a client calling a HTTP API, binary data sent down the same pipeline got corrupted somewhere on the way. Plus, to make things just a bit more interesting, it sometimes also made the whole network of VESNAs unreachable.
Now, problems like this are unfortunately quite a common occurrence. Before data from a sensor node reaches a client, it must pass through multiple hops in a (notoriously unreliable) ZigBee mesh network, a coordinator VESNA that serves as a proxy between the mesh and a TCP/IP tunnel, a Java application on the end of the tunnel that translates between a home-grown HTTP-like protocol used between VESNAs and proper HTTP and finally an Apache server acting as a reverse-proxy for the public HTTP API. Leaky abstractions and encapsulations are a plenty and it's not unusual to have some code somewhere along the line assume that the data it is passing is ASCII-only, valid UTF-8 or something else entirely.
I won't bore you with too many details about debugging. I opted for a top-down approach, with the Java layer being the main suspect based on previous experience. After that got a thorough check, I moved to sniffing the tunnel with Wireshark, which has an extra complication that it's SSL encrypted. Luckily, it doesn't seem to use Diffie–Hellman key exchange, meaning that the SSL dissector was quite effective once I figured out how to extract private keys from Java's keystore. That also seemed to look OK, so the next layer down was the SSL tunnel endpoint, which is a Digi Connect ME module.
This is basically a black box that takes a duplex RS-232 connection on one end and tunnels it through an encrypted TCP/IP connection. It's an deceptively simple description though. In fact Digi Connect ME is a miniature computer with an integrated web server for configuration, scripting support and a ton of other features (I have close to 500 pages of documentation for it on my computer and I only downloaded documents I though might be useful in debugging this issue).
Anyway, when I looked closely, the problem was quite apparent. On the RS-232 side the module was set to use XON/XOFF software flow-control. This obviously won't work when sending arbitrary binary data. Not only will the module interpret XON and XOFF bytes as special and so drop them from the pipeline, an XOFF that is not followed by XON will also halt the transmission, leading to hangs and timeouts. The fix looked simple enough: switch to hardware flow control using dedicated CTS and RTS lines.
As you might have guessed, it was not that easy. It turns out that when hardware flow control is enabled in Digi Connect ME, it will randomly duplicate characters when sending them down the serial line. Again, the suspicion first fell on our homegrown UART driver on the VESNA side, but a logic analyzer trace below confirmed that it's the actual Digi Connect module that is the culprit and not some bug on our side.
Now, at this point I'm seriously confused. Digi Connect ME is quite popular and, at least judging from browsing the web, used in a lot of applications. But after the ZigBit module it's also the second piece of hardware that exhibits such broken behavior that I can't believe it has passed any thorough quality check. All of my experience speaks that it must be us that are doing something wrong and in both cases I have tested just about all possibilities where VESNA could be doing something wrong. Actually, I want it to be a mistake on our end because that means I can fix it. But honestly, once you see wrong data being sent on a logic analyzer, I don't think there can be any more doubt. Must really everything turn rotten once you look into it with enough detail?