Don't use Atmel BitCloud/SerialNet ZigBit modules.
With this important public service announcement out of the way, let me start at the beginning.
Atmel makes ZigBit modules that contain an IEEE 802.15.4-compatible integrated radio from their AT86RF2xx family and an AVR-based microcontroller on a small hybrid component. The CPU runs a proprietary mesh-networking stack (BitCloud) built on top of the ZigBee specification and exposes a high-level interface on a serial line they call SerialNet (think "send the following data to this network address"-style interface). The module can be used either as a very simple way of adding mesh networking to some host device or as a stand-alone microcontroller with a built-in radio (Atmel provides a proprietary BitCloud SDK, so you can build your own firmware for the AVR).
At SensorLab we built a sensor node radio board for VESNA using these modules (more specifically, ATZB 900 B0 for 868 MHz and ATZB 24 B0 for 2.4 GHz links) as they appeared to be simple to use and would provide a temporary solution for connecting VESNAs with a wireless mesh until we come up with a working and reliable 6LoWPAN implementation. So far we have deployed well over 50 of these in different VESNA installations.
I can now say that these modules have been nothing but trouble from the start. First there is the issue of documentation. Atmel's documentation has always been superb in my memory. Compare one of their ATmega datasheets with the vague hand-waving STMicroelectronics calls microcontroller documentation and you'll know why. Unfortunately, the SerialNet user guide is an exception to this rule. They leave many corner cases undefined and you are left to your own experimentation to find out how the module behaves. There is almost no timing information. How long can you expect to wait for a response to a command? How long will the module be unresponsive and ignore commands after I change this setting? Even the hardware reset procedure is not described anywhere beyond a "Reset input (active low)".
The problems with this product however go deeper than this. In my experience developers, my self included, tend to be too quick to blame problems on bugs in someone else's code. When colleagues complained how buggy these modules are I said that it's much more likely a problem in our code or hardware design. That is until I started investigating myself the numerous problems we had with networking: the modules would return responses they shouldn't have according to the specification, they would say that they are connected to the network even though no other network node could communicate with them. Modules would even occasionally persistently corrupt themselves, requiring firmware reprogramming before they would start responding to commands again. Believe me, it's annoying to reach for a JTAG connector when the module in question is on a lamp post in some other part of the country.
For most of these bugs I can only offer anecdotal evidence. However I have been investigating one important issue for around two months now and I'm confident that there is something seriously wrong with these modules. I strongly suspect there is a race condition somewhere in Atmel's (proprietary and closed-source, of course) code that causes some kind of buffer corruption when a packet is received over the radio at the same time as the module receives a command over the serial line. This will cause the module to lose bytes on the serial line, making it impossible to reliably decode the communications protocol.
For instance, this is how the communications should look like over the serial line. Host in this case is VESNA and module is Atmel ATZB 900 B0:
→ AT+WNWK\x0d # host asks for network status ← DATA 0000,0,77:(77 bytes of data)\x0d\x0a # module asynchronously reports received data ← OK\x0d\x0a # module answers that network is OK ← DATA 0000,0,77:(77 bytes of data)\x0d\x0a # module asynchronously reports received data
This is how it sometimes looks like:
→ AT+WNWK\x0d ← DATA 0000,0,77:(77 bytes of data)\x0d\x0a ← OK\x0d # note missing \x0a ← DATA 0000,0,77:(77 bytes of data)\x0d\x0a
And sometimes it gets as bad as this:
→ AT+WNWK\x0d ← DATA 0000,0,77:(77 bytes of data)\x0d\x0a ← ODATA 0000,0,77:(77 bytes of data)\x0d\x0a # note only O from OK sent
An inviting explanation for these problems would be that we have a bad implementation of an UART on VESNA. Except that this happens even when the module is connected to a computer via a serial-to-USB converter and I have traces from a big and expensive Tektronix logic analyzer (as well as Sigrok) to prove that corrupted data is indeed present on the hardware serial line and not an artifact of some bug on the host side:
A logic analyzer trace demonstrating a missing line feed character. Click to enlarge.
A logic analyzer trace demonstrating a jumbled-up OK and DATA response. Click to enlarge.
I have seen this happen in the lab under controlled conditions on 10 different modules and have good reasons to suspect the same thing is happening on the deployed 50-plus modules. Also, this bug is present in both BitCloud 1.14 and 1.13 and in both vanilla and security-enabled builds. All of this points to the fact that this problem is not due to some isolated fluke on our side.
For well over a month I have been on the line with Atmel technical support and while they had politely answered all of my mail they had also failed to acknowledge the issue or provide any helpful information even though I sent them a simple test case that reliably reproduces the problem in a few seconds. Of course, without their help there is exactly zero chance of getting to the bottom of this and given all of the above I seriously doubt this is anything else than a bug in their firmware.
At this point I have mostly given up any hopes that this issue will be resolved. During my investigation I did find out that decreasing the amount of chatter on the serial line decreases the probability of errors, so I did manage to work around this bug a bit by switching to non-verbose responses (ATV0) and using packets that are a few bytes shorter than the maximum (say 75 bytes for encrypted frames). This will hopefully improve the reliability of already deployed hardware. For the future, we will be looking into alternatives, as unfortunately 6LoWPAN still seems to be somewhat outside of our grip.