On Atmel SerialNet ZigBit modules

13.08.2012 22:27

Don't use Atmel BitCloud/SerialNet ZigBit modules.

With this important public service announcement out of the way, let me start at the beginning.

Atmel makes ZigBit modules that contain an IEEE 802.15.4-compatible integrated radio from their AT86RF2xx family and an AVR-based microcontroller on a small hybrid component. The CPU runs a proprietary mesh-networking stack (BitCloud) built on top of the ZigBee specification and exposes a high-level interface on a serial line they call SerialNet (think "send the following data to this network address"-style interface). The module can be used either as a very simple way of adding mesh networking to some host device or as a stand-alone microcontroller with a built-in radio (Atmel provides a proprietary BitCloud SDK, so you can build your own firmware for the AVR).

Atmel ZigBit module on a VESNA SNR-MOD board.

At SensorLab we built a sensor node radio board for VESNA using these modules (more specifically, ATZB 900 B0 for 868 MHz and ATZB 24 B0 for 2.4 GHz links) as they appeared to be simple to use and would provide a temporary solution for connecting VESNAs with a wireless mesh until we come up with a working and reliable 6LoWPAN implementation. So far we have deployed well over 50 of these in different VESNA installations.

I can now say that these modules have been nothing but trouble from the start. First there is the issue of documentation. Atmel's documentation has always been superb in my memory. Compare one of their ATmega datasheets with the vague hand-waving STMicroelectronics calls microcontroller documentation and you'll know why. Unfortunately, the SerialNet user guide is an exception to this rule. They leave many corner cases undefined and you are left to your own experimentation to find out how the module behaves. There is almost no timing information. How long can you expect to wait for a response to a command? How long will the module be unresponsive and ignore commands after I change this setting? Even the hardware reset procedure is not described anywhere beyond a "Reset input (active low)".

The problems with this product however go deeper than this. In my experience developers, my self included, tend to be too quick to blame problems on bugs in someone else's code. When colleagues complained how buggy these modules are I said that it's much more likely a problem in our code or hardware design. That is until I started investigating myself the numerous problems we had with networking: the modules would return responses they shouldn't have according to the specification, they would say that they are connected to the network even though no other network node could communicate with them. Modules would even occasionally persistently corrupt themselves, requiring firmware reprogramming before they would start responding to commands again. Believe me, it's annoying to reach for a JTAG connector when the module in question is on a lamp post in some other part of the country.

For most of these bugs I can only offer anecdotal evidence. However I have been investigating one important issue for around two months now and I'm confident that there is something seriously wrong with these modules. I strongly suspect there is a race condition somewhere in Atmel's (proprietary and closed-source, of course) code that causes some kind of buffer corruption when a packet is received over the radio at the same time as the module receives a command over the serial line. This will cause the module to lose bytes on the serial line, making it impossible to reliably decode the communications protocol.

For instance, this is how the communications should look like over the serial line. Host in this case is VESNA and module is Atmel ATZB 900 B0:

→ AT+WNWK\x0d                                # host asks for network status
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a  # module asynchronously reports received data
← OK\x0d\x0a                                 # module answers that network is OK
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a  # module asynchronously reports received data

This is how it sometimes looks like:

→ AT+WNWK\x0d
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a
← OK\x0d                                     # note missing \x0a
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a

And sometimes it gets as bad as this:

→ AT+WNWK\x0d
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a
← ODATA 0000,0,77:(77 bytes of data)\x0d\x0a # note only O from OK sent

An inviting explanation for these problems would be that we have a bad implementation of an UART on VESNA. Except that this happens even when the module is connected to a computer via a serial-to-USB converter and I have traces from a big and expensive Tektronix logic analyzer (as well as Sigrok) to prove that corrupted data is indeed present on the hardware serial line and not an artifact of some bug on the host side:

Missing Line Feed character from an Atmel ZigBit module.

A logic analyzer trace demonstrating a missing line feed character. Click to enlarge.

Data corruption on the serial line from an Atmel ZigBit module.

A logic analyzer trace demonstrating a jumbled-up OK and DATA response. Click to enlarge.

I have seen this happen in the lab under controlled conditions on 10 different modules and have good reasons to suspect the same thing is happening on the deployed 50-plus modules. Also, this bug is present in both BitCloud 1.14 and 1.13 and in both vanilla and security-enabled builds. All of this points to the fact that this problem is not due to some isolated fluke on our side.

For well over a month I have been on the line with Atmel technical support and while they had politely answered all of my mail they had also failed to acknowledge the issue or provide any helpful information even though I sent them a simple test case that reliably reproduces the problem in a few seconds. Of course, without their help there is exactly zero chance of getting to the bottom of this and given all of the above I seriously doubt this is anything else than a bug in their firmware.

At this point I have mostly given up any hopes that this issue will be resolved. During my investigation I did find out that decreasing the amount of chatter on the serial line decreases the probability of errors, so I did manage to work around this bug a bit by switching to non-verbose responses (ATV0) and using packets that are a few bytes shorter than the maximum (say 75 bytes for encrypted frames). This will hopefully improve the reliability of already deployed hardware. For the future, we will be looking into alternatives, as unfortunately 6LoWPAN still seems to be somewhat outside of our grip.

Posted by Tomaž | Categories: Digital

Comments

I already told you about this one some times ago, but just in case, here it is again:
http://powwow.gforge.inria.fr/

I don't know much about these sensor networks, but it looks close enough to your needs ?

Thanks, I remember that link. We are developing our own VESNA sensor platform and actually we also have radio boards for it with CC2500-series radios, which is what PowWow is using. It's just that early on it was decided that open-source networking stacks (6LoWPAN, Rime) are not yet good enough for our us and we went with this proprietary Atmel solution.

Since among other things VESNA itself is the focus of research here we can't replace it with something else. But porting PowWow software stack to it might be an interesting possibility.

Posted by Tomaž

Yes, I was more thinking of the software side of the project. Your hardware seems to be working ok, right ?

The software stack was cleaned up a bit as part of the open sourcing project, to make it easier to port to different hardware. If the radio board is the same, then it should be even easier.
Now, this likely isn't a perfect solution, so take better planning in evaluating it and the other solutions before chosing one :). I don't want to make you lose your time on another unfitting solution.

Sure. I think we learned a big lesson here regarding evaluation and testing before deploying things in the field.

Posted by Tomaž

This is very interesting as I'm messing around with these modules over a year now. I got the same stability issues as you have seen here. I tested various versions of the BitCloud stack and SerialNet but I was not able to find the root of the problems. We have a simple setup with a coordinator and just one end device. The coordinator is sending some commands to the end device and is polling some sensor data from the end device.
Every few minutes we got an error code 4 in response to an 'atr' command and then sometimes the serial buffers seems to be messed up and the coordinator is answering with the line from the former command in these cases. Only 'atz' can help in these cases.
And we have seen various problems in setting up the mesh network. Sometime it works good, sometimes the coordinator can't find the end device or the other way round, sometimes the order of power on (end device first or coordinator first) has an influence.
If we're lucky it is 'just' a software problem and maybe using the new 'Atmel® Lightweight Mesh software Stack' will help. We have >500 Modules in the field and they are not used today. But they should be activated in the future and so it would be a bad solution to throw them away...

Bye,

Posted by Oliver

Hi, its been a while since this post. Any changes? Is it save to use nowadays?

Posted by p0fi

Hi Tomaz, any update with Atmel and issue you are having??

Posted by Slobodan

p0fi, Slobodan, to my knowledge Atmel has not released any new firmware versions that would address these issues.

Posted by Tomaž

Sorry for double post above.
I guess the only way to go now is Freescale solution. Hmm. I don't think I mind except Freescale IDEs are not free and solutions are more expensive for hobbyist like me.

Thank you for this article. It helped a lot on deciding which technology to go with.

Posted by Slobodan

Hi!

Since I'm planning to develop my own network, based on Atmel's BitCloud solution, this post of yours makes me rethink my choice.
Have you figured out a solution to these errors you found? I see Atmel hasn't updated the BitCloud since, so I guess the problem is still there...

Do you have an alternative idea? I looked at Freescale, but I find no ready-to-use modules based on Freescale KW20 Kinetis wireless microcontroller. I would not start designing my own module with this tricky SMT pad of this micro, RF components etc...
Do you know, are the Freescale tools free? As I see, they are, but above somebodí mentioned, that they aren't.

thanks for your answers.

Posted by Miklos

Miklos, there is no solution to all of these bugs. Our current code is now full of workarounds for various problems with Zigbit. It works well enough to be useful in practice if you accept the fact the you will occasionally have data corruption and that sometimes devices will simply fail is mysterious ways. Over the course of a year we have seen maybe 50% node uptime.

After this I wasn't involved in any large new sensor deployment so I haven't researched any other alternatives or touched the Zigbit code much after this post was written. Our plan is still to move to 6LoWPAN using Contiki and AT86RF212 radios eventually, but after all this time it still looks far from being usable in production on our deployed hardware (Contiki has terrible support for ARM-based systems out of the box and we don't have the manpower to keep working on a fork).

I have no experience with the Freescale solution you mention. I know colleagues have also looked into some new low-powered Wi-Fi solutions that have comparable power consumption to Zigbit, but I don't know any specific details.

Posted by Tomaž

Hi Tomasz... and everyone who has had issues with the Serialnet software. I came across this post a while back when developing around SerialNet and had similar issues.
Good news is after many support emails sent back and forth to Atmel they finally sent the SerialNet code to me. It has been released under the same license as BitCloud (however Atmel has specifically said that they do NOT provide support for the code, only the AT commands and BitCloud itself).
Here is the link to the Serialnet source: http://d01.megashares.com/dl/dvx9opU/SerialNet_BC_1_14_0_src.zip

Instructions (received with the source code):
Extract package and place inside Applications directory of Bitcloud SDK for Zigbit v 1.14.0. Then after required modifications and Configuration, you can compile the application. Ensure that the correct path to Makefile is provided in the Project Configuration options in AVR Studio 4.

Posted by Adam

Adam, thank you very much for sharing this.

From a quick look it seems that archive contains platform-independent networking code. I guess the SDK contains the platform-specific parts.

It's unlikely I'll take the time to find the bugs we are seeing with the modules. At this point we are treating hardware using ZigBit as legacy that will eventually be replaced. Still, it's interesting to take a peek inside the black box.

Posted by Tomaž

Yeah there is quite a bit of code in there and I didn't get to go through too much of it.
I didn't get hit that hard with the newline issue you were having (I got the occasional missing new line) but I didn't use the Zigbit boards (I used the RCB board with the 900MHz Zigbee module). Also I only had 4 nodes on the network.

What I think is happening is that SerialNet has to deal with 2 tasks that happen in real time (Data/AT commands from UART and Packet sending/routing).
My guess is the more nodes that are on the network, the worse the issue gets (due to more packet routing). The fix would be to find the bug in the code (likely not that easy)... or perhaps the SOC is not capable of doing both these tasks (I feel the 8 KB RAM in the ATMega is too low for this network stack).

There is also a debug mode switch in the configuration file, I didn't get a chance to try it but its there if anyone out there wants to have a go.

Posted by Adam

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)