Follow up on Atmel ZigBit modules

27.08.2014 12:09

I've ranted before about the problematic Atmel ZigBit modules and the buggy SerialNet firmware. In my back-of-the envelope analysis of failure modes in the Jožef Stefan Institute's sensor networks, one particular problem stood out related to this low-powered mesh networking hardware: a puzzling failure that prevents a module from joining the mesh and can seemingly be fixed by reprogramming the module's firmware.

A week ago Adam posted a link to the SerialNet source in a comment to my old blog post. While I've mostly moved on to other things, this new piece of information gave me sufficient excuse to spend another few hours exploring this problem.

Atmel ATZB_900_B0 module on a VESNA SNR-MOD board.

A quick look around Atmel's source package revealed that it contains only the code for the serial interface to the underlying proprietary ZigBee stack. There are no low-level hardware drivers for the radio and no actual network stack code in there. It didn't seem likely that the bug I was hunting was caused by this thin AT-command interface code. On the other hand, this code could be responsible for dropping out characters in the serial stream. However we have sufficient workarounds in place for that bug and it's not worth spending more time on it.

One thing caught my eye in the source: ATPEEK and ATPOKE commands. ATZB-900-B0 modules consist of an ATmega1281 microcontroller and an AT86TF212 transceiver. These two commands allow for raw access to the radio hardware registers, microcontroller RAM, code flash ROM and configuration EEPROM. I thought that given these, maybe I could find out what gets corrupted in module's non-volatile memories and perhaps fix it through the AT-command interface.

Only after figuring out how to use them from studying the source, I found out that these two commands have been in fact documented in revision 8369B of the SerialNet User Guide. Somehow I overlooked this addition previously.


For the sake of completeness, here is a more detailed description of the problem:

A module that previously worked fine and passed all of my system tests will suddenly no longer respond to an AT+WJOIN command. It will not respond with either OK nor ERROR (or their numeric equivalents). However it will respond to other commands in a normal fashion. This can happen after the module has been deployed for several months or only after a few hours.

A power cycle, reset or restoring factory defaults does not fix this. The only known way of restoring the module is to reprogram its firmware through the serial port using Atmel's Bootloader PC Tool for Windows. This reprogramming invokes a bootloader mode on the module and refreshes the contents of the microcontroller's flash ROM as well as resets the configuration EEPROM contents.

It appears that this manifests more often with sensor nodes that are power-cycled regularly. However, in our setup a node only joins the network once after a power-cycle. Even if the bug is caused by some random event that happens anytime during the uptime of the node, it will not be noticed until the next power cycle. So it is possible that it's not the power cycling itself that causes the problem. Aggressive power-cycling tests as well don't seem to increase the occurrence of the bug.


So, with the new found knowledge of ATPEEK I dumped the contents of the EEPROM and flash ROM from two known-bad modules and a few good ones. Comparing the dumps revealed that both of the bad modules are missing a 256 byte block of code from the flash starting at address 0x00011100:

--- zb_046041_good_flash.hex	2014-08-25 16:41:51.000000000 +0200
+++ zb_046041_bad_flash.hex	2014-08-25 16:41:47.000000000 +0200
@@ -4362,22 +4362,8 @@
 000110d0  88 23 41 f4 0e 94 40 88  86 e0 80 93 d5 13 0e 94  |.#A...@.........|
 000110e0  f0 7b 1c c0 80 91 da 13  88 23 99 f0 82 e2 61 ee  |.{.......#....a.|
 000110f0  73 e1 0e 94 41 0c 81 e0  80 93 e5 13 8e e3 91 e7  |s...A...........|
-00011100  90 93 e7 13 80 93 e6 13  8b ed 93 e1 0e 94 86 14  |................|
-00011110  05 c0 0e 94 9a 70 88 81  0e 94 b5 70 df 91 cf 91  |.....p.....p....|
-00011120  08 95 fc 01 80 81 88 23  29 f4 0e 94 40 88 0e 94  |.......#)...@...|
-00011130  e1 71 08 95 0e 94 b5 70  08 95 a2 e1 b0 e0 e3 ea  |.q.....p........|
-00011140  f8 e8 0c 94 71 f4 80 e0  94 e0 90 93 c7 17 80 93  |....q...........|
-00011150  c6 17 0e 94 0f 78 80 91  d9 13 83 70 83 30 61 f1  |.....x.....p.0a.|
-00011160  88 e2 be 01 6f 5f 7f 4f  0e 94 41 0c 89 81 88 23  |....o_.O..A....#|
-00011170  19 f1 0e 94 a6 9f 6b e1  70 e1 48 e0 50 e0 0e 94  |......k.p.H.P...|
-00011180  47 f5 8c 01 8b e2 be 01  6e 5f 7f 4f 0e 94 41 0c  |G.......n_.O..A.|
-00011190  8a 81 88 23 19 f0 01 15  11 05 71 f4 8e 01 0d 5f  |...#......q...._|
-000111a0  1f 4f 87 e2 b8 01 0e 94  41 0c c8 01 60 e0 0e 94  |.O......A...`...|
-000111b0  22 5c 80 e0 0e 94 8d 5c  80 91 d9 13 81 ff 14 c0  |"\.....\........|
-000111c0  0e 94 40 88 0e 94 68 67  80 91 40 10 87 70 19 f4  |..@...hg..@..p..|
-000111d0  0e 94 e1 71 15 c0 81 50  82 30 90 f4 86 e0 80 93  |...q...P.0......|
-000111e0  d5 13 0e 94 f0 7b 0c c0  80 91 40 10 87 70 19 f4  |.....{....@..p..|
-000111f0  0e 94 bf 71 05 c0 81 50  82 30 10 f4 0e 94 df 70  |...q...P.0.....p|
+00011100  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+*
 00011200  62 96 e4 e0 0c 94 8d f4  80 91 d7 13 90 91 d8 13  |b...............|
 00011210  00 97 d9 f4 84 e5 97 e7  90 93 d8 13 80 93 d7 13  |................|
 00011220  80 91 40 10 87 70 82 30  11 f4 0e 94 04 77 0e 94  |..@..p.0.....w..|

This is puzzling for several reasons.

First of all, it seems unlikely that this is a hardware problem. Both bad modules (with serial numbers apart by several 10000s) had lost the same block of code. If flash lost its contents due to out-of-spec voltage during programming or some other hardware problem I would expect the bad address to be random.

However a software bug causing a failure like that seems highly unlikely as well. I was expecting to see some kind of an EEPROM corruption. EEPROM is used to persistently store module settings and I assume the firmware often writes to it. However flash ROM should be mostly read-only. I find it hard to imagine what kind of a bug could erase a block - reprogramming the flash on a microcontroller is typically a somewhat complicated procedure that is unlikely to come up by chance.

One possibility is that we are maybe somehow unknowingly invoking the bootloader mode during the operation of the sensor node. During my testing however I found out that just invoking the serial bootloader mode without also supplying it with a fresh firmware image corrupts the flash ROM sufficiently that the module does not boot at all. The Bootloader PC Tool seems to suggest that these modules also have some kind of an over-the-air upgrade functionality, but I haven't yet looked into how that works. It's possible we're enabling that somehow.

Unfortunately, the poke functionality does not allow you to actually write to flash (you can write to RAM and EEPROM though). So even if this currently allows me to detect a corrupt module flash while the node is running, that is only good for saying that the module won't come back on-line after a reboot. I can't fix the problem without fully reprogramming the firmware. This means either hooking the module to a laptop or implementing the reprogramming procedure on the sensor node itself. The latter is not trivial, because it involves implementing the programming protocol and somehow arranging for the storage of a complete uncorrupted SerialNet firmware image on the sensor node.

Posted by Tomaž | Categories: Code

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)