Follow up on Atmel ZigBit modules

27.08.2014 12:09

I've ranted before about the problematic Atmel ZigBit modules and the buggy SerialNet firmware. In my back-of-the envelope analysis of failure modes in the Jožef Stefan Institute's sensor networks, one particular problem stood out related to this low-powered mesh networking hardware: a puzzling failure that prevents a module from joining the mesh and can seemingly be fixed by reprogramming the module's firmware.

A week ago Adam posted a link to the SerialNet source in a comment to my old blog post. While I've mostly moved on to other things, this new piece of information gave me sufficient excuse to spend another few hours exploring this problem.

Atmel ATZB_900_B0 module on a VESNA SNR-MOD board.

A quick look around Atmel's source package revealed that it contains only the code for the serial interface to the underlying proprietary ZigBee stack. There are no low-level hardware drivers for the radio and no actual network stack code in there. It didn't seem likely that the bug I was hunting was caused by this thin AT-command interface code. On the other hand, this code could be responsible for dropping out characters in the serial stream. However we have sufficient workarounds in place for that bug and it's not worth spending more time on it.

One thing caught my eye in the source: ATPEEK and ATPOKE commands. ATZB-900-B0 modules consist of an ATmega1281 microcontroller and an AT86TF212 transceiver. These two commands allow for raw access to the radio hardware registers, microcontroller RAM, code flash ROM and configuration EEPROM. I thought that given these, maybe I could find out what gets corrupted in module's non-volatile memories and perhaps fix it through the AT-command interface.

Only after figuring out how to use them from studying the source, I found out that these two commands have been in fact documented in revision 8369B of the SerialNet User Guide. Somehow I overlooked this addition previously.


For the sake of completeness, here is a more detailed description of the problem:

A module that previously worked fine and passed all of my system tests will suddenly no longer respond to an AT+WJOIN command. It will not respond with either OK nor ERROR (or their numeric equivalents). However it will respond to other commands in a normal fashion. This can happen after the module has been deployed for several months or only after a few hours.

A power cycle, reset or restoring factory defaults does not fix this. The only known way of restoring the module is to reprogram its firmware through the serial port using Atmel's Bootloader PC Tool for Windows. This reprogramming invokes a bootloader mode on the module and refreshes the contents of the microcontroller's flash ROM as well as resets the configuration EEPROM contents.

It appears that this manifests more often with sensor nodes that are power-cycled regularly. However, in our setup a node only joins the network once after a power-cycle. Even if the bug is caused by some random event that happens anytime during the uptime of the node, it will not be noticed until the next power cycle. So it is possible that it's not the power cycling itself that causes the problem. Aggressive power-cycling tests as well don't seem to increase the occurrence of the bug.


So, with the new found knowledge of ATPEEK I dumped the contents of the EEPROM and flash ROM from two known-bad modules and a few good ones. Comparing the dumps revealed that both of the bad modules are missing a 256 byte block of code from the flash starting at address 0x00011100:

--- zb_046041_good_flash.hex	2014-08-25 16:41:51.000000000 +0200
+++ zb_046041_bad_flash.hex	2014-08-25 16:41:47.000000000 +0200
@@ -4362,22 +4362,8 @@
 000110d0  88 23 41 f4 0e 94 40 88  86 e0 80 93 d5 13 0e 94  |.#A...@.........|
 000110e0  f0 7b 1c c0 80 91 da 13  88 23 99 f0 82 e2 61 ee  |.{.......#....a.|
 000110f0  73 e1 0e 94 41 0c 81 e0  80 93 e5 13 8e e3 91 e7  |s...A...........|
-00011100  90 93 e7 13 80 93 e6 13  8b ed 93 e1 0e 94 86 14  |................|
-00011110  05 c0 0e 94 9a 70 88 81  0e 94 b5 70 df 91 cf 91  |.....p.....p....|
-00011120  08 95 fc 01 80 81 88 23  29 f4 0e 94 40 88 0e 94  |.......#)...@...|
-00011130  e1 71 08 95 0e 94 b5 70  08 95 a2 e1 b0 e0 e3 ea  |.q.....p........|
-00011140  f8 e8 0c 94 71 f4 80 e0  94 e0 90 93 c7 17 80 93  |....q...........|
-00011150  c6 17 0e 94 0f 78 80 91  d9 13 83 70 83 30 61 f1  |.....x.....p.0a.|
-00011160  88 e2 be 01 6f 5f 7f 4f  0e 94 41 0c 89 81 88 23  |....o_.O..A....#|
-00011170  19 f1 0e 94 a6 9f 6b e1  70 e1 48 e0 50 e0 0e 94  |......k.p.H.P...|
-00011180  47 f5 8c 01 8b e2 be 01  6e 5f 7f 4f 0e 94 41 0c  |G.......n_.O..A.|
-00011190  8a 81 88 23 19 f0 01 15  11 05 71 f4 8e 01 0d 5f  |...#......q...._|
-000111a0  1f 4f 87 e2 b8 01 0e 94  41 0c c8 01 60 e0 0e 94  |.O......A...`...|
-000111b0  22 5c 80 e0 0e 94 8d 5c  80 91 d9 13 81 ff 14 c0  |"\.....\........|
-000111c0  0e 94 40 88 0e 94 68 67  80 91 40 10 87 70 19 f4  |..@...hg..@..p..|
-000111d0  0e 94 e1 71 15 c0 81 50  82 30 90 f4 86 e0 80 93  |...q...P.0......|
-000111e0  d5 13 0e 94 f0 7b 0c c0  80 91 40 10 87 70 19 f4  |.....{....@..p..|
-000111f0  0e 94 bf 71 05 c0 81 50  82 30 10 f4 0e 94 df 70  |...q...P.0.....p|
+00011100  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+*
 00011200  62 96 e4 e0 0c 94 8d f4  80 91 d7 13 90 91 d8 13  |b...............|
 00011210  00 97 d9 f4 84 e5 97 e7  90 93 d8 13 80 93 d7 13  |................|
 00011220  80 91 40 10 87 70 82 30  11 f4 0e 94 04 77 0e 94  |..@..p.0.....w..|

This is puzzling for several reasons.

First of all, it seems unlikely that this is a hardware problem. Both bad modules (with serial numbers apart by several 10000s) had lost the same block of code. If flash lost its contents due to out-of-spec voltage during programming or some other hardware problem I would expect the bad address to be random.

However a software bug causing a failure like that seems highly unlikely as well. I was expecting to see some kind of an EEPROM corruption. EEPROM is used to persistently store module settings and I assume the firmware often writes to it. However flash ROM should be mostly read-only. I find it hard to imagine what kind of a bug could erase a block - reprogramming the flash on a microcontroller is typically a somewhat complicated procedure that is unlikely to come up by chance.

One possibility is that we are maybe somehow unknowingly invoking the bootloader mode during the operation of the sensor node. During my testing however I found out that just invoking the serial bootloader mode without also supplying it with a fresh firmware image corrupts the flash ROM sufficiently that the module does not boot at all. The Bootloader PC Tool seems to suggest that these modules also have some kind of an over-the-air upgrade functionality, but I haven't yet looked into how that works. It's possible we're enabling that somehow.

Unfortunately, the poke functionality does not allow you to actually write to flash (you can write to RAM and EEPROM though). So even if this currently allows me to detect a corrupt module flash while the node is running, that is only good for saying that the module won't come back on-line after a reboot. I can't fix the problem without fully reprogramming the firmware. This means either hooking the module to a laptop or implementing the reprogramming procedure on the sensor node itself. The latter is not trivial, because it involves implementing the programming protocol and somehow arranging for the storage of a complete uncorrupted SerialNet firmware image on the sensor node.

Posted by Tomaž | Categories: Code | Comments »

jsonmerge

20.08.2014 21:01

As I mentioned in my earlier post, my participation at the Open Contracting code sprint during EuroPython resulted in the jsonmerge library. After the conference I slowly cleaned up the remaining few issues and brought up code coverage of unit tests to 99%. The first release is now available from PyPi under the MIT license.

jsonmerge tries to solve a problem that seems simple at first: given a series of structured JSON documents, how to create a single document that contains an aggregate of all their contents. With simple documents that might be as trivial as calling an update() method on a dict:

>>> a = {'foo': 1}
>>> b = {'bar': 2}

>>> c = a.update(b)
>>> c
{'foo': 1, 'bar': 2}

However, even with just two plain dictionaries, things can quickly get complicated. What should happen if both documents contain a field with the same name? Should a later value overwrite the earlier one? Or should the resulting document have in that place a list that contains both values? Source JSON documents themselves can also contain arrays (or arrays of arrays) and handling those is even less straightforward than dictionaries in this example.

Often I've seen a problem like this solved in application code - it's relatively simple to encode your wishes in several hundreds lines of Python. However JSON is a very flexible format while such code is typically brittle. Change the input document a bit and more often than not your code will start throwing KeyErrors left and right. Another problem with this approach is that it's often not obvious from the code what kind of a strategy is taken for merging changes in different parts of the document. If you want to have the behavior well documented you have to write and keep updated a piece of English prose that describes it.

Open Contracting folks are all about making a data standard. Having a piece of code instead of a specification clearly seemed like a wrong approach there. They were already using JSON schema to codify the format of various JSON documents for their procedures. So my idea was to extend the JSON schema format to also encode the information on how to merge consecutive versions of those document.

The result of this line of thought was jsonmerge. For example, to say that arrays appearing in the bar field should be appended instead of replaced, the following schema can be used:

schema = {
            "properties": {
                "bar": {
                    "mergeStrategy": "append"
                }
            }
        }

This way, the definition of the merge process is fairly flexible. jsonmerge contains what I hope are sane defaults for when the strategies are not explicitly defined. This means that the merge operation should not easily break when new fields are added to documents. This kind of schema is also a bit more self-explanatory than a pure Python implementation of the same process. If you already have a JSON schema for your documents, adding merge strategies should be fairly straight-forward.

One more thing that this approach makes possible is that given such an annotated schema for source documents, jsonmerge can automatically produce a JSON schema for the resulting merged document. The merged schema can be used with a schema validator to validate any other implementations of the document merge operation (or as a sanity check to check jsonmerge against itself). Again, this was convenient for Open Contracting since they expect their standards to have multiple implementations.

Since it works on JSON schema documents, the library structure borrows heavily from the jsonschema validator. I believe I managed to make the library general enough so that extending it with additional merge strategies shouldn't be too complicated. The operations performed on the documents are somewhat similar to what version control systems do. Because of that I borrowed terminology from there. jsonmerge documentation and source talks about base and head documents and merge strategies. The meanings are similar to what you would expect from a git man page.

So, if that sounds useful, fetch the latest release from PyPi or get the development version from GitHub. The README should contain further instructions on how to use the library. Consult the docstrings for specific details on the API - there shouldn't be many, as the public interface is fairly limited.

As always, patches and bug reports are welcome.

Posted by Tomaž | Categories: Code | Comments »

On cartoon horses and their lawyers

15.08.2014 19:14

GalaCon is an annual event that is about celebrating pastel colored ponies of all shapes and forms, from animation to traditional art and writing. It's one of the European counterparts to similar events that have popped up on the other side of the Atlantic in recent years. These gatherings were created in the unexpected wake of the amateur creativity that was inspired by Lauren Faust's reimagining of Hasbro's My Little Pony franchise. For the third year in a row GalaCon attracted people from as far away as New Zealand. It's a place where a sizable portion of the attendees wear at least a set of pony ears and a tail, if not a more elaborate equestrian-inspired attire. Needless to say, it can be a somewhat frightful experience at first and definitely not for everyone.

For most people it seems to be a place to get away from the social norms that typically prevent adults from obsessing over stories and imagery meant for audience half their age and often of the opposite gender. While I find the worshiping of creative talents behind the show a bit off-putting, I'm still fascinated by the amateur creations of this community. The artist's booths were a bit high on kitsch ("Incredible. Incredibly expensive" was one comment I overheard), but if you look into the right places on-line, there are still enjoyable and thoughtful stories, art and music to be found.

Meeting people I knew from their creations on the web was a fun experience. However for me this year's GalaCon was also a sobering insight into what kind of a strange mix of creativity, marketing psychology and legal matters goes into creating a cartoon for children these days.

GalaCon and Bronies e.V. flags at Forum am Schlosspark.

A highlight of the event was a panel by M. A. Larson, one of the writers behind the cartoon series. By going step by step through a thread of actual emails exchanged between himself, Lauren Faust and the Hasbro office he demonstrated the process behind creating a script for a single episode.

The exact topic of the panel was not announced beforehand however and all recording of the screen was prohibited, with staff patrolling the aisles to look for cameras. I don't know how much of that was for dramatic effect and how much due to real legal requirements. However even before the panel began that gave a strong impression of the kind of atmosphere a project like this is created in. Especially considering the episode he was discussing aired more then three years ago. I'm sure a lot of people in the audience could quote parts of that script by heart. It has been transcribed, analyzed to the last pixel, remixed and in general picked apart on the Internet years ago.

My Little Pony was called the end of the creator-driven era in animation. So far I thought marketing departments dictated what products should appear on the screen and which characters should be retired to make place for new toy lines. I was surprised to hear that sometimes Hasbro office gets involved even in details like which scene should appear last before the end of an act and the commercial break. That fact was even more surprising since this apparently happened in one of the earliest episodes where the general consensus seems to be that the show was not yet ruined by corporate control over creative talent.

Similar amount of thought seemed to go into possibilities of lawsuits. Larson mentioned their self-censorship of the idea to make characters go paragliding and have them do zip lining instead. Is it really harder to say in a court that some child has been hurt trying to imitate horses sliding along a wire than horses soaring under a parachute?

GalaCon 2014 opening ceremony.

The signs of the absurdity of intellectual property protection these days could also be seen throughout the event. Considering Bronies e.V. paid license fees for the public performance of the show it was ridiculous that they were using low-quality videos from the United States TV broadcasts for projection on the big cinema screen, complete with pop-up advertisements that didn't make sense.

Similarly, the love-hate relationship between copyright holders and non-commercial amateur works is nothing new to report. There were a lot of examples where rabid law firms, tasked with copyright protection and with only tenuous connections back to the mothership, used various extortion tactics to remove remixed content from the web. I still don't understand though what kind of a law justifies cease-and-desist letters for works inspired by unnamed background characters that only appeared for a couple of seconds in the original show.

Evening in front of the Forum am Schlosspark.

In general, GalaCon was a bit more chaotic experience than I would wish for and I left it with mixed feelings. Cartoon ponies on the internet are full of contradictions. While the stories they tell are inspiring and a welcome getaway from daily life, the money-grabs behind them are often depressing. I still believe in the good intentions of these events but the extravagant money throwing at the charity auction made me question a lot of things. With extra fees for faster queues, photos and autographs this year's event felt more like a commercial enterprise than a grassroots community event.

Posted by Tomaž | Categories: Life | Comments »