VESNA reliability and failure modes

23.11.2013 22:08

As you might know from my previous writings and talks, Jožef Stefan Institute runs an experimental wireless communications testbed as part of an European FP7 CREW project. Testbed is located in Logatec, a small city around 30 km from Ljubljana is unimaginatively called Log-a-tec. It consists of 54 VESNA devices mounted outside on street lights.

Wireless sensor node in the Log-a-tec testbed.

Each node has 24-hour power supply, but no wired communication lines to other nodes. Instead it has three separate radios. One of them is used to connect to a ZigBee mesh network that is used for management purposes. The other two are used to set up experimental networks and perform various measurements of the usage of the radio frequency spectrum.

The testbed is divided into three separate clusters. One ZigBee coordinator node per cluster provides a gateway from the mesh network to the Internet.

Combined map of the Log-a-tec testbed.

The testbed was deployed in steps around June 2012. It has been operating continuously since and while its reliability has been patchy at best it has nevertheless supported several experiments.

In the near future we are planning the first major maintenance operation. Nodes that have failed since deployment have already been unmounted. They will have failed components replaced and will at one point be mounted back on their positions on street lights. Therefore I think now is the perfect time to look back at the last year and a half and see how well the testbed has been doing overall.

First, here are some basic reliability indicators for time between August 2012 and November 2013:

  • Average availability of nodes (ping): 44.6%
  • Average time between resets (uptime): 26 days
  • Number of nodes not seen once: 24% (= 13/54)

Following two graphs show availability and uptime per individual node, colored by cluster. 13 nodes that have never been seen on the network are not shown (they have 0% availability and 0 uptime). Also note that when a coordinator (node00) was down, that usually meant that the whole cluster was unreachable.

VESNA outdoor node availability from August 2012 to November 2013

VESNA outdoor node uptime from August 2012 to November 2013

I have also been working on diagnosing specific problems with failed nodes. Unfortunately because sometimes work has been somewhat rushed due to impending deadlines, my records are not as good as I would wish for. Hence I can't easily give an exact breakdown of how much downtime was due to what problem. If at one point I will have time to go through my mail archive and gather all my old notes I might write a more detailed report.

However, Since I am getting a lot of questions regarding what exactly went wrong with nodes, here is a more or less complete list of problems I found, divided between those that have been seen once and those that were occurring more frequently.

A box of unmounted VESNA sensor nodes.

Recurring failures, ordered roughly by severity:

  • Broken boxes. VESNA nodes have been mounted in boxes certified for outdoor use. Nevertheless, a lot of them have cracked since deployment. This often resulted in condensation and in at least one case a node that was submerged in water. A lot of other failures on this list were likely indirectly caused by this.
  • I have already written about problems with Atmel ZigBit modules. While intermittent serial line problems have been mostly worked around, the persistent corruption of ZigBit firmware was one of the most common reasons why a node would not be reachable on the network. A corrupted ZigBit module does not join the mesh and requires firmware reprogramming to restore, something that can not be done remotely.
  • There have been some problems with an old version of our network driver that would sometimes fall into an infinite loop while it kept resetting the watchdog. Since we have no means of remotely resetting a node in that case, this bug has caused a lot of downtime in the early days of deployment. It proved so hard to debug that I ended up rewriting the problematic part of the code from scratch.
  • Texas Instruments CC-series transceiver degradation. While this has not resulted in a node downtime (and is not counted in the statistics above) it has nonetheless rendered several nodes useless for experiments.
  • Failed microcontroller flash. Due to an unfortunate design of VESNA's bootloader, it reprograms a block of flash on each boot. For nodes that were rebooting frequently (often because of other problems) this feature commonly resulted in stuck bits and a failed node.
  • Failed SD card interface. For mass storage, VESNA uses an SD card and on several nodes it has become inoperable. Since the SD card itself can still be read on another device, I suspect the connector (which was not designed for outdoor use).
  • Failed MRAM interface. In addition to SD card there is a small amount of non-volatile MRAM on board and on several nodes it has failed for an unknown reason.
  • People unplugging UTP cables and other problems with Internet connectivity at the remote end beyond our control.

One-time failures:

  • Digi Connect ME module SSL implementation bug.
  • Failed Ethernet PHY on a Digi Connect ME module. While these two problems only occurred once each, they were responsible for a lot of downtime for the whole City center cluster.
  • Failed interrupt request line on a CC1101 transceiver. Unknown reason, could be bad soldering.
Posted by Tomaž | Categories: Life | Comments »

Masters of Doom

28.10.2013 19:16

Recently I finished reading Masters of Doom by David Kushner. It came on my radar because of a post on Hacker News that said this was a book that inspired Alexis Ohanian and Steve Huffman to make Reddit.

The story follows John Carmack and John Romero from childhood, through the founding of id software, later successes and failures of games they developed and concludes with the founding of Armadillo Aerospace. Compared to other books I read about ups and downs of US start-up companies (like The Facebook Effect or Dreaming in Code) it presents a more personal view of people in it. It often goes into first-person details of how someone felt about other people or the way some project was going. That makes for an interesting description of dynamics in a team and how they led their lives. It also makes me wonder how much of these details can genuinely be learned through interviews and how much has been added later to make for a more interesting read.

While this part of the book is quite well written in my opinion, the author fails horribly at describing technical details or any of Carmack's many breakthroughs in computer graphics. Even though I knew beforehand many details of id game engines I was constantly baffled by their descriptions in the book and went several times to Wikipedia to check my memory (by the way, the best description of Carmack's smooth-scrolling trick in Commander Keen I could find is on this Wikipedia talk page). Even more puzzling is the wrong explanation of gamer slang telefrag. Thankfully only a small part of the content is dedicated to technical topics, but it makes me wonder how such mistakes could have happened when the author describes himself in the epilogue as a hacker that was himself involved in the gaming culture.

In the end I still enjoyed reading the book. It included many events I remember reading about in magazines years back and presented them from the other point of view. The story also gives a quite good time line of events and gives a genuine impression of the amazing speed at which advances in technology happened at that time.

According to the book, Doom was half-jokingly made to be the number one cause of decreased productivity in businesses around the world. If designers of Reddit took that as inspiration for their website you could certainly say their success was along the same lines.

Posted by Tomaž | Categories: Life | Comments »

Comments closed

19.07.2013 9:38

Number of comments submitted to posts on this blog has gone through the roof recently as you can see on the graph below. Of course, practically all of these are spam. Unfortunately moderating this flood of crappy advertisements and link baits is now starting to take more of my time than I am willing to spend on it. Since I want to keep my little corner of the web a clean and friendly place I'm closing the comment submission until I come across some viable solution. In the mean time if you have a question or want to contribute something to one of my posts, feel free to sent me an email.

Number of submitted comments versus time.

While Akismet has been doing a pretty good job of automatically filtering comment spam for me, it's been letting a non-trivial amount of it through in the recent months. Considering the increase in volume, that might not even be due to decreased accuracy.

The kind of spam I'm seeing is kind of surprising. Spammer fetches the blog post that contains the comment submission form, submits a comment and fetches the post again to verify that his comment is visible. After these three HTTP requests the originating IP is never seen hitting my server again, making me think this is done via a botnet or some other distributed operation like that. There is no obvious sign of crawling so I don't know how they get the URLs to spam. They use realistic looking user agent headers and the only obvious difference to a real browser is that they don't fetch any of the resources (images, CSS, ...) referenced in the HTML document.

The content varies, but a lot of the comments I've been removing manually these days look like bug reports ("the sidebar is not rendering correctly on my browser", "search doesn't work" and such) that are only given away by the obvious keyword stuffing in the author name and the URL (or when they are complaining about bugs in features this website doesn't have). They target both new and old blog posts, so just shutting down comments on old posts doesn't seem to be a solution.

Posted by Tomaž | Categories: Life | Comments »

Missile Gap

13.07.2013 20:58

Recently I read Missile Gap by Charles Stross (first 9 chapters seem to be freely available on the web). It's a fascinating little hard-science fiction story that mixes Earth from the cold war era and a completely outrageous premise that the world has suddenly become a flat disc.

I can just imagine this started as a crazy idea in the form of "well, I wonder what would happen if the Earth was flat" and then brought to the logical conclusion, with the politics of the 70s thrown in to make for a more captivating story. I think Missile Gap shows in the best possible way how a science fiction story can start with a completely unbelievable event and then build a world and extrapolate a line of believable events around it that makes for an enjoyable read that doesn't force you to suspend the rational part of your mind. Many stories I come across these days have less outrageous plot devices, but then continue to break known laws of physics like crazy during their course.

What also kept me turning pages is the inclusion and logical continuation of quite real, but obscure research projects that both superpowers were working on at that time. For someone like me who has spent too many hours reading up on canceled concepts of nuclear powered airplanes and rockets this was like icing on a cake.

Visualization of Missile Gap by Charles Stross

Anyway, the other day I needed something to occupy my mind and having the book handy on my Kindle and an idle Python interpreter on my laptop, I drew the visualization above. It shows Missile Gap's 17 chapters in three colors, to show three separate personal stories the book revolves around. The length of the boxes is proportional to the number of words while the accumulative number of words is shown on the scale on the right (click on the image for a larger version).

Posted by Tomaž | Categories: Life | Comments »

SIGINT 2013

10.07.2013 22:47

Last weekend Jure and I visited this year's iteration of the SIGINT conference in Köln, Germany. SIGINT is a conference organized by the Chaos Computer Club and like the Chaos Computer Congress has a bit of history. Previous years it felt more like a local event and I didn't consider visiting it. This year however I decided to give it a try since the announcement gave the impression that they were aiming for a more international audience, for instance with the preference for English talks in the Call for papers.

SIGINT 2013 logo

At the first impression, the event looked much like a summer version of the Congress. Instead of one big hall the conference was split between two buildings with three lecture rooms, two halls and an obligatory basement hack center with copious amounts of reasonably priced Club Mate. Beyond the Fairy dust you could see some usual suspects from past winter events in Hamburg and Berlin like the All Colors Are Beautiful blinking IKEA boxes installation, Rarity hacked Brother embroidery machine and Nick Farr in his trademark suit.

The talks were a mix of social and political topics, computer security and various other curiosities that I came to expect from hacker conferences like this. The society track was unsurprisingly dominated by the recent leaks about United States data collection. From these I can recommend watching the keynote by Meredith L. Patterson and the Politics of Surveillance by Rainey Reitman. On the computer security topic there were perhaps a few more talks by people that can read x86 assembly by heart than you could find at 29C3 (where I believe a lot of this crowd opted to go to BerlinSides instead). Embedded device security nightmares and Car immobilizer hacking rang close to home for me. Also worth watching once the video recordings are published is the Secure Exploit Payload Staging which gives a good impression of how little trace someone can leave after breaking into your server. From the retro-computing scene, I liked the The DRM of Pacman talk about vintage hardware copy protection schemes in game cabinets of old. And finally, I thoroughly enjoyed the Making music with a C compiler lecture, which made me think again about the complex synthesizer I implemented on VESNA. By the way, slides for my lightning talk on that topic are already on-line, although the original blog post is probably more informative.

In conclusion, it was a nice event with an appropriately lazy pace for an extended summer weekend. My only complaint would be that the crowd felt less open than what I'm used to at the Congresses. It was hard to strike an English conversation with someone and looking back I didn't really had any interesting chats at the event beyond me asking a few stupid questions regarding projects exhibited in the hallways. I couldn't help overhearing a few comments regarding how different the event was compared to previous years, so perhaps it's just a sign that most people there were still used to a more local audience. In any case that's a completely subjective feeling and it's perfectly possible that I wasn't in my most sociable mood either. I'm starting to fear that I might have slightly overbooked my travel plans for this summer.

Posted by Tomaž | Categories: Life | Comments »

Decline and fall of Kiberpipa

08.06.2013 16:50

I guess by this time it's a well known fact around here that Ljubljana's hackerspace Kiberpipa has come to the end of its days, at least in its current place and form. A farewell party has been held, good byes have been said and all it remains now is to start unplugging the server rack.

If you haven't been following the news, the simple story is that Kiberpipa's parent organization twice removed decided to convert the place into a restaurant. Destruction of a hackerspace is merely a collateral damage in a grand scheme of converting an old building full of non-profit student and art organizations into a very for-profit hotel in a sweet spot near the center of the city. As it's usual in such cases there's also a back story that involves removing opposition through legalistic procedures and suspicions of personal interests. It was all done under cover and the community found this out only after the contracts have been signed through rumors, hearsay and digging through meeting minutes. An official statement has only been made when media started asking questions even though Kiberpipa had a representative that should be kept up to date with such things.

Kiberpipa storage room.

I'm not and never was involved in the internal politics of Kiberpipa's tenuous relationship with its masters. For me, Kiberpipa was foremost a place to go to after lectures and later after work where I could meet the kind of people that enjoyed idly chatting about technology and various other geeky topics instead of sports events and daily politics. As I was involved in Kiberpipa from the start I did use to have daily responsibilities there, like administration of servers and taking care of network security. Kiberpipa was also the place of my attempt at running a serious free software project. Many hours were spent at weird hours in a cramped server room and I learned a lot from these jobs, but unlike other hackerspaces, Kiberpipa's attractiveness was rarely about having access to equipment that I wouldn't have otherwise. It was foremost the social aspect that kept me returning to the place. I found many valuable personal connections that later led me to start ups and other interesting volunteering work.

That said, Kiberpipa never felt like a tightly knit community. People that frequented the place or used its name on projects were always divided into groups that did not communicate well with each other. Contrary to most other hackerspaces, Kiberpipa was from the start tied to a relatively large non-profit that ran several other, mostly artistic operations, under control of the Student organization of the University of Ljubljana. For majority of my time in Kiberpipa they were kind of a fuzzy entity that only showed itself only when it exercised its power over some official aspect of the organization or left the place in ruin after an unannounced party. Ties between the more technical hacker crowd and arts communities were rarely relaxed. Often there was an unusual reversal of roles where artists were the ones supplying money through various public grants and technical people perceived as moochers playing with their toys. In its early days there was also a strong political activism side to Kiberpipa with which I didn't particularly identify myself either.

Kiberpipa server room.

When a rare project that involved collaboration happened, it was often setup and discussed outside of general channels like the member's mailing list. It's not surprising then that in all years of its existence and numerous formal and less formal meetings and discussions it was never possible to come up with a mission statement or give an answer what Kiberpipa was that everyone would agree with.

Even with hindsight it's hard to say what could have been done differently. It's doubtful that Kiberpipa would be this successful without its partnership with the student organization that ultimately led to its destruction. It provided an accessible, rent-free place and connections to government subsidies that removed the need for membership fees. With them the place would certainly attract a lot less people. Kiberpipa's community also showed a lot of flexibility, changing over the years its external face from mostly being a free cybercafé to a place to go to for lectures and workshops about various topics. Part of this probably comes from the fact that the community never learned how to transfer knowledge between generations, but still it's impressive enough that comments can be heard from old members that they never though the place would survive for 13 years.

Kovchek, Kiberpipa's old mobile video streaming server.

Although I mostly kept myself in the background and had my share of conflicts and grief there, Kiberpipa has been a big part of my life and I'm sad to see it end like this. Things may not be as bad as they look though. The latest generation of Kiberpipa's members are looking for ways of continuing the story in an independent fashion and although I'm not actively involved in that effort I hear that the outlook is good. I'm usually too pessimistic in such writings anyway. For better or worse I am quite certain though that Kiberpipa 2.0 will be quite different from the dark and smelly open source cellar we started 13 years ago.

Posted by Tomaž | Categories: Life | Comments »

World wide wheel, reinventing of

09.05.2013 22:00

The direction browsers and web technology are moving these days truly baffles me. As usual in the software world, it's all about piling one shiny feature on top of another. Now, I'm not against shiny per se, but it seems that a lot of these innovations are by people that haven't even took an hour to look at the already existing body of knowledge and standards that has accumulated over the years. With the frenzy of rolling releases and implementation-is-the-standard hotness, it's not even surprising that those are then implemented by browsers before someone with a long enough beard can stand up and shout Hey! We already thought of that in this here RFC.

Take for example all the buzz about finally solving the problem with authentication on the web. Finally, there's a way to securely sign into a website without all the mess with hundreds of hard-for-me-to-remember yet easy-to-guess-by-the-cracker user name and password combinations. Wonderful. Except that this exact thing existed on the web since people did cave paintings and used Netscape to browse the web. It's called SSL client side certificates and, amazingly, worked well enough for on-line banking and government sites even before the invention of pottery and cloud-based identity providers.

But that's just the most glaring case. Another front where this madness continues is pushing things from the old HTTP headers to the fancy new HTML5. Take for example a proposal to add a HTML attribute that defines whether a browser should display something or save it to disk by default. This functionality has existed for ages in the form of a HTTP header, yet this is somehow dismissed as a server-side solution (what does that even mean?).

I wonder how many web developers today are even aware that there exists a mechanism for a client to tell the browser which language the user prefers (but we most certainly need the annoying language selection whole-screen pop-up-and-click-through!). Or that the client can tell the server whether it would rather have a PNG for downloading or a friendly HTML page for viewing in a browser (meh, we'll just fudge that with some magic on-click Javascript handlers).

Now I can see someone laughing and saying how ridiculous this idea is and if I have ever even tried to use one of those ancient features. No it's not, and I have. It's consistently painful. But it's only so because for some reason, browsers long ago decided to make the most horrible interface to such functionality imaginable to man and then forgot to ever fix it. Mostly it's hidden 10 levels down in some obscure dialog box and if banks wouldn't give you click-by-click instructions on how to import a certificate, 99% of people would give up after a few hours and continue chiseling clay tablets. Now imagine if a tenth of time spent in reinventing the wheel would be spent just improving the usability of existing features. Why can't I go to a web page and get a prompt: Hey! This web page wants you to login. Do you want me to use one of these existing certificates or generate a new, throw-away one?. World would be just a tiny bit better, believe me.

In the end, I think modern browsers have focused way too much on improving the situation for the remote web page they are displaying and neglected the local part around it. And I believe this direction is bad in the long run. Consider also the European cookie directive. I'm pretty sure this bizarre catch-22 situation where web pages are now required to manage cookie preferences for you would not be needed if browsers provided a sane interface for handling these preferences in the first place. My Firefox has three places (that I know of!) where I can set which websites are allowed to store persistent state on my computer. Plus it manages to regularly lose them, but that's a different story.

Posted by Tomaž | Categories: Life | Comments »

Cost of a Kindle server

01.05.2013 10:48

I was wondering how much running a Kindle as an always-on, underpowered Debian box was costing me in terms of electricity. So I plugged it into one of the Energycount 3000 devices and monitored its power consumption over the last 4 days. This took into account the power consumption of the Kindle as well as the efficiency of a small Nokia cell phone charger I'm using to power it.

ec3k reported an average power of 1.0 W (and maximum 2.6 W). Dividing the watt-seconds count with time also yielded 1 W to three decimal places. This nice round number makes me suspect that it's due to limited precision of the measurement, but let's consider it accurate for the moment.

1 W is equal to 0.72 kWh per month. With the current prices I'm paying for electricity this costs me 0.083 € per month. For comparison, a cup of synthetic-tasting coffee from a machine at work costs around twice as much and running my desktop machine all the time would be around a hundred times as expensive.

Posted by Tomaž | Categories: Life | Comments »

Interesting battery failure mode

28.04.2013 13:29

Thanks to my previous posts about Amazon Kindle, I have another broken specimen on my desk now. This one seems to have experienced an interesting battery failure.

Amazon Kindle 3 batteries

Kindle's battery has 4 terminals: ground, a positive terminal for power and SDA and SCL pins for I2C communication with the integrated battery management circuit. On a normal battery, the positive terminal is around 3.7 V above ground, depending on the charge level of the Li-ion cell and the I2C lines are on ground level, because they need external pull-ups.

This broken battery however has the positive terminal at 0 V compared to ground terminal while the I2C pins are at -2.5 V. I can't imagine what kind of failure mode could cause pins to go lower than ground, unless the polarity of the cell got reversed somehow. I don't see any way how a failure in the battery management circuit or a loose connection somewhere could cause such readings. I'm pretty sure it's not an artifact of my multimeter either, because the battery can draw some milliamps of current from the ground to one of the I2C pins. For the record, this looks like an original 1830 mAh battery. Date of manufacture is April 2011 and type is 170-1032-01 Rev. A.

The master I2C interface on the Kindle wasn't damaged though, because it boots and reads out battery state just fine when attached to a different battery. There does seem to be a problem with bad a connection somewhere on the motherboard, because it crashes if I lightly knock on the CPU package. Possibly a hairline crack in some solder joint. But that's a topic for some other time.

Posted by Tomaž | Categories: Life | Comments »

Contiki and libopencm3 licensing

19.03.2013 18:08

At the beginning of March a discussion started on Contiki mailing list regarding merging of a pull request by Jeff Ciesielski that added a port of Contiki to STM32 microcontrollers using the libopencm3 Cortex M3 peripherals library. The issue raised was the difference in licensing. While Contiki is available under the permissive BSD-style license, libopencm3 uses GNU Lesser General Public License version 3. Jeff's pull request was later reverted as the result of this discussion and was similar to my own effort a while ago that was also rejected due to libopencm3 license.

Both the thread on contiki-devel and later on libopencm3-devel might be an interesting read if you are into open source hardware because they exposed some valid concerns regarding firmware licensing. Two topics got the most attention: First, if you ship a device with a proprietary firmware that uses a LGPL library, what does the license actually require from you. And second, whether the anti-tivoization clause is still justified outside of the field of consumer electronics.

I'll try to summarize my understanding of the discussion and add a few comments.


Only the libopencm3-using STM32 port of the Contiki would be affected by LGPL. Builds for other targets would be unaffected by libopencm3 license and still be BSD licensed, since binaries would not be linked in any way with libopencm3. Still, it was seen as a problem that not all builds of Contiki would be licensed with the same license. Apart from added complexity, I don't see why that would be problematic. FFmpeg is an example of an existing project that has been operating in this way for some time now.

LGPL requires you to distribute any changes to the library under the same license and provide means of using your software with a different (possibly further modified) version of the library. The second requirement is simple to satisfy on systems that support dynamic linking. However this is very rare in microcontroller firmware. In this case, at the very least you have to provide binary object files for the proprietary part and a script that links them with the LGPL library into a working, statically-linked firmware.

I can see how this can be hard to comply with from the point of the typical firmware developer. Such linking requires an unusual build process that might be hard to setup in IDEs. Additionally, modern visual tools often hide the object files and linking details completely. Using proprietary compilers it might even be impossible to have any kind of portable binary objects. In any way, this is seen by some as enough of a hurdle to make reimplementation of LGPL code easier than complying with the license.

From this point of view, GPL and LGPL licenses don't seem to have a lot of difference in practice (note that libopencm3 already switched from GPL to LGPL to address concerns that it should be easier to use in commercial products). SDCC project solved this problem by adding a special exception to the GPL.


The other issue was the anti-tivoization clause. This clause was added to the third revision of the GNU public licenses to ensure that freedom to modify software can't be restricted by hardware devices that do cryptographic signature verification. This was mostly a response to the practice in consumer electronics where free software was used to enable business models that depended on anti-features, like DRM, and hence required unmodifiable software to be viable. However in microcontroller firmware there might be reasons for locking down firmware reprogramming that are easier to justify from engineering and moral standpoints.

First such case was where software modification can enable fraud (for instance energy meters) or make the device illegal to use (for instance due to FCC requirements for radio equipment) or both. In a lot of these cases however there is a very simple answer: if the user does not own the device (as is usually the case for metering equipment), no license requires the owner to enable software modification or even disclose the source code. Where that is not the case, usually the technical means are only one part of the story. The user can be bound by a contract not to change particular aspects of the device and subject to inspections. The anti-tivoization clause also does not prevent tampering indicators. However it might be that in some cases software covered by anti-tivoization might simply not be usable in practice.

The other case was where changed firmware can have harmful effects. Some strong opinions were voiced that people hacking firmware on certain dangerous devices can not know enough not to be a danger to their surroundings. This is certainly a valid concern, but the question I see is, why suddenly draw the line at firmware modification?

Search the web and you will find cases where using a wrong driver on a laptop can lead to the thing catching fire, which can certainly lead to injuries. Does that mean that people should not be allowed to modify operating system on their computers? A similar argument was made years ago in computer security, but I believe it has been proved enough times by now that manufacturers of proprietary software are not always the most knowledgeable about their products. I am sure that every device that can be made harmful with a firmware update can be done so much easier with a screwdriver.

In general, artificially limiting the number of people tinkering with your products will limit the number of people doing harmful things, but also limit the number of people doing useful modifications. A lot of hardware that was found to be easily modifiable has been adopted for research purposes in much more fancy institutions than your local hackerspace.

I haven't been involved in the design of any truly dangerous product, so perhaps I can't really have an opinion about this. However I do believe that responsibility of a designer of such products ends with a clear and unambiguous warnings as to the dangers of modification or bypassing of safety features.

Posted by Tomaž | Categories: Life | Comments »

Embedded modules

02.03.2013 20:56

I've written before about problems with VESNA deployments that have come to consume large amounts of time and nerves. Several of these have come in turn from two proprietary microprocessor modules we use: Digi Connect ME for Ethernet connectivity and Atmel SerialNet for IEEE 802.15.4 mesh networking.

One of these issues, which now finally seems to be just on the brink of being resolved, has been been dragging on from the late summer last year. We have deployed several Digi Connect ME modules as parts of gateways between IEEE 802.15.4 mesh in clusters of VESNA nodes and the Internet. One of deployments has proved especially problematic. Encrypted SSL connections from the module would randomly get dropped and re-connect only after several hours of downtime.

The issue at first proved impossible to reproduce in a lab environment and since the exact same device worked on other networks the ISP and the firewall performing NAT was blamed. However, several trips to the location and many packet captures later I could find no specific problem with TCP or IP headers I could point my finger to. We replaced a network switch with no effect. Later, by experimenting with Digi Connect TCP keep-alive settings, a colleague found a setting that caused the dropped connection to be re-established immediately instead of causing hours of down-time, making the deployment at least partially useful.

Finally, last week I managed to reproduce the problem on my desk. I noticed that the TCP connections from that location had an unusually low MSS - just 536 bytes. By simulating this I could reliably reproduce connection drops and by experimenting further I found out that SSL data records fragmented in a particular way will cause the module to drop the connection. It was somewhat specific to the Java SSL implementation we used on the other end of connection and very unlikely to happen with other connections that used larger segment sizes.

The cause of the issue was therefore in the Digi Connect module. Before having a reproducible test case I haven't even considered a possibility that a change on the link layer somewhere in the route could trigger a bug at the application layer.

After I had that piece of information, a helpful member of the support forums quickly provided a solution. The issue itself however is not yet resolved since the change in the firmware broke all sorts of other things which now need to be looked into and fixed as well.

I can't say that all of our hard-to-solve bugs came from Digi Connect or Atmel modules. We caused plenty ourselves. But having now experienced working with these two fine products, my opinion is that less time would be wasted if we went for a lower-level solution (just an interface on the physical layer) and then used an open source network stack on top. It would take more time to get to a working solution but I think problems would be much easier to diagnose and solve than with what is essentially a magical black box.

Both Digi Connect and Atmel modules suffer from the fact that they hide some very complex machinery behind a very simplistic interface. Aside from the problem of leaky abstractions, when the machinery itself fails, they provide no information that would help you work around the problem (solving it is out of the question anyway because of proprietary software). Both also come with documentation that is focused on getting a working system as fast as possible, but lacks details on what happens in corner cases. These are mostly left to your imagination and experiments and as experience has shown, behavior can change between firmware revisions. In most cases you can't even practically test against these changes, since that would involve complicated hardware test harnesses.

Posted by Tomaž | Categories: Life | Comments »

Visiting the capital

23.02.2013 14:10

I spent the last week in Brussels. The CREW project I'm involved in at IJS has organized a couple of events there as well as a plenary meeting, so the past few days have been quite exhausting, not to mention the week leading to it spent in worries and preparations.

I haven't been to Belgium in a few years and this was the first time I actually flew in. The prices for flights from Ljubljana have always been unreasonably high, with kind of an urban legend going on that it's an unofficial way of the government subsidizing our national air line by paying high prices for frequent flights of various government officials.

In any case, my flight landed late at night and the Brussels airport was more or less dark and deserted. Dark, except for big, brightly lit LED boards with advertisements. These were giving optimistic visions of a bright and sustainable future all along the long path you have to walk from the gate to the train station. And the first of these was telling us in large, friendly letters that European Parliament protects our rights. I can tell you that coming from our small, unfashionable airport to this scene reminded me of a kind of certain not-so-optimistic science fiction stories. The fact that my hotel reservation came with a legal disclaimer about assistance with authorities did not help the issue either.

Vrije Universiteit Brussel

The first order of business in Brussels was a workshop on TV white-spaces for members of the European Commission. As you might know, there is a lot of discussion going on about how to re-use the frequencies that were freed by the transition to digital broadcasts. As a project that also works in that field we presented our view on that topic to people working on spectrum regulation.

The visit to the European Commission offices was actually quite different from what I expected. I was anticipating a dusty, gray place with laser-printed passive-aggressive notices hanging around the hallways that I usually associate with government buildings here. Instead, the part of the Beaulieu 33 I saw could probably compete with Hekovnik on the number of colorful and inspiring messages stuck to the walls. Not to mention various, kind of silly posters regarding network security (in the fashion of "you don't share your toothbrush, you shouldn't share passwords either"). The security procedures as well, while visible, were pretty unobtrusive and mostly involved displaying a kind of a self-destructing badge (it got crossed-out with "expired" all by itself after a day through some kind of a chemical process I guess).

Carolina Fortuna giving a tutorial on ProtoStack

The other public event we held were CREW training days at the Vrije Universiteit Brussel. I gave a tutorial there on how to use Jožef Stefan Institute's cognitive radio testbed in Logatec and the hardware I developed for various experiments. I'm happy that I received some positive responses to that. At least for me it was a big confirmation that the tools what we are developing at the Institute are actually useful to this research community and that we are contributing in a positive way.

While trying to enter the university building on Thursday we found ourselves in front of a crowd of protesters (there was a general strike in Brussels that day), so perhaps not everyone agrees with that. I'm not quite sure though whether the university itself or people employed there were the target of their protest or we just happened to be in the wrong place at the wrong time. I would say the latter, but on the list of offices on that particular address I couldn't find any kind of institution that would be worth protesting against in my opinion. Anyway, I was not able to understand any of their complaints through loud fireworks while we were entering the building under the watch of the local riot control police.

Posted by Tomaž | Categories: Life | Comments »

Further adventures in Chromebook-land

28.01.2013 21:33

As you might remember from my previous blog post, I have a disassembled ARM-based Samsung Chromebook lying around, occupying various horizontal surfaces that might otherwise be put into better use. After an initial success with replacing the original, feature-challenged OS with armhf port of Debian Wheezy I hit on a couple of snags. First, I found out that running the computer with a non-Google signed OS means having to look at an annoying warning message at each boot and having to press Ctrl-D (and being careful not to turn off developer mode by touching any other keys by mistake) or wait for a minute or so. And second, by carelessly playing around with alsamixer, I managed to get the left built-in speaker to melt through the bottom casing of the laptop.

Naturally, the smart decision would at that point be to return the laptop to the shop and demand my money back. Of course, I chose the other way.

Chromebook left speaker close-up

As far as the speaker is concerned, it's quite beyond repair. While the body (which I guess is kind of a resonance chamber?) and the magnet are quite unharmed, the membrane and the coil (made with a piece of flexible PCB as far as I could see) ended up in a puddle of molten plastic. I suppose the other speaker is still working correctly, but I didn't test it as I have it disconnected from the motherboard for now.

Chromebook motherboard, top side

Once you remove the human interface, the business part of the laptop is a surprisingly small motherboard, containing little more than the Exynos system-on-chip surrounded by a bunch of memory chips and power supply circuits.

Chromebook motherboard, bottom side

The problem with the annoying bootloader turned out to be harder to solve than I thought. As I understand the boot process, the CPU first runs pre-boot code (apparently some proprietary initialization code from Samsung). This then loads secondary program loader which in turn loads an U-Boot. This one then annoys you and goes on to load anything you want from the SSD, provided you are in developer mode, of course. I'm not sure about the first two parts, but U-boot is stored on an Winbond 25Q32DW series serial flash chip with an SPI interface.

Serial flash IC on Chromebook motherboard

This chip has a active-low write-protect pin. The pin is pulled low by default somehow, which prevents the main Exynos CPU from writing to flash. It doesn't seem to be tied to ground though, so I'm guess it might be controlled from the embedded system controller (or a GPIO pin from Exynos, but that would be kind of stupid). If you browse Google's documentation there are some mentions of a mysterious servo2 debug board that apparently allows you to overwrite the flash and even boot the computer if the flash is corrupted. I haven't been able to find any kind of details about it, not even where are you supposed to connect it. There are no special debug connectors on the laptop's motherboard as far as I can see, so it's either plugged into one of the externally accessible connectors (USB, HDMI, SD card, audio) and does some magic through there (possibly with the help of ESC), or servo2 refers to a special version of the motherboard that has some additional debug capabilities.

In any case, without debug board's magic, replacing the bootloader doesn't look simple. I can rewire the write-protect pin, but that will give me exactly one try at programming the flash. If I botch it, Exynos will crash on boot and I won't get another chance. I'm not sure I'm capable of desoldering the flash chip without destroying it, reprogram it externally and solder it back without messing up any of tiny SMD components around it. Although there seems to already be an Arduino-based programmer available for these chips, so at least I would be spared the task to code that myself.

I've built the Chromium OS development environment which in turn can also be used to build the flash images. While everything seems to build without problems, I'm kind of confused as to whether the images built in this way still include the annoying warning or not. The build process itself turned out to be quite convoluted, involving surprising amounts of complex Python code (what's wrong with Makefiles?) hidden behind Gentoo's Portage scripts and has so far resisted my attempts to find out how the flash image is actually constructed.

Unfortunately, while poking inside the laptop I managed to add a third problem on top of the previous two. While re-attaching the copper cooling plate my screwdriver slipped and shattered one of the tiny bare-die flip-chip packages around the STM32F10086 controller.

Shattered flip-chip component

These seem to be bare silicon soldered directly to the PCB without any kind of packaging and are surprisingly brittle. From what's left of it and by looking at similar components on the board, I'm guessing the laser-etched back-side marking originally said 2822HN. I'm not sure what its function was and I can't find any references on the web for these components (the other type of a similar component used on the motherboard is 28DCV7). Perhaps a discreet logic gate? In any case, it's quite beyond my capabilities to replace, even if I would manage to get a replacement part.

Surprisingly, with this component in the broken state as it is, the laptop still boots. So far I haven't yet tested if any peripheral isn't working. One effect seems to be that the power button is now kind of unreliable, taking several presses before the computer turns on. But that might also not be related - with the casing open, everything is kind of wobbly and I wouldn't be surprised if keyboard isn't properly supported in this setup.

In any case, this Chromebook seems to be a failure as far as replacing my EeePC goes. In this broken form it certainly won't become a computer I can rely on when traveling, even if I manage to replace the bootloader. Might eventually turn out useful for some other project though.

Posted by Tomaž | Categories: Life | Comments »

GPG key transition

13.01.2013 20:22

I've been using the same GnuPG key pair for signing and encrypting my mail since 2001. If you are not using an email client that is OpenPGP-aware you might have noticed that all my electronic correspondence seems to have a piece of robot barf appended at the end. I've been stubbornly insisting on at least signing all of my out-going mail, even for recipients that I know don't use public-key cryptography, in a futile attempt to raise awareness about these things.

This secret 1024 bit DSA/ElGamal pair will now soon be 12 years old. It has been moved between many machines and, while I'm quite careful about these things, it's at least probable that in all these years it had leaked somewhere. It's also hopelessly outdated by any modern standard and quite within the reach of modern code-breakers. Listening to the RSA factorization in the real world talk at 29C3 finally reminded me to take the plunge and replace it with a modern 4096 bit RSA key. I've also moved to SHA256 digests, as recommended by Debian. And finally, to prevent the new key from getting this far beyond its best-before date, I've also set the expiry date to 5 years.

So, my new key is:

pub   4096R/0A822E7A 2013-01-13 [expires: 2018-01-12]
      Key fingerprint = 4EC1 9BBE DE7A 4AA1 E6EB  A82F 059A 0D2C 0A82 2E7A

I will be immediately switching all signatures to it. I will not revoke my old key for the next 90 days, but if you encrypt your mail, please use my new key instead. Also, if you got one of my Moo cards recently, please note that the GPG fingerprint on the back side refers to my old key.

You can import my new public key into your key chain by using the following command:

$ gpg --keyserver subkeys.pgp.net --recv-key 0A822E7A

I would appreciate if you would sign my new key to integrate it into the web of trust. If you meet me in person in the future, I will probably give you the key fingerprint, so you can be sure it's the correct one. Otherwise if you trust my old key, you can check my official key transition statement, which is signed by both my old and my new key.

Posted by Tomaž | Categories: Life | Comments »

Pinkie sense debugging

12.01.2013 18:29

Here's another story about a debugging session that took an embarrassing amount of time and effort and confirmed once again that a well thought-out design for debugability will pay for itself many times over in lost stomach acid and general developer well being.

As you might remember, the UHF receiver I designed a while back has been deployed on VESNA sensor nodes in a cognitive radio testbed. When the first demos of the deployment were presented however it became apparent that nodes equipped with it had a problem: experiments would often fail mysteriously and had to be repeated a few times before valid measurements could be retrieved.

In this particular case firmware running on VESNA's ARM CPU implements a very simple, home-brew scheduler. An experimenter sends a list of tasks over the management network to the sensor node and the scheduler attempts to execute them roughly at the specified times. After a while, the experimenter asks the node if the given tasks have been completed, and if they have, requests a download of the recorded data. Often however nodes would simply continue replying that the tasks have not been completed well after the last command should have been concluded.

This kind of problem was specific to nodes equipped with the UHF receiver (not that other nodes don't have problems of their own) and seemingly limited to sensor nodes deployed high on light-poles as the one test article on my desk refused to exhibit this bug.

A hint of what might be going on came when I started monitoring the uptime of sensor nodes with Munin. When the bug manifested itself the number of seconds since the CPU reboot fell to zero, making apparent that the node was resetting itself during experiments. Upon reset the scheduler would forget about the scheduled tasks stored in volatile memory, leaving the non-volatile state that was queried by experimenters in a perpetual running state. This oversight on the part of scheduler design and the fact that the node can't signal errors back to the infrastructure over the management network protocol has already made debugging this issue harder than necessary.

Next step was to determine what was causing these resets. Fortunately the STM32F103 CPU used on VESNA provides a helpful set of flags in the RCC_CSR register that allow you to distinguish between six different CPU reset reasons. Unfortunately, the bootloader on deployed nodes clobbers the values in the register, leaving no way to determine its value after reboot.

VESNA with SNE-ISMTV-UHF in a weather-proof box.

Back to square one, I reasoned that resets might be hardware related. Since the UHF tuner is quite power hungry I guessed that it might have something to do with power supply on deployed nodes. I also suspected hangs in the STM32's I2C interface, which is supposedly notoriously buggy when presented with marginal signals on the bus. Add the fact that weather-proof plastic boxes used to house sensor nodes turned out not to be as resistant to rain as we originally hoped and hardware related problems did not seem that far-fetched.

However poking around the circuit did not reveal anything obviously wrong and with no way to reproduce the problem in a lab I came to another dead end.

Next break-through came when I managed to reproduce the problem on a node that has been unmounted and brought back to the lab. It turned out that on this node a specific command would result in a node reset in around 2 cases out of 100. This might not sound much, but a real-life experiment would typically consist of many such tasks, adding up to a much higher probability of failure. Still, it took around two hours to reliably reproduce a reset in this way and this resulted in careful monitoring of failure probabilities versus the physical node and firmware versions. This monitoring later proved that all nodes exhibited this bug, even the test article I initially marked as problem-free.

Having a reproducible test case on my desk however did little to help the issue. With no JTAG or serial console available on the production configuration, it was impossible to use a on-chip debugger. It did make it possible to upload a fixed bootloader though and the cause of the resets was revealed to be the hardware watchdog.

VESNA uses the STM32 independent watchdog, which is a piece of hardware in the microcontroller that resets the CPU state unless a specific register write occurs every minute or so. The fact that watchdog was resetting the CPU pointed to a software problem. Unfortunately the wise developers in STM did not provide any way to actually determine what the CPU has been doing before the watchdog killed it. There is no way to get last program counter value or even a pointer to the last stack frame and hence no sane way of debugging watchdog-related issues (I did briefly toy with the idea of patching the CPU reset vector and dumping memory contents on reset, but soon decided that does not classify as a sane method).

This led to another fruitless hunt around the source code for functions that might be hanging the CPU for too long.

Then I noticed that one of the newer firmware versions had a much lower chance of failure - 1 failure in 2500 tries. Someone, probably unknowingly since I didn't see any commit messages that would tell me otherwise, already fixed, or nearly fixed, the bug. Since such accidental fixes tend also to be accidentally removed it still made sense to figure out what exactly was causing this bug to make sure the fix stayed put. I fired up git bisect and after a few days of testing I came up with the following situation:

git-bisect result

Note that git bisect is being run in reverse here. Since I was searching for a commit that fixed a bug, not introduced it, bisect/bad marks a revision with the bug fixed while bisect/good marks a revision with a bug.

But if you look closely, you can see that the first revision that fixed it was a merge commit of two branches, both of which exhibited the bug. To make things even more curious, this was a straightforward merge with no conflicts. It made no sense that this merge would introduce any timing changes large enough to trip the watchdog timer. However the fact that the changes had to do with the integrated A/D converter did curiously point to a certain direction.

After carefully testing for and excluding bootloader and programming issues (after all, I did find a bug once where firmware would not be uploaded to flash properly if it was exactly a multiple of 512 bytes long), I came upon this little piece of code:

ADC_Cmd(ADC1, ENABLE);
while (!(ADC_GetFlagStatus(ADC1, ADC_FLAG_EOC)));

Using the STM's scarily-licensed firmware library, this triggers a one-shot A/D conversion and waits for the end of conversion flag to be set by hardware. When I carefully added a timeout to this loop the bug disappeared in all firmware versions that previously exhibited the bug. I say carefully because at this point I was not trusting compiler either, which means I added some timeout code to the loop first that had a larger-than-realistic timeout, made sure the bug was still there, and then only changed the timeout value without touching the code itself.

Now it would be most convenient to blame everything to a buggy microcontroller peripheral. Thus far most clues seem to point to the fact that some minor timing issues during ADC calibration and turn-on may cause the ADC to sporadically hang or take far longer than expected to finish a conversion (turning off the watchdog did not usually result in a hang).

But even if that is the case (and I'm still not completely convinced although this case is now closed as far as I'm concerned), this journey mostly showed what happens when debugging a 40.000-line embedded code base with none to little internal debug tools at your disposal. It's a tremendous time sink and takes careful planning and note taking - I'm left with pages of calculations I made to make sure tests were running long enough to ensure a good probability of correct git-bisect result knowing prior probabilities of encountering a bug.

So, if you made it this far and are writing code, please make sure proper error reporting facilities are your top priority. Make sure you fail as early as possible and give as much feedback as possible to people that might be trying to debug things. Try to be reasonably robust to failures in code and hardware outside of your control. And above all, make absolutely sure you don't interfere with any kind of debugging facilities your platform provides. As limited as they might appear to be, they are still better than nothing.

Posted by Tomaž | Categories: Life | Comments »