Update on CC2500 deterioration

25.03.2013 15:32

Last November I was writing about an issue with 2.4 GHz transceivers that were deployed on street lights as part of Jožef Stefan Institute's testbed for radio communications. Some radios based on Texas Instruments CC2500 integrated circuit seemed to degrade after they were mounted out doors. In the most extreme case, there was almost an -30 dB drop in receiver sensitivity and maximum transmit power dropped to insignificant levels.

The leading theory at the time was that some component has deteriorated over time due to environmental effects. After some initial testing it appeared that the coaxial cable between the radio PCB and the connector is to blame. However having more boards with this issue in the lab showed that in most cases it's not possible to restore original characteristics only by changing the coaxial cable.

A pile of SNE-ISMTV boards and pigtail cables.

I then fired up the soldering iron and did a brute-force approach: on three bad radio boards I manually replaced components one by one, preforming automated tests of the board after each replacement. While replacing the coaxial cable did have a small effect (0.3 dB in the best case), only replacing the CC2500 integrated circuit itself seemed to restore the correct performance. So the suspicion immediately fell on the transceiver chip itself.

Since I did not exhaustively test all radios for sensitivity and transmit power before deployment, I couldn't be sure whether the radios have been bad from the start or whether they degraded due to environment. The fact that only radios that have been deployed out-doors have had this issue pointed to the second explanation, but since more radios were deployed out-doors than in-doors that could have been a coincidence as well.

As a test, I also brutally heated up one of the newly replaced CC2500 chips with a hot air gun. After such test, the radio board had similar symptoms as the boards that were unmounted from the testbed as defective: the digital part of the IC still functioned correctly, but receiver's sensitivity dropped permanently by 33 dB. Unfortunately I did not record the temperature to see how much heating is actually required.

To test the other theory, I set up a long-running experiment on the testbed that recorded received signal strength between neighboring pairs of newly replaced radios 4 times a day between December 2012 and March 2013. If radios were deteriorating over time, I should be able to see the signal strength drop during these three months.

Here are two typical plots of these measurements over time:

Long term RSSI measurement between nodes 12 and 15.

Long term RSSI measurement between nodes 2 and 25.

As you can see, while there is a lot of variation in the signal strength, there is no obvious downward trend. Variation is probably due to weather and or large things moving around the radios (these are mounted in an industrial zone, so moving trucks and other such things are not an uncommon occurrence).

So, it currently looks like we either mounted already defective radios or they were damaged by a one-time event after deployment (which is hard to imagine). The fact that over-heating damages radios in a similar way may point to an error in manufacturing, although that's not a popular opinion, since these boards were soldered using a lead solder and radio ICs are supposed to support a RoHS process that involves higher temperatures. It's also a theoretical possibility that the radios were damaged during summer (the test above was obviously done over winter months), although again it's very unlikely that sun would overheat the radios.

Posted by Tomaž | Categories: Analog | Comments »

Blinken Kindle

21.03.2013 21:50

Kindle 3 has a two-color LED in the power button. With original software, you can see it glow briefly when you slide the button or when the battery is charging. A Kindle running Debian of course allows many more interesting possibilities.

The two-color LED is actually composed of two separate LEDs that are controlled separately: an amber and a green one.

The amber LED is lit when the battery is charging. Currently it is not clear to me how exactly it is controlled, but it seems to involve both the hardware-controlled charger LED pin of the MC13892 power management IC and a software-controlled GPIO. The relevant code starts at line 929 in arcotg_udc.c. Note that pmic_green_led_disable() function forces the MC13892 pin state (which as I understand actually affects the amber LED, not the green one).

In any case, there doesn't seem to be any user-space interface available for controlling this LED. It's turned off by default.

The green LED is much easier to control. It's connected directly to one of the signaling LED drivers on the MC13892 and there's a convenient, if somewhat weird interface for those in sysfs. Relevant code is in pmic_light.c.

All operations are made through the /sys/devices/platform/pmic_light.1/lit file. You can write statements in the form of "command value" to this file to control the LED drivers. Supported commands are:

  • ch - select channel on which subsequent commands will operate. Kindle's LED uses channel 4, so before issuing any other commands, always write ch 4. Better not touch other channels.
  • cur - set LED current. Can be set from 0 to 7 to control LED brightness. Kindle's built-in software uses 7.
  • dc - set PWM duty-cycle. Can be set from 0 to 32. 0 means constantly off, 32 means constantly on.
  • bp - set PWM frequency. Can be set from 0 to 3. At 0, frequency is high enough for blinking to be invisible to human eye and PWM setting can be used as an additional brightness control. On higher settings, the PWM output is seen as blinking (see MC13892 datasheet for exact frequencies).
  • ra - Either 0 or 1. If set to 0, PWM duty-cycle changes are immediate. If set to 1, changes happen via a hardware generated ramp, making smooth visual transitions.

As an example, here's what I have in /etc/rc.local to turn off the green LED at boot:

echo "ch 4"	> /sys/devices/platform/pmic_light.1/lit
echo "cur 0"	> /sys/devices/platform/pmic_light.1/lit
echo "dc 0"	> /sys/devices/platform/pmic_light.1/lit
Posted by Tomaž | Categories: Code | Comments »

Contiki and libopencm3 licensing

19.03.2013 18:08

At the beginning of March a discussion started on Contiki mailing list regarding merging of a pull request by Jeff Ciesielski that added a port of Contiki to STM32 microcontrollers using the libopencm3 Cortex M3 peripherals library. The issue raised was the difference in licensing. While Contiki is available under the permissive BSD-style license, libopencm3 uses GNU Lesser General Public License version 3. Jeff's pull request was later reverted as the result of this discussion and was similar to my own effort a while ago that was also rejected due to libopencm3 license.

Both the thread on contiki-devel and later on libopencm3-devel might be an interesting read if you are into open source hardware because they exposed some valid concerns regarding firmware licensing. Two topics got the most attention: First, if you ship a device with a proprietary firmware that uses a LGPL library, what does the license actually require from you. And second, whether the anti-tivoization clause is still justified outside of the field of consumer electronics.

I'll try to summarize my understanding of the discussion and add a few comments.

Only the libopencm3-using STM32 port of the Contiki would be affected by LGPL. Builds for other targets would be unaffected by libopencm3 license and still be BSD licensed, since binaries would not be linked in any way with libopencm3. Still, it was seen as a problem that not all builds of Contiki would be licensed with the same license. Apart from added complexity, I don't see why that would be problematic. FFmpeg is an example of an existing project that has been operating in this way for some time now.

LGPL requires you to distribute any changes to the library under the same license and provide means of using your software with a different (possibly further modified) version of the library. The second requirement is simple to satisfy on systems that support dynamic linking. However this is very rare in microcontroller firmware. In this case, at the very least you have to provide binary object files for the proprietary part and a script that links them with the LGPL library into a working, statically-linked firmware.

I can see how this can be hard to comply with from the point of the typical firmware developer. Such linking requires an unusual build process that might be hard to setup in IDEs. Additionally, modern visual tools often hide the object files and linking details completely. Using proprietary compilers it might even be impossible to have any kind of portable binary objects. In any way, this is seen by some as enough of a hurdle to make reimplementation of LGPL code easier than complying with the license.

From this point of view, GPL and LGPL licenses don't seem to have a lot of difference in practice (note that libopencm3 already switched from GPL to LGPL to address concerns that it should be easier to use in commercial products). SDCC project solved this problem by adding a special exception to the GPL.

The other issue was the anti-tivoization clause. This clause was added to the third revision of the GNU public licenses to ensure that freedom to modify software can't be restricted by hardware devices that do cryptographic signature verification. This was mostly a response to the practice in consumer electronics where free software was used to enable business models that depended on anti-features, like DRM, and hence required unmodifiable software to be viable. However in microcontroller firmware there might be reasons for locking down firmware reprogramming that are easier to justify from engineering and moral standpoints.

First such case was where software modification can enable fraud (for instance energy meters) or make the device illegal to use (for instance due to FCC requirements for radio equipment) or both. In a lot of these cases however there is a very simple answer: if the user does not own the device (as is usually the case for metering equipment), no license requires the owner to enable software modification or even disclose the source code. Where that is not the case, usually the technical means are only one part of the story. The user can be bound by a contract not to change particular aspects of the device and subject to inspections. The anti-tivoization clause also does not prevent tampering indicators. However it might be that in some cases software covered by anti-tivoization might simply not be usable in practice.

The other case was where changed firmware can have harmful effects. Some strong opinions were voiced that people hacking firmware on certain dangerous devices can not know enough not to be a danger to their surroundings. This is certainly a valid concern, but the question I see is, why suddenly draw the line at firmware modification?

Search the web and you will find cases where using a wrong driver on a laptop can lead to the thing catching fire, which can certainly lead to injuries. Does that mean that people should not be allowed to modify operating system on their computers? A similar argument was made years ago in computer security, but I believe it has been proved enough times by now that manufacturers of proprietary software are not always the most knowledgeable about their products. I am sure that every device that can be made harmful with a firmware update can be done so much easier with a screwdriver.

In general, artificially limiting the number of people tinkering with your products will limit the number of people doing harmful things, but also limit the number of people doing useful modifications. A lot of hardware that was found to be easily modifiable has been adopted for research purposes in much more fancy institutions than your local hackerspace.

I haven't been involved in the design of any truly dangerous product, so perhaps I can't really have an opinion about this. However I do believe that responsibility of a designer of such products ends with a clear and unambiguous warnings as to the dangers of modification or bypassing of safety features.

Posted by Tomaž | Categories: Life | Comments »

On Kindle power supply

15.03.2013 21:22

Kindle's power supply (still talking about Kindle 3 that's been turned into a light-weight headless Debian box) is centered around MC13892. That is a power management integrated circuit (often referred to as PMIC in source code) specifically designed for powering Kindle's Freescale i.MX35 processor from a lithium-ion battery.

MC13892 power management circuit in Kindle 3

The chip itself hides under one of the shiny RF shielded enclosures and is connected to the main CPU over an SPI bus. The MC13892 datasheet reveals a very flexible chip that contains several configurable switching and linear power supplies, handles power on an USB bus and can work with a main and a backup battery. It also has peripherals like the real time clock, touch screen, temperature and light sensor interface and several programmable LED drivers.

Actually, it's interesting that Kindle 3 already seems to have much of the hardware needed to implement a touch screen and a screen with a front light even though these features only appeared in much later models. Also interesting to note is that the e-ink has its own separate power supply, so a lot of the MC13892 functionality appears unused.

Talking about unused functionality, MC13892 also has a Coulomb counter that can accurately integrate battery current to predict its life time. This also appears unused as the battery module itself seems to integrate a management circuit with an I2C bus. As far as I can see the built-in software actually uses information from that instead of MC13892. libgasgauge.so suggests it might be one of the Texas Instruments products.

Apart from curiosity, I also looked into this topic to find a most convenient way to power my Kindle without a battery attached. I'm powering my device from an outlet so it doesn't make sense to waste a perfectly good Li-ion battery by keeping it constantly connected to a charger.

Kindle 3 with a power supply adapter attached.

However, as many people on the web with dead Kindle batteries found out, Kindle won't boot with no voltage on the battery connector. Looking into the supply tree in the MC13892 datasheet it's apparent that the battery voltage is the central point from which all other parts are powered. The datasheet also explicitly states that the MC13892 will not power up the CPU unless it detects a valid voltage on the battery, even if power is available from the USB interface.

Unfortunately, this check cannot be fooled by a high-impedance voltage source in place of the battery (I tried), which means that the only way to power it up is to provide a proper voltage source capable of around 100 mA at 3.0 to 4.2 V.

Kindle 3 power supply adapter PCB.

This led to me to make this tiny power adapter that attaches to the main PCB instead of the battery. It provides all the power for Kindle's main board which has another benefit of freeing up the USB connector for any future hacks (switching to USB host mode would be nice).

My random parts bin contained a small Nokia charger (recently donated as broken) that gives somewhere between 6.0 to 5.5 V under load. Unfortunately that's a bit too high for Kindle's battery input (absolute maximum rating 4.8 V). Instead of tearing the charger apart and adjusting its voltage feedback, I opted for a small low-drop regulator (also salvaged from a random piece of broken electronics) on the adapter itself to lower the voltage to 4.2 V.

I guess using a dissipative regulator like this ruins a bit the wonderful power efficiency of Kindle's hardware, but at a few tens of milliamps of typical current draw it hardly gets warm to the touch.

Posted by Tomaž | Categories: Analog | Comments »

Standard C and embedded programming

09.03.2013 15:14

Previous week I had a few discussions with my colleagues regarding some finer details of variable types in C and how code generated by GCC behaves in various edge cases. This led me to read a few chapters of the actual C standard and research this topic a bit.

It's surprising how much of embedded C code I see on a day to day basis depends on behaviors that are implementation-specific according to the standard. The most common cases seem to be:

  • Assuming how struct and union types are stored in memory - specifically how the struct elements are aligned in memory and how assigning one element in a union type affects other elements.
  • Assuming how integer types are stored - specifically byte order when casting between pointers to differently sized integers.
  • Assuming what the storage size for enum types is.

In the world of software for personal computers, it's more or less universally agreed (at least in free software community) that these offenses are bad. They lead to unportable software and angry package maintainers that have to deal with bugs that only appear when software is compiled on certain architectures.

With embedded software the situation is less clear. First of all, embedded software is by definition more closely tied with the hardware it is running on. It's impossible to make it independent of the architecture. A large fraction of firmwares are also developed over a relatively short-term: Compared to desktop software that can be maintained over years, once a firmware image is finished it is often shipped with the physical product and never updated (even if a possibility for that exists). That means that it's unlikely that the CPU architecture or the compiler will change during the development cycle.

In embedded software you often have to deal with serialization and deserialization of types. For instance, you might be sending structures or multi-byte integers over a bus that only processes bytes at a time. Just casting char buffer pointer to a complex type at the first glance produces shorter, simpler code than doing it the standard-compliant way with arithmetic operations and assignments to individual struct fields.

But is the code really simpler? The emphasis should always be on code that is easy to read, not easy to write. When making a cast from *char to *int you silently assume that the next pair of eyes knows the endianess of the architecture you are working on.

There's also the question of crossing the fine line between implementation specific and undefined behaviors. Former depend only on the architecture (and perhaps minor compiler version) and the latter can change under your feet in more subtle ways. For instance, I have seen a few cases where results of arithmetic operations that depended on undefined behaviors would change with changes to unrelated code. That leads to heisenbugs and other such beasts you most certainly do not want near your project. Granted these usually involve more cleverness from the developer's side than the memory storage assumptions I mentioned above. In fact since these seem to be depended on so often you could say that some of the struct and union storage details are de-facto standard these days.

So what's the verdict here? I admit I'm guilty of using some of these tricks myself. In the most extreme case they can save you hundreds of lines of boiler plate code that just shuffles bytes hence and forth. And down in the lonely trenches of proprietary firmware development code reusability can be such a low concern that it doesn't matter if your code even compiles the day after the dead-line.

Before starting a project it makes sense to take a step back and set a policy depending on the predicted life-time of the code. I do think though that sticking to portable code is a must if you want to publish your work under and open source license. With the ever growing open source hardware community, sharing drivers between platforms is getting more and more common. Even with proprietary projects, I'm sure the person down the line that will be debugging your code will be happier if she didn't have to needlessly delve into specifics of the processor architecture you are using.

Posted by Tomaž | Categories: Code | Comments »

Embedded modules

02.03.2013 20:56

I've written before about problems with VESNA deployments that have come to consume large amounts of time and nerves. Several of these have come in turn from two proprietary microprocessor modules we use: Digi Connect ME for Ethernet connectivity and Atmel SerialNet for IEEE 802.15.4 mesh networking.

One of these issues, which now finally seems to be just on the brink of being resolved, has been been dragging on from the late summer last year. We have deployed several Digi Connect ME modules as parts of gateways between IEEE 802.15.4 mesh in clusters of VESNA nodes and the Internet. One of deployments has proved especially problematic. Encrypted SSL connections from the module would randomly get dropped and re-connect only after several hours of downtime.

The issue at first proved impossible to reproduce in a lab environment and since the exact same device worked on other networks the ISP and the firewall performing NAT was blamed. However, several trips to the location and many packet captures later I could find no specific problem with TCP or IP headers I could point my finger to. We replaced a network switch with no effect. Later, by experimenting with Digi Connect TCP keep-alive settings, a colleague found a setting that caused the dropped connection to be re-established immediately instead of causing hours of down-time, making the deployment at least partially useful.

Finally, last week I managed to reproduce the problem on my desk. I noticed that the TCP connections from that location had an unusually low MSS - just 536 bytes. By simulating this I could reliably reproduce connection drops and by experimenting further I found out that SSL data records fragmented in a particular way will cause the module to drop the connection. It was somewhat specific to the Java SSL implementation we used on the other end of connection and very unlikely to happen with other connections that used larger segment sizes.

The cause of the issue was therefore in the Digi Connect module. Before having a reproducible test case I haven't even considered a possibility that a change on the link layer somewhere in the route could trigger a bug at the application layer.

After I had that piece of information, a helpful member of the support forums quickly provided a solution. The issue itself however is not yet resolved since the change in the firmware broke all sorts of other things which now need to be looked into and fixed as well.

I can't say that all of our hard-to-solve bugs came from Digi Connect or Atmel modules. We caused plenty ourselves. But having now experienced working with these two fine products, my opinion is that less time would be wasted if we went for a lower-level solution (just an interface on the physical layer) and then used an open source network stack on top. It would take more time to get to a working solution but I think problems would be much easier to diagnose and solve than with what is essentially a magical black box.

Both Digi Connect and Atmel modules suffer from the fact that they hide some very complex machinery behind a very simplistic interface. Aside from the problem of leaky abstractions, when the machinery itself fails, they provide no information that would help you work around the problem (solving it is out of the question anyway because of proprietary software). Both also come with documentation that is focused on getting a working system as fast as possible, but lacks details on what happens in corner cases. These are mostly left to your imagination and experiments and as experience has shown, behavior can change between firmware revisions. In most cases you can't even practically test against these changes, since that would involve complicated hardware test harnesses.

Posted by Tomaž | Categories: Life | Comments »