Update on CC2500 deterioration, 2

25.11.2013 19:53

Related to my last post and continuing the story about the Texas Instruments CC2500 transceiver deterioration, here is one new data point I can share.

Sensor node 5 with the CC2500 transceiver board serial 01042 was mounted on an outdoor location on 7 November 2012 and unmounted on 18 November 2013. Before it left my office and after it returned, I measured the relationship between power at the antenna interface indicated by CC2500 RSSI register and signal power, as reported by a calibrated Rohde & Schwarz SMBV vector signal generator. In both cases +9.50 dB has been added to the indicated power, which was the calibration value for this transceiver board. According to my testing, RSSI offsets between +9.50 and +11.50 dB are typical for this series of boards.

You can see the difference between these two measurements on the graph below. The sensitivity of the receiver has dropped by nearly 40 dB sometime during the last year. This directly contradicts the theory that bad sensitivity was due to overheating during reflow soldering.

Change in CC2500 indicated power between 2012 and 2013.

Another hint at what happened can be seen from the following log of in-situ RSSI measurements. The graph shows the received signal strength at a neighboring node 3, listening to transmissions using this transceiver board. Obviously something happened to this board in the first week of February that drastically decreased its transmission power (while it can't be seen from the graph above, transmission power has also similarly decreased between the two times I had this node on my desk).

Long term RSSI measurement between nodes 5 and 3.

Curiously, similar measurement data for the reverse direction (so in the case where transceiver 01042 would listen to transmissions from its neighbor) stops at the same day of February, due to what looks like SD card failure on the sensor node 5.

So, from this new evidence I can conclude that at least in this one case:

  • The failure appeared while the node was mounted on a light pole,
  • change was not gradual but instantaneous,
  • it likely happened together with failure of other components.

As Iggy commented on my last post on this topic, a humidity problem would still fit this description (as water could easily break the SD card interface). Unfortunately we don't seem to have any suitable ovens available at the department to try baking the bad boards to see if this change is reversible (I'm a bit partial to trying this in my kitchen). As I am currently focused on UHF receivers this problem is not high enough on my priority list to try and coordinate such an experiment with some other institution. Unless of course somebody else is interested in investigating this issue - in that case I would be happy to provide a pile of bad transceiver boards.

I might actually just try the reverse and plop a few good boards into a glass of water over night and see if that has a similar effect on them.

Posted by Tomaž | Categories: Analog | Comments »

VESNA reliability and failure modes

23.11.2013 22:08

As you might know from my previous writings and talks, Jožef Stefan Institute runs an experimental wireless communications testbed as part of an European FP7 CREW project. Testbed is located in Logatec, a small city around 30 km from Ljubljana is unimaginatively called Log-a-tec. It consists of 54 VESNA devices mounted outside on street lights.

Wireless sensor node in the Log-a-tec testbed.

Each node has 24-hour power supply, but no wired communication lines to other nodes. Instead it has three separate radios. One of them is used to connect to a ZigBee mesh network that is used for management purposes. The other two are used to set up experimental networks and perform various measurements of the usage of the radio frequency spectrum.

The testbed is divided into three separate clusters. One ZigBee coordinator node per cluster provides a gateway from the mesh network to the Internet.

Combined map of the Log-a-tec testbed.

The testbed was deployed in steps around June 2012. It has been operating continuously since and while its reliability has been patchy at best it has nevertheless supported several experiments.

In the near future we are planning the first major maintenance operation. Nodes that have failed since deployment have already been unmounted. They will have failed components replaced and will at one point be mounted back on their positions on street lights. Therefore I think now is the perfect time to look back at the last year and a half and see how well the testbed has been doing overall.

First, here are some basic reliability indicators for time between August 2012 and November 2013:

  • Average availability of nodes (ping): 44.6%
  • Average time between resets (uptime): 26 days
  • Number of nodes not seen once: 24% (= 13/54)

Following two graphs show availability and uptime per individual node, colored by cluster. 13 nodes that have never been seen on the network are not shown (they have 0% availability and 0 uptime). Also note that when a coordinator (node00) was down, that usually meant that the whole cluster was unreachable.

VESNA outdoor node availability from August 2012 to November 2013

VESNA outdoor node uptime from August 2012 to November 2013

I have also been working on diagnosing specific problems with failed nodes. Unfortunately because sometimes work has been somewhat rushed due to impending deadlines, my records are not as good as I would wish for. Hence I can't easily give an exact breakdown of how much downtime was due to what problem. If at one point I will have time to go through my mail archive and gather all my old notes I might write a more detailed report.

However, Since I am getting a lot of questions regarding what exactly went wrong with nodes, here is a more or less complete list of problems I found, divided between those that have been seen once and those that were occurring more frequently.

A box of unmounted VESNA sensor nodes.

Recurring failures, ordered roughly by severity:

  • Broken boxes. VESNA nodes have been mounted in boxes certified for outdoor use. Nevertheless, a lot of them have cracked since deployment. This often resulted in condensation and in at least one case a node that was submerged in water. A lot of other failures on this list were likely indirectly caused by this.
  • I have already written about problems with Atmel ZigBit modules. While intermittent serial line problems have been mostly worked around, the persistent corruption of ZigBit firmware was one of the most common reasons why a node would not be reachable on the network. A corrupted ZigBit module does not join the mesh and requires firmware reprogramming to restore, something that can not be done remotely.
  • There have been some problems with an old version of our network driver that would sometimes fall into an infinite loop while it kept resetting the watchdog. Since we have no means of remotely resetting a node in that case, this bug has caused a lot of downtime in the early days of deployment. It proved so hard to debug that I ended up rewriting the problematic part of the code from scratch.
  • Texas Instruments CC-series transceiver degradation. While this has not resulted in a node downtime (and is not counted in the statistics above) it has nonetheless rendered several nodes useless for experiments.
  • Failed microcontroller flash. Due to an unfortunate design of VESNA's bootloader, it reprograms a block of flash on each boot. For nodes that were rebooting frequently (often because of other problems) this feature commonly resulted in stuck bits and a failed node.
  • Failed SD card interface. For mass storage, VESNA uses an SD card and on several nodes it has become inoperable. Since the SD card itself can still be read on another device, I suspect the connector (which was not designed for outdoor use).
  • Failed MRAM interface. In addition to SD card there is a small amount of non-volatile MRAM on board and on several nodes it has failed for an unknown reason.
  • People unplugging UTP cables and other problems with Internet connectivity at the remote end beyond our control.

One-time failures:

  • Digi Connect ME module SSL implementation bug.
  • Failed Ethernet PHY on a Digi Connect ME module. While these two problems only occurred once each, they were responsible for a lot of downtime for the whole City center cluster.
  • Failed interrupt request line on a CC1101 transceiver. Unknown reason, could be bad soldering.
Posted by Tomaž | Categories: Life | Comments »

Origin of frequency division

17.11.2013 18:35

The most basic feature of radio communication, practically since its invention, is the division of the electromagnetic spectrum between different users on the basis of different sine wave frequencies. In fact, the word radio spectrum is basically synonymous with this division and the first question about any kind of radio is usually what frequency it operates on.

After working in the Department of Communication Systems for most of the past two years, I began to wonder what is the original reason behind frequency division. It's one of those fundamental questions that sometimes pop into your head to keep your mind off more depressing topics.

The color spectrum rendered into the sRGB color space.

Image by Spigget CC BY-SA 3.0

The classical electromagnetic field theory gives a wave equation in empty space that does not favor the sine wave over any other kind of wave function. Similarly, a wave shared between transmitters can be decomposed into multiple independent channels based on any one out of an infinite set of orthogonal function families. Again, there is no preference to the sine and cosine functions and the Fourier decomposition that is ubiquitous in radio communication theory.

In fact, a lot of recent technologies, for example third-generation GSM, sub-divide their channels using orthogonal functions other than a sine wave. However, this is done only after first filtering the radio signal based on sine wave frequencies.

Electromagnetic field in a practical, Earth-based environment however does favor a division of signals based on sine waves. One classical reason is that objects that appear in the path of radio waves only come in a certain range of sizes. Diffraction and other such phenomena are mostly based on the relationship between wavelength and obstacle size. This means that sine waves with certain frequencies will have more favorable propagation properties than others. Hence it makes sense for instance to use a frequency band that will have better propagation for longer-range applications.

Another reason why it is natural to treat electromagnetic waves as a sum of sine functions is because of quantum mechanics and the fact that the frequency determines the photon energy. Size of energy quanta determines how the field can interact with matter in its path and this again affects atmospheric path loss in different frequency bands.

Early radio receiver.

While physics of radio propagation gives valid reasons why limit a transmission to a particular part of the electromagnetic spectrum, it doesn't explain why use relatively narrow band transmissions. Radio spectrum generally spans from 3 kHz to 300 GHz while most communication technologies will currently top out in the range of 100 MHz per channel.

The historic reason why frequency division was originally used is that the natural response of most electromagnetic and mechanical systems is a harmonic oscillation. Such oscillators can be conveniently used as signal filters to extract a channel in a shared medium.

Modern systems that use other kinds of orthogonal functions for multiplexing use digital processing for signal filtering. Only in recent history has digital processing been able to process signals with significant bandwidth. That left analog filters and the frequency division as the only option for multiplexing in the early days of radio. We are still long way off before 300 GHz of radio spectrum could be processed digitally.

Another problem with purely digital processing is that passive analog filters can have a much higher dynamic range compared to A/D converters or even active analog filters. The range between noise floor and coils melting or quartz crystals shattering is significantly better than the linear range of transistors. The ability to extract a weak signal in the presence of another transmission with a much higher power is crucial in radio technology where it's not unusual to see power differences well over 10 orders of magnitude. That is why even state-of-the-art software defined radios have front-end signal conditioning implemented in analog technology.

The only current technology I know that largely does away with frequency multiplex is Ultra-wideband. It is of course still frequency band limited. Partly because of propagation physics mentioned above and partly artificially, to minimize interference with other technologies that share the same frequency band. However with effective bandwidths in the range of giga-hertz it depends on frequency division much less than conventional technologies. Unfortunately I don't know the details of UWB implementation, so I don't know how it manages to overcome the dynamic range problem and other technological limitations.

Posted by Tomaž | Categories: Ideas | Comments »

Automatic gain control in TDA18219HN

05.11.2013 20:11

One of the first things I did at the Jožef Stefan Institute was to design a small, compact UHF receiver around the TDA18219HN chip from NXP. Requirements at the time for spectrum sensing only called for precise radiometric measurements of incident signal power. However now it is time to move on to more advanced detection methods and it would be nice if my hardware could capture the actual signal waveform instead of just its amplitude. Because of that I have spent quite some time recently working on a new version of the receiver.

As usual, things are not going as well as I hoped. In some cases the output signal from the tuner is badly distorted - something which I did not notice when all I was interested in was the signal amplitude. It looks like a problem with automatic gain control, so I dug out as much as possible from the little documentation there is available on this chip and did some measurements of my own.

Automatic gain control stages in TDA18219HN tuner.

Image by NXP

As the diagram in the datasheet shows, this chip has seven stages with variable gain. Coupled with detectors they form several feedback loops that try to keep the signal level throughout the tuner approximately constant and within the linear region of the analog circuitry. This is important since this tuner is designed to work with both cable networks and wireless terrestrial reception. This means it must work with signal levels that differ by almost 10 orders of magnitude.

Except AGCK and IF AGC all of these stages change their gain in discrete steps. AGCK can set its gain continuously and compensates for step changes in other stages to give the illusion of continuous gain variation.

IF AGC gain is controlled externally via an analog pin and is meant to be controlled by whatever is decoding the signal (in my experiments this pin was always grounded to set the lowest gain). All other stages are controlled automatically by integrated logic. NXP doesn't tell you in detail how it works ("the gain is distributed to offer best trade-off between linearity and noise" is about as far as the datasheet goes).

There are some I2C registers that apparently affect the AGC behavior, but except for Take-Over-Point setting they are mostly not documented and in the end the only thing you can do is basically follow the register values in the reference driver implementation. I tried playing a bit with these settings but didn't see any obvious difference in performance.

The I2C control interface however does allow you to monitor the current gain of AGC1, AGC2, AGC4 and AGC5. So it's possible to at least monitor the reaction of the tuner to various input signal power levels.

The following graph shows how the gain of the tuner changes to keep the output level constant when the input power is changed from -100 to 0 dBm. Shown are cumulative gains at individual stages (e.g. AGC4 line includes gains from AGC1, AGC2 and AGC4). The line labeled "other" shows the total gain and includes stages that don't allow direct monitoring (AGC3, AGCK, IF AGC and possibly others). Total gain was calculated from the signal level measured at the output of the tuner.

TDA18219HN tuner gain versus input power.

In the case of the graph above, the signal was not distorted (although there are still some strange variations in gain, like the dip between -80 and -70 dBm input power). I'll keep the case when the signal gets clipped for a later blog post.

While it's interesting to poke around with a stick inside a black box like this, it's not really a productive way to spend time. Unfortunately NXP doesn't offer any kind of design support for these chips so I'm mostly on my own for solving this (they don't even have a distributor for Europe any more). When I was choosing tuner chips two years ago this one came up on top by its specifications and availability, however now the secretive nature of NXP's products and lack of documentation is becoming a larger and larger obstacle in further developing this design.

Posted by Tomaž | Categories: Analog | Comments »