## Seminar on receiver noise and covariance detection

31.10.2014 19:35

Here are slides of yet another seminar I gave at the School a few weeks ago to an audience of one. Again, I'm also posting them here in case it might be useful beyond merely incrementing my credit point counter. Read below for a short summary or dive directly into the paper if it sounds like fun reading to you. It's only four pages this time - I was warned that nobody has time to read my papers.

Like all analog devices, radio receivers add some noise to the signal that is passing through them. Some of this noise is due to pretty basic laws of physics, like thermal noise or noise due to various quantum effects in semiconductors. Other sources of noise however come from purely engineering constraints. These are for example crosstalks between parts of the circuit, non-ideal filters and so on. When designing receivers, all these noise sources are usually considered equivalent, since in the end only total noise power is what matters. For instance, you might design a filter so that it filters out unwanted signals until their power is around thermal noise floor. It doesn't make sense to have more attenuation, since you won't see much improvement in total noise power.

However, when you are using a receiver as a spectrum sensor, very weak spurious signals buried in noise become significant. After all, the purpose of a spectrum sensor is exactly that: to detect very weak signals in presence of noise. Since you don't know what kind of signal you are detecting, a local oscillator harmonic might look exactly like valid transmission you want to detect. Modern spectrum sensing methods like covariance- and eigenvalue-based detectors work well in presence of white noise. Because of this it might be better for a receiver designer to trade low total noise power for noise with a higher power, but one that looks more like white noise.

The simulations I describe were actually motivated by the difference I saw between theoretical performance of such detectors and practical experiments with an USRP when preparing one of my earlier seminars. I had a suspicion that spurious signals and non-white noise from the USRP's front-end could be causing this. To see if it's true, I've created a simulation using Python and NumPy that checks the minimal detectable power for two detectors in presence of different spurious sine signals and noise, colored by digital down-conversion.

In the end, I found out that periodic spurious signals affected the minimal detectable signal power even when they were 30 dB below the thermal noise power, regardless of frequency. Similarly, digital down-conversion alone also affects detector performance because of correlation it introduces into thermal noise. However since oversampling ADC have so many other practical benefits, DDC is most likely a net gain even in a spectrum sensing application. On the other hand, periodic components in receiver noise should be avoided as far as possible.

Posted by | Categories: Analog | Comments »

## On hunting non-deterministic bugs

26.10.2014 14:13

Bugs that don't seem to consistently manifest themselves are one of the most time consuming problems to solve in software development. In multi-tasking operating systems they are typically caused by race conditions between threads or details in memory management. They are perhaps even more common in embedded software. Programs on microcontrollers are typically interfacing with external processes that run asynchronously by their very nature. If you mix software and hardware development, unexpected software conditions may even be triggered by truly random events on improperly designed hardware.

When dealing with such bugs, first thing you need to realize is that you are in fact looking for a bug that only appears sometimes. I have seen many commits and comments by developers that have seen a bug manifest itself, wrote a quick fix and thought they have fixed a bug, since it didn't happen the second way around. These are typically changes that, after closer scrutiny, do not actually have any effect on the process they are supposedly fixing. Often this is connected with incomplete knowledge of the workings of the program or development tools. In other cases, the fact that such a change apparently fixed an application bug is blamed on bugs in compilers or other parts of the toolchain.

You can only approach non-deterministic processes with statistics. And first requirement of doing any meaningful statistics is a significant sample size. The corollary of this is that automated tests are a must when you suspect a non-deterministic bug. Checking if running a test 100 times resulted in any failures should require no more than checking a single line of terminal output. If your debugging strategy includes manually checking if a particular printf line got hit out of hundreds lines of other debugging output, you won't be able to consistently tell whether the bug happened or not after half a day of debugging, much less run a hundred repetitions and have any kind of confidence in the result.

Say you've seen a bug manifest itself in 1 run of a test out of 10. You then look at the code, find a possible problem and implement a fix. How many repetitions of the test must you run to be reasonably sure that you have actually fixed the bug and you weren't just lucky the second run around?

In the first approximation, we can assume the probability Pfail of the bug manifesting itself is:

P_{fail} = \frac{1}{n} = \frac{1}{10}

The question whether your tests passed due to sheer luck then translates to the probability of seeing zero occurrences of an event with probability Pfail after m repetitions. The number of occurrences has a binomial distribution. Given the desired probability Ptest of our the test giving the correct result, the required number of repetitions m is:

m = \frac{\log{(1 - P_{test})}}{\log{(1 - P_{fail})}} = \frac{\log{(1-P_{test})}}{\log{(1 - \frac{1}{n})}}

It turns out the ratio between m and n is more or less constant for practical values of n (e.g. >10):

\frac{m}{n} \approx -\log{(1 - P_{test})}

For instance, if you want to be 99% sure that your fix actually worked and that the test did not pass purely by chance, you need to run around 4.6 times more repetitions than those you used initially when discovering the bug.

This is not the whole story though. If you've seen a bug once in 10 runs, Pfail=0.1 is only the most likely estimate for the probability of its occurrence. It might be actually higher or lower and you've only seen one failure by chance:

If you want to also account for the uncertainty in Pfail, the derivation of m gets a bit complex. It involves using the beta distribution for the likelihood of the Pfail estimate, deriving Ptest from the law of total probability and then solving for m. The end result, however, is similarly straightforward and can be summarized in a simple table:

Ptest [%]m/n
90.02.5
99.010
99.930

Even this still assumes the bug basically behaves as a weighted coin, whose flips are independent of each other and whose probability doesn't change with time. This might or might not be a good model. It probably works well for problems in embedded systems where a bug is caused by small physical variations in signal timings. Problems with memory management or heisenbugs on the other hand can behave in a completely different way.

Assuming the analysis above works, a good rule of thumb therefore seems to be that if you discovered a bug using n repetitions of the test, checking whether it has been fixed or not should be done using at least 10·n repetitions. Of course, you can never be absolutely certain. Using factor of 10 only means that you will on average mark a bug fixed, when in fact it is not, once out of hundred debugging sessions. It's usually worth understanding why the change fixed the bug in addition to seeing the test suite pass.

Posted by | Categories: Ideas | Comments »

## CubieTruck UDMA CRC errors

18.10.2014 20:07

Last year I bought a CubieTruck, a small, low-powered ARM computer, to host this web site and a few other things. Combined with a Samsung 840 EVO SSD on the SATA bus, it proved to be a relatively decent replacement for my aging Intel box.

One thing that has been bothering me right from the start though is that every once in a while, there were problems with the SATA bus. Occasionally, isolated error messages like these appeared in the kernel log:

kernel: ata1.00: exception Emask 0x10 SAct 0x2000000 SErr 0x400100 action 0x6 frozen
kernel: ata1.00: irq_stat 0x08000000, interface fatal error
kernel: ata1: SError: { UnrecovData Handshk }
kernel: ata1.00: failed command: WRITE FPDMA QUEUED
kernel: ata1.00: cmd 61/18:c8:68:0e:49/00:00:02:00:00/40 tag 25 ncq 12288 out
kernel:          res 40/00:c8:68:0e:49/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
kernel: ata1.00: status: { DRDY }
kernel: ata1: hard resetting link
kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
kernel: ata1.00: supports DRM functions and may not be fully accessible
kernel: ata1.00: supports DRM functions and may not be fully accessible
kernel: ata1.00: configured for UDMA/133
kernel: ata1: EH complete


At the same time, the SSD reported increased UDMA CRC error count through the SMART interface:

These errors were mostly benign. Apart from the cruft in the log files they did not appear to have any adverse effects. Only once or twice in the last 10 months or so did they cause the kernel to remount filesystems on the SSD as read-only, which required some manual intervention to get the CubieTruck back on-line.

I've seen some forum discussions that suggested this might be caused by a bad power supply. However, checking the power lines with an oscilloscope did not show anything suspicious. On the other hand, I did notice during this test that the errors seemed to occur when I was touching the SATA cable. This made me think that the cable or the connectors on it might be the culprit - something that was also suggested in the forums.

Originally, CubieTruck comes with a custom SATA cable that combines both power and data lines for the hard drive and has special connectors (at least considering what you usually see in the context of the SATA cabling) on the motherboard side.

Last few weeks it appeared that the errors were getting increasingly more common, so I decided to try replacing the cable. Instead of ordering a new CubieTruck SSD kit I improvised a bit: I didn't have proper connectors for CubieTruck's power lines at hand, so I just soldered the cables directly to the motherboard. On the SSD drive I used the standard 15-pin SATA power connector.

For the data connection, I used an ordinary SATA data cable. The shortest one I could find was about three times as long as necessary, so it looks a bit uglier now. The connector on the motherboard side also needed some work with a scalpel to fit into CubieTruck's socket. The original connector on the cable that came with CubieTruck is thinner than those on standard SATA cables I tried.

So far it seems this fixed the CRC errors. In the past few days since I replaced the cable I haven't seen any new errors pop up, but I guess it will take a month or so to be sure.

Posted by | Categories: Digital | Comments »

## 2.4 GHz band occupancy survey

09.10.2014 19:36

The 100 MHz of spectrum around 2.45 GHz is shared by all sorts of technologies, from wireless LAN and Bluetooth, through video streaming to the yesterday's meatloaf you are heating up in the microwave oven. It's not hard to see the potential for it being overused. Couple this with ubiquitous complaints about non-working Wi-Fi at conferences and overuse is generally taken as a fact.

The assumption that existing unlicensed spectrum, including the 2.4 GHz band, is not enough to support all the igadgets of tomorrow is pretty much central in all sorts of efforts that push for new radio technologies. These try to introduce regulatory changes or develop smarter radios. While I don't have anything against these projects (in fact, some of them pay for my lunch), it seems there's a lack of up-to-date surveys of how much the band is actually used in the real world. It's always nice to double-check the assumptions before building upon them.

Back in April I've already written about using VESNA sensor nodes to monitor the usage of radio spectrum. Since then I have placed my stand-alone sensor at several more locations in or around Ljubljana and recorded spectrogram data for intervals ranging between a few hours to a few months. You might remember the sensor box and my lightning talk about it from WebCamp Ljubljana. All together it resulted in a pretty comprehensive dataset that covers some typical in-door environments where you usually hear most complaints about bad quality of service.

(At this point, I would like to thank everyone that ranted about their Wi-Fi and allowed me to put a ugly plastic spy box in their living room for a week. You know who you are).

A few weeks ago I have finally managed to put together a relatively comprehensive report on these measurements. Typically, such surveys are done with professional equipment in the five-digit price range instead of cheap sensor nodes. Because of that a lot of the paper is dedicated to ensuring that the results are trustworthy. While there are still some unknowns regarding how the spectrum measurement with CC2500 behaves, I'm pretty confident at this point that what's presented is not completely wrong.

To spare you the reading if you are in a hurry, here's the relevant paragraph from the conclusion. Please bear in mind that I'm talking about the physical layer here. Whether or not various upper-layer protocols were able to efficiently use this spectrum is another matter.

According to our study, more than 90% of spectrum is available more than 95% of the time in residential areas in Ljubljana, Slovenia. Daily variations in occupancy exist, but are limited to approximately 2%. In a conference environment, overall occupancy reached at most 40%.

For another view of this data set, check also animated histograms on YouTube.

Posted by | Categories: Life | Comments »

## Checking hygrometer calibration

06.10.2014 22:08

Several years ago I picked an old, wireless temperature and humidity sensor from trash. I fixed a bad solder joint on its radio transmitter and then used it many times simply as a dummy AM transmitter when playing with 433 MHz super-regenerative receivers and packet decoders. Recently though, I've been using it for it's original purpose: to monitor outside air temperature and humidity. I've thrown together a receiver from some old parts I had lying around, a packet decoder running on an Arduino and a Munin plug-in.

Looking at the relative air humidity measurements I gathered over the past months however I was wondering how accurate they are. The hygrometer is now probably close to 10 years old and of course hasn't been calibrated since it left the factory. Considering this is a fairly low-cost product, I doubt it was very precise even when new.

These are the sensors on the circuit board: the green bulb on the right is a thermistor and the big black box on the left is the humidity sensor, probably some kind of a resistive type. There are no markings on it, but the HR202 looks very similar. The sensor reports relative humidity with 1% resolution and temperature with 0.1°C resolution.

Resistive sensors are sensitive to temperature as well as humidity. Since the unit has a thermometer, I'm guessing the on-board controller compensates for the changes in resistance due to temperature variations. It shows the same value on an LCD screen as it sends over the radio, so the compensation definitely isn't left to the receiver.

To check the accuracy of the humidity measurements reported by the sensor, I made two reference environments with known humidity in small, airtight Tupperware containers:

• A 75% relative humidity above a saturated solution of sodium chloride and
• 100% relative humidity above a soaked paper towel.

I don't have a temperature stabilized oven at home and I wanted to measure at least three different humidity and temperature points. The humidity in my containers took around 24 hours to stabilize after sealing, so I couldn't just heat them up. In the end, I decided to only take the measurements at the room temperature (which didn't change a lot) and in the fridge. Surprisingly, the receiver picked up 433 MHz transmission from within the metal fridge without any special tweaking.

Here are the measurements:

T [°C]Rhreference [%]Rhmeasured [%]ΔRh [%]
247569-6
2275750
57562-13
37560-15
2310098-2
2110098-2

So, from this simple experiment it seems that the measurements are consistently a bit too low.

The 6% step between 22 and 24°C is interesting - it happens abruptly when the temperature sensor reading goes over 23°C. I'm pretty sure it's due to temperature compensation in the controller. Probably it does not do any interpolation between values in its calibration table.

From a quick look into various datasheets it seems these sensors typically have a ±5% accuracy. The range I saw here is +0/-15%, so it's a bit worse. However considering its age and the fact that the sensor has been sitting on a dusty shelf for a few years without a cover, I would say it's still relatively accurate.

I've seen some cheap hygrometer calibration kits for sale that contain salt mixtures for different humidity references. It would be interesting to try that and get a better picture of how the response of the sensor changed, but I think buying a new, better calibrated sensor makes much more sense at this point.

Posted by | Categories: Life | Comments »