Seminar on receiver noise and covariance detection

31.10.2014 19:35

Here are slides of yet another seminar I gave at the School a few weeks ago to an audience of one. Again, I'm also posting them here in case it might be useful beyond merely incrementing my credit point counter. Read below for a short summary or dive directly into the paper if it sounds like fun reading to you. It's only four pages this time - I was warned that nobody has time to read my papers.

Effects of non-Gaussian noise on covariance-based detectors title slide

Like all analog devices, radio receivers add some noise to the signal that is passing through them. Some of this noise is due to pretty basic laws of physics, like thermal noise or noise due to various quantum effects in semiconductors. Other sources of noise however come from purely engineering constraints. These are for example crosstalks between parts of the circuit, non-ideal filters and so on. When designing receivers, all these noise sources are usually considered equivalent, since in the end only total noise power is what matters. For instance, you might design a filter so that it filters out unwanted signals until their power is around thermal noise floor. It doesn't make sense to have more attenuation, since you won't see much improvement in total noise power.

However, when you are using a receiver as a spectrum sensor, very weak spurious signals buried in noise become significant. After all, the purpose of a spectrum sensor is exactly that: to detect very weak signals in presence of noise. Since you don't know what kind of signal you are detecting, a local oscillator harmonic might look exactly like valid transmission you want to detect. Modern spectrum sensing methods like covariance- and eigenvalue-based detectors work well in presence of white noise. Because of this it might be better for a receiver designer to trade low total noise power for noise with a higher power, but one that looks more like white noise.

The simulations I describe were actually motivated by the difference I saw between theoretical performance of such detectors and practical experiments with an USRP when preparing one of my earlier seminars. I had a suspicion that spurious signals and non-white noise from the USRP's front-end could be causing this. To see if it's true, I've created a simulation using Python and NumPy that checks the minimal detectable power for two detectors in presence of different spurious sine signals and noise, colored by digital down-conversion.

In the end, I found out that periodic spurious signals affected the minimal detectable signal power even when they were 30 dB below the thermal noise power, regardless of frequency. Similarly, digital down-conversion alone also affects detector performance because of correlation it introduces into thermal noise. However since oversampling ADC have so many other practical benefits, DDC is most likely a net gain even in a spectrum sensing application. On the other hand, periodic components in receiver noise should be avoided as far as possible.

Posted by Tomaž | Categories: Analog | Comments »

On hunting non-deterministic bugs

26.10.2014 14:13

Bugs that don't seem to consistently manifest themselves are one of the most time consuming problems to solve in software development. In multi-tasking operating systems they are typically caused by race conditions between threads or details in memory management. They are perhaps even more common in embedded software. Programs on microcontrollers are typically interfacing with external processes that run asynchronously by their very nature. If you mix software and hardware development, unexpected software conditions may even be triggered by truly random events on improperly designed hardware.

When dealing with such bugs, first thing you need to realize is that you are in fact looking for a bug that only appears sometimes. I have seen many commits and comments by developers that have seen a bug manifest itself, wrote a quick fix and thought they have fixed a bug, since it didn't happen the second way around. These are typically changes that, after closer scrutiny, do not actually have any effect on the process they are supposedly fixing. Often this is connected with incomplete knowledge of the workings of the program or development tools. In other cases, the fact that such a change apparently fixed an application bug is blamed on bugs in compilers or other parts of the toolchain.

You can only approach non-deterministic processes with statistics. And first requirement of doing any meaningful statistics is a significant sample size. The corollary of this is that automated tests are a must when you suspect a non-deterministic bug. Checking if running a test 100 times resulted in any failures should require no more than checking a single line of terminal output. If your debugging strategy includes manually checking if a particular printf line got hit out of hundreds lines of other debugging output, you won't be able to consistently tell whether the bug happened or not after half a day of debugging, much less run a hundred repetitions and have any kind of confidence in the result.


Say you've seen a bug manifest itself in 1 run of a test out of 10. You then look at the code, find a possible problem and implement a fix. How many repetitions of the test must you run to be reasonably sure that you have actually fixed the bug and you weren't just lucky the second run around?

In the first approximation, we can assume the probability Pfail of the bug manifesting itself is:

P_{fail} = \frac{1}{n} = \frac{1}{10}

The question whether your tests passed due to sheer luck then translates to the probability of seeing zero occurrences of an event with probability Pfail after m repetitions. The number of occurrences has a binomial distribution. Given the desired probability Ptest of our the test giving the correct result, the required number of repetitions m is:

m = \frac{\log{(1 - P_{test})}}{\log{(1 - P_{fail})}} = \frac{\log{(1-P_{test})}}{\log{(1 - \frac{1}{n})}}

It turns out the ratio between m and n is more or less constant for practical values of n (e.g. >10):

Ratio between number of test runs in discovery and number of required verification test runs.
\frac{m}{n} \approx -\log{(1 - P_{test})}

For instance, if you want to be 99% sure that your fix actually worked and that the test did not pass purely by chance, you need to run around 4.6 times more repetitions than those you used initially when discovering the bug.


This is not the whole story though. If you've seen a bug once in 10 runs, Pfail=0.1 is only the most likely estimate for the probability of its occurrence. It might be actually higher or lower and you've only seen one failure by chance:

PDF for probability of failure when one failure was seen in 10 test runs.

If you want to also account for the uncertainty in Pfail, the derivation of m gets a bit complex. It involves using the beta distribution for the likelihood of the Pfail estimate, deriving Ptest from the law of total probability and then solving for m. The end result, however, is similarly straightforward and can be summarized in a simple table:

Ptest [%]m/n
90.02.5
99.010
99.930

Even this still assumes the bug basically behaves as a weighted coin, whose flips are independent of each other and whose probability doesn't change with time. This might or might not be a good model. It probably works well for problems in embedded systems where a bug is caused by small physical variations in signal timings. Problems with memory management or heisenbugs on the other hand can behave in a completely different way.

Assuming the analysis above works, a good rule of thumb therefore seems to be that if you discovered a bug using n repetitions of the test, checking whether it has been fixed or not should be done using at least 10·n repetitions. Of course, you can never be absolutely certain. Using factor of 10 only means that you will on average mark a bug fixed, when in fact it is not, once out of hundred debugging sessions. It's usually worth understanding why the change fixed the bug in addition to seeing the test suite pass.

Posted by Tomaž | Categories: Ideas | Comments »

CubieTruck UDMA CRC errors

18.10.2014 20:07

Last year I bought a CubieTruck, a small, low-powered ARM computer, to host this web site and a few other things. Combined with a Samsung 840 EVO SSD on the SATA bus, it proved to be a relatively decent replacement for my aging Intel box.

One thing that has been bothering me right from the start though is that every once in a while, there were problems with the SATA bus. Occasionally, isolated error messages like these appeared in the kernel log:

kernel: ata1.00: exception Emask 0x10 SAct 0x2000000 SErr 0x400100 action 0x6 frozen
kernel: ata1.00: irq_stat 0x08000000, interface fatal error
kernel: ata1: SError: { UnrecovData Handshk }
kernel: ata1.00: failed command: WRITE FPDMA QUEUED
kernel: ata1.00: cmd 61/18:c8:68:0e:49/00:00:02:00:00/40 tag 25 ncq 12288 out
kernel:          res 40/00:c8:68:0e:49/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
kernel: ata1.00: status: { DRDY }
kernel: ata1: hard resetting link
kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
kernel: ata1.00: supports DRM functions and may not be fully accessible
kernel: ata1.00: supports DRM functions and may not be fully accessible
kernel: ata1.00: configured for UDMA/133
kernel: ata1: EH complete

At the same time, the SSD reported increased UDMA CRC error count through the SMART interface:

UDMA CRC weekly error count on CubieTruck.

These errors were mostly benign. Apart from the cruft in the log files they did not appear to have any adverse effects. Only once or twice in the last 10 months or so did they cause the kernel to remount filesystems on the SSD as read-only, which required some manual intervention to get the CubieTruck back on-line.

I've seen some forum discussions that suggested this might be caused by a bad power supply. However, checking the power lines with an oscilloscope did not show anything suspicious. On the other hand, I did notice during this test that the errors seemed to occur when I was touching the SATA cable. This made me think that the cable or the connectors on it might be the culprit - something that was also suggested in the forums.

Originally, CubieTruck comes with a custom SATA cable that combines both power and data lines for the hard drive and has special connectors (at least considering what you usually see in the context of the SATA cabling) on the motherboard side.

Last few weeks it appeared that the errors were getting increasingly more common, so I decided to try replacing the cable. Instead of ordering a new CubieTruck SSD kit I improvised a bit: I didn't have proper connectors for CubieTruck's power lines at hand, so I just soldered the cables directly to the motherboard. On the SSD drive I used the standard 15-pin SATA power connector.

For the data connection, I used an ordinary SATA data cable. The shortest one I could find was about three times as long as necessary, so it looks a bit uglier now. The connector on the motherboard side also needed some work with a scalpel to fit into CubieTruck's socket. The original connector on the cable that came with CubieTruck is thinner than those on standard SATA cables I tried.

Replacement SATA cables for CubieTruck.

So far it seems this fixed the CRC errors. In the past few days since I replaced the cable I haven't seen any new errors pop up, but I guess it will take a month or so to be sure.

Posted by Tomaž | Categories: Digital | Comments »

2.4 GHz band occupancy survey

09.10.2014 19:36

The 100 MHz of spectrum around 2.45 GHz is shared by all sorts of technologies, from wireless LAN and Bluetooth, through video streaming to the yesterday's meatloaf you are heating up in the microwave oven. It's not hard to see the potential for it being overused. Couple this with ubiquitous complaints about non-working Wi-Fi at conferences and overuse is generally taken as a fact.

The assumption that existing unlicensed spectrum, including the 2.4 GHz band, is not enough to support all the igadgets of tomorrow is pretty much central in all sorts of efforts that push for new radio technologies. These try to introduce regulatory changes or develop smarter radios. While I don't have anything against these projects (in fact, some of them pay for my lunch), it seems there's a lack of up-to-date surveys of how much the band is actually used in the real world. It's always nice to double-check the assumptions before building upon them.

Back in April I've already written about using VESNA sensor nodes to monitor the usage of radio spectrum. Since then I have placed my stand-alone sensor at several more locations in or around Ljubljana and recorded spectrogram data for intervals ranging between a few hours to a few months. You might remember the sensor box and my lightning talk about it from WebCamp Ljubljana. All together it resulted in a pretty comprehensive dataset that covers some typical in-door environments where you usually hear most complaints about bad quality of service.

(At this point, I would like to thank everyone that ranted about their Wi-Fi and allowed me to put a ugly plastic spy box in their living room for a week. You know who you are).

In-door occupancy survey of the 2.4 GHz band

A few weeks ago I have finally managed to put together a relatively comprehensive report on these measurements. Typically, such surveys are done with professional equipment in the five-digit price range instead of cheap sensor nodes. Because of that a lot of the paper is dedicated to ensuring that the results are trustworthy. While there are still some unknowns regarding how the spectrum measurement with CC2500 behaves, I'm pretty confident at this point that what's presented is not completely wrong.

To spare you the reading if you are in a hurry, here's the relevant paragraph from the conclusion. Please bear in mind that I'm talking about the physical layer here. Whether or not various upper-layer protocols were able to efficiently use this spectrum is another matter.

According to our study, more than 90% of spectrum is available more than 95% of the time in residential areas in Ljubljana, Slovenia. Daily variations in occupancy exist, but are limited to approximately 2%. In a conference environment, overall occupancy reached at most 40%.

For another view of this data set, check also animated histograms on YouTube.

Posted by Tomaž | Categories: Life | Comments »

Checking hygrometer calibration

06.10.2014 22:08

Several years ago I picked an old, wireless temperature and humidity sensor from trash. I fixed a bad solder joint on its radio transmitter and then used it many times simply as a dummy AM transmitter when playing with 433 MHz super-regenerative receivers and packet decoders. Recently though, I've been using it for it's original purpose: to monitor outside air temperature and humidity. I've thrown together a receiver from some old parts I had lying around, a packet decoder running on an Arduino and a Munin plug-in.

Looking at the relative air humidity measurements I gathered over the past months however I was wondering how accurate they are. The hygrometer is now probably close to 10 years old and of course hasn't been calibrated since it left the factory. Considering this is a fairly low-cost product, I doubt it was very precise even when new.

Weather station humidity and temperature sensors.

These are the sensors on the circuit board: the green bulb on the right is a thermistor and the big black box on the left is the humidity sensor, probably some kind of a resistive type. There are no markings on it, but the HR202 looks very similar. The sensor reports relative humidity with 1% resolution and temperature with 0.1°C resolution.

Resistive sensors are sensitive to temperature as well as humidity. Since the unit has a thermometer, I'm guessing the on-board controller compensates for the changes in resistance due to temperature variations. It shows the same value on an LCD screen as it sends over the radio, so the compensation definitely isn't left to the receiver.

Calibrating a hygrometer using a saturated salt solution.

To check the accuracy of the humidity measurements reported by the sensor, I made two reference environments with known humidity in small, airtight Tupperware containers:

  • A 75% relative humidity above a saturated solution of sodium chloride and
  • 100% relative humidity above a soaked paper towel.

I don't have a temperature stabilized oven at home and I wanted to measure at least three different humidity and temperature points. The humidity in my containers took around 24 hours to stabilize after sealing, so I couldn't just heat them up. In the end, I decided to only take the measurements at the room temperature (which didn't change a lot) and in the fridge. Surprisingly, the receiver picked up 433 MHz transmission from within the metal fridge without any special tweaking.

Here are the measurements:

T [°C]Rhreference [%]Rhmeasured [%]ΔRh [%]
247569-6
2275750
57562-13
37560-15
2310098-2
2110098-2

So, from this simple experiment it seems that the measurements are consistently a bit too low.

The 6% step between 22 and 24°C is interesting - it happens abruptly when the temperature sensor reading goes over 23°C. I'm pretty sure it's due to temperature compensation in the controller. Probably it does not do any interpolation between values in its calibration table.

Rh and temperature readings above a saturated solution at room temperature.

From a quick look into various datasheets it seems these sensors typically have a ±5% accuracy. The range I saw here is +0/-15%, so it's a bit worse. However considering its age and the fact that the sensor has been sitting on a dusty shelf for a few years without a cover, I would say it's still relatively accurate.

I've seen some cheap hygrometer calibration kits for sale that contain salt mixtures for different humidity references. It would be interesting to try that and get a better picture of how the response of the sensor changed, but I think buying a new, better calibrated sensor makes much more sense at this point.

Posted by Tomaž | Categories: Life | Comments »

Seminar on covariance-based spectrum sensing

29.09.2014 20:05

Here are the slides from my seminar on a practical test of covariance-based spectrum sensing methods. It was presented behind closed doors and was worth five science collaboration points.

Covariance-based spectrum sensing methods in practice title slide

The basic idea behind spectrum sensing is for a sensor to detect which frequency channels are occupied and which are vacant. This requires detecting very weak transmissions, typically below the noise floor of the receiver. For practical reasons, you want such a sensor to be small, cheap, robust and capable of detecting a wide range of signals.

Covariance-based and eigenvalue-based detectors are a relatively recent development in this field. Simulations show that they are capable of detecting a wide range of realistic transmissions, are immune to noise power changes and can detect signals at relatively low signal-to-noise ratios. They are also interesting because they are not hard to implement on low-cost hardware with limited capabilities.

Over the summer I performed several table-top experiments with a RF signal generator and a few radio receivers. I implemented a few of the methods I found in various published papers and checked how well they perform in practice. I was also interested in what kind of characteristics are important when designing a receiver specifically for such an use case - when using a receiver for sensing, noise properties you mostly don't care about for data reception start to get important.

This work more or less builds upon my earlier seminar on spectrum sensing methods and my work on a new UHF receiver for VESNA. In fact, I have performed similar experiments with the Ettus Research USRP specifically to see how well my receiver would work with such methods before finalizing the design. Since I now finally have a few precious prototypes of SNE-ESHTER on my desk, I was able to actually check its performance. While I don't have conclusive results yet, these latest tests do hint that it does well compared to the bog-standard USRP.

A paper describing the details is yet to be published, so unfortunately I'm told it is under an embargo (I'm happy to share details in person, if anyone is interested though). But the actual code, measurements and a few IPython notebooks with analysis of the measurements are already on GitHub. Feel free to try and replicate my results in your own lab.

Posted by Tomaž | Categories: Life | Comments »

Disassembling Tek P5050 probe

03.09.2014 18:59

We have a big and noisy 4-channel Tektronix TDS 5000 oscilloscope at work that is used in our lab and around the department. Recently, one of its 500 MHz probes stopped working for an unknown reason, as it's bound to happen to any equipment that is used daily by a diverse group of people.

Tektronix P5050 oscilloscope probe.

This is an old Tektronix P5050 voltage probe. You can't buy new ones any more and a similar probe will apparently set the tax payers back around $500. So it seemed reasonable to spend some time looking into fixing it before ordering a replacement.

This specimen doesn't appear to be getting any signal to the scope. The cable is probably fine, since I can see some resistance between the tip and the BNC connector on the other end. My guess is that the problem is most likely in the compensation circuit inside the box at the oscilloscope end.


So, how does one disassemble it? It's not like you want to apply the usual remove-the-labels-and-jam-the-screwdriver-under-the-plastic to this thing. I couldn't find any documentation on the web, so here's a quick guide. It's nothing complicated, but when working with delicate, expensive gadgets (that are not even mine in the first place) I usually feel much better if I see someone else has managed to open it before me.

First step is to remove the metal cable strain relief and the BNC connector. I used a wrench to unscrew washer at the back of the BNC connector while the strain relief was loose enough to remove by hand.

Tek P5050 probe partially disassembled.

The circuit case consists of two plastic halves on the outside and two metal shield halves on the inside that also carry the front and aft windings for the strain relief and the BNC connector. There are no screws or glue. The plastic halves firmly latch into grooves on the two broad sides of the metal ground shield (you can see one of the grooves on the photo above).

You can pry off the plastic shell by carefully lifting the sides. I found that it's best to start at the holes for the windings. Bracing a flat screwdriver against the metal at that point allows you to lift the plastic parts with minimal damage.

After you remove the plastic, the metal parts should come apart by themselves. The cable is not removable without soldering.

Circuit board inside Tek P5050 probe.

Finally, here's the small circuit that is hidden inside the box. Vias suggest that it's a two-sided board. Unfortunately you can't remove it from the shield without cutting off the metal rivets.

The trimmer capacitor in the center is user-accessible through the hole in the casing. The two potentiometers on the side appear to be factory set. From a quick series of pokes with a multimeter it appears one of the ceramic capacitors is shorted, however I want to study this a bit more before I put a soldering iron to it.

Posted by Tomaž | Categories: Life | Comments »

Follow up on Atmel ZigBit modules

27.08.2014 12:09

I've ranted before about the problematic Atmel ZigBit modules and the buggy SerialNet firmware. In my back-of-the envelope analysis of failure modes in the Jožef Stefan Institute's sensor networks, one particular problem stood out related to this low-powered mesh networking hardware: a puzzling failure that prevents a module from joining the mesh and can seemingly be fixed by reprogramming the module's firmware.

A week ago Adam posted a link to the SerialNet source in a comment to my old blog post. While I've mostly moved on to other things, this new piece of information gave me sufficient excuse to spend another few hours exploring this problem.

Atmel ATZB_900_B0 module on a VESNA SNR-MOD board.

A quick look around Atmel's source package revealed that it contains only the code for the serial interface to the underlying proprietary ZigBee stack. There are no low-level hardware drivers for the radio and no actual network stack code in there. It didn't seem likely that the bug I was hunting was caused by this thin AT-command interface code. On the other hand, this code could be responsible for dropping out characters in the serial stream. However we have sufficient workarounds in place for that bug and it's not worth spending more time on it.

One thing caught my eye in the source: ATPEEK and ATPOKE commands. ATZB-900-B0 modules consist of an ATmega1281 microcontroller and an AT86TF212 transceiver. These two commands allow for raw access to the radio hardware registers, microcontroller RAM, code flash ROM and configuration EEPROM. I thought that given these, maybe I could find out what gets corrupted in module's non-volatile memories and perhaps fix it through the AT-command interface.

Only after figuring out how to use them from studying the source, I found out that these two commands have been in fact documented in revision 8369B of the SerialNet User Guide. Somehow I overlooked this addition previously.


For the sake of completeness, here is a more detailed description of the problem:

A module that previously worked fine and passed all of my system tests will suddenly no longer respond to an AT+WJOIN command. It will not respond with either OK nor ERROR (or their numeric equivalents). However it will respond to other commands in a normal fashion. This can happen after the module has been deployed for several months or only after a few hours.

A power cycle, reset or restoring factory defaults does not fix this. The only known way of restoring the module is to reprogram its firmware through the serial port using Atmel's Bootloader PC Tool for Windows. This reprogramming invokes a bootloader mode on the module and refreshes the contents of the microcontroller's flash ROM as well as resets the configuration EEPROM contents.

It appears that this manifests more often with sensor nodes that are power-cycled regularly. However, in our setup a node only joins the network once after a power-cycle. Even if the bug is caused by some random event that happens anytime during the uptime of the node, it will not be noticed until the next power cycle. So it is possible that it's not the power cycling itself that causes the problem. Aggressive power-cycling tests as well don't seem to increase the occurrence of the bug.


So, with the new found knowledge of ATPEEK I dumped the contents of the EEPROM and flash ROM from two known-bad modules and a few good ones. Comparing the dumps revealed that both of the bad modules are missing a 256 byte block of code from the flash starting at address 0x00011100:

--- zb_046041_good_flash.hex	2014-08-25 16:41:51.000000000 +0200
+++ zb_046041_bad_flash.hex	2014-08-25 16:41:47.000000000 +0200
@@ -4362,22 +4362,8 @@
 000110d0  88 23 41 f4 0e 94 40 88  86 e0 80 93 d5 13 0e 94  |.#A...@.........|
 000110e0  f0 7b 1c c0 80 91 da 13  88 23 99 f0 82 e2 61 ee  |.{.......#....a.|
 000110f0  73 e1 0e 94 41 0c 81 e0  80 93 e5 13 8e e3 91 e7  |s...A...........|
-00011100  90 93 e7 13 80 93 e6 13  8b ed 93 e1 0e 94 86 14  |................|
-00011110  05 c0 0e 94 9a 70 88 81  0e 94 b5 70 df 91 cf 91  |.....p.....p....|
-00011120  08 95 fc 01 80 81 88 23  29 f4 0e 94 40 88 0e 94  |.......#)...@...|
-00011130  e1 71 08 95 0e 94 b5 70  08 95 a2 e1 b0 e0 e3 ea  |.q.....p........|
-00011140  f8 e8 0c 94 71 f4 80 e0  94 e0 90 93 c7 17 80 93  |....q...........|
-00011150  c6 17 0e 94 0f 78 80 91  d9 13 83 70 83 30 61 f1  |.....x.....p.0a.|
-00011160  88 e2 be 01 6f 5f 7f 4f  0e 94 41 0c 89 81 88 23  |....o_.O..A....#|
-00011170  19 f1 0e 94 a6 9f 6b e1  70 e1 48 e0 50 e0 0e 94  |......k.p.H.P...|
-00011180  47 f5 8c 01 8b e2 be 01  6e 5f 7f 4f 0e 94 41 0c  |G.......n_.O..A.|
-00011190  8a 81 88 23 19 f0 01 15  11 05 71 f4 8e 01 0d 5f  |...#......q...._|
-000111a0  1f 4f 87 e2 b8 01 0e 94  41 0c c8 01 60 e0 0e 94  |.O......A...`...|
-000111b0  22 5c 80 e0 0e 94 8d 5c  80 91 d9 13 81 ff 14 c0  |"\.....\........|
-000111c0  0e 94 40 88 0e 94 68 67  80 91 40 10 87 70 19 f4  |..@...hg..@..p..|
-000111d0  0e 94 e1 71 15 c0 81 50  82 30 90 f4 86 e0 80 93  |...q...P.0......|
-000111e0  d5 13 0e 94 f0 7b 0c c0  80 91 40 10 87 70 19 f4  |.....{....@..p..|
-000111f0  0e 94 bf 71 05 c0 81 50  82 30 10 f4 0e 94 df 70  |...q...P.0.....p|
+00011100  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|
+*
 00011200  62 96 e4 e0 0c 94 8d f4  80 91 d7 13 90 91 d8 13  |b...............|
 00011210  00 97 d9 f4 84 e5 97 e7  90 93 d8 13 80 93 d7 13  |................|
 00011220  80 91 40 10 87 70 82 30  11 f4 0e 94 04 77 0e 94  |..@..p.0.....w..|

This is puzzling for several reasons.

First of all, it seems unlikely that this is a hardware problem. Both bad modules (with serial numbers apart by several 10000s) had lost the same block of code. If flash lost its contents due to out-of-spec voltage during programming or some other hardware problem I would expect the bad address to be random.

However a software bug causing a failure like that seems highly unlikely as well. I was expecting to see some kind of an EEPROM corruption. EEPROM is used to persistently store module settings and I assume the firmware often writes to it. However flash ROM should be mostly read-only. I find it hard to imagine what kind of a bug could erase a block - reprogramming the flash on a microcontroller is typically a somewhat complicated procedure that is unlikely to come up by chance.

One possibility is that we are maybe somehow unknowingly invoking the bootloader mode during the operation of the sensor node. During my testing however I found out that just invoking the serial bootloader mode without also supplying it with a fresh firmware image corrupts the flash ROM sufficiently that the module does not boot at all. The Bootloader PC Tool seems to suggest that these modules also have some kind of an over-the-air upgrade functionality, but I haven't yet looked into how that works. It's possible we're enabling that somehow.

Unfortunately, the poke functionality does not allow you to actually write to flash (you can write to RAM and EEPROM though). So even if this currently allows me to detect a corrupt module flash while the node is running, that is only good for saying that the module won't come back on-line after a reboot. I can't fix the problem without fully reprogramming the firmware. This means either hooking the module to a laptop or implementing the reprogramming procedure on the sensor node itself. The latter is not trivial, because it involves implementing the programming protocol and somehow arranging for the storage of a complete uncorrupted SerialNet firmware image on the sensor node.

Posted by Tomaž | Categories: Code | Comments »

jsonmerge

20.08.2014 21:01

As I mentioned in my earlier post, my participation at the Open Contracting code sprint during EuroPython resulted in the jsonmerge library. After the conference I slowly cleaned up the remaining few issues and brought up code coverage of unit tests to 99%. The first release is now available from PyPi under the MIT license.

jsonmerge tries to solve a problem that seems simple at first: given a series of structured JSON documents, how to create a single document that contains an aggregate of all their contents. With simple documents that might be as trivial as calling an update() method on a dict:

>>> a = {'foo': 1}
>>> b = {'bar': 2}

>>> c = a.update(b)
>>> c
{'foo': 1, 'bar': 2}

However, even with just two plain dictionaries, things can quickly get complicated. What should happen if both documents contain a field with the same name? Should a later value overwrite the earlier one? Or should the resulting document have in that place a list that contains both values? Source JSON documents themselves can also contain arrays (or arrays of arrays) and handling those is even less straightforward than dictionaries in this example.

Often I've seen a problem like this solved in application code - it's relatively simple to encode your wishes in several hundreds lines of Python. However JSON is a very flexible format while such code is typically brittle. Change the input document a bit and more often than not your code will start throwing KeyErrors left and right. Another problem with this approach is that it's often not obvious from the code what kind of a strategy is taken for merging changes in different parts of the document. If you want to have the behavior well documented you have to write and keep updated a piece of English prose that describes it.

Open Contracting folks are all about making a data standard. Having a piece of code instead of a specification clearly seemed like a wrong approach there. They were already using JSON schema to codify the format of various JSON documents for their procedures. So my idea was to extend the JSON schema format to also encode the information on how to merge consecutive versions of those document.

The result of this line of thought was jsonmerge. For example, to say that arrays appearing in the bar field should be appended instead of replaced, the following schema can be used:

schema = {
            "properties": {
                "bar": {
                    "mergeStrategy": "append"
                }
            }
        }

This way, the definition of the merge process is fairly flexible. jsonmerge contains what I hope are sane defaults for when the strategies are not explicitly defined. This means that the merge operation should not easily break when new fields are added to documents. This kind of schema is also a bit more self-explanatory than a pure Python implementation of the same process. If you already have a JSON schema for your documents, adding merge strategies should be fairly straight-forward.

One more thing that this approach makes possible is that given such an annotated schema for source documents, jsonmerge can automatically produce a JSON schema for the resulting merged document. The merged schema can be used with a schema validator to validate any other implementations of the document merge operation (or as a sanity check to check jsonmerge against itself). Again, this was convenient for Open Contracting since they expect their standards to have multiple implementations.

Since it works on JSON schema documents, the library structure borrows heavily from the jsonschema validator. I believe I managed to make the library general enough so that extending it with additional merge strategies shouldn't be too complicated. The operations performed on the documents are somewhat similar to what version control systems do. Because of that I borrowed terminology from there. jsonmerge documentation and source talks about base and head documents and merge strategies. The meanings are similar to what you would expect from a git man page.

So, if that sounds useful, fetch the latest release from PyPi or get the development version from GitHub. The README should contain further instructions on how to use the library. Consult the docstrings for specific details on the API - there shouldn't be many, as the public interface is fairly limited.

As always, patches and bug reports are welcome.

Posted by Tomaž | Categories: Code | Comments »

On cartoon horses and their lawyers

15.08.2014 19:14

GalaCon is an annual event that is about celebrating pastel colored ponies of all shapes and forms, from animation to traditional art and writing. It's one of the European counterparts to similar events that have popped up on the other side of the Atlantic in recent years. These gatherings were created in the unexpected wake of the amateur creativity that was inspired by Lauren Faust's reimagining of Hasbro's My Little Pony franchise. For the third year in a row GalaCon attracted people from as far away as New Zealand. It's a place where a sizable portion of the attendees wear at least a set of pony ears and a tail, if not a more elaborate equestrian-inspired attire. Needless to say, it can be a somewhat frightful experience at first and definitely not for everyone.

For most people it seems to be a place to get away from the social norms that typically prevent adults from obsessing over stories and imagery meant for audience half their age and often of the opposite gender. While I find the worshiping of creative talents behind the show a bit off-putting, I'm still fascinated by the amateur creations of this community. The artist's booths were a bit high on kitsch ("Incredible. Incredibly expensive" was one comment I overheard), but if you look into the right places on-line, there are still enjoyable and thoughtful stories, art and music to be found.

Meeting people I knew from their creations on the web was a fun experience. However for me this year's GalaCon was also a sobering insight into what kind of a strange mix of creativity, marketing psychology and legal matters goes into creating a cartoon for children these days.

GalaCon and Bronies e.V. flags at Forum am Schlosspark.

A highlight of the event was a panel by M. A. Larson, one of the writers behind the cartoon series. By going step by step through a thread of actual emails exchanged between himself, Lauren Faust and the Hasbro office he demonstrated the process behind creating a script for a single episode.

The exact topic of the panel was not announced beforehand however and all recording of the screen was prohibited, with staff patrolling the aisles to look for cameras. I don't know how much of that was for dramatic effect and how much due to real legal requirements. However even before the panel began that gave a strong impression of the kind of atmosphere a project like this is created in. Especially considering the episode he was discussing aired more then three years ago. I'm sure a lot of people in the audience could quote parts of that script by heart. It has been transcribed, analyzed to the last pixel, remixed and in general picked apart on the Internet years ago.

My Little Pony was called the end of the creator-driven era in animation. So far I thought marketing departments dictated what products should appear on the screen and which characters should be retired to make place for new toy lines. I was surprised to hear that sometimes Hasbro office gets involved even in details like which scene should appear last before the end of an act and the commercial break. That fact was even more surprising since this apparently happened in one of the earliest episodes where the general consensus seems to be that the show was not yet ruined by corporate control over creative talent.

Similar amount of thought seemed to go into possibilities of lawsuits. Larson mentioned their self-censorship of the idea to make characters go paragliding and have them do zip lining instead. Is it really harder to say in a court that some child has been hurt trying to imitate horses sliding along a wire than horses soaring under a parachute?

GalaCon 2014 opening ceremony.

The signs of the absurdity of intellectual property protection these days could also be seen throughout the event. Considering Bronies e.V. paid license fees for the public performance of the show it was ridiculous that they were using low-quality videos from the United States TV broadcasts for projection on the big cinema screen, complete with pop-up advertisements that didn't make sense.

Similarly, the love-hate relationship between copyright holders and non-commercial amateur works is nothing new to report. There were a lot of examples where rabid law firms, tasked with copyright protection and with only tenuous connections back to the mothership, used various extortion tactics to remove remixed content from the web. I still don't understand though what kind of a law justifies cease-and-desist letters for works inspired by unnamed background characters that only appeared for a couple of seconds in the original show.

Evening in front of the Forum am Schlosspark.

In general, GalaCon was a bit more chaotic experience than I would wish for and I left it with mixed feelings. Cartoon ponies on the internet are full of contradictions. While the stories they tell are inspiring and a welcome getaway from daily life, the money-grabs behind them are often depressing. I still believe in the good intentions of these events but the extravagant money throwing at the charity auction made me question a lot of things. With extra fees for faster queues, photos and autographs this year's event felt more like a commercial enterprise than a grassroots community event.

Posted by Tomaž | Categories: Life | Comments »