CubieTruck UDMA CRC errors

18.10.2014 20:07

Last year I bought a CubieTruck, a small, low-powered ARM computer, to host this web site and a few other things. Combined with a Samsung 840 EVO SSD on the SATA bus, it proved to be a relatively decent replacement for my aging Intel box.

One thing that has been bothering me right from the start though is that every once in a while, there were problems with the SATA bus. Occasionally, isolated error messages like these appeared in the kernel log:

kernel: ata1.00: exception Emask 0x10 SAct 0x2000000 SErr 0x400100 action 0x6 frozen
kernel: ata1.00: irq_stat 0x08000000, interface fatal error
kernel: ata1: SError: { UnrecovData Handshk }
kernel: ata1.00: failed command: WRITE FPDMA QUEUED
kernel: ata1.00: cmd 61/18:c8:68:0e:49/00:00:02:00:00/40 tag 25 ncq 12288 out
kernel:          res 40/00:c8:68:0e:49/00:00:02:00:00/40 Emask 0x10 (ATA bus error)
kernel: ata1.00: status: { DRDY }
kernel: ata1: hard resetting link
kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
kernel: ata1.00: supports DRM functions and may not be fully accessible
kernel: ata1.00: supports DRM functions and may not be fully accessible
kernel: ata1.00: configured for UDMA/133
kernel: ata1: EH complete

At the same time, the SSD reported increased UDMA CRC error count through the SMART interface:

UDMA CRC weekly error count on CubieTruck.

These errors were mostly benign. Apart from the cruft in the log files they did not appear to have any adverse effects. Only once or twice in the last 10 months or so did they cause the kernel to remount filesystems on the SSD as read-only, which required some manual intervention to get the CubieTruck back on-line.

I've seen some forum discussions that suggested this might be caused by a bad power supply. However, checking the power lines with an oscilloscope did not show anything suspicious. On the other hand, I did notice during this test that the errors seemed to occur when I was touching the SATA cable. This made me think that the cable or the connectors on it might be the culprit - something that was also suggested in the forums.

Originally, CubieTruck comes with a custom SATA cable that combines both power and data lines for the hard drive and has special connectors (at least considering what you usually see in the context of the SATA cabling) on the motherboard side.

Last few weeks it appeared that the errors were getting increasingly more common, so I decided to try replacing the cable. Instead of ordering a new CubieTruck SSD kit I improvised a bit: I didn't have proper connectors for CubieTruck's power lines at hand, so I just soldered the cables directly to the motherboard. On the SSD drive I used the standard 15-pin SATA power connector.

For the data connection, I used an ordinary SATA data cable. The shortest one I could find was about three times as long as necessary, so it looks a bit uglier now. The connector on the motherboard side also needed some work with a scalpel to fit into CubieTruck's socket. The original connector on the cable that came with CubieTruck is thinner than those on standard SATA cables I tried.

Replacement SATA cables for CubieTruck.

So far it seems this fixed the CRC errors. In the past few days since I replaced the cable I haven't seen any new errors pop up, but I guess it will take a month or so to be sure.

Posted by Tomaž | Categories: Digital | Comments »

GA 7VT600 lmsensors settings

02.05.2014 17:35

Recently I've put into use a relatively ancient Gigabyte GA 7VT600 1394 motherboard that's been gathering dust on the top shelf of my wardrobe. I used it to replace an even older MSI board which, while still working perfectly, was getting a bit slow.

After replacing the dead lithium battery for RTC and NVRAM, it seems to work just fine with stock Debian Wheezy and passes a few ad-hoc stress tests.

One thing I noticed though is that sensors tool from the lm-sensors package isn't very useful by default.

it87-isa-0290
Adapter: ISA adapter
in0:          +1.70 V  (min =  +0.00 V, max =  +4.08 V)
in1:          +1.33 V  (min =  +0.00 V, max =  +4.08 V)
in2:          +3.25 V  (min =  +0.00 V, max =  +4.08 V)
in3:          +2.86 V  (min =  +0.00 V, max =  +4.08 V)
in4:          +3.23 V  (min =  +0.00 V, max =  +4.08 V)
in5:          +1.89 V  (min =  +0.00 V, max =  +4.08 V)
in6:          +1.89 V  (min =  +0.00 V, max =  +4.08 V)
in7:          +3.01 V  (min =  +0.00 V, max =  +4.08 V)
Vbat:         +0.00 V  
fan1:        3308 RPM  (min =    0 RPM, div = 8)
fan2:           0 RPM  (min =    0 RPM, div = 8)
temp1:        +35.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp2:        +31.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermistor
temp3:        +47.0°C  (low  = +127.0°C, high = +127.0°C)  sensor = thermal diode
intrusion0:  OK

The board obviously has a IT87 series chip that provides some hardware monitoring functionality (you need the it87 kernel module). Apart from the lack of useful labels, some voltages also seem to be divided by voltage dividers before being measured by it87. I would expect at least 5 V and 12 V lines there.

Figuring out which fan is which was trivial. For finding out other things, I compared the printout above with what the BIOS setup utility says. I picked out the most logical divider values for voltages. Since these also seem to fit the order of sensors, I'm relatively confident they are correct.

PC Health Status screen on GA 7VT600 motherboard.

in5 and in6 readings are very unstable and don't seem to be shown in the BIOS screen. I'm guessing they are not connected on this board. temp2 is also not shown, but seems to give reasonable values, so I'm guessing there is a temperature sensor connected there, but I don't know where it is.

So, for future reference, put this into /etc/sensors.d/ga-7vt600 to get a nicely labeled and properly calculated values for this hardware:

chip "it87-isa-0290"
    label temp1 "Sys Temp"
    label temp2 "Aux Temp"
    label temp3 "CPU Temp"

    label fan1 "CPU Fan"
    label fan2 "Sys Fan"

    label in0 "Vcore"
    label in1 "DDR Vtt"
    label in2 "+3.3V"
    label in3 "+5V"
    label in4 "+12V"
    label in7 "5VSB"

    compute in3 @*1.679, @/1.679
    compute in4 @*3.973, @/3.973
    compute in7 @*1.679, @/1.679
Posted by Tomaž | Categories: Digital | Comments »

CubieTruck Perl performance

23.01.2014 22:57

Two months ago I bought a CubieTruck, one of the many cheap, bare-bone ARM-based computers that keep popping-up everywhere these days. My idea was to replace the aging x86 server that is running this website with something more power-efficient. So I was looking for a reasonably powerful board with a proper SATA interface and a decent amount of RAM. Raspberry Pi was out of the question, but the latest incarnation of CubieBoard with a dual-core 1 GHz ARM Cortex-A7, 2 GB of RAM, SATA 2.0 and Gigabit Ethernet seemed to fit the bill.

Unfortunately I could not find any reliable benchmarks I could use to estimate how ARM SoCs perform in comparison with my existing setup. So before I decided to migrate I took a while to do some performance tests and get to know this hardware.

CubieTruck

The software setup I'm interested in benchmarking is somewhat archaic in these days of Node.js and NoSQL. I'm using Perl 5 with HTML::Template doing most of the heavy lifting (at least according to Devel::NYTProf profiler). Most parts are statically generated and some are dynamic using a handful of SpeedyCGI Perl 5 scripts. These are combined into a consistent website you see here with a somewhat convoluted Apache configuration using the threaded worker.

In the following benchmarks I'm comparing:

  • An AMD Duron at 700 MHz, 1.2 GB RAM running stock x86 Debian Squeeze. Root filesystem is mounted from an IDE hard drive.
  • A CubieTruck A20 running armhf Debian Wheezy and the kernel supplied for the CubieTruck Ubuntu Server installation. Root filesystem is mounted from an SD card.

Both machines were connected through a 100 Mb/s Ethernet switch to a laptop which was running the remote end of the benchmarks.


First, to see how fast the static part of the web site is generated, I ran the full (single threaded) HTML rebuild. I measured the required user space CPU time with the time utility. This is the fastest run of three on each machine:

AMD Duron CubieTruck
CPU time to rebuild static pages 45.3 s 61.8 s

Then, to check if network was operating at the bit rate I thought it was, I ran iperf to measure TCP throughput between the server and the laptop:

AMD Duron CubieTruck
iperf throughput test 94.0 Mb/s 94.5 Mb/s

Finally, I ran a suite of tests using the Apache benchmarking tool. I measured how many requests per minute a server can handle for different types of content and different number of concurrent requests. Numbers in parentheses show size of HTTP body (without headers).

CubieTruck requests per second for a static HTML page.

CubieTruck requests per second for a dynamic HTML page.

CubieTruck requests per second for an image.

CubieTruck requests per second for API call.

The site rebuild is somewhat disappointingly almost one-third slower than on a 10 year old PC. However the single threaded Apache performance is on par with it. In the case of more concurrent users the CubieTruck of course has an advantage because of an additional CPU core. Actually in both cases with static content CubieTruck managed to saturate the line when there was more than one concurrent request.

I tried to make these tests in a way that the slow SD card in the CubieTruck would minimally affect their outcome. All of data should fit into the buffer cache, which is why in the first test I only took into account the fastest run and only user space CPU time. However I now suspect that the SD card still affected the numbers somehow (the rebuild operation is the heaviest of the tests regarding filesystem I/O). I don't know for sure how kernel computes the time returned by the time utility.

These results are good enough that I can't dismiss CubieTruck based on performance. If a proper SATA drive wouldn't speed it up, I could probably parallelize the build process with not much work. That should cut down on time if it's really Perl performance on ARM that is slowing it down. On the other hand I'm having some other concerns about using CubieTruck as a personal server so I'm not completely decided yet about putting it on my rack.

Posted by Tomaž | Categories: Digital | Comments »

Repairing the Happy Hacking Keyboard

29.09.2013 15:40

My trusty old Happy Hacking Keyboard has been working pretty reliably for the last four years. After fixing a botched plastic mold and strategically placing a piece of cardboard in its innards that is. Regarding the typing feel it is still my favorite keyboard that doesn't take a lot of space on a crowded desk and I only switch to a regular-sized Logitech when I'm working with EDA programs where I need functions keys a lot.

So I was pretty disappointed when it stopped working a week ago. Checking the kernel log revealed all sorts of random USB bus errors:

usb 6-1.1: USB disconnect, device number 14
usb 6-1: reset full-speed USB device number 13 using uhci_hcd
usb 6-1: device not accepting address 13, error -71
usb 6-1: reset full-speed USB device number 13 using uhci_hcd
usb 6-1: device firmware changed
hub 6-1:1.0: hub_port_status failed (err = -19)
hub 6-1:1.0: hub_port_status failed (err = -19)
hub 6-1:1.0: hub_port_status failed (err = -19)
hub 6-1:1.0: activate --> -19
usb 6-1: USB disconnect, device number 13
usb 6-1: new full-speed USB device number 15 using uhci_hcd
usb 6-1: string descriptor 0 read error: -71
usb 6-1: New USB device found, idVendor=04fe, idProduct=0008
usb 6-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 6-1: can't set config #1, error -71
hub 6-0:1.0: port 1 disabled by hub (EMI?), re-enabling...

This looked like something systemic. Either the controller was resetting continuously or there was something wrong with the USB wiring between the controller and the computer.

A new, identical HHKB model goes for more than $110 today so I opened it up to see if there's anything I can do. After checking the cable with an ohm-meter my suspicion fell on the power supply which seems to consist of a 3.3 V LDO regulator and some capacitors. I could see no obvious transients on the power rails when the controller switched on after the keyboard was plugged in. One interesting thing I did see was that if negotiation with USB host fails the controller switches itself off completely, including its quartz oscillator.

Happy Hacking Keyboard Lite2 USB controller

Since I had some problems with flaky USB cables before, I removed the original cable and soldered a new cable directly to the circuit board. This fixed the problem! After poking around some more, it turned out that after re-soldering the connector to the PCB the original cable worked as well.

I removed the membrane before poking around the controller board since a hot soldering iron and plastics don't mix well. Re-inserting the soft matrix tails into the (non-ZIF) connectors was somewhat tricky. I resorted to using pliers plus a bit of paper to protect the delicate wires.

Re-inserting flexible cables into connectors using pliers

I also noticed that silver wires on the keyboard matrix itself seem to be developing a kind of a dark oxide on the outer edges. I don't remember for sure whether they looked like this from the start though. If something is eating away at the wires that definitely puts a kind of a definitive limit to this keyboard's longevity.

Possible oxidation on the keyboard matrix

In conclusion, just checking with a multimeter doesn't mean there's not a bad solder joint somewhere on a high-speed bus. The wisdom of checking first for bad RoHS soldering on mechanically (and thermally) stressed components confirmed itself again. Also, did you know there are a couple of alternative, open source Happy Hacking Keyboard controllers out there?

Posted by Tomaž | Categories: Digital | Comments »

Some notes about CC chips

20.05.2013 12:31

Here are two unusual things I noticed while working with Texas Instruments (used to be Chipcon) CC2500 and CC1101 integrated transceivers.

It appears that the actual bit rate in continuous synchronous serial mode can differ significantly from what Texas Instrument's SmartRF Studio 7 calculates. See for example the following measurements that were taken on a CC2500 chip using a 27 MHz crystal oscillator as a reference clock.

MDMCFG4 valueRF studio [baud]measured [baud]
0x8a50.038.5
0x8b100.077.2
0x8c200.0154.0
0x8d400.0305.0

These bit rates were measured using an oscilloscope attached to the clock output of the transceiver, so I trust them to be correct. Bit rates I measured on a CC1101 agree with what SmartRF Studio predicts.

Update: I revisited this issue and the problem was a bug in my code that caused MDMCFG3 register (which also affects data rate) not to be properly programmed on CC2500. Accounting for this bug, the data rates are within 1% of those calculated by SmartRF Studio 7 or from the formula given in the datasheet.

The other issue I saw is symbol mapping for 4FSK modulation in CC1101. It looks like it depends on the configured bit rate. For example, with 200 baud, the symbol to frequency mapping appears to be as follows:

symbolΔf
00−0.33 fdev
01−1.00 fdev
10+0.33 fdev
11+1.00 fdev

However, with 45 baud, the mapping is different, with symbol bit order apparently switched around:

symbolΔf
00−0.33 fdev
01+0.33 fdev
10−1.00 fdev
11+1.00 fdev

Update: It's possible this difference has something to do with when exactly the radio samples the data line in relation to the clock. Either I don't understand exactly what is going on or the radio isn't sampling the data when it is supposed to. Also, the factors of fdev in tables were wrong (symbol frequencies are equally spaced, with maximum deviation from central frequency equal fdev).

Of course, this doesn't matter if you are using two identically configured CC1101 chips on both ends of the radio link. But it is important if you want to use it to communicate with some other hardware.

Posted by Tomaž | Categories: Digital | Comments »

VESNA and signal synthesis

26.04.2013 20:49

Experimental radio equipment on VESNA can be used for more than spectrum sensing. Radio boards based on Texas Instruments CC2500 and CC1101 transceivers can also be used for packet transmission and reception and, perhaps somewhat surprisingly, also as flexible signal generators. These can be used in experiments when you want for instance to introduce a controlled interference in some system or to check if your spectrum sensor is working correctly.

Usually when I'm talking with people about what our hardware is capable of the conversation often starts with amazement at how small the radio part is and then inevitably turns to the question whether VESNA has a software defined radio. I have to answer no, it doesn't, and then there's usually an awkward silence because that seems like a dead-end for any serious research work these days.

CC1101 transceiver on SNE-ISMTV-868

True, these transceivers don't provide software access to the undemodulated baseband samples (like for instance USRP does). However they are still amazingly flexible. They offer plenty of reconfigurability and, after you get to know some of their quirks, can be adapted for a lot of weird use cases without hardware changes. If you skip the various high-level digital parts of the chip for proprietary packet format handling there's a flexible front-end in there that allows you to choose between a handful of frequency, amplitude and phase modulators and several options for channel filters and such.

The closest you can come to software defined radio is a transparent continuous transmit mode where you can feed the transceiver arbitrary binary data from the microcontroller and it will simply get modulated and fed into the antenna. There is a catch though, since the chips offer at most 2-bit quantization of the baseband signal before modulation (they were designed for simple digital transmissions after all). This means that you have to get creative if you want to approximate an analog modulation and be ready for plenty of quantization noise.

This is how it looks like on a spectrum analyzer when you run a direct digital synthesis algorithm on the microcontroller that generates a baseband sawtooth frequency sweep and use the amplitude modulator to up-convert it to 2.4 GHz:

RF signal synthesis using VESNA

(Click to watch RF signal synthesis using VESNA video)

Using delta-sigma modulation you can approximate arbitrary waveforms in this way. For instance, you can make a passable simulation of an (analog) wireless microphone transmission using a 4-FSK modulator in CC1101 tuned into the UHF band.

Of course, setting this up takes more work than popping a few blocks into GNU Radio Companion and part of my job is to make it more accessible to people using VESNA and our VESNA-based testbeds. If you're interested in such signal generation using Texas Instruments CC series, some platform-independent code capable of doing this should start hitting the vesna-spectrum-sensor repository on GitHub in the next few weeks.

Posted by Tomaž | Categories: Digital | Comments »

The lowly connector

11.04.2013 20:47

Here's a small addendum to my previous list of lessons regarding easy debugability of microcontroller boards.

It's not a bad idea to pay a bit of extra attention to the connectors that will be used when developing and debugging software. It's likely these will see much more use than any other connector on the board, especially if the board is also meant for teaching or research like VESNA.

The first concern is that it should be hard to connect it in a wrong way. IDC and similar pin header connectors are popular for JTAG. Use a male part that has a shroud so that it's impossible to connect it when displaced by a pin or turned 180 degrees. That might sound obvious, but when you're debugging that tricky race condition and switching the debugger between three systems on your table late in the evening, the last thing you need is a burned out board because of a misplaced connector.

The other thing that is also worth considering is the life time of the connector itself. While the life time of the part on the board might not be problematic, the part that stays with the developer can be. For VESNA one of the most common reasons to make a trip to the soldering station is a torn wire in the connector for the serial debug console. We use something similar to the Berg connector and that one doesn't really take many connect-disconnect cycles before either wires get torn out or little springs break and the metal part falls out of the plastic housing. It's not always obvious that has happened and again it's a pain to realize after a long debugging session that the reason a board is talking garbage is not due to a bug you're trying to catch but rather due to a broken ground line in your debug console.

Posted by Tomaž | Categories: Digital | Comments »

About Digi Connect ME

03.10.2012 23:32

Remember my recent story about Atmel modules? For the last two days I had a very strong déjà-vu feeling when I was trying to debug an issue that popped up in the last minute and is preventing a somewhat time critical experiment from being performed on one of our VESNA testbeds.

It turned out that while 7-bit ASCII strings were correctly transferred between VESNAs and a client calling a HTTP API, binary data sent down the same pipeline got corrupted somewhere on the way. Plus, to make things just a bit more interesting, it sometimes also made the whole network of VESNAs unreachable.

VESNA coordinator with Digi Connect ME module.

Now, problems like this are unfortunately quite a common occurrence. Before data from a sensor node reaches a client, it must pass through multiple hops in a (notoriously unreliable) ZigBee mesh network, a coordinator VESNA that serves as a proxy between the mesh and a TCP/IP tunnel, a Java application on the end of the tunnel that translates between a home-grown HTTP-like protocol used between VESNAs and proper HTTP and finally an Apache server acting as a reverse-proxy for the public HTTP API. Leaky abstractions and encapsulations are a plenty and it's not unusual to have some code somewhere along the line assume that the data it is passing is ASCII-only, valid UTF-8 or something else entirely.

I won't bore you with too many details about debugging. I opted for a top-down approach, with the Java layer being the main suspect based on previous experience. After that got a thorough check, I moved to sniffing the tunnel with Wireshark, which has an extra complication that it's SSL encrypted. Luckily, it doesn't seem to use Diffie–Hellman key exchange, meaning that the SSL dissector was quite effective once I figured out how to extract private keys from Java's keystore. That also seemed to look OK, so the next layer down was the SSL tunnel endpoint, which is a Digi Connect ME module.

This is basically a black box that takes a duplex RS-232 connection on one end and tunnels it through an encrypted TCP/IP connection. It's an deceptively simple description though. In fact Digi Connect ME is a miniature computer with an integrated web server for configuration, scripting support and a ton of other features (I have close to 500 pages of documentation for it on my computer and I only downloaded documents I though might be useful in debugging this issue).

VESNA coordinator hooked up to a logic analyzer.

Anyway, when I looked closely, the problem was quite apparent. On the RS-232 side the module was set to use XON/XOFF software flow-control. This obviously won't work when sending arbitrary binary data. Not only will the module interpret XON and XOFF bytes as special and so drop them from the pipeline, an XOFF that is not followed by XON will also halt the transmission, leading to hangs and timeouts. The fix looked simple enough: switch to hardware flow control using dedicated CTS and RTS lines.

As you might have guessed, it was not that easy. It turns out that when hardware flow control is enabled in Digi Connect ME, it will randomly duplicate characters when sending them down the serial line. Again, the suspicion first fell on our homegrown UART driver on the VESNA side, but a logic analyzer trace below confirmed that it's the actual Digi Connect module that is the culprit and not some bug on our side.

Logic analyzer trace from a Digi Connect ME module.

Now, at this point I'm seriously confused. Digi Connect ME is quite popular and, at least judging from browsing the web, used in a lot of applications. But after the ZigBit module it's also the second piece of hardware that exhibits such broken behavior that I can't believe it has passed any thorough quality check. All of my experience speaks that it must be us that are doing something wrong and in both cases I have tested just about all possibilities where VESNA could be doing something wrong. Actually, I want it to be a mistake on our end because that means I can fix it. But honestly, once you see wrong data being sent on a logic analyzer, I don't think there can be any more doubt. Must really everything turn rotten once you look into it with enough detail?

Posted by Tomaž | Categories: Digital | Comments »

IguanaWorks USB IR transceiver

19.08.2012 22:16

I bookmarked this little gadget a while ago. Having recently solved my problems with scriptable switching of PulseAudio audio outputs I thought it's time to finally order it and try to automate a few other home-theater related operations through it. Over the summer a few other infrared-communication related things also piled up on my desk, so having an universal IR transmitter and receiver within reach seemed like a good idea.

IguanaWorks USB IR transceiver

This is an IguanaWorks USB IR transceiver, the hybrid version. Hybrid meaning it has both an integrated IR LED and detector pair and a 3.5 mm jack for an external transmitter.

From the software side it comes with quite an elaborate framework, free and open source of course. Software also comes in the form of Debian binary and source packages, which is a nice plus. I did have a small problem compiling them though since the build process seems to depend on the iguanair user being present on the system. This user only gets created during installation which makes it kind of a catch-22 situation. Once compiled the packages did work fine on my Debian Squeeze system.

After everything is installed, you get:

  • igdaemon, a daemon that communicates with the actual USB dongle,
  • igclient, a client that exposes daemon functionality through a command-line interface,
  • a patched version of lircd daemon that includes a driver that offloads communication to igdaemon.

lirc is the usual Linux framework for dealing with infrared remotes. It knows how to inject keypresses into the Linux input system when an IR command is received and comes with utilities that can send commands back through the IR transmitter to other devices. This is the first time I'm dealing with it and I'm still a bit confused how it all fits together, but right now it appears some parts of the lirc ecosystem don't currently work with iguanair at all. For instance, xmode2 utility that shows received IR signals in an oscilloscope-like display isn't supported.

As I'm currently mostly interested in using this from my scripts, using igclient directly seems to be simplest option. There are also Python bindings for the client library, but they appear undocumented and I haven't yet took a dive into the source code to figure it out.

The client reports the received signals in the form of space-pulse durations, like this:

$ igclient --receiver-on --sleep 10
received 1 signal(s):
  space: 95573
received 3 signal(s):
  space: 7616
  pulse: 64
  space: 65536

I'm not yet sure what the units for those numbers are. According to the documentation the transmit functionality expects a similarly formatted input, but I have yet to try it out. It seems that if I want to plot the signals on a time line I will have to write my own utility for that.

To be honest I expected using this to be simpler from the computer side. In the end it basically has the same functionality as my 433 MHz receiver. One thing I also overlooked is that it's only capable of transmitting modulated on-off keyed transmissions (25 - 125 kHz carrier), which makes it useless for devices that don't use that, like shutter glasses. But given that I did basically zero research before ordering it I can't really blame anyone else but me for that (and that bookmark must have been at least a year old). Just yesterday I also stumbled on IR toy which appears to be a similar device. It would be interesting to know how it compares with the IguanaWorks one.

Posted by Tomaž | Categories: Digital | Comments »

On Atmel SerialNet ZigBit modules

13.08.2012 22:27

Don't use Atmel BitCloud/SerialNet ZigBit modules.

With this important public service announcement out of the way, let me start at the beginning.

Atmel makes ZigBit modules that contain an IEEE 802.15.4-compatible integrated radio from their AT86RF2xx family and an AVR-based microcontroller on a small hybrid component. The CPU runs a proprietary mesh-networking stack (BitCloud) built on top of the ZigBee specification and exposes a high-level interface on a serial line they call SerialNet (think "send the following data to this network address"-style interface). The module can be used either as a very simple way of adding mesh networking to some host device or as a stand-alone microcontroller with a built-in radio (Atmel provides a proprietary BitCloud SDK, so you can build your own firmware for the AVR).

Atmel ZigBit module on a VESNA SNR-MOD board.

At SensorLab we built a sensor node radio board for VESNA using these modules (more specifically, ATZB 900 B0 for 868 MHz and ATZB 24 B0 for 2.4 GHz links) as they appeared to be simple to use and would provide a temporary solution for connecting VESNAs with a wireless mesh until we come up with a working and reliable 6LoWPAN implementation. So far we have deployed well over 50 of these in different VESNA installations.

I can now say that these modules have been nothing but trouble from the start. First there is the issue of documentation. Atmel's documentation has always been superb in my memory. Compare one of their ATmega datasheets with the vague hand-waving STMicroelectronics calls microcontroller documentation and you'll know why. Unfortunately, the SerialNet user guide is an exception to this rule. They leave many corner cases undefined and you are left to your own experimentation to find out how the module behaves. There is almost no timing information. How long can you expect to wait for a response to a command? How long will the module be unresponsive and ignore commands after I change this setting? Even the hardware reset procedure is not described anywhere beyond a "Reset input (active low)".

The problems with this product however go deeper than this. In my experience developers, my self included, tend to be too quick to blame problems on bugs in someone else's code. When colleagues complained how buggy these modules are I said that it's much more likely a problem in our code or hardware design. That is until I started investigating myself the numerous problems we had with networking: the modules would return responses they shouldn't have according to the specification, they would say that they are connected to the network even though no other network node could communicate with them. Modules would even occasionally persistently corrupt themselves, requiring firmware reprogramming before they would start responding to commands again. Believe me, it's annoying to reach for a JTAG connector when the module in question is on a lamp post in some other part of the country.

For most of these bugs I can only offer anecdotal evidence. However I have been investigating one important issue for around two months now and I'm confident that there is something seriously wrong with these modules. I strongly suspect there is a race condition somewhere in Atmel's (proprietary and closed-source, of course) code that causes some kind of buffer corruption when a packet is received over the radio at the same time as the module receives a command over the serial line. This will cause the module to lose bytes on the serial line, making it impossible to reliably decode the communications protocol.

For instance, this is how the communications should look like over the serial line. Host in this case is VESNA and module is Atmel ATZB 900 B0:

→ AT+WNWK\x0d                                # host asks for network status
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a  # module asynchronously reports received data
← OK\x0d\x0a                                 # module answers that network is OK
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a  # module asynchronously reports received data

This is how it sometimes looks like:

→ AT+WNWK\x0d
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a
← OK\x0d                                     # note missing \x0a
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a

And sometimes it gets as bad as this:

→ AT+WNWK\x0d
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a
← ODATA 0000,0,77:(77 bytes of data)\x0d\x0a # note only O from OK sent

An inviting explanation for these problems would be that we have a bad implementation of an UART on VESNA. Except that this happens even when the module is connected to a computer via a serial-to-USB converter and I have traces from a big and expensive Tektronix logic analyzer (as well as Sigrok) to prove that corrupted data is indeed present on the hardware serial line and not an artifact of some bug on the host side:

Missing Line Feed character from an Atmel ZigBit module.

A logic analyzer trace demonstrating a missing line feed character. Click to enlarge.

Data corruption on the serial line from an Atmel ZigBit module.

A logic analyzer trace demonstrating a jumbled-up OK and DATA response. Click to enlarge.

I have seen this happen in the lab under controlled conditions on 10 different modules and have good reasons to suspect the same thing is happening on the deployed 50-plus modules. Also, this bug is present in both BitCloud 1.14 and 1.13 and in both vanilla and security-enabled builds. All of this points to the fact that this problem is not due to some isolated fluke on our side.

For well over a month I have been on the line with Atmel technical support and while they had politely answered all of my mail they had also failed to acknowledge the issue or provide any helpful information even though I sent them a simple test case that reliably reproduces the problem in a few seconds. Of course, without their help there is exactly zero chance of getting to the bottom of this and given all of the above I seriously doubt this is anything else than a bug in their firmware.

At this point I have mostly given up any hopes that this issue will be resolved. During my investigation I did find out that decreasing the amount of chatter on the serial line decreases the probability of errors, so I did manage to work around this bug a bit by switching to non-verbose responses (ATV0) and using packets that are a few bytes shorter than the maximum (say 75 bytes for encrypted frames). This will hopefully improve the reliability of already deployed hardware. For the future, we will be looking into alternatives, as unfortunately 6LoWPAN still seems to be somewhat outside of our grip.

Posted by Tomaž | Categories: Digital | Comments »

Microcontroller board design tips

20.06.2012 17:52

I've been working with VESNA for a better half of a year now and over time I have come to know some oversights that were made in the original hardware design that make development of software for VESNA unnecessary complicated. Of course, mistakes like these are only obvious in hindsight, so I'm sharing here a few tips that should prevent them in future designs.

Have a JTAG port always accessible. JTAG is what allows you to do any kind of debugging beyond simple printfs. Together with an on-chip debugger and the GNU debugger it makes it possible to debug firmware running on the microcontroller in much the same way as an executable on your PC. You can inspect parts of the CPU's address space, program new firmware right from the debugger and set break and watchpoints. However a JTAG port takes between 4 and 6 pins and many microcontrollers allow you to turn it off and remap other peripherals to those pins.

Relying on the functionality hidden behind JTAG pins is a bad idea and should only be used as a last resort. Doubly so on a board that is meant for multiple purposes and is going to see a lot of software development. VESNA has several peripherals on the core board that can only be used after JTAG is turned off. Needless to say, these are a pain to develop drivers for. This means that people avoid using those parts of the hardware as far as possible and in the cases when the use is unavoidable, the software is badly tested and bugs tends to linger around for longer than for other parts of the system.

Straightforward physical accessibility is also important. On VESNA, JTAG pins are routed through the general-purpose expansion connector and we had several cases where accessing the connector was a simple geometric impossibility. It also means that all expansion boards have to be designed with the debugability of the underlying core board in mind. Having a separate connector would be a much better choice. When designing a board, think how it will be mounted and what kind of expansions will there be in the future. These should not make the debug port inaccessible.

Use a separate serial port for diagnostics and debug messages. The Cortex M3 microcontroller used on VESNA has several hardware UART peripherals, but only one is accessible on a convenient connector, which means that one serial line is used for data transfer as well as random debug messages used during development. This violates the rule of separation, which means that debug messages jumbled up together with data are not a rare occurrence.

Advice regarding accessibility of the JTAG port applies to the diagnostic port as well. VESNA's serial line is available on the connector that is also used for external power. This means that in some configurations the serial line simply can't be connected, making debugging hard. A much better design choice would be to bundle JTAG and a diagnostic serial console on the same connector and leave the power supply separate.

Actually, I have come to learn that special care should be given to accessibility of the circuit in general. It should be possible to run the basic core board in a way that makes it possible to probe most nets and microcontroller pins with an oscilloscope or a logic analyzer. The positioning of the JTAG on the expansion connector on VESNA means that in order to use a debugger you have to use a special break-out debug board which, until a later redesign, used to cover half of the core board, making a part of the circuit inaccessible to a oscilloscope probe without a soldering iron and some wires. When a bug hits that requires you to use a debugger and an oscilloscope you really don't need any additional complications.

In short, design for debugability, both software and hardware. It will make people happier.

Posted by Tomaž | Categories: Digital | Comments »

Organic display, part 2

02.06.2012 21:06

Here is the follow up to the organic LED display I was reviewing back in February and my recently manufactured Arduino shield for it.

There is not much more to tell about the hardware. I basically went with the design I described in my first post. Display controller is powered from the 3.3 V line supplied by Arduino while the display itself uses a LM2703 step-up converter connected to Arduino's 5 V supply. This micropower switcher is sufficient to power the OLED array, but the reference design proved to be a bit slow to react to fast changes in display current. For instance, with a checkerboard pattern on the display, the supply voltage will drop for almost 1 V when display scans the high-brightness areas and this causes some visible shadowing. If you look closely at the photo below, you can see that the centers of white squares are darker than the corners.

Arduino OLED shield showing a checkerboard pattern

The reference LM2703 design I followed uses a single ceramic 4.7 µF capacitor at the switcher output, so this problem could most likely be fixed by adding some larger output capacitors.

VDDH supply when displaying a checkerboard pattern

Voltage drop on the VDDH line when displaying the 45° checkerboard pattern.

From the software side, SEPS525 controller turned out to be an interesting challenge. Densitron's datasheet warns that you need to correctly program the controller (like pixel driving current and timings) before turning on the high-voltage supply or risk damaging the display. This is somewhat of a gamble, since the SPI interface only allows for one-way communication and except for turning the display on you can't know whether you actually managed to set up SPI correctly. In my case this resulted in a few mishaps where the switcher's coil loudly complained about currents that were probably well beyond display's maximum ratings. Luckily, the display appears to have survived all of them.

SEPS525 datasheet is quite long and vague in places, but once you figure out the SPI protocol and set up the configuration registers correctly the use is quite straightforward: you give it a drawing area (either full screen or a part of it) and burst the pixel data in row-major order while the controller takes care of incrementing the framebuffer pointer and line wrap. Display supports either 6/6/6 or 5/6/5 RGB color formats, however I'm not aware of any microcontroller SPI peripheral that is actually capable of transmitting 18 bit words through SPI. Arduino's certainly isn't, which means you must either sacrifice 2 bits of color depth or bit-bang the SPI protocol which is horribly slow.

Talking about speed, bursting the whole screen worth of data from ATmega328 using hardware SPI through Arduino's library takes around 100 ms. If you want to do some calculations on the pixels (like the checkerboard pattern above) this might quickly become 1000 ms, so don't expect high frame rates. Another problem is that the framebuffer takes 40960 bytes, which is too much to store even a single full-screen image in ATmega328's 32 kB flash. I cooked up a simple RLE-compressor for the example you see below, which approximately halves the size of the stored images for the price of some extra CPU cycles (drawing this particular one takes 180 ms).

Arduino OLED shield showing Preening RD by MochaDelight

Preening RD by MochaDelight

Hardware design and software is available under CC-BY-SA 3.0 and GNU GPL 3 licenses respectively. The repository below contains a SEPS525 driver and an Arduino sketch that demonstrates its use with the three push-buttons that are also on the shield:

https://www.tablix.org/~avian/git/arduino-seps525-oled.git

I have a few of these boards extra. So if you're interested in getting a bare PCB or a kit, please let me know. Be warned though that soldering the OLED connector makes for a good exercise in precision SMD soldering.

Posted by Tomaž | Categories: Digital | Comments »

Carambola

28.05.2012 18:45

Small, very cheap computers running Linux seem to be all the rage these days. Everyone is talking about Raspberry Pies and announcements for similar hardware from various vendors keep appearing in my RSS reader. FreedomBox community has a thread on their mailing list that mentions most of them.

These devices are interesting both from the software and hardware standpoint. Raspberry Pi is often seen as a modern day equivalent of home microcomputers like Sinclair Spectrum, where children can learn to program in an environment where it's hard to destroy something important or expensive. But I think the hardware perspective is also important: most of these computers come with user-accessible, simple hardware interfaces, like GPIO pins, RS-232 ports and even low-level buses like I2C and SPI. Modern desktop computers have dropped these in favor of very complicated interfaces like USB that make it disproportionally hard to attach simple home-made electronics to them. Combine this with the simplicity that a multi-tasking, network operating system like Linux can offer in terms of Internet connectivity and it's not unlikely that the next hugely-popular hardware tinkering platform after Arduino will be running on one of these simple computers.

Carambola from 8devices is a relatively unknown computer from the lower end of this class. Since it is quite similar to the TP-Link TL-WR703N I saw at 28C3 I decided to get one and try it out.

8devices Carambola on a development board, top

By the way, in Slovene language number 8 is pronounced the same as English awesome. I wonder if something similar is the case in Lithuanian and if that is the story behind this company's name

8devices Carambola on a development board, bottom

Carambola comes in an Arduino-like form factor which you can see sitting on top of the larger development board on the picture above. It consists of a Realtek RT3050 system-on-chip that includes a MIPS CPU, 802.11b/g/n radio, 2 wired 100 Mb/s Ethernet interfaces, USB controller and a few other details. Also on board, but in separate packages are 32 MB of RAM and 8 MB of flash ROM. Apart from Wi-Fi which uses an on-board chip antenna, everything is available on two lines of pins. In contrast to Arduino, Carambola uses a male connector. Unfortunately pins are 2 mm apart, which means you won't be able to simply plug it into a standard protoboard. I guess the development board is pretty much a must for any prototyping.

First of all, while this is advertised as an open hardware platform, the schematic of Carambola itself has not been disclosed (development board design is available though). I haven't yet looked deep enough to check what is the status of the software running on Carambola. There does appear to be a datasheet available for RT3050 and 8devices also has a GitHub account.

Speaking of software, my Carambola came preloaded with OpenWRT Linux (bleeding edge, r30001). Default settings configured eth1 with a static IP 192.168.1.1 while eth2 used DHCP auto-configuration. ssh was installed and running, but since there was no password set for root it would not let me in. I connected a serial terminal to the RS232 port (running at 115200 baud), where I found a shell that allowed me to set the root password. After that I could connect to Carambola through one of its LAN ports via ssh.

Here is the output of a few commands that should give you an idea of what kind of system this is:

root@OpenWrt:~# df -h
Filesystem                Size      Used Available Use% Mounted on
rootfs                    4.6M    276.0K      4.3M   6% /
/dev/root                 2.0M      2.0M         0 100% /rom
tmpfs                    14.7M     40.0K     14.7M   0% /tmp
tmpfs                   512.0K         0    512.0K   0% /dev
/dev/mtdblock5            4.6M    276.0K      4.3M   6% /overlay
overlayfs:/overlay        4.6M    276.0K      4.3M   6% /
root@OpenWrt:~# cat /proc/cpuinfo
system type		: Ralink RT3350   id:1 rev:2
machine			: CARAMBOLA
processor		: 0
cpu model		: MIPS 24Kc V4.12
BogoMIPS		: 212.58
wait instruction	: yes
microsecond timers	: yes
tlb_entries		: 32
extra interrupt vector	: yes
hardware watchpoint	: yes, count: 4, address/irw mask: [0x0000, 0x0018, 0x0c90, 0x0810]
ASEs implemented	: mips16
shadow register sets	: 1
kscratch registers	: 0
core			: 0
VCED exceptions		: not available
VCEI exceptions		: not available
root@OpenWrt:~# free
             total         used         free       shared      buffers
Mem:         30176        14236        15940            0         1760
-/+ buffers:              12476        17700
Swap:            0            0            0

The pre-loaded system looks pretty complete, so I didn't bother with building a custom firmware image yet. gpioctl tool was already installed, which meant I was able to play with GPIO ports (nicely exposed on the development board in a small prototyping area) directly from the shell.

For instance, running this:

root@OpenWrt:~# while true; do gpioctl set 1; gpioctl clear 1; done

produces a square wave with approximately 60 Hz on the first GPIO port, which also gives you a rough estimate of the performance of shell scripting for embedded programming.

In conclusion, this is an interesting piece of hardware, but for which I don't currently have a concrete application. The documentation is somewhat chaotic, with useful information appearing on the wiki, forum and 8devices web site.

I wonder how this hype with small Linux computers will turn out. It reminds me of netbooks in some ways and that didn't turn out all that well, with vendors dropping the light-weight, Linux running solid state computers when their customers realized they couldn't run full-blown Windows software on them.

Posted by Tomaž | Categories: Digital | Comments »

Reading STM32F1 real-time clock

11.04.2012 20:22

VESNA is using STM32F1 family of ARM Cortex M3 microcontrollers from ST Microelectronics. These chips have a real-time clock peripheral built-in that can be used to keep track of time and date. In VESNA it uses an external 32.768 kHz tuning-fork quartz oscillator and is running even when the CPU has been power-down to conserve power.

The clock can be used in a number of ways: it can trigger periodic (e.g. system tick) and non-periodic (e.g. alarm) interrupts or you can simply read its value when you need a timestamp in your code. The latter use might appear to be the simplest, but can be especially problematic as the peripheral stores time in no less than 4 16-bit registers spread out over 16 bytes of address space. They can not be read atomically which can lead to subtle race condition bugs where the clock appears to be wrong for duration of one tick. I recently spent quite some time debugging such a bug and would like to share my findings (for best experience, open up reference manual at chapter 18: Real-time clock).

RTC keeps time in two internal registers: the prescaler RTC_DIV counts down periods of the RTC oscillator. Once it reaches zero it is reset and the counter RTC_CNT register gets incremented. These two registers aren't directly accessible - instead each of them has two 16-bit shadow registers on a CPU-accessible bus APB1 that get periodically updated with fresh values synchronously to the CPU bus clock. These are called RTC_DIVH, RTC_DIVL, RTC_CNTH and RTC_CNTL in the documentation.

VESNA uses what is likely the most common configuration: the prescaler is set so that it wraps around each 32768 cycles, making RTC_CNT count seconds while RTC_DIV can be used to keep fractional seconds with around 30 μs precision.

There are two important things to watch out:

  • As mentioned before, you can't read the four values atomically. This means that between reading say RTC_CNTH and RTC_DIVL the values might have changed. In the best case this means you get a value off by one RTC tick. In the worst case, lower registers just overflowed into a RTC_CNTH increment and the value you read is off by 18 hours.
  • RTC_CNT only gets incremented one clock tick after RTV_DIV gets reset.

First, you might be tempted to make the four bus reads atomic by synchronizing the reads with the shadow register update. There is a RTC_CRL_RSF registers synchronized flag that gets set by hardware each time the shadow registers are updated. I have tried this by thinking that if I read the values immediately after it gets set the values won't change for another RTC clock period (which should be plenty, considering RTC runs on the order of 10 kHz and the CPU on the order of 10 MHz). This however does not work reliably for some reason - the documentation only says that this works for the first update of the register anyway. Such synchronization also slows down the clock read-out function and even makes its run time unpredictable.

The second point is actually documented in the datasheet if you look carefully at the timing diagram in the real-time clock chapter. But it is easy to overlook and I wasted more than one day thinking that observing that behavior is due to some problem in my code. It also makes detecting counter overflow somewhat more complicated.

In the end, I went with code like this:

uint16_t divl1 = RTC_DIVL;
uint16_t cnth1 = RTC_CNTH;
uint16_t cntl1 = RTC_CNTL;

uint16_t divl2 = RTC_DIVL;
uint16_t cnth2 = RTC_CNTH;
uint16_t cntl2 = RTC_CNTL;

uint16_t divl, cnth, cntl;

if(cntl1 != cntl2) {
	/* overflow occurred between reads of cntl, hence it
	 * couldn't have occurred before the first read. */
	divl = divl1;
	cnth = cnth1;
	cntl = cntl1;
} else {
	/* no overflow between reads of cntl, hence the
	 * values between the reads are correct */
	divl = divl2;
	cnth = cnth2;
	cntl = cntl2;
}

/* CNT is incremented one RTCCLK tick after the DIV counter
 * gets reset to 32767, so to correct for that increment 
 * the seconds count if DIV just got reset */
uint32_t sec = (((uint32_t)cnth) << 16 | ((uint32_t)cntl));
if(divl == 32767) sec++;

/*
 *        1000000                   15625
 * usec = ------- * (32767 - div) = ----- * (32767 - div)
 *         32768                     512
 */

uint32_t usec = 15625 * (32767 - ((uint32_t)divl)) / 512;

This code makes two assumptions: that RTC_CNTH is unused (i.e. prescaler divides the oscillator frequency by less than 65536) and that the CPU is fast enough to read the four registers in less than one increment of the counter registers. Note that the latter one can be affected by interrupt service routines, so if you have a slow CPU clock, fast running RTC and/or long-running ISRs it might be necessary to disable interrupts while reading the RTC registers.

A version of this function that would work with any prescaler setting would make a nice addition to libopencm3, but I have yet to come up with one elegant enough to warrant a patch.

Note that currently both libopencm3 with rtc_get_counter_val() and rtc_get_prescale_div_val() and STM's FWLIB with RTC_GetCounter() and RTC_GetDivider() get this wrong. Also they don't support getting both values in a consistent way. There is a discussion about this issue on STM32 forums and the solution given there is functionally identical to mine (though I don't like the potential for a goto-induced infinite loop).

Posted by Tomaž | Categories: Digital | Comments »

Clock drift

29.02.2012 19:58

When I was processing the raw measurement results from Munich experiment I noticed that absolute timestamps on spectrograms recorded by two VESNA nodes were differing randomly from zero to three seconds and required manual alignment.

The experimental setup was not expected to give very precise time reference: each VESNA has its own free running real-time clock stabilized by a 32.768 kHz quartz which was providing time relative to the start of the measurement. Absolute time on the other hand was provided by two Linux running laptops to which the sensor nodes were sending the data. So all in all the accuracy of our timestamps depended on four quartz clocks.

Nevertheless this result was surprising to me. I expected these four clocks to drift apart in the three hours of measurements, but this drift should be uniform. I can't explain what could cause the difference between clocks to randomly change between experiments.

To rule out any problems on VESNA I rigged up a simple test where I compared the difference between VESNA's real-time clock and the clock provided by the Linux kernel with a running NTP daemon. This is the result of four test runs with two nodes that we were using in Munich:

Results of a clock drift test for two VESNA nodes.

As you can see, the clocks do drift apart and do so more or less linearly (nodes were turned off for the night before each test to also see if warm-up affected the drift). Node 117 drifts at around 3 ppm and node 113 at 9 ppm. This is actually quite bad, as node 117 would gain a second each four days, even considering that the reference here was a laptop that probably isn't a shining example of accuracy either.

The only weird thing is the strange bump on the graph for node 113, where drift was positive for around 15 minutes and then turned back to negative. I could not reproduce it in any of the later runs and it might be due to NTP adjusting the laptops clock. But even that can't explain the observed deviations that are more than ten times what we see here.

While this is of course not conclusive, it does give me some confidence that it might be the PC software that was causing these problems.

Posted by Tomaž | Categories: Digital | Comments »