Monitoring serial console through JTAG

30.08.2012 18:15

I've written previously that VESNA has an unfortunately placed debug serial console. The connector with the RX and TX pins also provides power to the microcontroller in the common configuration we have deployed in the field. This means that it's quite hard to connect a terminal emulator to it without resetting the node. While it might seem that debug messages printed on the serial console are only useful in testing and development, we have a common occurrence where deployed VESNAs would stop responding to the network commands. Since checking the serial console is the first step in debugging such an issue, it's important that it's accessible even in a production system.

If you have a JTAG port accessible (and enabled) and an in-circuit debugger (like OpenOCD) that works with GDB you can attempt to monitor what is being printed to the serial port in the following way.

Load up the debug information for the firmware running on the microcontroller, set GDB for remote debugging through in-circuit debugger and enter the following commands:

(gdb) break _write
Breakpoint 1 at 0x80001c8: file hello-world.c, line 43.

This sets a break point on the function that writes to the serial console. I'm using the int _write(int file, char *ptr, int len) syscall hook here, which in turn calls low-level UART functions.

(gdb) commands 1
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>output *ptr@len
>echo \n
>continue
>end

Here we instruct GDB to skip the usual informative messages about a break point being reached, print the array of characters in the function's arguments and continue without the usual command prompt.

(gdb) set height 0

Finally, we disable the built-in pager, so that GDB doesn't wait for confirmation on every screenful of print out. Now we can continue the execution of the program, and GDB will print a line for each invocation of the _write function:

(gdb) c
Continuing.
Note: automatically using hardware breakpoints for read-only addresses.
"Hello, world!\n"
"Hello, world!\n"
"Hello, world!\n"

In a pinch this turns out to be a useful workaround when debugging with no serial console. Be warned though that breaking into GDB this way does seem to incur a significant overhead. Programs that print a lot of output will run significantly slower, although I haven't measured exactly how much. This shouldn't be a problem unless the root of problem you are debugging is time-sensitive or you are calling printf() in interrupt handlers (but you wouldn't be doing that, would you?)

Posted by Tomaž | Categories: Code | Comments »

Organic display, part 3

25.08.2012 18:17

I'm still occasionally playing with the OLED display on the Arduino. It's a fun pass time to see what you can squeeze out of the Arduino's 30 kB of ROM in terms of graphics, animation and silly games. Plus the SEPS525 controller is in some ways similar to what old home computers from the Commodore 64 era had, offering some simple hardware features like scan line mirroring and framebuffer windows.

Anyway, while tweaking the software I noticed two problems with the shield: First, a white pixel in a row full of white pixels is somewhat darker than a white pixel in a row that is only half-lit, and second, when displaying some patterns the LM2703 micro-power switcher that provides the high-voltage supply starts emitting a high-pitched sound.

Since it's usually a bad sign that you can hear a switching power supply that is designed to operate in the megahertz range, I suspected that those two issues might be connected.

After hooking up the circuit to an oscilloscope and displaying a few test patterns the first thing to catch my attention were these spikes on the 14 V power supply provided by the switcher:

Voltage on the OLED power supply output capacitor

The upper, blue trace shows the power supply output voltage and the lower, yellow trance shows the power supply input voltage.

The picture above covers the time it takes the display to scan about 12 rows of pixels. The controller drives one row of pixels at a time and test pattern I used here was a black screen with every fourth row completely white. You can see when the controller hits the white row since the power consumption increases, which in turn increases the power supply's switching frequency and causes a voltage drop on the power supply input.

The weird spike happens when the driver goes from a black to a white line. Some further tests confirmed that, with the spike getting even larger if a black line follows several white ones. There is no spike on the black-to-white transition. The maximum voltage drop I saw was around 1 V, which is completely out of tolerances for this power supply design. What is happening here?

Voltage on the OLED power supply output capacitor, enlarged

Here is a similar picture with a smaller time scale. The blue trace still shows switcher output. Ignore the yellow trace, which is voltage on the switcher's coil. It looks like the OLED display draws a lot of current for a small period of time, causing a large drop on the output capacitor voltage. After it stops, the switcher works continuously to get the voltage back to the correct level, overshooting a bit in the end due to delays in the regulation loop.

I can't measure currents directly on the PCB, but since I know the capacitances and the first derivative of the voltage I can calculate it. Given the 4.7 µF output capacitance, plus two smaller 100 nF capacitors at the display pins, these voltage ramps correspond to a peak discharge current of 1.5 A and an average charge current of 200 mA.

I = C \frac{dU}{dt}

These results are weird. The display's datasheet lists maximum operating current on the 14 V power supply as 32.8 mA (although they might mean the average current). Even more confusing is the second result. The switcher is designed to provide a maximum of 60 mA to the output capacitor and given the components it's simply impossible for it to supply more than three times the rated current.

Time to recheck the assumptions. The first suspicion fell on the output capacitor. If its true capacitance was smaller than I thought then the calculation above would give a result that is too high. Changing the capacitor on the board for an identical one from the same reel didn't significantly change the situation. I don't have equipment to measure the capacitance directly, but by adding some other, presumably known, capacitance in parallel to the suspected output capacitor and repeating the same measurement with the oscilloscope I was able to calculate it's true value.

I = C \frac{dU}{dt} = (C + C_k)\frac{dU_k}{dt}
C = \frac{C_k\frac{dU_k}{dt}}{\frac{dU}{dt} - \frac{dU_k}{dt}}

It turns out that this 4.7 µF chip ceramic capacitor in fact only has around 1.4 µF (the other one from the same reel I replaced had 1.2 µF). Using this capacitor value in the first equation the switcher current also matches the design calculations, which confirms this calculation was correct. I re-checked the part number and in fact these should be 4.7 µF with a -20% tolerance. The ramp on the picture above has a time scale on the order of a few tens of microseconds. These SMD capacitors should go to 100 MHz, so that shouldn't be an issue either.

Who is to blame here? Bad case of quality control or a mislabeled component reel? Since I soldered these myself by hand I can't claim the capacitors have been through the correct temperature profiles, but this would be the first time I would consistently destroy such ceramic capacitors by soldering. I will order equivalent capacitors from some other manufacturer and see if those fare any better.

With the corrected capacitance value, the current spikes come at about 450 mA peak. They are probably caused by the OLED pre-charge cycle and SEPS525 controller design. While I can't know exactly how the chip works, I do have some speculations on how a row- and column-driver design could cause this behavior. In any case I can't do anything about it, but a proper output capacitance should even out the power supply current and stop the audible noise. Even with these bad capacitors I don't think the shadowing effect is caused by the power supply. The voltage regulation is otherwise within tolerances and these drops occur only on transitions to black lines, where they can't cause those visible effects. Much more likely the artifacts are caused by voltage drops in the driver IC itself and again outside my control.

Update: Above conclusions are most likely wrong. See my follow-up post.

Posted by Tomaž | Categories: Analog | Comments »

IguanaWorks USB IR transceiver

19.08.2012 22:16

I bookmarked this little gadget a while ago. Having recently solved my problems with scriptable switching of PulseAudio audio outputs I thought it's time to finally order it and try to automate a few other home-theater related operations through it. Over the summer a few other infrared-communication related things also piled up on my desk, so having an universal IR transmitter and receiver within reach seemed like a good idea.

IguanaWorks USB IR transceiver

This is an IguanaWorks USB IR transceiver, the hybrid version. Hybrid meaning it has both an integrated IR LED and detector pair and a 3.5 mm jack for an external transmitter.

From the software side it comes with quite an elaborate framework, free and open source of course. Software also comes in the form of Debian binary and source packages, which is a nice plus. I did have a small problem compiling them though since the build process seems to depend on the iguanair user being present on the system. This user only gets created during installation which makes it kind of a catch-22 situation. Once compiled the packages did work fine on my Debian Squeeze system.

After everything is installed, you get:

  • igdaemon, a daemon that communicates with the actual USB dongle,
  • igclient, a client that exposes daemon functionality through a command-line interface,
  • a patched version of lircd daemon that includes a driver that offloads communication to igdaemon.

lirc is the usual Linux framework for dealing with infrared remotes. It knows how to inject keypresses into the Linux input system when an IR command is received and comes with utilities that can send commands back through the IR transmitter to other devices. This is the first time I'm dealing with it and I'm still a bit confused how it all fits together, but right now it appears some parts of the lirc ecosystem don't currently work with iguanair at all. For instance, xmode2 utility that shows received IR signals in an oscilloscope-like display isn't supported.

As I'm currently mostly interested in using this from my scripts, using igclient directly seems to be simplest option. There are also Python bindings for the client library, but they appear undocumented and I haven't yet took a dive into the source code to figure it out.

The client reports the received signals in the form of space-pulse durations, like this:

$ igclient --receiver-on --sleep 10
received 1 signal(s):
  space: 95573
received 3 signal(s):
  space: 7616
  pulse: 64
  space: 65536

I'm not yet sure what the units for those numbers are. According to the documentation the transmit functionality expects a similarly formatted input, but I have yet to try it out. It seems that if I want to plot the signals on a time line I will have to write my own utility for that.

To be honest I expected using this to be simpler from the computer side. In the end it basically has the same functionality as my 433 MHz receiver. One thing I also overlooked is that it's only capable of transmitting modulated on-off keyed transmissions (25 - 125 kHz carrier), which makes it useless for devices that don't use that, like shutter glasses. But given that I did basically zero research before ordering it I can't really blame anyone else but me for that (and that bookmark must have been at least a year old). Just yesterday I also stumbled on IR toy which appears to be a similar device. It would be interesting to know how it compares with the IguanaWorks one.

Posted by Tomaž | Categories: Digital | Comments »

On Atmel SerialNet ZigBit modules

13.08.2012 22:27

Don't use Atmel BitCloud/SerialNet ZigBit modules.

With this important public service announcement out of the way, let me start at the beginning.

Atmel makes ZigBit modules that contain an IEEE 802.15.4-compatible integrated radio from their AT86RF2xx family and an AVR-based microcontroller on a small hybrid component. The CPU runs a proprietary mesh-networking stack (BitCloud) built on top of the ZigBee specification and exposes a high-level interface on a serial line they call SerialNet (think "send the following data to this network address"-style interface). The module can be used either as a very simple way of adding mesh networking to some host device or as a stand-alone microcontroller with a built-in radio (Atmel provides a proprietary BitCloud SDK, so you can build your own firmware for the AVR).

Atmel ZigBit module on a VESNA SNR-MOD board.

At SensorLab we built a sensor node radio board for VESNA using these modules (more specifically, ATZB 900 B0 for 868 MHz and ATZB 24 B0 for 2.4 GHz links) as they appeared to be simple to use and would provide a temporary solution for connecting VESNAs with a wireless mesh until we come up with a working and reliable 6LoWPAN implementation. So far we have deployed well over 50 of these in different VESNA installations.

I can now say that these modules have been nothing but trouble from the start. First there is the issue of documentation. Atmel's documentation has always been superb in my memory. Compare one of their ATmega datasheets with the vague hand-waving STMicroelectronics calls microcontroller documentation and you'll know why. Unfortunately, the SerialNet user guide is an exception to this rule. They leave many corner cases undefined and you are left to your own experimentation to find out how the module behaves. There is almost no timing information. How long can you expect to wait for a response to a command? How long will the module be unresponsive and ignore commands after I change this setting? Even the hardware reset procedure is not described anywhere beyond a "Reset input (active low)".

The problems with this product however go deeper than this. In my experience developers, my self included, tend to be too quick to blame problems on bugs in someone else's code. When colleagues complained how buggy these modules are I said that it's much more likely a problem in our code or hardware design. That is until I started investigating myself the numerous problems we had with networking: the modules would return responses they shouldn't have according to the specification, they would say that they are connected to the network even though no other network node could communicate with them. Modules would even occasionally persistently corrupt themselves, requiring firmware reprogramming before they would start responding to commands again. Believe me, it's annoying to reach for a JTAG connector when the module in question is on a lamp post in some other part of the country.

For most of these bugs I can only offer anecdotal evidence. However I have been investigating one important issue for around two months now and I'm confident that there is something seriously wrong with these modules. I strongly suspect there is a race condition somewhere in Atmel's (proprietary and closed-source, of course) code that causes some kind of buffer corruption when a packet is received over the radio at the same time as the module receives a command over the serial line. This will cause the module to lose bytes on the serial line, making it impossible to reliably decode the communications protocol.

For instance, this is how the communications should look like over the serial line. Host in this case is VESNA and module is Atmel ATZB 900 B0:

→ AT+WNWK\x0d                                # host asks for network status
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a  # module asynchronously reports received data
← OK\x0d\x0a                                 # module answers that network is OK
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a  # module asynchronously reports received data

This is how it sometimes looks like:

→ AT+WNWK\x0d
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a
← OK\x0d                                     # note missing \x0a
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a

And sometimes it gets as bad as this:

→ AT+WNWK\x0d
← DATA 0000,0,77:(77 bytes of data)\x0d\x0a
← ODATA 0000,0,77:(77 bytes of data)\x0d\x0a # note only O from OK sent

An inviting explanation for these problems would be that we have a bad implementation of an UART on VESNA. Except that this happens even when the module is connected to a computer via a serial-to-USB converter and I have traces from a big and expensive Tektronix logic analyzer (as well as Sigrok) to prove that corrupted data is indeed present on the hardware serial line and not an artifact of some bug on the host side:

Missing Line Feed character from an Atmel ZigBit module.

A logic analyzer trace demonstrating a missing line feed character. Click to enlarge.

Data corruption on the serial line from an Atmel ZigBit module.

A logic analyzer trace demonstrating a jumbled-up OK and DATA response. Click to enlarge.

I have seen this happen in the lab under controlled conditions on 10 different modules and have good reasons to suspect the same thing is happening on the deployed 50-plus modules. Also, this bug is present in both BitCloud 1.14 and 1.13 and in both vanilla and security-enabled builds. All of this points to the fact that this problem is not due to some isolated fluke on our side.

For well over a month I have been on the line with Atmel technical support and while they had politely answered all of my mail they had also failed to acknowledge the issue or provide any helpful information even though I sent them a simple test case that reliably reproduces the problem in a few seconds. Of course, without their help there is exactly zero chance of getting to the bottom of this and given all of the above I seriously doubt this is anything else than a bug in their firmware.

At this point I have mostly given up any hopes that this issue will be resolved. During my investigation I did find out that decreasing the amount of chatter on the serial line decreases the probability of errors, so I did manage to work around this bug a bit by switching to non-verbose responses (ATV0) and using packets that are a few bytes shorter than the maximum (say 75 bytes for encrypted frames). This will hopefully improve the reliability of already deployed hardware. For the future, we will be looking into alternatives, as unfortunately 6LoWPAN still seems to be somewhat outside of our grip.

Posted by Tomaž | Categories: Digital | Comments »

Follow up on open-sourcing VESNA

05.08.2012 14:52

In December last year I wrote a post about VESNA and how it will be open sourced soon. However now, almost 8 months later, I can't say that promise has been fulfilled. Hardware design still isn't public and, apart from some of my isolated experiments, there isn't currently much to show on the software side either. Considering the amount of feedback I got about that blog post I feel I need to give a follow-up on the status of open sourcing VESNA.

During this whole endeavor I've learned a lot regarding the process of releasing to the public a non-trivial piece of work that has been developed behind closed doors for a significant amount of time in a large organization. On the web you constantly see discussions on how this or that company should release some piece of software under an open source license (famous Nvidia drivers come to mind). Once you actually attempt to do something like that, even on a much smaller scale, it becomes obvious just how naive such calls are.

First of all, there are strictly legal issues. For software that you develop for in-house use and only share with some partners you simply don't need to be that careful about what foreign code you incorporate into your product. This is especially problematic in embedded programming, where a lot of hardware manufacturers will give you documentation, libraries or examples to built upon that are attached to scary looking non-disclosure agreements, re-distribution licenses and so on. Even if in my opinion many of these are not actually enforceable, you can't ignore them without a through review and (expensive) legal help.

A bunch of rusty gears

More specifically, almost all of existing software for VESNA is based on the proprietary STM32 firmware library that has a scary enough license to motivate people to make a free replacement for it. And no, I can't imagine moving everything to libopencm3 with limited resources we have. I'm personally guilty of committing code to the same repository that said "Under no circumstances is this software to be exposed to or placed under an Open Source License of any type" in the header. When you have project deadlines looming, clean room design isn't something that you want to mention at a meeting. While I took good care of adding warnings about it in appropriate places, I can't help myself wondering how many similar gems we have in the code that we don't even know about.

Hardware designs themselves can have similar issues. Chip manufacturers sometimes won't let you release a schematic of a circuit that uses their products, as silly as that sounds. Even though to my knowledge only one part of VESNA might potentially have that problem (my UHF spectrum sensing receiver of all things), these are even harder to track. The fact that Open hardware licenses themselves are in their infancy makes it even harder to make any solid decision regarding hardware designs without dedicating unreasonable amounts of time to it.

Then there are also cultural problems. Some of the projects here at the Institute have built upon open-source software, like Contiki. But that was done in isolation from the up-stream project. This means that after years of separate development the upstream has moved on and the private fork has accumulated a large number of changes. This is a nightmare if you want to contribute changes back to the community. While I may argue that it's reasonable to expect small drive-by contributions to be accepted upstream, I certainly can't recommend stopping by a mailing list and dropping a 50.000 line patch at the unsuspecting developers.

So, while starting off an open-source project from zero might be relatively effortless, open-sourcing something requires a lot of dedication and time from everyone working on the project. Unfortunately these are hard to justify when you have ongoing projects with their own deadlines and demands for attention. You need a well thought-out strategy and at least some solid ideas on what benefits an open development process will actually bring. Unfortunately the latter is not as straightforward as it might seem in the case of VESNA and SensorLab's academic setting.

While all of the above doesn't mean that I have given up on a free and open source VESNA I hope it does explain why I have been overly optimistic in my initial writings. I still invite you to cautiously stop by SensorLab's GitHub account and web site every now and then in case we manage another step in that direction.

Posted by Tomaž | Categories: Life | Comments »