Saving power on an ESP8266 web server using delays

13.08.2022 17:19

ESP8266 core support on Arduino comes with a library that allows you to quickly setup and run a web server from your device. Commonly the code using the ESP8266WebServer library looks something like the following:

#include <ESP8266WebServer.h>

ESP8266WebServer server(80);

void setup(void)
{
	... various wi-fi and request handler setup calls here ...
}

void loop(void)
{
	server.handleClient();
}

For a more concrete example, take a look at the "Hello World" example that is included with the library.

In the following we'll focus on the loop() and ignore the rest. As the name implies, the Arduino framework calls the loop() function in an infinite loop. The handleClient() method checks if any client has connected over the Wi-Fi network and issued an HTTP request. If so, it handles the request and sends a response back to the client. If not, it exits immediately, doing nothing.

In other words, loop() implements a busy-wait for HTTP requests. Even when there is nothing to do, the CPU still runs the busy loop roughly 100000 times per second. In the common case where a page is only requested from the server every once in a while, the system will spend almost all of its time in this state.

Busy loops are common on microcontrollers. On an ATmega microcontroller, another popular target for Arduino code, this is hardly a problem. Unless you're running your hardware off a battery and really counting every microwatt, a busy CPU uses negligible extra power over a CPU that's sleeping. The ESP8266 is a bit more powerful than an 8-bit microcontroller though and correspondingly has a higher power consumption when not sleeping. Hence it makes sense to be a bit more sensible about when the CPU is running or not.

The best solution would be to not have a busy loop at all. Ideally we could use the underlying RTOS to only schedule a handleClient() task in response to network events. The RTOS is smart enough to put the CPU to sleep when no task needs to run. Unfortunately, the simple Arduino environment does not want us to mess with the RTOS. The ESP8266WebServer library certainly isn't written with such use in mind. This approach would require a lot of refactoring of ESP8266WebServer and the code that uses it.

A simpler way is to slow the busy loop down. There is no need to check for client connections hundreds of thousand times per second. A more modest rate would be good enough and the CPU can sleep in between. This would decrease the power consumption at the cost of the response time for HTTP requests. Inserting the Arduino-provided delay() function into loop() does exactly what we want:

void loop(void)
{
	server.handleClient();
	delay(...);
}

The question that remains is what is a good value to choose for delay in this case. What value gives the best trade off between decreased power consumption and increased response time?

My setup for measuring ESP8266 power consumption.

To check I've used this somewhat messy breadboard setup. I've measured the power consumption and response times when running a simple web server Arduino sketch, similar to the "Hello World" included with the ESP8266WebServer library, but with various delays inserted into the loop.

A multimeter with an averaging function measured the current on the 3.3 V supply line of an ESP-01 module. Averaging time was 30 s. I've measured the power consumption while the module was idle (not answering any HTTP requests)

The module was connected through an USB development board to a PC for easy programming from the Arduino IDE. I've used Arduino 2.8.19 and the ESP8266 core 2.4.2.

I've measured HTTP server response times over the Wi-Fi network using Apache Bench (ab -n10). As the representative metric I've chosen the median time from the Waiting row. This is the time it took ESP8266 to send the first byte of the response.

Supply current versus various delays in loop()

Response time versus various delays in loop()

Here are the results, comparing code with just a call to handleClient() to code that is also calling delay with various argument values. The argument to delay is wait time in milliseconds. I've also did a measurement with yield() instead of delay(), something that I've seen used in some examples.

Even a 1 millisecond delay, the minimum allowed by the function, decreases the idle power consumption by 60%. Not surprising if you consider that this small delay causes the CPU to be idle 99.999% of time. The added delay does also measurably increase the response time, as was expected, but the difference between 6 or 8 ms should be negligible for most practical use cases.

Further increasing the delay does not bring any benefits. It only increases the response time without measurably decreasing power consumption. I suspect the variations in measured current with higher delay values are likely due to uncontrolled traffic on the network waking up the ESP8266's radio more often in some test runs than in others.

Adding a yield() call to loop() does not show any benefits in this test. In fact, handleClient() itself calls yield() in some cases before returning. Similarly, adding delay(0) has no measurable effect.

In conclusion, I recommend using a loop() with 1 ms delay:

void loop(void)
{
	server.handleClient();
	delay(1);
}

The added delay decreases the idle power consumption at 3.3 V by about 60%, from around 230 mW to around 70 mW. This is significant enough that you can feel the board running cooler by touch. On a yearly basis it saves around 1.4 kWh per device that's powered on continuously.

For most simple Arduino sketches using ESP8266WebServer, like using the request handler to read a sensor or actuate a relay, this is just a simple one-line change, so I think the power saving is worth the effort. Of course, if you're doing something else in loop() in addition to just calling handleClient(), adding a delay might have other side effects. In such case the code running in loop() might need some adjusting to account for the delay.

Posted by Tomaž | Categories: Digital | Comments »

Wio RP2040 review

12.08.2021 7:44

I've been following the development of the ecosystem around the new RP2040 microcontroller from the Raspberry Pi Foundation. I've found the microcontroller interesting in the combination with MicroPython since it appeared suitable for development in high-level language while still offering reasonably good real-time performance. For the common low-level bit banging microcontroller stuff I'm not sure if Python beats C/C++. However as soon as any kind of networking is involved, I think using a high-level language is significantly easier. With all sorts of necessary error handling and multi-tasking, networking code quickly becomes unreadable in C.

Hence I've been curiously waiting for RP2040 development boards to appear that would integrate some kind of a network interface. The two products that were most prominent on my radar were Arduino Nano RP2040 Connect and Seeed Wio RP2040. Both were announced earlier this year, but were more or less unobtainable. In June however Seeed reached out to me and offered to send me a free sample of their Wio RP2040 development board in return for a review. Two months later I've finally got one on my desk.

Seeed Wio RP2040 mini dev board on top of its box.

Wio RP2040 itself is a surface-mount module with castellated holes suitable for machine mounting on custom PCBs. It contains just the RP2040 microcontroller, the radio and an integrated inverted-F antenna for connecting to a Wi-Fi network. Voltage supply for the module is 5 V, the GPIO pins use 3.3 V levels. There is also a 3.3 V regulator output pin available on the module, however I could not find any information on how much current you can safely draw for your own use.

To make development easier, Seeed also sells the module already mounted on the Mini Dev Board, which is what I got in the box you see above. The Mini Dev Board adds two LEDs, two buttons, a USB C connector and breaks out all the module pins to two 14 pin 100 mil headers. Schematic and PCB layout files are available for the Mini Dev Board, but not for the module itself.

Also worth noting is the declaration of an EU representative on the box. This is most likely related to the requirements of the European Radio Equipment Directive.

Wio RP2040 compared to Raspberry Pi Pico.

Compared to Raspberry Pi Pico, Wio RP2040 Mini Dev Board is slightly shorter and wider. It has a USB C connector instead of USB Micro for programming and getting power from a PC. In addition to the Pico's BOOT button, the Mini Dev Board also has a RUN button for manually resetting the microcontroller. There's also a power LED that is hard-wired to the power supply line.

Same as on Pico, the Mini Dev Board has space for 100 mil headers on its edge. The headers themselves are not included in the box, so if you want to mount this on a breadboard you need to supply and solder them yourself. GPIO 20, 21 and 22 are not available on the Wio RP2040 headers. They are probably used for communication with the wireless chip inside the module.

Curiously, I found zero information on which 2.4 GHz 802.11 b/g/n radio is used in the module. I'm yet to peek under the RF shield can, but I strongly suspect it hides an ESP8285 from Espressif. ESP8285 is a variant of the popular ESP8266 with built-in flash memory. This guess comes from the fact that the host name the module uses when obtaining an IP address from a DHCP server is espressif. The official firmware image also has a number of strings that mention ESP:

$ strings firmware.uf2|grep -i esp
esp8285
o rp2040] %s | esp8285_ipconfig could'n get ip
[wio rp2040] %s | esp8285_ipconfig could'n get gateway
[wio rp2040] %s | esp8285_ipconfig could'n get netmask
[wio rp2040] %s | esp8285_config could'n get ip
couldn't init nic esp8285 ,try again please
esp8285 power off

Speaking of firmware, Seeed provides a firmware.uf2 file that contains a customized MicroPython interpreter with some added modules related to networking. Unfortunately, it's not clear at the moment what is the source used for building this file. Another problem is that the file linked from the Wiki seems to silently change without notice. Since July I've seen at least two files being distributed with the same name and URL but different contents.

The procedure for loading the firmware is the same as with Pico. Power up the module with BOOT button depressed and then copy the firmware image into the emulated USB storage device. Using rshell, this is how the module presents itself, running the firmware.uf2 downloaded on August 5:

$ rshell
Connecting to /dev/ttyACM0 (buffer-size 512)...
Trying to connect to REPL  connected
Testing if sys.stdin.buffer exists ... Y
Retrieving root directories ... 
Setting time ... Aug 05, 2021 07:43:39
Evaluating board_name ... pyboard
Retrieving time epoch ... Jan 01, 1970
Welcome to rshell. Use Control-D (or the exit command) to exit rshell.
> ls /pyboard
> repl
Entering REPL. Use Control-X to exit.
>
MicroPython v1.15 on 2021-07-06; Seeed Wio with RP2040
Type "help()" for more information.
>>> help('modules')
__main__          machine           uasyncio/funcs    urandom
_boot             math              uasyncio/lock     ure
_onewire          micropython       uasyncio/stream   uselect
_rp2              mqtt              ubinascii         usocket
_thread           network           ucollections      ustruct
_uasyncio         onewire           uctypes           usys
builtins          rp2               uerrno            utime
cmath             uarray            uhashlib          uzlib
ds18x20           uasyncio/__init__ uio
framebuf          uasyncio/core     ujson
gc                uasyncio/event    uos

The embedded flash filesystem is empty by default, however there are some extra importable modules available in the interpreter: network, mqtt and a few others. Again, unfortunately there is very little information on these, apart from a few examples in the Wiki. No source available as far as I can tell either. MQTT module seems similar to umqtt.simple described here with some differences - there is no check_msg() method, for example.

I didn't have much luck with using these networking Python modules. Some examples in the wiki are apparently outdated and I didn't manage to get any of them to a usable state.

Specifically, the firmware I was using seemed to have problems receiving data from the network. I could connect to the Wi-Fi network and successfully open a usocket to another host. Sending data using usocket.send() worked. However as soon as the socket received anything from the other end, the MicroPython interpreter would apparently crash and I could never get anything back using usocket.recv(). The program stopped running and the REPL would not respond. I couldn't connect to the board over USB anymore until I reset the processor using the RUN button.

I had similar problems with Seeed's MQTT example code. After fixing it to account for the fact that WLAN_UART class is not defined, Wio RP2040 connects to my MQTT broker. I can successfully publish messages and subscribe to topics from MicroPython. However as soon as some other client sends a message to the topic that the Wio RP2040 is subscribed to, the interpreter crashes. There's definitely something still alive running on the MCU because the broker keeps getting periodic MQTT pings from Wio RP2040. The Python code doesn't seem to be executing though and neither Thonny nor rshell will connect to it.

I tried to find the problem, but without the source and any kind of debug info I was pretty much stuck. I also asked my Seeed contact about it and after a week I have yet to receive a reply.

Update: On 17 August I received a MicroPython firmware image from Seeed that fixes the interpreter crashes related to the networking I describe above. They say that they will fix the image linked from the wiki at a later date.

Screenshot of Thonny with Wio RP2040 MQTT example.

It's obviously very early in the product cycle. I actually don't know if these modules have shipped in any quantity so far. Each time I check, they are out of stock and the banner on Seeed website currently says they will start shipping in September. Still, I was disappointed to see that networking, the main feature of this module, doesn't seem to be functional at the moment. It seems Seeed's customized MicroPython port still needs some work. There's also support for programming the module in C/C++ using Arduino IDE. I have not tried that, but it seems other people are not having much success with that either.

Apart from fixing the software, I hope Seeed also adds some more documentation in the future. Having examples is great, but the custom Python modules should come with a reference. If the firmware image is open source, instructions on building one would be welcome as well. I'm also missing a proper hardware datasheet with some electrical specifications for the module.

The problems I encountered are even more puzzling since Wio RP2040 seems to be focused on being a base for a product than a development board for one-off projects. Its bare-bones design doesn't include any extra sensors that Arduino is shipping on their RP2040 boards. This makes it less inviting for playing around compared to the kitchen-sink-included approach of the Arduino. On the other hand, that's obviously a feature when you're designing a custom board with only the peripherals you need. Seeed is also running a promotional campaign and gives you some free modules when using their assembly services.

Another thing worth noting is that with recently introduced EU import regulations, getting these modules shipped in small quantities from China is quite troublesome and expensive. Even when receiving this free sample I had to deal with import customs paperwork and pay approximately 20 EUR in VAT and processing fees. Add shipping costs and the 13 USD base price shown in the Seeed store effectively becomes around 50 EUR. On the other hand, I have noticed that the modules are listed on Mouser, so this might improve in the future.

In summary, this module promises to be a cheap and simple basis for small network-connected sensors and actuators. I like the simplicity of connecting to MQTT using a few lines of Python. Unfortunately, current software does no deliver on that promise and I can only recommend waiting until the quality improves.

Posted by Tomaž | Categories: Digital | Comments »

Measuring interrupt response times, part 3

01.05.2021 18:17

Around five years ago I performed some measurements of interrupt response times in a Raspberry Pi Zero and an Arduino. My goal was to get some rough estimates of what kind of real-time performance you can expect from these systems. I was not interested in pushing them to their limits. I wanted to compare the most straightforward approaches - code you would find in documentation or in examples that pop up on top of web searches. This year the Raspberry Pi Pico was released and it promises to become just as popular. It brings some interesting new features that I wanted to explore, like MicroPython and the programmable I/O (PIO). I thought it would be interesting to repeat my old measurements and see how well it compares to the other two systems.

I only briefly summarize my previous results here. Read my original blog post for a longer introduction, description of the test setup and more in-depth discussion of the first batch of measurements. In the follow up post I also dug a little deeper into the reasons behind some of the more unusual results I got with Arduino and Raspberry Pi Zero.

Raspberry Pi Pico connected to the test setup.

For the purpose of this test, the interrupt response time is the time the system takes to change a state of an output GPIO pin in response to the change in an input GPIO pin. In real applications there is usually some kind of processing involved, so this value represents only the best-case scenario of how fast the software can respond to external events.

This response time was measured using a signal generator and an oscilloscope. A square wave generated by the signal generator was connected to the input pin. The two-channel oscilloscope was connected to both the input pin and the output pin. It was setup to measure the interval between the two state changes. The measurement was automated and repeated 500 times for each setup. Exact settings used are noted here.

To perform the test with the RP2040 processor on the Raspberry Pi Pico I installed a MicroPython firmware, as described in the Getting Started guide. I tried two implementations: A pure Python implementation was using the machine.Pin built-in class to configure a Python function as an interrupt handler. The PIO implementation used the rp2.asm_pio decorator to program the PIO state machine from Python code (see Section 3.9 in the Python SDK manual). After the state machine was programmed, the input was handled purely inside the PIO and the Python interpreter was not involved. You can find exact code I used in the GitHub repo.

Here is how the new measurements with the RP2040 compare with Arduino and the Raspberry Pi Zero:

Histogram of interrupt response time measurements.

The MicroPython implementation on the RP2040 (yellow) has the average response time of around 60 μs. This is around 3.5 times faster than using a CPython implementation on the Zero (cyan) which averages at around 210 μs. It is also more consistent, with less spread between minimum and maximum response times. A surprising result at the first glance, since Zero has a much more capable CPU running at up to 1000 MHz while the ARM core in the Pico only runs at 125 MHz.

The difference is very likely due to all the Linux kernel housekeeping and context switching that happens before the interrupt is propagated from the hardware to the Python process. MicroPython, while quite complex, is still a lightweight interpreter compared to the full CPython on the Zero. This is consistent with the fact that a C implementation that runs in the kernel on the Zero (blue) is much faster than MicroPython on the RP2040.

The following figure zooms in on the left end of the histogram:

Zoomed view of the left end of the response time histogram.

Here you can see that the PIO implementation is amazingly fast compared to all previously tested configurations. With the average response time of 0.043 μs it beats both the polling and the interrupt-driven C++ implementation on the Arduino by two orders of magnitude.

This comparison is a bit unfair though. The specialized PIO state machines on the RP2040 are indeed very fast, with only 8 ns per instruction and an instruction set that is optimized for responding to input events. However, the amount of processing you can do with them is very limited compared to all other approaches I've tested. Each PIO can only process 32 instructions. Most real-life applications beyond interfacing with a simple bus protocol will need a round-trip to MicroPython. This puts the response time back into the hundred-microsecond range.

Still, investigating PIO performance is interesting. Here is another level of zoom to show only the distribution of response times by the PIO implementation:

Histogram of response times for the RP2040 PIO implementation.

The response times should be in the range of 4 to 5 instruction cycles - 2 cycles for the input synchronizer (see 3.5.6.3 in the RP2040 Datasheet), between 1 or 2 cycles for WAIT and 1 cycle for SET. I did not use any clock dividers and used the default 125 MHz system clock, so each instruction takes 8 ns. This gives the range of response times between 32 to 40 ns.

I measured between 38 and 48 ns. Very likely this is a measurement error. Unfortunately my signal generator has a rise-time of around 10 ns. This means that in the nanosecond range the transition between low and high logic level is not well defined and this introduces an error into my measurement. I verified by other means that one PIO instruction indeed takes exactly 8 ns in my setup. It is also possible that I missed something and there is an additional PIO cycle (or two) needed somewhere before the response propagates to the GPIO pin.

On the oscilloscope screenshot below, the blue trace is the stimulus signal from the signal generator and the yellow trace is the response generated by the PIO on the output pin. You can see that the rise times are not insignificant compared to the measured time interval.

Signals on the input and output pins on the RP2040.

In the end this was an interesting exercise. I was surprised by the performance of MicroPython on the Raspberry Pi Pico and how quick the development setup is. I honestly expected Python code to run slower and I was again reminded that my intuition can be wrong sometimes. Unfortunately I didn't have time to setup the C SDK to also try out a native implementation of the same test on the RP2040. Perhaps some other day.

Programmable I/O is certainly the most interesting part of the RP2040. It took me a while to understand the unusual instruction set and how the FIFO buffers work. I like how the integration of the assembler into MicroPython makes it easily accessible for experimentation. I was impressed by the performance and quick response times. On the other hand, I was also surprised by how limited PIOs are in terms of the program size and the choice of instructions. I was expecting something similar to PRUs on the Sitara SoC. PIOs seem indeed very specialized devices for interfacing with digital buses and can't do much more in terms of algorithmic complexity.

Posted by Tomaž | Categories: Digital | Comments »

Another SD card postmortem

16.05.2020 11:28

I was recently restoring a Raspberry Pi at work that was running a Raspbian system off a SanDisk Ultra 8 GB micro SD card. It was powered on continuously and managed to survive almost exactly 6 months since I last set it up. I don't know when this SD card first started showing problems, but when the problem became apparent I couldn't log in and Linux didn't even boot up anymore after a power cycle.

SanDisk Ultra 8 GB micro SD card.

I had a working backup of the system, however I was curious how well ddrescue would be able to recover the contents of the failed card. To my surprise, it did quite well, restoring 99.9% of the data after about 30 hours of run time. I've only ran the copy and trim phase (--no-scrape). Approximately 8 MB out of 8 GB of data remained unrecovered.

This was enough that fsck was able to recover the filesystem to a good enough state so that it could be mounted. Another interesting thing in the recovered data was the write statistic that is kept in ext4 superblock. The system only had one partition on the SD card:

$ dumpe2fs /dev/mapper/loop0p2 | grep Lifetime
dumpe2fs 1.43.4 (31-Jan-2017)
Lifetime writes:          823 GB

On one hand, 823 GB of writes after 6 months was more than I was expecting. The system was setup in a way to avoid a lot of writes to the SD card and had a network mount where most of the heavy work was supposed to be done. It did have a running Munin master though and I suspect that was where most of these writes came from.

On the other hand, 823 GB on a 8 GB card is only about 100 write cycles per cell, if the card is any good at doing wear leveling. That's awfully low.

In addition to a raw data file, ddrescue also creates a log of which parts of the device failed. Very likely a controller in the SD card itself is doing a lot of remapping. Hence a logical address visible from Linux has little to do with where the bits are physically stored in silicon. So regardless of what the log says, it's impossible to say whether errors are related to one failed physical area on a flash chip, or if they are individual bit errors spread out over the entire device. Still, I think it's interesting to look at this visualization:

Visualization of the ddrescue map file.

This image shows the distribution of unreadable sectors reported by ddrescue over the address space of the SD card. The address space has been sliced into 4 MB chunks (8192 blocks of 512 bytes). These slices are stacked horizontally, hence address 0 is on the bottom left and increases up and right in a saw-tooth fashion. The highest address is on the top right. Color shows the percentage of unreadable blocks in that region.

You can see that small errors are more or less randomly distributed over the entire address space. Keep in mind that summed up, unrecoverable blocks only cover 0.10% of the space, so this image exaggerates them. There are a few hot spots though and one 4 MB slice in particular at around 4.5 GB contains a lot of more errors than other regions. It's also interesting that some horizontal patterns can also be seen - the upper half of the image appears more error free than the bottom part. I've chosen 4 MB slices exactly because of that. While internal memory organization is a complete black box, it does appear that 4 MB blocks play some role in it.

Just for comparison, here is the same data plotted using a space-filling curve. The black area on the top-left is part of the graph not covered by the SD card address space (the curve covers 224 = 16777216 blocks of 512 bytes while the card only stores 15523840 blocks or 7948206080 bytes). This visualization better shows grouping of errors, but hides the fact that 4 MB chunks seem to play some role:

Visualization of the ddrescue map file using a Hilbert curve.

I quickly also looked into whether failures could be predicted by something like SMART. Even though it appears that some cards do support it, none I tried produced any useful data with smartctl. Interestingly, plugging the SanDisk Ultra into an external USB-connected reader on a laptop does say that the device has a SMART capability:

$ smartctl -d scsi -a /dev/sdb
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-12-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               Generic
Product:              STORAGE DEVICE
Revision:             1206
Compliance:           SPC-4
User Capacity:        7 948 206 080 bytes [7,94 GB]
Logical block size:   512 bytes
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
Serial number:        000000001206
Device type:          disk
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
Local Time is:        Thu May 14 16:36:47 2020 CEST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature:     0 C
Drive Trip Temperature:        0 C

Error Counter logging not supported

scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
Device does not support Self Test logging

However I suspect this response comes from the reader, not the SD card. Multiple cards I tried produced the same 1206 serial number. Both a new and a failed card had the "Health Status: OK" line, so that's misleading as well.

This is a second time I was replacing the SD card in this Raspberry Pi. The first time it lasted around a year and a half. It further justifies my opinion that SD cards just aren't suitable for unattended systems or those running continuously. In fact, I suggest avoiding them if at all possible. For example, newer Raspberry Pis support booting from USB-attached storage.

Update: For further reading on SD card internals, Peter mentions an interesting post in his comment below. The flashbench tool appears to be still available here. The Linaro flash survey seems to have been deleted from their Wiki. A copy is available from archive.org.

Posted by Tomaž | Categories: Digital | Comments »

Notes on the general-purpose clock on BCM2835

17.02.2018 20:30

Raspberry Pi boards, or more specifically the Broadcom system-on-chips they are based upon, have the capability to generate a wide range of stable clock signals entirely in hardware. These are called general-purpose clock peripherals (GPCLK) and the clock signals they generate can be routed to some of the GPIO pins as alternate pin functions. I was recently interested in this particular detail of Raspberry Pi and noticed that there is little publicly accessible information about this functionality. I had to distill a coherent picture from various, sometimes conflicting, sources of information floating around various websites and forums. So, in the hope that it will be useful for someone else, I'm posting my notes on Raspberry Pi GPCLKs with links for anyone that needs to dig deeper. My research was focused on Raspberry Pi Compute Module 3, but it should mostly apply to all Raspberry Pi boards.

Here is an example of a clock setup that uses two GPCLK peripherals to produce two clocks that drive components external to the BCM2835 system-on-chip (in my case a 12.228 MHz clock for an external audio interface and a 24 MHz clock for an USB hub). This diagram is often called the clock tree and is a common sight in datasheets. Unfortunately, the publicly-accessible BCM2835 datasheet omits it, so I drew my own on what information I could gather on the web. Only components relevant to clocking the GPCLKs are shown:

Rough sketch of a clock tree for BCM2835.

The root of the tree is an oscillator on the far left of the diagram. BCM2835 derives all other internal clocks from it by multiplying or dividing its frequency. On a Compute Module 3 the oscillator clock is a 19.2 MHz signal defined by the on-board crystal resonator. The oscillator frequency is fixed and cannot be changed in software.

19.2 MHz crystal resonator on a Compute Module 3.

The oscillator is routed to a number of phase-locked loop (PLL) devices. A PLL is a complex device that allows you to multiply the frequency of a clock by a configurable, rational factor. Practical PLLs necessarily add some amount of jitter into the clock. How much depends on their internal design and is largely independent of the multiplication factor. For BCM2835 some figures can be found in the Compute Module datasheet, under the section Electrical specification. You can see that routing the clock through a PLL increases the jitter by 28 ps.

GPCLK jitter characteristics from RPI CM datasheet.

Image by Raspberry Pi (Trading) Ltd.

The BCM2835 contains 5 independent PLLs - PLLA, PLLB, PLLC, PLLD and PLLH. The system uses most of these for their own purposes, such as clocking the ARM CPU, the VideoCore GPU, HDMI interface, etc. The PLLs have some default configuration that however cannot be strongly relied upon. Some default frequencies are listed on this page. Note that it says that PLLC settings depend on overclocking settings. My own experiments show that PLLH settings change between firmware versions and whether a monitor is attached to HDMI or not. On some Raspberry Pi boards, other PLLs are used to clock on-board peripherals like Ethernet or Wi-Fi - search for gp_clk in dt-blob.dts. On the Compute Module, PLLA appears to be turned-off by default and free for general-purpose use. This kernel commit suggests that PLLD settings are also stable.

For each PLL, the multiplication factors can be set independently in software using registers in I/O memory. To my knowledge these registers are not publicly documented, but the clk-bcm2835.c file in the Linux kernel offers some insight into what settings are available. The output of each PLL branches off into several channels. Each channel has a small integer divider that can be used to lower the frequency. It is best to leave the settings of the PLL and channel dividers to the firmware by using the vco@PLLA and chan@APER sections in dt-blob.bin. This is described in the Raspberry Pi documentation.

There are three available GPCLK peripherals: GPCLK0, GPCLK1 and GPCLK2. For each you can independently choose a source. 5 clock sources are available: oscillator clock and 4 channels from 4 PLLs (PLLB isn't selectable). Furthermore, each GPCLK peripheral has a independent fractional divider. This divider can again divide the frequency of the selected clock source by (almost) an arbitrary rational number.

Things are somewhat better documented at this stage. The GPCLK clock source and fractional divider are controlled from I/O memory registers that are described in the BCM2835 ARM peripherals document. Note that there is an error in the equation for the average output frequency in Table 6-32. It should be:

f_{GPCLK} = \frac{f_{source}}{\mathrm{DIVI} + \frac{\mathrm{DIVF}}{4096}}

It is perfectly possible to setup GPCLK by writing directly into registers from Linux user space, for example by mmaping /dev/mem or using a command-line tool like busybox devmem. This way you can hand-tune the integer and fractional parts of the dividers or use the noise-shaping functions. You might want to do this if jitter is important. When experimenting, if find the easiest way to get the register base address is to search the kernel log. In the following case, the CM_GP0CTL register would be at 0x3f201070 0x3f101070:

# dmesg|grep gpiomem
gpiomem-bcm2835 3f200000.gpiomem: Initialised: Registers at 0x3f200000

A simpler way for taking care of GPCLK settings is again through the dt-blob.bin file. By using the clock@GPCLK0 directives under clock_routing and clock_setup, the necessary register values are calculated and set automatically at boot by the firmware. As far as I can see, these only allow using PLLA and APER. Attempting to set or use other PLL sources has unpredictable results in my experience. I also recommend checking the actual clock output with an oscilloscope, since these automatic settings might not be optimal.

The settings shown on the clock tree diagram on the top were obtained with the following part compiled into the dt-blob.bin:

clock_routing {
	vco@PLLA {
		freq = <1920000000>;
	};
	chan@APER {
		div = <4>;
	};
	clock@GPCLK0 {
		pll = "PLLA";
		chan = "APER";
	};
	clock@GPCLK2 {
		pll = "PLLA";
		chan = "APER";
	};
};

clock_setup {
	clock@GPCLK0 {
		freq = <24000000>;
	};
	clock@GPCLK2 {
		freq = <12288000>;
	};
};

If the clocks can be setup automatically, why is all this background important? Rational dividers used by GPCLKs work by switching the output between two frequencies. In contrast to PLLs the jitter they introduce depends largely on their settings and can be quite severe in some cases. For example, this is how the 12.228 MHz clock looked like when set by firmware from a first-attempt dt-blob.bin:

Clock signal with a lot of jitter produced by GPCLK function.

It's best to keep your clocks divided by integer ratios, because in that case the dividers introduce minimal jitter. If you can't do that, jitter is minimized by maximizing the input clock source frequency. In my case, I wanted to generate two clocks that didn't have an integer ratio, so I was forced to use the fractional part on at least one divider. I opted to have low jitter, integer divided clock for the 24 MHz USB clock and a higher jitter (but still well within acceptable range) for the 12.288 MHz audio clock.

This is how the 12.228 MHz clock looks like with the settings shown above. It's a significant improvement over the first attempt:

Low-jitter clock signal produced by GPCLK function.

GPCLKs can be useful in lowering the cost of a system, since they can remove the need for separate expensive crystal resonators. However, it can sometimes be deceiving how easy they are to set up on Raspberry Pi. dt-blob.bin offers a convenient way of enabling clocks on boot without any assistance from Linux userland. Unfortunately the exact mechanism on how the firmware sets the registers is not public, hence it's worth double checking its work by understanding what is happening behind the scenes, inspecting the registers and checking the actual clock signal with an oscilloscope.

Posted by Tomaž | Categories: Digital | Comments »

Testing Galaksija's memory

26.09.2017 20:13

Before attempting to restore the damaged laminate of Mr Ivetić' Galaksija I wanted to have some more confidence that the major components are still in working order. The fact that the NAND gate in the character generator patch was still working correctly gave me hope that the board was not connected to a wrong power supply. That could do serious damage to the semiconductors. Still, I wanted to test some of the bigger integrated circuits. Memory chips are relatively straightforward to check. They are also mounted on sockets on this board, so they were easy to remove and test on a breadboard.

Character generator ROM on the Galaksija circuit board.

This is another post in the series about restoration of an original Galaksija microcomputer. Galaksija is a small home microcomputer from former Yugoslavia that was built around the Z80 microprocessor. It used EPROMs, a predecessor to modern flash memory, to store its simple operating system. As it was common at the time, Galaksija could not update its system software by itself. In fact, western home computers typically stored such software in mask ROMs. There, code and data was programmed by physically etching a pattern into the metal layer of the chip. Even though Yugoslavia had semiconductor industry that was capable of making ROMs, producing a custom chip was not economical for Galaksija, which had relatively low production numbers.

EPROMs removed from the Galaksija circuit board.

Galaksija originally came with two EPROMs. The first one, called ROM A in the manual and marked Master EPROM here, contains 4 kB of Z80 CPU machine code and data for basic operations. It includes specialized functions related to the hardware: video driver for generation of the video signal, keyboard read-out as well as modulation and demodulation routines for saving data to an audio cassette. Some higher-level functions are also included. There's a simplistic terminal emulation with a command-line interface, a stack-based floating point calculator and a BASIC interpreter based on the TRS-80. My incomplete Galaksija disassembly contains more details.

The second EPROM is the character generator ROM. Galaksija's video output is designed fundamentally around text. The frame buffer contains only references to characters that are to be drawn on the screen. How these characters look, the actual pixels you see, are stored in the character ROM. This is similar to how text mode worked on old PCs and was done to limit the RAM use. In fact, a bitmapped image of the whole screen would not fit into the 2 kB of Galaksija's RAM. Of course, this means that only very limited graphics can be displayed. By sacrificing a lot of RAM and hacking the video driver, some limitations can be worked around.

Iskra EMS6116 static RAM on a Galaksija computer.

Galaksija uses static RAM for its working memory. Using costly static RAM was obsolete even in the early 1980s and is the main reason why Galaksija originally only had 2 kB of RAM (which could be upgraded to 6 kB by inserting up to two more identical 2 kB chips). The much larger dynamic RAM would require more complicated circuitry to interface with the CPU and was only added in the later Galaksija "Plus" upgrade. Interestingly, the Z80 CPU was originally meant to use dynamic RAM and includes functionality to perform the required refresh cycles. However in Galaksija this function was instead used for video generation. This board uses a rare EMS6116 RAM chip made by Iskra Semiconductors.

Galaksija's ROM A connected to Arduino Mega.

After carefully removing all three memory chips from the board I wired them up to an Arduino Mega using a breadboard and a rat's nest of jumper wires. I used a slightly modified Oddbloke's RomReader sketch for dumping the EPROM contents. Since the 6116 RAM has an electrical interface that is very similar to 27-series EPROMs I also used this sketch as a base of my RAM test. The RAM test sketch first wrote a test pattern (bytes 00, FF, AA and 55) to all RAM addresses and then read it out to check for any bad bits. Sources for both Arduino sketches are available here.

The first few runs of the EPROM dumper showed that ROM A didn't read out correctly. Its contents differed from what I had on record and consecutive reads yielded somewhat different results. After double-checking my setup however it turned out that my Arduino Mega board only puts out around 4.5 V on the +5 V supply. This is on the lower specified limit for these EPROMs, so it could explain occasional bad bits. After supplying a more stable voltage to the EPROM from a lab power supply, ROM A read correctly. Its contents were exactly the same as what I had on record (and what I use on my Galaksija replica).

Two variants of the Galaksija character set

Similarly, the RAM and the other EPROM also checked out fine. However, in contrast to ROM A, the character ROM contents differed from what I had expected. After a closer look at the binary dump (using chargendump tool from my Galaksija tools to visualize its contents) it turned out that the difference is in characters 0 and 39 (ASCII hex codes 40 and 27 respectively). These two characters are used to draw the two halves of the logo that is displayed before the distinctive Galaksija READY command prompt.

Galaksija screenshot

The character ROM I use on my replica contains an arrow-like logo of Elektronika inženjering. The ROM in this Galaksija's image contains the game-of-life glider logo of Mipro design. Both of these logos are etched into the copper on the solder side of the Galaksija circuit board:

"design mipro" logo on Galaksija PCB.

Elektronika inženjering logo on Galaksija PCB.

I don't know why there are two versions of the character ROM in existence and how old each of them is. As far as I know, both of these companies were involved in the manufacture of the original do-it-yourself kit parts (including the PCB and the keyboard). Wikipedia currently says that later factory-built computers were built by Elektronika inženjering, so it is possible that the arrow logo version is the more recent one. The blurry screenshot from the original Galaksija manual suggests that the glider logo was used when the screenshot was made. This seems to confirm that the glider logo is older.

Figure showing Galaksija's character set from the Galaksija manual.

In any case, both versions of the ROM seem to already float around the web, so this discovery isn't terribly exciting. The Galaksija Emulator for instance comes with the glider logo version. As far as I can remember, I originally obtained my arrow logo ROM images from the Wikipedia page. The article used to contain hex dumps, but they were since then deleted due to copyright and non-encyclopedic content concerns.

In conclusion, everything worked as expected, which is great news as far as the restoration of this Galaksija is concerned and a green light to proceed to fixing the PCB. It's also a testament to the reliability of old integrated circuits. I was pretty sure at least the EPROMs have discharged. The datasheet mentions that normal office fluorescent lighting will discharge an unprotected die in around 3 years. Considering that the chips were most likely programmed more than 30 years ago, it is surprising that the content lasted this long (the bit errors at low supply voltage I've seen might be the first sign of the deterioration though). It's also surprising that the Iskra EMS6116 survived and passed all the tests I could throw at it. Domestic chips did not have the best of reputations as far as reliability was concerned, but at least this specimen seemed to survive the test of time just fine.

Posted by Tomaž | Categories: Digital | Comments »

BeagleCore Module eMMC and SD card benchmarks

12.08.2017 13:03

In my experience, slow filesystem I/O is one of the biggest disadvantages of cheap ARM-based single-board computers. It contributes a lot to the general feeling of sluggishness when you work interactively with such systems. Of course, for many applications you might not care much about the filesystem after booting. It all depends on what you want to use the computer for. But it's very rare to see one of these small ARMs that would not be several times slower than a 10-year old Intel x86 box as far as I/O is concerned.

A while ago I did some benchmarking of the eMMC flash on the old Raspberry Pi Compute Module. I compared it with the SD card performance on Raspberry Pi Zero and the SATA drive on a CubieTruck. In the mean time, the project that brought the Computer Module on my desk back then has pivoted to a BeagleCore module. Since I now have a small working system with the BCM1 I thought I might do the same thing and compare its I/O performance with the other systems I tested earlier.

BeagleCore Module mounted on the SNA-LGTC board.

The BCM1 is a small Linux-running computer that comes in the form of a surface-mount hybrid module. It is built around a Texas Instruments AM335x Sitara system-on-chip with a single-core 1 GHz ARM Cortex-A8 CPU. The module comes with 512 MB RAM and 4 GB eMMC chip. It is supported by the software from the BeagleBoard ecosystem and in my case runs Debian Jessie with the 4.4.30-ti-r64 Linux kernel. Our board has a micro SD card socket, so I was also able to benchmark the SD card as well as the eMMC flash. I was using the Samsung EVO+ 32 GB card.

To perform the benchmark I used the same script on BCM1 as I used in my previous tests. I used hdparm and dd to estimate uncached and cached read and write throughputs. I ran each test 5 times and used the best result.

Comparison of write performance for ARM systems.

The write performance is better with the SD card than the eMMC flash on BCM1. SD card on BCM1 is also faster than the SD card on Raspberry Pi Zero, although this is probably not relevant. It's likely that the Zero performance was limited by the no-name SD card that came with it. eMMC on the BCM1 is slower than on Raspberry Pi CM.

The 16.2 MB/s result for the SD card here is somewhat suspect however. After several repeats, the first run of five was always the fastest, with the later runs only yielding around 12 MB/s. It is as if some caching was involved (even though fdatasync was specified with dd).

Comparison of read performance for ARM systems.

Interestingly, things turn around with read performance. BCM1's eMMC flash is better at reading data than the SD card. In fact, BCM1 eMMC flash reads faster than both Raspberry Pi setups I tested. It is still at least 3 times slower than a SATA drive on the CubieTruck.

Comparison of cached read performance for ARM systems.

Cached read performance is the least interesting of these tests. It's more or less the benchmark of the CPU memory access rather than anything related to the storage devices. Hence both BCM1 results are more or less identical. Interestingly, BCM1 with the 1 GHz CPU does not seem to be significantly better than the Compute Module with the 700 MHz CPU.

My results for the BCM1 eMMC flash are similar to those published here for the BeagleBone Black. This is expected, since BeagleBone Black has the same hardware as BCM1, and gives me some confidence that my results are at least somewhat correct.

Posted by Tomaž | Categories: Digital | Comments »

The Galaksija character generator patch

05.07.2017 19:55

In my first overview of Mr Ivetić' Galaksija I mentioned a curious bundle of components hidden inside a yellowing cocoon of Sellotape. It was obviously not a part of the original kit and I speculated that it was likely a workaround for some timing issue connected with the character generator. In the late 1980s several articles were published in Računari and Moj Mikro magazines that attempted to help Galaksija owners fix various hardware problems. Unfortunately I couldn't find any suggested fixes that would match what I saw, so I decided to investigate this particular hardware patch a bit further.

Galaksija circuit board from Mr. Ivetić.

This is another post in the series about the possible restoration of an original Galaksija computer that I took custody of recently. Galaksija is a small home microcomputer from former Yugoslavia that was built around the Z80 microprocessor. The designs were openly published in a magazine in 1984 with the intention that readers would build their own computers from scratch. It is similar to the Sinclair ZX80 in that it uses the CPU to generate the video signal and is constructed solely from general-purpose logic chips. It is generally considered the most successful of several domestic alternatives to computers that were illegally imported from the west.

Components that we hidden under the tape.

After carefully unwrapping layers of disgusting, decaying sticky tape I found a 74LS10 triple 3-input NAND chip from National Semiconductors, a resistor and two capacitors. The circuit is connected to the rest of the computer with only four wires: a logic input, a logic output, +5V supply and ground. The green 150 nF capacitor on top of the chip is only used for decoupling the power supply. First 3-input NAND is wired as a 2-input NAND, second NAND is wired as a NOT gate and the third is left unconnected. Together they form the following functionally equivalent logic circuit:

Schematic of the monostable multivibrator circuit.

This circuit acts as a monostable multivibrator. It will take an impulse of an arbitrary length on its input and always output an impulse of a fixed length that is defined by the time constant of the RC circuit.

When the input goes from high to low, the transition is immediately propagated over the NAND gate, the capacitor and NOT gate to the other NAND input. This latches the output low regardless of any later input changes. Over time, the resistor discharges the capacitor enough that the NOT gate input falls below the logic threshold and the output goes back high. This also unlatches the circuit, allowing another input impulse to trigger it again.

The theoretical output impulse length should be around 80 ns based on the capacitor and resistor values shown above.

Monostable multivibrator demonstration, short impulse.

Monostable multivibrator demonstration, long impulse.

I carefully unsoldered the circuit from Galaksija and connected it to a signal generator using the original lengths of wire. On the screenshots above, the yellow trace is the input and the blue trace is the output. As you can see, the circuit is still working. The output impulse length correctly stays the same regardless of the input impulse length. The circuit has a propagation delay of around 32 ns and the measured output impulse length is around 60 ns. The digital signal is distorted due to ground bounce and other effects of the wires that are quite long for signals this fast.

Location of the monostable on the full schematic.

The monostable is connected in front of the shift/load input to the 74LS166 shift register that generates the video signal. See full schematic here.

Normally the shift register shifts out individual bits on its serial output as the electron beam scans the TV screen. However, once per every 8 pixels it must load new data. To do this, the CPU reads out 8 new pixels from the character ROM. During this time, the shift/load signal must go low for exactly one transition of the 6.144 MHz pixel clock to load the register.

Timing diagram for the CPU's M1 cycle with the "M1" detect signals added.

Loading the shift register in Galaksija is quite a tricky operation as several signals must be accurately synchronized. The situation is made even more complex by the original Galaksija design. To avoid using an extra chip, the circuit does not fully decode the required CPU bus states with combinatorial logic. Instead, it generates the shift/load impulse dynamically. A 74LS74 D flip-flop is cleverly wired to the CPU bus, as shown on the timing diagram above, to create the load impulse.

Normally digital circuits are designed to work even with ideal components with zero propagation time. However, this circuit depends on the fact that the pixel data will be loaded into the shift register before the CPU settles after the last clock of the M1 state. It's one of the two parts of Galaksija's circuit where signals race each other like this.

For a more in-depth explanation of the character generator, see my old blog post about the CMOS redesign and sections 3.1.5 and 4.1.2 in my diploma thesis (in Slovene - English machine translation).

Timing detail for the shift/load signal and the pixel clock.

So, why was the monostable circuit added to this Galaksija? The rough timing diagram above shows the shift/load signal in relation to the pixel clock. The specification for the SGS Z8400B (the Z80 variant used on this particular board) only gives a maximum of 100 ns for the settling after the low-to-high transition of the CPU clock. Hence, the time the shift/load signal spends low in the original circuit (without the monostable) is anywhere between 0 and 100 ns. If this time is too short, it will miss the low-to-high transition of the pixel clock and the shift register won't load. With the monostable added into the circuit, however, the shift/load will always be low for 60 ns and will always catch the pixel clock.

It was known that the original Galaksija design doesn't work with all Z80-compatible CPUs. As the CPU manufacturers improved their processes, the signal transition times were getting lower and eventually some chips were settling too fast for the unmodified character generator circuit to work correctly. The CPU on this board has a date code from 1986, around 10 years after the first Z80 CPU was introduced and 2 years after Galaksija was first published. It's not surprising that it caused timing problems.

This patch appears to be one way to make the circuit more resilient to the CPU variations. It is not perfect though. If the transition time is too short, the impulse might be too short to trigger the monostable. A better approach is to fully decode the CPU state. This is the solution I chose when designing my CMOS replica. Of course, this comes at a cost of more logic and would not be simple to add to an existing circuit board.

Posted by Tomaž | Categories: Digital | Comments »

Closer look at the original Galaksija

09.05.2017 20:34

A few weeks ago I met with Mr. Vojislav Ivetić in Maribor. He entrusted me with an old Galaksija computer circuit board. Several years ago he obtained it from Janez Stergar at the Faculty of Electrical Engineering and Computer Science, University of Maribor. He told me that the historical computer was in an unknown condition, very likely not working, and was interested in restoring it back to usable state. This post is the result of my visual inspection of the circuit to estimate the extent of the restoration that would be necessary.

Galaksija is a small home microcomputer that was designed in Belgrade by Voja Antonić around the Z80 microprocessor. The designs were openly published in a magazine in 1984 with the intention that readers would build their own computers from scratch. Do-it-yourself kits could be ordered by mail and eventually also complete, factory made computers. Galaksija was often easier to obtain than similar foreign computers due to heavy import restrictions in the former Yugoslavia. It is generally considered the most successful of several attempts at a domestic home microcomputer.

At the first glance, Mr. Ivetić' Galaksija appears to be built from one of the kits. It has a white mechanical keyboard and a factory made single-layer printed circuit board with the green solder mask and white silk screen print on top. The integrated circuits and other components were most likely gathered from various sources and soldered manually (not all are in sockets). All original Galaksija computers I've seen looked very similar to this. Some had black keyboards, but they all shared the same PCB design.

Galaksija circuit board from Mr. Ivetić.

The circuit board has the basic Galaksija configuration. Only the 4 kB ROM A is installed. This ROM contains the BASIC interpreter, video driver and the rest of Galaksija's minimalistic operating system (here marked Master EPROM). The ROM B socket is empty.

The quartz windows on UV-erasable EPROMs are only covered with a white paper sticker. If the board was stored for a long time exposed to light, it might be that the EPROMs have lost their charge due to ambient UV light and will have to reprogrammed.

Iskra EMS6116 static RAM on a Galaksija computer.

There is a single 2 kB static RAM chip installed. Interestingly, the logo suggests this is an Iskra EMS6116, a domestic integrated circuit. I was not aware that RAM was produced by Iskra. In fact, the original magazine article that gives instructions for Galaksija builders suggests ordering RAM and other chips by mail from abroad (with suggested distributors that will ship to Yugoslavia and tips on getting the shipments through customs). Sockets for additional two 2 kB RAM chips are empty.

All other chips are foreign made. The Z80 CPU and EPROMs are all from SGS (former Italian semiconductor company, later merged into STMicroelectronics). These also have the most recent date codes among the identifiable components on the board: first week of 1986. Original Galaksija design was published in January 1984, so this board was built at least 2 years later. Other logic chips I could identify are from TI and SGS. The oldest chip is the 74LS38 from 1979.

Improvised circuit on shift/load line.

There is a small bundle of components wrapped in sticky tape hanging off the PCB on four wires. It looks like it contains an IC in a DIP package and some capacitors. The circuit sits in front of the shift/load input to the 74LS166 shift register that generates the video signal. It's also connected to the ground and the power supply. Since the extra circuit is not connected to any other digital lines, I'm guessing it is most likely a delay to fix some timing problem.

Location of the improvised circuit on the schematic.

Normally, the shift/load input is driven directly by a circuit that detects when the CPU is in the M1 (opcode fetch) cycle. See full schematic here. I know from my previous research that M1 detection circuit on the original Galaksija is unreliable, since it depends on signal timings that are not guaranteed by the design of the Z80 CPU. It's possible that this was an attempt to work around this issue.

Two potentiometers for setting sync pulse lengths.

There is no RF modulator installed. The circuit has been modified so that composite video signal is directly present on a pair of improvised screw terminals. I'm guessing this Galaksija was used with a monitor or a TV with composite input. Those were quite rare at the time, but it was not uncommon for people to modify their TV sets to add a composite input.

Two potentiometers are wired in series with R12 and R13. They have been glued down, but are now hanging loose on wires. Potentiometers seem to have been installed to adjust horizontal and vertical sync pulse widths. They are not part of the original design. They affect the time constants of 74LS123 monostable multivibrators that generate synchronization impulses in the composite video signal.

Missing space key on the Galaksija.

The space keycap is missing, but the key itself is present. I guess even if a suitable replacement can't be found, one could be drawn in a CAD program and 3D-printed.

Example of a lifted track on Galaksija PCB.

A look at the bottom side reveals that the condition of the copper laminate is quite bad. Many tracks and annular rings have broken or lifted off the substrate. The PCB shows signs of old repairs to some of the damaged tracks, so at least part of this damage is not due to age. Maybe soldering was done at a too high temperature or the quality of the laminate was not particularly good. This Galaksija shows no signs that it was ever mounted in a case, so the damage might also be due to mechanical stress. Many tracks around EPROM sockets are broken, suggesting that the stress of inserting and removing the EPROMs was at least partially responsible.

Ruined annular rings under a transistor on Galaksija.

I've counted around 40 points on the PCB that would need repair. Some are hairline breaks in traces that seem easy to reliably bridge with solder. Other parts would require replacements of copper areas using foil and epoxy glue to bring them back to original condition. Fortunately this PCB has relatively large features compared to modern SMD boards. However, this extent of repair still seems like a lot of delicate work. I'm also not certain that other areas of the laminate that look fine now would not start failing during repair.

If all else fails, another possibility would be to have a whole replacement PCB made and re-solder the keyboard and other original components. This would obviously decrease the historical authenticity. While the scans of original PCB masks are available on the web, those are not precise enough to make a usable replacement board. They would need to be redrawn before they can be sent to a fab.

In conclusion, all basic components are there and look fairly well preserved. At the moment I have no reason to believe that any chips are bad. However the PCB should be repaired before attempting to power up this board. The extent of damage and the amount of fine work with the copper foil would make this repair quite time consuming. It would be nice to somehow check the state of the most critical chips before proceeding on that path. Fixing the PCB would be a big waste of time if the CPU or RAM chip will eventually turn out to be bad. On the other hand, replacements for 74LSxx series logic still seem to be relatively easy to come by.

Posted by Tomaž | Categories: Digital | Comments »

ESP8266 humidity monitoring

27.02.2017 20:11

Last year my flat developed a bit of a mold problem, or maybe I just found out about it then. It's possible the fungus already lived a long, fulfilling life before being discovered. It wouldn't be surprising for a building from an era when thermal- and hydro-isolation were pretty far down on the priority list. In any case, it made me want to monitor the relative air humidity and dew point levels a bit more closely. I had the apartment pretty well covered with sensors already, but the room with the mold in particular lacked a hygrometer.

Wireless humidity monitor based on the ESP8266 module.

Not to re-invent too much hot water, I more or less replicated the nicely documented temperature and humidity web server project from Adafruit. It was doubly appealing because I still had a full bag of old ESP8266 modules that I bought for pennies back when they were the exciting new thing. The only problem was the fact that the humidity sensors supported out-of-the-box by that project were only available from their shop, which is relatively expensive for small items with oversea shipping. Since I was in a hurry, I bought a few DHT11 modules anyway (which turned out to be a mistake).

There's not much to say about the hardware. It's the minimalistic Adafruit's circuit soldered on a perforated board. For the power supply I used a small 3.3V switch-mode converter module I had left over from another project. I was nicely surprised by how easy ESP8266 support was to install into the Arduino IDE. Another pleasant discovery was that ESP8266 with the Arduino-based firmware seems to consume much less power than with the stock AT-command firmware.

The Arduino IDE got updated since Adafruit's tutorial was written, so I had to experiment a bit with the firmware upload settings. Following values seemed to work with my particular modules. Another thing I discovered was that the RST line on ESP8266 has to be left floating for the firmware upload to work reliably. On my previous ESP8266 project I tied it to VCC.

Arduino firmware upload settings for ESP8266 modules.

Unfortunately, the DHT11 modules are pretty bad as far as accuracy is concerned. I only discovered Robert's wonderfully in-depth comparison of hygrometer modules after the fact. I played a bit with power supply filtering, but that doesn't seem to be the source of the noise in the data. I ended up modifying Adafruit's firmware so that it reads the sensor every 5 seconds and returns the average of last 8 readings. This alleviates somewhat the problem, but I definitely recommend using some other sensor to anyone wanting to build this.

For comparison, here is the daily humidity graph recorded using DHT11 with averaging:

DHT11 example daily humidity record.

And here is humidity recorded at the same time (albeit in a different room) by a TEMPerHUM USB dongle with no extra averaging applied:

TEMPerHUM example daily humidity record.

After several months of running, the Arduino-based ESP8266 turned out to be pretty reliable. I haven't seen any big outages in the log of sensor readings. This is a nice improvement over the stock firmware that I used in my Munin display, which still regularly gets lost to the point that it requires a power cycle.

Posted by Tomaž | Categories: Digital | Comments »

Raspberry Pi Compute Module eMMC benchmarks

03.07.2016 13:52

I have a Raspberry Pi Compute Module development kit on my desk at the moment. I'm doing some testing and prototyping because we're considering using it for a project at the Institute. The Compute Module is basically a small PCB with the Broadcom's BCM2835 system-on-chip, 4 GB of flash ROM on an eMMC connection and little else. Even providing power supply at a number of different voltages is left as an exercise for the user.

Raspberry Pi Compute Module

I was wondering how the eMMC flash performs compared to the SD card on the more common Pies. I couldn't find any good benchmarks on the web. Wikipedia says that the latest eMMC standard rivals SATA speeds, but there's not much info around on what kind the Compute Module uses. I've used Samsung's ARM Chromebook with eMMC flash a while ago and that felt pretty fast. On the other hand, watching package updates scroll by on the Compute Module gave me a feeling that it's quite sluggish.

To get some more objective benchmark, I decided to compare the I/O performance with my Raspberry Pi Zero. Zero uses the same BCM2835 SoC, so the results should be somewhat comparable. I used the SD card that originally came with Zero preloaded with the Noobs distribution. It only has the raspberry logo printed on it, so I don't know the exact model or manufacturer. Both Compute Module and Zero were running the latest Raspbian Jessie.

One surprising discovery during this benchmark was that CPU on Zero runs between 700 MHz and 1 GHz while the Compute Module will only run at 700 MHz. These are the ranges detected at boot by bcm2835-cpufreq and default /boot/config.txt that came with the Raspbian image (i.e. no special overclocking). Because of this I performed the benchmarks on Zero at 700 MHz and 1 GHz.

For comparison, I also ran the same benchmark on my Cubietruck that has an Allwinner A20 system-on-chip with SATA-connected Samsung EVO 840 SSD and runs vanilla Debian Jessie.

This is the benchmark script I used. For each run, I chose the fastest result out of 5:

N=5

DEVICE=/dev/sda
#DEVICE=/dev/mmcblk0

I=0
while [ $I -lt $N ]; do
	hdparm -t $DEVICE
	I=$(($I+1))
done

I=0
while [ $I -lt $N ]; do
	hdparm -T $DEVICE
	I=$(($I+1))
done

I=0
while [ $I -lt $N ]; do
	dd if=/dev/zero of=tempfile bs=1M count=128 conv=fdatasync 2>&1
	I=$(($I+1))
done

I=0
while [ $I -lt $N ]; do
	echo 3 > /proc/sys/vm/drop_caches
	dd if=tempfile of=/dev/null bs=1M count=128 2>&1
	I=$(($I+1))
done

I=0
while [ $I -lt $N ]; do
	dd if=tempfile of=/dev/null bs=1M count=128 2>&1
	I=$(($I+1))
done

Here is write performance, as measured by dd. I wonder if dd figures are affected by filesystem fragmentation since it writes an actual file that might not be contiguous. I've been using Zero for a while with this Raspbian image while the Compute Module has been freshly re-imaged. Fragmentation shouldn't be as significant as with spinning disks, but it probably still has some effect.

Comparison of write performance.

Read performance, as measured by hdparm as well as dd. To remove the effect of cache when measuring with dd, I explicitly dropped kernel block device caches before each run.

Comparison of read performance.

From this it seems Compute Module's eMMC flash is slightly faster than the SD card, both on read and writes when comparing to Zero running at the same CPU clock frequency. It's interesting that Zero's results change significantly with CPU frequency, which seems to suggest that some part of SD card I/O is CPU bound. That said, performance seems to be somewhere roughly on the same order of magnitude. Cubietruck is significantly faster than both. In light of this result, it's sad that never versions of Cubieboard (and cheap ARM SoCs in general) dropped the SATA interface.

Finally, I tested block device cache performance. This more or less shows only RAM and CPU performance and shouldn't depend on storage speed.

Comparison of cached read performance.

Interestingly, Zero seems to be somewhat faster than the Compute Module at 700 MHz here. /proc/cpuinfo shows a different revision, although it's not clear to me whether that marks board revision or SoC revision. It might be that processors in Zero and Compute Module are not identical pieces of silicon.

In the end, I should note that these results are not super accurate. Complexities of I/O benchmarking on Linux aside, there are several things that might have affected the results. I already mentioned different filesystem state. A different SD card in Zero might give very different results (I didn't have a second empty card at hand to try that). While Raspberry Pies were idle during these tests, Cubietruck was running my web server and various other little tidbits that tend to accumulate on such machines.

Posted by Tomaž | Categories: Digital | Comments »

Ultra-narrowband and BPSK on TI CC chips

15.06.2016 20:56

Ultra-narrowband is a fancy new name for an old thing. The idea is to use a phase modulated carrier to transmit data at a very low bitrate. This saves energy and improves spectral efficiency (bits per second of data throughput per hertz of radio bandwidth). This in turn makes it convenient for battery-powered sensors and 20-billion Internet-connected toasters of tomorrow. For similar reasons, amateur radio operators have been chatting over PSK31, which is essentially the same thing as ultra-narrowband, for almost two decades now.

Currently SIGFOX seems to be the main commercial operator that's pushing this technology. They don't publish protocol details, however they've written a 3GPP proposal for C-UNB standard, which is public. The benefit of ultra-narrowband is that the simple BPSK modulation can be implemented with existing cheap and well tested integrated transceivers. Compare with the original Weightless standard for instance, which required custom silicon for its much more advanced physical layer and seems mostly forgotten these days (although it's not a completely fair comparison, since SIGFOX operates in unlicensed spectrum and Weightless had to deal with complexities of TV whitespaces, but I digress).

CC1101 transceiver on SNE-ISMTV-868

The CC-series of transceivers from Texas Instruments (like CC1101 and CC1120) has a lot of software-configurable modulation blocks built-in, but a BPSK modulator is not among them. However, you can find some references to ultra-narrowband being implemented with these chips which suggests that people are using them for this purpose. The C-UNB proposal also mentions that it can be easily implemented with modified FSK modulation, but doesn't go into more detail. I wanted to implement ultra-narrowband on CC1101 for a project we're doing at the Institute, so I looked into this possibility.

As any introductory course in telecommunications is quick to point out, frequency and phase modulation are basically the same thing. If you take a frequency modulator and feed it a time-derivative of a signal the result is identical to a phase modulator fed with the unmodified signal. In practice however it's not that simple. BPSK requires that the phase changes ±180° for each symbol change. The frequency-shift keying block in CC chips does not have a well-defined relation between frequency deviation and symbol rate. This means that it's hard to define how much signal phase changes during each symbol.

CC1101 does have a minimum-shift keying mode. This is a special form of frequency modulation that has well-defined phase shifts between symbols. Wikipedia says that the carrier phase continuously shifts by ±90° each symbol period, which does not sound useful at first:

Minimum-shift keying illustration.

In this interpretation of phase shifts, the carrier frequency fc is in the middle between frequencies for the two symbols, f0 and f1. This is the usual interpretation for frequency modulation, where you have approximately equal numbers of both symbols in a typical transmission.

However, if you transmit mostly one symbol, say f0, the receiver will consider that to be the carrier f'c. In that case, each occurrence of symbol 1 rotates the phase of the signal compared to f'c by +180°. This is exactly what you need to implement BPSK.

Alternative interpretation of phase in MSK.

BPSK requires that phase shifts are fast compared to symbol rate, so you want to encode each BPSK symbol with many MSK symbols. Ultra-narrowband uses symbol rates on the order of 100 symbols/s while CC1101 supports up to around 1 Msymbol/s. This means that you could have fast phase changes, but 10 MSK symbols per each BPSK symbol seems to suffice.

In the end, bits encoded into MSK symbols look somewhat similar to the theoretical time-derivative I mentioned above. You have an impulse of a single f1 symbol each time you have a transition from bit 0 to 1 or vice versa:

Using multiple MSK symbols as one BPSK symbol.

So far, this has been all theoretic. How well does it work in practice? The most obvious problem is frequency stability. The local oscillator on CC1101 is designed to be re-calibrated often, but you cannot calibrate it while you are transmitting. With such low bitrates, packet transmissions last for several seconds. During that time the frequency can drift quite a lot, especially compared to the very limited bandwidth of these transmissions. This is the usual problem with narrowband transmissions and CC1101 has no mechanism for compensating for it on reception. That is why I doubt a CC1101-to-CC1101 link would work in this way and I haven't tried it.

Transmission from a CC1101 to a specialized receiver however seems to work quite nicely in practice. You just have to use a SDR with a wide-enough channel for reception and compensate for frequency drifts in software. I have some lab measurements to share, but those will have to wait for another post.

Posted by Tomaž | Categories: Digital | Comments »

Measuring interrupt response times, part 2

27.04.2016 11:40

Last week I wrote about some typical interrupt response times you get from an Arduino and Raspberry Pi, if you follow basic examples from documentation or whatever comes up on Google. I got some quite unexpected results, like for instance a Python script that responds faster than a compiled C program. To check some of my guesses as to what caused those results, I did another set of measurements.

For Arduino, most response times were grouped around 9 microseconds, but there were a few outliers. I checked the Arduino library source and it indeed always enables AVR timer/counter0 overflow interrupt. If timer interrupt happens at the same time as the GPIO interrupt I was measuring, the GPIO interrupt can get delayed. Performing the measurement with the timer interrupt masked out indeed removes the outliers:

Effect of timer interrupt on Arduino response time.

With timer off, all measured response times are between 9.1986 to 8.9485 μs. This is a 0.2501 μs long interval. It fits perfectly with theory - at 16 MHz CPU clock and instruction length between 1 and 5 cycles, uncertainty for interrupt latency is 0.25 μs.

The second weird thing was the aforementioned discrepancy between Python and C on Raspberry Pi. The default Python library uses an ugly hack to bypass the kernel GPIO driver and control GPIO lines directly from user space: it mmaps a range of physical memory containing GPIO registers into its own process memory space using /dev/mem. This is similar to how X servers on Linux (used to?) access graphics hardware from user space. While this approach is very unportable, it's also much faster since you don't need to do context switches into kernel for every operation.

To check just how much faster mmap method is on Raspberry Pi, I copied the GPIO access code from the RPi.GPIO library into my test C program:

Response times using sysfs and mmap methods on Raspberry Pi.

As you can see, the native program is now faster than the interpreted Python script. This also demonstrates just how costly context switches are: the sysfs version is more than two times slower on average. It's also worth noting that both RPi.GPIO and my C program still use epoll() or select() on a sysfs file to wait for the interrupt. Just output pin change can be done with direct memory accesses.

Finally, Raspberry Pi was faster when the CPU was loaded which seemed counterintuitive. I tracked this down to automatic CPU frequency scaling. By default, Raspberry Pi Zero seems to be set to run between 700 MHz and 1000 MHz using ondemand governor. If I switch to performance governor, it keeps the CPU running at 1 GHz at all times. In that case, as expected, the CPU load increases the average response time:

Effect of cpufreq governor on Raspberry Pi response time.

It's interesting to note that Linux kernel comes with pluggable idle loop implementations (CONFIG_CPU_IDLE). The idle loop can be selected through /sys/devices/system/cpu/cpuidle in a similar way to the CPU frequency governor. The Raspbian Jessie release however has that disabled. It uses the default idle loop for ARMv6 processors. Assembly code has been patched though. The ARM Wait For Interrupt WFI instruction in the vanilla kernel has been replaced with some mcreq (write to coprocessor?) instructions. I can't find any info on the JIRA ticket referenced in the comment and the change has been added among other BCM-specific changes in a single 6400-line commit. Idle loop implementation is interesting because if it puts the CPU into a power saving mode, it can affect the interrupt latency as well.

As before, source code and raw data is on GitHub.

Posted by Tomaž | Categories: Digital | Comments »

Measuring interrupt response times

18.04.2016 15:13

Embedded systems were traditionally the domain of microcontrollers. You programmed them in C on bare metal, directly poking values into registers and hooking into interrupt vectors. Only if it was really necessary you would include some kind of a light-weight operating system. Times are changing though. These days it's becoming more and more common to see full Linux systems and high-level languages in this area. It's not surprising: if I can just pop open a shell, see what exceptions my Python script is throwing and fix them on the fly, I'm not going to bother with microcontrollers and the whole in-circuit debugger thing. Some even say it won't be long before we will all be just running web browsers on our devices.

It seems to be common knowledge that the traditional approach really excels at latency. If you're moderately careful with your code, you can get your system to react very quickly and consistently to events. Common embedded Linux systems don't have real-time features. They seem to address this deficiency with some combination of "don't care", "it's good enough" and throwing raw CPU power at the problem. Or as the author of RPi.GPIO library puts it:

If you are after true real-time performance and predictability, buy yourself an Arduino.

I was wondering what kind of performance you could expect from these modern systems. I tend to be very conservative in my work: I have a pile of embedded Linux-running boards, but they are mostly gathering dust while I stick to old-fashioned Cortex M3s and AVRs. So I thought it would be interesting to do some experiments and get some real data about these things.

Measuring interrupt response times on Arduino.

To test how fast a program can respond to an event, I chose a very simple task: Raise an output digital line whenever a rising edge happens on an input digital line. This allowed me to very simply measure response times in an automated fashion using an USB-connected oscilloscope and a signal generator.

I tested two devices: An Arduino Uno using a 16 MHz ATmega328 microcontroller and an Raspberry Pi Zero using a 1 GHz ARM-based CPU running Raspbian Jessie. I tried several approaches to implementing the task. On Arduino, I implemented it with an interrupt and a polling loop. On Raspberry Pi, I tried a kernel module, a native binary written in C and a Python program. You can see exact source code on GitHub.

Measuring interrupt response times on Raspberry Pi.

For all of these, I chose the most obvious approach possible. My implementations were based as much as possible on the preferred libraries mentioned in the documentation or whatever came up on top of my web searches. This meant that for Arduino, I was using the Arduino IDE and the library that comes with it. For Raspberry Pi, I used the RPi.GPIO Python library, the GPIO sysfs interface for native code in user space and the GPIO consumer interface for the kernel module (based on examples from Stefan Wendler). Definitely many of these could be further hand-optimized, but I was mostly interested here in out-of-the-box performance you could get in the first try.

Here is a histogram of 500 measurements for the five implementations:

Histogram of response time measurements.

As expected, Arduino and the Raspberry Pi kernel module were both significantly faster and more consistent than the two Raspberry Pi user space implementations. Somewhat shocking though, the interpreted Python program was considerably faster than my C program compiled into native code.

If you check the source, RPi.GPIO library maps the hardware registers directly into its process memory. This means that it does not need any syscalls for controlling the GPIO lines. On the other hand, my C implementation uses the kernel's sysfs interface. This is arguably a cleaner and safer way to do it, but it requires calls into the kernel to change GPIO states and these require expensive context switches. This difference is likely the reason why Python was faster.

Histogram of response time measurements (zoomed)

Here is the zoomed-in left part of the histogram. Raspberry Pi kernel module can be just as fast as the Arduino, but is less consistent. Not surprising, since the kernel has many other interrupts to service and not that impressive considering 60 times faster CPU clock.

Arduino itself is not that consistent out-of-the-box. While most interrupts are served in around 9 microseconds (so around 140 CPU cycles), occasionally they take as long as 15 microseconds. Probably Arduino library is to blame here since it uses the timer interrupt for delay functions. This interrupt seems to be always enabled, even when a delay function is not running, and hence competes with the GPIO interrupt I am using.

Also, this again shows that polling on Arduino can sometimes be faster than interrupts.

Effect of CPU load on response time.

Another interesting result was the effect of CPU load on Raspberry Pi response times. Somewhat counter intuitively, response times are smaller on average when there is some other process consuming CPU cycles. This happens even with the kernel module, which makes me think it has something to do with power saving features. Perhaps this is due to CPU frequency scaling or maybe the kernel puts an idle CPU into some sleep mode from which it takes longer to wake up.

In conclusion, I was a bit impressed how well Python scores on this test. While it's an order of magnitude slower than Arduino, 200 microseconds on average is not bad. Of course, there's no hard upper limit on that. In my test, some responses took two times as much and things really start falling apart if you increase the interrupt load (like for instance, with a process that does something with the SD card or network adapter). Some of the results on Raspberry Pi were quite surprising and they show once again that intuition can be pretty wrong when it comes to software performance.

I will likely be looking into more details regarding some of these results. If you would like to reproduce my measurements, I've put source code, raw data and a notebook with analysis on GitHub.

Posted by Tomaž | Categories: Digital | Comments »

Another hard drive failure

07.02.2015 21:41

Earlier today one of my hard drives died. It was a fairly old 750 GB "Caviar GP" drive from a Western Digital "My Book" external enclosure. All it does now is emit an impressively loud metallic clicking noise.

I should have seen this coming, of course. At this point I have a pile of failed drives stashed in a box somewhere. I remember that this particular one has been unusually slow to start and mount for the last couple of times I used it. Also, smartd has previously reported "2 Currently unreadable (pending) sectors". Both of which I ignored, because I assumed this was yet another problem with the power supply. I had a "My Book" 12V external power supply fail before with similar symptoms.

I only used this drive for backups recently, so except for some archival copies of machines I no longer own, probably nothing of value was lost. Having at least a listing of contents before it failed would be nice though.

Disassembled Western Digital "My Book" external drive.

Of course, I opened it up to see if there's anything obvious wrong with it. The "My Book" USB interface board and the power supply are not the cause, because the drive has the same problem even when it is connected directly to a SATA port. I can hear the platters spinning and the clicking noise can only be caused by heads trashing around, so those are not stuck either.

Corrosion of surface finish on the controller PCB.

The only thing that immediately looks wrong is the unusual amount of corrosion on the hard drive controller PCB. It's bad enough that one some exposed test points both the immersion gold and the copper layer are completely gone. I'm not quite sure what could have caused that. As far as I can remember, this drive was sitting somewhere around my desk for the whole time, so it hasn't been exposed to any hostile environments. It might be a manufacturing defect of some sort - maybe the board was not rinsed well enough after processing.

Bottom side of the hard drive controller PCB.

I cleaned the pads where the motor and the head connect to the circuit board, but that didn't make any difference.

The copper below the green solder mask looks fine though. The bottom side of the PCB contains one large BGA chip. Maybe that one developed some bad connections, if the problem is indeed in the controller board. Just as an experiment, I also tried the disk-in-the-freezer trick, but that did not make the disk behave any differently.

Posted by Tomaž | Categories: Digital | Comments »