My experience with Firefox containers

25.07.2020 19:17

For the past months I've been using the Firefox Multi-Account Containers (MAC) extension. This extension makes it possible to maintain several isolated browser states at a time. By a browser state I mean everything that websites store in the browser: cookies, local storage and so on. In practical terms that means logins on websites with "remember me" functionality, shopping baskets, advertisement network IDs and other similar things. You can setup the extension so that certain websites always open in a certain container. The Temporary Containers (TC) extension further builds upon MAC by dynamically creating and deleting containers as you browse in an attempt to keep the browser from accumulating long-term cookies.

Screenshot of the Firefox Multi-Account Containers extension.

I have a few reasons to use such a setup: First is that I commonly use company and personal accounts on websites. I want to keep these logins completely separate for convenience (no need to constantly log-out and log-in to change accounts). I've also once had an instance where a web shop silently merged accounts based on some hidden browser state. Even though that was most certainly unacceptable behavior on that web shop's end I would like to avoid another case where my personal projects end up on corporate order history.

The second reason is privacy and security. I think is harder to exploit weaknesses in browser's cross-site isolation, or do phishing attacks, if the default browser instance I'm using doesn't store authentication cookies for any important accounts. The fact that most websites see a fresh browser state without cookies also slightly raises the bar for tracking my browsing habits between different websites.

Cookies and Site Data preferences in Firefox.

I used to have Firefox set so that it cleared all cookies after every session. This took care of accumulating cookies, but meant that I needed to continuously re-login to every website. This wasn't such a inconvenience on my end. However recently more and more websites started treating new logins from a cookie-less browser as a security breach. At best I would constantly get mails about it, at worst I would get accounts blocked or thrown into a captcha-hell for unusual behavior.

A typical "sign-in from a new device detected" mail.

I would still prefer to have this setting enabled for default browsing, combined with a few permanent containers for websites that do this sort of unusual behavior detection. However MAC doesn't allow you to set this independently for each container. In theory, using TC fixes that problem. Opening up a website in a fresh temporary container that is used once and then deleted after closing the browser tab has the same effect as clearing cookies.

My foremost problem with containers is that in practical use they don't really contain the state between websites. It's trivial to make a mistake and use the wrong container. If I click a link and open it in a new tab, that tab will inherit the container of the original tab. The same also happens if you enter a new URL manually into the address bar. It's very easy, for example, to follow a link a coworker shared in a web chat and then spend an hour researching related websites. If I forget to explicitly open the link in "a new Temporary Container" that browsing will all happen in the permanent container that I would prefer to only be used for the web chat service. The tab titles get a colored underline that shows what container they are using, but it's easy to overlook that.

All it takes is one such mistake and the container is permanently polluted with cookies and logins from unrelated websites that I would not like to have in there. These will persist, since to retain the web chat login I have to set the browser to retain cookies indefinitely. Over time I found that all permanent containers tend to accumulate cookies and persisting logins for websites I frequent which defeats most of the benefits of using them.

There is the "Always Open This Site in..." option, but it works the other way I would want it to. You can define a list of websites that need to be opened in a certain container, but you can't say that a website outside this list needs to be opened outside of a container (or in a Temporary Container). The "Always Open This Site in..." also has the additional problem that it's annoying if I want to use a website in two containers (e.g. work and personal ones). In that case I have to constantly click through warnings such as this:

Firefox warning about assigned containers.

Again, the Temporary Containers extension attempts to address this. There is a group of preferences called "Isolation". In theory you can set this up so that a new temporary container is automatically opened up when you navigate away to a different website. It also takes effect when using the permanent containers.

It's possible that I don't understand exactly how this works, but I found any setting other than "Never" to not be useful in practice. The problem is with websites that use third-party logins (e.g. a website where you log in with your Google account, but is not hosted under google.com, I'm guessing through a mechanism like OpenID). Isolation completely breaks this authentication flow since the log-in form then opens up in a new container and whatever authentication tokens it sets aren't visible in the original one.

Isolation preferences of the Temporary Containers extension.

Finally, both extensions are somewhat buggy in my experience. For example, I'm occasionally seeing tabs opened in no container at all even though I have Automatic Mode turned on in TC, which should automatically re-open any tab that's not in a container in a temporary one. I can't find a reliable way to reproduce this, but it seems to only happen with tabs that I open from a different application and might be related to this issue.

Cleaning old temporary containers often doesn't work reliably. Again, it's hard to say what exactly isn't working and if that's one of the known bugs or not. I often find that the browser has accumulated tens of temporary containers that I then need to clean up by hand. This is especially annoying since a recent update to MAC made container deletion unnecessarily tedious. It takes 4 mouse clicks to delete one container, so deleting a few tens is a quite a chore.

There is also a very annoying couple of bugs related to Firefox sync. MAC allows the container settings to be synchronized between all Firefox installations linked to the same Firefox account using Mozilla's servers. This is an incredibly useful feature to me since I commonly use multiple computers and often also multiple operating systems on each one. Needless to say, synchronizing browser settings manually is annoying.

Unfortunately, running MAC and TC with sync enabled runs the risk of permanently breaking the Firefox account. Because of the bugs I linked above it's very easy to accumulate so many temporary accounts that you exceeded the storage quota on the sync server. It seems that once that happens the sync server will not even let you delete the offending data that's stored there before erroring out. The result is that add-on sync will no longer work on that account and even cleaning up your local setup afterwards will not fix it.

Errors about insufficient storage in Firefox sync logs.

In conclusion, I'm not very happy with this setup (or the modern web in general, but I digress). Multi-Account Containers is certainly an improvement over a setup with multiple separate Firefox profiles that I was using previously. It does work well enough for keeping work and personal accounts separate. On the other hand, it doesn't seem to be very effective in isolating cookies and other state between browsing sessions. I'm not sure what exactly a better solution would be. I feel like I'm leaning more and more towards a setup where I would just use two completely separate browsers. One for heavy web apps that require a login, persistent local storage, Javascript and another one that's more aggressive at cleaning up after itself for everything else.

Posted by Tomaž | Categories: Code | Comments »

On missing IPv6 router advertisements

03.05.2020 16:58

I've been having problems with Internet connectivity for the past week or so. Randomly connections would timeout and some things would work very slowly or not at all. In the end it turned out to be a problem with IPv6 routing. It seems my Internet service provider is having problems with sending out periodic Router Advertisements and the default route on my router often times out. I've temporarily worked around it by manually adding a route.

I'm running a simple, dual-stack network setup. There's a router serving a LAN. The router is connected over an optical link to the ISP that's doing Prefix Delegation. Problems appeared as intermittent. A lot of software seems to gracefully fall back onto IPv4 if IPv6 stops working, but there's usually a more or less annoying delay before it does that. On the other hand some programs don't and seem to assume that there's global connectivity as long as a host has a globally-routable IPv6 address.

The most apparent and reproducible symptom was that IPv6 pings to hosts outside of LAN often weren't working. At the same time, hosts on the LAN had valid, globally-routable IPv6 addresses, and pings inside the LAN would work fine:

$ ping -6 -n3 host-on-the-internet
connect: Network is unreachable
$ ping -6 -n3 host-on-the-LAN
PING ...(... (2a01:...)) 56 data bytes
64 bytes from ... (2a01:...): icmp_seq=1 ttl=64 time=0.404 ms
64 bytes from ... (2a01:...): icmp_seq=2 ttl=64 time=0.353 ms
64 bytes from ... (2a01:...): icmp_seq=3 ttl=64 time=0.355 ms

--- ... ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2026ms
rtt min/avg/max/mdev = 0.353/0.370/0.404/0.032 ms

Rebooting my router seemed to help for a while, but then the problem would reappear. After some debugging I've found out that the immediate cause of the problems was that the default route on my router would disappear approximately 30 minutes after it has been rebooted. It would then randomly re-appear and disappear a few times a day.

On my router, the following command would return empty most of the time:

$ ip -6 route | grep default

But immediately after a reboot, or if I got lucky, I would get a route. I'm not sure why there are two identical entries here, but the only difference is the from field:

$ ip -6 route | grep default
default from 2a01::... via fe80::... dev eth0 proto static metric 512 pref medium
default from 2a01::... via fe80::... dev eth0 proto static metric 512 pref medium

The following graph shows the number of entries returned by the command above over time. You can see that most of the day router didn't have a default route:

Number of valid routes obtained from RA over time.

The thing that was confusing me the most was the fact that the mechanism for getting the default IPv6 route is distinct from the the way the prefix delegation is done. This means that every device in the LAN can get a perfectly valid, globally-routable IPv6 address, but at the same time there can be no configured route for packets going outside of the LAN.

The route is automatically configured via Router Advertisement (RA) packets, which are part of the Neighbor Discovery Protocol. When my router first connects to the ISP, it sends out a Router Solicitation (RS). In response to the RS, the ISP sends back a RA. The RA contains the link-local address to which the traffic intended for the Internet should be directed to, as well as a Router Lifetime. Router Lifetime sets a time interval for which this route is valid. This lifetime appears to be 30 minutes in my case, which is why rebooting the router seemed to fix the problems for a short while.

The trick is that the ISP should later periodically re-send the RA by itself, refreshing the information and lifetime, hence pushing back the deadline at which the route times out. Normally, a new RA should arrive well before the lifetime of the first one runs out. However in my case, it seemed that for some reason the ISP suddenly started sending out RA's only sporadically. Hence the route would timeout in most cases, and my router wouldn't know where to send the packets that were going outside of my LAN.

To monitor RA packets on the router using tcpdump:

$ tcpdump -v -n -i eth0 "icmp6 && ip6[40] == 134"

This should show packets like the following arriving in intervals that should be much shorter than the advertised router lifetime. On a different, correctly working network, I've seen packets arriving roughly once every 10 minutes with lifetime of 30 minutes:

18:52:01.080280 IP6 (flowlabel 0xb42b9, hlim 255, next-header ICMPv6 (58) payload length: 176)
fe80::... < ff02::1: [icmp6 sum ok] ICMP6, router advertisement, length 176
	hop limit 64, Flags [managed, other stateful], pref medium, router lifetime 1800s, reachable time 0ms, retrans timer 0ms
	...
19:00:51.599538 IP6 (flowlabel 0xb42b9, hlim 255, next-header ICMPv6 (58) payload length: 176) 
fe80::... < ff02::1: [icmp6 sum ok] ICMP6, router advertisement, length 176
	hop limit 64, Flags [managed, other stateful], pref medium, router lifetime 1800s, reachable time 0ms, retrans timer 0ms
	...

However in this case this wasn't happening. Similarly to what the graph above shows, these packets only arrive sporadically. As far as I know, this is an indication that something is wrong on the ISP side. Sending a RA in response to RS seems to work, but periodic RA sending doesn't. Strictly speaking there's nothing that can be done to fix this on my end. My understanding of RFC 4861 is that a downstream host should only send out RS once, after connecting to the link.

Once the host sends a Router Solicitation, and receives a valid Router Advertisement with a non-zero Router Lifetime, the host MUST desist from sending additional solicitations on that interface, until the next time one of the above events occurs.

Indeed, as far as I can see, Linux doesn't have any provisions for re-sending RS in case all routes from a previously received RAs time out. This answer argues that it should, but I can find no references that would confirm this. On the other hand, this answer agrees with me that RS should only be sent when connecting to a link. On that note, I've also found a discussion that mentions blocking multicast packets as a cause of similar problems. I don't believe that is the case here.

In the end I've used an ugly workaround so that things kept working. I've manually added a permanent route that is identical to what is randomly advertised in RA packets:

$ ip -6 route add default via fe80::... dev eth0

Compared to entries originating from RA this manual entry in the routing table won't time out - at least not until my router gets rebooted. It also doesn't hurt anything if additional, identical routes get occasionally added via RA. Of course, it still goes completely against the IPv6 neighbor discovery mechanism. If anything changes on the ISP side, for example if the link-local address of the router changes, the entry won't get updated and the network will break again. However it does seem fix my issues at the moment. The fact that it's working also seems to confirm my suspicion that something is only wrong with RA transmissions on the ISP side, and that actual routing on their end works correctly. I've reported my findings to the ISP and hopefully things will get fixed on their end, but in the mean time, this will have to do.

Posted by Tomaž | Categories: Code | Comments »

Printing .lto_priv symbols in GDB

14.02.2020 16:08

Here's a stupid little GNU debugger detail I've learned recently - you have to quote the names of some variables. When debugging a binary that was compiled with link time optimization, it sometimes appears like you can't inspect certain global variables.

GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
[...]
(gdb) print usable_arenas
No symbol "usable_arenas" in current context.

The general internet wisdom seems to be that if a variable is subject to link time optimization it can't be inspected in the debugger. I guess this comes from the similar problem of inspecting private variables that are subject to compiler optimization. In some cases private variables get assigned to a register and don't appear in memory at all.

However, if it's a global variable, accessed from various places in the code, then its value must be stored somewhere, regardless of what tricks the linker does with its location. It's unlikely it would get assigned to a register, even if it's theoretically possible. So after some mucking about in the disassembly to find the address of the usable_arenas variable I was interested in, I was surprised to find out that gdb does indeed know about it:

(gdb) x 0x5617171d2b80
0x5617171d2b80 <usable_arenas.lto_priv.2074>:	0x17215410
(gdb) info symbol 0x5617171d2b80
usable_arenas.lto_priv in section .bss of /usr/bin/python3.5

This suggests that the name has a .lto_priv or a .lto_priv.2074 suffix (Perhaps meaning LTO private variable? It is declared a static variable in C). However I still can't print it:

(gdb) print usable_arenas.lto_priv
No symbol "usable_arenas" in current context.
(gdb) print usable_arenas.lto_priv.2074
No symbol "usable_arenas" in current context.

The trick is not that this is some kind of a special variable or anything. It just has a tricky name. You have to put it in quotes so that gdb doesn't try to interpret the dot as an operator:

(gdb) print 'usable_arenas.lto_priv.2074'
$3 = (struct arena_object *) 0x561717215410

TAB completion also works against you here, since it happily completes the name without the quotes and without the .2074 at the end, giving the impression that it should work that way. It doesn't. If you use completion, you have to add the quotes and the number suffix manually around the completed name (or only press TAB after inputting the leading quote, which works correctly).

Finally, I don't know what the '2074' means, but it seems you need to find that number in order to use the symbol name in gdb. Every LTO-affected variable seems to get a different number assigned. You can find the one you're interested in via a regexp search through the symbol table like this:

(gdb) info variables usable_arenas
All variables matching regular expression "usable_arenas":

File ../Objects/obmalloc.c:
struct arena_object *usable_arenas.lto_priv.2074;
Posted by Tomaž | Categories: Code | Comments »

Checking Webmention adoption rate

25.01.2020 14:42

Webmention is a standard that attempts to give plain old web pages some of the attractions of big, centralized social media. The idea is that web servers can automatically inform each other about related content and actions. In this way a post on a self-hosted blog, like this one, can display backlinks to a post on another server that mentions it. It also makes it possible to implement gimmicks such as a like counter. Webmention is kind of a successor to pingbacks that were popularized some time ago by Wordpress. Work on standardizing Webmention seem to date back to at least 2014 and it has been first published as a working draft by W3C in 2016.

I've first read about Webmention on jlelse's blog. I was wondering what the adoption of this standard is nowadays. Some searching revealed conflicting amounts of enthusiasm for it, but not much recent information. Glenn Dixon wrote in 2017 about giving up on it due to lack of adoption. On the other hand, Ryan Barrett celebrated 1 million sent Webmentions in 2018.

To get a better feel of what the state is in my local web bubble, I've extracted all external links from my blog posts in the last two years (January 2018 to January 2020). That yielded 271 unique URLs on 145 domains from 44 blog posts. I've then used Web::Mention to discover any Webmention endpoints for these URLs. Endpoint discovery is a first step in sending a notification to a remote server about related content. If that fails it likely means that the host doesn't implement the protocol.

The results weren't encouraging. None of the URLs had discoverable endpoints. That means that even if I would implement the sending part of the Webmention protocol on my blog, I wouldn't have sent any mentions in the last two years.

Another thing I wanted to check is if anyone was doing the same in the other direction. Were there any failed incoming attempts to discover an endpoint on my end? Unfortunately there is no good way of determining that from the logs I keep. In theory endpoint discovery can look just like a normal HTTP request. Many Webmention implementations seem to have "webmention" in their User agent header however. According to this heuristic I did likely receive at least 3 distinct requests for endpoint discovery in the last year. It's likely there were more (for example, I know that my log aggregates don't include requests from Wordpress plug-ins due to some filter regexps).

So it seems that implementing this protocol doesn't look particularly inviting from the network effect standpoint. I also wonder if Webmentions would become the spam magnet that pingbacks were back in the day if they reached any kind of wide-spread use. The standard does include a provision for endpoints to verify that the source page indeed links to the destination URL the Webmention request says it does. However to me that protection seems trivial to circumvent and only creates a little more work for someone wanting to send out millions of spammy mentions across the web.

Posted by Tomaž | Categories: Code | Comments »

Radeon performance problem after suspend

08.01.2020 20:07

This is a problem I've encountered on my old desktop box that's still running Debian Stretch. Its most noticeable effect is that large GNOME terminal windows get very laggy and editing files in GVim is almost unusable due to the slow refresh rate. I'm not sure when this first started happening. I suspect it was after I upgraded the kernel to get support for the Wacom Cintiq. However I've only started noticing it much later, so it's possible that some other package upgrade triggered it. Apart from the kernel I can't find anything else (like recent Intel microcode updates) affecting this issue though. On the other hand the hardware here is almost a decade old at this point and way past due for an upgrade, so I'm not completely ruling out that something physical broke.

The ATI Radeon graphic card and the kernel that I'm using:

$ lspci | grep VGA
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV710 [Radeon HD 4350/4550]
$ cat /proc/version
Linux version 4.19.0-0.bpo.6-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP Debian 4.19.67-2+deb10u2~bpo9+1 (2019-11-12)

The Radeon-related kernel parameters:

radeon.audio=1 radeon.hard_reset=1

I think I've added hard_reset because of some occasional hangs I was seeing a while ago. I'm not sure if it's still needed with this kernel version and I don't remember having problems with X hanging in recent times. I've also seen this exact performance problem with kernel 4.19.12 from Stretch backports. I can't reproduce the problem on stock Stretch kernel 4.9.189. Other than the kernel from backports I'm using a stock Stretch install of GNOME 3 and X.Org (xserver-xorg-video-radeon version 1:7.8.0-1+b1).

To reproduce the problem, open a GNOME terminal and resize it, say to 132x64 or something similar. The exact size isn't important. Fill the scrollback, for example by catting a large file or running yes for a second. After that, scroll the terminal contents by holding enter on the shell prompt. If everything is working correctly, the scrolling will be smooth. If this bug manifests itself, terminal contents will scroll in large, random increments, seemingly refreshing around once per second.

The second way is to open a largish (say 1000 line) text file in GVim and try to edit it or scroll through it. Again, the cursor will lag significantly after the keyboard input. Interestingly, some applications aren't affected. For example, scrolling in Firefox or Thunderbird will remain smooth. GIMP doesn't seem to be affected much either.

On the affected kernels, I can reliably reproduce this by putting the computer to sleep (suspend to RAM - alt-click on the power button in the GNOME menu) and waking it up. After a fresh reboot, things will run normally. After a suspend, the performance problems described above manifest themselves. There is no indication in dmesg or syslog that anything went wrong at wake up.

I've tracked this down to a problem with Radeon's dynamic power saving feature. It seems that after sleep it gets stuck in its lowest performance setting and doesn't automatically adjust when some application starts actively using the GPU. I can verify that by running the following in a new terminal:

# watch cat /sys/kernel/debug/dri/0/radeon_pm_info

On an idle computer, this should display something like:

uvd    vclk: 0 dclk: 0
power level 0    sclk: 11000 mclk: 25000 vddc: 1100

After a fresh reboot, when scrolling the terminal or contents of a GVim buffer, the numbers normally jump up:

uvd    vclk: 0 dclk: 0
power level 2    sclk: 60000 mclk: 40000 vddc: 1100

However after waking the computer from sleep, the numbers in radeon_pm_info stay constant, regardless of any activity in the terminal window or GVim.

I've found a workaround to get the power management working again. The following script forces the DPM into the high profile and then resets it to whatever it was before (it's auto on my system). This seems to fix the problem and it can be verified through the radeon_pm_info method I described above. Most importantly, this indeed seems to restore the automatic adjustment. According to radeon_pm_info the card doesn't just get stuck again at the highest setting.

$ cat /usr/local/bin/radeon_dpm_workaround.sh
#!/bin/bash

set -eu

DPM_FORCE=/sys/class/drm/card0/device/power_dpm_force_performance_level

CUR=`cat "$DPM_FORCE"`
echo high > "$DPM_FORCE"
sleep 1
echo "$CUR" > "$DPM_FORCE"

To get this to automatically run each time the computer wakes from sleep, I've used the following systemd service file:

$ cat /etc/systemd/system/radeon_dpm_workaround.service
[Unit]
After=suspend.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/radeon_dpm_workaround.sh

[Install]
WantedBy=suspend.target

It needs to be enabled via:

# systemctl enable radeon_dpm_workaround.service

As I said in the introduction, this is a fairly old setup (it even still has a 3.5" floppy drive!). However after many years it's still doing its job reasonably well and hence I never seem to find the motivation to upgrade it. This series of Radeon cards does seem to have a somewhat buggy support in open source drivers. I've always had some degree of problems with it. For a long time HDMI audio was very unreliable and another problem that I still see sometimes is that shutting down X hangs the system for several minutes.

Posted by Tomaž | Categories: Code | Comments »

Dropping the "publicsuffix" Python package

02.12.2019 10:50

I have just released version 1.1.1 of the publicsuffix Python package. Baring any major bugs that would affect some popular software package using it, this will be the last release. I've released v1.1.1 because I received a report that a bug in publicsuffix package is preventing installation of GNU Mailman.

In the grand scheme of things, it's not a big deal. It's a small library with a modest number of users. I haven't done any work, short of answering mail about it, since 2015. Drop-in alternatives exist. People that care strongly about the issues I cover below have most likely already switched to one of the forks and rewrites that popped up over the years. For those that don't care, nothing will change. The code still works and the library is still normally installable from PyPi. Debian package continues to exist. The purpose of this post is more to give some closure and to sum up a few mail threads that started back in 2015 and never reached a conclusion.

Screenshot of the publicsuffix package page on PyPi.

I've first released the publicsuffix library back in 2011, two employers and a life ago. Back then there was no easily accessible Python implementation of Mozilla's Public Suffix List. Since I needed one for my work, I've picked up a source file from an abandoned open source project on Google Code (which was just being abandoned by Google around that time). I did some minor work on it to make it usable as a standalone library and published it on PyPi.

I've not used publicsuffix myself for years. Looking back, most of my open source projects that I still maintain seem to be like that. Even though I don't use them, I feel some obligation to do basic maintenance on them and answer support mail. If not for other reasons, then out of a sense that I should give back to the body of free software that I depend so much on in my professional career. Some technical problems are also simply fun to work on and most of the time there's not much pressure.

However one thing that was a source of long discussions about publicsuffix is the way the PSL data is distributed. I've written previously about the issue. In summary, you either distribute stale data with the code or fetch an up-to-date copy via the network, which is a privacy problem. These two are the only options possible and going with one or the other or both was always going to be a problem for someone. I hate software that phones home (well, phones Mozilla in this case) as much as anyone, but it's a problem that me as a mere maintainer of a Python library had no hope of solving, even if I got CC'd in all the threads discussing it.

The Public Suffix List is a funny thing. Ideally, software either should not care about the semantic meaning of domain names or this meaning should be embedded in the basic infrastructure of the Internet (e.g. DNS or something). But alas we don't live in either of those worlds and hence we have a magic text file that lives on a HTTP server somewhere and some software needs to have access to it if it wants to do its thing. No amount of worrying on my part was going to change that.

Screenshot of publicsuffix forks on GitHub.

The other issue that sparked at least one fork of publicsuffix was the fact that I refused to publish the source on GitHub. Even tough there are usually several copies of the publicsuffix code on the GitHub at any time, none of them are mine. I was instead hosting my own git repo and was accepting bug reports and other comments only over email.

Some time ago already GitHub became synonymous with open source. People simply expect a PyPi package to have a GitHub (or GitLab, or BitBucket) point-and-click interface somewhere on the web. The practical problem I have with that is that it hugely increases the amount of effort I have to spend on a project (subjectively speaking - keep in mind this is something I do in my free time). Yes, it makes it trivial for someone to contribute a patch. However in practice I find that it does not result in greater quantity of meaningful patches or bug reports. What it does do is create more work for me dealing with low-effort contributions I must reject.

I'm talking about a daunting asymmetry in communication. Writing two sentences in a hurry in a GitHub issue or pushing a bunch of untested code my way in a pull request can take all of a minute for the submitter. On the other hand, I don't want to discourage people from contributing to free software and I know how frustrating it can be to contribute to open source projects (see my post about drive by contributions). So I try to take some time to study the pull request and write an intelligible and useful answer. However this is simply not sustainable. Looking back I also seem to often fail at not letting my frustration show through in my answer. Hence I feel like requiring contributors to at least know how to use git format-patch and write an email forms a useful barrier to entry. It prevents frustration at both ends and I believe that for a well thought-out contribution, the overhead of opening a mail client should be negligible.

Of course, if the project is not officially present on GitHub you get the current situation, where multiple public copies of the project still exist on GitHub, made by random people for their own use. These copies often keep in my contact details and don't obviously state that the code has been modified and/or is not related to the PyPi releases. This causes confusion, since code on GitHub is not the same as the one on PyPi. People also sometimes reuse version numbers for their own private use that conflict with version numbers on PyPi and so on and so on. It is kind of a damned if you do and damned if you don't situation really.


How can I sum this up? I've maintained this software for around 8 years, well after I left the company for which it was originally developed. During that time people have forked and rewrote it for various, largely non-technical reasons. That's fine. It's how free software is supposed to work and my own package was based on another one that got abandoned. I might still be happy to work on technical issues, but the part that turned out much more exhausting than working on the code was dealing with the social and ideological issues people had with it. It's probably my failing that I've spent so much thought on those. In the end, my own interests have changed as well during that time and finally letting it go does also feel like a stone off my shoulders.

Posted by Tomaž | Categories: Code | Comments »

ZX81 LPRINT bug and software archaeology

04.11.2019 19:07

By some coincidence I happened to stumble upon a week-old, unanswered question posted to Hacker News regarding a bug in Sinclair BASIC on a Timex Sinclair 1000 microcomputer. While I never owned a TS1000, the post attracted my interest. I've studied ZX81, an almost identical microcomputer, extensively when I was doing my research on Galaksija. It also reminded me of a now almost forgotten idea to write a post on some obscure BASIC bugs in Galaksija's ROM that I found mentioned in contemporary literature.

ZX81 exhibited at the Frisk festival.

The question on Hacker News is about the cause of a bug where the computer, when attached to a printer, would print out certain floating point numbers incorrectly. The most famous example, mentioned in the Wikipedia article on Timex Sinclair 1000, is the printout of 0.00001. The BASIC statement:

LPRINT 0.00001

unexpectedly types out the following on paper:

0.0XYZ1

This bug occurs both on Timex Sinclair 1000 as well as on Sinclair ZX81, since both computers share the same ROM code. Only the first zero after the decimal point is printed correctly while the subsequent zeros seem to be replaced with random alphanumeric characters. The non-zero digit at the end is again printed correctly. Interestingly, this only happens when using the LPRINT (line-printer print) statement that makes a hard-copy of the output on paper using a printer. The similar PRINT statement that displays the output on the TV screen works correctly (you can try it out on JtyOne's Online Emulator).

The cause of the bug lies in the code that takes a numerical value in the internal format of the BASIC's floating point calculator and prints out individual characters. One particular part of the code determines the number of zeros after the decimal point and uses a loop to print them out:

;; PF-ZEROS
L16B2:  NEG                     ; Prepare number of zeros
        LD      B,A             ; to print in B.

        LD      A,$1B           ; Print out character '.'
        RST     10H             ; 

        LD      A,$1C           ; Prepare character '0' 
				; to print out in A.

;; PF-ZRO-LP
L16BA:  RST     10H             ; Call "print character" routine
        DJNZ    L16BA           ; and loop back B times.

(This assembly listing is taken from Geoff Wearmouth's disassembly. Comments are mine.)

The restart 10h takes a character code in register A and either prints it out on the screen or sends it to the printer. Restarts are a bit like simple system calls - they are an efficient way to call an often-used routine on the Z80 CPU. The problem lies in the fact that this restart doesn't preserve the contents of the A register. It does preserve the contents of register B and other main registers through the use of the EXX instruction and the shadow registers, however the original contents of A is lost after the call returns.

Since the code above doesn't reset the contents of the A register after each iteration, only the first zero after the decimal point is printed correctly. Subsequent zeros are replaced with whatever was junk left in the A register by the 10h restart code. Solution is to simply adjust the DJNZ instruction to loop back two bytes earlier, to the LD instruction, so that the character code is stored to A in each iteration. You can see this fix in Geoff's customized ZX81 ROM, or in Timex Sinclair 1500 ROM (see line 3835 in this diff between TS1500 and TS1000).

This exact same code is also used when displaying numbers on the TV screen, however in that case it works correctly. The reason is that when set to print to screen, printing character 0 via the 10h restart actually preserves the contents of register A. Looking at the disassembly I suspect that was simply a lucky coincidence and not a conscious decision by the programmer. Any code calling 10h doesn't know whether the printer or the screen is used, and hence must assume that A isn't preserved anyway.


Of course, I'm far from being the first person to write about this particular Sinclair bug. Why then does the post on Hacker News say that there's little information to be found about it? The Wikipedia article doesn't cite a reference for this bug either.

It turns out that during my search for the answer, the three most useful pages were no longer on-line. Paul Farrow's ZX resource centre, S. C. Agate's ZX81 ROMs page and Geoff Wearmouth's Sinclair ROM disassemblies are wonderful historical resources that must have taken a lot of love and effort to put together. Sadly, they are now only accessible through the snapshots on the Internet Archive's Wayback Machine. If I wouldn't know about them beforehand, I probably wouldn't find them now. For the last one you even need to know what particular time range to look at on Archive.org, since the domain was taken over by squatters and recent snapshots only show ads (incidentally, this is also the reason why I'm re-hosting some of its former content).

I feel like we can still learn a lot from these early home computers and I'm happy that questions about them still pop-up in various forums. This LPRINT bug seems to be a case of a faulty generalization. It's a well known type of a mistake where the programmer wrongly generalizes an assumption (10h preserves A) that is in fact only true in a special case (displaying character on screen). History tends to repeat itself and I believe that many of the blunders in modern software wouldn't happen if software developers would be more aware of the history of their trade.

It's sad that these old devices are disappearing and that primary literature sources about them are hard to find, but I find it even more concerning that now it seems also these secondary sources are slowly fading out from general accessibility on the web.

Posted by Tomaž | Categories: Code | Comments »

Specializing C++ templates on AVR pins

09.10.2019 19:56

AVR is a popular processor core for microcontrollers, probably most well known from the ATmega chips on the classic Arduino boards. There's an excellent, free software C and C++ compiler toolchain available for them, which also usually sits behind the well-known Arduino IDE. AVR is a Harvard architecture, which means that code and data memory (i.e. flash and RAM) have separate address spaces. On these small, 8-bit chips there is very little RAM space (can be as little as 128 bytes on ATtiny devices). On the other hand, the amount of flash is usually much less restrictive.

Sometimes you want the microcontroller to drive multiple instances of the same external peripheral. Let's say for example this peripheral uses one GPIO pin and two instances differ only in the pin they are connected to. This means that the code in the microcontroller for peripheral 1 differs from code for peripheral 2 only by the address of the special function register (SFR) it uses and a bit mask. A naive way of implementing this in clean C++ would be something like the following:

#include <avr/io.h>

class DIO {
	public:
		PORT_t * const port;
		const int pin_bm;

		DIO(PORT_t* port_, int pin_bm_) 
			: port(port_), pin_bm(pin_bm_) {}

		void setup(void) const
		{
			port->DIR |= pin_bm;
		}

		// other methods would go here ...
};

DIO dio1(&PORTB, PIN4_bm);
DIO dio2(&PORTB, PIN5_bm);

int main(void)
{
	dio1.setup();
	dio2.setup();

	// other code would go here ...

	while (1);
}

Note that I'm using the nice PORT_t struct and the bit mask constants provided by avr-gcc headers instead of messing around with register addresses and such.

Unfortunately, this isn't optimal. Even though the class members are declared as const, compiler stores them in RAM. This means that each instance of DIO class costs 4 precious bytes of RAM that, baring a stray cosmic particle, will never ever change during the course of the program.

Compiling the code above for the ATtiny416 with avr-gcc 5.4.0 confirms this:

Program Memory Usage  : 204 bytes   5,0 % Full
Data Memory Usage     : 8 bytes   3,1 % Full

We could store member variables as data in program space. This is possible by using some special instructions and variable attributes (PROGMEM, pgm_read_byte() etc.), but it's kind of ugly and doesn't play nicely with the C language. Another way is to actually have copies of the binary code for each peripheral, but only change the constants in each copy.

Obviously we don't want to have multiple copies in the source as well. In C the usual approach is to do something rude with C preprocessor and macros. Luckily, C++ offers a nicer way via templates. We can make a template class that takes non-type parameters and specialize that class on the register and bitmap it should use. Ideally, we would use something simple like this:

template <PORT_t* port, int pin_bm> class DIO {
	public:
		void setup(void) {
			port->DIR |= pin_bm;
		}
};

DIO<(&PORTB), PIN4_bm> dio1;
DIO<(&PORTB), PIN5_bm> dio1;

Sadly, this does not work. Behind the scenes, PORTB is a macro that casts a constant integer to the PORT_t struct via a pointer:

#define PORTB    (*(PORT_t *) 0x0620)

&PORTB evaluates back to an integral constant, however the compiler isn't smart enough to realize that and balks at seeing a structure and a pointer dereference:

main.cpp(8,7): error: a cast to a type other than an integral or enumeration type cannot appear in a constant-expression
DIO<(&PORTB), PIN4_bm> dio1;
main.cpp(8,7): error: '*' cannot appear in a constant-expression
main.cpp(8,7): error: '&' cannot appear in a constant-expression

I found a thread on the AVR freaks forum which talks about this exact problem. The discussion meanders around the usefulness of software abstractions and the ability of hardware engineers to follow orders, but fails to find a solution that does not involve a soup of C preprocessor macros.

Fortunately, there does seem to exist an elegant C++ solution. The compiler's restriction can be worked around without resorting to preprocessor magic and while still using the convenient constants defined in the avr-gcc headers. We only need to evaluate the macro in another expression and do the cast to PORT_t inside the template:

template <int port_addr, int pin_bm> class DIO {
	public:
		void setup(void) {
			port()->DIR |= pin_bm;

		}

		static PORT_t* port()
		{
			return (PORT_t*) port_addr;
		}
};

static const int PORTB_ADDR = (int) &PORTB; 
DIO<PORTB_ADDR, PIN4_bm> dio1;
DIO<PORTB_ADDR, PIN5_bm> dio2;

This compiles and works fine with avr-gcc. As intended, it uses zero bytes of RAM. Interestingly enough, it also significantly reduces the code size, something I did not expect:

Program Memory Usage  : 108 bytes   2,6 % Full
Data Memory Usage     : 0 bytes   0,0 % Full

I'm not sure why the first version using member variables produces so much more code. A brief glance at the disassembly suggests that the extra code comes from the constructor. I would guess that using a code-free constructor with initializer lists should generate minimal extra instructions. Apparently that is not so here. In any case, I'm not complaining. I'm also sure that in a non-toy example, the template version will use more flash space. After all, It has to duplicate the code for all methods.

You might also wonder how Arduino does this with the digitalWrite() function and friends. They store a number of data tables in program memory and then have preprocessor macros to read them out using pgm_read_byte(). As I mentioned above, this works, but isn't nicely structured and it's pretty far from clean, object-oriented C++ programming.

Posted by Tomaž | Categories: Code | Comments »

z80dasm 1.1.6

09.09.2019 20:32

z80dasm is a command-line disassembler for the Z80 CPU. I initially released it way back in 2007 when I was first exploring Galaksija's ROM and other disassemblers I could find didn't do a good enough job for me. Since then it accumulated a few more improvements and bug fixes as I received feedback and patches from various contributors.

Version 1.1.6 is a release that has been way overdue. Most importantly, it fixes a segmentation fault bug that several people have reported. The patch for the bug has actually been committed to the git repository since June last year, but somehow I forgot to bump the version and roll up a new release.

The problem appeared when feeding a symbol file that was generated by z80dasm back as input to z80dasm (possibly after editing some symbol names). This is something that the man page explicitly mentions is supported. However, when this was done together with block definitions, it caused z80dasm to segfault with a NULL pointer dereference. Some code didn't expect that the symbol automatically generated to mark a block start could already be defined via the input symbol file. Thanks to J. B. Langston for first sending me the report and analysis of the crash.

I took this opportunity to review the rest of the symbol handling code and do some further clean ups. It has also led me to implement a feature that I have been asked for in the past. z80dasm now has the ability to sort the symbol table before writing it out to the symbol file.

More specifically, there is now a --sym-order command-line option that either takes a default or frequency argument. Default leaves the ordering as it was in the previous versions - ordered by symbol value. Frequency sorts the symbol table by how frequently a symbol is used in the disassembly. The most commonly used symbols are written at the start of the symbol file. When first approaching an unknown binary, this might help you to identify the most commonly used subroutines.

Anyway, the new release is available from the usual place. See the included README file for build and installation instructions. z80dasm is also included in Debian, however the new release is not in yet (if you're a Debian developer and would like to sponsor an upload, please get in touch).

Posted by Tomaž | Categories: Code | Comments »

Quick and ugly WFM data export for Rigol DS2072A

15.08.2019 14:48

At work I use a Rigol DS2072A oscilloscope. It's quite a featureful little two-channel digital scope that mostly does the job that I need it for. It can be buggy at times though and with experience I learned to avoid some of its features. Like for example the screenshot tool that sometimes, but not always, captures a perfectly plausible PNG that actually contains something different than what was displayed on the physical screen at the time. I'm not joking - I think there's some kind of a double-buffering issue there.

Recently I was using it to capture some waveforms that I wanted to further process on my computer. On most modern digital scopes that's a simple matter of exporting a trace to a CSV file on a USB stick. DS2072A indeed has this feature, however I soon found out that it is unbearably slow. Exporting 1.4 Msamples took nearly 6 minutes. I'm guessing exporting a full 14 Msample capture would take an hour - I've never had the patience to actually wait for one to finish and the progress indicator indeed remained pegged at 0% until I reset the scope in frustration. I planned to do many captures, so that approach was clearly unusable.

Rigol DS2072A oscilloscope.

Luckily, there's also an option for a binary export that creates WFM files. Exporting to those is much faster than to the text-based CSV format, but on the other hand it creates binary blobs that apparently only the scope itself can read. I found the open source pyRigolWFM tool for reading WFM files, but unfortunately it only seems to support the DS1000 series and doesn't work with files produced by DS2072A. There's also Rigol's WFM converter, but again it only works with DS4000 and DS6000 series, so I had no luck with that either.

I noticed that the sizes of WFM files in bytes were similar to the number of samples they were supposed to contain, so I guessed extracting raw data from them wouldn't be that complicated - they can't be compressed and there are only that many ways you can shuffle bytes around. The only weird thing was that the files containing the same number of samples were all of a slightly different size. A comment on the pyRigolWFM issue tracker mentioned that the WFM files are more or less a memory dump of the scope's memory which gave me hope that their format isn't very complicated.

After some messing around in a Jupyter Notebook I came up with the following code that extracted the data I needed from WFM files into a Numpy array:

import numpy as np
import struct

def load_wfm(path):
    with open(path, 'rb') as f:
        header = f.read(0x200) 
        
    magic = header[0x000:0x002]
    assert magic == b'\xa5\xa5'
        
    offset_1 = struct.unpack('<i', header[0x044:0x048])[0]
    offset_2 = struct.unpack('<i', header[0x048:0x04c])[0]
    n_samples = struct.unpack('<i', header[0x05c:0x060])[0]
    sample_rate = struct.unpack('<i', header[0x17c:0x180])[0]
    
    assert n_samples % 2 == 0
    
    pagesize = n_samples//2
        
    data = np.fromfile(path, dtype=np.uint8)
    
    t = np.arange(n_samples)/sample_rate
    x0 = np.empty(n_samples)
    
    # Samples are interleaved on two (?) pages
    x0[0::2] = data[offset_1:offset_1+pagesize]
    x0[1::2] = data[offset_2:offset_2+pagesize]
    
    # These will depend on attenuator settings. I'm not sure
    # how to read them from the file, but it's easy to guess 
    # them manually when comparing file contents to what is
    # shown on the screen.
    n = -0.4
    k = 0.2
    
    x = x0*k + n
    
    return t, x
    
t, x = load_wfm("Newfile1.wfm")

Basically, the file consists of a header and sample buffer. The header contains metadata about the capture, like the sample rate and number of captured samples. It also contains pointers into the buffer. Each sample in a trace is represented by one byte. I'm guessing it is a raw, unsigned 8-bit value from the ADC. That value needs to be scaled according to the attenuator and probe settings to get the measured voltage level. I didn't manage to figure out how the attenuator settings were stored in the header. I calculated the scaling constants manually, by comparing the raw values with what was displayed on the scope's screen. Since I was doing all captures at the same settings that worked for me.

I also didn't bother to completely understand the layout of the file. The code above worked for exports containing only channel 1. In all my files the samples were interleaved in two memory pages: even samples were in one page, odd samples in another. I'm not sure if that's always the case and the code obviously does not attempt to cover any other layout.

Here is a plot of the data I extracted for the trace that is shown on the photograph above:

Plot of the sample data extracted from the WFM file.

I compared the trace data I extracted from the WFM file with the data from the CSV file that is generated by the oscilloscope's own slow CSV export function. The differences between the two are on the order of 10-15. That is most likely due to the floating point precision. For all practical purposes, the values from both exports are identical:

Difference between data from the WFM file and the CSV export.

Anyway, I hope that's useful for anyone else that needs to extract data from these scopes. Just please be aware that is only a minimal viable solution for what I needed to do - the code will need some further hacking if you will apply it to your own files.

Posted by Tomaž | Categories: Code | Comments »

When the terminal is not enough

19.04.2019 9:26

Sometimes I'm surprised by utilities that run fine in a terminal window on a graphical desktop but fail to do so in a remote SSH connection. Unfortunately, this seems to be a side-effect of software on desktop Linux getting more tightly integrated. These days, it's more and more common to see a command-line tool pop up a graphical dialog. It looks fancy, and there might be security benefits in the case of passphrase entries, but it also means that often the remote use case with no access to the local desktop gets overlooked.

The GNOME keyring management is one offender I bump against the most. It tries to handle all entry of sensitive credentials, like passwords and PINs on a system and integrates with the SSH and GPG agents. I remember that it used to interfere with private SSH key passphrase entry when jumping from one SSH connection to another, but that seems to be fixed in Debian Stretch.

On the other hand, running GPG in a SSH session by default still doesn't work (this might also pop up when, for example, signing git tags):

$ gpg -s gpg_test
gpg: using "0A822E7A" as default secret key for signing
gpg: signing failed: Operation cancelled
gpg: signing failed: Operation cancelled

This happens when that same user is also logged in to the graphical desktop, but the graphical session is locked. I'm not sure exactly what happens in the background, but something somewhere seems to cancel the passphrase entry request.

The solution is to set the GPG agent to use the text-based pin entry tool. Install the pinentry-tty package and put the following into ~/.gnupg/gpg-agent.conf:

pinentry-program /usr/bin/pinentry-tty

After this, the passphrase can be entered through the SSH session:

$ gpg -s gpg_test
gpg: using "0A822E7A" as default secret key for signing
Please enter the passphrase to unlock the OpenPGP secret key:
"Tomaž Šolc (Avian) <tomaz.solc@tablix.org>"
4096-bit RSA key, ID 059A0D2C0A822E7A,
created 2013-01-13.

Passphrase:

Update: Note however that with this change in place graphical programs that run GPG without a terminal, such as Thunderbird's Enigmail extension, will not work.

Another offender are PulseAudio- and systemd-related tools. For example, inspecting the state of the sound system over SSH fails with:

$ pactl list
Connection failure: Connection refused
pa_context_connect() failed: Connection refused

Here, the error message is a bit misleading. The problem is that the XDG environment variables aren't set up properly. Specifically for PulseAudio, the XDG_RUNTIME_DIR should be set to something like /run/user/1000.

This environment is usually set by pam_systemd.so PAM module. However, some overzealous system administrators disable the use of PAM for SSH connections, so it might not be set in an SSH session. To have these variables set up automatically, you should have the following in your /etc/ssh/sshd_config:

UsePAM yes
Posted by Tomaž | Categories: Code | Comments »

Booting Raspbian from Barebox

01.03.2019 18:28

Barebox is an open source bootloader intended for embedded systems. In the past few weeks I've been spending a lot of time at work polishing its support for Raspberry Pi boards and fixing some rough edges. Some of my patches have already been merged upstream, and I'm hoping the remainder will end up in one the of upcoming releases. I thought I might as well write a short tutorial on how to get it running on a Raspberry Pi 3 with the stock Raspbian Linux distribution.

The Broadcom BCM28xx SoC on Raspberry Pi has a peculiar boot process. First a proprietary binary blob is started on the VideoCore processor. This firmware reads some config files and does some hardware initialization. Finally, it brings up the ARM core, loads a kernel image (or a bootloader, as we'll see later) and jumps to it. When starting the kernel it also passes to it some hardware information, such as a device tree and boot arguments. Raspbian Linux ships with its own patched Linux kernel that tends to be relatively tightly coupled with the VideoCore firmware and the information it provides.

On the other hand, upstream Linux kernels tend to ignore the VideoCore as much as possible. This makes them much easier to use with a bootloader. In fact, a bootloader is required since they can't use the device tree provided by the VideoCore firmware. Unfortunately, upstream kernels also historically have worse support for various peripherals found on Raspberry Pi boards, so sometimes you can't use them. In the rest of this text, I'll be only talking about Raspbian kernels as most of my work focused on making the VideoCore interoperation work correctly for them.

Raspberry Pi 3 connected to a USB-to-serial adapter.

To get started, you'll need a Raspberry Pi 3 board and an SD card with a Raspbian Stretch Lite. You'll also need a USB-to-serial adapter that uses 3.3V levels (I'm using this one) and another computer to connect it to. Connect ground, pin 14 and pin 15 on Raspberry Pi header to ground, RXD and TXD on the serial adapter respectively. We will use this connection to access the Barebox serial console. For the parts where we will interact with the Linux system on the Raspberry Pi you can also connect it to a HDMI monitor and a keyboard. I prefer to work over ssh.

Run the following on the computer to bring up the serial terminal at 115200 baud (adjust the device path as necessary to access your serial adapter):

$ screen /dev/ttyUSB0 115200

It's possible to cross-compile Barebox for ARM on an Intel computer. However, for simplicity we'll just compile Barebox on the Raspberry Pi itself since all the tools are already available there. If you intend to experiment, it's worth setting up either the cross-compiler or two Raspberry Pis: one for compilation and one for testing.

Now log-in to your Raspberry Pi and open a terminal, either through ssh or the monitor and keyboard. Fetch the Barebox source. At the time of writing, not all of my patches (1, 2) are in the upstream repository yet. You can get a patched tree based on the current next branch from my GitHub fork:

$ git clone https://github.com/avian2/barebox.git
$ cd barebox

Update: Barebox v2019.04.0 release should contain everything shown in this blog post.

In the top of the Barebox source distribution, run the following to configure the build for Raspberry Pi:

$ export ARCH=arm
$ make rpi_defconfig

At this point you can also run make menuconfig and poke around. Barebox uses the same configuration and build system as the Linux kernel, so if you've ever compiled Linux the interface should be very familiar. When you are ready to compile, run:

$ make -j4

This should take around a minute on a Raspberry Pi 3. If compilation fails, check if you are missing any tools and install them. You might need to install flex and bison with apt-get.

After a successful compilation the bootloader binaries end up in the images directory. If you didn't modify the default config you should get several builds for different versions of the Raspberry Pi board. Since we're using Raspberry Pi 3, we need barebox-raspberry-pi-3.img. Copy this to the /boot partition:

$ sudo cp images/barebox-raspberry-pi-3.img /boot/barebox.img

Now put the following into /boot/config.txt. What this does is tell the VideoCore firmware to load the Barebox image instead of the Linux kernel and to bring up the serial interface hardware for us:

kernel=barebox.img
enable_uart=1

At this point you should be ready to reboot Raspberry Pi. Do a graceful shutdown with the reboot command. After a few moments you should see the following text appear on the serial terminal on your computer:

barebox 2019.02.0-00345-g0741a17e3 #183 Fri Mar 1 09:44:11 GMT 2019


Board: RaspberryPi Model 3B
bcm2835-gpio 3f200000.gpio@7e200000.of: probed gpiochip-1 with base 0
bcm2835_mci 3f300000.sdhci@7e300000.of: registered as mci0
malloc space: 0x0fe7e5c0 -> 0x1fcfcb7f (size 254.5 MiB)
mci0: detected SD card version 2.0
mci0: registered disk0
environment load /boot/barebox.env: No such file or directory
Maybe you have to create the partition.
running /env/bin/init...

Hit any key to stop autoboot:    3

Press a key to drop into the Hush shell. You'll get a prompt that is very similar to Bash. You can use cd to move around and ls to inspect the filesystem. Barebox uses the concept of filesystem mounts just like Linux. By default, the boot partition on the SD card gets mounted under /boot. The root is a virtual RAM-only filesystem that is not persistent.

The shell supports simple scripting with local and global variables. Barebox calls the shell scripts that get executed on boot the bootloader environment. You can find those under /env. The default environment on Raspberry Pi currently doesn't do much, so if you intend to use Barebox in some application, you will need to do the scripting yourself.

To finally boot the Raspbian system, enter the following commands:

barebox@RaspberryPi Model 3B:/ global linux.bootargs.vc="$global.vc.bootargs"
barebox@RaspberryPi Model 3B:/ bootm -o /vc.dtb /boot/kernel7.img

Loading ARM Linux zImage '/boot/kernel7.img'
Loading devicetree from '/vc.dtb'
commandline: console=ttyAMA0,115200 console=ttyS1,115200n8   8250.nr_uarts=1 bcm2708_fb.fbwidth=656 bcm2708_fb.fbheight=416 bcm2708_fb.fbswap=1 vc_mem.mem_base=0x3ec00000 vc_mem.mem_size=0x40000000  dwc_otg.lpm_enable=0 console=tty1 root=PARTUUID=1c5963cc-02 rootfstype=ext4 elevator=deadline fsck.repair=yes rootwait
Starting kernel in hyp mode

The global command adds the kernel command-line from VideoCore to the final kernel boot arguments. Barebox saves the command-line that was passed to it from the firmware into the vc.bootargs variable. When booting the kernel, Barebox constructs the final boot arguments by concatenating all global variables that start with linux.bootargs.

The bootm command actually boots the ARMv7 kernel (/boot/kernel7.img) from the SD card using the device tree from the VideoCore. Each time it is started, Barebox saves the VideoCore device tree to its own root filesystem at /vc.dtb. A few moments after issuing the bootm command, the Raspbian system should come up and you should be able to log into it. Depending on whether the serial console has been enabled in the Linux kernel, you might also see kernel boot messages and a log-in prompt in the serial terminal.

You can see that the system was booted from Barebox by inspecting the device tree:

$ cat /sys/firmware/devicetree/base/chosen/barebox-version
barebox-2019.02.0-00345-g0741a17e3

Since we passed both the VideoCore device tree contents and the command-line to the kernel, the system should work as if no bootloader was actually used. This means that settings in cmdline.txt and config.txt, such as device tree overlays and HDMI configuration, should be correctly honored.

In the end, why would you want to use such a bootloader? Increasing reliability and resistance to bad system updates is one reason. For example, Barebox can enable the BCM28xx hardware watchdog early in the boot process. This means that booting a bad kernel is less likely to result in a system that needs a manual reset. A related Barebox feature is the bootchooser which can be used to implement the A/B root partition scheme, so that a failed boot from one root automatically results in the bootloader falling back to the other root. Note however that one thing still missing in Barebox is the support for reading reset-source on Raspberry Pi, so you can't yet distinguish between a power-on reset and a watchdog running out.

As it was pointed out on the Barebox mailing list, this is not perfect. Discussion on the design of the hardware watchdog aside, the bootloader also can't help with VideoCore firmware updates, since the firmware gets started before the bootloader. Also since the device tree and kernel in Raspbian are coupled with the VideoCore firmware, this means that kernel updates in Raspbian can be problematic as well, even with a bootchooser setup. However a step towards a more reliable system, even if not perfect, can still solve some real-life problems.

Posted by Tomaž | Categories: Code | Comments »

The case of disappearing PulseAudio daemon

03.11.2018 19:39

Recently I was debugging an unusual problem with the PulseAudio daemon. PulseAudio handles high-level details of audio recording, playback and processing. These days it is used by default in many desktop Linux distributions. The problem I was investigating was on an embedded device using Linux kernel 4.9. I relatively quickly I found a way to reproduce it. However finding the actual cause was surprisingly difficult and led me into the kernel source and learning about the real-time scheduler. I thought I might share this story in case someone else finds themselves in a similar situation.

The basic issue was that PulseAudio daemon occasionally restarted. This reset some run-time configuration that should not be lost while the device was operating and also broke connections to clients that were talking to the daemon. Restarts were seemingly connected to the daemon or the CPU being under load. For example, the daemon would restart if many audio files were played simultaneously on it. I could reproduce the problem on PulseAudio 12.0 using a shell script similar to the following:

n=0
while [ $n -lt 100 ]; do
	pacmd play-file foo.wav 0
	n=$(($n+1))
done

This triggers 100 playbacks of the foo.wav file at almost the same instant and would reliably make the daemon restart on the device. However I was sometimes seeing restarts with less than ten simultaneous audio plays. A similar script with 1000 plays would sometimes also cause a restart on my Intel-based laptop using the same PulseAudio version. This made it easier to investigate the problem since I didn't need to do the debugging remotely on the device.

syslog held some clues what was happening. PulseAudio process was apparently being sent the SIGKILL signal. systemd detected that and restarted the service shortly after:

systemd[1]: pulseaudio.service: Main process exited, code=killed, status=9/KILL
systemd[1]: pulseaudio.service: Unit entered failed state.
systemd[1]: pulseaudio.service: Failed with result 'signal'.

However, there were no other clues as to what sent the SIGKILL or why. Kernel log from dmesg had nothing whatsoever related to this. Increasing logging verbosity in systemd and PulseAudio showed nothing relevant. Attaching a debugger and strace to the PulseAudio process showed that the signal was received at seemingly random points in the execution of the daemon. This showed that the problem was not directly related to any specific line of code, but otherwise led me nowhere.

When I searched the web, all suggestions seemed to point to the kernel killing the process due to an out-of-memory condition. The kernel being the cause of the signal seemed reasonable, however OOM condition is usually clearly logged. Another guess was that systemd itself was killing the process after I learned about its resource control feature. This turned out to be a dead end as well.

The first real useful clue was when Gašper gave me a crash course on using perf. After setting up debugging symbols for the kernel and the PulseAudio binary, I used the following:

$ perf_4.9 record -g -a -e 'signal:*'
(trigger the restart here)
$ perf_4.9 script

This printed out a nice backtrace of the code that generated the problematic SIGKILL signal (among noise about uninteresting other signals being sent between processes on the system):

alsa-sink-ALC32 12510 [001] 24535.773432: signal:signal_generate: sig=9 errno=0 code=128 comm=alsa-sink-ALC32 pid=12510 grp=1 res=0
            7fffbae8982d __send_signal+0x80004520223d ([kernel.kallsyms])
            7fffbae8982d __send_signal+0x80004520223d ([kernel.kallsyms])
            7fffbaef0652 run_posix_cpu_timers+0x800045202522 ([kernel.kallsyms])
            7fffbaea7aa9 scheduler_tick+0x800045202079 ([kernel.kallsyms])
            7fffbaefaa80 tick_sched_timer+0x800045202000 ([kernel.kallsyms])
            7fffbaefa480 tick_sched_handle.isra.12+0x800045202020 ([kernel.kallsyms])
            7fffbaefaab8 tick_sched_timer+0x800045202038 ([kernel.kallsyms])
            7fffbaeebbfe __hrtimer_run_queues+0x8000452020de ([kernel.kallsyms])
            7fffbaeec2dc hrtimer_interrupt+0x80004520209c ([kernel.kallsyms])
            7fffbb41b1c7 smp_apic_timer_interrupt+0x800045202047 ([kernel.kallsyms])
            7fffbb419a66 __irqentry_text_start+0x800045202096 ([kernel.kallsyms])

alsa-sink-ALC32 is the thread in the PulseAudio daemon that is handling the interface between the daemon and the ALSA driver for the audio hardware. The stack trace shows that the signal was generated in the context of that thread, however the originating code was called from a timer interrupt, not a syscall. Specifically, the run_posix_cpu_timers function seemed to be the culprit. This was consistent with the random debugger results I saw before, since interrupts were not in sync with the code running in the thread.

Some digging later I found the following code that is reached from run_posix_cpu_timers via some static functions. Intermediate static functions probably got optimized away by the compiler and don't appear in the perf stack trace:

if (hard != RLIM_INFINITY &&
    tsk->rt.timeout > DIV_ROUND_UP(hard, USEC_PER_SEC/HZ)) {
	/*
	 * At the hard limit, we just die.
	 * No need to calculate anything else now.
	 */
	__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
	return;
}

Now things started to make sense. Linux kernel implements some limits on how much time a thread with real-time scheduling priority can use before cooperatively yielding the CPU to other threads and processes (via a blocking syscall for instance). If a thread hits this time limit it is silently sent the SIGKILL signal by the kernel. Kernel resource limits are documented in the setrlimit man page (the relevant limit here is RLIMIT_RTTIME). The PulseAudio daemon was setting the ALSA thread to real-time priority and it was getting killed under load.

Using real-time scheduling seems to be the default in PulseAudio 12.0 and the time limit set for the process is 200 ms. The limit for a running daemon can be inspected from shell using prlimit:

$ prlimit --pid PID | grep RTTIME
RTTIME     timeout for real-time tasks           200000    200000 microsecs

Details of real-time scheduling can be adjusted in /etc/pulse/daemon.conf, for instance with:

realtime-scheduling = yes
rlimit-rttime = 400000

Just to make sure that an RTTIME over-run indeed produces such symptoms I made the following C program that intentionally triggers it. Running the program showed that indeed the cause for the SIGKILL in this case isn't logged anywhere and produces a similar perf backtrace:

#include <stdio.h>
#include <sched.h>
#include <sys/resource.h>

int main(int argc, char** argv)
{
        struct sched_param sched_param;
        if (sched_getparam(0, &sched_param) < 0) {
                printf("sched_getparam() failed\n");
                return 1;
        }
        sched_param.sched_priority = sched_get_priority_max(SCHED_RR);
        if (sched_setscheduler(0, SCHED_RR, &sched_param)) {
		printf("sched_setscheduler() failed\n");
		return 1;
	}
        printf("Scheduler set to Round Robin with priority %d...\n",
			sched_param.sched_priority);
        fflush(stdout);

	struct rlimit rlimit;
	rlimit.rlim_cur = 500000;
	rlimit.rlim_max = rlimit.rlim_cur;

	if (setrlimit(RLIMIT_RTTIME, &rlimit)) {
		printf("setrlimit() failed\n");
		return 1;
	}
	printf("RTTIME limit set to %ld us...\n",
			rlimit.rlim_max);

	printf("Hogging the CPU...\n");
	while(1);
}

It would be really nice if kernel would log somewhere the reason for this signal, like it does with OOM. It might be that in this particular case developers wanted to avoid calling possibly expensive logging functions from interrupt context. On the other hand, it seems that kernel by default doesn't log any process kills at all due to resource limit over-runs. Failed syscalls can be logged using auditd, but that wouldn't help here as no syscalls actually failed.

As far as this particular PulseAudio application was concerned, there weren't really any perfect solutions. This didn't look like a bug in PulseAudio but rather a fact of life in a constrained environment with real-time tasks. The PulseAudio man page discusses some trade-offs of real-time scheduling (which is nice in hindsight, but you first have to know where to look). In my specific case, there were more or less only three possibilities of how to proceed:

  1. Disable RTTIME limit and accept that PulseAudio might freeze other processes on the device for an arbitrary amount of time,
  2. disable real-time scheduling and accept occasional skips in the audio due to other processes taking the CPU from PulseAudio for too long, or
  3. accept the fact that PulseAudio will restart occasionally and make other software on the device recover from this case as best as possible.

After considering implications to the functionality of the device I went with the last one in the end. I also slightly increased the default RTTIME limit so that restarts would be less common while still having an acceptable maximum response time for other processes.

Posted by Tomaž | Categories: Code | Comments »

Getting ALSA sound levels from a command-line

01.09.2018 14:33

Sometimes I need to get an idea of signal levels coming from an ALSA capture device, like a line-in on a sound card, and I only have a ssh session available. For example, I've recently been exploring a case of radio-frequency interference causing audible noise in an audio circuit of an embedded device. It's straightforward to load up raw samples into a Jupyter notebook or Audacity and run some quick analysis (or simply listen to the recording on headphones). But what I've really been missing is just a simple command-line tool that would show me some RMS numbers. This way I wouldn't need to transfer potentially large wav files around quite so often.

I felt like a command-line VU meter was something that should have already existed, but finding anything like that on Google turned out elusive. Overtime I've ended up with a bunch of custom half-finished Python scripts. However Python is slow, especially on low-powered ARM devices, so I was considering writing a more optimized version in C. Luckily I've recently come across this recipe for sox which does exactly what I want. Sox is easily apt-gettable on all Debian flavors I commonly care about and even on slow CPUs the following doesn't take noticeable longer than it takes to record the data:

$ sox -q -t alsa -d -n stats trim 0 5
             Overall     Left      Right
DC offset  -0.000059 -0.000059 -0.000046
Min level  -0.333710 -0.268066 -0.333710
Max level   0.273834  0.271820  0.273834
Pk lev dB      -9.53    -11.31     -9.53
RMS lev dB    -25.87    -26.02    -25.73
RMS Pk dB     -20.77    -20.77    -21.09
RMS Tr dB     -32.44    -32.28    -32.44
Crest factor       -      5.44      6.45
Flat factor     0.00      0.00      0.00
Pk count           2         2         2
Bit-depth      15/16     15/16     15/16
Num samples     242k
Length s       5.035
Scale max   1.000000
Window s       0.050

This captures 5 seconds of audio (trim 0 5) from the default ALSA device (-t alsa -d), runs it through the stats filter and discards the samples without saving them to a file (-n). The stats filter calculates some useful statistics and dumps them to standard error. A different capture device can be selected through the AUDIODEV environment variable.

The displayed values are pretty self-explanatory: min and max levels show extremes in the observed samples, scaled to ±1 (so they are independent of the number of bits in ADC samples). Root-mean-square (RMS) statistics are scaled so that 0 dB is the full-scale of the ADC. In addition to overall mean RMS level you also get peak and trough values measured over a short sliding window (length of this window is configurable, and is shown on the last line of the output). This gives you some idea of the dynamics in the signal as well as the overall volume. Description of other fields can be found in the sox(1) man page.

Posted by Tomaž | Categories: Code | Comments »

Monitoring HomePlug AV devices

23.05.2018 18:51

Some time ago I wanted to monitor the performance of a network of Devolo dLAN devices. These are power-line Ethernet adapters. Each device looks like a power brick with a standard Ethernet RJ45 port. You can plug several of these into wall sockets around a building and theoretically they will together act as an Ethernet bridge, linking all their ports as if they were connected to a single network. The power-line network in question seemed to be having intermittent problems, but without some kind of a log it was hard to pin-point exactly what was the problem.

I have very little experience with power-line networks and some quick web searches yielded conflicting information about how these things work and what protocols are at work behind the curtains. Purely from the user perspective, the experience seems to be similar to wireless LANs. While individual devices have flashy numbers written on them, such as 500 Mbps, these are just theoretical "up to" throughputs. In practice, bandwidth of individual links in the network seems to be dynamically adjusted based on signal quality and is commonly quite a bit lower than advertised.

Devolo Cockpit screenshot.

Image by devolo.com

Devolo provides an application called Cockpit that allows you to configure the devices and visualize the power-line network. The feature I was most interested in was the real-time display of the physical layer bitrate for each individual link in the network. While the Cockpit is available for Linux, it is a user friendly point-and-click graphical application and chances were small that I would be able to integrate it into some kind of an automated monitoring process. The prospect of decoding the underlying protocol seemed easier. So I did a packet capture with Wireshark while the Cockpit was fetching the bitrate data:

HomePlug AV protocol decoded in Wireshark.

Wireshark immediately showed the captured packets as part of the HomePlug AV protocol and provided a nice decode. This finally gave me a good keyword I could base my further web searches on, which revealed a helpful white paper with some interesting background technical information. HomePlug AV physical layer apparently uses frequencies in the range of 2 - 28 MHz using OFDM with adaptive number of bits per modulation symbol. The network management is centralized, using a coordinator and a mix of CSMA/CA and TDMA access.

More importantly, the fact that Wireshark decode showed bitrate information in plain text gave me confidence that replicating the process of querying the network would be relatively straightforward. Note how the 113 Mbit/sec in the decode directly corresponds to hex 0x71 in raw packet contents. It appeared that only two packets were involved, a Network Info Request and a Network Info Confirm:

HomePlug AV Network Info Confirmation packet decode.

However before diving directly into writing code from scratch I came across the Faifa command-line tool on GitHub. The repository seems to be a source code dump from a now-defunct dev.open-plc.org web site. There is very little in terms of documentation or clues to its progeny. Last commit was in 2016. However a browse through its source code revealed that it is capable of sending the 0xa038 Network Info Request packet and receiving and decoding the corresponding 0xa039 Network Info Confirm reply. This was exactly what I was looking for.

Some tweaking and a compile later I was able to get the bitrate info from my terminal. Here I am querying one device in the power-line network (its Ethernet address is in the -a parameter). The queried device returns the current network coordinator and a list of stations it is currently connected to, together with the receive and transmit bitrates for each of those connections:

# faifa -i eth4 -a xx:xx:xx:xx:xx:xx -t a038
Faifa for HomePlug AV (GIT revision master-5563f5d)

Started receive thread
Frame: Network Info Request (Vendor-Specific) (0xA038)

Dump:
Frame: Network Info Confirm (Vendor-Specific) (A039), HomePlug-AV Version: 1.0
Network ID (NID): xx xx xx xx xx xx xx
Short Network ID (SNID): 0x05
STA TEI: 0x24
STA Role: Station
CCo MAC: 
	xx:xx:xx:xx:xx:xx
CCo TEI: 0xc2
Stations: 1
Station MAC       TEI  Bridge MAC        TX   RX  
----------------- ---- ----------------- ---- ----
xx:xx:xx:xx:xx:xx 0xc2 xx:xx:xx:xx:xx:xx 0x54 0x2b
Closing receive thread

The original tool had some annoying problems that I needed to work around before deploying it to my monitoring system. Most of all, it operated by sending the query with Ethernet broadcast address as the source. It then put the local network interface into promiscuous mode to listen for broadcasted replies. This seemed like bad practice and created problems for me, least of which was log spam with repeated kernel warnings about promiscuous mode enters and exits. It's possible that the use of broadcasts was a workaround for hardware limitation on some devices, but the devices I tested (dLAN 200 and dLAN 550) seem to reply just fine to queries from non-broadcast addresses.

I also fixed a race condition that was in the original tool due to the way it received the replies. If multiple queries were running on the same network simultaneously sometimes replies from different devices became confused. Finally, I fixed some rough corners regarding libpcap usage that prevented the multi-threaded Faifa process from exiting cleanly once a reply was received. I added a -t command-line option for sending and receiving a single packet.

As usual, the improved Faifa tool is available in my fork on GitHub:

$ git clone https://github.com/avian2/faifa.git

To conclude, here is an example of bitrate data I recorded using this approach. It shows transmit bitrates reported by one device in the network to two other devices (here numbered "station 1" and "station 2"). The data was recorded over the course of 9 days and the network was operating normally during this time:

Recorded PHY bitrate for two stations.

Even this graph shows some interesting things. Some devices (like the "station 1" here) seem to enter a power saving mode. Such devices don't appear in the network info reports, which is why data is missing for some periods of time. Even out of power saving mode, devices don't seem to update their reported bitrates if there is no data being transferred on that specific link. I think this is why the "station 2" here seems to have long periods where the reported bitrate remains constant.

Posted by Tomaž | Categories: Code | Comments »