Printing .lto_priv symbols in GDB

14.02.2020 16:08

Here's a stupid little GNU debugger detail I've learned recently - you have to quote the names of some variables. When debugging a binary that was compiled with link time optimization, it sometimes appears like you can't inspect certain global variables.

GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
[...]
(gdb) print usable_arenas
No symbol "usable_arenas" in current context.

The general internet wisdom seems to be that if a variable is subject to link time optimization it can't be inspected in the debugger. I guess this comes from the similar problem of inspecting private variables that are subject to compiler optimization. In some cases private variables get assigned to a register and don't appear in memory at all.

However, if it's a global variable, accessed from various places in the code, then its value must be stored somewhere, regardless of what tricks the linker does with its location. It's unlikely it would get assigned to a register, even if it's theoretically possible. So after some mucking about in the disassembly to find the address of the usable_arenas variable I was interested in, I was surprised to find out that gdb does indeed know about it:

(gdb) x 0x5617171d2b80
0x5617171d2b80 <usable_arenas.lto_priv.2074>:	0x17215410
(gdb) info symbol 0x5617171d2b80
usable_arenas.lto_priv in section .bss of /usr/bin/python3.5

This suggests that the name has a .lto_priv or a .lto_priv.2074 suffix (Perhaps meaning LTO private variable? It is declared a static variable in C). However I still can't print it:

(gdb) print usable_arenas.lto_priv
No symbol "usable_arenas" in current context.
(gdb) print usable_arenas.lto_priv.2074
No symbol "usable_arenas" in current context.

The trick is not that this is some kind of a special variable or anything. It just has a tricky name. You have to put it in quotes so that gdb doesn't try to interpret the dot as an operator:

(gdb) print 'usable_arenas.lto_priv.2074'
$3 = (struct arena_object *) 0x561717215410

TAB completion also works against you here, since it happily completes the name without the quotes and without the .2074 at the end, giving the impression that it should work that way. It doesn't. If you use completion, you have to add the quotes and the number suffix manually around the completed name (or only press TAB after inputting the leading quote, which works correctly).

Finally, I don't know what the '2074' means, but it seems you need to find that number in order to use the symbol name in gdb. Every LTO-affected variable seems to get a different number assigned. You can find the one you're interested in via a regexp search through the symbol table like this:

(gdb) info variables usable_arenas
All variables matching regular expression "usable_arenas":

File ../Objects/obmalloc.c:
struct arena_object *usable_arenas.lto_priv.2074;
Posted by Tomaž | Categories: Code | Comments »

Checking Webmention adoption rate

25.01.2020 14:42

Webmention is a standard that attempts to give plain old web pages some of the attractions of big, centralized social media. The idea is that web servers can automatically inform each other about related content and actions. In this way a post on a self-hosted blog, like this one, can display backlinks to a post on another server that mentions it. It also makes it possible to implement gimmicks such as a like counter. Webmention is kind of a successor to pingbacks that were popularized some time ago by Wordpress. Work on standardizing Webmention seem to date back to at least 2014 and it has been first published as a working draft by W3C in 2016.

I've first read about Webmention on jlelse's blog. I was wondering what the adoption of this standard is nowadays. Some searching revealed conflicting amounts of enthusiasm for it, but not much recent information. Glenn Dixon wrote in 2017 about giving up on it due to lack of adoption. On the other hand, Ryan Barrett celebrated 1 million sent Webmentions in 2018.

To get a better feel of what the state is in my local web bubble, I've extracted all external links from my blog posts in the last two years (January 2018 to January 2020). That yielded 271 unique URLs on 145 domains from 44 blog posts. I've then used Web::Mention to discover any Webmention endpoints for these URLs. Endpoint discovery is a first step in sending a notification to a remote server about related content. If that fails it likely means that the host doesn't implement the protocol.

The results weren't encouraging. None of the URLs had discoverable endpoints. That means that even if I would implement the sending part of the Webmention protocol on my blog, I wouldn't have sent any mentions in the last two years.

Another thing I wanted to check is if anyone was doing the same in the other direction. Were there any failed incoming attempts to discover an endpoint on my end? Unfortunately there is no good way of determining that from the logs I keep. In theory endpoint discovery can look just like a normal HTTP request. Many Webmention implementations seem to have "webmention" in their User agent header however. According to this heuristic I did likely receive at least 3 distinct requests for endpoint discovery in the last year. It's likely there were more (for example, I know that my log aggregates don't include requests from Wordpress plug-ins due to some filter regexps).

So it seems that implementing this protocol doesn't look particularly inviting from the network effect standpoint. I also wonder if Webmentions would become the spam magnet that pingbacks were back in the day if they reached any kind of wide-spread use. The standard does include a provision for endpoints to verify that the source page indeed links to the destination URL the Webmention request says it does. However to me that protection seems trivial to circumvent and only creates a little more work for someone wanting to send out millions of spammy mentions across the web.

Posted by Tomaž | Categories: Code | Comments »

On "The Bullet Journal Method" book

17.01.2020 12:02

How can you tell if someone uses a Bullet Journal®? You don't have to, they will immediately tell you themselves.

Some time last year I saw this book in the window of a local bookstore. I was aware of the website, but I didn't know the author also published a book about his method of organizing notebooks. I learned about the Bullet Journal back in 2014 and it motivated me to better organize my daily notes. About 3000 written pages later I'm still using some of the techniques I learned back then. I was curious if the book holds any new useful note-taking ideas, so I bought it on the spot.

The Bullet Journal Method by Ryder Carroll.

The Bullet Journal Method is a 2018 book by Ryder Carroll (by the way, the colophon says my copy is printed in Slovenia). The text is split into 4 parts: first part gives motivation for writing a notebook. That is followed by a description of the actual note-taking methods. The third and longest part of the book at around 100 pages is called "The Practice". It's kind of a collection of essays giving advice on life philosophy with general topics such as meaning, gratitude and so on. The last part explores a few variations of the methods described in the book.

The methods described in the book differ a bit from what I remember. In fact the author does note in a few places that their advice has changed over time. The most surprising to me was the change from using blank squares as a symbol for an unfinished task to simple dots. The squares were in my opinion one of the most useful things I took from the Bullet Journal as they are a very clear visual cue. They really catch the eye among other notes and drawings when browsing for things left undone in a project.

In general, the contents of my notebooks are quite different from the journals the book talks about. I don't have such well defined formats of pages (they call it "collections"), except perhaps monthly indexes. My notebooks more resemble lab notes and I also tend to write things in longer form than the really short bullet lists suggested in the book. The author spends a lot of time on migrations and reflection: rewriting things from an old, full notebook to a new one, moving notes between months and so on. I am doing very little of that and rely more on referencing and looking up things in old notebooks. I do see some value in that though and after reading the book I'm starting to do more of it for some parts of my notes. I've experimented with a few other note-taking methods from the book as well, and some seem to be working for me and I've dropped the others.

The Bullet Journal Method on Productivity.

I was surprised to see that a large portion of the book is dedicated to this very general motivational and life style advice, including diagrams like the one you see above, much in the style of self-help books. It made me give up on the book midway half-way through for a few months. I generally have a dislike for this kind of texts, but I don't think it's badly written. The section is intertwined with exercises that you can write down in your journal, like the "five whys" and so on. Some were interesting and others not so much. Reading about a suggestion to write your own obituary after a recent death in the family was off-putting, but I can hardly blame the book for that coincidence.

There is certainly some degree of Bullet Journal® brand building in this book. It feels like the author tries quite hard to sell their method in the first part of the book via thankful letters and stories from people that solved various tough life problems by following their advice. Again, something I think commonly found in self-help books and for me personally this usually has the opposite effect from what was probably intended. I do appreciate that the book doesn't really push the monetary side of it. Author's other businesses (branded notebooks and the mobile app) are each mentioned once towards the end of the book and not much more.

Another pleasant surprise was the tactful acknowledgment from the author that many journals shared on the web and social media don't resemble real things and can be very demotivational or misleading. I've noticed that myself. For example, if you search for "bullet journal" on YouTube you'll find plenty of people sharing their elaborately decorated notebooks that have been meticulously planned and sectioned for a year in advance. That's simply not how things work in my experience and most of all, I strongly believe that writing the notebook with the intention of sharing it on social media defeats the whole purpose.

In conclusion, it's an interesting book and so far I've kept it handy on my desk to occasionally look up some example page layouts that are given throughout it. I do recommend it if you're interested in using physical notebooks or are frustrated with the multitude of digital productivity apps that never tend to quite work out. It's certainly a good starting point, but keep in mind that what's recommended in there might not be what actually works best for you. My advice would be only to keep writing and give it some time until you figure out the useful parts.

Posted by Tomaž | Categories: Life | Comments »

Radeon performance problem after suspend

08.01.2020 20:07

This is a problem I've encountered on my old desktop box that's still running Debian Stretch. Its most noticeable effect is that large GNOME terminal windows get very laggy and editing files in GVim is almost unusable due to the slow refresh rate. I'm not sure when this first started happening. I suspect it was after I upgraded the kernel to get support for the Wacom Cintiq. However I've only started noticing it much later, so it's possible that some other package upgrade triggered it. Apart from the kernel I can't find anything else (like recent Intel microcode updates) affecting this issue though. On the other hand the hardware here is almost a decade old at this point and way past due for an upgrade, so I'm not completely ruling out that something physical broke.

The ATI Radeon graphic card and the kernel that I'm using:

$ lspci | grep VGA
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV710 [Radeon HD 4350/4550]
$ cat /proc/version
Linux version 4.19.0-0.bpo.6-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)) #1 SMP Debian 4.19.67-2+deb10u2~bpo9+1 (2019-11-12)

The Radeon-related kernel parameters:

radeon.audio=1 radeon.hard_reset=1

I think I've added hard_reset because of some occasional hangs I was seeing a while ago. I'm not sure if it's still needed with this kernel version and I don't remember having problems with X hanging in recent times. I've also seen this exact performance problem with kernel 4.19.12 from Stretch backports. I can't reproduce the problem on stock Stretch kernel 4.9.189. Other than the kernel from backports I'm using a stock Stretch install of GNOME 3 and X.Org (xserver-xorg-video-radeon version 1:7.8.0-1+b1).

To reproduce the problem, open a GNOME terminal and resize it, say to 132x64 or something similar. The exact size isn't important. Fill the scrollback, for example by catting a large file or running yes for a second. After that, scroll the terminal contents by holding enter on the shell prompt. If everything is working correctly, the scrolling will be smooth. If this bug manifests itself, terminal contents will scroll in large, random increments, seemingly refreshing around once per second.

The second way is to open a largish (say 1000 line) text file in GVim and try to edit it or scroll through it. Again, the cursor will lag significantly after the keyboard input. Interestingly, some applications aren't affected. For example, scrolling in Firefox or Thunderbird will remain smooth. GIMP doesn't seem to be affected much either.

On the affected kernels, I can reliably reproduce this by putting the computer to sleep (suspend to RAM - alt-click on the power button in the GNOME menu) and waking it up. After a fresh reboot, things will run normally. After a suspend, the performance problems described above manifest themselves. There is no indication in dmesg or syslog that anything went wrong at wake up.

I've tracked this down to a problem with Radeon's dynamic power saving feature. It seems that after sleep it gets stuck in its lowest performance setting and doesn't automatically adjust when some application starts actively using the GPU. I can verify that by running the following in a new terminal:

# watch cat /sys/kernel/debug/dri/0/radeon_pm_info

On an idle computer, this should display something like:

uvd    vclk: 0 dclk: 0
power level 0    sclk: 11000 mclk: 25000 vddc: 1100

After a fresh reboot, when scrolling the terminal or contents of a GVim buffer, the numbers normally jump up:

uvd    vclk: 0 dclk: 0
power level 2    sclk: 60000 mclk: 40000 vddc: 1100

However after waking the computer from sleep, the numbers in radeon_pm_info stay constant, regardless of any activity in the terminal window or GVim.

I've found a workaround to get the power management working again. The following script forces the DPM into the high profile and then resets it to whatever it was before (it's auto on my system). This seems to fix the problem and it can be verified through the radeon_pm_info method I described above. Most importantly, this indeed seems to restore the automatic adjustment. According to radeon_pm_info the card doesn't just get stuck again at the highest setting.

$ cat /usr/local/bin/radeon_dpm_workaround.sh
#!/bin/bash

set -eu

DPM_FORCE=/sys/class/drm/card0/device/power_dpm_force_performance_level

CUR=`cat "$DPM_FORCE"`
echo high > "$DPM_FORCE"
sleep 1
echo "$CUR" > "$DPM_FORCE"

To get this to automatically run each time the computer wakes from sleep, I've used the following systemd service file:

$ cat /etc/systemd/system/radeon_dpm_workaround.service
[Unit]
After=suspend.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/radeon_dpm_workaround.sh

[Install]
WantedBy=suspend.target

It needs to be enabled via:

# systemctl enable radeon_dpm_workaround.service

As I said in the introduction, this is a fairly old setup (it even still has a 3.5" floppy drive!). However after many years it's still doing its job reasonably well and hence I never seem to find the motivation to upgrade it. This series of Radeon cards does seem to have a somewhat buggy support in open source drivers. I've always had some degree of problems with it. For a long time HDMI audio was very unreliable and another problem that I still see sometimes is that shutting down X hangs the system for several minutes.

Posted by Tomaž | Categories: Code | Comments »

Jahresrückblick

03.01.2020 14:58

One of my more vivid childhood memories is from the center of Ljubljana, somewhere around the new year 1990. On the building of the Nama department store there was a new, bright red LED screen. It was scrolling a message that read "Welcome to the new decade". It was probably one of the first such displays I ever saw. I remember finding the message kind of weird and surprising. I think up to that point I had decades shelved among the terms that were only the matter of books or movies.

It's now the end of another decade and, amusingly enough, I was again not really thinking about it in such terms. I was only reminded of the upcoming round number on the calendar when social media posts and articles summarizing the past 10 years started popping up. Anyway, I'm not even going to attempt to sum up the decade. A huge enough number of things happened in the past year alone, both happy and sad and somewhere in between. I usually have problems summarizing even that down to a few paragraphs, so here are only a few personal highlights.

PCB with an OLED screen and some analog circuits.

On the electronic side, I'm really happy with how one work-related project turned out. It's a small multi-purpose microcontroller board that includes an analog front-end for certain proprietary buses. I've executed the whole project, from writing up a specification based on measurements and reverse engineering, drawing up the schematic to assembling prototypes and coding the firmware. It was an interesting exercise in optimization, both from the perspective of having a minimal BOM and low-level programming of integrated peripherals in the microcontroller.

Each year I mention the left-over pile of unfinished side projects. This year isn't any different and some projects stretch back worryingly deep into my stack of notebooks. Perhaps the closest to completion is a curve tracer that I designed while researching a curiosity in bipolar transistor behavior. I've also received the ERASynth Micro signal generator that I've helped crowdfund. It's supposed to become a part of an RF measurement system that I'm slowly piecing together. I feel bad for not posting a review of it, but as usual other things intervened.

"Failed sim" drawing

I've continued to spend many evenings drawing, either in a classical drawing class at the National Gallery or behind the digital setup I have at home. At the start of the year I was doing some more experiments with animation, trying out lessons I've learned and checking out how far I can get with Python scripts for compositing and lighting effects. I further developed my GIMP plug-in.

I played around with some story ideas, but I didn't end up doing any kind of a longer project like I did a year I ago. I enjoyed trying out different drawing styles, experimenting with character design and doing random illustrations that came up in my mind. I've come up with some kind of an alternative space-race theme with animals, but in the end I realized that while I can draw the characters, I don't really have a story to tell about them.

Measuring the CPU temperature with an IR thermometer.

Speaking about telling stories, I've written more blog posts this year than the year before. I've also had my moment of fame when my rant about Google blocking messages from my mail server was posted on HackerNews and reached the top of the front page. My server received a yearly amount of traffic in just a couple of days and the article got mentioned on sites like Bloomberg and BoingBoing. It was a fresh dose of motivation to keep writing amid the trend of falling number of visitors and RSS subscribers that I've seen since around 2014.

As I always repeat in these posts, it's hard to sum up 365 days in a few paragraphs. I've tried to stick to the positive side of things above. In case the picture is too rosy, I must add that there were also sad times that were hard to wade through and plans that didn't turn out as they should. The next year looks like it will bring some big challenges for me, so again I'll say that I won't make any plans on what kind of personal projects I'll do. If anything, I wish to finish some that I already started and try to refrain from beginning any new ones to add to the pile.

Posted by Tomaž | Categories: Life | Comments »

More about pogo pins, and a note about beryllium

20.12.2019 12:42

Back in November I wrote about reliability problems with a bed-of-nails test fixture I've made for an electronic circuit. The fixture with 21 pogo pins only had around 60% long-term probability that all pins would contact their test pads correctly, leading to a very high false alarm rate. I did a quick review of blog posts about similar setups and scientific literature I've found on the subject. Based on what I've read it seemed that such severe problems were rare. From my own analysis I concluded that likely causes were either dirty test pads or bad contacts inside the pogo pins themselves, between the plunger and the body of the pin. The pogo pins I was using were on the cheaper end of the spectrum, so the latter explanation seemed likely.

Recently I got hold of a set of more expensive pins and, as it happened, also a new digital microscope. I was wondering how the mechanical design of the new pins compared to the old ones, so I looked at them under the microscope. This lead to some new clues about the cause of the problems I was investigating:

Pogo pin tip comparison under a microscope.

Pogo pin tips pictured above from left to right:

a) Harwin P19-0121, new (23.00 € for 10 pieces). Tip material is gold-plated steel.

b) P75-B1 type of uncertain origin, new (4.46 € for 10 pieces).

c) and d) two examples of P75-B1 removed from the test fixture after approximately 1500 mating cycles.

The more expensive Harwin pins show a significantly sharper point than the ones sold by Adafruit. Even when new, the cheaper pins have a slightly rounded tip. Over many mating cycles with a test pad the tips end up being even more flattened. The c) and d) pins above have been used with a flat test point surface on a lead-free HASL-finish PCB (the test setup described in my previous post). I couldn't find any specification of the longevity of the P75-series of pins. Harwin P19 are specified for 100k cycles, so it seems surprising that P75 would wear down so much after less than 2% of that amount. This evaluation by OKI shows that contact resistance of probes for wafer testing starts to rise somewhere after 10k cycles.

These flattened tips do explain somewhat the problem I'm seeing. Compared to sharp ones, dull or rounded contacts have a worse chance of piercing surface contamination on a PCB, like oxide or flux residue. Hence why my analysis showed that the failure rate was related to each production batch. Each batch had a slightly different amount of residue left on the boards and none was perfectly clean. First results show that replacing the pins did have a positive effect on the test reliability (I imagine it's hard to get any worse than that 40% fail rate), but I'll have to wait to get some statistically significant numbers.

In the context of using more expensive pogo pins, another issue came up. Some of the more expensive pogo pins use heads made from beryllium-copper alloy. None of the pins pictured above do, but other head shapes from the same Harwin P19 product line do in fact use beryllium according to their datasheets. Beryllium has some health risks associated with it, especially when it's in particulate form. I was wondering, if I switch the test setup to such pins, how much beryllium would be released into the environment due to parts wearing down in the way I've seen?

First paragraph from the Exposure Assessment Guide.

Image by Beryllium Science & Technology Association

From the microscope photographs above, I'm estimating that approximately 140000 μm3 of material was lost from one pin after 1500 cycles. This value is based on the volume of a cone that's missing above the flattened tips of pins c) and d) and probably overestimates the true amount. Given a BeCu alloy density of 8.25 g/cm3 and assuming beryllium content of 3% by mass, this comes out as approximately 0.04 μg of pure beryllium released into the environment. One figure I found for the recommended beryllium exposure per inhalable volume of air is 0.6 μg/m3.

This means that all accumulated dust from the wear of 15 pins would need to be distributed in a single cubic meter of air to reach the maximum recommended density for breathable air. Considering that the amount of wear shown above happened during a time span of months it seems unlikely that all of it would instantaneously end up gathered in a small volume. I don't know if the missing material would end up in the form of dust around the pins, or would be slowly carried away, smeared little by little on test pads. In any case, based on this back-of-the-envelope calculation beryllium contacts seem reasonably safe to use, even if the amount of beryllium lost isn't completely negligible compared to published exposure limits (but of course, I'm not any kind of a workplace safety expert).

I don't think this result is surprising. Finished products using beryllium are generally considered safe. BeCu alloys have been used for mundane things like golf clubs and musical instruments. Harwin doesn't publish any MSDS documents for their products. Also as far as I'm aware, beryllium use isn't covered by RoHS, REACH and other such regulations. But in any case, it can't hurt following some basic precautions when working with electronic components that incorporate this kind of materials.

Posted by Tomaž | Categories: Analog | Comments »

Food container damage in a microwave oven

12.12.2019 17:19

Some time ago I made a decision to start bringing my own lunch to work. The idea was that for a few days per week I would cook something simple at home the evening before and store it in a plastic container over night in my fridge. I would then bring the container with me to the office next day and heat it up in a microwave at lunch time. For various reasons I wasn't 100% consistent in following this plan, but for the past 3 months or so I did quite often make use of the oven in the office kitchenette. Back in mid-September I also bought a new set of food-grade plastic containers to use just for this purpose.

Around a week ago, just when I was about to fill one of the new containers, I noticed some white stains on its walls. Some increasingly vigilant scraping and rinsing later and the stains started looking less and less like dried-on food remains and more like some kind of corrosion of the plastic material. This had me worried, since the idea that I was eating dissolved polymer with my lunch didn't sound very inviting. On the other hand, I was curious. I've never seen plastic corroding in this way. In any case, I stopped using the containers and did some quick research.

Two types of clear plastic polypropylene food containers.

After carefully inspecting all plastic containers in my kitchen, I've found a few more instances of this exact same effect. All were on one of the two types of containers I've used for carrying lunch to work. The two types are shown on the photo above. The top blue one is a 470 ml Lock & Lock (this model is apparently now called "classic"). It's dated 2008, made in China. I have a stock of these that I've used for more than 10 years for freezing or refrigerating food, but until recently never for heating things up in a microwave. The bottom green one is a 1.1 L Curver "Smart fresh". I've bought a few of these 3 months ago and only used them for carrying and heating up lunches in a microwave.

Both of these types are marked microwave safe, food safe and dishwasher safe (I've been washing the containers in a dishwasher after use). They all have the number "5" resin identification code and the PP acronym, meaning they are supposed to be made out of polypropylene polymer. The following line of logos is embossed on the bottom of the Curver containers (on a side note, the capacity spelled out in Braille seems to say "7.6 L"). Lock & Lock has a similar line of logos, except they don't advertise "BPA FREE":

Markings on the Curver Smart Fresh food container.

The damage to the Curver container is visible on the photograph below. It looks like white dots and lines on the vertical wall of the container. At a first glance it could be mistaken for dried-on food remains. On all damaged containers it most often appears in a line approximately around the horizontal level where the interface between the liquid and air would be. I tend to fill these to around the half way mark and I've used the containers both for mostly solid food like rice or pasta and liquids like sauces and soups. If I run a finger across the stains they feel rough compared to the mirror finish plastic in the other parts. No amount of washing with water or a detergent will remove them. However, the stains are much less visible when wet.

Damaged walls of the Curver plastic food container.

Here is how these stains look under a microscope. The width of area pictured here is approximately 10 mm. Microscope shows much better than the naked eye that what look like white stains on the surface are actually small pits and patches of the wall that became corrugated. The damage does appear superficial and doesn't seem to penetrate the immediate surface layer of the plastic.

Damage to the polypropylene surface under a microscope, 1.

Damage to the polypropylene surface under a microscope, 2.

I've only used the containers in this office microwave. It's a De'Longhi Perfecto MW 311 rated at 800 W (MAFF heating category "D"). I've always used the rotating plate, usually the highest power level and 2 to 3 minutes of heating time per container.

Power rating sign on the De'Longhi MW 311 microwave oven.

After some searching around the web, I found a MetaFilter post from 2010 that seems to describe exactly the same phenomenon. Rough patches of plastic that seem like corrosion appearing on Lock & Lock polypropylene containers. The only difference seems to be that in Hakaisha's case the damage seems to be on the bottom of the container. The comments in that thread that seem plausible to me suggest physical damage from steam bubbles, chemical corrosion from tomato sauce or other acids in food or some non-specific effect of microwaves on the plastic.

My experience suggests that heating and/or use in a microwave oven was required for this effect. If only food contact would be to blame, I'm sure I would have seen this on my old set of Lock & Lock containers sooner. Polypropylene is quite a chemically inert material (hence why it's generally considered food safe), however its resistance to various chemicals does decrease with higher temperatures. For example, the chemical resistance table entry for oleic acid goes from no significant attack at 20°C to light attack at 60°C.

The comment about tomatoes is interesting. I've definitely seen that oily foods with a strong red color from tomatoes or red peppers will stain the polypropylene, even when stored in the refrigerator. In fact, leaflets that come with these food containers often warn that this is possible. In my experience, the red, transparent stain remains on the container for several cycles in the dishwasher, but does fade after some time. My Lock & Lock containers have been stained like that many times, but didn't develop the damaged surface before I started microwaving food in them.

Physical damage from steam bubbles seems unlikely to me. I guess something similar to cavitation might occur as a liquid-filled container moves through nodes and antinodes of the microwave oven's EM field, causing the water to boil and cool. However, it doesn't explain why this seems to mostly occur at the surface of the liquid. Direct damage from microwave radiation also doesn't make sense. It would occur all over the volume of the plastic, not only on the inner surface and in those specific spots. In any case, dielectric heating of the polypropylene itself should be negligible (it is, after all, used for low-loss capacitors exactly because of that property).

Another interesting source on this topic I found was a paper on deformation of packaging materials by Yoon et al. published in Korean Journal of Food Science and Technology in 2015. It discusses composite food pouches rather than monolithic polypropylene containers, however the inner layer of those pouches was in fact a polypropylene film. The authors investigated the causes of damage to that film after food in the pouches has been heated by microwaves. They show some microphotographs that look similar to what I've seen under my microscope.

Unfortunately, the paper is in Korean except for the abstract and figure captions. Google translate isn't very helpful. My understanding is that they conclude that hot spots can occur in salty, high-viscosity mixtures that contain little water. My guess is the mixture must be salty to reduce the penetration depth due to increased conductivity and high-viscosity to lessen the effect of evaporative cooling.

Most telling was the following graph that shows temperature measurements in various spots in a food pouch during microwave heating. Note how the highest temperatures are reached near the filling level (which I think means on the interface between the food in the pouch and air). Below the filling level, the temperature never raises above the boiling point of water. Wikipedia has values between 130°C and 166°C for the melting point of polypropylene. Given the graph below it seems plausible that a partially-dried out food mixture stuck to the container above the liquid level might heat up enough to melt a spot on the container.

Figure 3 from Analysis of the Causes of Deformation by Yoon et al.

Image by Yoon et al.

In summary, I think spot melting of the plastic described in the Yoon paper seems the most plausible explanation for what I was seeing. Then again, I'm judging this based on my high-school knowledge of chemistry, so there are probably aspects of this question I didn't consider. It's also hard to find anything health- or food-related on the web that appears trustworthy. It would be interesting to try out some experiments to test some of these theories. Whatever the true cause of the damage might be, I thought it was prudent to buy some borosilicate glass containers to replace the polypropylene ones for the time being.

Posted by Tomaž | Categories: Life | Comments »

Dropping the "publicsuffix" Python package

02.12.2019 10:50

I have just released version 1.1.1 of the publicsuffix Python package. Baring any major bugs that would affect some popular software package using it, this will be the last release. I've released v1.1.1 because I received a report that a bug in publicsuffix package is preventing installation of GNU Mailman.

In the grand scheme of things, it's not a big deal. It's a small library with a modest number of users. I haven't done any work, short of answering mail about it, since 2015. Drop-in alternatives exist. People that care strongly about the issues I cover below have most likely already switched to one of the forks and rewrites that popped up over the years. For those that don't care, nothing will change. The code still works and the library is still normally installable from PyPi. Debian package continues to exist. The purpose of this post is more to give some closure and to sum up a few mail threads that started back in 2015 and never reached a conclusion.

Screenshot of the publicsuffix package page on PyPi.

I've first released the publicsuffix library back in 2011, two employers and a life ago. Back then there was no easily accessible Python implementation of Mozilla's Public Suffix List. Since I needed one for my work, I've picked up a source file from an abandoned open source project on Google Code (which was just being abandoned by Google around that time). I did some minor work on it to make it usable as a standalone library and published it on PyPi.

I've not used publicsuffix myself for years. Looking back, most of my open source projects that I still maintain seem to be like that. Even though I don't use them, I feel some obligation to do basic maintenance on them and answer support mail. If not for other reasons, then out of a sense that I should give back to the body of free software that I depend so much on in my professional career. Some technical problems are also simply fun to work on and most of the time there's not much pressure.

However one thing that was a source of long discussions about publicsuffix is the way the PSL data is distributed. I've written previously about the issue. In summary, you either distribute stale data with the code or fetch an up-to-date copy via the network, which is a privacy problem. These two are the only options possible and going with one or the other or both was always going to be a problem for someone. I hate software that phones home (well, phones Mozilla in this case) as much as anyone, but it's a problem that me as a mere maintainer of a Python library had no hope of solving, even if I got CC'd in all the threads discussing it.

The Public Suffix List is a funny thing. Ideally, software either should not care about the semantic meaning of domain names or this meaning should be embedded in the basic infrastructure of the Internet (e.g. DNS or something). But alas we don't live in either of those worlds and hence we have a magic text file that lives on a HTTP server somewhere and some software needs to have access to it if it wants to do its thing. No amount of worrying on my part was going to change that.

Screenshot of publicsuffix forks on GitHub.

The other issue that sparked at least one fork of publicsuffix was the fact that I refused to publish the source on GitHub. Even tough there are usually several copies of the publicsuffix code on the GitHub at any time, none of them are mine. I was instead hosting my own git repo and was accepting bug reports and other comments only over email.

Some time ago already GitHub became synonymous with open source. People simply expect a PyPi package to have a GitHub (or GitLab, or BitBucket) point-and-click interface somewhere on the web. The practical problem I have with that is that it hugely increases the amount of effort I have to spend on a project (subjectively speaking - keep in mind this is something I do in my free time). Yes, it makes it trivial for someone to contribute a patch. However in practice I find that it does not result in greater quantity of meaningful patches or bug reports. What it does do is create more work for me dealing with low-effort contributions I must reject.

I'm talking about a daunting asymmetry in communication. Writing two sentences in a hurry in a GitHub issue or pushing a bunch of untested code my way in a pull request can take all of a minute for the submitter. On the other hand, I don't want to discourage people from contributing to free software and I know how frustrating it can be to contribute to open source projects (see my post about drive by contributions). So I try to take some time to study the pull request and write an intelligible and useful answer. However this is simply not sustainable. Looking back I also seem to often fail at not letting my frustration show through in my answer. Hence I feel like requiring contributors to at least know how to use git format-patch and write an email forms a useful barrier to entry. It prevents frustration at both ends and I believe that for a well thought-out contribution, the overhead of opening a mail client should be negligible.

Of course, if the project is not officially present on GitHub you get the current situation, where multiple public copies of the project still exist on GitHub, made by random people for their own use. These copies often keep in my contact details and don't obviously state that the code has been modified and/or is not related to the PyPi releases. This causes confusion, since code on GitHub is not the same as the one on PyPi. People also sometimes reuse version numbers for their own private use that conflict with version numbers on PyPi and so on and so on. It is kind of a damned if you do and damned if you don't situation really.


How can I sum this up? I've maintained this software for around 8 years, well after I left the company for which it was originally developed. During that time people have forked and rewrote it for various, largely non-technical reasons. That's fine. It's how free software is supposed to work and my own package was based on another one that got abandoned. I might still be happy to work on technical issues, but the part that turned out much more exhausting than working on the code was dealing with the social and ideological issues people had with it. It's probably my failing that I've spent so much thought on those. In the end, my own interests have changed as well during that time and finally letting it go does also feel like a stone off my shoulders.

Posted by Tomaž | Categories: Code | Comments »

On reliability of pogo pins

26.11.2019 20:42

A bit over a year ago I designed and built a device for testing assembled printed circuit boards as they come off the assembly line. While I'm not new to electronic test fixtures, this was the first time I've used the bed-of-nails approach: the test jig has a number of spring-loaded pogo pins that make contact with various test pads on the device-under-test (DUT). This setup has now made thousands of cycles and the device proved itself to be capable of detecting a large variety of defects, without doubt preventing many expensive debugging sessions.

However one problem that has been constantly troubling this setup since the beginning is its unreliability. Even after a lot of fussing around with various adjustments, the procedure still has an abysmal false error rate compared to the actual rate of manufacturing defects. In many cases, the operator must remove, re-seat the DUT and restart the test several times before the test will signal a pass. Such test repetitions obviously cause a lot of frustration, decrease the confidence in the testing procedure and significantly lengthen a test that would otherwise take only a few moments. All evidence, like the fact that detected defect types appear completely random and that most test failures disappear when re-seating the DUT, firmly points towards the pogo pins as the cause.

I was surprised at this outcome, since I've never heard about bad contacts being such a problem with pogo pins. There are quite a few blog posts and basic tutorials around about the pogo pin test jigs. Hacker Noon mentions that getting the fine mechanical details correct can be tricky. The Big Mess o' Wires blog says that their test board only worked reliably after three iterations of the design. Thom wrote that they didn't have many issues with contacts on their test jig. It seems that reliability is not a common problem people have with pogo pins, once initial mechanical problems have been ironed out.

Pogo pins mounted on a test fixture.

My bed of nails setup is shown above. It uses P75-type pogo pins - a widely available, cheap variant of uncertain origin. For example, they are sold by Adafruit. The whole bed has 21 pins and uses a combination of needle heads (P75-B1) and cupped heads (P75-A1). There was not enough PCB space on the DUT for all the required test pads so I used cupped head pins to mate with the underside of THT connector pins. P75 pogo pins seem to use exposed steel for the head and plunger (they are slightly magnetic) and only have the gold plating on the bottom body part. I'm not using the mounting sleeves. The pin bodies are directly soldered to the test jig PCB.

The mechanical parts have been removed in the photograph above, but you can get an idea of how they look from the CAD render below. During the test the DUT is securely fixed onto the pins using a clamp, centering pins and a frame. This setup is similar to the one described by Hacker Noon. The difference is that I'm using two parallel PCB boards to position the pins instead of 3D printed parts. The setup was designed so that the pogo pins only compress to approximately half of their 100 mil travel. The mechanical frame carries most of the clamping force.

The boards I'm testing have a lead-free HASL finish and there is no solder paste applied to the test pads. This means that test pads might be sensitive to oxidation. However that shouldn't be a problem since the test is applied shortly after production. It's also worth mentioning that I'm testing an analog circuit. Compared to purely digital tests these are more sensitive to the resistance between the test fixture and the DUT.

CAD drawing of the test device with the bed-of-nails.

Since I have a lot of data collected from the test device I thought statistical analysis might shed some light on the reliability problem. If not directly showing a way to improve the existing device, perhaps it would at least give me some idea what can be expected from pogo pins when designing future test fixtures.

The first thing I was interested in was the resistance between a pogo pin on the test fixture and its corresponding test pad on the DUT. The test procedure was not designed to directly measure this. Fortunately however I found a way to estimate test point resistance for two specific pogo pins (out of 21). I calculated their resistances from certain other measurements I took during the test procedure. Of course, this was not as good as a direct measurement and the estimate is still affected somewhat by variations in some components on the DUT, the test device and resistances of other test points. A Monte Carlo simulation showed an error in the resistance estimate of less than 10 mΩ due to these effects.

As luck would have it, one of the pogo pins I was able to estimate was using the needle head while the second one was using the cupped head. This resulted in the following two histograms of resistances to two test points. They show how commonly each of the two test points exhibited a certain resistance over thousands of matings with the DUT:

Histogram of resistances through a needle-head pogo pin.

Histogram of resistances through a cupped-head pogo pin.

Different colors show data from different DUT production batches. Overall, you can see that most commonly the connection resulted in a resistance of around 0.1 Ω and majority of connections were below 0.5 Ω. This is pretty good, even if somewhat above the 50 mΩ rated contact resistance for this type of pins. The cupped head pin showed less variance than the needle head. Still, the values show much higher variance than the estimated 10 mΩ error, which gives some confidence that this is actually due to changing contact resistances of the pogo pins.

However, one thing that is not visible on these plots is the fact that some connections resulted in estimates well over 1 Ω (approximately 10% for the needle head and 6% for the cupped head). I could also only produce this estimate when the test progressed to the point where some voltage measurements have been made (which depend on a reasonably good contact over 4 pogo pins for needle head pin and 2 pogo pins for cupped head pin). Hence test runs where these measurements were not taken are not included in the histograms above.

So what about these failed attempts? One way to show them is the number of test repetitions that a DUT had to undergo before a test first passed. Using records of thousands of tests, the following histogram emerged:

Number of test repetitions required before the first pass.

Again, the colors show data from different production batches. Overall, approximately 60% of DUTs passed on the first test attempt. A bit above 20% passed on the second and around 10% on the third attempt. You can also see some differences in batches. For example, the batch shown in red was particularly bad and more DUTs required a second repetition than passed the first test. Number of DUTs that failed the test 10 times or more is very small - mostly these are the DUTs that actually had a manufacturing defect and didn't fail due to a false reading on the test fixture.

The histogram shows a nicely exponential characteristic - exactly what you would expect if each test repetition was a random event with a Ppass probability of succeeding. From the data I can estimate that:

P_{pass} = 59.4\%

If I further assume that a test will succeed if all pogo pins contact successfully, and that each of the 21 pogo pin contacts is an independent random event by itself, we can calculate the a probability Pfail-pin that a pogo pin will fail to make a good contact:

P_{fail-pin} = 1 - \sqrt[21]{P_{pass}} \approx 2.4\%

Using this model, I can back predict the probability that a DUT will pass the test after N test repetitions:

P_{pass-after-N-repetitions} = (1 - P_{pass})^{N-1} \cdot P_{pass}

This model fits almost perfectly with the measured histogram, as you can see on the picture below. The predicted number of test repetitions before first pass (red) is laid over the histogram of measurements (gray).

Comparing the model for test repetitions to measurements.

The model also fits reasonably well with number of cases where I've estimated test point resistances above 1 Ω. This might be a bit handwavy since it's hard to see how different failures would affect the results. For the needle-head test point I've seen approximately 10% of cases where resistance was above 1 Ω. This fits well with the fact that 4 points needed to be well connected for the measurement to be accurate and 2.4% failure rate for the connections:

P_{fail} = 1 - (1 - P_{fail-pin})^{N_{pins}} = 1 - (1 - 2.4\%)^4 = 9.3\%

Similarly for the cupped pin measurement, where I've seen 6% of measurements above 1 Ω and required 2 points to be well connected:

P_{fail} = 1 - (1 - 2.4\%)^2 = 4.7\%

In conclusion, my data shows that individual pogo pins seem to have approximately 2.4% chance of not mating correctly with their test points. When they do contact correctly, they usually show a reasonably low resistance of approximately 100 mΩ between the pin and the test pad, with worst cases being less than 500 mΩ. It's not clear from the data what is causing such a high rate of unsuccessful connections. Since the failure rate varies from batch to batch, this suggests that at least part of it is related in some way to the production process (for example, oxide or flux residue on the test pads). On the other hand, it's also possible that the pins themselves are responsible for these failures. The bad contact might in fact be between the plunger and pin body, not between the head and the test pad. In that case it might be worth experimenting with the more expensive pogo pins that have gold plated heads and plungers.

Posted by Tomaž | Categories: Analog | Comments »

ZX81 LPRINT bug and software archaeology

04.11.2019 19:07

By some coincidence I happened to stumble upon a week-old, unanswered question posted to Hacker News regarding a bug in Sinclair BASIC on a Timex Sinclair 1000 microcomputer. While I never owned a TS1000, the post attracted my interest. I've studied ZX81, an almost identical microcomputer, extensively when I was doing my research on Galaksija. It also reminded me of a now almost forgotten idea to write a post on some obscure BASIC bugs in Galaksija's ROM that I found mentioned in contemporary literature.

ZX81 exhibited at the Frisk festival.

The question on Hacker News is about the cause of a bug where the computer, when attached to a printer, would print out certain floating point numbers incorrectly. The most famous example, mentioned in the Wikipedia article on Timex Sinclair 1000, is the printout of 0.00001. The BASIC statement:

LPRINT 0.00001

unexpectedly types out the following on paper:

0.0XYZ1

This bug occurs both on Timex Sinclair 1000 as well as on Sinclair ZX81, since both computers share the same ROM code. Only the first zero after the decimal point is printed correctly while the subsequent zeros seem to be replaced with random alphanumeric characters. The non-zero digit at the end is again printed correctly. Interestingly, this only happens when using the LPRINT (line-printer print) statement that makes a hard-copy of the output on paper using a printer. The similar PRINT statement that displays the output on the TV screen works correctly (you can try it out on JtyOne's Online Emulator).

The cause of the bug lies in the code that takes a numerical value in the internal format of the BASIC's floating point calculator and prints out individual characters. One particular part of the code determines the number of zeros after the decimal point and uses a loop to print them out:

;; PF-ZEROS
L16B2:  NEG                     ; Prepare number of zeros
        LD      B,A             ; to print in B.

        LD      A,$1B           ; Print out character '.'
        RST     10H             ; 

        LD      A,$1C           ; Prepare character '0' 
				; to print out in A.

;; PF-ZRO-LP
L16BA:  RST     10H             ; Call "print character" routine
        DJNZ    L16BA           ; and loop back B times.

(This assembly listing is taken from Geoff Wearmouth's disassembly. Comments are mine.)

The restart 10h takes a character code in register A and either prints it out on the screen or sends it to the printer. Restarts are a bit like simple system calls - they are an efficient way to call an often-used routine on the Z80 CPU. The problem lies in the fact that this restart doesn't preserve the contents of the A register. It does preserve the contents of register B and other main registers through the use of the EXX instruction and the shadow registers, however the original contents of A is lost after the call returns.

Since the code above doesn't reset the contents of the A register after each iteration, only the first zero after the decimal point is printed correctly. Subsequent zeros are replaced with whatever was junk left in the A register by the 10h restart code. Solution is to simply adjust the DJNZ instruction to loop back two bytes earlier, to the LD instruction, so that the character code is stored to A in each iteration. You can see this fix in Geoff's customized ZX81 ROM, or in Timex Sinclair 1500 ROM (see line 3835 in this diff between TS1500 and TS1000).

This exact same code is also used when displaying numbers on the TV screen, however in that case it works correctly. The reason is that when set to print to screen, printing character 0 via the 10h restart actually preserves the contents of register A. Looking at the disassembly I suspect that was simply a lucky coincidence and not a conscious decision by the programmer. Any code calling 10h doesn't know whether the printer or the screen is used, and hence must assume that A isn't preserved anyway.


Of course, I'm far from being the first person to write about this particular Sinclair bug. Why then does the post on Hacker News say that there's little information to be found about it? The Wikipedia article doesn't cite a reference for this bug either.

It turns out that during my search for the answer, the three most useful pages were no longer on-line. Paul Farrow's ZX resource centre, S. C. Agate's ZX81 ROMs page and Geoff Wearmouth's Sinclair ROM disassemblies are wonderful historical resources that must have taken a lot of love and effort to put together. Sadly, they are now only accessible through the snapshots on the Internet Archive's Wayback Machine. If I wouldn't know about them beforehand, I probably wouldn't find them now. For the last one you even need to know what particular time range to look at on Archive.org, since the domain was taken over by squatters and recent snapshots only show ads (incidentally, this is also the reason why I'm re-hosting some of its former content).

I feel like we can still learn a lot from these early home computers and I'm happy that questions about them still pop-up in various forums. This LPRINT bug seems to be a case of a faulty generalization. It's a well known type of a mistake where the programmer wrongly generalizes an assumption (10h preserves A) that is in fact only true in a special case (displaying character on screen). History tends to repeat itself and I believe that many of the blunders in modern software wouldn't happen if software developers would be more aware of the history of their trade.

It's sad that these old devices are disappearing and that primary literature sources about them are hard to find, but I find it even more concerning that now it seems also these secondary sources are slowly fading out from general accessibility on the web.

Posted by Tomaž | Categories: Code | Comments »