On reliability of pogo pins

26.11.2019 20:42

A bit over a year ago I designed and built a device for testing assembled printed circuit boards as they come off the assembly line. While I'm not new to electronic test fixtures, this was the first time I've used the bed-of-nails approach: the test jig has a number of spring-loaded pogo pins that make contact with various test pads on the device-under-test (DUT). This setup has now made thousands of cycles and the device proved itself to be capable of detecting a large variety of defects, without doubt preventing many expensive debugging sessions.

However one problem that has been constantly troubling this setup since the beginning is its unreliability. Even after a lot of fussing around with various adjustments, the procedure still has an abysmal false error rate compared to the actual rate of manufacturing defects. In many cases, the operator must remove, re-seat the DUT and restart the test several times before the test will signal a pass. Such test repetitions obviously cause a lot of frustration, decrease the confidence in the testing procedure and significantly lengthen a test that would otherwise take only a few moments. All evidence, like the fact that detected defect types appear completely random and that most test failures disappear when re-seating the DUT, firmly points towards the pogo pins as the cause.

I was surprised at this outcome, since I've never heard about bad contacts being such a problem with pogo pins. There are quite a few blog posts and basic tutorials around about the pogo pin test jigs. Hacker Noon mentions that getting the fine mechanical details correct can be tricky. The Big Mess o' Wires blog says that their test board only worked reliably after three iterations of the design. Thom wrote that they didn't have many issues with contacts on their test jig. It seems that reliability is not a common problem people have with pogo pins, once initial mechanical problems have been ironed out.

Pogo pins mounted on a test fixture.

My bed of nails setup is shown above. It uses P75-type pogo pins - a widely available, cheap variant of uncertain origin. For example, they are sold by Adafruit. The whole bed has 21 pins and uses a combination of needle heads (P75-B1) and cupped heads (P75-A1). There was not enough PCB space on the DUT for all the required test pads so I used cupped head pins to mate with the underside of THT connector pins. P75 pogo pins seem to use exposed steel for the head and plunger (they are slightly magnetic) and only have the gold plating on the bottom body part. I'm not using the mounting sleeves. The pin bodies are directly soldered to the test jig PCB.

The mechanical parts have been removed in the photograph above, but you can get an idea of how they look from the CAD render below. During the test the DUT is securely fixed onto the pins using a clamp, centering pins and a frame. This setup is similar to the one described by Hacker Noon. The difference is that I'm using two parallel PCB boards to position the pins instead of 3D printed parts. The setup was designed so that the pogo pins only compress to approximately half of their 100 mil travel. The mechanical frame carries most of the clamping force.

The boards I'm testing have a lead-free HASL finish and there is no solder paste applied to the test pads. This means that test pads might be sensitive to oxidation. However that shouldn't be a problem since the test is applied shortly after production. It's also worth mentioning that I'm testing an analog circuit. Compared to purely digital tests these are more sensitive to the resistance between the test fixture and the DUT.

CAD drawing of the test device with the bed-of-nails.

Since I have a lot of data collected from the test device I thought statistical analysis might shed some light on the reliability problem. If not directly showing a way to improve the existing device, perhaps it would at least give me some idea what can be expected from pogo pins when designing future test fixtures.

The first thing I was interested in was the resistance between a pogo pin on the test fixture and its corresponding test pad on the DUT. The test procedure was not designed to directly measure this. Fortunately however I found a way to estimate test point resistance for two specific pogo pins (out of 21). I calculated their resistances from certain other measurements I took during the test procedure. Of course, this was not as good as a direct measurement and the estimate is still affected somewhat by variations in some components on the DUT, the test device and resistances of other test points. A Monte Carlo simulation showed an error in the resistance estimate of less than 10 mΩ due to these effects.

As luck would have it, one of the pogo pins I was able to estimate was using the needle head while the second one was using the cupped head. This resulted in the following two histograms of resistances to two test points. They show how commonly each of the two test points exhibited a certain resistance over thousands of matings with the DUT:

Histogram of resistances through a needle-head pogo pin.

Histogram of resistances through a cupped-head pogo pin.

Different colors show data from different DUT production batches. Overall, you can see that most commonly the connection resulted in a resistance of around 0.1 Ω and majority of connections were below 0.5 Ω. This is pretty good, even if somewhat above the 50 mΩ rated contact resistance for this type of pins. The cupped head pin showed less variance than the needle head. Still, the values show much higher variance than the estimated 10 mΩ error, which gives some confidence that this is actually due to changing contact resistances of the pogo pins.

However, one thing that is not visible on these plots is the fact that some connections resulted in estimates well over 1 Ω (approximately 10% for the needle head and 6% for the cupped head). I could also only produce this estimate when the test progressed to the point where some voltage measurements have been made (which depend on a reasonably good contact over 4 pogo pins for needle head pin and 2 pogo pins for cupped head pin). Hence test runs where these measurements were not taken are not included in the histograms above.

So what about these failed attempts? One way to show them is the number of test repetitions that a DUT had to undergo before a test first passed. Using records of thousands of tests, the following histogram emerged:

Number of test repetitions required before the first pass.

Again, the colors show data from different production batches. Overall, approximately 60% of DUTs passed on the first test attempt. A bit above 20% passed on the second and around 10% on the third attempt. You can also see some differences in batches. For example, the batch shown in red was particularly bad and more DUTs required a second repetition than passed the first test. Number of DUTs that failed the test 10 times or more is very small - mostly these are the DUTs that actually had a manufacturing defect and didn't fail due to a false reading on the test fixture.

The histogram shows a nicely exponential characteristic - exactly what you would expect if each test repetition was a random event with a Ppass probability of succeeding. From the data I can estimate that:

P_{pass} = 59.4\%

If I further assume that a test will succeed if all pogo pins contact successfully, and that each of the 21 pogo pin contacts is an independent random event by itself, we can calculate the a probability Pfail-pin that a pogo pin will fail to make a good contact:

P_{fail-pin} = 1 - \sqrt[21]{P_{pass}} \approx 2.4\%

Using this model, I can back predict the probability that a DUT will pass the test after N test repetitions:

P_{pass-after-N-repetitions} = (1 - P_{pass})^{N-1} \cdot P_{pass}

This model fits almost perfectly with the measured histogram, as you can see on the picture below. The predicted number of test repetitions before first pass (red) is laid over the histogram of measurements (gray).

Comparing the model for test repetitions to measurements.

The model also fits reasonably well with number of cases where I've estimated test point resistances above 1 Ω. This might be a bit handwavy since it's hard to see how different failures would affect the results. For the needle-head test point I've seen approximately 10% of cases where resistance was above 1 Ω. This fits well with the fact that 4 points needed to be well connected for the measurement to be accurate and 2.4% failure rate for the connections:

P_{fail} = 1 - (1 - P_{fail-pin})^{N_{pins}} = 1 - (1 - 2.4\%)^4 = 9.3\%

Similarly for the cupped pin measurement, where I've seen 6% of measurements above 1 Ω and required 2 points to be well connected:

P_{fail} = 1 - (1 - 2.4\%)^2 = 4.7\%

In conclusion, my data shows that individual pogo pins seem to have approximately 2.4% chance of not mating correctly with their test points. When they do contact correctly, they usually show a reasonably low resistance of approximately 100 mΩ between the pin and the test pad, with worst cases being less than 500 mΩ. It's not clear from the data what is causing such a high rate of unsuccessful connections. Since the failure rate varies from batch to batch, this suggests that at least part of it is related in some way to the production process (for example, oxide or flux residue on the test pads). On the other hand, it's also possible that the pins themselves are responsible for these failures. The bad contact might in fact be between the plunger and pin body, not between the head and the test pad. In that case it might be worth experimenting with the more expensive pogo pins that have gold plated heads and plungers.

Posted by Tomaž | Categories: Analog | Comments »

ZX81 LPRINT bug and software archaeology

04.11.2019 19:07

By some coincidence I happened to stumble upon a week-old, unanswered question posted to Hacker News regarding a bug in Sinclair BASIC on a Timex Sinclair 1000 microcomputer. While I never owned a TS1000, the post attracted my interest. I've studied ZX81, an almost identical microcomputer, extensively when I was doing my research on Galaksija. It also reminded me of a now almost forgotten idea to write a post on some obscure BASIC bugs in Galaksija's ROM that I found mentioned in contemporary literature.

ZX81 exhibited at the Frisk festival.

The question on Hacker News is about the cause of a bug where the computer, when attached to a printer, would print out certain floating point numbers incorrectly. The most famous example, mentioned in the Wikipedia article on Timex Sinclair 1000, is the printout of 0.00001. The BASIC statement:

LPRINT 0.00001

unexpectedly types out the following on paper:

0.0XYZ1

This bug occurs both on Timex Sinclair 1000 as well as on Sinclair ZX81, since both computers share the same ROM code. Only the first zero after the decimal point is printed correctly while the subsequent zeros seem to be replaced with random alphanumeric characters. The non-zero digit at the end is again printed correctly. Interestingly, this only happens when using the LPRINT (line-printer print) statement that makes a hard-copy of the output on paper using a printer. The similar PRINT statement that displays the output on the TV screen works correctly (you can try it out on JtyOne's Online Emulator).

The cause of the bug lies in the code that takes a numerical value in the internal format of the BASIC's floating point calculator and prints out individual characters. One particular part of the code determines the number of zeros after the decimal point and uses a loop to print them out:

;; PF-ZEROS
L16B2:  NEG                     ; Prepare number of zeros
        LD      B,A             ; to print in B.

        LD      A,$1B           ; Print out character '.'
        RST     10H             ; 

        LD      A,$1C           ; Prepare character '0' 
				; to print out in A.

;; PF-ZRO-LP
L16BA:  RST     10H             ; Call "print character" routine
        DJNZ    L16BA           ; and loop back B times.

(This assembly listing is taken from Geoff Wearmouth's disassembly. Comments are mine.)

The restart 10h takes a character code in register A and either prints it out on the screen or sends it to the printer. Restarts are a bit like simple system calls - they are an efficient way to call an often-used routine on the Z80 CPU. The problem lies in the fact that this restart doesn't preserve the contents of the A register. It does preserve the contents of register B and other main registers through the use of the EXX instruction and the shadow registers, however the original contents of A is lost after the call returns.

Since the code above doesn't reset the contents of the A register after each iteration, only the first zero after the decimal point is printed correctly. Subsequent zeros are replaced with whatever was junk left in the A register by the 10h restart code. Solution is to simply adjust the DJNZ instruction to loop back two bytes earlier, to the LD instruction, so that the character code is stored to A in each iteration. You can see this fix in Geoff's customized ZX81 ROM, or in Timex Sinclair 1500 ROM (see line 3835 in this diff between TS1500 and TS1000).

This exact same code is also used when displaying numbers on the TV screen, however in that case it works correctly. The reason is that when set to print to screen, printing character 0 via the 10h restart actually preserves the contents of register A. Looking at the disassembly I suspect that was simply a lucky coincidence and not a conscious decision by the programmer. Any code calling 10h doesn't know whether the printer or the screen is used, and hence must assume that A isn't preserved anyway.


Of course, I'm far from being the first person to write about this particular Sinclair bug. Why then does the post on Hacker News say that there's little information to be found about it? The Wikipedia article doesn't cite a reference for this bug either.

It turns out that during my search for the answer, the three most useful pages were no longer on-line. Paul Farrow's ZX resource centre, S. C. Agate's ZX81 ROMs page and Geoff Wearmouth's Sinclair ROM disassemblies are wonderful historical resources that must have taken a lot of love and effort to put together. Sadly, they are now only accessible through the snapshots on the Internet Archive's Wayback Machine. If I wouldn't know about them beforehand, I probably wouldn't find them now. For the last one you even need to know what particular time range to look at on Archive.org, since the domain was taken over by squatters and recent snapshots only show ads (incidentally, this is also the reason why I'm re-hosting some of its former content).

I feel like we can still learn a lot from these early home computers and I'm happy that questions about them still pop-up in various forums. This LPRINT bug seems to be a case of a faulty generalization. It's a well known type of a mistake where the programmer wrongly generalizes an assumption (10h preserves A) that is in fact only true in a special case (displaying character on screen). History tends to repeat itself and I believe that many of the blunders in modern software wouldn't happen if software developers would be more aware of the history of their trade.

It's sad that these old devices are disappearing and that primary literature sources about them are hard to find, but I find it even more concerning that now it seems also these secondary sources are slowly fading out from general accessibility on the web.

Posted by Tomaž | Categories: Code | Comments »