On hunting non-deterministic bugs

26.10.2014 14:13

Bugs that don't seem to consistently manifest themselves are one of the most time consuming problems to solve in software development. In multi-tasking operating systems they are typically caused by race conditions between threads or details in memory management. They are perhaps even more common in embedded software. Programs on microcontrollers are typically interfacing with external processes that run asynchronously by their very nature. If you mix software and hardware development, unexpected software conditions may even be triggered by truly random events on improperly designed hardware.

When dealing with such bugs, first thing you need to realize is that you are in fact looking for a bug that only appears sometimes. I have seen many commits and comments by developers that have seen a bug manifest itself, wrote a quick fix and thought they have fixed a bug, since it didn't happen the second way around. These are typically changes that, after closer scrutiny, do not actually have any effect on the process they are supposedly fixing. Often this is connected with incomplete knowledge of the workings of the program or development tools. In other cases, the fact that such a change apparently fixed an application bug is blamed on bugs in compilers or other parts of the toolchain.

You can only approach non-deterministic processes with statistics. And first requirement of doing any meaningful statistics is a significant sample size. The corollary of this is that automated tests are a must when you suspect a non-deterministic bug. Checking if running a test 100 times resulted in any failures should require no more than checking a single line of terminal output. If your debugging strategy includes manually checking if a particular printf line got hit out of hundreds lines of other debugging output, you won't be able to consistently tell whether the bug happened or not after half a day of debugging, much less run a hundred repetitions and have any kind of confidence in the result.

Say you've seen a bug manifest itself in 1 run of a test out of 10. You then look at the code, find a possible problem and implement a fix. How many repetitions of the test must you run to be reasonably sure that you have actually fixed the bug and you weren't just lucky the second run around?

In the first approximation, we can assume the probability Pfail of the bug manifesting itself is:

P_{fail} = \frac{1}{n} = \frac{1}{10}

The question whether your tests passed due to sheer luck then translates to the probability of seeing zero occurrences of an event with probability Pfail after m repetitions. The number of occurrences has a binomial distribution. Given the desired probability Ptest of our the test giving the correct result, the required number of repetitions m is:

m = \frac{\log{(1 - P_{test})}}{\log{(1 - P_{fail})}} = \frac{\log{(1-P_{test})}}{\log{(1 - \frac{1}{n})}}

It turns out the ratio between m and n is more or less constant for practical values of n (e.g. >10):

Ratio between number of test runs in discovery and number of required verification test runs.

\frac{m}{n} \approx -\log{(1 - P_{test})}

For instance, if you want to be 99% sure that your fix actually worked and that the test did not pass purely by chance, you need to run around 4.6 times more repetitions than those you used initially when discovering the bug.

This is not the whole story though. If you've seen a bug once in 10 runs, Pfail=0.1 is only the most likely estimate for the probability of its occurrence. It might be actually higher or lower and you've only seen one failure by chance:

PDF for probability of failure when one failure was seen in 10 test runs.

If you want to also account for the uncertainty in Pfail, the derivation of m gets a bit complex. It involves using the beta distribution for the likelihood of the Pfail estimate, deriving Ptest from the law of total probability and then solving for m. The end result, however, is similarly straightforward and can be summarized in a simple table:

Ptest [%]m/n

Even this still assumes the bug basically behaves as a weighted coin, whose flips are independent of each other and whose probability doesn't change with time. This might or might not be a good model. It probably works well for problems in embedded systems where a bug is caused by small physical variations in signal timings. Problems with memory management or heisenbugs on the other hand can behave in a completely different way.

Assuming the analysis above works, a good rule of thumb therefore seems to be that if you discovered a bug using n repetitions of the test, checking whether it has been fixed or not should be done using at least 10·n repetitions. Of course, you can never be absolutely certain. Using factor of 10 only means that you will on average mark a bug fixed, when in fact it is not, once out of hundred debugging sessions. It's usually worth understanding why the change fixed the bug in addition to seeing the test suite pass.

Posted by Tomaž | Categories: Ideas

Add a new comment

(No HTML tags allowed. Separate paragraphs with a blank line.)