Here's another story about a debugging session that took an embarrassing amount of time and effort and confirmed once again that a well thought-out design for debugability will pay for itself many times over in lost stomach acid and general developer well being.
As you might remember, the UHF receiver I designed a while back has been deployed on VESNA sensor nodes in a cognitive radio testbed. When the first demos of the deployment were presented however it became apparent that nodes equipped with it had a problem: experiments would often fail mysteriously and had to be repeated a few times before valid measurements could be retrieved.
In this particular case firmware running on VESNA's ARM CPU implements a very simple, home-brew scheduler. An experimenter sends a list of tasks over the management network to the sensor node and the scheduler attempts to execute them roughly at the specified times. After a while, the experimenter asks the node if the given tasks have been completed, and if they have, requests a download of the recorded data. Often however nodes would simply continue replying that the tasks have not been completed well after the last command should have been concluded.
This kind of problem was specific to nodes equipped with the UHF receiver (not that other nodes don't have problems of their own) and seemingly limited to sensor nodes deployed high on light-poles as the one test article on my desk refused to exhibit this bug.
A hint of what might be going on came when I started monitoring the uptime of sensor nodes with Munin. When the bug manifested itself the number of seconds since the CPU reboot fell to zero, making apparent that the node was resetting itself during experiments. Upon reset the scheduler would forget about the scheduled tasks stored in volatile memory, leaving the non-volatile state that was queried by experimenters in a perpetual running state. This oversight on the part of scheduler design and the fact that the node can't signal errors back to the infrastructure over the management network protocol has already made debugging this issue harder than necessary.
Next step was to determine what was causing these resets. Fortunately the STM32F103 CPU used on VESNA provides a helpful set of flags in the RCC_CSR register that allow you to distinguish between six different CPU reset reasons. Unfortunately, the bootloader on deployed nodes clobbers the values in the register, leaving no way to determine its value after reboot.
Back to square one, I reasoned that resets might be hardware related. Since the UHF tuner is quite power hungry I guessed that it might have something to do with power supply on deployed nodes. I also suspected hangs in the STM32's I2C interface, which is supposedly notoriously buggy when presented with marginal signals on the bus. Add the fact that weather-proof plastic boxes used to house sensor nodes turned out not to be as resistant to rain as we originally hoped and hardware related problems did not seem that far-fetched.
However poking around the circuit did not reveal anything obviously wrong and with no way to reproduce the problem in a lab I came to another dead end.
Next break-through came when I managed to reproduce the problem on a node that has been unmounted and brought back to the lab. It turned out that on this node a specific command would result in a node reset in around 2 cases out of 100. This might not sound much, but a real-life experiment would typically consist of many such tasks, adding up to a much higher probability of failure. Still, it took around two hours to reliably reproduce a reset in this way and this resulted in careful monitoring of failure probabilities versus the physical node and firmware versions. This monitoring later proved that all nodes exhibited this bug, even the test article I initially marked as problem-free.
Having a reproducible test case on my desk however did little to help the issue. With no JTAG or serial console available on the production configuration, it was impossible to use a on-chip debugger. It did make it possible to upload a fixed bootloader though and the cause of the resets was revealed to be the hardware watchdog.
VESNA uses the STM32 independent watchdog, which is a piece of hardware in the microcontroller that resets the CPU state unless a specific register write occurs every minute or so. The fact that watchdog was resetting the CPU pointed to a software problem. Unfortunately the wise developers in STM did not provide any way to actually determine what the CPU has been doing before the watchdog killed it. There is no way to get last program counter value or even a pointer to the last stack frame and hence no sane way of debugging watchdog-related issues (I did briefly toy with the idea of patching the CPU reset vector and dumping memory contents on reset, but soon decided that does not classify as a sane method).
This led to another fruitless hunt around the source code for functions that might be hanging the CPU for too long.
Then I noticed that one of the newer firmware versions had a much lower chance of failure - 1 failure in 2500 tries. Someone, probably unknowingly since I didn't see any commit messages that would tell me otherwise, already fixed, or nearly fixed, the bug. Since such accidental fixes tend also to be accidentally removed it still made sense to figure out what exactly was causing this bug to make sure the fix stayed put. I fired up git bisect and after a few days of testing I came up with the following situation:
Note that git bisect is being run in reverse here. Since I was searching for a commit that fixed a bug, not introduced it, bisect/bad marks a revision with the bug fixed while bisect/good marks a revision with a bug.
But if you look closely, you can see that the first revision that fixed it was a merge commit of two branches, both of which exhibited the bug. To make things even more curious, this was a straightforward merge with no conflicts. It made no sense that this merge would introduce any timing changes large enough to trip the watchdog timer. However the fact that the changes had to do with the integrated A/D converter did curiously point to a certain direction.
After carefully testing for and excluding bootloader and programming issues (after all, I did find a bug once where firmware would not be uploaded to flash properly if it was exactly a multiple of 512 bytes long), I came upon this little piece of code:
while (!(ADC_GetFlagStatus(ADC1, ADC_FLAG_EOC)));
Using the STM's scarily-licensed firmware library, this triggers a one-shot A/D conversion and waits for the end of conversion flag to be set by hardware. When I carefully added a timeout to this loop the bug disappeared in all firmware versions that previously exhibited the bug. I say carefully because at this point I was not trusting compiler either, which means I added some timeout code to the loop first that had a larger-than-realistic timeout, made sure the bug was still there, and then only changed the timeout value without touching the code itself.
Now it would be most convenient to blame everything to a buggy microcontroller peripheral. Thus far most clues seem to point to the fact that some minor timing issues during ADC calibration and turn-on may cause the ADC to sporadically hang or take far longer than expected to finish a conversion (turning off the watchdog did not usually result in a hang).
But even if that is the case (and I'm still not completely convinced although this case is now closed as far as I'm concerned), this journey mostly showed what happens when debugging a 40.000-line embedded code base with none to little internal debug tools at your disposal. It's a tremendous time sink and takes careful planning and note taking - I'm left with pages of calculations I made to make sure tests were running long enough to ensure a good probability of correct git-bisect result knowing prior probabilities of encountering a bug.
So, if you made it this far and are writing code, please make sure proper error reporting facilities are your top priority. Make sure you fail as early as possible and give as much feedback as possible to people that might be trying to debug things. Try to be reasonably robust to failures in code and hardware outside of your control. And above all, make absolutely sure you don't interfere with any kind of debugging facilities your platform provides. As limited as they might appear to be, they are still better than nothing.