2007-03-11 Bug in software or hardware

This week was a very rewarding week: we squashed a bug which seemed to elude the very best minds -- these of the Telis team.

The problem was that when measuring a voltage, we read out the wrong value. We're reading very accurately, in the microvolt (uV) scale and this is done with an electronics board which incorporates an ADC. When we made sure that no current was running on the measured circuit, we tried to measure zero but we actually got -14 uV. On this scale that isn't something to worry about; besides the ADC there are more electronic components on the board and these can all account for a slight offset. Hell, on this scale even the temperature can play a role.

However, this ADC has a lot of options and one of them is a procedure to measure an offset and store it in a register. Further reads will then take this offset into account. The electronics guy had created a script for this purpose. I had incorporated the script into a nice Perl module with a button in the user interface named 'Measure Offsets'. I've previously described this procedure in 2006-10-20 Measuring FFO offsets.

So, we ran the procedure and did a new measurement. The offset changed, but didn't disappear. Hmm, strange. Now we measured -7 uV. Weird!

FFO plot offset correctie zichtbaar.png

First we tried the usual stuff, to make sure this faulty reading was repeatable. Turn off electronics, disconnect cables, reconnect, turn on again. Trying an older version of the software. Completely reproducible. Then it became time to start thinking.

We tried to determine the location of the problem. Is it the hardware, the software, or the hardware instructions loaded into the flash located on the electronics board?

The measurement is run from the FFO board:

FFO-pll top.jpg

Our electronics guy tried the spare FFO board. Fully reproducible faulty behavior. So, it's not the hardware. Then it must be the software, right?

We reran the old script from which the Measure Offsets Perl module was created. This script ran the offset procedure for the ADC and then did some measurements. These checked out fine, printing zero uV after the offset procedure. However, if we then walked to the main software screen and read out the value, it had the -7 uV offset again. Can we rule out the software then?

We compared the Perl module and the original script line by line. These were the same. We also checked what each line did. They were created some time ago and we wanted to make sure everything still made sense.

Then we realized that there was a difference between a readout in the original Measure Offsets script and a readout in the main software screen. The second one uses a macro, the hardware instructions loaded into the flash located on the electronics board. This macro first puts the ADC in a high resolution before making the measurement.

So we changed the Measure Offsets procedure to first set the ADC in a high resolution before doing the offset procedure. Then we reran the measurement and waited with fingers crossed.... and Bingo! That was the problem. When we reran the plot, the following picture appeared:

offset fixed.jpg

The line left is the measurement before we ran the offsets procedure. The line at the right is the corrected measurement. (Note that the lines aren't as jagged as the first plot -- that is because the ADC was set to a higher accuracy, which takes more time for the measurement.)

Turns out it wasn't a hardware problem. It wasn't a software problem, either. It even wasn't really a problem in the macros. We just didn't use the offset options of the ADC in the right way. It was fully tested, but not in the exact same way measurements were taken later.

This type of bug had evaded unit testing and was only be caught with good testing in the field. Can't beat that kind of testing.