Thursday, March 7, 2013

Post mortem: chasing elusive bugs

UPDATE: scroll below for an update, no spoilers here.

This is a post-mortem that I decided to share and write with everyone. It relates to my attempts to fix a seemingly simple bug in a CI loop that manages source code for checkbox.

This turned out to be widely different from what I initially thought.

So for about a week I've been plagued with tarmac failing tests on perfectly good code. This happened right after the change that made us reuse virtual machines instead of (erroneously) create a fresh one each time after tarmac removed the vagrant state file '.vagrant' from the tree.

The bug manifested itself like this:

If you follow that and hop to another pastebin you'll see "error: could not create 'plainbox.egg-info': File exists".

I started debugging this, initially suspecting bugs in, repeated testing was inconclusive, I never managed to grasp the bug. It would happen once, then again, only on precise guest, not on the quantal guest.

At that time, I could reproduce it locally on "fx" (my desktop), it would also happen on "tarmac-lander" the dell laptop that I got from Tim.

Time went by, I kept installing daily updates (both machines are on raring). By now I pushed the change that replaced "ln -s" with "ln -sf". That didn't work and we started getting failures again.

I started suspecting tarmac as the root cause. As you know tarmac rebuilds the pristine tree before merging any code or running tests. That also removed the .egg-info directory which is created by setuptools/distribute, for example, when running ' test' or ' egg_info'. That led me to push this change

I kept testing and testing and it would show the error is gone. The directory would now be rebuilt before running ' test'. I suspected that a distribute/setuptools bug is causing this so I started looking at that. This would play well with this failing on precise and not failing on quantal. I was wrong again.

In the meantime I landed this to trunk and started running tarmac merges again. And boy, did it fail. I wanted to pull my hair out as the bug was back in full glory. That did not make any sense. I ran everything a couple of times and the bug kept going away.

At that time I did a few good things for the first time:

  1. I created a copy of the virtual machine just after the error happened
  2. I created a copy of the directory shared to the virtual machine.

Now I was getting somewhere. I could reproduce the bug, once, then it would fix itself and never happen. I could restore the saved VM image and reproduce it again.

To reproduce the bug I was running a script that starts precise, runs tests that trip over the bug, suspends precise, does the same for quantal.

I started removing things to see what's really causing this. I removed the part that tested quantal, focusing on precise only. I was right, I could re-run tests over and over and they would keep failing. I was doing this all on the environment where our tarmac loop happens, on "tarmac-lander", the dell laptop. I never tried moving that environment to my desktop and that was a very good decision (perhaps by accident) in retrospective.

I started digging deeper. The failing command was basically:

$ vagrant ssh precise -c 'cd checkbox/plainbox && python3 egg_info

That ssh into the precise VM, goes to checkbox which is a symlink to /vagrant, which is the shared folder from host os, goes into the plainbox sub-directory and runs to build plainbox.egg-info

That kept failing, but this was really silly, it should not have. The error message, if you recall, was:

error: could not create 'plainbox.egg-info': File exists

The second good idea I had that day was to run strace. I prepended strace to python and then I saw it, right at the end, in plain sight:

mkdir("plainbox.egg-info", 0777) = -1 EEXIST
stat64("plainbox.egg-info", 0xbfd696fc) = -1 ENOENT

Now we're getting somewhere. Inspired by this small victory I wrote a minimal test program that reproduces it. Quick testing and... bingo, it's broken. So virtualbox must be broken, good.

I filed a bug upstream and asked my colleague to check if this fails for him on virtualbox 4.2 from oracle. Well, you can probably guess that it was not it, it did not fail, everything was right. At that time I was sure virtualbox 4.2 was the cure. I planned updating my "fx" workstation to see this fixes everything and the week-long horror goes away.

Then I had another good idea, let's check if it also fails on "fx" before I upgrade... well, it worked... no bug.

I had the same version of virtualbox. The same version of the guest kernel, the same version of all the other packages as both machines share an apt-cacher-ng cache.

So what was it? What's different? Both are x86_64, both are running the same, er! kernel? "fx" at that time was running 3.8.0-10 as I kept updating and rebooting every day. "tarmac-lander" was still at 3.8.0-8 as I never bothered to reboot our CI loop.

A few reboots later I had the "facts" changed again, hopefully for the last time. The Ubuntu kernel 3.8.0-8 is affected, anything after that, specifically 3.8.0-10 and -11 are okay. No virtualbox upgrades. Just the host kernel version. I've added a comment to the upstream bug I've reported and started going through changes between -8 and -10 to see what may be influencing this.

UPDATE: It just happened again, this time on 3.8.0-11. I'm clueless now. Could be time related? Memory corruption (unlikely)? Any hints?

Lessons learned:

  1. When reasoning about anything, move one piece and measure again, otherwise you reason based on false data and quickly go nowhere.
  2. If possible, copy stuff just after the failure elusive bugs may go away or be "fixed" by something random you did later while poking at the bug.
  3. Check all the reproduce instructions. Do everything from scratch following those. It takes forever but it may _save_ time if the fix is wrong.