This is a post-mortem that I decided to share and write with everyone. It relates to my attempts to fix a seemingly simple bug in a CI loop that manages source code for checkbox.
This turned out to be widely different from what I initially thought.
So for about a week I've been plagued with tarmac failing tests on perfectly good code. This happened right after the change that made us reuse virtual machines instead of (erroneously) create a fresh one each time after tarmac removed the vagrant state file '.vagrant' from the tree.
The bug manifested itself like this:
If you follow that and hop to another pastebin you'll see "error: could not create 'plainbox.egg-info': File exists".
I started debugging this, initially suspecting bugs in test-in-vagrant.sh, repeated testing was inconclusive, I never managed to grasp the bug. It would happen once, then again, only on precise guest, not on the quantal guest.
At that time, I could reproduce it locally on "fx" (my desktop), it would also happen on "tarmac-lander" the dell laptop that I got from Tim.
Time went by, I kept installing daily updates (both machines are on raring). By now I pushed the change that replaced "ln -s" with "ln -sf". That didn't work and we started getting failures again.
I started suspecting tarmac as the root cause. As you know tarmac rebuilds the pristine tree before merging any code or running tests. That also removed the .egg-info directory which is created by setuptools/distribute, for example, when running 'setup.py test' or 'setup.py egg_info'. That led me to push this change
I kept testing and testing and it would show the error is gone. The directory would now be rebuilt before running 'setup.py test'. I suspected that a distribute/setuptools bug is causing this so I started looking at that. This would play well with this failing on precise and not failing on quantal. I was wrong again.
In the meantime I landed this to trunk and started running tarmac merges again. And boy, did it fail. I wanted to pull my hair out as the bug was back in full glory. That did not make any sense. I ran everything a couple of times and the bug kept going away.
At that time I did a few good things for the first time:
- I created a copy of the virtual machine just after the error happened
- I created a copy of the directory shared to the virtual machine.
Now I was getting somewhere. I could reproduce the bug, once, then it would fix itself and never happen. I could restore the saved VM image and reproduce it again.
To reproduce the bug I was running a script that starts precise, runs tests that trip over the bug, suspends precise, does the same for quantal.
I started removing things to see what's really causing this. I removed the part that tested quantal, focusing on precise only. I was right, I could re-run tests over and over and they would keep failing. I was doing this all on the environment where our tarmac loop happens, on "tarmac-lander", the dell laptop. I never tried moving that environment to my desktop and that was a very good decision (perhaps by accident) in retrospective.
I started digging deeper. The failing command was basically:
$ vagrant ssh precise -c 'cd checkbox/plainbox && python3 setup.py egg_info
That ssh into the precise VM, goes to checkbox which is a symlink to /vagrant, which is the shared folder from host os, goes into the plainbox sub-directory and runs setup.py to build plainbox.egg-info
That kept failing, but this was really silly, it should not have. The error message, if you recall, was:
error: could not create 'plainbox.egg-info': File exists
The second good idea I had that day was to run strace. I prepended strace to python and then I saw it, right at the end, in plain sight:
mkdir("plainbox.egg-info", 0777) = -1 EEXIST
stat64("plainbox.egg-info", 0xbfd696fc) = -1 ENOENT
Now we're getting somewhere. Inspired by this small victory I wrote a minimal test program that reproduces it. Quick testing and... bingo, it's broken. So virtualbox must be broken, good.
I filed a bug upstream and asked my colleague to check if this fails for him on virtualbox 4.2 from oracle. Well, you can probably guess that it was not it, it did not fail, everything was right. At that time I was sure virtualbox 4.2 was the cure. I planned updating my "fx" workstation to see this fixes everything and the week-long horror goes away.
Then I had another good idea, let's check if it also fails on "fx" before I upgrade... well, it worked... no bug.
I had the same version of virtualbox. The same version of the guest kernel, the same version of all the other packages as both machines share an apt-cacher-ng cache.
So what was it? What's different? Both are x86_64, both are running the same, er! kernel? "fx" at that time was running 3.8.0-10 as I kept updating and rebooting every day. "tarmac-lander" was still at 3.8.0-8 as I never bothered to reboot our CI loop.
A few reboots later I had the "facts" changed again, hopefully for the last time. The Ubuntu kernel 3.8.0-8 is affected, anything after that, specifically 3.8.0-10
UPDATE: It just happened again, this time on 3.8.0-11. I'm clueless now. Could be time related? Memory corruption (unlikely)? Any hints?
Lessons learned:
- When reasoning about anything, move one piece and measure again, otherwise you reason based on false data and quickly go nowhere.
- If possible, copy stuff just after the failure elusive bugs may go away or be "fixed" by something random you did later while poking at the bug.
- Check all the reproduce instructions. Do everything from scratch following those. It takes forever but it may _save_ time if the fix is wrong.
So as stated above, this is still broken:
ReplyDeletezyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ vagrant ssh precise -c 'cd checkbox/plainbox && ~/vboxbug'
mkdir(...) = -1: errno 17
stat(...) = -1: errno 2
This system is borked, thanks!
zyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ uname -a
Linux tarmac-lander 3.8.0-11-generic #20-Ubuntu SMP Tue Mar 5 20:32:48 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
zyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ vagrant ssh precise
Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-38-generic i686)
Some more testing, host not modified, guest destroyed and re-created as shown below:
ReplyDeletezyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ vagrant ssh precise
Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-38-generic i686)
* Documentation: https://help.ubuntu.com/
System information as of Thu Mar 7 22:08:34 UTC 2013
System load: 0.13 Processes: 64
Usage of /: 2.8% of 39.36GB Users logged in: 0
Memory usage: 42% IP address for eth0: 10.0.2.15
Swap usage: 0%
Graph this data and manage this system at https://landscape.canonical.com/
Get cloud support with Ubuntu Advantage Cloud Guest:
http://www.ubuntu.com/business/services/cloud
Use Juju to deploy your cloud instances and workloads:
https://juju.ubuntu.com/#cloud-precise
*** System restart required ***
vagrant@vagrant-ubuntu-precise-32:~$ uname -a
Linux vagrant-ubuntu-precise-32 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:20:02 UTC 2013 i686 i686 i386 GNU/Linux
vagrant@vagrant-ubuntu-precise-32:~$ wget https://gist.github.com/zyga/5110513/raw/85910f4ef112e7f4ad189312ce53c6bddfe82d8a/vboxbug.c
--2013-03-07 22:09:01-- https://gist.github.com/zyga/5110513/raw/85910f4ef112e7f4ad189312ce53c6bddfe82d8a/vboxbug.c
Resolving gist.github.com (gist.github.com)... 204.232.175.94
Connecting to gist.github.com (gist.github.com)|204.232.175.94|:443... connected.
gHTTP request sent, awaiting response... c200 OK
Length: unspecified [text/plain]
Saving to: `vboxbug.c'
[ <=> ] 2,763 --.-K/s in 0s
2013-03-07 22:09:02 (114 MB/s) - `vboxbug.c' saved [2763]c
vagrant@vagrant-ubuntu-precise-32:~$ gcc -Wall vboxbug.c -o vboxbug
vagrant@vagrant-ubuntu-precise-32:~$ logout
Connection to 127.0.0.1 closed.
zyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ vagrant ssh precise -c 'cd checkbox/plainbox && ~/vboxbug'
mkdir(...) = -1: errno 17
stat(...) didn't fail, this system works ok
Maybe on to something again?
ReplyDeletezyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ vagrant up precise
[precise] VM already created. Booting if it's not already running...
[precise] Resuming suspended VM...
[precise] Booting VM...
[precise] Waiting for VM to boot. This can take a few minutes.
[precise] VM booted and ready for use!
zyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ vagrant ssh precise -c 'cd checkbox/plainbox && ~/vboxbug'
mkdir(...) = -1: errno 17
stat(...) didn't fail, this system works ok
zyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ bzr status
unknown:
plainbox/plainbox.egg-info/
zyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ rm -rf plainbox/plainbox.egg-info/
zyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ vagrant ssh precise -c 'cd checkbox/plainbox && ~/vboxbug'
mkdir(...) = -1: errno 17
stat(...) = -1: errno 2
This system is borked, thanks!
zyga@tarmac-lander:~/.cache/tarmac/branches/checkbox/trunk$ vagrant ssh precise -c 'cd checkbox/plainbox && ~/vboxbug'
mkdir(...) = -1: errno 17
stat(...) = -1: errno 2
This system is borked, thanks!
Hello, this is heisenbug, I'm still here:
ReplyDeletevagrant@vagrant-ubuntu-precise-32:~/checkbox/plainbox$ ls -la
total 92
drwxrwxr-x 1 vagrant vagrant 4096 Mar 7 22:25 .
drwxrwxr-x 1 vagrant vagrant 4096 Mar 7 22:20 ..
-rw-rw-r-- 1 vagrant vagrant 169 Mar 7 15:34 .coveragerc
-rw-rw-r-- 1 vagrant vagrant 66 Mar 7 15:34 .gitignore
-rw-rw-r-- 1 vagrant vagrant 78 Mar 7 15:34 .gitmodules
-rw-rw-r-- 1 vagrant vagrant 35147 Mar 7 15:34 COPYING
-rw-rw-r-- 1 vagrant vagrant 67 Mar 7 15:34 MANIFEST.in
-rw-rw-r-- 1 vagrant vagrant 3886 Mar 7 15:34 README.md
drwxrwxr-x 1 vagrant vagrant 4096 Mar 7 15:34 docs
-rwxrwxr-x 1 vagrant vagrant 1476 Mar 7 15:34 mk-interesting-graphs.sh
-rwxrwxr-x 1 vagrant vagrant 1077 Mar 7 15:34 mk-venv.sh
drwxrwxr-x 1 vagrant vagrant 4096 Mar 7 22:21 plainbox
-rw-rw-r-- 1 vagrant vagrant 63 Mar 7 15:34 setup.cfg
-rwxrwxr-x 1 vagrant vagrant 1527 Mar 7 15:34 setup.py
-rwxrwxr-x 1 vagrant vagrant 263 Mar 7 15:34 test-with-coverage.sh
vagrant@vagrant-ubuntu-precise-32:~/checkbox/plainbox$ stat plainbox.egg-info
stat: cannot stat `plainbox.egg-info': No such file or directory
vagrant@vagrant-ubuntu-precise-32:~/checkbox/plainbox$ mkdir plainbox.egg-info
mkdir: cannot create directory `plainbox.egg-info': File exists
I'm almost sure it's caused by some stale cache entry in vboxsf.
ReplyDeleteI'll try to provide verified reproducible instructions and automated test tomorrow:
Rough idea:
1) [host] start a VM
2) [host] share a folder
3) [guest] create directory in shared folder
4) [host] remove the directory in shared folder
5) [guest] observe the bug