Thursday, October 3, 2013

apt-proxy-setup 0.1 released

Hi

I've just made the first release of apt-proxy-setup - a smart and effortless way to use your local apt-cacher-ng while at home/office while still having access to the main public archive while on the road. Perfect when you're attending a sprint or a conference and don't want to keep changing your proxy configuration all the time. There is a README file and an example configuration file. Enjoy!
 Thanks
ZK

Monday, April 8, 2013

Why is command-not-found crashing?

Background

Many many years ago I wrote the command-not-found program. It's still there on the CD, on the server, installed by default. I'm really proud of that. What makes the pride go away is the sea of bugs in that program.

It does not help to say that command-not-found crashes gracefully, telling you how to report the problem. I myself feel helpless about those problems but once in a while someone wants to help out and comes asking for directions.

I'm really really happy to help anyone contribute bug fixes, improvements or just play around with the code to understand it better. In that spirit, instead of responding privately (as command-not-found has no mailing list or anything similar) I've decided to write a blog post about the problem, hoping both to archive my thoughts and redirect others to it when needed and to get some attention from people that could suggest a way to fix the problem.

So, the question is "do you know why is command-not-found crashing?"
Yes, I do.

Unicode decoding problems

Consider this scenario:
  1. The program is being given some bytes, on stdin, via arguments to main() or otherwise
  2. The program wants to interpret those bytes as text, it needs to know the encoding.
  3. The program queries some locations to know what the encoding is.
  4. The program attempts to interpret the bytes according to that encoding. Something is incorrect though (corrupted bytes, incorrect encoding hints) and stuff blows up. This is the UnicodeDecodeError exception that is often happening.
There are several separate causes of this problem. I will talk about that later.

Locale problems

Another group of failures is related to locale. Locale is being used for several things but most importantly, for interacting with gettext to get translated strings to be used at run-time. Locale related problems look as follows:
  1. The program uses standard library calls to initialize the locale system and the translation catalog.
  2. That operation queries some environment variables and looks at certain files and tries to load them
  3. Something is incorrect (bad settings, missing files) and stuff blows up. This is the locale.Error exception that is sometimes happening.
So if I know why this happens, why is it not fixed: because it's not easy. You will notice that this virtually never happens if you are using Ubuntu directly. You'd have to try to get that to happen (explicitly mis-configure your system / remove essential files). This is not interesting to fix as it affects practically nobody. It's only interesting in the manner that the "fix" should be equally good for local and ....

Remote users

This is where all of the problems are coming from. This is almost always observed when logging in remotely with SSH. SSH inherits / sets certain environment variables depending on the configuration of the system people connect FROM. Some of those are SSH/pam bugs that incorrectly negotiate which variables are okay to forward. The rest might be ssh/osx putty/windows misconfiguration (by default) that is causing things to break as I've explained above.

Typical use cases:
  1. Windows user using putty to login to a 12.04 server - explodes because of missing locale for en_US langpack (not installed by default IIRC) and because of incorrect putty settings (assuming ISO8859-X encoding) corrupting the input buffer (when you press enter what you see and what gets sent to the remote machine is different
  2. Mac OS X  user inheriting a bunch of environment variables that don't work in Ubuntu or any Linux for that manner. This causes the Unicode exceptions or locale exceptions, depending on what settings people have.
So there you have it. I don't have the time and hardware/software necessary to carefully analyze possible remote interactions (various versions of windows, cygwin, ssh, locale settings or mac).

Possible solutions

I guess there are two ways this could be solved:
  1. The root problem could be carefully analyzed and solved. This would improve the experience of all users logging in remotely.
  2. Command not found could silently turn off translations and fall-back to assuming UTF-8, English or even silently not doing anything (no suggestions). That works too, in some way, so if anything it's a low(er) hanging fruit to go after.
Anyone that wants to contribute is really welcome, I've moved the project from launchpad to github a while ago and there is even some slow progress on the issues faced by OS X users. If you want to help just fork the repo, dig through the code, experiment, open bugs and send comments. I'm really looking forward to helping anyone interested in working on this.

Thursday, March 7, 2013

Post mortem: chasing elusive bugs

UPDATE: scroll below for an update, no spoilers here.

This is a post-mortem that I decided to share and write with everyone. It relates to my attempts to fix a seemingly simple bug in a CI loop that manages source code for checkbox.

This turned out to be widely different from what I initially thought.

So for about a week I've been plagued with tarmac failing tests on perfectly good code. This happened right after the change that made us reuse virtual machines instead of (erroneously) create a fresh one each time after tarmac removed the vagrant state file '.vagrant' from the tree.

The bug manifested itself like this:

If you follow that and hop to another pastebin you'll see "error: could not create 'plainbox.egg-info': File exists".

I started debugging this, initially suspecting bugs in test-in-vagrant.sh, repeated testing was inconclusive, I never managed to grasp the bug. It would happen once, then again, only on precise guest, not on the quantal guest.

At that time, I could reproduce it locally on "fx" (my desktop), it would also happen on "tarmac-lander" the dell laptop that I got from Tim.

Time went by, I kept installing daily updates (both machines are on raring). By now I pushed the change that replaced "ln -s" with "ln -sf". That didn't work and we started getting failures again.

I started suspecting tarmac as the root cause. As you know tarmac rebuilds the pristine tree before merging any code or running tests. That also removed the .egg-info directory which is created by setuptools/distribute, for example, when running 'setup.py test' or 'setup.py egg_info'. That led me to push this change

I kept testing and testing and it would show the error is gone. The directory would now be rebuilt before running 'setup.py test'. I suspected that a distribute/setuptools bug is causing this so I started looking at that. This would play well with this failing on precise and not failing on quantal. I was wrong again.

In the meantime I landed this to trunk and started running tarmac merges again. And boy, did it fail. I wanted to pull my hair out as the bug was back in full glory. That did not make any sense. I ran everything a couple of times and the bug kept going away.

At that time I did a few good things for the first time:

  1. I created a copy of the virtual machine just after the error happened
  2. I created a copy of the directory shared to the virtual machine.

Now I was getting somewhere. I could reproduce the bug, once, then it would fix itself and never happen. I could restore the saved VM image and reproduce it again.

To reproduce the bug I was running a script that starts precise, runs tests that trip over the bug, suspends precise, does the same for quantal.

I started removing things to see what's really causing this. I removed the part that tested quantal, focusing on precise only. I was right, I could re-run tests over and over and they would keep failing. I was doing this all on the environment where our tarmac loop happens, on "tarmac-lander", the dell laptop. I never tried moving that environment to my desktop and that was a very good decision (perhaps by accident) in retrospective.

I started digging deeper. The failing command was basically:

$ vagrant ssh precise -c 'cd checkbox/plainbox && python3 setup.py egg_info

That ssh into the precise VM, goes to checkbox which is a symlink to /vagrant, which is the shared folder from host os, goes into the plainbox sub-directory and runs setup.py to build plainbox.egg-info

That kept failing, but this was really silly, it should not have. The error message, if you recall, was:

error: could not create 'plainbox.egg-info': File exists

The second good idea I had that day was to run strace. I prepended strace to python and then I saw it, right at the end, in plain sight:

mkdir("plainbox.egg-info", 0777) = -1 EEXIST
stat64("plainbox.egg-info", 0xbfd696fc) = -1 ENOENT

Now we're getting somewhere. Inspired by this small victory I wrote a minimal test program that reproduces it. Quick testing and... bingo, it's broken. So virtualbox must be broken, good.

I filed a bug upstream and asked my colleague to check if this fails for him on virtualbox 4.2 from oracle. Well, you can probably guess that it was not it, it did not fail, everything was right. At that time I was sure virtualbox 4.2 was the cure. I planned updating my "fx" workstation to see this fixes everything and the week-long horror goes away.

Then I had another good idea, let's check if it also fails on "fx" before I upgrade... well, it worked... no bug.

I had the same version of virtualbox. The same version of the guest kernel, the same version of all the other packages as both machines share an apt-cacher-ng cache.

So what was it? What's different? Both are x86_64, both are running the same, er! kernel? "fx" at that time was running 3.8.0-10 as I kept updating and rebooting every day. "tarmac-lander" was still at 3.8.0-8 as I never bothered to reboot our CI loop.

A few reboots later I had the "facts" changed again, hopefully for the last time. The Ubuntu kernel 3.8.0-8 is affected, anything after that, specifically 3.8.0-10 and -11 are okay. No virtualbox upgrades. Just the host kernel version. I've added a comment to the upstream bug I've reported and started going through changes between -8 and -10 to see what may be influencing this.

UPDATE: It just happened again, this time on 3.8.0-11. I'm clueless now. Could be time related? Memory corruption (unlikely)? Any hints?

Lessons learned:

  1. When reasoning about anything, move one piece and measure again, otherwise you reason based on false data and quickly go nowhere.
  2. If possible, copy stuff just after the failure elusive bugs may go away or be "fixed" by something random you did later while poking at the bug.
  3. Check all the reproduce instructions. Do everything from scratch following those. It takes forever but it may _save_ time if the fix is wrong.