Monday, April 8, 2013

Why is command-not-found crashing?

Background

Many many years ago I wrote the command-not-found program. It's still there on the CD, on the server, installed by default. I'm really proud of that. What makes the pride go away is the sea of bugs in that program.

It does not help to say that command-not-found crashes gracefully, telling you how to report the problem. I myself feel helpless about those problems but once in a while someone wants to help out and comes asking for directions.

I'm really really happy to help anyone contribute bug fixes, improvements or just play around with the code to understand it better. In that spirit, instead of responding privately (as command-not-found has no mailing list or anything similar) I've decided to write a blog post about the problem, hoping both to archive my thoughts and redirect others to it when needed and to get some attention from people that could suggest a way to fix the problem.

So, the question is "do you know why is command-not-found crashing?"
Yes, I do.

Unicode decoding problems

Consider this scenario:
  1. The program is being given some bytes, on stdin, via arguments to main() or otherwise
  2. The program wants to interpret those bytes as text, it needs to know the encoding.
  3. The program queries some locations to know what the encoding is.
  4. The program attempts to interpret the bytes according to that encoding. Something is incorrect though (corrupted bytes, incorrect encoding hints) and stuff blows up. This is the UnicodeDecodeError exception that is often happening.
There are several separate causes of this problem. I will talk about that later.

Locale problems

Another group of failures is related to locale. Locale is being used for several things but most importantly, for interacting with gettext to get translated strings to be used at run-time. Locale related problems look as follows:
  1. The program uses standard library calls to initialize the locale system and the translation catalog.
  2. That operation queries some environment variables and looks at certain files and tries to load them
  3. Something is incorrect (bad settings, missing files) and stuff blows up. This is the locale.Error exception that is sometimes happening.
So if I know why this happens, why is it not fixed: because it's not easy. You will notice that this virtually never happens if you are using Ubuntu directly. You'd have to try to get that to happen (explicitly mis-configure your system / remove essential files). This is not interesting to fix as it affects practically nobody. It's only interesting in the manner that the "fix" should be equally good for local and ....

Remote users

This is where all of the problems are coming from. This is almost always observed when logging in remotely with SSH. SSH inherits / sets certain environment variables depending on the configuration of the system people connect FROM. Some of those are SSH/pam bugs that incorrectly negotiate which variables are okay to forward. The rest might be ssh/osx putty/windows misconfiguration (by default) that is causing things to break as I've explained above.

Typical use cases:
  1. Windows user using putty to login to a 12.04 server - explodes because of missing locale for en_US langpack (not installed by default IIRC) and because of incorrect putty settings (assuming ISO8859-X encoding) corrupting the input buffer (when you press enter what you see and what gets sent to the remote machine is different
  2. Mac OS X  user inheriting a bunch of environment variables that don't work in Ubuntu or any Linux for that manner. This causes the Unicode exceptions or locale exceptions, depending on what settings people have.
So there you have it. I don't have the time and hardware/software necessary to carefully analyze possible remote interactions (various versions of windows, cygwin, ssh, locale settings or mac).

Possible solutions

I guess there are two ways this could be solved:
  1. The root problem could be carefully analyzed and solved. This would improve the experience of all users logging in remotely.
  2. Command not found could silently turn off translations and fall-back to assuming UTF-8, English or even silently not doing anything (no suggestions). That works too, in some way, so if anything it's a low(er) hanging fruit to go after.
Anyone that wants to contribute is really welcome, I've moved the project from launchpad to github a while ago and there is even some slow progress on the issues faced by OS X users. If you want to help just fork the repo, dig through the code, experiment, open bugs and send comments. I'm really looking forward to helping anyone interested in working on this.