BackgroundMany many years ago I wrote the command-not-found program. It's still there on the CD, on the server, installed by default. I'm really proud of that. What makes the pride go away is the sea of bugs in that program.
It does not help to say that command-not-found crashes gracefully, telling you how to report the problem. I myself feel helpless about those problems but once in a while someone wants to help out and comes asking for directions.
I'm really really happy to help anyone contribute bug fixes, improvements or just play around with the code to understand it better. In that spirit, instead of responding privately (as command-not-found has no mailing list or anything similar) I've decided to write a blog post about the problem, hoping both to archive my thoughts and redirect others to it when needed and to get some attention from people that could suggest a way to fix the problem.
So, the question is "do you know why is command-not-found crashing?"
Yes, I do.
Unicode decoding problemsConsider this scenario:
- The program is being given some bytes, on stdin, via arguments to main() or otherwise
- The program wants to interpret those bytes as text, it needs to know the encoding.
- The program queries some locations to know what the encoding is.
- The program attempts to interpret the bytes according to that encoding. Something is incorrect though (corrupted bytes, incorrect encoding hints) and stuff blows up. This is the UnicodeDecodeError exception that is often happening.
Locale problemsAnother group of failures is related to locale. Locale is being used for several things but most importantly, for interacting with gettext to get translated strings to be used at run-time. Locale related problems look as follows:
- The program uses standard library calls to initialize the locale system and the translation catalog.
- That operation queries some environment variables and looks at certain files and tries to load them
- Something is incorrect (bad settings, missing files) and stuff blows up. This is the locale.Error exception that is sometimes happening.
Remote usersThis is where all of the problems are coming from. This is almost always observed when logging in remotely with SSH. SSH inherits / sets certain environment variables depending on the configuration of the system people connect FROM. Some of those are SSH/pam bugs that incorrectly negotiate which variables are okay to forward. The rest might be ssh/osx putty/windows misconfiguration (by default) that is causing things to break as I've explained above.
Typical use cases:
- Windows user using putty to login to a 12.04 server - explodes because of missing locale for en_US langpack (not installed by default IIRC) and because of incorrect putty settings (assuming ISO8859-X encoding) corrupting the input buffer (when you press enter what you see and what gets sent to the remote machine is different
- Mac OS X user inheriting a bunch of environment variables that don't work in Ubuntu or any Linux for that manner. This causes the Unicode exceptions or locale exceptions, depending on what settings people have.
Possible solutionsI guess there are two ways this could be solved:
- The root problem could be carefully analyzed and solved. This would improve the experience of all users logging in remotely.
- Command not found could silently turn off translations and fall-back to assuming UTF-8, English or even silently not doing anything (no suggestions). That works too, in some way, so if anything it's a low(er) hanging fruit to go after.