Monday, August 8, 2016

Snap execution environment

This is the fourth article in the series about snappy interfaces. You can check out articles one, two and three though they are not directly required.

In this installment we will explore the layout and properties of the file system at the time snap application is executed. From the point of view of the user nothing special is happening. The user either runs an application by clicking on a desktop icon or by running a shell command. Internally, snapd uses a series of steps (which I will not explain today as they are largely an implementation detail) to configure the application process.

The (ch)root filesystem and a bit of magic

Let’s start with the most important fact: the root filesystem is not the filesystem of the host distribution. Using the host filesystem would lead to lots of inconsistencies. Those are rather obvious: different base libraries, potentially different filesystem layout, etc. At snap application runtime the root filesystem is changed to the core snap itself. You can think of this as a kind of chroot. Obviously the chroot itself would be insufficient as snaps are read only filesystems and the core snap is no different.

Certain directories in the core snap are bind mounted (you can think of this as a special type of a symbolic link or a hard link to a directory though neither are fully accurate) to locations on the host file system. This includes the /home directory, the /run directory and a few others (see Appendix A for the full list). Notably this does not include the /usr/lib or /usr/bin. If a snap needs a library or an executable to function, that library or executable has to be present in the snap itself. The only exception to that are very low level libraries like libc that are present in the core snap.

TIP: explore the core snap to see what is there. Having installed at least one snap you can go to /snap/ubuntu-core/current to see the list of files provided there.

With all those mounts and chroots in place one might wonder how mounts look like for all the other processes in the system? The answer is simple, they look as if nothing special related to snappy was happening.

To understand the answer you have to know a little about Linux namespaces. Namespaces are a way to create a separate view of a given aspect of a Linux system at a per-process level. Unlike full-blown virtual machines (where you run a whole emulated computer with a potentially different operating system kernel and emulated peripheral devices) namespaces are fine grained. Snappy uses just one of the available namespaces, the mount namespace. Now I won’t fool you, while the idea seems simple “mounts in the namespace are isolated from the mounts outside of the namespace” the reality is far more complex because of the so-called shared-subtrees. One interesting consequence is that mounts performed after a snap application is stated (e.g. in the /media directory) are visible to the said application (e.g. to VLC) while the reverse is not true. If a malicious snap tries (and manages despite various defenses put in place) to mount something in say, /usr/ that change will be visible only to the snap application process.

Don’t worry if you don’t fully understand this topic. The main point is that your application sees a different view of the filesystem but that view is consistent across distributions.

TIP: you may have seen the core snap as it looks like on disk if you followed the earlier tip. Now see the real file system at runtime! Install the snapd-hacker-toolbelt snap and run snapd-hacker-toolbelt.busybox sh. This will give you a shell with many of the familiar commands that let you peek and poke at the environment.
Now there are a few more tweaks I should point out but won’t go into too much detail:

  • Each process gets a private /tmp directory with a fresh tmpfs mounted there. This is a security measure. One simple consequence is that you cannot expect to share files by dropping them there and that you cannot create arbitrarily large files there since tmpfs is backed by a fraction of available system memory.
  • There’s also a private instance of /dev/pts directory with pseudo terminal emulators. This is an another security measure. In practice you will not care about this much. It’s just a part of the Linux plumbing that has to be setup in a given way.
  • The whole host filesystem is mounted at /var/lib/snapd/hostfs. This can be used by interfaces similar to the content sharing interface, for example. This is super interesting and we will devote a whole article to using this later on.
  • There’s special code that exists to support Nvidia proprietary drivers. I will discuss this with a separate installment that may be of interest to game developers.
  • The current working directory may be impossible to preserve across the whole chroot and mount and bind mount magic. The easiest way to experience this is to create a directory in /tmp (e.g. /tmp/foo) and try to run any snap command there. Because of the private (and empty) /tmp directory the /tmp/foo directory does not exist for the snap application process. Snap-confine will print an informative error message and refuse to run.

This now much more comfortable. Many of the usual places exist and contain the data the applications are familiar with. This doesn’t mean those directories are readable or writable to the application process, they are just present. Confinement and interfaces decide if something is readable or writable. This brings us to the second big part of snap-confine

Process confinement

Snap-confine (as of version 1.0.39) supports two sandboxing technologies: seccomp and apparmor.

Seccomp is used to constrain system calls that an application can use. System calls are the interface between the kernel and user space. If you are unfamiliar with the concept then don’t worry. In very rough terms some of the functions of your programming language are implemented as as system calls and seccomp is the linux subsystem that is responsible for mediating access to them.

Right now when your application runs in devmode you will not get any advice on the system calls you are relying on that are not allowed by the set of used interfaces. The so-called complain mode of seccomp is being actively developed upstream so the situation may change by the time you are reading this.

In strict confinement any attempt to use a disallowed system call will instantly kill the offending process. This is accompanied by a rather cryptic message that you can see in the system log:
sie 08 12:36:53 gateway kernel: audit: type=1326 audit(1470652613.076:27): auid=1000 uid=1000 gid=1000 ses=63 pid=66834 comm="links" exe="/snap/links/2/usr/bin/links" sig=31 arch=c000003e syscall=54 compat=0 ip=0x7f8dcb8ffc8a code=0x0
What this tells us is that process ID 66834 was killed with signal 31 (SIGSYS) because it tried to use system call 54 (setsockopt) Note that system call numbers are architecture specific. The output above was from an amd64 machine.

Even when a system call is allowed the particular operation may be intercepted and denied by apparmor. For example, the sandbox is setup so that applications can freely write to $SNAP_USER_DATA (or $SNAP_DATA for services) but cannot, by default, either read or write from the real home directory.
sie 08 12:56:40 gateway kernel: audit: type=1400 audit(1470653800.724:28): apparmor="DENIED" operation="open" profile="snap.snapd-hacker-toolbelt.busybox" name="/home/zyga/.ssh/authorized_keys" pid=67013 comm="busybox" requested_mask="r" denied_mask="r" fsuid=1000 ouid=1000

Here we see that process ID 67013 tried to “open” /home/zyga/.ssh/authorized_keys and that the “r” (read) mask was denied. In devmode that is obviously allowed but is accompanied with an appropriate warning message instead.

TIP: Whenever you run into problems like this you should give the snappy-debug snap a try. It is designed to read and understand messages like that and give you useful advice.

Apparmor has much wider feature set and can perform checks for linux capabilities, traditional UNIX IPC like signals and sockets, DBus messages (including details of the object and method invoked). The vast majority of the current snap confinement is made with apparmor profiles. We will look at all the features in greater detail in the next few articles in this series where we will actively implement simple new interfaces from scratch.

There's one last thing that snap-confine does, in some cases is creates...

A device control group

In certain cases a device control group is created and the application process is moved there. The control group has only a small set of typical device nodes (e.g. /dev/null) and explicitly defined additional devices. This is done by using udev rules to tag appropriate devices. This is listed for completeness, you should never worry about or even think about this topic. Once the need arises we will explore it and how it can be used to simplify handling of certain situations. Again, by default there is no device control group being used.

Putting it all together

With all of those changes in place snap-confine executes (using execv) a wrapper script corresponding to the application command entry in the snapcraft.yaml file. This happens each time you run an application.

If you are interested in learning more about snap-confine I encourage you to check out its manual page (snap-confine.5) and source code. If you have any questions please feel free to ask at the mailing list or using comments on this blog below.

Next time we will look at creating our first, extremely simple, interface in practice.

Appendix A: List of host directories that are bind-mounted.

NOTE: This list will slowly get less and less accurate as more of the mount points become dynamic and controlled by available interfaces.
  • /dev
  • /etc (except for /etc/alternatives)
  • /home
  • /root
  • /proc
  • /snap
  • /sys
  • /var/snap
  • /var/lib/snapd
  • /var/tmp
  • /var/log
  • /run
  • /media
  • /lib/modules
  • /usr/src