Wednesday, October 22, 2014

Launching a process to monitor stdout, stderr and exit code reliably

Recently I'm fixing a rather difficult bug that deals with doing one simple task reliably. Run a program and watch (i.e. intercept and process) stdout and stderr until the process terminates.

Doing this is surprisingly difficult and I was certainly caught in a few mistakes the first time I tried to do this. I recently posted a lengthy comment on the corresponding bug. It took me a few moments to carefully analyze and re-think the situation and how a reliable approach should work. Non the less I am only human and I certainly have made my set of mistakes.

Below is the reproduction for my current approach. The implementation is still in progress but it seems to work (I need to implement the termination phase of non-kill-able processes and switch to fully non-blocking I/O). So far I've used epoll(7) and signalfd(7). I'm still planning to use timerfd_create(2) for the timer, perhaps with CLOCK_RTC for hard wall-clock-time limit enforcement. I'll post the full, complete examples once I'm done with this but you can look at how it mostly looks like today in the python-glibc git tree's demos/ directory.

I'd like to ask everyone that has experience with this part of systems engineering to poke holes in my reasoning and show how this might fail and misbehave. Thanks.

The current approach, that so far works good on all the pathological cases is to do this.
The general idea is that we're in a I/O loop, using non-blocking I/O and a select-like mechanism to wait for wait for:
 - timeout (optional, new feature)
 - read side of the stdout pipe data
 - read side of the stdout pipe being closed
 - read side of the stderr pipe data
 - read side of the stderr pipe being closed
 - SIGCHLD being delivered with the intent to say that the process is dead
In general we keep looping and terminate only when the set of waited things (stdout depleted, stderr depleted, process terminated) is empty. This is not always true so see below. The action that we do on each is event is obviously different:
If the timeout has elapsed we proceed to send SIGTERM, reset the timer for shutdown period, followed by SIGQUIT and another timer reset. After that we send SIGKILL. This can fail as the process may have elevated itself beyond our capabilities. This is still undecided but perhaps, at this time, we should use an elevated process manager (see below). If we fail to terminate the process special provisions apply (see below).
If we have data to read we just do and process that (send to log files, process, send to .record.gz). This is a point where we can optimize the process and improve reliability in event of sudden system crash. Using more modern facilities we can implement tee in kernel space which lowers processing burden on python and, in general, makes it more likely that the log files will see actual output the process made just prior to its death.
We can also use pipes in O_DIRECT (aka packet mode) here to ensure that all writes() end up as individual records, which is the indented design of the I/O log record concept. This won't address the inherent buffering that is enabled in all programs that detect when they are redirected and no longer attached to a tty.
Whenever one of the pipes is depleted (which may *never* happen, lesson learned) we just close our side.
When the child dies, and this is the most important part and the actual bugfix, we do the following sequence of events:
 - if we still have stdout pipe open, read at most one PIPE_BUF. We cannot read more as the pipe may live on forever and we can just hang as we currently do. Reading one PIPE_BUF ensures that we catch the last moments of what the originally started process intended to tell us. Then we close the pipe. This will likely result in SIGPIPE in any processes that are still attached to it though we have no guarantee that it will rally kill them as that signal can be blocked.
 - if we still have stderr pipe open we follow the same logic as for stdout above.
 - we restore some signal handling that was blocked during the execution of the loop and terminate.
There's one more trick up our sleeve and that is PR_SET_CHILD_SUBREAPER but I'll describe that in a separate bug report that deals with runaway processes. Think dbus-launch or anything that double-forks and demonizes

If you have any comments or ideas please post them here (wherever you are reading this), on the launchpad bug report page or via email. Thanks a lot!