There are a number of corner cases to consider when dealing with Docker, multiple processes, and signals. Probably the most famous post on this matter is from the Phusion blog. Here, we'll see some examples of how to see these problems first hand, and one way to work around it: fpco/pid1.
The Phusion blog post recommends using their baseimage-docker. This image provides a
my_init entrypoint which handles the problems described here, as well as introducing some extra OS features, such as syslog handling. Unfortunately, we ran into problems with Phusion's usage of syslog-ng, in particular with it creating unkillable processes pegged at 100% CPU usage. We're still investigating the root cause, but in practice we have found that the syslog usage is a far less motivating case than simply a good init process, which is why we've created the pid1 Haskell package together with a simple fpco/pid1 Docker image.
This blog post is intended to be interactive: you'll get the most bang for your buck by opening up your terminal and running commands along with reading the text. It will be far more motivating to see your
Ctrl-C completely fail to kill a process.
NOTE The primary reason we wrote our own implementation in Haskell was to be able to embed it within the Stack build tool. There are other lightweight init processes already available, such as dumb-init. I've also blogged about using dumb-init. While this post uses
pid1, there's nothing specific to it versus other init processes.
Playing with entrypoints
Docker has a concept of entrypoints, which provides a default wrapping command for commands you provides to
docker run. For example, consider this interaction with Docker:
$ docker run --entrypoint /usr/bin/env ubuntu:16.04 FOO=BAR bash c 'echo $FOO' BAR
This works because the above is equivalent to:
$ docker run ubuntu:16.04 /usr/bin/env FOO=BAR bash -c 'echo $FOO'
Entrypoints can be overridden on the command line (as we just did), but can also be specified in the Dockerfile (which we'll do later). The default entrypoint for the ubuntu Docker image is a null entrypoint, meaning that the provided command will be run directly without any wrapping. We're going to simulate that experience by using
/usr/bin/env as an entrypoint, since switching entrypoint back to null isn't yet supported in released Docker. When you run
/usr/bin/env foo bar baz, the
env process will
foo command, making
foo the new PID 1, which for our purposes gives it the same behavior as a null entrypoint.
snoyberg/docker-testing images we'll use below set
/sbin/pid1 as the default entrypoint. In the example commands, we're explicitly including
--entrypoint /sbin/pid1. This is just to be clear on which entrypoint is being used; if you exclude that option, the same behavior will persist.
Sending TERM signal to process
We'll start with our sigterm.hs program, which runs
ps (we'll see why soon), then sends itself a
SIGTERM and then loops forever. On a Unix system, the default process behavior when receiving a
SIGTERM is to exit. Therefore, we'd expect that our process will just exit when run. Let's see:
$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing sigterm PID TTY TIME CMD 1 ? 00:00:00 sigterm 9 ? 00:00:00 ps Still alive! Still alive! Still alive! ^C $
The process ignored the
SIGTERM and kept running, until I hit Ctrl-C (we'll see what that does later). Another feature in the sigterm code base, though, is that if you give it the command line argument
install-handler, it will explicitly install a SIGTERM handler which will kill the process. Perhaps surprisingly, this has a significant impact on our application:
$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing sigterm install-handler PID TTY TIME CMD 1 ? 00:00:00 sigterm 8 ? 00:00:00 ps Still alive! $
The reason for this is some Linux kernel magic: the kernel treats a process with PID 1 specially, and does not, by default, kill the process when receiving the
SIGINT signals. This can be very surprising behavior. For a simpler example, try running the following commands in two different terminals:
$ docker run --rm --name sleeper ubuntu:16.04 sleep 100 $ docker kill -s TERM sleeper
Notice how the
docker run command does not exit, and if you check your
aux output, you'll see that the process is still running. That's because the
sleep process was not designed to be PID 1, and does not install a special signal handler. To work around this problem, you've got two choices:
- Ensure every command you run from
docker runhas explicit handling of
- Make sure the command you run isn't PID 1, but instead use a process that is designed to handle
Let's see how the
sigterm program works with our
$ docker run --rm --entrypoint /sbin/pid1 snoyberg/docker-testing sigterm PID TTY TIME CMD 1 ? 00:00:00 pid1 8 ? 00:00:00 sigterm 12 ? 00:00:00 ps
The program exits immediately, as we'd like. But look at the
ps output: our first process is now
pid1 instead of
sigerm is being launched as a different PID (8 in this case), the special casing from the Linux kernel does not come into play, and default
SIGTERM handling is active. To step through exactly what happens in our case:
- Our container is created, and the command
/usr/sbin/pid1 sigtermis run inside of it.
pid1starts as PID-1, does its business, and then
SIGTERMsignal to itself, causing it to die.
pid1sees that its child died from SIGTERM (== signal 15) and exits with exit code 143 (== 128 + 15).
- Since our PID1 is dead, our container dies too.
This isn't just some magic with
sigterm, you can do the same thing with
$ docker run --rm --name sleeper fpco/pid1 sleep 100 $ docker kill -s TERM sleeper
Unlike with the
ubuntu image, this will kill the container immediately, due to the
/sbin/pid1 entrypoint used by
NOTE In the case of
sigterm, which sends the TERM signal to itself, it turns out you don't need a special PID1 process with signal handling, anything will do. For example, try
docker run --rm --entrypoint /usr/bin/env
snoyberg/docker-testing /bin/bash -c "sigterm;echo bye". But playing with
sleep will demonstrate the need for a real signal-aware PID1 process.
Ctrl-C: sigterm vs sleep
There's a slight difference between
sleep when it comes to the behavior of sending hitting
Ctrl-C. When you use
Ctrl-C, it sends a
SIGINT to the
docker run process, which proxies that signal to the process inside the container.
sleep will ignore it, just as it ignores
SIGTERM, due to the default signal handlers for PID1 in the Linux kernel. However, the
sigterm executable is written in Haskell, and the Haskell runtime itself installs a signal handler that converts
SIGINT into a user interrupt exception, overriding the PID1 default behavior. For more on signal proxying, see the docker attach documentation.
Suppose you have process A, which
execs process B. When process B dies, process A must call
waitpid to get its exit status from the kernel, and until it does so, process B will be dead but with an entry in the system process table. This is known as being a zombie.
But what happens if process B outlives process A? In this case, process B is known as an orphan, and needs to be adopted by the init process, aka PID1. It is the init process's job to reap orphans so they do not remain as zombies.
The orphans.hs program will:
- Spawn a child process, and then loop forever calling
- In the child process: run the
echocommand a few times, without calling
waitpid, and then exit
As you can see, none of the processes involved will reap the zombie
echo processes. The output from the process confirms that we have, in fact, created zombies:
$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing orphans 1 2 3 4 Still alive! PID TTY TIME CMD 1 ? 00:00:00 orphans 8 ? 00:00:00 orphans 13 ? 00:00:00 echo <defunct> 14 ? 00:00:00 echo <defunct> 15 ? 00:00:00 echo <defunct> 16 ? 00:00:00 echo <defunct> 17 ? 00:00:00 ps Still alive! PID TTY TIME CMD 1 ? 00:00:00 orphans 13 ? 00:00:00 echo <defunct> 14 ? 00:00:00 echo <defunct> 15 ? 00:00:00 echo <defunct> 16 ? 00:00:00 echo <defunct> 18 ? 00:00:00 ps Still alive!
And so on until we kill the container. That
<defunct> indicates a zombie process. The issue is that our PID 1, orphans, doesn't do reaping. As you probably guessed, we can solve this by just using the
$ docker run --rm --entrypoint /sbin/pid1 snoyberg/docker-testing orphans 1 2 3 4 Still alive! PID TTY TIME CMD 1 ? 00:00:00 pid1 10 ? 00:00:00 orphans 14 ? 00:00:00 orphans 19 ? 00:00:00 echo <defunct> 20 ? 00:00:00 echo <defunct> 21 ? 00:00:00 echo <defunct> 22 ? 00:00:00 echo <defunct> 23 ? 00:00:00 ps Still alive! PID TTY TIME CMD 1 ? 00:00:00 pid1 10 ? 00:00:00 orphans 24 ? 00:00:00 ps Still alive!
pid1 now adopts the
echo processes when the child
orphans process dies, and reaps accordingly.
Let's try out something else: process A is the primary command for the Docker container, and it spawns process B. Before process B exits, process A exits, causing the Docker container to exit. In this case, the running process B will be forcibly closed by the kernel (see this Stack Overflow question for details). We can see this with our surviving.hs program
$ docker run --rm --entrypoint /usr/bin/env snoyberg/docker-testing surviving Parent sleeping Child: 1 Child: 2 Child: 4 Child: 3 Child: 1 Child: 2 Child: 3 Child: 4 Parent exiting
Unfortunately this doesn't give our child processes a chance to do any cleanup. Instead, we would rather send them a
SIGTERM, and after a grace period send them a
SIGKILL. This is exactly what
$ docker run --rm --entrypoint /sbin/pid1 snoyberg/docker-testing surviving Parent sleeping Child: 2 Child: 3 Child: 1 Child: 4 Child: 2 Child: 1 Child: 4 Child: 3 Parent exiting Got a TERM Got a TERM Got a TERM Got a TERM
docker run vs PID1
When you run
sleep 60 and then hit Ctrl-C, the
sleep process itself receives a
SIGINT. When you instead run
docker run --rm
fpco/pid1 sleep 60 and hit Ctrl-C, you may think that the same thing is happening. However, in reality, it's not at all the same. Your
docker run call creates a
docker run process, which sends a command to the Docker daemon on your machine, and that daemon creates the actual
sleep process (inside a container). When you hit Ctrl-C on your terminal, you're sending
docker run, which is in fact sending a command to the Docker daemon, which in turn sends a
SIGINT to your
Want proof? Try out the following:
$ docker run --rm fpco/pid1 sleep 60&  417 $ kill -KILL $! $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 69fbc70e95e2 fpco/pid1 "/sbin/pid1 sleep 60" 11 seconds ago Up 11 seconds hopeful_mayer + Killed docker run --rm fpco/pid1 sleep 60
In this case, we sent a
SIGKILL to the
docker run command. Unlike
SIGKILL cannot be handled, and therefore
docker run is unable to delegate signal handling to a different process. As a result, the
docker run command itself dies, but the
sleep process (and its container) continue running.
Some takeaways from this:
- Make sure you use something like
pid1so that your
docker runprocess actually get your container to reliably shut down
- If you must send a
SIGKILLto your process, use the
docker killcommand instead
Alternative to entrypoint
--entrypoint /sbin/pid1 a lot here. In fact, each usage of that has been superfluous, since the
snoyberg/docker-testing images both use
/sbin/pid1 as their default entrypoint anyway. I included it for explicitness. To prove it to you:
$ docker run --rm fpco/pid1 sleep 60 ^C$
But if you don't want to muck with entrypoints, you can always just include
/sbin/pid1 at the beginning of your command, e.g.:
$ docker run --rm --entrypoint /usr/bin/env fpco/pid1 /sbin/pid1 sleep 60 ^C$
And if you have your own Docker image and you'd just like to include the
pid1 executable, you can download it from the Github releases page.
Dockerfiles, command vs exec form
You may be tempted to put something like
ENTRYPOINT /sbin/pid1 in your Dockerfile. Let's see why that won't work:
$ cat Dockerfile FROM fpco/pid1 ENTRYPOINT /sbin/pid1 $ docker build --tag test . Sending build context to Docker daemon 2.048 kB Step 1 : FROM fpco/pid1 ---> aef1f7b702b9 Step 2 : ENTRYPOINT /sbin/pid1 ---> Using cache ---> f875b43a9e40 Successfully built f875b43a9e40 $ docker run --rm test ps pid1: No arguments provided
The issue here is that we specified /sbin/pid1 in what Docker calls command form. This is just a raw string which is interpreted by the shell. It is unable to be passed an additional command (like
ps), and therefore
pid1 itself complains that it hasn't been told what to run. The correct way to specify your entrypoint is
ENTRYPOINT ["/sbin/pid1"], e.g.:
$ cat Dockerfile FROM fpco/pid1 ENTRYPOINT ["/sbin/pid1"] $ docker build --tag test . Sending build context to Docker daemon 2.048 kB Step 1 : FROM fpco/pid1 ---> aef1f7b702b9 Step 2 : ENTRYPOINT /sbin/pid1 ---> Running in ba0fa8c5bd41 ---> 4835dec4aae6 Removing intermediate container ba0fa8c5bd41 Successfully built 4835dec4aae6 $ docker run --rm test ps PID TTY TIME CMD 1 ? 00:00:00 pid1 8 ? 00:00:00 ps
Generally speaking, you should stick with command form in your Dockerfiles at all times. It is explicit about whitespace handling, and avoids the need to use a shell as an interpreter.
The main takeaway here is: unless you have a good reason to do otherwise, you should use a minimal init process like
pid1. The Phusion/my_init approach works, but may be too heavy weight for some. If you don't need syslog and other add-on features of Phusion, you're probably best with a minimal init instead.
As a separate but somewhat related comment: we're going to have a follow up post on this blog in the coming days explaining how we compiled the
pid1 executable as a static executable to make it compatible with all various Linux flavors, and how you can do the same for your Haskell executables. Stay tuned!