It always begins with a PagerDuty alert...

A few weeks ago, I got a PagerDuty ping with this error:

[FIRING:1] :fire: KubeContainerWaitingQuestDB
Summary: Container in waiting state for longer than 1 hour

Huh, well that could be caused by a bunch of things. Let's use kubectl to check the Pod's status:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed
to get sandbox image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5":
failed to pull image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5":
failed to pull and unpack image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5":
failed to resolve reference "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5":
pull access denied, repository does not exist or may require authorization:
authorization failed:
no basic authcredentials

Uhhh... pause:3.5? What is this thing? Why doesn't the node have access to it?

After a bit of digging (otherwise known as copy/pasting the error logs into Google), it turns out that I hit a GitHub issue (Sandbox container image being GC'd in 1.29) in awslabs/amazon-eks-ami. Because my node was running out of disk space, this pause:3.5 container got garbage collected by the container runtime, containerd. Subsequently, when a new Pod was created and containerd attempted to re-pull this image, it wasn't properly authenticated with ECR (using credentials from aws eks get-docker-login), so the kubelet emitted the above error event associated with the Pod.

While that seemed to explain the issue at hand, it didn't answer the question of why this pause:3.5 container was needed in the first place! Is this a sandbox container? And if so, what does that mean? Since the container is required by every Pod on my Node, clearly this is an important construct in the Kubernetes ecosystem.

What is a Pod Sandbox?

I knew that different containers running inside the same Pod share resources like filesystems and networks. This lets me, for example, create a volume mount in one container and access it from another one in the Pod.

The sandbox creates an environment which allows a Pod's containers to safely share these resources, while still isolating them from the rest of the node. This is detailed more in a k8s blog post introducing the Container Runtime Interface (CRI) back in 2016:

A Pod is composed of a group of application containers in an isolated environment with resource constraints. In CRI, this environment is called PodSandbox. We intentionally leave some room for the container runtimes to interpret the PodSandbox differently based o in the future.

Ok, so the sandbox is a group of namespaces? Then what does this pause container do? Like most things, let's go to the source to see if we can get another clue. After some basic argument handling (for -v) and lifecycle management around SIGINT, SIGTERM, and SIGCHLD, we get to the main execution loop:

  for (;;)
    pause();

Looking at the pause(2) man page, it turns out that all this does is put the thread to sleep. So why does every Pod need an extra container that just sleeps?

Since the docs mentioned that a PodSandbox is implementation specfic, I starting digging around in containerd. There, I found the containerd cri plugin architecture on GitHub. From the architecture doc, one of the Pod initialization steps is:

cri uses containerd internal to create and start a special pause container (the sandbox container) and put that container inside the pod’s cgroups and namespace (steps omitted for brevity);

Finally, the answer!

Once I clicked the link from the above quote, which led me to Ian Lewis's blog, I finally had my answer.

In Kubernetes, the pause container serves as the “parent container” for all of the containers in your pod. The pause container has two core responsibilities. First, it serves as the basis of Linux namespace sharing in the pod. And second, with PID (process ID) namespace sharing enabled, it serves as PID 1 for each pod and reaps zombie processes.

If only I had found this article at the beginning of my journey! Ian goes into great detail about how the pause container not only sleeps, but also manages the lifecycle of child processes by reaping zombies, which is something that I missed when I first reviewed the container executable's source code. There are also other articles on the blog that discuss how container runtimes work at a low level, which is really interesting stuff. I highly recommend reading these!

While this post could be boiled down to simply sharing a link to the aforementioned blog post, I thought that it would be useful to publicly document my research process and train of thought. Many times, it takes a combination of Google searches, blog posts, READMEs, docs, and source files to solve or understand a particular issue. I find that by going the extra mile and taking the time to properly investigate a problem, you can really improve your understanding of the systems that you work with every day, making it that much easier to solve the next issue that pops up.

Back to the original PagerDuty alert. I originally solved the problem by cordoning, draining, and deleting the node. The Pod was re-scheduled to a new node (with a downloaded pause container image), and started up with no issues.

While I simply could've left it there, knowing that I had a remediation plan in case I ever experienced that error again, I'm glad that I went the extra mile to fully investigate the problem. As a result, I not only feel more confident in my choice to remove the node, but I also have a deeper understanding of how Kubernetes works at a fundamental level and add that bit knowledge to my toolkit.