What the heck is a pod sandbox?

It always begins with a PagerDuty alert...

A few weeks ago, I got a PagerDuty ping with this error:

[FIRING:1] :fire: KubeContainerWaitingQuestDB
Summary: Container in waiting state for longer than 1 hour

Huh, well that could be caused by a bunch of things. Let's use kubectl to check the Pod's status:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed
to get sandbox image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5":
failed to pull image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5":
failed to pull and unpack image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5":
failed to resolve reference "602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks/pause:3.5":
pull access denied, repository does not exist or may require authorization:
authorization failed:
no basic authcredentials

Uhhh... pause:3.5? What is this thing? Why doesn't the node have access to it?

After a bit of digging (otherwise known as copy/pasting the error logs into Google), it turns out that I hit a GitHub issue (Sandbox container image being GC'd in 1.29) in awslabs/amazon-eks-ami. Because my node was running out of disk space, this pause:3.5 container got garbage collected by the container runtime, containerd. Subsequently, when a new Pod was created and containerd attempted to re-pull this image, it wasn't properly authenticated with ECR (using credentials from aws eks get-docker-login), so the kubelet emitted the above error event associated with the Pod.

While that seemed to explain the issue at hand, it didn't answer the question of why this pause:3.5 container was needed in the first place! Is this a sandbox container? And if so, what does that mean? Since the container is required by every Pod on my Node, clearly this is an important construct in the Kubernetes ecosystem.

What is a Pod Sandbox?

I knew that different containers running inside the same Pod share resources like filesystems and networks. This lets me, for example, create a volume mount in one container and access it from another one in the Pod.

The sandbox creates an environment which allows a Pod's containers to safely share these resources, while still isolating them from the rest of the node. This is detailed more in a k8s blog post introducing the Container Runtime Interface (CRI) back in 2016:

A Pod is composed of a group of application containers in an isolated environment with resource constraints. In CRI, this environment is called PodSandbox. We intentionally leave some room for the container runtimes to interpret the PodSandbox differently based o in the future.

Ok, so the sandbox is a group of namespaces? Then what does this pause container do? Like most things, let's go to the source to see if we can get another clue. After some basic argument handling (for -v) and lifecycle management around SIGINT, SIGTERM, and SIGCHLD, we get to the main execution loop:

  for (;;)
    pause();

Looking at the pause(2) man page, it turns out that all this does is put the thread to sleep. So why does every Pod need an extra container that just sleeps?

Since the docs mentioned that a PodSandbox is implementation specfic, I starting digging around in containerd. There, I found the containerd cri plugin architecture on GitHub. From the architecture doc, one of the Pod initialization steps is:

cri uses containerd internal to create and start a special pause container (the sandbox container) and put that container inside the pod’s cgroups and namespace (steps omitted for brevity);

Finally, the answer!

Once I clicked the link from the above quote, which led me to Ian Lewis's blog, I finally had my answer.

In Kubernetes, the pause container serves as the “parent container” for all of the containers in your pod. The pause container has two core responsibilities. First, it serves as the basis of Linux namespace sharing in the pod. And second, with PID (process ID) namespace sharing enabled, it serves as PID 1 for each pod and reaps zombie processes.

If only I had found this article at the beginning of my journey! Ian goes into great detail about how the pause container not only sleeps, but also manages the lifecycle of child processes by reaping zombies, which is something that I missed when I first reviewed the container executable's source code. There are also other articles on the blog that discuss how container runtimes work at a low level, which is really interesting stuff. I highly recommend reading these!

While this post could be boiled down to simply sharing a link to the aforementioned blog post, I thought that it would be useful to publicly document my research process and train of thought. Many times, it takes a combination of Google searches, blog posts, READMEs, docs, and source files to solve or understand a particular issue. I find that by going the extra mile and taking the time to properly investigate a problem, you can really improve your understanding of the systems that you work with every day, making it that much easier to solve the next issue that pops up.

Back to the original PagerDuty alert. I originally solved the problem by cordoning, draining, and deleting the node. The Pod was re-scheduled to a new node (with a downloaded pause container image), and started up with no issues.

While I simply could've left it there, knowing that I had a remediation plan in case I ever experienced that error again, I'm glad that I went the extra mile to fully investigate the problem. As a result, I not only feel more confident in my choice to remove the node, but I also have a deeper understanding of how Kubernetes works at a fundamental level and add that bit knowledge to my toolkit.

A cloud engineer's first QuestDB Pull Request

Originally published on the QuestDB Blog

It was a little over a year and a half into my tenure as a cloud engineer at QuestDB when I started my first Pull Request to the core database. Before that, I had spent my time working with tools like Kubernetes and Docker to manage QuestDB deployments across multiple datacenters. I implemented production-grade observability solutions, wrote a Kubernetes operator in Golang, and pored over seemingly minute details our AWS bills.

While I enjoyed the cloud-native work (and still do!), I continued to have a nagging desire to meaningfully contribute the actual database that I spent all day orchestrating. It took the birth of my daughter, and the accompanying parental leave, for me to disconnect a bit and think about my priorities and career goals.

What was really stopping me from contributing? Was it the dreaded imposter syndrome? Or just a backlog of cloud-related tasks on my plate? After all, QuestDB is open source, so not much was stopping me from submitting some code changes.

With this mindset, I had a meeting with Vlad, our CTO, as I was ending my leave and about to start ramping my workload back up.

First PR to QuestDB Core

Config Hot-Reloading

Since I was coming back to work part-time for a bit, I figured that I could pick up a project that wasn't particularly time-sensitive so I could continue to help out with the baby at home.

One item that came up was the ability for QuestDB to adjust its runtime configuration on-the-fly. To do so, we'd need to monitor the config file, server.conf, and apply any configuration changes to the database without restarting it.

This task immediately resonated with me, since I've personally felt the pain of not having this "hot-reload" feature. I've spent way too many hours writing Kubernetes operator code that restarts a running QuestDB Pod on a mounted ConfigMap change. Having the opportunity to build a hot-reload feature was tantalizing, to say the least.

So I was off to the races, excited to get started working on my first major contribution to QuestDB.

We quickly arrived at a basic design.

  1. Build a new FileWatcher class: Monitor the database's server.conf file for changes.

  2. Detect changes: FileWatcher detects changes to the server.conf file.

  3. Load new values: Read the server.conf and load any new configuration values from the updated file.

  4. Validate the updated configuration: Validate the new server configuration.

  5. Apply the new configuration: Apply the new configuration values to the running server without restarting it.

Seems easy enough!

But like in most production-grade code, this seemingly simple problem got quite complex very quickly...

Complications

It's one thing to add a new feature to a relatively greenfield codebase. But it's quite another to add one to a mature codebase with over 100 contributors and years of history. Not all of these challenges were evident at the start, but over time, I started to internalize them.

  1. The FileWatcher component needed to be cross-platform, since QuestDB supports Linux, macOS, and Windows.
  2. The new reloading server config (called DynamicServerConfig) had to slot-in seamlessly to the existing plumbing that runs QuestDB and allows QuestDB Enterprise to plug in to the open core.
  3. We wanted the experience to be as seamless as possible for end users. This means that we couldn't forcibly close open database connections or restart the entire server.
  4. Caching configuration values was much more common throughout the codebase than we initially thought. Many classes and factories read the server configuration only once on initialization, and would need to be re-initialized to accommodate a new config setting.
  5. Like everything we do at QuestDB, performance is paramount. The solution needed to be as efficient as possible to allocate compute resources to more important things, like ingesting and querying data.

Inotify, kqueue, epoll, oh my!

Nine times out of ten, if you ask me to write a cross-platform file watcher library, I would google for "cross-platform filewatcher in Java". But working on a codebase that values strict memory accounting and efficient resource usage, it just didn't feel right to pull in a 3rd party library off the shelf. To maintain the performance that QuestDB is known for, it's crucial to understand what every bit of code is doing under the hood. So, in spite of the famous "Not Invented Here Syndrome", I went about learning how to implement file watchers at the syscall level in C.

I've felt this way a few times in my career, picking up something so brand new that I wasn't even sure where to start. And during these times, I've reached for venerable and canonical books on the subject to learn the basics. So, I hit up Amazon and got some reading material.

Reading Material

Since I'd recently been coding on my fancy new Ryzen Zen 4-based EndeavourOS desktop, I decided to start with the Linux implementation. I began writing some primitive C code, working with APIs like inotify and epoll, to effectively park a thread and wait for a change in a specific file or directory. Once a change was detected by one of these lower-level libraries, execution would continue where I would perform some basic filtering for a particular filename and return.

Once I was happy with my implementation, I still needed to make the code available to the JVM, where QuestDB runs. I was able to use the Java Native Interface (JNI) to wrap my functions in macros that allow the JVM to load the compiled binary and call them directly from Java.

But this was only the start. I also needed to make this work for macOS and Windows. Unfortunately, inotify isn't included in either of those operating systems, so I needed to find an alternative. Since macOS is built on top of FreeBSD, they share many of the same core libraries. This includes kqueue, which I was able to use instead of inotify to implement the core functionality of my filewatcher. Luckily, QuestDB already has some kqueue code written, since we use it to handle network traffic on those platforms. So I only had to add a few new functions in C to add the functionality that I required.

As for Windows? Vlad was a lifesaver there, since I don't have a Windows machine! He used low-level WinAPI libraries to implement the filewatcher and made them available to QuestDB through the JNI.

When I first started reading the QuestDB codebase, I found a web of classes and interfaces with abstract names like FactoryProviderFactory and PropBootstrapConfiguration. Was this that "enterprise Java"-style of programming that I've heard so much about?

FizzBuzzEnterpriseEdition

After a lot of F12 and Opt+Shift+F12 in IntelliJ, I started to build a mental map of the project structure and things started to make more sense. At its core, the entrypoint is a linear process. We use Java's built-in Properties to read server.conf into a property of a BootstrapConfiguration, pass that in a constructor to a Bootstrap class, and use that as an input to ServerMain, QuestDB's entrypoint.

QuestDB Bootstrap Flow

The reason for so many factories, interfaces, and abstract classes is twofold.

  1. it allows devs to mock just about any dependency in unit tests
  2. it creates abstraction layers for QuestDB Enterprise to use and extend existing core components

Now, I was ready to make some changes! I added a new DynamicServerConfiguration interface that exposed a reload() method, and created an implementation of this class that used the delegate pattern to wrap a legacy ServerConfiguration interface. When reload() was called, we would read the server.conf file, validate it, and atomically swap the delegate config with the new version. I then created an instance of my FileWatcher in the main QuestDB entrypoint with a callback that called DynamicServerConfiguration.reload() when it was triggered (on a file change).

Dynamic Server Config Reload Sequence Diagram

As you can imagine, wiring this all up wasn't the easiest, since I needed to maintain the existing class initialization order so that all dependencies would be ready at the correct time. I also didn't want to significantly modify the entrypoint of QuestDB. I felt this would not only confuse developers, but also cause problems when trying to compile Enterprise Edition.

Vlad had some great advice for me here, paraphrasing, "Make a change and re-run unit tests. If you've broken 100s, then try a different way. If you've only broken around 5, then you're on the right track."

Now, what can we actually reload?

There are a lot of possible settings to change in QuestDB, at all different levels of the database. At first, we thought that something like hot-reloading a query timeout would be a nice feature to have. This way, if I find that a specific query is taking a long to execute, I can simply modify my server.conf without having to restart the database.

Unfortunately, query timeouts are cached deep inside the cairo query engine, and updating those components to read directly from the DynamicServerConfiguration would be an exercise in futility.

After a lot of poking and prodding of the codebase, we found something that would work, pgwire credentials! QuestDB supports configurable read/write and readonly users that are used to secure communication with the database over Postgres wire protocol. We validate these users' credentials with a class that reads them from a ServerConfiguration and stores them in a custom (optimized) utf8 sink.

I was able to modify this class to accept my new dynamic configuration, cache it, and check whether the config reference has changed from the previous call. Because the dynamic configuration uses the delegate pattern, after a successful configuration reload (where we re-initialize the delegate), a new configuration would have a different memory address, and the cached reference would not match. At this point, the class would know to update its username & password sinks with the newly-updated config values.

The big moment, ready to merge!

It takes a village to raise a child. I've learned that already in my short time as a father. And a Pull Request is no different. Both Vlad and Jaromir helped to get this thing over the finish line. From acting as a soundboard to getting their hands dirty in Java and C code, they really provided fantastic support over the 5 months that my PR was open.

Towards the end of the project, even though all tests were passing in core, there was even a wrinkle in QuestDB Enterprise that prevented us from merging the PR. We realized that our abstraction layers were not quite perfect, so we couldn't reuse some parts of the core codebase in Enterprise. Instead of re-architecting everything from scratch and probably adding weeks or more to the project, we ended up just copying a few lines from core into Enterprise. It compiled, tests passed, and everyone was happy.

PR merged on GitHub

Now that Enterprise and core were both ready to go with a green check mark on GitHub, I hit the "Merge" button on GitHub and went outside for a long walk.

Now that Enterprise and core were both ready to go with a green check mark on GitHub, I hit the "Merge" button on GitHub and went outside for a long walk.

Learnings

While this ended up being an incredibly long journey to "simply" let users change pgwire credentials on-the-fly, I consider it to be a massive personal success in my growth and development as a software engineer. The amount of confidence that this task has given me cannot be understated. From this project alone, I've:

  • written my own C code for the first time
  • learned several new kernel APIs
  • used unsafe semantics in a memory-managed programming language
  • navigated the inner workings of a massive, mature codebase

And all in a new IDE for me (IntelliJ)!

With confidence stemming from the breadth and depth of work in this project, I'm ready to take on my next challenge in the core QuestDB codebase. I've already implemented a few simple SQL functions and started to grok the SQL Expression Parser. But given our aggressive roadmap with features like Parquet support, Array data types, and an Apache Arrow ADBC driver, I'm sure that there are plenty of other things for me to contribute in the future! What's even more exciting is that I can use my cloud-native expertise to help drive the database forward as we move towards a fully distributed architecture.

If you're curious about all of this work, here's a link to the PR

How to add Kubernetes-powered leader election to your Go apps

The Kubernetes standard library is full of gems, hidden away in many of the various subpackages that are part of the ecosystem. One such example that I discovered recently k8s.io/client-go/tools/leaderelection, which can be used to add a leader election protocol to any application running inside a Kubernetes cluster. This article will discuss what leader election is, how it's implemented in this Kubernetes package, and provide an example of how we can use this library in our own applications.

Leader Election

Leader election is a distributed systems concept that is a core building block of highly-available software. It allows for multiple concurrent processes to coordinate amongst each other and elect a single "leader" process, which is then responsible for performing synchronous actions like writing to a data store.

This is useful in systems like distributed databases or caches, where multiple processes are running to create redundancy against hardware or network failures, but can't write to storage simultaneously to ensure data consistency. If the leader process becomes unresponsive at some point in the future, the remaining processes will kick off a new leader election, eventually picking a new process to act as the leader.

Using this concept, we can create highly-available software with a single leader and multiple standby replicas.

In Kubernetes, the controller-runtime package uses leader election to make controllers highly-available. In a controller deployment, resource reconciliation only occurs when a process is the leader, and other replicas are waiting on standby. If the leader pod becomes unresponsive, the remaining replicas will elect a new leader to perform subsequent reconciliations and resume normal operation.

Kubernetes Leases

This library uses a Kubernetes Lease, or distributed lock, that can be obtained by a process. Leases are native Kubernetes resources that are held by a single identity, for a given duration, with a renewal option. Here's an example spec from the docs:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  labels:
	apiserver.kubernetes.io/identity: kube-apiserver
	kubernetes.io/hostname: master-1
  name: apiserver-07a5ea9b9b072c4a5f3d1c3702
  namespace: kube-system
spec:
  holderIdentity: apiserver-07a5ea9b9b072c4a5f3d1c3702_0c8914f7-0f35-440e-8676-7844977d3a05
  leaseDurationSeconds: 3600
  renewTime: "2023-07-04T21:58:48.065888Z"

Leases are used by the k8s ecosystem in three ways:

  1. Node Heartbeats: Every Node has a corresponding Lease resource and updates its renewTime field on an ongoing basis. If a Lease's renewTime hasn't been updated in a while, the Node will be tainted as not available and no more Pods will be scheduled to it.
  2. Leader Election: In this case, a Lease is used to coordinate among multiple processes by having a leader update the Lease's holderIdentity. Standby replicas, with different identities, are stuck waiting for the Lease to expire. If the Lease does expire, and is not renewed by the leader, a new election takes place in which the remaining replicas attempt to take ownership of the Lease by updating its holderIdentity with their own. Since the Kubernetes API server disallows updates to stale objects, only a single standby node will successfully be able to update the Lease, at which point it will continue execution as the new leader.
  3. API Server Identity: Starting in v1.26, as a beta feature, each kube-apiserver replica will publish its identity by creating a dedicated Lease. Since this is a relatively slim, new feature, there's not much else that can be derived from the Lease object aside from how many API servers are running. But this does leave room to add more metadata to these Leases in future k8s versions.

Now let's explore this second use case of Leases by writing a sample program to demonstrate how you can use them in leader election scenarios.

Example Program

In this code example, we are using the leaderelection package to handle the leader election and Lease manipulation specifics.

package main

import (
	"context"
	"fmt"
	"os"
	"time"

	"k8s.io/client-go/tools/leaderelection"
	rl "k8s.io/client-go/tools/leaderelection/resourcelock"
	ctrl "sigs.k8s.io/controller-runtime"
)

var (
	// lockName and lockNamespace need to be shared across all running instances
	lockName      = "my-lock"
	lockNamespace = "default"

	// identity is unique to the individual process. This will not work for anything,
	// outside of a toy example, since processes running in different containers or
	// computers can share the same pid.
	identity      = fmt.Sprintf("%d", os.Getpid())
)

func main() {
	// Get the active kubernetes context
	cfg, err := ctrl.GetConfig()
	if err != nil {
		panic(err.Error())
	}

	// Create a new lock. This will be used to create a Lease resource in the cluster.
	l, err := rl.NewFromKubeconfig(
		rl.LeasesResourceLock,
		lockNamespace,
		lockName,
		rl.ResourceLockConfig{
			Identity: identity,
		},
		cfg,
		time.Second*10,
	)
	if err != nil {
		panic(err)
	}

	// Create a new leader election configuration with a 15 second lease duration.
	// Visit https://pkg.go.dev/k8s.io/client-go/tools/leaderelection#LeaderElectionConfig
	// for more information on the LeaderElectionConfig struct fields
	el, err := leaderelection.NewLeaderElector(leaderelection.LeaderElectionConfig{
		Lock:          l,
		LeaseDuration: time.Second * 15,
		RenewDeadline: time.Second * 10,
		RetryPeriod:   time.Second * 2,
		Name:          lockName,
		Callbacks: leaderelection.LeaderCallbacks{
			OnStartedLeading: func(ctx context.Context) { println("I am the leader!") },
			OnStoppedLeading: func() { println("I am not the leader anymore!") },
			OnNewLeader:      func(identity string) { fmt.Printf("the leader is %s\n", identity) },
		},
	})
	if err != nil {
		panic(err)
	}

	// Begin the leader election process. This will block.
	el.Run(context.Background())

}

What's nice about the leaderelection package is that it provides a callback-based framework for handling leader elections. This way, you can act on specific state changes in a granular way and properly release resources when a new leader is elected. By running these callbacks in separate goroutines, the package takes advantage of Go's strong concurrency support to efficiently utilize machine resources.

Testing it out

To test this, lets spin up a test cluster using kind.

$ kind create cluster

Copy the sample code into main.go, create a new module (go mod init leaderelectiontest) and tidy it (go mod tidy) to install its dependencies. Once you run go run main.go, you should see output like this:

$ go run main.go
I0716 11:43:50.337947     138 leaderelection.go:250] attempting to acquire leader lease default/my-lock...
I0716 11:43:50.351264     138 leaderelection.go:260] successfully acquired lease default/my-lock
the leader is 138
I am the leader!

The exact leader identity will be different from what's in the example (138), since this is just the PID of the process that was running on my computer at the time of writing.

And here's the Lease that was created in the test cluster:

$ kubectl describe lease/my-lock
Name:         my-lock
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  coordination.k8s.io/v1
Kind:         Lease
Metadata:
  Creation Timestamp:  2024-07-16T15:43:50Z
  Resource Version:    613
  UID:                 1d978362-69c5-43e9-af13-7b319dd452a6
Spec:
  Acquire Time:            2024-07-16T15:43:50.338049Z
  Holder Identity:         138
  Lease Duration Seconds:  15
  Lease Transitions:       0
  Renew Time:              2024-07-16T15:45:31.122956Z
Events:                    <none>

See that the "Holder Identity" is the same as the process's PID, 138.

Now, let's open up another terminal and run the same main.go file in a separate process:

$ go run main.go
I0716 11:48:34.489953     604 leaderelection.go:250] attempting to acquire leader lease default/my-lock...
the leader is 138

This second process will wait forever, until the first one is not responsive. Let's kill the first process and wait around 15 seconds. Now that the first process is not renewing its claim on the Lease, the .spec.renewTime field won't be updated anymore. This will eventually cause the second process to trigger a new leader election, since the Lease's renew time is older than its duration. Because this process is the only one now running, it will elect itself as the new leader.

the leader is 604
I0716 11:48:51.904732     604 leaderelection.go:260] successfully acquired lease default/my-lock
I am the leader!

If there were multiple processes still running after the initial leader exited, the first process to acquire the Lease would be the new leader, and the rest would continue to be on standby.

No single-leader guarantees

This package is not foolproof, in that it "does not guarantee that only one client is acting as a leader (a.k.a. fencing)". For example, if a leader is paused and lets its Lease expire, another standby replica will acquire the Lease. Then, once the original leader resumes execution, it will think that it's still the leader and continue doing work alongside the newly-elected leader. In this way, you can end up with two leaders running simultaneously.

To fix this, a fencing token which references the Lease needs to be included in each request to the server. A fencing token is effectively an integer that increases by 1 every time a Lease changes hands. So a client with an old fencing token will have its requests rejected by the server. In this scenario, if an old leader wakes up from sleep and a new leader has already incremented the fencing token, all of the old leader's requests would be rejected because it is sending an older (smaller) token than what the server has seen from the newer leader.

Implementing fencing in Kubernetes would be difficult without modifying the core API server to account for corresponding fencing tokens for each Lease. However, the risk of having multiple leader controllers is somewhat mitigated by the k8s API server itself. Because updates to stale objects are rejected, only controllers with the most up-to-date version of an object can modify it. So while we could have multiple controller leaders running, a resource's state would never regress to older versions if a controller misses a change made by another leader. Instead, reconciliation time would increase as both leaders need to refresh their own internal states of resources to ensure that they are acting on the most recent versions.

Still, if you're using this package to implement leader election using a different data store, this is an important caveat to be aware of.

Conclusion

Leader election and distributed locking are critical building blocks of distributed systems. When trying to build fault-tolerant and highly-available applications, having tools like these at your disposal is critical. The Kubernetes standard library gives us a battle-tested wrapper around its primitives to allow application developers to easily build leader election into their own applications.

While use of this particular library does limit you to deploying your application on Kubernetes, that seems to be the way the world is going recently. If in fact that is a dealbreaker, you can of course fork the library and modify it to work against any ACID-compliant and highly-available datastore.

Stay tuned for more k8s source deep dives!