Annotations in Kubernetes Operator Design

It seems that annotations are everywhere in the Kubernetes (k8s) ecosystem. Ingress controllers, cloud providers, and operators of all kinds use the metadata stored in annotations to perform targeted actions inside of a cluster. So how can we leverage these when developing a new k8s operator?

To the Docs

Despite their widespread use, the official documentation of annotations is actually quite brief. In fact, it only takes two short sentences at the top of the page to define an annotation:

You can use Kubernetes annotations to attach arbitrary non-identifying metadata to objects. Clients such as tools and libraries can retrieve this metadata.

While technically accurate, this definition is still pretty vague and not entirely helpful.

The docs expand on this by providing a few examples of the types of metadata that can be stored in an annotation. But these samples range from build information all the way to individuals' "phone or pager numbers" (who still carries a pager these days anyway?).

Somewhere within their ambiguity lies the true power of k8s annotations; they grant the ability to tag any cluster resource with structured data in almost any format. It's like having a dedicated key-value store attached to every resource in your cluster! So how can we harness this power in an operator?

In this post, I will detail a way in which I recently used annotations while writing an operator for my company's product, QuestDB. Hopefully this will give you an idea of how you can incorporate annotations into your own operators to harness their full potential.

Background

The operator that I've been working on is designed to manage the full lifecycle of a QuestDB database instance, including version and hardware upgrades, config changes, backups, and (eventually) recovery from node failure. I used the Operator SDK and kubebuilder frameworks to provide scaffolding and API support.

It always comes back to a JWK

In order to take advantage of the database's many performance optimizations (such as importing over 300k rows/sec with io_uring), we recommend that users ingest data over InfluxDB Line Protocol. One of the features that we offer, which is not part of the original protocol, is authentication over TCP using a JSON Web Key (JWK).

This feature can be configured in a file that is referenced by the main server config on launch. You just need to add your JWK's key id and public data to the file in this format:

testUser1 ec-p-256-sha256 fLKYEaoEb9lrn3nkwLDA-M_xnuFOdSt9y0Z7_vWSHLU Dt5tbS1dEDMSYfym3fgMv0B99szno-dFc1rYF9t0aac
# [key/user id] [key type] {keyX keyY}

Let's say that you have your private key stored elsewhere in a k8s cluster as a Secret, so your client application can securely push data to your QuestDB instance. The JWK secret data would look something like this:

{
  "kty": "EC",
  "d": "5UjEMuA0Pj5pjK8a-fa24dyIf-Es5mYny3oE_Wmus48",
  "crv": "P-256",
  "kid": "testUser1",
  "x": "fLKYEaoEb9lrn3nkwLDA-M_xnuFOdSt9y0Z7_vWSHLU",
  "y": "Dt5tbS1dEDMSYfym3fgMv0B99szno-dFc1rYF9t0aac"
}

When a user creates a QuestDB Custom Resource (CR) in the cluster, we want to be able to point our operator to this private key and reformat the public values ("kid", "x", and "y") so that it can create a valid auth.conf ConfigMap value to mount to the Pod running our QuestDB instance. The operator can then add line.tcp.auth.db.path=auth.conf to the main server config to make it aware of the new authentication file, and the client application can communicate to QuestDB securely over ILP using the private key.

How can we let the operator know which Secret to use?

Using the Spec

One approach is to simply create a field on the QuestDB Custom Resource:

type QuestDBSpec struct {
    ...
    IlpSecretName      string `json:"ilpSecretName,omitempty"`
    IlpSecretNamespace string `json:"ilpSecretNamespace,omitempty"`
    ...
}

With these fields, a user can now set their values to the name and namespace of the secret that contains the JWK's private key, like so:

apiVersion: crd.questdb/v1
kind: QuestDB
...
spec:
  ilpSecretName: my-private-key
  ilpSecretNamespace: default

After applying the above yaml to the cluster, the operator will kick off a reconciliation loop of the newly created (or updated) QuestDB CR. Inside this loop, the operator will query the k8s API for the Secret default/my-private-key, obtain the "kid", "x", and "y" values from the Secret's data, modify the ConfigMap that is holding the QuestDB configuration, and continue the process as described above.

Even though this technically works, the approach is fairly naive and can lead to some issues down the line. For example, if you want to rotate your JWK, how will the operator know to update the public key in the QuestDB auth ConfigMap? Or, what will happen if the secret does not even exist? Let's use some kubebuilder primitives to help answer these questions and improve the solution.

Kubebuilder Watches

Kubebuilder has built-in support for watching resources that are managed both by the operator and also externally by another component. A watch is a function that registers the controller with the k8s API server, so that the controller is notified when a "watched" resource has changed. This allows the operator to kick off a reconciliation loop against the changed object, to ensure that the actual resource spec matches the desired spec (through operator's custom logic).

Using kubebuilder, resource watches can be configured in a function:

func (r *QuestDBReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&questdbv1.QuestDB{}).
        Owns(&corev1.ConfigMap{}).
        Watches(
            &source.Kind{Type: &corev1.Secret{}},
            handler.EnqueueRequestsFromMapFunc(r.secretToQuestDB),
            builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}),
        ).
        Complete(r)
}

In this function, we register our reconciler with a controller manager and set up 3 different types of watches:

  • For(&questdbv1.QuestDB{}) instructs the manager that the controller's primary managed resource is a questdbv1.QuestDB. This watch function registers the manager with the k8s API so it will be notified about any changes that happen to a QuestDB CR. When a change has been identified, the manager will kick off a reconcile of that object, calling the QuestDBReconciler.Reconcile() function to migrate the resource status to its desired state. Only one For clause can be used when registering a new controller, which goes along hand-in-hand with the recommendation that a controller should be responsible for a single CR.

  • Owns(&corev1.ConfigMap{}) will kick off a reconcile of any QuestDB CR when a ConfigMap that is owned by a QuestDB changes. To own an object, you can use the controllerutil.SetControllerReference function to create a parent-child relationship between the QuestDB parent and ConfigMap child. So then, changes to that ConfigMap will trigger a reconcile of the parent QuestDB in the controller.

  • Based on the function signature alone, the Watches block is clearly very different than the previous types. In this case, we are listening for changes to any corev1.Secret inside the entire cluster, regardless of ownership constraints. The watch is also set up with a specific predicate to filter out some events (predicate.ResourceVersionChangedPredicate). This predicate will match cluster events when a Secret's version is incremented (as the result of a Spec or Status change). So when a corev1.Secret change is found anywhere in the cluster, the manager will run the secretToQuestDB function to map that Secret to zero-or-more QuestDB NamespacedName references, based on its characteristics.

Below, we will use this function to update a QuestDB's config if a JWK value has changed. To do this, we need to map from a Secret to any QuestDBs that are using that Secret's value for ILP authentication.

Let's take a deeper look at this mapper function to see how to accomplish this.

EnqueueRequestsFromMapFunc

The sigs.k8s.io/controller-runtime package defines a MapFunc that is an input to the Watches function:

type MapFunc func(client.Object) []reconcile.Request

This function accepts a generic API object and returns a list of reconcile requests, which are simple wrappers on top of namespaced names (usually seen in the form "namespace/name"):

type Request struct {
  // NamespacedName is the name and namespace of the object to reconcile.
  types.NamespacedName
}

So how can we turn a generic client.Object (that is a generic abstraction on top of a Secret) into the name and namespace of a QuestDB object that we want to reconcile?

There are many possible answers to this question!

One idea is to create a naming convention that somehow encodes the name and namespace of the target QuestDB into the Secret's name, so we could use the client.Object.GetName() and client.Object.GetNamespace() to build a NamespacedName to reconcile. Perhaps something like questdb-${DB_NAME}-ilp. But this would limit what we could name Secrets, which might not interop well if something like external secrets controller is syncing the secret from an external source like Vault. Or if a developer simply forgets the naming convention, and needs to debug why their QuestDB's ILP auth isn't working.

Maybe we could reuse the IlpSecretName and IlpSecretNamespace spec fields from the previous section? We could query for a QuestDB that has a Spec.IlpSecretName == client.Object.GetName() (and likewise for namespace) inside our mapper function. But this doesn't work for a few reasons.

The first is that you are unable to use field selectors with CRDs, so this query is literally impossible in the current version of k8s!

Secondly, lets say you try to bypass this restriction by storing the secret name on the QuestDB object in something that could queried against, like resource labels. Since the function only accepts a client.Object and does not return an error along with its []reconcile.Request, there's no clean place to instantiate a new client inside a MapFunc. To do that, you would need a cancelable context and a standardized way to handle API errors. You can create all of this inside a MapFunc, but you wouldn't be able to use the rest of kubebuilder's built-in error handling capabilities and its context that is attached to every other API request in the system. So based on the signature of MapFunc, it's clear that the designers don't want you making any queries inside of them!

Then how can we only use the data found in the client.Object to create a list of QuestDBs to reconcile?

Annotations to the rescue!

To solve this issue, I decided to create a new annotation: "crd.questdb.io/name". This annotation will be attached to a Secret and points to the name of the QuestDB CR that will use its data to construct an ILP auth config file. For simplicity, I will assume that the Secret will only be used by a single QuestDB, and that both the Secret and QuestDB will reside in the same namespace.

This allows us to create a very simple mapper function that looks something like this:

func CheckSecretForQdbs(obj client.Object) []reconcile.Request {

  var (
    requests = []reconcile.Request{}
  )

  // Exit if the object is not a Secret
  if _, ok := obj.(*v1core.Secret); !ok {
    return requests
  }

  // Extract the target QuestDB from the annotation
  qdbName, ok := obj.GetAnnotations()["crd.questdb.io/name"]
  if !ok {
    return requests
  }

  requests = append(requests, reconcile.Request{
    NamespacedName: client.ObjectKey{
      Name:      qdbName,
      // The Secret and QuestDB must reside in
      // the same namespace for this to work
      Namespace: obj.GetNamespace(),
    },
  })

  return requests

}

Reconciliation logic

But we're not done yet! The controller still needs to find this Secret and use its data to construct the auth config.

Inside our QuestDB reconciliation loop, we can query for all Secrets in a QuestDB's namespace and iterate over them until we find the one we're looking for, based on our new annotation. Here's a small code sample of that, without any additional error-checking.

func (r *QuestDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
  q := &questdbv1.QuestDB{}

  // Assumes that the QuestDB exists (for simplicity)
  err := r.Get(ctx, req.NamespacedName, q)
  if err != nil {
    return ctrl.Result{}, err
  }

  allSecrets := &v1core.SecretList{}
  authSecret := v1core.Secret{}

  // Get a list of all secrets in the namespace
  if err := r.List(ctx, allSecrets, client.InNamespace(q.Namespace)); err != nil {
    return nil, err
  }

  // Iterate over them to find the secret with the desired annotation
  for _, secret := range allSecrets.Items {
    if secret.Annotations["crd.questdb.io/name"] == q.Name {
      authSecret = secret
    }
  }

  if authSecret.Name == "" {
    return errors.New("auth secret not found")
  }

  var (
    x   = authSecret["x"]
    y   = authSecret["y"]
    kid = authSecret["kid"]
  )

  // Construct the ILP auth string to add to the QuestDB config
  var auth string = constructIlpAuthConfig(x, y, kid)

  // Add this auth string to a ConfigMap value and update...
}

As you can see, the new annotation allows us to fully decouple the Secret from the QuestDB operator, since there are no domain-specific naming requirements for the Secret. You don't even need to change the QuestDB CR spec to update the config. All you need to do is add the annotation to any Secret in the QuestDB's namespace, set the value to the name of the QuestDB resource, and the operator will be automatically be notified of the change and update your QuestDB's config to use the Secret's public key data.

Note that this is a golden-path solution; we still need to handle cases where more than 1 Secret has the annotation or a matching Secret does not have the required keys that are needed to generate the JWK public key.

No limits

The beauty of annotations is that you can store anything in them, and with a custom operator, use that data to perform any cluster automation that you can dream of! K8s doesn't even prescribe the format of an annotation's value, as long as it can be represented in a YAML string. This means you can use simple strings, JSON, or even base64-encoded binary blobs as annotation values for an operator to use! Still, since k8s is a young-ish and constantly evolving system, I would probably stick with simple annotation values to abide by KISS as much as possible.

After using annotations in my operator code, I've started to gain more of an appreciation for why the k8s annotation docs are so vague; because they can be used for any custom action, it's not really possible to define all of their capabilities. It's up to the operator developer to use annotations in his or her own way.

I hope this example has sparked some of your own ideas about how to use annotations in your own operators. Let me know if it has!

Running Databases on Kubernetes

A few weeks ago, Kelsey Hightower wrote a tweet and held a live discussion on Twitter about whether it's a good idea or not to run a database on Kubernetes. This happened to be incredibly timely for me, since we at QuestDB are about to launch our own cloud database service (built on top of k8s)!

You can run databases on Kubernetes because it's fundamentally the same as running a database on a VM. The biggest challenge is understanding that rubbing Kubernetes on Postgres won't turn it into Cloud SQL. 🧵

Kelsey Hightower on Twitter: https://t.co/zdFobm4ijy

K8s Primitives

When working with databases, the obvious concern is data persistence. Earlier in its history, k8s really shined in the area of orchestrating stateless workloads, but support for stateful workflows was limited. Eventually, primitives like StatefulSets, PersistentVolumes (PVs), and PersistentVolumeClaims (PVCs) were developed to help orchestrate stateful workloads on top of the existing platform.

PersistentVolumes are abstractions that allow for the management of raw storage; ranging from local disk to NFS, cloud-specific block storage, and more. These work in concert with PersistentVolumeClaims that represent requests for a pod to access the storage managed by a PV. A user can bind a PVC to a PV to make an ownership claim on a set of raw disk resources encompassed by the PV. Then, you can add that PVC to any pod spec as a volume, effectively allowing you to mount any kind of persistent storage medium to a particular workload. The separation of PV and PVC also allows you to fully control the lifecycle of your underlying block storage, including mounting it to different workloads or freeing it all together once the claim expires.

StatefulSets manage the lifecycles of pods that require more stability than what exists in other primitives like Deployments and ReplicaSets. By creating a StatefulSet, you can guarantee that when you remove a pod, the storage managed by its mounted PVCs does not get deleted along with it. You can imagine how useful this property is if you're hosting a database! StatefulSets also allow for ordered deployment, scaling, and rolling updates, all of which create more predictability (and thus stability) in our workloads. This is also something that seems to go hand-in-hand with what you want out of your database's infrastructure.

What else?

While StatefulSets, PVs, and PVCs do quite a bit of work for us, there are still many administration and configuration actions that you need to perform on a production-level database. For example, how do you orchestrate backups and restores? These can get quite complex when dealing with high-traffic databases that include functionality such as WALs. What about clustering and high availability? Or version upgrades? Are these operations zero-downtime? Every database deals with these features in different ways, many of which require precise coordination between components to succeed. Kubernetes alone can't handle this. For example, you can't have a StatefulSet automatically set up your average RDBMS in a read-replica mode very easily without some additional orchestration.

Not only do you have to implement many of these features yourself, but you also need to deal with the ephemeral nature of Kubernetes workloads. To ensure peak performance, you have to guarantee that the k8s scheduler places your pods on nodes that are already pre-tuned to run your database, with enough free resources to properly run it. If you're dealing with clustering, how are you handling networking to ensure that database nodes are able to connect to each other (ideally in the same cloud region)? This brings me to my next point...

Pets, not cattle

Pets, not cattle

Once you start accounting for things like node performance-tuning and networking, along with the requirement to store data persistently in-cluster, all of a sudden your infrastructure starts to grow into a set of carefully groomed pet servers instead of nameless herds of cattle. But one of the main benefits of running your application in k8s is the exact ability to treat your infrastructure like cattle instead of pets! All of the most common abstractions like Deployments, Ingresses, and Services, along with features like vertical and horizontal autoscaling, are made possible because you can run your workloads on a high-level set of infrastructure components so you don't have to worry about your physical infrastructure layer. These abstractions allow you to focus more on what you're trying to achieve with your infrastructure instead of how you're going to achieve it.

Then why even bother with k8s?

Despite these rough edges, there are plenty of reasons to want to run your database on k8s. There's no denying that k8s' popularity has increased tremendously over the past few years across both startups and enterprises. The k8s ecosystem is under constant development so that its feature set continues to expand and improve regularly. And its operator model allows end users to programmatically manage their workloads by writing code against the core k8s APIs to automatically perform tasks that would previously have to be done manually. K8s allows for easy GitOps-style management so you can leverage battle-tested software development practices when managing infrastructure in a reproducible and safe way. While vendor lock-in still exists in the world of k8s, its effect can be minimized to make it easier for you to go multi-cloud (or even swap one for another).

So what can we do if we want to take advantage of all the benefits that k8s has to offer while using it to host our database?

What do you need to build an RDS on k8s?

Towards the end of the live chat, someone asked Kelsey, "what do you actually need to build an RDS on k8s?" He jokingly answered with expertise, funding, and customers. While we're certainly on the right track with these at QuestDB, I think that this can be better phrased in that you need to implement Day 2 Operations to get to what a typical managed database service would provide.

Day 2 Operations

Storage Engineer

Day 2 Operations encompass many of the items that I've been discussing; backups, restores, stop/start, replication, high availability, and clustering. These are the features that differentiate a managed database service from a simple database hosted on k8s primitives, which is what I would call a Day 1 Operation. While k8s and its ecosystem can make it very easy to install a database in your cluster, you're going to eventually need to start thinking about Day 2 Operations once you get past the prototype phase.

Here, I'll jump into more detail about what makes these operations so difficult to implement and why special care must be taken when implementing them, either by a database admin or a managed database service provider.

Stop/Start

Stopping and starting databases is a common operation in today's DevOps practices, and is a must-have for any fully-featured managed database service. It is pretty easy to find at least one reason for wanting to stop-and-start a database. For example, you may want to have a database used for running integration tests that run on a pre-defined schedule. Or you maybe have a shared instance that's used by a development team for live QA before merging a commit. You could always create and delete database instances on-demand, but it is sometimes easier to have a reference to a static database connection string and url in your test harness or orchestration code.

While stop/start can be automated in k8s (perhaps by simply setting a StatefulSet's replica count to 0), there are still other aspects that need to be considered. If you're shutting down a database to save some money, will you also be spinning down any infrastructure? If so, how can you ensure that this infrastructure will be available when you start the database backup? K8s provides primitives like node affinity and taints to help solve this problem, but everyone's infrastructure provisioning situation and budget are different, and there's no one-size-fits-all approach to this problem.

Backup & Restore

One interesting point that Kelsey made in his chat was that having the ability to start an instance from scratch (moving from a stopped -> running state), is not trivial. Many challenges need to be solved, including finding the appropriate infrastructure to run the database, setting up network connectivity, mounting the correct volume, and ensuring data integrity once the volume has been mounted. In fact, this is such an in-depth topic, that Kelsey compares going from 0 -> 1 running instance to an actual backup-and-restore test. If you can indeed spin up an instance from scratch while loading up pre-existing data, you have successfully completed a live restore test!

Even if you have restores figured out, backups have their own complexities. K8s provides some useful building blocks like Jobs and CronJobs, which you can use if you want to take a one-off backup or create a backup schedule respectively. But you need to ensure that these jobs are configured correctly in order to access raw database storage. Or if your database allows you to perform a backup using a CLI, then these jobs also need secure access to credentials to even connect to the database in the first place. From an end-user standpoint, you need an easy way to manage existing backups, which includes creating an index, applying data retention policies, and RBAC policies. Again, while k8s can help us build out these backup-and-restore components, a lot of these features are built on top of the infrastructure primitives that k8s provides.

Replication, HA, and Clustering

These days, you can get very far by simply vertically scaling your database. The performance of modern databases can be sufficient for almost anyone's use case if you throw enough resources at the problem. But once you've reached a certain scale, or require features like high availability, there is a reason to enable some of the more advanced database management features like clustering and replication.

Once you start down this path, the amount of infrastructure orchestration complexity can increase exponentially. You need to start thinking more about networking and physical node placement to achieve your desired goal. If you don't have a centralized monitoring, logging, and telemetry solution, you're now going to need one if you want to easily diagnose issues and get the best performance out of your infrastructure. Based on its architecture and feature set, every database can have different options for enabling clustering, many of which require intimate knowledge of the inner workings of the database to choose the correct settings.

Vanilla k8s knows nothing of these complexities. Instead, these all need to be orchestrated by an administrator or operator (human or automated). If you're working with production data, changes may need to happen with close-to-zero downtime. This is where managed database services shine. They can make some of these features as easy to configure as a single web form with a checkbox or two and some input fields. Unless you're willing to invest the time into developing these solutions yourself, or leverage existing open-source solutions if they exist, sometimes it's worth giving up some level of control for automated expert assistance when configuring a database cluster.

Orchestration

Orchestra Conductor

For your Day 2 Operations to work as they would in a managed database service such as RDS, they need to not just work, but also be automated. Luckily for us, there are several ways to build automation around your database on k8s.

Helm & Yaml tools won't get us there

Since k8s configuration is declarative, it can be very easy to get from 0 -> 1 with traditional yaml-based tooling like Helm or cdk8s. Many industry-leading k8s tools install into a cluster with a simple helm install or kubectl apply command.

These are sufficient for Day 1 Operations and non-scalable deployments. But as soon as you start to move into more vendor-specific Day 2 Operations that require more coordination across system components, the usefulness of traditional yaml-based tools starts to degrade quickly, since some imperative programming logic is required.

Provisioners

One pattern that you can use to automate database management is a provisioner process. We've even used this approach to build v1 of our managed cloud solution. When a user wants to make a change to an existing database's state, our backend sends a message to a queue that is eventually picked up by a provisioner. The provisioner reads the message, uses its contents to determine which actions to perform on the cluster, and performs them sequentially. Where appropriate, each action contains a rollback step in case of a kubectl apply error to leave the infrastructure in a predictable state. Progress is reported back to the application on a separate gossip queue, providing almost-immediate feedback to the user on the progress of each state change.

While this has grown to be a powerful tool for us, there is another way to interact with the k8s API that we are now starting to leverage...

Operators

K8s has an extensible Operator pattern that you can use to manage your own Custom Resources (CRs) by writing and deploying a controller that reconciles your current cluster state into its desired state, as specified by CR yaml spec files that are applied to the cluster. This is also how the functionality of the basic k8s building blocks are implemented, which just further emphasizes how powerful this model can be.

Operators have the ability to hook into the k8s API server and listen for changes to resources inside a cluster. These changes get processed by a controller, which then kicks off a reconciliation loop where you can add your custom logic to perform any number of actions, ranging from simple resource existence to complex Day 2 Operations. This is an ideal solution to our management problem; we can offload much of our imperative code into a native k8s object, and database-specific operations appear to be as seamless as the standard set of k8s building blocks. Many existing database products use operators to accomplish this, and more are currently in development (see the Data on Kubernetes community for more information on these efforts).

As you can imagine, coordinating activities like backups, restores, and clustering inside a mostly stateless and idempotent reconciliation loop isn't the easiest. Even if you follow best practices by writing a variety of simple controllers, with each managing its own clearly-defined CR, the reconciliation logic can still be very error-prone and time-consuming to write. While frameworks like Operator SDK exist to help you with scaffolding your operator, and libraries like Kubebuilder provide a set of incredibly useful controller libraries, it's still a lot of work to undertake.

K8s is just a tool

At the end of the day, k8s is a single tool in the DevOps engineer's toolkit. These days, it's possible to host workloads in a variety of ways; using managed services (PaaS), k8s, VMs, or even running on a bare metal server. The tool that you choose depends on a variety of factors including time, experience, performance requirements, ease of use, and cost.

While hosting a database on k8s might be a fit for your organization, it just as easily could create even more overhead and instability if not done carefully. Implementing the Day 2 features that I described above is time-consuming and costly to get right. Testing is incredibly important, since you want to be absolutely sure that your (and your customers') precious data is kept safe and accessible when it's needed.

If you just need a reliable database to run your application on top of, then maybe all of the work required to run a database on k8s might be too much for you to undertake. But if your database has strong k8s support (most likely via an operator), or you are doing something unique (and at-scale) with your storage layer, it might be worth it to look more into managing your stateful databases on k8s. Just be prepared for a large time investment and ensure that you have the requisite in-house knowledge (or support) so that you can be confident that you're performing your database automation activities correctly and safely.

We've spent the past year building our own managed database service on top of k8s. If you want to check out what we've built, you can visit the QuestDB Cloud page and see it for yourself!


Originally posted on the QuestDB Blog. All images are common use from https://www.pexels.com/.

Kubebuilder and Operator-SDK Tips and Tricks

Recently, I've been spending a lot of time writing a Kubernetes operator using the go operator-sdk, which is built on top of the Kubebuilder framework. This is a list of a few tips and tricks that I've compiled over the past few months working with these frameworks.

Log Formatting

Kubebuilder, like much of the k8s ecosystem, utilizes zap for logging. Out of the box, the Kubebuilder zap configuration outputs a timestamp for each log, which gets formatted using scientific notation. This makes it difficult for me to read the time of an event just by glancing at it. Personally, I prefer ISO 8601, so let's change it!

In your scaffolding's main.go, you can configure your current logger format by modifying the zap.Options struct and calling ctrl.SetLogger.

opts := zap.Options{
    Development: true,
    TimeEncoder: zapcore.ISO8601TimeEncoder,
}

ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

In this case, I added the zapcore.ISO8601TimeEncoder, which encodes timestamps to human-readable ISO 8601-formatted strings. It took a bit of digging, along with a bit of help from the Kubernetes Slack org, to figure this one out. But it's been a huge quality-of-life improvement when debugging complex reconcile loops, especially in a multithreaded environment.

MaxConcurrentReconciles

Speaking of multithreaded environments, by default, an operator will only run a single reconcile loop per-controller. However, in practice, especially when running a globally-scoped controller, it's useful to run multiple concurrent reconcile loops to simultaneously handle many resource changes at once. Luckily, the Operator SDK makes this incredibly easy with the MaxConcurrentReconciles setting. We can set this up in a new controller's SetupWithManager func:

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
		WithOptions(controller.Options{MaxConcurrentReconciles: 10}).
        ...
        Complete(r)
}

I've created a command line arg in my main.go file that allows the user to set this value to any integer value, since this will likely be a tweaked over time depending on how the controller performs in a production cluster.

Parent-Child Relationships

One of the basic functions of a controller is to act as a parent to Kubernetes resources. This allows the controller to "own" these objects such that when it is deleted, all child objects are automatically garbage collected by the Kubernetes runtime.

I like this small function that can be called for any client.Object to add a parent reference to the controller that you're writing.

func (r *CustomReconciler) ownObject(ctx context.Context, cr *myapiv1alpha1.CustomResource, obj client.Object) error {

	err := ctrl.SetControllerReference(cr, obj, r.Scheme)
	if err != nil {
		return err
	}
	return r.Update(ctx, obj)
}

You can then add Owns watches for these resources in your SetupWithManager func. These will instruct your controller to listen for changes in child resources of the specified types, triggering a reconcile loop on each change.

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
        Owns(&v1apps.Deployment{}).
		Owns(&v1core.ConfigMap{}).
		Owns(&v1core.Service{}).
        Complete(r)
}

Watches

Your controller can also watch resources that it doesn't own. This is useful for when you need to watch for changes in globally-scoped resources like PersistentVolumes or Nodes. Here's an example of how you would register this watch in your SetupWithManager func.

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
        Watches(
            &source.Kind{Type: &v1core.Node{}},
            handler.EnqueueRequestsFromMapFunc(myNodeFilterFunc),
            builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}),
        ).
        Complete(r)
}

In this case, you need to implement myNodeFilterFunc to accept an obj client.Object and return []reconcile.Request. Using the ResourceVersionChangedPredicate triggers the filter function for every change on that resource type, so it's important to write your filter function to be as efficient as possible, since there is a chance that it could be called quite a bit, especially if your controller is globally-scoped.

Field Indexers

One gotcha that I encountered happened when trying to query for a list of Pods that are running on a particular Node. This query uses a FieldSelector filter, as seen here:

// Get a list of all pods on the node
err := c.List(ctx, &pods, &client.ListOptions{
    Namespace:     "",
    FieldSelector: fields.ParseSelectorOrDie(fmt.Sprintf("spec.nodeName=%s", node.Name)),
})

This codepath led to the following error: Index with name field:spec.nodeName does not exist. After some googling around, I found this GitHub issue that referenced a Kubebuilder docs page which contained the answer.

Controllers created using operator-sdk and Kubebuilder use a built-in caching mechanism to store results of API requests. This is to prevent spamming the K8s API, as well as improve reconciliation performance.

When performing resource lookups using FieldSelectors, you first need to add your desired search field to an index that the cache can use for lookups. Here's an example that will build this index for a Pod's nodeName:

if err := mgr.GetFieldIndexer().IndexField(context.TODO(), &v1core.Pod{}, "spec.nodeName", func(rawObj client.Object) []string {
    pod := rawObj.(*v1core.Pod)
    return []string{pod.Spec.NodeName}
}); err != nil {
    return err
}

Now, we can run the List function from above with the FieldSelector with no issues.

Retries on Conflicts

If you've ever written controllers, you're probably very familiar with the error Operation cannot be fulfilled on ...: the object has been modified; please apply your changes to the latest version and try again

This occurs when the version of the resource that you're currently reconciling in your controller is out-of-date with what's in latest version of the K8s cluster state. If you're retrying your reconciliation loop on any errors, your controller will eventually reconcile the resource, but this can really pollute your logs and make it difficult to spot more important errors.

After reading through the k8s source, I found the solution to this: RetryOnConflict. It's a utility function in the client-go package that runs a function and automatically retries on conflict, up to a certain point.

Now, you can just wrap your logic inside this function argument, and never have to worry about this issue again! And the added benefit is that you just get to return err instead of return ctrl.Result{}, err, which makes your code that much easier to read.

Useful Kubebuilder Markers

Here are some useful code markers that I've found while developing my operator.

  1. To add custom columns to your custom resource's description (when running kubectl get), you can add annotations to your API object like these:
//+kubebuilder:printcolumn:name="NodeReady",type="boolean",JSONPath=".status.nodeReady"
//+kubebuilder:printcolumn:name="NodeIp",type="string",JSONPath=".status.nodeIp"
  1. To add a shortname to your custom resource (like pvc for PersistentVolumeClaim for example), you can add this annotation:
//+kubebuilder:resource:shortName=mycr;mycrs

More docs on kubebuilder markers can be found here:

https://book.kubebuilder.io/reference/markers/crd.html