How do Kubernetes Operators Handle Concurrency?

By default, operators built using Kubebuilder and controller-runtime process a single reconcile request at a time. This is a sensible setting, since it's easier for operator developers to reason about and debug the logic in their applications. It also constrains throughput from the controller to core Kubernetes resources like ectd and the API server.

But what if your work queue starts backing up and average reconciliation times increase due to requests that are left sitting in the queue, waiting to be processed? Luckily for us, a controller-runtime Controller struct includes a MaxConcurrentReconciles field (as I previously mentioned in my Kubebuilder Tips article). This option allows you to set the number of concurrent reconcile loops that are running in a single controller. So with a value above 1, you can reconcile multiple Kubernetes resources simultaneously.

Early in my operator journey, one question that I had was how could we guarantee that the same resource isn't being reconciled at the same time by 2 or more goroutines? With MaxConcurrentReconciles set above 1, this could lead to all sorts of race conditions and undesireable behavior, as the state of an object inside a reconciliation loop could change via a side-effect from an external source (a reconciliation loop running in a different thread).

I thought about this for a while, and even implemented a sync.Map-based approach that would allow a goroutine to acquire a lock for a given resource (based on its namespace/name).

It turns out that all of this effort was for naught, since I recently learned (in a k8s slack channel) that the controller workqueue already includes this feature! Albeit with a simpler implementation.

This is a quick story about how a k8s controller's workqueue guarantees that unique resources are reconciled sequentially. So even if MaxConcurrentReconciles is set above 1, you can be confident that only a single reconciliation function is acting on any given resource at a time.

client-go/util

Controller-runtime uses the client-go/util/workqueue library to implement its underlying reconciliation queue. In the package's doc.go file, a comment states that the workqueue supports these properties:

  • Fair: items processed in the order in which they are added.
  • Stingy: a single item will not be processed multiple times concurrently, and if an item is added multiple times before it can be processed, it will only be processed once.
  • Multiple consumers and producers. In particular, it is allowed for an item to be reenqueued while it is being processed.
  • Shutdown notifications.

Wait a second... My answer is right here in the second bullet, the "Stingy" property! According to these docs, the queue will automatically handle this concurrency issue for me, without having to write a single line of code. Let's run through the implementation.

How does the workqueue work?

The workqueue struct has 3 main methods, Add, Get, and Done. Inside a controller, an informer would Add reconcile requests (namespaced-names of generic k8s resources) to the workqueue. A reconcile loop running in a separate goroutine would then Get the next request from the queue (blocking if it is empty). The loop would perform whatever custom logic is written in the controller, and then the controller would call Done on the queue, passing in the reconcile request as an argument. This would start the process over again, and the reconcile loop would call Get to retrieve the next work item.

This is similar to processing messages in RabbitMQ, where a worker pops an item off the queue, processes it, and then sends an "Ack" back to the message broker indicating that processing has completed and it's safe to remove the item from the queue.

Still, I have an operator running in production that powers QuestDB Cloud's infrastructure, and wanted to be sure that the workqueue works as advertised. So a wrote a quick test to validate its behavior.

A little test

Here is a simple test that validates the "Stingy" property:

package main_test

import (
    "testing"

    "github.com/stretchr/testify/assert"

    "k8s.io/client-go/util/workqueue"
)

func TestWorkqueueStingyProperty(t *testing.T) {

    type Request int

    // Create a new workqueue and add a request
    wq := workqueue.New()
    wq.Add(Request(1))
    assert.Equal(t, wq.Len(), 1)

    // Subsequent adds of an identical object
    // should still result in a single queued one
    wq.Add(Request(1))
    wq.Add(Request(1))
    assert.Equal(t, wq.Len(), 1)

    // Getting the object should remove it from the queue
    // At this point, the controller is processing the request
    obj, _ := wq.Get()
    req := obj.(Request)
    assert.Equal(t, wq.Len(), 0)

    // But re-adding an identical request before it is marked as "Done"
    // should be a no-op, since we don't want to process it simultaneously
    // with the first one
    wq.Add(Request(1))
    assert.Equal(t, wq.Len(), 0)

    // Once the original request is marked as Done, the second
    // instance of the object will be now available for processing
    wq.Done(req)
    assert.Equal(t, wq.Len(), 1)

    // And since it is available for processing, it will be
    // returned by a Get call
    wq.Get()
    assert.Equal(t, wq.Len(), 0)
}

Since the workqueue uses a mutex under the hood, this behavior is threadsafe. So even if I wrote more tests that used multiple goroutines simultaneously reading and writing from the queue at high speeds in an attempt to break it, the workqueue's actual behavior would be the same as that of our single-threaded test.

All is not lost

Kubernetes did it

There are a lot of little gems like this hiding in the Kubernetes standard libraries, some of which are in not-so-obvious places (like a controller-runtime workqueue found in the go client package). Despite this discovery, and others like it that I've made in the past, I still feel that my previous attempts at solving these issues are not complete time-wasters. They force you to think critically about fundamental problems in distributed systems computing, and help you to understand more of what is going on under the hood. So that by the time I've discovered that "Kubernetes did it", I'm relieved that I can simplify my codebase and perhaps remove some unnecessary unit tests.

An Introduction to Custom Resource Definitions and Custom Resources (Operators 101: Part 2)

Check out Part 1 of this series for an introduction to Kubernetes operators.

Now that we've covered the basics of what a Kubernetes operator is, we can start to dig into the details of each operator component. Remember that a Kubernetes operator consists of 3 parts: a Custom Resource Definition, a Custom Resource, and a Controller. Before we can focus on the Controller, which is where an operator's automation takes place, we first need to define our Custom Resource Definition and use that to create a Custom Resource.

Example: RSS Feed Reader Application

Throughout this series, I will be using a distributed RSS feed reader application (henceforth known as "FeedReader") to demonstrate basic Kubernetes operator concepts. Let's start with a quick overview of the FeedReader design.

FeedReader Architecture

The architecture for our FeedReader application is split into 3 main components:

  1. A StatefulSet running our feed database. This holds both the URLs of feeds that we wish to scrape, as well as their scraped contents
  2. A CronJob to perform the feed scraping job at a regular interval
  3. A Deployment that provides a frontend to query the feed database and display RSS feed contents to the end user

FeedReader Architecture Diagram

The FeedReader application runs as a single golang binary that has 3 modes: web server, feed scraper, and embedded feed database. Each of these modes is designed to run in a Deployment, CronJob, and StatefulSet respectively.

Back to CRDs: Starting At the End

Just to recap, a Custom Resource Definition is the schema of a new Kubernetes resource type that you define. But instead of thinking in terms of an abstract schema, I believe that it's easier to start from your desired result (a concrete Custom Resource) and work backwards from there.

My very basic rule of thumb is that whenever you need a variable to describe an option in your application, you should consider defining a new field and setting sample value in a Custom Resource to represent it. Once you have a Custom Resource with fields and values that clearly explain your application's desired behavior, you can then start to generalize these fields into an abstract schema, or CRD.

Let's perform this exercise with the FeedReader.

FeedReader CRD

To keep the article at a readable length, I'm just going focus on the most critical aspects of the FeedReader. I've extracted two key variables that are required for the application to function at its most basic level:

  1. A list of RSS feed urls to scrape
  2. A time interval that defines how often to scrape the urls in the above list

Thus, I'll add 2 fields to my FeedReader CR:

  1. A list of strings that hold my feed urls
  2. A Golang duration string that specifies the scrape interval

Based on these two requirements, here's an example of a FeedReader Custom Resource that I would like to define:

---
apiVersion: crd.sklar.rocks/v1alpha1
kind: FeedReader
metadata:
  name: feed-reader-app
spec:
  feeds:
    - https://sklar.rocks/atom.xml
  scrapeInterval: 60m

Once created, the above resource should deploy a new instance of the FeedReader app, add one feed (https://sklar.rocks/atom.xml) to the database, and scrape that url every 60 minutes for new posts that are not currently in the feed database.

Note that I could use something like crontab syntax to define the scrape interval, but I find the golang human-readable intervals easier to work with, since I don't always have to cross-check with crontab.guru to make sure that I've defined the correct schedule.

Exercise

Can you think of any more options that the FeedReader would need? Personally, I would start with things like feed retention, logging, and scraping retry behavior, but the possibilities are endless!

From Instance to Schema

But before we can kubectl apply on the above CR yaml, we first need to define and apply a schema in the form of a Custom Resource Definition (CRD) for us to create an object of this type in our Kubernetes cluster. Once the CRD is known to our cluster, we can then create and modify new instances of that CRD.

If you've ever worked with OpenAPI, Swagger, XML Schema or any other schema definition, you probably know that in many cases, the schema of an object can be incredibly more verbose than an instance of the object itself. In Kubernetes, to define a simple Custom Resource Definition with the two fields that I outlined above, the yaml would look something like this:

---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: feedreaders.crd.sklar.rocks
spec:
  group: crd.sklar.rocks
  names:
    kind: FeedReader
    listKind: FeedReaderList
    plural: feedreaders
    singular: feedreader
  scope: Namespaced
  versions:
    name: v1alpha1
    schema:
      openAPIV3Schema:
        properties:
          apiVersion:
            description: 'APIVersion defines the versioned schema of this representation
              of an object. Servers should convert recognized schemas to the latest
              internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
            type: string
          kind:
            description: 'Kind is a string value representing the REST resource this
              object represents. Servers may infer this from the endpoint the client
              submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
            type: string
          metadata:
            type: object
          spec:
            properties:
              feeds:
                description: List of feed urls to scrape at a regular interval
                items:
                  type: string
                type: array
              scrapeInterval:
                description: 'Interval at which to scrape feeds, in the form of a
                  golang duration string (see https://pkg.go.dev/time#ParseDuration for valid formats)'
                type: string
            type: object
          status:
            properties:
              lastScrape:
                description: Time of the last scrape
                format: date-time
                type: string
            type: object
        type: object
    served: true
    storage: true
    subresources:
      status: {}

Now that's a lot of yaml! This schema includes metadata like the resource's name, API group, and version, as well as the formal definitions of our custom fields and standard Kubernetes fields like "apiVersion" and "kind".

If you're thinking "do I actually have to write something like this for all of my CRDs?", the answer is no! This is where we can leverage the Kubernetes standard library and golang tooling when defining CRDs. For example, I didn't have to write a single line of the yaml in the example above! Here's how I did it.

Controller-gen to the Rescue

The Kubernetes community has developed a tool to generate CRD yaml files directly from Go structs, called controller-gen. This tool parses Go source code for structs and special comments known as "markers" that begin with //+kubebuilder. It then uses this parsed data to generate a variety of operator-related code and yaml manifests. One such manifest that it can create is a CRD.

Using controller-gen, here's the code that I used to generate the above schema. It's certainly much less verbose than the resulting yaml file!

package v1alpha1

import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

type FeedReaderSpec struct {
	// List of feed urls to scrape at a regular interval
	Feeds []string `json:"feeds,omitempty"`
	// Interval at which to scrape feeds, in the form of a golang duration string
	// (see https://pkg.go.dev/time#ParseDuration for valid formats)
	ScrapeInterval string `json:"scrapeInterval,omitempty"`
}

type FeedReaderStatus struct {
	// Time of the last scrape
	LastScrape metav1.Time `json:"lastScrape,omitempty"`
}

//+kubebuilder:object:root=true
//+kubebuilder:subresource:status

type FeedReader struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   FeedReaderSpec   `json:"spec,omitempty"`
	Status FeedReaderStatus `json:"status,omitempty"`
}

//+kubebuilder:object:root=true

type FeedReaderList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []FeedReaderList `json:"items"`
}

Here, we've defined a FeedReader struct that has a Spec and a Status field. Seem familiar? Almost every Kubernetes resource that we deal with has a .spec and .status field in its definition. But the types of our FeedReader's Spec and Status are actually custom types that we defined directly in our Go code. This gives us the ability to craft our CRD schema in a compiled language with static type-checking while letting the tooling emit the equivalent yaml CRD spec to use in our Kubernetes clusters.

If you compare the CRD yaml with the FeedReaderSpec struct, you can see that controller-gen extracted the struct fields, types, and docstrings into the emitted yaml. Also, since our FeedReader inherits from metav1.TypeMeta and metav1.ObjectMeta and is annotated with the //+kubebuilder:object:root=true comment, the fields from those interfaces (like apiVersion and kind) are included in the yaml as well.

But what about things like the API group, crd.sklar.rocks, and version v1alpha1? These are handled in a separate Go file in the same package:

// +kubebuilder:object:generate=true
// +groupName=crd.sklar.rocks
package v1alpha1

import (
	"k8s.io/apimachinery/pkg/runtime/schema"
	"sigs.k8s.io/controller-runtime/pkg/scheme"
)

var (
	// GroupVersion is group version used to register these objects
	GroupVersion = schema.GroupVersion{Group: "crd.sklar.rocks", Version: "v1alpha1"}

	// SchemeBuilder is used to add go types to the GroupVersionKind scheme
	SchemeBuilder = &scheme.Builder{GroupVersion: GroupVersion}
)

func init() {
	SchemeBuilder.Register(&FeedReader{}, &FeedReaderList{})
}

Here, we initialize a GroupVersion instance that defines our API group (crd.sklar.rocks) and version (v1alpha1). We then use it to instantiate the SchemeBuilder, which is used to register the FeedReader and FeedReaderList structs from above. When controller-gen is run, it will load this Go package, check for any registered structs, parse them (and any related comment markers), and emit the CRD yaml that we can use to install the CRD in our cluster.

Assuming that our api definitions are in the ./api directory and we want the output in ./deploy/crd, all we need to do is run the following command to generate our FeedReader CRD's yaml definition:

$ controller-gen crd paths="./api/..." output:dir=deploy/crd

Installing our Custom Resource Definition

Now that we've generated our Custom Resource Definition, how do we install it in the cluster? It's actually very simple! A CustomResourceDefinition is its own Kubernetes resource type, just like a Deployment, Pod, StatefulSet, or any of the other Kubernetes resources that you're used to working with. All you need to do is kubectl create your CRD yaml, and the new CRD will be validated and registered with the API server.

You can check this by typing kubectl get crd, which will list all of the Custom Resource Definitions installed in your cluster. If you've copy-pasted the CRD yaml and installed the FeedReader CRD in your cluster, you should see the feedreader.crd.sklar.rocks in the list, under the v1alpha1 API version.

With our FeedReader CRD installed, you are free to create as many resources of kind: FeedReader as you wish. Each one of these represents a unique instance of the FeedReader application in your cluster, with each instance managing its own Deployment, StatefulSet and CronJob. This way, you can easily deploy and manage multiple instances of your application in your cluster, all by creating and modifying Kubernetes objects in the same way that you would for any other resource type.

What's Next?

Now that we have a Custom Resource Definition and a Custom Resource that defines our FeedReader application, we need to actually perform the orchestration that will manage the application in our cluster. Stay tuned for the next article in this series, which will discuss how to do that in a custom controller.

If you like this content and want to follow along with the series, I'll be writing more articles under the Operators 101 tag on this blog.

How the CSI (Container Storage Interface) Works

If you work with persistent storage in Kubernetes, maybe you've seen articles about how to migrate from in-tree to CSI volumes, but aren't sure what all the fuss is about? Or perhaps you're trying to debug a stuck VolumeAttachment that won't unmount from a node, holding up your important StatefulSet rollout? A clear understanding of what the Container Storage Interface (or CSI for short) is and how it works will give you confidence when dealing with persistent data in Kubernetes, allowing you to answer these questions and more!

The Container Storage Interface is an API specification that enables developers to build custom drivers which handle the provisioning, attaching, and mounting of volumes in containerized workloads. As long as a driver correctly implements the CSI API spec, it can be used in any supported Container Orchestration system, like Kubernetes. This decouples persistent storage development efforts from core cluster management tooling, allowing for the rapid development and iteration of storage drivers across the cloud native ecosystem.

In Kubernetes, the CSI has replaced legacy in-tree volumes with a more flexible means of managing storage mediums. Previously, in order to take advantage of new storage types, one would have had to upgrade an entire cluster's Kubernetes version to access new PersistentVolume API fields for a new storage type. But now, with the plethora of independent CSI drivers available, you can add any type of underlying storage to your cluster instantly, as long as there's a driver for it.

But what if existing drivers don't provide the features that you require and you want to build a new custom driver? Maybe you're concerned about the ramifications of migrating from in-tree to CSI volumes? Or, you simply want to learn more about how persistent storage works in Kubernetes? Well, you're in the right place! This article will describe what the CSI is and detail how it's implemented in Kubernetes.

It's APIs All the Way Down

Like many things in the Kubernetes ecosystem, the Container Storage Interface is actually just an API specification. In the container-storage-interface/spec GitHub repo, you can find this spec in 2 different versions:

  1. A protobuf file that defines the API schema in gRPC terms
  2. A markdown file that describes the overall system architecture and goes into detail about each API call

What I'm going to discuss in this section is an abridged version of that markdown file, while borrowing some nice ASCII diagrams from the repo itself!

Architecture

A CSI Driver has 2 components, a Node Plugin and a Controller Plugin. The Controller Plugin is responsible for high-level volume management; creating, deleting, attaching, detatching, snapshotting, and restoring physical (or virtualized) volumes. If you're using a driver built for a cloud provider, like EBS on AWS, the driver's Controller Plugin communicates with AWS HTTPS APIs to perform these operations. For other storage types like NFS, EXSI, ZFS, and more, the driver sends these requests to the underlying storage's API endpoint, in whatever format that API accepts.

On the other hand, the Node Plugin is responsible for mounting and provisioning a volume once it's been attached to a node. These low-level operations usually require privileged access, so the Node Plugin is installed on every node in your cluster's data plane, wherever a volume could be mounted.

The Node Plugin is also responsible for reporting metrics like disk usage back to the Container Orchestration system (referred to as the "CO" in the spec). As you might have guessed already, I'll be using Kubernetes as the CO in this post! But what makes the spec so powerful is that it can be used by any container orchestration system, like Nomad for example, as long as it abides by the contract set by the API guidelines.

The specification doc provides a few possible deployment patterns, so let's start with the most common one.

                             CO "Master" Host
+-------------------------------------------+
|                                           |
|  +------------+           +------------+  |
|  |     CO     |   gRPC    | Controller |  |
|  |            +----------->   Plugin   |  |
|  +------------+           +------------+  |
|                                           |
+-------------------------------------------+

                            CO "Node" Host(s)
+-------------------------------------------+
|                                           |
|  +------------+           +------------+  |
|  |     CO     |   gRPC    |    Node    |  |
|  |            +----------->   Plugin   |  |
|  +------------+           +------------+  |
|                                           |
+-------------------------------------------+

Figure 1: The Plugin runs on all nodes in the cluster: a centralized
Controller Plugin is available on the CO master host and the Node
Plugin is available on all of the CO Nodes.

Since the Controller Plugin is concerned with higher-level volume operations, it does not need to run on a host in your cluster's data plane. For example, in AWS, the Controller makes AWS API calls like ec2:CreateVolume, ec2:AttachVolume, or ec2:CreateSnapshot to manage EBS volumes. These functions can be run anywhere, as long as the caller is authenticated with AWS. All the CO needs is to be able to send messages to the plugin over gRPC. So in this architecture, the Controller Plugin is running on a "master" host in the cluster's control plane.

On the other hand, the Node Plugin must be running on a host in the cluster's data plane. Once the Controller Plugin has done its job by attaching a volume to a node for a workload to use, the Node Plugin (running on that node) will take over by mounting the volume to a well-known path and optionally formatting it. At this point, the CO is free to use that path as a volume mount when creating a new containerized process; so all data on that mount will be stored on the underlying volume that was attached by the Controller Plugin. It's important to note that the Container Orchestrator, not the Controller Plugin, is responsible for letting the Node Plugin know that it should perform the mount.

Volume Lifecycle

The spec provides a flowchart of basic volume operations, also in the form of a cool ASCII diagram:

   CreateVolume +------------+ DeleteVolume
 +------------->|  CREATED   +--------------+
 |              +---+----^---+              |
 |       Controller |    | Controller       v
+++         Publish |    | Unpublish       +++
|X|          Volume |    | Volume          | |
+-+             +---v----+---+             +-+
                | NODE_READY |
                +---+----^---+
               Node |    | Node
            Publish |    | Unpublish
             Volume |    | Volume
                +---v----+---+
                | PUBLISHED  |
                +------------+

Figure 5: The lifecycle of a dynamically provisioned volume, from
creation to destruction.

Mounting a volume is a synchronous process: each step requires the previous one to have run successfully. For example, if a volume does not exist, how could we possibly attach it to a node?

When publishing (mounting) a volume for use by a workload, the Node Plugin first requires that the Controller Plugin has successfully published a volume at a directory that it can access. In practice, this usually means that the Controller Plugin has created the volume and attached it to a node. Now that the volume is attached, it's time for the Node Plugin to do its job. At this point, the Node Plugin can access the volume at its device path to create a filesystem and mount it to a directory. Once it's mounted, the volume is considered to be published and it is ready for a containerized process to use. This ends the CSI mounting workflow.

Continuing the AWS example, when the Controller Plugin publishes a volume, it calls ec2:CreateVolume followed by ec2:AttachVolume. These two API calls allocate the underlying storage by creating an EBS volume and attaching it to a particular instance. Once the volume is attached to the EC2 instance, the Node Plugin is free to format it and create a mount point on its host's filesystem.

Here is an annotated version of the above volume lifecycle diagram, this time with the AWS calls included in the flow chart.

   CreateVolume +------------+ DeleteVolume
 +------------->|  CREATED   +--------------+
 |              +---+----^---+              |
 |       Controller |    | Controller       v
+++         Publish |    | Unpublish       +++
|X|          Volume |    | Volume          | |
+-+                 |    |                 +-+
                    |    |
 <ec2:CreateVolume> |    | <ec2:DeleteVolume>
                    |    |
 <ec2:AttachVolume> |    | <ec2:DetachVolume>
                    |    |
                +---v----+---+
                | NODE_READY |
                +---+----^---+
               Node |    | Node
            Publish |    | Unpublish
             Volume |    | Volume
                +---v----+---+
                | PUBLISHED  |
                +------------+

If a Controller wants to delete a volume, it must first wait for the Node Plugin to safely unmount the volume to preserve data and system integrity. Otherwise, if a volume is forcibly detatched from a node before unmounting it, we could experience bad things like data corruption. Once the volume is safely unpublished (unmounted) by the Node Plugin, the Controller Plugin would then call ec2:DetachVolume to detatch it from the node and finally ec2:DeleteVolume to delete it, assuming that the you don't want to reuse the volume elsewhere.

What makes the CSI so powerful is that it does not prescribe how to publish a volume. As long as your driver correctly implements the required API methods defined in the CSI spec, it will be compatible with the CSI and by extension, be usable in COs like Kubernetes and Nomad.

Running CSI Drivers in Kubernetes

What I haven't entirely make clear yet is why the Controller and Node Plugins are plugins themselves! How does the Container Orchestrator call them, and where do they plug into?

Well, the answer depends on which Container Orchestrator you are using. Since I'm most familiar with Kubernetes, I'll be using it to demonstrate how a CSI driver interacts with a CO.

Deployment Model

Since the Node Plugin, responsible for low-level volume operations, must be running on every node in your data plane, it is typically installed using a DaemonSet. If you have heterogeneous nodes and only want to deploy the plugin to a subset of them, you can use node selectors, affinities, or anti-affinities to control which nodes receive a Node Plugin Pod. Since the Node Plugin requires root access to modify host volumes and mounts, these Pods will be running in privileged mode. In this mode, the Node Plugin can escape its container's security context to access the underlying node's filesystem when performing mounting and provisioning operations. Without these elevated permissions, the Node Plugin could only operate inside of its own containerized namespace without the system-level access that it requires to provision volumes on the node.

The Controller Plugin is usually run in a Deployment because it deals with higher-level primitives like volumes and snapshots, which don't require filesystem access to every single node in the cluster. Again, lets think about the AWS example I used earlier. If the Controller Plugin is just making AWS API calls to manage volumes and snapshots, why would it need access to a node's root filesystem? Most Controller Plugins are stateless and highly-available, both of which lend themselves to the Deployment model. The Controller also does not need to be run in a privileged context.

Event-Driven Sidecar Pattern

Now that we know how CSI plugins are deployed in a typical cluster, it's time to focus on how Kubernetes calls each plugin to perform CSI-related operations. A series of sidecar containers, that are registered with the Kubernetes API server to react to different events across the cluster, are deployed alongside each Controller and Node Plugin. In a way, this is similar to the typical Kubernetes controller pattern, where controllers react to changes in cluster state and attempt to reconcile the current cluster state with the desired one.

There are currently 6 different sidecars that work alongside each CSI driver to perform specific volume-related operations. Each sidecar registers itself with the Kubernetes API server and watches for changes in a specific resource type. Once the sidecar has detected a change that it must act upon, it calls the relevant plugin with one or more API calls from the CSI specification to perform the desired operations.

Controller Plugin Sidecars

Here is a table of the sidecars that run alongside a Controller Plugin:

Sidecar NameK8s Resources WatchedCSI API Endpoints Called
external-provisionerPersistentVolumeClaimCreateVolume,DeleteVolume
external-attacherVolumeAttachmentController(Un)PublishVolume
external-snapshotterVolumeSnapshot(Content)CreateSnapshot,DeleteSnapshot
external-resizerPersistentVolumeClaimControllerExpandVolume

How do these sidecars work together? Let's use an example of a StatefulSet to demonstrate. In this example, we're dynamically provisioning our PersistentVolumes (PVs) instead of mapping PersistentVolumeClaims (PVCs) to existing PVs. We start at the creation of a new StatefulSet with a VolumeClaimTemplate.

---
apiVersion: apps/v1
kind: StatefulSet
spec:
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "my-storage-class"
      resources:
        requests:
         storage: 1Gi

Creating this StatefulSet will trigger the creation of a new PVC based on the above template. Once the PVC has been created, the Kubernetes API will notify the external-provisioner sidecar that this new resource was created. The external-provisioner will then send a CreateVolume message to its neighbor Controller Plugin over gRPC. From here, the CSI driver's Controller Plugin takes over by processing the incoming gRPC message and will create a new volume based on its custom logic. In the AWS EBS driver, this would be an ec2:CreateVolume call.

At this point, the control flow moves to the built-in PersistentVolume controller, which will create a matching PV and bind it to the PVC. This allows the StatefulSet's underlying Pod to be scheduled and assigned to a Node.

Here, the external-attacher sidecar takes over. It will be notified of the new PV and call the Controller Plugin's ControllerPublishVolume endpoint, mounting the volume to the StatefulSet's assigned node. This would be the equivalent to ec2:AttachVolume in AWS.

At this point, we have an EBS volume that is mounted to an EC2 instance, all based on the creation of a StatefulSet, PersistentVolumeClaim, and the work of the AWS EBS CSI Controller Plugin.

Node Plugin Sidecars

There is only one unique sidecar that is deployed alongside the Node Plugin; the node-driver-registrar. This sidecar, running as part of a DaemonSet, registers the Node Plugin with a Node's kubelet. During the registration process, the Node Plugin will inform the kubelet that it is able to mount volumes using the CSI driver that it is part of. The kubelet itself will then wait until a Pod is scheduled to its corresponding Node, at which point it is then responsible for making the relevant CSI calls (PublishVolume) to the Node Plugin over gRPC.

Common Sidecars

There is also a livenessprobe sidecar that runs in both the Container and Node Plugin Pods that monitors the health of the CSI driver and reports back to the Kubernetes Liveness Probe mechanism.

Communication Over Sockets

How do these sidecars communicate with the Controller and Node Plugins? Over gRPC through a shared socket! So each sidecar and plugin contains a volume mount pointing to a single unix socket.

CSI Controller Deployment

This diagram highlights the pluggable nature of CSI Drivers. To replace one driver with another, all you have to do is simply swap the CSI Driver container with another and ensure that it's listening to the unix socket that the sidecars are sending gRPC messages to. Becase all drivers advertise their own different capabilities and communicate over the shared CSI API contract, it's literally a plug-and-play solution.

Conclusion

In this article, I only covered the high-level concepts of the Container Storage Interface spec and implementation in Kubernetes. While hopefully it has provided a clearer understanding of what happens once you install a CSI driver, writing one requires significant low-level knowledge of both your nodes' operating system(s) and the underlying storage mechanism that your driver is implementing. Luckily, CSI drivers exist for a variety of cloud providers and distributed storage solutions, so it's likely that you can find a CSI driver that already fulfills your requirements. But it always helps to know what's happening under the hood in case your particular driver is misbehaving.

If this article interests you and you want to learn more about the topic, please let me know! I'm always happy to answer questions about CSI Drivers, Kubernetes Operators, and a myriad of other DevOps-related topics.