Kubebuilder and Operator-SDK Tips and Tricks

Recently, I've been spending a lot of time writing a Kubernetes operator using the go operator-sdk, which is built on top of the Kubebuilder framework. This is a list of a few tips and tricks that I've compiled over the past few months working with these frameworks.

Log Formatting

Kubebuilder, like much of the k8s ecosystem, utilizes zap for logging. Out of the box, the Kubebuilder zap configuration outputs a timestamp for each log, which gets formatted using scientific notation. This makes it difficult for me to read the time of an event just by glancing at it. Personally, I prefer ISO 8601, so let's change it!

In your scaffolding's main.go, you can configure your current logger format by modifying the zap.Options struct and calling ctrl.SetLogger.

opts := zap.Options{
    Development: true,
    TimeEncoder: zapcore.ISO8601TimeEncoder,
}

ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

In this case, I added the zapcore.ISO8601TimeEncoder, which encodes timestamps to human-readable ISO 8601-formatted strings. It took a bit of digging, along with a bit of help from the Kubernetes Slack org, to figure this one out. But it's been a huge quality-of-life improvement when debugging complex reconcile loops, especially in a multithreaded environment.

MaxConcurrentReconciles

Speaking of multithreaded environments, by default, an operator will only run a single reconcile loop per-controller. However, in practice, especially when running a globally-scoped controller, it's useful to run multiple concurrent reconcile loops to simultaneously handle many resource changes at once. Luckily, the Operator SDK makes this incredibly easy with the MaxConcurrentReconciles setting. We can set this up in a new controller's SetupWithManager func:

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
		WithOptions(controller.Options{MaxConcurrentReconciles: 10}).
        ...
        Complete(r)
}

I've created a command line arg in my main.go file that allows the user to set this value to any integer value, since this will likely be a tweaked over time depending on how the controller performs in a production cluster.

Parent-Child Relationships

One of the basic functions of a controller is to act as a parent to Kubernetes resources. This allows the controller to "own" these objects such that when it is deleted, all child objects are automatically garbage collected by the Kubernetes runtime.

I like this small function that can be called for any client.Object to add a parent reference to the controller that you're writing.

func (r *CustomReconciler) ownObject(ctx context.Context, cr *myapiv1alpha1.CustomResource, obj client.Object) error {

	err := ctrl.SetControllerReference(cr, obj, r.Scheme)
	if err != nil {
		return err
	}
	return r.Update(ctx, obj)
}

You can then add Owns watches for these resources in your SetupWithManager func. These will instruct your controller to listen for changes in child resources of the specified types, triggering a reconcile loop on each change.

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
        Owns(&v1apps.Deployment{}).
		Owns(&v1core.ConfigMap{}).
		Owns(&v1core.Service{}).
        Complete(r)
}

Watches

Your controller can also watch resources that it doesn't own. This is useful for when you need to watch for changes in globally-scoped resources like PersistentVolumes or Nodes. Here's an example of how you would register this watch in your SetupWithManager func.

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
        Watches(
            &source.Kind{Type: &v1core.Node{}},
            handler.EnqueueRequestsFromMapFunc(myNodeFilterFunc),
            builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}),
        ).
        Complete(r)
}

In this case, you need to implement myNodeFilterFunc to accept an obj client.Object and return []reconcile.Request. Using the ResourceVersionChangedPredicate triggers the filter function for every change on that resource type, so it's important to write your filter function to be as efficient as possible, since there is a chance that it could be called quite a bit, especially if your controller is globally-scoped.

Field Indexers

One gotcha that I encountered happened when trying to query for a list of Pods that are running on a particular Node. This query uses a FieldSelector filter, as seen here:

// Get a list of all pods on the node
err := c.List(ctx, &pods, &client.ListOptions{
    Namespace:     "",
    FieldSelector: fields.ParseSelectorOrDie(fmt.Sprintf("spec.nodeName=%s", node.Name)),
})

This codepath led to the following error: Index with name field:spec.nodeName does not exist. After some googling around, I found this GitHub issue that referenced a Kubebuilder docs page which contained the answer.

Controllers created using operator-sdk and Kubebuilder use a built-in caching mechanism to store results of API requests. This is to prevent spamming the K8s API, as well as improve reconciliation performance.

When performing resource lookups using FieldSelectors, you first need to add your desired search field to an index that the cache can use for lookups. Here's an example that will build this index for a Pod's nodeName:

if err := mgr.GetFieldIndexer().IndexField(context.TODO(), &v1core.Pod{}, "spec.nodeName", func(rawObj client.Object) []string {
    pod := rawObj.(*v1core.Pod)
    return []string{pod.Spec.NodeName}
}); err != nil {
    return err
}

Now, we can run the List function from above with the FieldSelector with no issues.

Retries on Conflicts

If you've ever written controllers, you're probably very familiar with the error Operation cannot be fulfilled on ...: the object has been modified; please apply your changes to the latest version and try again

This occurs when the version of the resource that you're currently reconciling in your controller is out-of-date with what's in latest version of the K8s cluster state. If you're retrying your reconciliation loop on any errors, your controller will eventually reconcile the resource, but this can really pollute your logs and make it difficult to spot more important errors.

After reading through the k8s source, I found the solution to this: RetryOnConflict. It's a utility function in the client-go package that runs a function and automatically retries on conflict, up to a certain point.

Now, you can just wrap your logic inside this function argument, and never have to worry about this issue again! And the added benefit is that you just get to return err instead of return ctrl.Result{}, err, which makes your code that much easier to read.

Useful Kubebuilder Markers

Here are some useful code markers that I've found while developing my operator.

  1. To add custom columns to your custom resource's description (when running kubectl get), you can add annotations to your API object like these:
//+kubebuilder:printcolumn:name="NodeReady",type="boolean",JSONPath=".status.nodeReady"
//+kubebuilder:printcolumn:name="NodeIp",type="string",JSONPath=".status.nodeIp"
  1. To add a shortname to your custom resource (like pvc for PersistentVolumeClaim for example), you can add this annotation:
//+kubebuilder:resource:shortName=mycr;mycrs

More docs on kubebuilder markers can be found here:

https://book.kubebuilder.io/reference/markers/crd.html

Using QuestDB to collect infrastructure metrics

Originally posted on the QuestDB Blog

Introduction

One of my favorite things about QuestDB is the ability to write queries in SQL against a high-performance time series database. Since I've been using SQL as my primary query language for basically my entire professional career, it feels natural for me to interact with data using SQL instead of other newer proprietary query languages. Combined with QuestDB's custom SQL extensions, its built-in SQL support makes writing complex queries a breeze.

In my life as a Cloud Engineer, I deal with time series metrics all the time. Unfortunately, many of today's popular metrics databases don't support the SQL query language. As a result, I've become more dependent on pre-built dashboards, and it takes me longer to write my own queries with JOINs, transformations, and temporal aggregations.

QuestDB can be a great choice for ingesting application and infrastructure metrics, it just requires a little more work on the initial setup than the Kubernetes tooling du jour. Despite this extra upfront time investment (which is fairly minimal in the grand scheme of things), I think that the benefits of using QuestDB for infrastructure metrics are worth it. With QuestDB, you get industry-leading performance and the ability to interact with the database in the most commonly-used query language in existence. We're even using QuestDB to display customer database metrics in our own QuestDB Cloud!

In this article, I will demonstrate how we use QuestDB as the main component in this new feature. This should provide enough information for you to also use QuestDB for ingesting, storing, and querying infrastructure metrics in your own clusters.

Architecture

Prometheus is a common time series database that is already installed in many Kubernetes clusters. We will be leveraging its remote write functionality to pipe data into QuestDB for querying and storage. However, since Prometheus remote write does not support the QuestDB-recommended InfluxDB Line Protocol (ILP) as a serialization format, we need to use a proxy to translate Prometheus-formatted metrics into ILP messages. We will be use Influx Data's Telegraf as this translation component.

Now, with our data in QuestDB, we can use SQL to query our metrics using any one of the supported methods: the Web Console, PostgreSQL wire protocol, or HTTP REST API.

Here's a quick overview of the architecture:

Infrastructure diagram

Prometheus remote write

While Prometheus operates on an interval-based pull model, it also has the ability to push metrics to remote sources. This is known as "remote write" capability, and is easily configurable in a yaml file. Here's an example of a basic remote write configuration:

remoteWrite:
  - url: http://default.telegraf.svc:9999/write
    name: questdb-telegraf
    remote_timeout: 10s

This yaml will configure Prometheus to send samples to the specified url with a 10 second timeout. In this case, we will be forwarding our metrics on to telegraf, with a custom port and endpoint that we can specify in the telegraf config (see below for more details). There are also a variety of other remote write options, allowing users to customize timeouts, headers, authentication, and additional relabling configs before writing to the remote data store. All of the possible options can be found on the Prometheus website.

QuestDB ILP and Telegraf

Now that we have our remote write configured, we need to set up its destination. Installing telegraf into a cluster is straightforward, just helm install its Helm chart.

We do need to configure telegraf to read from a web socket (where Prometheus is configured to writing to) and send to QuestDB for long-term storage. In a Kubernetes deployment, these options can be set in the config section of the telegraf Helm chart's values.yaml file.

Input Configuration

Since telegraf will be receiving metrics from Prometheus, we need to open a port that enables communication between the two services. Telegraf has an HTTP listener plugin that allows it to listen for traffic on a specified port. We also need to configure the path of the listener to match our Promtheus remote write url.

The HTTP listener (v2) supports multiple data formats to consume via its plugin architecture. A full list of options can be found in the telegraf docs. We will be using the Prometheus Remote Write Parser Plugin to accept our Prometheus messages.

Here is how this setup looks in the telegraf config:

[[inputs.http_listener_v2]]
  ## Address and port to host HTTP listener on
  service_address = ":9999"

  ## Paths to listen to.
  paths = ["/write"]

  ## Data format to consume.
  data_format = "prometheusremotewrite"

When passing these values to the Helm chart, you can use this yaml specification:

config:
  inputs:
    - http_listener_v2:
        service_address: ":9999"
        path: "/write"
        data_format: prometheusremotewrite

Output configuration

We recommend that you use the InfluxDB Line Protocol (ILP) over TCP to insert data into QuestDB. Luckily, telegraf includes an ILP output plugin! But unfortunately, this is not a plug-and-play solution. By default, all metrics will be written to a single measurement, prometheus_remote_write, with the individual metric's key being sent over the wire as a field. In practice, this means that all of your metrics will be written to a single QuestDB table, called prometheus_remote_write. There will then be an additional column for every single metric AND field that you are capturing. This leads to a large table, with potentially thousands of columns, that's difficult to work with and contains all sparse data, which could negatively impact performance.

To fix this problem, Telegraf provides us with a sample starlark script that transforms each measurement such that we will have a table-per-metric in QuestDB. This script will run for every metric that telegraf receives, so the output will be formatted correctly.

This is what telegraf's output config looks like:

[[outputs.socket_writer]]
  ## Address and port to write to
  address = "tcp://questdb.questdb.svc:9009"

[[processors.starlark]]
      source = '''
    def apply(metric):
      if metric.name == "prometheus_remote_write":
            for k, v in metric.fields.items():
                metric.name = k
                metric.fields["value"] = v
                metric.fields.pop(k)
      return metric
    '''

As an added benefit to using ILP with QuestDB, we don't even have to worry about each metric's fieldset. Over ILP, QuestDB automatically creates tables for new metrics. It also adds new columns for fields that it hasn't seen before, and INSERTs nulls for any missing fields.

Helm configuration

I've found that the easiest way to configure the values.yaml file is to mount the starlark script as a volume, and add a reference to it in the config. This way we don't need to deal with any whitespace-handling or special indentation in our ConfigMap specification.

The output and starlark Helm configuration would look like this:

# continued from above
# config:
outputs:
  - socket_writer:
      address: tcp://questdb.questdb.svc:9009
processors:
  - starlark:
      script: /opt/telegraf/remotewrite.star

We also need to add the volume and mount at the root level of the values.yaml:

volumes:
  - name: starlark-script
    configMap:
      name: starlark-script
mountPoints:
  - name: starlark-script
    mountPath: /opt/telegraf
    subpath: remotewrite.star

This volume references a ConfigMap that contains the starlark script from the above example:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: starlark-script
data:
  remotewrite.star: |
    def apply(metric):
      ...

Querying metrics with SQL

QuestDB has some powerful SQL extensions that can simplify writing time series queries. For example, given the standard set of metrics that a typical Prometheus installation collects, we can use QuestDB to not only find pods with the highest memory usage in a cluster (over a 6 month period), but also find the specific time period when the memory usage spiked. We can even access custom labels to help identify the pods with a human-readable name (instead of the long alphanumeric name assigned to pods by deployments or stateful sets). This is all performed with a simple SQL syntax using JOINs (enhanced by the ASOF keyword) and SAMPLE BY to bucket data into days with a simple line of SQL:

SELECT l.custom_label, w.timestamp, max(w.value / r.value) as mem_usage
FROM container_memory_working_set_bytes AS w
ASOF JOIN kube_pod_labels AS l ON (w.pod = l.pod)
ASOF JOIN kube_pod_container_resource_limits AS r ON (
  r.pod = w.pod AND
  r.container = w.container
)
WHERE custom_label IS NOT NULL
  AND r.resource = 'memory'
  AND w.timestamp > '2022-06-01'
SAMPLE BY 1d
ALIGN TO CALENDAR TIME ZONE 'Europe/Berlin'
ORDER BY mem_usage DESC;

Here's a sample output of that query:

custom_labeltimestampmem_usage
keen austin2022-07-04T16:18:00.000000Z0.999853875401
optimistic banzai2022-07-12T16:18:00.000000Z0.9763028946
compassionate taussig2022-07-11T16:18:00.000000Z0.975367909527
cranky leakey2022-07-11T16:18:00.000000Z0.974941994418
quirky morse2022-07-05T16:18:00.000000Z0.95084235665
admiring panini2022-06-21T16:18:00.000000Z0.925567626953

This is only one of many ways that you can use QuestDB to write powerful time-series queries that you can use for one-off investigation or to power dashboards.

Metric retention

Since databases storing infrastructure metrics can grow to extreme sizes over time, it is important to enforce a retention period to free up space by deleting old metrics. Even though QuestDB does not support the traditional DELETE SQL command, you can still implement metric retention by using the DROP PARTITION command.

In QuestDB, data is stored by columns on-disk and optionally partitioned by a time duration. By default, when using ILP to ingest metrics, and a new table is automatically created, it is partitioned by DAY. This allows us to DROP PARTITIONs on a daily basis. If you need a different partitioning scheme, you can create the table with your desired partition period before ingesting any data over ILP, since ALTER TABLE does not support any changes to table partitioning. But since ILP does automatically add columns, the table specification can be very simple, with just the name and a timestamp column.

Once you've decided on your desired metric retention period, you can create a cron job that removes all partitions older than your oldest retention date. This will help keep your storage usage in check.

For more information about data retention in QuestDB, you can check out the docs.

Working example

I have created a working example of this setup in a repo, sklarsa/questdb-metrics-blog-post. The entire example runs in a local Kind cluster.

To run the example, execute the following commands:

git clone https://github.com/sklarsa/questdb-metrics-blog-post.git
cd questdb-metrics-blog-post
./run.sh

After a few minutes, all pods should be in ready with the following prompt:

You can now access QuestDB here: http://localhost:9000
Ctrl-C to exit
Forwarding from 127.0.0.1:9000 -> 9000
Forwarding from [::1]:9000 -> 9000

From here, you can navigate to http://localhost:9000 and explore the metrics that are being ingested into QuestDB. The default Prometheus scrape interval is 30 seconds, so there might not be a ton of data in there, but you should see a list of tables, one per each metric that we are collecting:

Metrics in the console

Once you're done, you can clean up the entire experiment by deleting the cluster:

./cleanup.sh

Conclusion

QuestDB can be a very powerful piece in the Cloud Engineer's toolkit. It grants you the ability to run complex time-series queries across multiple metrics with unparalleled speed in the world's most ubiquitous query language, SQL. Every second counts when debugging an outage at 2AM, and reducing the cognitive load of writing queries, as well as their execution time, is a game-changer for me.

I wrote some Rust!

Thanks to a weekly Rust workshop hosted by one of my colleagues at QuestDB, along with some extra time spent in an airport lounge during a long layover, I was finally able to hack together some Rust that does something useful!

questdb-retention

The project I put together is questdb-retention, a small program that allows you to manage data retention in your own QuestDB instance.

Even though QuestDB does not support a traditional SQL DELETE statement (as of v7.0), you can still purge stale data from you database by using the DROP PARTITION command. Check out the docs for more information about the specifics around data retention.

The questdb-retention package allows you to drop old questdb partitions either interactively (on the command line) or through a yaml config file. Only tables that are partitioned are supported, for obvious reasons.

In my view, the best way to use this package is to write a yaml config, compile the executable, and add a cron job that runs the command at regular intervals. Here's an example of a yaml file that you can use:

---
conn_str: host=localhost user=admin password=quest port=8812
tables:
  my_partitioned_table_by_month: 5
  my_partitioned_table_by_day: 5

The config only specifies the number of partitions to keep per table, but since QuestDB supports partitioning by anywhere from HOUR to YEAR, it's difficult to tell how much data is being retained by just inspecting the config file. This is something that's worth improving in the future. Also, once the package reaches stability, I'll take a look at adding it to crates.io to make it official.

All in all, once I got the hang of some Rust basics, I really enjoyed working in the language! I find Rust's matching to be incredibly useful, and I'm starting to get comfortable with borrows, mutability, and common Rust types like Option and Result. I'm excited to learn more concepts like async and lifetimes, as well as starting to write some lower-level code that really can take advantage of all Rust has to offer.