Running Databases on Kubernetes

A few weeks ago, Kelsey Hightower wrote a tweet and held a live discussion on Twitter about whether it's a good idea or not to run a database on Kubernetes. This happened to be incredibly timely for me, since we at QuestDB are about to launch our own cloud database service (built on top of k8s)!

You can run databases on Kubernetes because it's fundamentally the same as running a database on a VM. The biggest challenge is understanding that rubbing Kubernetes on Postgres won't turn it into Cloud SQL. 🧵

Kelsey Hightower on Twitter: https://t.co/zdFobm4ijy

K8s Primitives

When working with databases, the obvious concern is data persistence. Earlier in its history, k8s really shined in the area of orchestrating stateless workloads, but support for stateful workflows was limited. Eventually, primitives like StatefulSets, PersistentVolumes (PVs), and PersistentVolumeClaims (PVCs) were developed to help orchestrate stateful workloads on top of the existing platform.

PersistentVolumes are abstractions that allow for the management of raw storage; ranging from local disk to NFS, cloud-specific block storage, and more. These work in concert with PersistentVolumeClaims that represent requests for a pod to access the storage managed by a PV. A user can bind a PVC to a PV to make an ownership claim on a set of raw disk resources encompassed by the PV. Then, you can add that PVC to any pod spec as a volume, effectively allowing you to mount any kind of persistent storage medium to a particular workload. The separation of PV and PVC also allows you to fully control the lifecycle of your underlying block storage, including mounting it to different workloads or freeing it all together once the claim expires.

StatefulSets manage the lifecycles of pods that require more stability than what exists in other primitives like Deployments and ReplicaSets. By creating a StatefulSet, you can guarantee that when you remove a pod, the storage managed by its mounted PVCs does not get deleted along with it. You can imagine how useful this property is if you're hosting a database! StatefulSets also allow for ordered deployment, scaling, and rolling updates, all of which create more predictability (and thus stability) in our workloads. This is also something that seems to go hand-in-hand with what you want out of your database's infrastructure.

What else?

While StatefulSets, PVs, and PVCs do quite a bit of work for us, there are still many administration and configuration actions that you need to perform on a production-level database. For example, how do you orchestrate backups and restores? These can get quite complex when dealing with high-traffic databases that include functionality such as WALs. What about clustering and high availability? Or version upgrades? Are these operations zero-downtime? Every database deals with these features in different ways, many of which require precise coordination between components to succeed. Kubernetes alone can't handle this. For example, you can't have a StatefulSet automatically set up your average RDBMS in a read-replica mode very easily without some additional orchestration.

Not only do you have to implement many of these features yourself, but you also need to deal with the ephemeral nature of Kubernetes workloads. To ensure peak performance, you have to guarantee that the k8s scheduler places your pods on nodes that are already pre-tuned to run your database, with enough free resources to properly run it. If you're dealing with clustering, how are you handling networking to ensure that database nodes are able to connect to each other (ideally in the same cloud region)? This brings me to my next point...

Pets, not cattle

Pets, not cattle

Once you start accounting for things like node performance-tuning and networking, along with the requirement to store data persistently in-cluster, all of a sudden your infrastructure starts to grow into a set of carefully groomed pet servers instead of nameless herds of cattle. But one of the main benefits of running your application in k8s is the exact ability to treat your infrastructure like cattle instead of pets! All of the most common abstractions like Deployments, Ingresses, and Services, along with features like vertical and horizontal autoscaling, are made possible because you can run your workloads on a high-level set of infrastructure components so you don't have to worry about your physical infrastructure layer. These abstractions allow you to focus more on what you're trying to achieve with your infrastructure instead of how you're going to achieve it.

Then why even bother with k8s?

Despite these rough edges, there are plenty of reasons to want to run your database on k8s. There's no denying that k8s' popularity has increased tremendously over the past few years across both startups and enterprises. The k8s ecosystem is under constant development so that its feature set continues to expand and improve regularly. And its operator model allows end users to programmatically manage their workloads by writing code against the core k8s APIs to automatically perform tasks that would previously have to be done manually. K8s allows for easy GitOps-style management so you can leverage battle-tested software development practices when managing infrastructure in a reproducible and safe way. While vendor lock-in still exists in the world of k8s, its effect can be minimized to make it easier for you to go multi-cloud (or even swap one for another).

So what can we do if we want to take advantage of all the benefits that k8s has to offer while using it to host our database?

What do you need to build an RDS on k8s?

Towards the end of the live chat, someone asked Kelsey, "what do you actually need to build an RDS on k8s?" He jokingly answered with expertise, funding, and customers. While we're certainly on the right track with these at QuestDB, I think that this can be better phrased in that you need to implement Day 2 Operations to get to what a typical managed database service would provide.

Day 2 Operations

Storage Engineer

Day 2 Operations encompass many of the items that I've been discussing; backups, restores, stop/start, replication, high availability, and clustering. These are the features that differentiate a managed database service from a simple database hosted on k8s primitives, which is what I would call a Day 1 Operation. While k8s and its ecosystem can make it very easy to install a database in your cluster, you're going to eventually need to start thinking about Day 2 Operations once you get past the prototype phase.

Here, I'll jump into more detail about what makes these operations so difficult to implement and why special care must be taken when implementing them, either by a database admin or a managed database service provider.

Stop/Start

Stopping and starting databases is a common operation in today's DevOps practices, and is a must-have for any fully-featured managed database service. It is pretty easy to find at least one reason for wanting to stop-and-start a database. For example, you may want to have a database used for running integration tests that run on a pre-defined schedule. Or you maybe have a shared instance that's used by a development team for live QA before merging a commit. You could always create and delete database instances on-demand, but it is sometimes easier to have a reference to a static database connection string and url in your test harness or orchestration code.

While stop/start can be automated in k8s (perhaps by simply setting a StatefulSet's replica count to 0), there are still other aspects that need to be considered. If you're shutting down a database to save some money, will you also be spinning down any infrastructure? If so, how can you ensure that this infrastructure will be available when you start the database backup? K8s provides primitives like node affinity and taints to help solve this problem, but everyone's infrastructure provisioning situation and budget are different, and there's no one-size-fits-all approach to this problem.

Backup & Restore

One interesting point that Kelsey made in his chat was that having the ability to start an instance from scratch (moving from a stopped -> running state), is not trivial. Many challenges need to be solved, including finding the appropriate infrastructure to run the database, setting up network connectivity, mounting the correct volume, and ensuring data integrity once the volume has been mounted. In fact, this is such an in-depth topic, that Kelsey compares going from 0 -> 1 running instance to an actual backup-and-restore test. If you can indeed spin up an instance from scratch while loading up pre-existing data, you have successfully completed a live restore test!

Even if you have restores figured out, backups have their own complexities. K8s provides some useful building blocks like Jobs and CronJobs, which you can use if you want to take a one-off backup or create a backup schedule respectively. But you need to ensure that these jobs are configured correctly in order to access raw database storage. Or if your database allows you to perform a backup using a CLI, then these jobs also need secure access to credentials to even connect to the database in the first place. From an end-user standpoint, you need an easy way to manage existing backups, which includes creating an index, applying data retention policies, and RBAC policies. Again, while k8s can help us build out these backup-and-restore components, a lot of these features are built on top of the infrastructure primitives that k8s provides.

Replication, HA, and Clustering

These days, you can get very far by simply vertically scaling your database. The performance of modern databases can be sufficient for almost anyone's use case if you throw enough resources at the problem. But once you've reached a certain scale, or require features like high availability, there is a reason to enable some of the more advanced database management features like clustering and replication.

Once you start down this path, the amount of infrastructure orchestration complexity can increase exponentially. You need to start thinking more about networking and physical node placement to achieve your desired goal. If you don't have a centralized monitoring, logging, and telemetry solution, you're now going to need one if you want to easily diagnose issues and get the best performance out of your infrastructure. Based on its architecture and feature set, every database can have different options for enabling clustering, many of which require intimate knowledge of the inner workings of the database to choose the correct settings.

Vanilla k8s knows nothing of these complexities. Instead, these all need to be orchestrated by an administrator or operator (human or automated). If you're working with production data, changes may need to happen with close-to-zero downtime. This is where managed database services shine. They can make some of these features as easy to configure as a single web form with a checkbox or two and some input fields. Unless you're willing to invest the time into developing these solutions yourself, or leverage existing open-source solutions if they exist, sometimes it's worth giving up some level of control for automated expert assistance when configuring a database cluster.

Orchestration

Orchestra Conductor

For your Day 2 Operations to work as they would in a managed database service such as RDS, they need to not just work, but also be automated. Luckily for us, there are several ways to build automation around your database on k8s.

Helm & Yaml tools won't get us there

Since k8s configuration is declarative, it can be very easy to get from 0 -> 1 with traditional yaml-based tooling like Helm or cdk8s. Many industry-leading k8s tools install into a cluster with a simple helm install or kubectl apply command.

These are sufficient for Day 1 Operations and non-scalable deployments. But as soon as you start to move into more vendor-specific Day 2 Operations that require more coordination across system components, the usefulness of traditional yaml-based tools starts to degrade quickly, since some imperative programming logic is required.

Provisioners

One pattern that you can use to automate database management is a provisioner process. We've even used this approach to build v1 of our managed cloud solution. When a user wants to make a change to an existing database's state, our backend sends a message to a queue that is eventually picked up by a provisioner. The provisioner reads the message, uses its contents to determine which actions to perform on the cluster, and performs them sequentially. Where appropriate, each action contains a rollback step in case of a kubectl apply error to leave the infrastructure in a predictable state. Progress is reported back to the application on a separate gossip queue, providing almost-immediate feedback to the user on the progress of each state change.

While this has grown to be a powerful tool for us, there is another way to interact with the k8s API that we are now starting to leverage...

Operators

K8s has an extensible Operator pattern that you can use to manage your own Custom Resources (CRs) by writing and deploying a controller that reconciles your current cluster state into its desired state, as specified by CR yaml spec files that are applied to the cluster. This is also how the functionality of the basic k8s building blocks are implemented, which just further emphasizes how powerful this model can be.

Operators have the ability to hook into the k8s API server and listen for changes to resources inside a cluster. These changes get processed by a controller, which then kicks off a reconciliation loop where you can add your custom logic to perform any number of actions, ranging from simple resource existence to complex Day 2 Operations. This is an ideal solution to our management problem; we can offload much of our imperative code into a native k8s object, and database-specific operations appear to be as seamless as the standard set of k8s building blocks. Many existing database products use operators to accomplish this, and more are currently in development (see the Data on Kubernetes community for more information on these efforts).

As you can imagine, coordinating activities like backups, restores, and clustering inside a mostly stateless and idempotent reconciliation loop isn't the easiest. Even if you follow best practices by writing a variety of simple controllers, with each managing its own clearly-defined CR, the reconciliation logic can still be very error-prone and time-consuming to write. While frameworks like Operator SDK exist to help you with scaffolding your operator, and libraries like Kubebuilder provide a set of incredibly useful controller libraries, it's still a lot of work to undertake.

K8s is just a tool

At the end of the day, k8s is a single tool in the DevOps engineer's toolkit. These days, it's possible to host workloads in a variety of ways; using managed services (PaaS), k8s, VMs, or even running on a bare metal server. The tool that you choose depends on a variety of factors including time, experience, performance requirements, ease of use, and cost.

While hosting a database on k8s might be a fit for your organization, it just as easily could create even more overhead and instability if not done carefully. Implementing the Day 2 features that I described above is time-consuming and costly to get right. Testing is incredibly important, since you want to be absolutely sure that your (and your customers') precious data is kept safe and accessible when it's needed.

If you just need a reliable database to run your application on top of, then maybe all of the work required to run a database on k8s might be too much for you to undertake. But if your database has strong k8s support (most likely via an operator), or you are doing something unique (and at-scale) with your storage layer, it might be worth it to look more into managing your stateful databases on k8s. Just be prepared for a large time investment and ensure that you have the requisite in-house knowledge (or support) so that you can be confident that you're performing your database automation activities correctly and safely.

We've spent the past year building our own managed database service on top of k8s. If you want to check out what we've built, you can visit the QuestDB Cloud page and see it for yourself!


Originally posted on the QuestDB Blog. All images are common use from https://www.pexels.com/.

Kubebuilder and Operator-SDK Tips and Tricks

Recently, I've been spending a lot of time writing a Kubernetes operator using the go operator-sdk, which is built on top of the Kubebuilder framework. This is a list of a few tips and tricks that I've compiled over the past few months working with these frameworks.

Log Formatting

Kubebuilder, like much of the k8s ecosystem, utilizes zap for logging. Out of the box, the Kubebuilder zap configuration outputs a timestamp for each log, which gets formatted using scientific notation. This makes it difficult for me to read the time of an event just by glancing at it. Personally, I prefer ISO 8601, so let's change it!

In your scaffolding's main.go, you can configure your current logger format by modifying the zap.Options struct and calling ctrl.SetLogger.

opts := zap.Options{
    Development: true,
    TimeEncoder: zapcore.ISO8601TimeEncoder,
}

ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

In this case, I added the zapcore.ISO8601TimeEncoder, which encodes timestamps to human-readable ISO 8601-formatted strings. It took a bit of digging, along with a bit of help from the Kubernetes Slack org, to figure this one out. But it's been a huge quality-of-life improvement when debugging complex reconcile loops, especially in a multithreaded environment.

MaxConcurrentReconciles

Speaking of multithreaded environments, by default, an operator will only run a single reconcile loop per-controller. However, in practice, especially when running a globally-scoped controller, it's useful to run multiple concurrent reconcile loops to simultaneously handle many resource changes at once. Luckily, the Operator SDK makes this incredibly easy with the MaxConcurrentReconciles setting. We can set this up in a new controller's SetupWithManager func:

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
		WithOptions(controller.Options{MaxConcurrentReconciles: 10}).
        ...
        Complete(r)
}

I've created a command line arg in my main.go file that allows the user to set this value to any integer value, since this will likely be a tweaked over time depending on how the controller performs in a production cluster.

Parent-Child Relationships

One of the basic functions of a controller is to act as a parent to Kubernetes resources. This allows the controller to "own" these objects such that when it is deleted, all child objects are automatically garbage collected by the Kubernetes runtime.

I like this small function that can be called for any client.Object to add a parent reference to the controller that you're writing.

func (r *CustomReconciler) ownObject(ctx context.Context, cr *myapiv1alpha1.CustomResource, obj client.Object) error {

	err := ctrl.SetControllerReference(cr, obj, r.Scheme)
	if err != nil {
		return err
	}
	return r.Update(ctx, obj)
}

You can then add Owns watches for these resources in your SetupWithManager func. These will instruct your controller to listen for changes in child resources of the specified types, triggering a reconcile loop on each change.

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
        Owns(&v1apps.Deployment{}).
		Owns(&v1core.ConfigMap{}).
		Owns(&v1core.Service{}).
        Complete(r)
}

Watches

Your controller can also watch resources that it doesn't own. This is useful for when you need to watch for changes in globally-scoped resources like PersistentVolumes or Nodes. Here's an example of how you would register this watch in your SetupWithManager func.

func (r *CustomReconciler) SetupWithManager(mgr ctrl.Manager) error {

    return ctrl.NewControllerManagedBy(mgr).
        Watches(
            &source.Kind{Type: &v1core.Node{}},
            handler.EnqueueRequestsFromMapFunc(myNodeFilterFunc),
            builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}),
        ).
        Complete(r)
}

In this case, you need to implement myNodeFilterFunc to accept an obj client.Object and return []reconcile.Request. Using the ResourceVersionChangedPredicate triggers the filter function for every change on that resource type, so it's important to write your filter function to be as efficient as possible, since there is a chance that it could be called quite a bit, especially if your controller is globally-scoped.

Field Indexers

One gotcha that I encountered happened when trying to query for a list of Pods that are running on a particular Node. This query uses a FieldSelector filter, as seen here:

// Get a list of all pods on the node
err := c.List(ctx, &pods, &client.ListOptions{
    Namespace:     "",
    FieldSelector: fields.ParseSelectorOrDie(fmt.Sprintf("spec.nodeName=%s", node.Name)),
})

This codepath led to the following error: Index with name field:spec.nodeName does not exist. After some googling around, I found this GitHub issue that referenced a Kubebuilder docs page which contained the answer.

Controllers created using operator-sdk and Kubebuilder use a built-in caching mechanism to store results of API requests. This is to prevent spamming the K8s API, as well as improve reconciliation performance.

When performing resource lookups using FieldSelectors, you first need to add your desired search field to an index that the cache can use for lookups. Here's an example that will build this index for a Pod's nodeName:

if err := mgr.GetFieldIndexer().IndexField(context.TODO(), &v1core.Pod{}, "spec.nodeName", func(rawObj client.Object) []string {
    pod := rawObj.(*v1core.Pod)
    return []string{pod.Spec.NodeName}
}); err != nil {
    return err
}

Now, we can run the List function from above with the FieldSelector with no issues.

Retries on Conflicts

If you've ever written controllers, you're probably very familiar with the error Operation cannot be fulfilled on ...: the object has been modified; please apply your changes to the latest version and try again

This occurs when the version of the resource that you're currently reconciling in your controller is out-of-date with what's in latest version of the K8s cluster state. If you're retrying your reconciliation loop on any errors, your controller will eventually reconcile the resource, but this can really pollute your logs and make it difficult to spot more important errors.

After reading through the k8s source, I found the solution to this: RetryOnConflict. It's a utility function in the client-go package that runs a function and automatically retries on conflict, up to a certain point.

Now, you can just wrap your logic inside this function argument, and never have to worry about this issue again! And the added benefit is that you just get to return err instead of return ctrl.Result{}, err, which makes your code that much easier to read.

Useful Kubebuilder Markers

Here are some useful code markers that I've found while developing my operator.

  1. To add custom columns to your custom resource's description (when running kubectl get), you can add annotations to your API object like these:
//+kubebuilder:printcolumn:name="NodeReady",type="boolean",JSONPath=".status.nodeReady"
//+kubebuilder:printcolumn:name="NodeIp",type="string",JSONPath=".status.nodeIp"
  1. To add a shortname to your custom resource (like pvc for PersistentVolumeClaim for example), you can add this annotation:
//+kubebuilder:resource:shortName=mycr;mycrs

More docs on kubebuilder markers can be found here:

https://book.kubebuilder.io/reference/markers/crd.html

Using QuestDB to collect infrastructure metrics

Originally posted on the QuestDB Blog

Introduction

One of my favorite things about QuestDB is the ability to write queries in SQL against a high-performance time series database. Since I've been using SQL as my primary query language for basically my entire professional career, it feels natural for me to interact with data using SQL instead of other newer proprietary query languages. Combined with QuestDB's custom SQL extensions, its built-in SQL support makes writing complex queries a breeze.

In my life as a Cloud Engineer, I deal with time series metrics all the time. Unfortunately, many of today's popular metrics databases don't support the SQL query language. As a result, I've become more dependent on pre-built dashboards, and it takes me longer to write my own queries with JOINs, transformations, and temporal aggregations.

QuestDB can be a great choice for ingesting application and infrastructure metrics, it just requires a little more work on the initial setup than the Kubernetes tooling du jour. Despite this extra upfront time investment (which is fairly minimal in the grand scheme of things), I think that the benefits of using QuestDB for infrastructure metrics are worth it. With QuestDB, you get industry-leading performance and the ability to interact with the database in the most commonly-used query language in existence. We're even using QuestDB to display customer database metrics in our own QuestDB Cloud!

In this article, I will demonstrate how we use QuestDB as the main component in this new feature. This should provide enough information for you to also use QuestDB for ingesting, storing, and querying infrastructure metrics in your own clusters.

Architecture

Prometheus is a common time series database that is already installed in many Kubernetes clusters. We will be leveraging its remote write functionality to pipe data into QuestDB for querying and storage. However, since Prometheus remote write does not support the QuestDB-recommended InfluxDB Line Protocol (ILP) as a serialization format, we need to use a proxy to translate Prometheus-formatted metrics into ILP messages. We will be use Influx Data's Telegraf as this translation component.

Now, with our data in QuestDB, we can use SQL to query our metrics using any one of the supported methods: the Web Console, PostgreSQL wire protocol, or HTTP REST API.

Here's a quick overview of the architecture:

Infrastructure diagram

Prometheus remote write

While Prometheus operates on an interval-based pull model, it also has the ability to push metrics to remote sources. This is known as "remote write" capability, and is easily configurable in a yaml file. Here's an example of a basic remote write configuration:

remoteWrite:
  - url: http://default.telegraf.svc:9999/write
    name: questdb-telegraf
    remote_timeout: 10s

This yaml will configure Prometheus to send samples to the specified url with a 10 second timeout. In this case, we will be forwarding our metrics on to telegraf, with a custom port and endpoint that we can specify in the telegraf config (see below for more details). There are also a variety of other remote write options, allowing users to customize timeouts, headers, authentication, and additional relabling configs before writing to the remote data store. All of the possible options can be found on the Prometheus website.

QuestDB ILP and Telegraf

Now that we have our remote write configured, we need to set up its destination. Installing telegraf into a cluster is straightforward, just helm install its Helm chart.

We do need to configure telegraf to read from a web socket (where Prometheus is configured to writing to) and send to QuestDB for long-term storage. In a Kubernetes deployment, these options can be set in the config section of the telegraf Helm chart's values.yaml file.

Input Configuration

Since telegraf will be receiving metrics from Prometheus, we need to open a port that enables communication between the two services. Telegraf has an HTTP listener plugin that allows it to listen for traffic on a specified port. We also need to configure the path of the listener to match our Promtheus remote write url.

The HTTP listener (v2) supports multiple data formats to consume via its plugin architecture. A full list of options can be found in the telegraf docs. We will be using the Prometheus Remote Write Parser Plugin to accept our Prometheus messages.

Here is how this setup looks in the telegraf config:

[[inputs.http_listener_v2]]
  ## Address and port to host HTTP listener on
  service_address = ":9999"

  ## Paths to listen to.
  paths = ["/write"]

  ## Data format to consume.
  data_format = "prometheusremotewrite"

When passing these values to the Helm chart, you can use this yaml specification:

config:
  inputs:
    - http_listener_v2:
        service_address: ":9999"
        path: "/write"
        data_format: prometheusremotewrite

Output configuration

We recommend that you use the InfluxDB Line Protocol (ILP) over TCP to insert data into QuestDB. Luckily, telegraf includes an ILP output plugin! But unfortunately, this is not a plug-and-play solution. By default, all metrics will be written to a single measurement, prometheus_remote_write, with the individual metric's key being sent over the wire as a field. In practice, this means that all of your metrics will be written to a single QuestDB table, called prometheus_remote_write. There will then be an additional column for every single metric AND field that you are capturing. This leads to a large table, with potentially thousands of columns, that's difficult to work with and contains all sparse data, which could negatively impact performance.

To fix this problem, Telegraf provides us with a sample starlark script that transforms each measurement such that we will have a table-per-metric in QuestDB. This script will run for every metric that telegraf receives, so the output will be formatted correctly.

This is what telegraf's output config looks like:

[[outputs.socket_writer]]
  ## Address and port to write to
  address = "tcp://questdb.questdb.svc:9009"

[[processors.starlark]]
      source = '''
    def apply(metric):
      if metric.name == "prometheus_remote_write":
            for k, v in metric.fields.items():
                metric.name = k
                metric.fields["value"] = v
                metric.fields.pop(k)
      return metric
    '''

As an added benefit to using ILP with QuestDB, we don't even have to worry about each metric's fieldset. Over ILP, QuestDB automatically creates tables for new metrics. It also adds new columns for fields that it hasn't seen before, and INSERTs nulls for any missing fields.

Helm configuration

I've found that the easiest way to configure the values.yaml file is to mount the starlark script as a volume, and add a reference to it in the config. This way we don't need to deal with any whitespace-handling or special indentation in our ConfigMap specification.

The output and starlark Helm configuration would look like this:

# continued from above
# config:
outputs:
  - socket_writer:
      address: tcp://questdb.questdb.svc:9009
processors:
  - starlark:
      script: /opt/telegraf/remotewrite.star

We also need to add the volume and mount at the root level of the values.yaml:

volumes:
  - name: starlark-script
    configMap:
      name: starlark-script
mountPoints:
  - name: starlark-script
    mountPath: /opt/telegraf
    subpath: remotewrite.star

This volume references a ConfigMap that contains the starlark script from the above example:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: starlark-script
data:
  remotewrite.star: |
    def apply(metric):
      ...

Querying metrics with SQL

QuestDB has some powerful SQL extensions that can simplify writing time series queries. For example, given the standard set of metrics that a typical Prometheus installation collects, we can use QuestDB to not only find pods with the highest memory usage in a cluster (over a 6 month period), but also find the specific time period when the memory usage spiked. We can even access custom labels to help identify the pods with a human-readable name (instead of the long alphanumeric name assigned to pods by deployments or stateful sets). This is all performed with a simple SQL syntax using JOINs (enhanced by the ASOF keyword) and SAMPLE BY to bucket data into days with a simple line of SQL:

SELECT l.custom_label, w.timestamp, max(w.value / r.value) as mem_usage
FROM container_memory_working_set_bytes AS w
ASOF JOIN kube_pod_labels AS l ON (w.pod = l.pod)
ASOF JOIN kube_pod_container_resource_limits AS r ON (
  r.pod = w.pod AND
  r.container = w.container
)
WHERE custom_label IS NOT NULL
  AND r.resource = 'memory'
  AND w.timestamp > '2022-06-01'
SAMPLE BY 1d
ALIGN TO CALENDAR TIME ZONE 'Europe/Berlin'
ORDER BY mem_usage DESC;

Here's a sample output of that query:

custom_labeltimestampmem_usage
keen austin2022-07-04T16:18:00.000000Z0.999853875401
optimistic banzai2022-07-12T16:18:00.000000Z0.9763028946
compassionate taussig2022-07-11T16:18:00.000000Z0.975367909527
cranky leakey2022-07-11T16:18:00.000000Z0.974941994418
quirky morse2022-07-05T16:18:00.000000Z0.95084235665
admiring panini2022-06-21T16:18:00.000000Z0.925567626953

This is only one of many ways that you can use QuestDB to write powerful time-series queries that you can use for one-off investigation or to power dashboards.

Metric retention

Since databases storing infrastructure metrics can grow to extreme sizes over time, it is important to enforce a retention period to free up space by deleting old metrics. Even though QuestDB does not support the traditional DELETE SQL command, you can still implement metric retention by using the DROP PARTITION command.

In QuestDB, data is stored by columns on-disk and optionally partitioned by a time duration. By default, when using ILP to ingest metrics, and a new table is automatically created, it is partitioned by DAY. This allows us to DROP PARTITIONs on a daily basis. If you need a different partitioning scheme, you can create the table with your desired partition period before ingesting any data over ILP, since ALTER TABLE does not support any changes to table partitioning. But since ILP does automatically add columns, the table specification can be very simple, with just the name and a timestamp column.

Once you've decided on your desired metric retention period, you can create a cron job that removes all partitions older than your oldest retention date. This will help keep your storage usage in check.

For more information about data retention in QuestDB, you can check out the docs.

Working example

I have created a working example of this setup in a repo, sklarsa/questdb-metrics-blog-post. The entire example runs in a local Kind cluster.

To run the example, execute the following commands:

git clone https://github.com/sklarsa/questdb-metrics-blog-post.git
cd questdb-metrics-blog-post
./run.sh

After a few minutes, all pods should be in ready with the following prompt:

You can now access QuestDB here: http://localhost:9000
Ctrl-C to exit
Forwarding from 127.0.0.1:9000 -> 9000
Forwarding from [::1]:9000 -> 9000

From here, you can navigate to http://localhost:9000 and explore the metrics that are being ingested into QuestDB. The default Prometheus scrape interval is 30 seconds, so there might not be a ton of data in there, but you should see a list of tables, one per each metric that we are collecting:

Metrics in the console

Once you're done, you can clean up the entire experiment by deleting the cluster:

./cleanup.sh

Conclusion

QuestDB can be a very powerful piece in the Cloud Engineer's toolkit. It grants you the ability to run complex time-series queries across multiple metrics with unparalleled speed in the world's most ubiquitous query language, SQL. Every second counts when debugging an outage at 2AM, and reducing the cognitive load of writing queries, as well as their execution time, is a game-changer for me.