Upgrading ClickHouse® on Kubernetes: Canary and Rolling Updates with Zero Downtime

This is the tenth article in our series on running the ClickHouse® database on Kubernetes with the Altinity® Kubernetes Operator. A running cluster is not a finished cluster. New ClickHouse versions ship monthly, and the operator itself gets updates. This article shows how to apply those changes without taking your database offline.

Two different upgrades

There are two things you might upgrade, and they are independent. One is the ClickHouse version, the database image your pods run. The other is the operator version, the controller in kube-system that manages your clusters. We cover both, starting with the more common one.

How a rolling ClickHouse upgrade works

To change the ClickHouse version you change the image in your CHI and reapply. The operator does not restart everything at once. It performs a rolling update: it upgrades one host at a time, optionally removing that host from the cluster's routing first, waiting for it to come back healthy, and only then moving to the next. Because your data is replicated, the other replicas keep serving queries while one pod restarts. Done this way, a version change causes no downtime for a properly replicated cluster.

To upgrade, edit the image tag in your pod template, for example from 26.3 to a newer release, and reapply:

spec:
  templates:
    podTemplates:
      - name: clickhouse-pod
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:26.4

kubectl apply -n ch -f cluster.yaml
kubectl get pods -n ch -w

You will see pods restart one by one, each picking up the new version, while the cluster stays available throughout.

Canary first: test a new version on one replica

Restarting every node onto a brand new version at once is risky. The safer pattern is a canary: run the new version on a single replica, verify it behaves, and only then roll it out everywhere. The operator supports this by letting you override the pod template for one specific replica while the rest stay on the current version.

You define two pod templates, the current version and the candidate, then point one replica of one shard at the candidate template:

spec:
  configuration:
    zookeeper:
      keeper:
        name: keeper
    clusters:
      - name: "main"
        templates:
          podTemplate: ch-current
        layout:
          shardsCount: 2
          replicasCount: 2
          shards:
            - name: "0"
              replicas:
                - name: "1"
                  templates:
                    podTemplate: ch-candidate
  templates:
    podTemplates:
      - name: ch-current
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:26.3
      - name: ch-candidate
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:26.4

When you apply this, only that one replica restarts onto the candidate version. The rest of the cluster keeps running the current version. You now have a real node on the new version, taking real replicated traffic, that you can observe.

Propagating the update

Once the canary has proven itself, you propagate the new version to the whole cluster by making it the default and removing the override. Set the cluster's default pod template to the new image and delete the per-replica override, then reapply:

spec:
  configuration:
    clusters:
      - name: "main"
        templates:
          podTemplate: ch-new
        layout:
          shardsCount: 2
          replicasCount: 2
  templates:
    podTemplates:
      - name: ch-new
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:26.4

The operator rolls the remaining nodes onto the new version one at a time, exactly as before, with no downtime. Canary, observe, propagate is the pattern to internalize for every version bump.

A real upgrade gotcha to know about

Operational honesty matters, so here is a concrete caution from the operator's own release notes. On some newer ClickHouse versions running on Kubernetes, an upstream regression could cause DDL statements, such as CREATE TABLE, to fail on freshly created pods. The documented workaround is simply to restart the affected ClickHouse pods, and the underlying issue is fixed in current patch releases. The lesson is general: before upgrading a production cluster, read the release notes for both ClickHouse and the operator, and test on a canary first. Surprises are far cheaper to discover on one node than on all of them.

Upgrading the operator itself

The operator is upgraded separately from your clusters. If you installed it with kubectl, apply the newer bundle:

kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml

If you installed it with Helm, update the repository and upgrade the release. Note that the Helm chart asks you to apply the updated Custom Resource Definitions separately during an upgrade, so follow the chart's upgrade instructions:

helm repo update clickhouse-operator
helm upgrade clickhouse-operator clickhouse-operator/altinity-clickhouse-operator

Upgrading the operator can trigger a rolling restart of managed clusters if pod templates or labels changed between versions, so treat an operator upgrade with the same care as a database upgrade: read its release notes, do it during a quiet window, and watch the rollout.

Verifying after any upgrade

After an upgrade, confirm the version is what you expect and the cluster is healthy:

kubectl exec -n ch chi-ch-main-0-0-0 -- clickhouse-client -q "SELECT version()"
kubectl get chi -n ch

A Completed status on the CHI and the expected version string mean the rollout finished cleanly.

What is next

You can now evolve the cluster safely over time. In the next article we add eyes to the system: monitoring ClickHouse on Kubernetes with Prometheus and Grafana, so you can see health, performance, and problems before your users do.

Upgrading ClickHouse® on Kubernetes: Canary and Rolling Updates with Zero Downtime

Two different upgrades

How a rolling ClickHouse upgrade works

Canary first: test a new version on one replica

Propagating the update

A real upgrade gotcha to know about

Upgrading the operator itself

Verifying after any upgrade

What is next

References

Expert ClickHouse services

Manage ClickHouse with CHOps

Related articles

FIPS 140-3 Compliance for ClickHouse® on Kubernetes with the Altinity® Operator

Troubleshooting ClickHouse® on Kubernetes: A Practical Debugging Guide

Tiered Storage for ClickHouse® on Kubernetes: Hot Disks and S3 Cold Storage