All posts
Scaling ClickHouse® on Kubernetes: Shards, Availability Zones, and Anti-Affinity

Scaling ClickHouse® on Kubernetes: Shards, Availability Zones, and Anti-Affinity

May 26, 20265 min readGayathri
Share:

This is the ninth article in our series on running the ClickHouse® database on Kubernetes with the Altinity® Kubernetes Operator. We have a replicated cluster with two copies of the same data. Now we scale out: spread the data across more machines with shards, and make the placement of pods deliberate so a single failure cannot take down all copies of anything.

Replicas scale safety, shards scale size

A quick recap. A replica is a complete copy of the data; more replicas mean more fault tolerance and more read capacity, but every replica stores the whole dataset. A shard is a slice; each shard holds a different portion of the rows, so more shards mean the data and the query work are split across more machines. Real clusters combine both: several shards, each with a couple of replicas.

In the operator you control this with two numbers in the cluster layout, shardsCount and replicasCount. A cluster with shardsCount: 2 and replicasCount: 2 has four ClickHouse pods: two shards, each replicated twice.

Adding shards

Here is a two-shard, two-replica cluster. Save it as sharded.yaml (it assumes a Keeper named keeper exists, as built in the Keeper article):

apiVersion: "clickhouse.altinity.com/v1"
kind: "ClickHouseInstallation"
metadata:
  name: "ch"
spec:
  configuration:
    zookeeper:
      keeper:
        name: keeper
    clusters:
      - name: "main"
        layout:
          shardsCount: 2
          replicasCount: 2
  templates:
    podTemplates:
      - name: clickhouse-pod
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:26.3

After applying it you will see four pods: chi-ch-main-0-0-0 and chi-ch-main-0-1-0 for shard 0's two replicas, and chi-ch-main-1-0-0 and chi-ch-main-1-1-0 for shard 1's. The Distributed table from the previous article now automatically spans both shards, because it is defined over the {cluster} topology. Queries fan out to both shards and merge, and inserts spread rows across them. You scaled the dataset across more machines by changing one number.

The placement problem

By default Kubernetes is free to schedule pods anywhere. That is dangerous for a database. Imagine both replicas of shard 0 landing on the same physical node: if that node dies, you lose both copies and that shard goes offline. Similarly, if every pod lands in the same cloud availability zone, a single zone outage takes the whole cluster down. We want the opposite: copies spread apart. The operator gives you two tools for this, anti-affinity and zones.

Anti-affinity: keep copies on different nodes

Anti-affinity is a scheduling rule that tells Kubernetes to keep certain pods apart. The operator exposes it through podDistribution on a pod template. Adding a ClickHouseAntiAffinity distribution tells the operator to generate the underlying Kubernetes anti-affinity so its ClickHouse pods avoid sharing a node:

spec:
  templates:
    podTemplates:
      - name: clickhouse-pod
        podDistribution:
          - type: ClickHouseAntiAffinity
            scope: ClickHouseInstallation
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:26.3

The scope controls how wide the rule reaches; ClickHouseInstallation spreads all pods of this installation across nodes. With this in place, the scheduler will not co-locate two ClickHouse pods on the same node, so no single node holds two copies of the same shard.

One practical caveat for the local lab: spreading pods across nodes requires more than one node. A default minikube has a single node, so strict anti-affinity would leave extra pods unschedulable. To try it locally, start a multi-node cluster with minikube start --nodes 3. On a real cloud cluster you already have many nodes.

Zones: survive an availability zone outage

Cloud providers group machines into availability zones that fail independently. Spreading replicas across zones means a zone outage costs you at most one copy. Kubernetes labels each node with its zone under the well-known label topology.kubernetes.io/zone. The operator's pod template has a zone field that places pods on nodes matching a zone value:

spec:
  templates:
    podTemplates:
      - name: ch-zone-a
        zone:
          values:
            - "us-east-1a"
        podDistribution:
          - type: ClickHouseAntiAffinity
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:26.3
      - name: ch-zone-b
        zone:
          values:
            - "us-east-1b"
        podDistribution:
          - type: ClickHouseAntiAffinity
        spec:
          containers:
            - name: clickhouse
              image: clickhouse/clickhouse-server:26.3

You then assign different replicas to different zone templates in the cluster layout, so replica 1 lives in zone A and replica 2 in zone B. The operator translates the zone field into Kubernetes node affinity, scheduling each pod onto a node in the requested zone. Combined with anti-affinity, your copies are now spread across both separate nodes and separate zones, which is exactly what a resilient cluster needs.

Putting placement together

A robust production layout looks like this in words: two or more shards to split the data, two replicas per shard for safety, each shard's replicas pinned to different zones, and anti-affinity ensuring no two ClickHouse pods share a node. Every part of that is expressed declaratively in the CHI, and the operator keeps it true as the cluster changes. You describe the safety property you want; the operator enforces it through Kubernetes scheduling.

A note on local testing

On a single-node minikube you can still create shards and replicas, but anti-affinity and zone placement have no effect because there is only one node in one zone. Use a multi-node minikube or a cloud cluster to see placement actually take hold, and inspect where pods landed with kubectl get pods -n ch -o wide, which shows the node for each pod.

What is next

Your cluster is scaled and its copies are spread for safety. In the next article we handle change over time: upgrading the ClickHouse version with zero downtime using canary and rolling updates, and upgrading the operator itself.

References

Work with Quantrail

Expert ClickHouse services

We design, migrate, tune, and run ClickHouse for teams that own their data, from first architecture through day-two operations. Tell us what you are building and we will help.

Talk to an expert

Manage ClickHouse with CHOps

CHOps is our free, open-source ClickHouse admin tool: monitoring, query profiling, backups, visual access control, and alerting in one self-hosted interface, with zero agents on your servers.

Explore CHOps
Share: