Troubleshooting ClickHouse® on Kubernetes: A Practical Debugging Guide

This is the sixteenth article in our series on running the ClickHouse® database on Kubernetes with the Altinity® Kubernetes Operator. Even a well-built cluster eventually misbehaves. This article gives you a repeatable method for finding out what is wrong and the fixes for the failures you are most likely to hit.

A debugging method that always works

When something is broken, resist the urge to guess. Follow the same path every time, from the high-level resource down to the individual container. Each step narrows the problem.

First, check the cluster's own status:

kubectl get chi -n ch

The status tells you a lot. Completed means the operator finished successfully. InProgress means it is still reconciling, so wait. Aborted means the operator hit a problem it would not push through, and you need to look closer.

Second, ask the resource to explain itself:

kubectl describe chi ch -n ch

The events and status messages at the bottom often name the problem directly. Third, look at the pods:

kubectl get pods -n ch
kubectl describe pod <pod-name> -n ch

A pod's status and its events explain why it is not running. Fourth, read logs, both the operator's and the database's:

kubectl logs -n kube-system -l app=clickhouse-operator --tail=100
kubectl logs <pod-name> -n ch

This status, then describe, then events, then logs sequence resolves the large majority of issues. Now let us apply it to the common ones.

Pod stuck in Pending

A Pending pod has not been scheduled onto any node. The describe pod output names the reason in its events. Three causes dominate. The storage claim cannot be satisfied, because no StorageClass matched or the cluster cannot provision the disk; check kubectl get pvc -n ch for a claim stuck in Pending. There is not enough CPU or memory on any node to meet the pod's requests; lower the requests or add capacity. Or a placement rule cannot be met, most often anti-affinity on a single-node cluster, where the scheduler refuses to co-locate pods; either add nodes or relax the rule, as discussed in the scaling article.

CrashLoopBackOff

CrashLoopBackOff means the container starts and immediately dies, repeatedly. The pod logs are your evidence:

kubectl logs <pod-name> -n ch --previous

The --previous flag shows the last crashed instance's output, which is where the real error is. For ClickHouse this is usually a configuration mistake: an invalid setting, a malformed file added through the files block, or a bad value in settings. ClickHouse prints the offending option as it refuses to start. Fix the value in the CHI and reapply.

Unbound PersistentVolumeClaim

If kubectl get pvc -n ch shows a claim stuck in Pending, storage provisioning failed. Common reasons are naming a StorageClass that does not exist, the cluster having no default StorageClass while your template names none, or the underlying provisioner being unable to create the disk. List your classes with kubectl get storageclass, confirm the name in your volumeClaimTemplates matches one of them, and on a local cluster make sure the default storage provisioner is enabled.

ClickHouse cannot reach Keeper

If replicated tables fail to create or replicas will not sync, ClickHouse may not be reaching Keeper. Confirm the Keeper pods are running with kubectl get chk -n ch and kubectl get pods -n ch. Check that your CHI's zookeeper.keeper.name points at the actual CHK name. Then verify connectivity from inside a ClickHouse pod:

kubectl exec -n ch chi-ch-main-0-0-0 -- \
  clickhouse-client -q "SELECT * FROM system.zookeeper WHERE path='/' FORMAT Vertical"

If that query errors, the path or the Keeper reference is wrong; if it returns nodes, coordination is healthy.

An aborted reconcile from a bad update

When you apply a change that ClickHouse cannot accept, such as an invalid setting or a broken image, the new pod never becomes Ready, and the operator stops the rolling update rather than breaking the rest of the cluster. The CHI status goes to Aborted and describe explains why. The reliable fix is to revert: change the manifest back to the last known-good version and reapply. The operator reconciles forward to the corrected state. Because it stopped instead of marching on, your healthy replicas kept serving throughout, which is the safety behaviour working as intended.

The DDL-on-new-pods gotcha

One specific issue is worth recognizing because it looks alarming. On some newer ClickHouse versions running on Kubernetes, an upstream regression could cause statements like CREATE TABLE to fail on a freshly created pod. The documented workaround is simply to restart the affected ClickHouse pods, and the underlying issue is fixed in current patch releases. If DDL fails only on brand-new pods, check the operator and ClickHouse release notes and restart the pod before assuming a deeper problem.

A note for FIPS deployments

If you have enabled the operator's image policy, covered in the next article, a cluster can be aborted with a FIPSImagePolicyViolation reason when its image is not a FIPS build. The fix is to point the CHI or CHK at a properly FIPS-tagged image. The reason string in the status tells you exactly which policy tripped.

General tips

A few habits prevent most emergencies. Make one change at a time and watch it reconcile, so when something breaks you know what caused it. Keep your manifests in version control so reverting is trivial. Watch the monitoring you set up, since a problem often shows as a metric trend before it becomes an outage. And when in doubt, return to the method: status, describe, events, logs. It rarely fails to point at the cause.

What is next

You can now diagnose and fix a cluster under stress. The final article in the series covers a specialized but important topic for regulated environments: running ClickHouse on Kubernetes in a FIPS 140-3 compliant posture with the operator's FIPS controls.

Troubleshooting ClickHouse® on Kubernetes: A Practical Debugging Guide

A debugging method that always works

Pod stuck in Pending

CrashLoopBackOff

Unbound PersistentVolumeClaim

ClickHouse cannot reach Keeper

An aborted reconcile from a bad update

The DDL-on-new-pods gotcha

A note for FIPS deployments

General tips

What is next

References

Expert ClickHouse services

Manage ClickHouse with CHOps

Related articles

FIPS 140-3 Compliance for ClickHouse® on Kubernetes with the Altinity® Operator

Tiered Storage for ClickHouse® on Kubernetes: Hot Disks and S3 Cold Storage

A Production-Grade ClickHouse® Cluster on Kubernetes with the Altinity® Operator