Monitoring ClickHouse® on Kubernetes with Prometheus and Grafana

This is the eleventh article in our series on running the ClickHouse® database on Kubernetes with the Altinity® Kubernetes Operator. A cluster you cannot see is a cluster you cannot trust. This article gives you eyes on the system with Prometheus for collecting metrics and Grafana for visualizing them.

What to monitor and why

For a ClickHouse cluster the metrics that matter most are query throughput and latency, memory usage, disk space, the number of active data parts (too many signals merge pressure), and replication health (whether replicas are keeping up through Keeper). Watching these lets you catch a full disk, a memory-hungry query, or a lagging replica before it becomes an outage.

How metrics get out of ClickHouse

Two sources feed your monitoring. First, the operator runs a metrics exporter that scrapes every cluster it manages and republishes the data in Prometheus format. It is exposed by a Service called clickhouse-operator-metrics in the kube-system namespace on port 8888. Second, the operator annotates the ClickHouse pods so a Prometheus that does Kubernetes service discovery will find and scrape them automatically.

You can see the operator metrics directly. Port-forward the service and open the endpoint:

kubectl -n kube-system port-forward service/clickhouse-operator-metrics 8888

Then visit http://localhost:8888/metrics in a browser. The wall of text you see is the raw metrics Prometheus will collect.

Step 1: Install Prometheus and Grafana

The simplest way to get both at once is the community kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, and Alertmanager and wires them together. Install it into a dedicated namespace:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring

Give it a minute, then confirm the pods are running:

kubectl get pods -n monitoring

You will see Prometheus, Grafana, and Alertmanager pods come up. If you prefer, Altinity also publishes ready-made Prometheus and Grafana manifests in the operator repository, but the Helm stack is the gentlest starting point.

Step 2: Tell Prometheus to scrape the operator

Prometheus needs to know about the operator's metrics endpoint. With the Prometheus Operator that the stack installs, you express a scrape target as a small ServiceMonitor resource pointing at the operator's metrics Service. Save this as operator-monitor.yaml:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: clickhouse-operator
  namespace: monitoring
  labels:
    release: monitoring
spec:
  namespaceSelector:
    matchNames:
      - kube-system
  selector:
    matchLabels:
      app: clickhouse-operator
  endpoints:
    - port: metrics
      path: /metrics

kubectl apply -f operator-monitor.yaml

The release: monitoring label is what lets the stack's Prometheus pick up this ServiceMonitor. Once applied, Prometheus begins scraping ClickHouse metrics through the operator. You can confirm by port-forwarding Prometheus and checking its targets page:

kubectl -n monitoring port-forward svc/monitoring-kube-prometheus-prometheus 9090

Open http://localhost:9090/targets and look for the ClickHouse operator target in the up state.

Step 3: Open Grafana and add the data source

Grafana is where the numbers become charts. Get its admin password and port-forward it:

kubectl -n monitoring get secret monitoring-grafana \
  -o jsonpath="{.data.admin-password}" | base64 -d; echo
kubectl -n monitoring port-forward svc/monitoring-grafana 3000:80

Open http://localhost:3000 and log in as admin with the password you just printed. The kube-prometheus-stack already configures Prometheus as the default data source, so you can go straight to dashboards. If you ever add Grafana separately, point its Prometheus data source at the in-cluster address http://<prometheus-service>.<namespace>.svc.cluster.local:9090 using proxy access.

Step 4: Import the Altinity dashboard

Altinity publishes a ready-made Grafana dashboard for the operator and the ClickHouse clusters it manages. In Grafana, go to Dashboards, choose Import, and upload the Altinity ClickHouse Operator dashboard JSON from the operator repository (or paste its dashboard ID), then select your Prometheus data source. You immediately get panels for query rates, memory, parts, replication, and more, without building anything by hand.

Step 5: Alerts

Collecting metrics is only useful if something tells you when they go wrong. The operator repository ships a set of Prometheus alert rules for ClickHouse, covering conditions like a replica falling behind, too many parts, or a server becoming unreachable. Apply those rules to your Prometheus, and route them through Alertmanager to email or a chat channel so a human hears about trouble. Start with the provided rules and tune the thresholds to your workload over time.

A quick health check from the database itself

Dashboards aside, ClickHouse exposes its own state through system tables, which are handy for a quick look or a custom panel:

SELECT metric, value FROM system.metrics WHERE metric LIKE '%Query%';
SELECT database, table, count() AS parts FROM system.parts WHERE active GROUP BY database, table ORDER BY parts DESC LIMIT 10;

The first shows live query activity; the second shows part counts per table, a key signal of merge health.

What is next

You can now see your cluster's health and get alerted when it degrades. In the next article we harden it: encrypting connections with TLS, managing certificates, locking down users and networks, and keeping every secret in Kubernetes rather than in your manifests.

Monitoring ClickHouse® on Kubernetes with Prometheus and Grafana

What to monitor and why

How metrics get out of ClickHouse

Step 1: Install Prometheus and Grafana

Step 2: Tell Prometheus to scrape the operator

Step 3: Open Grafana and add the data source

Step 4: Import the Altinity dashboard

Step 5: Alerts

A quick health check from the database itself

What is next

References

Expert ClickHouse services

Manage ClickHouse with CHOps

Related articles

FIPS 140-3 Compliance for ClickHouse® on Kubernetes with the Altinity® Operator

Troubleshooting ClickHouse® on Kubernetes: A Practical Debugging Guide

Tiered Storage for ClickHouse® on Kubernetes: Hot Disks and S3 Cold Storage