All posts
How to Run ClickHouse® with Docker Compose: Single Node and Multi-Node Cluster (Beginner's Guide)

How to Run ClickHouse® with Docker Compose: Single Node and Multi-Node Cluster (Beginner's Guide)

June 5, 202612 min readSivaranjani
Share:

This guide takes you from nothing to a working ClickHouse® database cluster on your own machine. First we run a single node and load a real open dataset to play with. Then we build a proper multi-node cluster, with sharding and replication, using Docker Compose.

No prior experience with Docker or ClickHouse is assumed. We explain every concept before we use it. By the end you will understand not just the commands, but why each piece exists.

This is a foundation article. If you came here on the way to building a Change Data Capture pipeline, this is the ClickHouse half you will need. See What is Debezium and how to offload analytics to ClickHouse for the bigger picture.

What is ClickHouse, briefly

ClickHouse is an open-source columnar database built for analytics. It is designed to scan and aggregate enormous numbers of rows in a fraction of a second, which makes it ideal for dashboards, reports, and data exploration. We use ClickHouse 26.3, which is the current Long Term Support release, the right choice for anything you plan to keep running.

What is Docker, in plain English

Docker is a tool for packaging a program together with everything it needs to run, into a single unit called an image. When you run an image, you get a container: an isolated little environment that behaves the same on every machine. You do not install ClickHouse on your laptop directly; you run the official ClickHouse image, and Docker takes care of the rest. When you are done, you throw the container away and your laptop is exactly as it was.

The benefit for learning is huge. There is nothing to install and uninstall, no version conflicts, and no mess. If something goes wrong, you delete the container and start again.

What is Docker Compose

Running one container by hand is easy. Running several containers that need to talk to each other, a database here, a coordination service there, is where it gets fiddly. Docker Compose solves this. You describe all your containers in a single text file called docker-compose.yml, and one command starts them all, wired together on a private network where they can find each other by name.

A cluster is exactly this kind of multi-container setup, which is why Compose is the perfect tool for it.

Prerequisites

Install Docker Desktop, which includes both Docker and Docker Compose. You will also want a terminal. That is everything.

Part A: A single ClickHouse node

Let us start with the simplest possible thing: one ClickHouse container.

Start it

Create a folder and a docker-compose.yml inside it:

services:
  clickhouse:
    image: clickhouse/clickhouse-server:26.3
    container_name: clickhouse
    ports:
      - "8123:8123"   # HTTP interface
      - "9000:9000"   # native client interface
    environment:
      CLICKHOUSE_PASSWORD: clickhouse
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

Start it:

docker compose up -d

The -d runs it in the background. The first time, Docker downloads the image, which takes a moment. The two published ports are the two ways to talk to ClickHouse: port 8123 for the HTTP interface (handy for tools and curl), and port 9000 for the fast native protocol used by the command-line client. The ulimits line raises a file-handle limit that ClickHouse likes to have.

Connect to it

Open the ClickHouse client inside the container:

docker compose exec clickhouse clickhouse-client --password clickhouse

You are now at a SQL prompt. Try:

SELECT version();

You can also reach it over HTTP from your host without entering the container:

echo 'SELECT version()' | curl 'http://localhost:8123/?password=clickhouse' --data-binary @-

Load a real open dataset

A database is more fun with real data. We will use the OpenFlights airports dataset, which lists about 7,700 airports around the world and is published under the Open Database License, so it is free to use. ClickHouse can read it straight from the web with the url table function.

First create a table to hold it:

CREATE TABLE airports
(
    airport_id Int32,
    name       String,
    city       String,
    country    String,
    iata       String,
    icao       String,
    lat        Float64,
    lon        Float64,
    altitude   Int32,
    timezone   String,
    dst        String,
    tz         String,
    type       String,
    source     String
)
ENGINE = MergeTree
ORDER BY airport_id;

This is a basic MergeTree table, the standard ClickHouse engine for a single node. The ORDER BY line tells ClickHouse how to physically sort the data, which is what makes its queries fast. Now load the data directly from the source:

INSERT INTO airports
SELECT * FROM url(
    'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat',
    'CSV',
    'airport_id Int32, name String, city String, country String, iata String, icao String, lat Float64, lon Float64, altitude Int32, timezone String, dst String, tz String, type String, source String'
);

Check it loaded:

SELECT count() FROM airports;

You should see roughly 7,700 rows.

Run an analytical query

Now ask an analytical question, the kind ClickHouse is built for. Which countries have the most airports?

SELECT country, count() AS airports
FROM airports
GROUP BY country
ORDER BY airports DESC
LIMIT 10;

This scans every row, groups by country, counts, and sorts, and it returns in a blink. That is the columnar engine at work. You now have a working single node with real data in it.

A single node is perfect for learning, for development, and even for plenty of production workloads. But it has two limits: if that one machine fills up, you cannot grow, and if it fails, your data is offline. A cluster solves both. That is Part B.

Part B: A multi-node cluster

A real ClickHouse cluster does two separate things, and it helps to keep them straight.

Replication means keeping identical copies of the same data on more than one node. If one node fails, another has the same data, so nothing goes offline. This is about safety.

Sharding means splitting one big table across several nodes, so each holds only a portion. This lets you store and query more data than one machine could handle. This is about scale.

We will build a cluster that does both: two shards, each with two replicas, for a total of four ClickHouse nodes. To keep the replicas in sync, the nodes need a coordination service, and ClickHouse provides its own, called ClickHouse Keeper. We run three Keeper nodes, because coordination services need an odd number to vote reliably. Modern ClickHouse uses Keeper; older guides that start a separate ZooKeeper are out of date.

That is seven containers in total. ClickHouse is light when idle, so this runs comfortably on a laptop.

Set up the project folder

mkdir clickhouse-cluster
cd clickhouse-cluster

We will create a few small configuration files here, then one Compose file that ties them together.

The Keeper configuration

Create three nearly identical files, one per Keeper node. They differ only in the server_id line. Here is keeper-01.xml:

<clickhouse>
    <listen_host>0.0.0.0</listen_host>
    <keeper_server>
        <tcp_port>9181</tcp_port>
        <server_id>1</server_id>
        <log_storage_path>/var/lib/clickhouse-keeper/log</log_storage_path>
        <snapshot_storage_path>/var/lib/clickhouse-keeper/snapshots</snapshot_storage_path>
        <raft_configuration>
            <server><id>1</id><hostname>keeper-01</hostname><port>9234</port></server>
            <server><id>2</id><hostname>keeper-02</hostname><port>9234</port></server>
            <server><id>3</id><hostname>keeper-03</hostname><port>9234</port></server>
        </raft_configuration>
    </keeper_server>
</clickhouse>

Create keeper-02.xml and keeper-03.xml as copies of this file, changing only <server_id> to 2 and 3 respectively. The raft_configuration block is the same in all three and lists every Keeper node so they can find each other.

The cluster configuration for ClickHouse nodes

Create cluster.xml, which all four ClickHouse nodes share. It defines the cluster layout and points the nodes at Keeper:

<clickhouse>
    <listen_host>0.0.0.0</listen_host>
    <remote_servers>
        <cluster_2S_2R>
            <shard>
                <internal_replication>true</internal_replication>
                <replica><host>clickhouse-01</host><port>9000</port></replica>
                <replica><host>clickhouse-02</host><port>9000</port></replica>
            </shard>
            <shard>
                <internal_replication>true</internal_replication>
                <replica><host>clickhouse-03</host><port>9000</port></replica>
                <replica><host>clickhouse-04</host><port>9000</port></replica>
            </shard>
        </cluster_2S_2R>
    </remote_servers>
    <zookeeper>
        <node><host>keeper-01</host><port>9181</port></node>
        <node><host>keeper-02</host><port>9181</port></node>
        <node><host>keeper-03</host><port>9181</port></node>
    </zookeeper>
</clickhouse>

Two things to read here. The remote_servers block describes our cluster, named cluster_2S_2R: two shards, each with two replicas. The zookeeper block points ClickHouse at Keeper; it is named zookeeper for historical reasons, but Keeper speaks the same protocol, so this is how you connect to it.

Per-node identity: macros

Each node needs to know which shard and replica it is. ClickHouse calls these values macros. Create four small files, one per node. Here is macros-01.xml:

<clickhouse>
    <macros>
        <cluster>cluster_2S_2R</cluster>
        <shard>01</shard>
        <replica>clickhouse-01</replica>
    </macros>
</clickhouse>

Create the other three by changing the shard and replica values to match the cluster layout:

  • macros-02.xml: shard 01, replica clickhouse-02
  • macros-03.xml: shard 02, replica clickhouse-03
  • macros-04.xml: shard 02, replica clickhouse-04

So clickhouse-01 and clickhouse-02 are the two replicas of shard 1, and clickhouse-03 and clickhouse-04 are the two replicas of shard 2.

Allow the nodes to talk

Recent ClickHouse images lock down the default user. For a local learning cluster, create users-access.xml to let the nodes connect to each other without a password. This is fine for a laptop but must never be used in production:

<clickhouse>
    <users>
        <default>
            <password></password>
            <networks><ip>::/0</ip></networks>
            <profile>default</profile>
            <quota>default</quota>
            <access_management>1</access_management>
        </default>
    </users>
</clickhouse>

The Docker Compose file

Now tie it all together. Create docker-compose.yml:

services:
  keeper-01:
    image: clickhouse/clickhouse-keeper:26.3
    container_name: keeper-01
    volumes:
      - ./keeper-01.xml:/etc/clickhouse-keeper/keeper_config.xml
 
  keeper-02:
    image: clickhouse/clickhouse-keeper:26.3
    container_name: keeper-02
    volumes:
      - ./keeper-02.xml:/etc/clickhouse-keeper/keeper_config.xml
 
  keeper-03:
    image: clickhouse/clickhouse-keeper:26.3
    container_name: keeper-03
    volumes:
      - ./keeper-03.xml:/etc/clickhouse-keeper/keeper_config.xml
 
  clickhouse-01:
    image: clickhouse/clickhouse-server:26.3
    container_name: clickhouse-01
    ports:
      - "8123:8123"
    depends_on: [keeper-01, keeper-02, keeper-03]
    ulimits:
      nofile: {soft: 262144, hard: 262144}
    volumes:
      - ./cluster.xml:/etc/clickhouse-server/config.d/cluster.xml
      - ./macros-01.xml:/etc/clickhouse-server/config.d/macros.xml
      - ./users-access.xml:/etc/clickhouse-server/users.d/users-access.xml
 
  clickhouse-02:
    image: clickhouse/clickhouse-server:26.3
    container_name: clickhouse-02
    ports:
      - "8124:8123"
    depends_on: [keeper-01, keeper-02, keeper-03]
    ulimits:
      nofile: {soft: 262144, hard: 262144}
    volumes:
      - ./cluster.xml:/etc/clickhouse-server/config.d/cluster.xml
      - ./macros-02.xml:/etc/clickhouse-server/config.d/macros.xml
      - ./users-access.xml:/etc/clickhouse-server/users.d/users-access.xml
 
  clickhouse-03:
    image: clickhouse/clickhouse-server:26.3
    container_name: clickhouse-03
    ports:
      - "8125:8123"
    depends_on: [keeper-01, keeper-02, keeper-03]
    ulimits:
      nofile: {soft: 262144, hard: 262144}
    volumes:
      - ./cluster.xml:/etc/clickhouse-server/config.d/cluster.xml
      - ./macros-03.xml:/etc/clickhouse-server/config.d/macros.xml
      - ./users-access.xml:/etc/clickhouse-server/users.d/users-access.xml
 
  clickhouse-04:
    image: clickhouse/clickhouse-server:26.3
    container_name: clickhouse-04
    ports:
      - "8126:8123"
    depends_on: [keeper-01, keeper-02, keeper-03]
    ulimits:
      nofile: {soft: 262144, hard: 262144}
    volumes:
      - ./cluster.xml:/etc/clickhouse-server/config.d/cluster.xml
      - ./macros-04.xml:/etc/clickhouse-server/config.d/macros.xml
      - ./users-access.xml:/etc/clickhouse-server/users.d/users-access.xml

Each ClickHouse node publishes its HTTP port on a different host port (8123 through 8126) so you can reach any of them from your laptop.

Start the cluster

docker compose up -d
docker compose ps

Give it a moment, then connect to the first node:

docker compose exec clickhouse-01 clickhouse-client

Confirm the cluster is assembled:

SELECT cluster, shard_num, replica_num, host_name
FROM system.clusters
WHERE cluster = 'cluster_2S_2R';

You should see four rows: two shards, two replicas each.

Create distributed tables

We create two tables. The first is a local table on every node, using ReplicatedMergeTree, which is the replicating cousin of MergeTree. The second is a Distributed table, which holds no data itself but acts as a single doorway that fans queries out across all shards.

Run this from clickhouse-01. The ON CLUSTER clause makes ClickHouse apply it to every node at once:

CREATE DATABASE flights ON CLUSTER cluster_2S_2R;
 
CREATE TABLE flights.airports_local ON CLUSTER cluster_2S_2R
(
    airport_id Int32,
    name       String,
    city       String,
    country    String,
    lat        Float64,
    lon        Float64,
    altitude   Int32
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/airports', '{replica}')
ORDER BY airport_id;
 
CREATE TABLE flights.airports ON CLUSTER cluster_2S_2R
AS flights.airports_local
ENGINE = Distributed(cluster_2S_2R, flights, airports_local, cityHash64(country));

The ReplicatedMergeTree takes two parameters. The first is a path in Keeper where the table's replication state lives; the {shard} macro makes each shard use a different path, while the two replicas of a shard share the same path so they stay in sync. The second, {replica}, identifies each replica uniquely. Those macros are exactly the per-node files we created earlier.

The Distributed table is told which cluster to use, which local table to read and write, and how to decide which shard a row belongs to. We shard by cityHash64(country), so all airports in the same country land on the same shard, which keeps per-country queries efficient.

Load the dataset into the cluster

Insert through the Distributed table, and ClickHouse routes each row to the correct shard:

INSERT INTO flights.airports
SELECT airport_id, name, city, country, lat, lon, altitude
FROM url(
    'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat',
    'CSV',
    'airport_id Int32, name String, city String, country String, iata String, icao String, lat Float64, lon Float64, altitude Int32, timezone String, dst String, tz String, type String, source String'
);

See sharding and replication at work

Query the whole dataset through the Distributed table:

SELECT count() FROM flights.airports;

You get the full count of roughly 7,700. Now look at how it is split. Query the local table on shard 1 and shard 2 directly:

docker compose exec clickhouse-01 clickhouse-client --query \
  "SELECT count() FROM flights.airports_local"
docker compose exec clickhouse-03 clickhouse-client --query \
  "SELECT count() FROM flights.airports_local"

The two counts are each a portion of the total, and they add up to the full number. That is sharding: each shard holds part of the data.

Now check replication. clickhouse-02 is the other replica of shard 1, so it should hold exactly the same data as clickhouse-01:

docker compose exec clickhouse-02 clickhouse-client --query \
  "SELECT count() FROM flights.airports_local"

It matches clickhouse-01. That is replication: the second replica received the data automatically.

Finally, prove the cluster survives a failure. Stop one replica and query again through the Distributed table:

docker compose stop clickhouse-02
docker compose exec clickhouse-01 clickhouse-client --query \
  "SELECT count() FROM flights.airports"

You still get the full count, because the surviving replica of shard 1 served the data. Bring it back with docker compose start clickhouse-02, and it catches up automatically.

Single node or cluster: which should you use

For learning, development, and many production workloads, a single node is the right answer. It is simple, fast, and has far fewer moving parts. Do not build a cluster until you actually need one.

Reach for a cluster when your data outgrows one machine (you need sharding to spread it out), or when downtime is unacceptable (you need replication so a node failure does not take you offline), or both. The setup above gives you a realistic, if small, version of what a production cluster looks like.

Cleaning up

From either project folder:

docker compose down -v

The -v flag also removes the stored data, leaving your machine clean.

References

What is next

You now know how to run ClickHouse, load real data, and stand up a sharded, replicated cluster. The natural next step is to feed it live data from your existing databases. Start with the concepts in What is Debezium and how to offload analytics to ClickHouse, then follow the hands-on guides for PostgreSQL, MySQL, and Oracle.

If you would like help designing a production ClickHouse deployment, including cluster sizing, sharding strategy, and operations, the engineers at Quantrail Data do exactly this. Reach out through our services page and we will be glad to help.

Work with Quantrail

Expert ClickHouse services

We design, migrate, tune, and run ClickHouse for teams that own their data, from first architecture through day-two operations. Tell us what you are building and we will help.

Talk to an expert

Manage ClickHouse with CHOps

CHOps is our free, open-source ClickHouse admin tool: monitoring, query profiling, backups, visual access control, and alerting in one self-hosted interface, with zero agents on your servers.

Explore CHOps
Share: