All posts
PostgreSQL to ClickHouse® CDC with Debezium 3.5 and Kafka 4: A Complete Beginner's Guide

PostgreSQL to ClickHouse® CDC with Debezium 3.5 and Kafka 4: A Complete Beginner's Guide

June 9, 202618 min readSanjeev Kumar G
Share:

If you run an application on PostgreSQL and you want fast analytics on that data without hammering your production database, this guide is for you. We are going to build a real Change Data Capture (CDC) pipeline that copies every insert, update, and delete from PostgreSQL into the ClickHouse® database in near real time, using only open-source tools and Docker.

This is Part 1 of a three-part series. Part 2 covers MySQL to ClickHouse, and Part 3 covers Oracle to ClickHouse. We start with PostgreSQL because it is the most common starting point and the concepts you learn here carry over to the other two.

No prior experience with Debezium, Kafka, or ClickHouse is assumed. We explain every moving part in plain language before we run anything.

What you will build

By the end of this tutorial you will have a working pipeline that looks like this:

PostgreSQL writes every change to its write-ahead log. Debezium reads that log and turns each change into an event. Apache Kafka stores those events durably. The ClickHouse Kafka Connect Sink reads the events from Kafka and writes them into a ClickHouse table. When you insert, update, or delete a row in PostgreSQL, the same change shows up in ClickHouse a moment later.

Everything runs locally in Docker containers, so you can follow along on a laptop.

What is Change Data Capture, in plain English

Imagine you have an orders table in PostgreSQL. Customers place orders, change their minds, and cancel them all day long. Your analytics team wants to run heavy reports on this data, but running those reports directly on PostgreSQL would slow down the application for real customers.

The naive solution is to copy the whole table to an analytics database every night. That works, but the data is always up to a day stale, and copying the entire table every night is wasteful.

Change Data Capture is the better solution. Instead of copying everything, CDC watches for changes and copies only what actually changed, as it happens. The result is an analytics copy that stays a few seconds behind production instead of a full day.

The trick that makes CDC efficient is that it does not poll your tables with queries. Every relational database already keeps a log of every change it makes, so that it can recover after a crash. In PostgreSQL this log is called the Write-Ahead Log, or WAL. CDC tools read this log directly, which means they put almost no extra load on the database.

Why PostgreSQL and ClickHouse are a great pair

PostgreSQL is excellent at transactions: many small reads and writes, one row at a time, with strong correctness guarantees. It is the workhorse behind a huge number of applications.

ClickHouse is the opposite kind of database. It is built for analytics: scanning billions of rows, grouping and aggregating them, and returning results in milliseconds. It is not designed to be the primary store for your application, but it is outstanding for dashboards, reports, and data exploration.

By keeping PostgreSQL as your source of truth and continuously streaming changes into ClickHouse, you get the best of both worlds: a reliable transactional database for your app, and a blazing-fast analytical database for your reports, always in sync.

The tools and the exact versions

We are deliberately pinning specific, current versions so that this tutorial actually works when you run it. Mixing incompatible versions is the single most common reason these pipelines fail, so do not assume a newer or older tag will behave the same way.

ComponentRoleImage and version
PostgreSQLSource databasepostgres:17
Apache KafkaEvent log / transportapache/kafka:4.1.0 (KRaft mode, no ZooKeeper)
DebeziumPostgreSQL source connectorquay.io/debezium/connect:3.5
ClickHouse Kafka Connect SinkLoads events into ClickHousev1.3.7
ClickHouseAnalytics databaseclickhouse/clickhouse-server:26.3 (LTS)

A few notes on these choices. Debezium 3.5 (specifically 3.5.2.Final, released 2026-06-02) is built and tested against Kafka 4.1, which is why we pair them. Debezium 3.5 supports PostgreSQL versions 14 through 18, so postgres:17 is comfortably in range. ClickHouse 26.3 is the current Long Term Support release, which is the right choice for anything you intend to keep running.

Kafka 4 is worth a special mention. Older Kafka tutorials you may find online start a separate ZooKeeper container. Kafka 4 removed ZooKeeper entirely and uses a built-in system called KRaft, so there is no ZooKeeper in our setup. If a tutorial tells you to start ZooKeeper, it is out of date.

Prerequisites

You need Docker and Docker Compose installed, and roughly 4 GB of free memory for the containers. You also need a terminal and a willingness to copy and paste. That is all.

How the data actually flows

Before we touch any configuration, it helps to understand the journey a single change takes. Say a customer updates their email address in PostgreSQL.

First, PostgreSQL writes that update to its Write-Ahead Log as part of committing the transaction. Second, Debezium, which has subscribed to the WAL using a feature called logical replication, sees the change and produces an event describing it. That event includes the row before the change, the row after the change, and the type of operation. Third, Debezium hands the event to Kafka, which stores it in a topic. Fourth, the ClickHouse Kafka Connect Sink reads the event from the topic and inserts it into a ClickHouse table.

One detail matters a lot for beginners. Debezium's raw event is a nested structure with a before section and an after section. ClickHouse wants flat rows, not nested ones. We solve this with a small piece of Debezium configuration called the ExtractNewRecordState transformation, which flattens the event down to just the new state of the row and adds a marker so we can tell deletes apart from inserts and updates. We will set this up later.

Step 1: Prepare a project folder

Create a folder for the project and move into it:

mkdir pg-to-clickhouse-cdc
cd pg-to-clickhouse-cdc

We will create a handful of files in here as we go.

Step 2: Download the ClickHouse Kafka Connect Sink

The Debezium connect image already contains the PostgreSQL source connector, but it does not contain the connector that writes to ClickHouse. We add that ourselves by downloading it and mounting it into the container.

Download version 1.3.7 (the current release) and unzip it into a local folder:

mkdir -p connect-plugins
cd connect-plugins
curl -L -o clickhouse-kafka-connect.zip \
  https://github.com/ClickHouse/clickhouse-kafka-connect/releases/download/v1.3.7/clickhouse-kafka-connect-v1.3.7.zip
unzip clickhouse-kafka-connect.zip
rm clickhouse-kafka-connect.zip
cd ..

You should now have a connect-plugins/clickhouse-kafka-connect-v1.3.7 folder containing some JAR files. We will mount this folder into the Debezium container so that Kafka Connect can find the ClickHouse sink alongside Debezium's own connectors.

Step 3: The Docker Compose file

Create a file named docker-compose.yml with the following content. Read the comments, because they explain what each service is for.

services:
  # The source database. We enable logical replication so Debezium can read the WAL.
  postgres:
    image: postgres:17
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: shop
    command:
      - "postgres"
      - "-c"
      - "wal_level=logical"
    ports:
      - "5432:5432"
 
  # A single-node Kafka 4 broker in KRaft mode (no ZooKeeper).
  kafka:
    image: apache/kafka:4.1.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
 
  # Kafka Connect, running the Debezium image, with the ClickHouse sink mounted in.
  connect:
    image: quay.io/debezium/connect:3.5
    depends_on:
      - kafka
      - postgres
    ports:
      - "8083:8083"
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: cdc-connect
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_statuses
      # Use plain JSON without schemas, which is the simplest format for the
      # ClickHouse sink to consume.
      KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE: "false"
      CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: "false"
    volumes:
      # Mount the ClickHouse sink so Kafka Connect discovers it on startup.
      - ./connect-plugins/clickhouse-kafka-connect-v1.3.7:/kafka/connect/clickhouse-kafka-connect
 
  # The analytics database.
  clickhouse:
    image: clickhouse/clickhouse-server:26.3
    ports:
      - "8123:8123"
      - "9000:9000"
    environment:
      CLICKHOUSE_USER: default
      CLICKHOUSE_PASSWORD: clickhouse
      CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: "1"
    ulimits:
      nofile:
        soft: 262144
        hard: 262144

A short explanation of the less obvious parts. The Kafka service has a lot of environment variables because KRaft mode needs to be told that this single node is both the broker (it stores data) and the controller (it manages cluster metadata). The replication factors are all set to 1 because we only have one broker; in production you would have at least three.

The connect service mounts your downloaded ClickHouse sink folder into /kafka/connect/, which is where the Debezium image looks for connectors. We also tell Kafka Connect to use plain JSON without an embedded schema, which keeps the setup simple and is easy for the ClickHouse sink to read.

Step 4: Start the stack

Bring everything up:

docker compose up -d

The first run downloads several images, so give it a minute or two. Check that all four containers are running:

docker compose ps

Confirm that Kafka Connect is alive and can see both the Debezium and ClickHouse connector plugins:

curl -s http://localhost:8083/connector-plugins | grep -o '"class":"[^"]*"'

You should see io.debezium.connector.postgresql.PostgresConnector and com.clickhouse.kafka.connect.ClickHouseSinkConnector in the output. If the ClickHouse connector is missing, the volume mount path is wrong; double-check the folder name matches the version you downloaded.

Step 5: Create a table in PostgreSQL and add some data

Open a shell into PostgreSQL:

docker compose exec postgres psql -U postgres -d shop

Create a simple customers table and insert a few rows:

CREATE TABLE customers (
  id         INT PRIMARY KEY,
  name       TEXT NOT NULL,
  email      TEXT NOT NULL,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
 
INSERT INTO customers (id, name, email) VALUES
  (1, 'Ada Lovelace', 'ada@example.com'),
  (2, 'Alan Turing',  'alan@example.com'),
  (3, 'Grace Hopper', 'grace@example.com');

One important detail: Debezium needs each table it captures to have a way to identify rows. A primary key does this automatically, which is why our table has one. For updates and deletes to work correctly downstream, every captured table should have a primary key. Type \q to exit psql when you are done.

Step 6: Create the target table in ClickHouse

Now we design the ClickHouse table that will receive the changes. This is the most important design decision in the whole pipeline, so we will go slowly.

ClickHouse is append-only at heart. It does not update or delete individual rows efficiently the way PostgreSQL does. So how do we reflect updates and deletes? The answer is a special table engine called ReplacingMergeTree.

A ReplacingMergeTree keeps multiple versions of a row but, when you ask for the latest state, it collapses them down to the newest version per primary key. We give it two helper columns: a version number, so it knows which copy is newest, and a deleted flag, so we can represent a deleted row.

Open a ClickHouse client:

docker compose exec clickhouse clickhouse-client --password clickhouse

Create the database and table:

CREATE DATABASE IF NOT EXISTS shop;
 
CREATE TABLE shop.customers
(
    id         Int32,
    name       String,
    email      String,
    updated_at DateTime64(6),
    -- Debezium adds these via the transformation we configure in Step 7.
    _version   UInt64,
    _deleted   UInt8
)
ENGINE = ReplacingMergeTree(_version, _deleted)
ORDER BY id;

Two things to understand here. The ORDER BY id line tells ClickHouse that id is the key that identifies a unique customer, which must match the primary key in PostgreSQL. The ReplacingMergeTree(_version, _deleted) line tells ClickHouse to keep the row with the highest _version for each id, and to treat rows where _deleted = 1 as removed.

Step 7: Register the Debezium source connector

Now we tell Debezium to start watching the PostgreSQL customers table. Connectors are configured by sending a small piece of JSON to Kafka Connect's REST API.

Create a file named postgres-source.json:

{
  "name": "postgres-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "tasks.max": "1",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "postgres",
    "database.password": "postgres",
    "database.dbname": "shop",
    "topic.prefix": "shop",
    "plugin.name": "pgoutput",
    "table.include.list": "public.customers",
    "publication.autocreate.mode": "filtered",
 
    "transforms": "unwrap",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.delete.tombstone.handling.mode": "rewrite",
    "transforms.unwrap.add.fields": "op,source.lsn"
  }
}

Send it to Kafka Connect:

curl -X POST -H "Content-Type: application/json" \
  --data @postgres-source.json \
  http://localhost:8083/connectors

Let us unpack the important settings. plugin.name is set to pgoutput, which is PostgreSQL's built-in logical replication output and requires no extra installation. topic.prefix of shop means events for our table land in a Kafka topic named shop.public.customers. table.include.list limits capture to just our one table.

The transforms block is the part that makes ClickHouse happy. ExtractNewRecordState flattens Debezium's nested event into a plain row. delete.tombstone.handling.mode set to rewrite means that when a row is deleted in PostgreSQL, Debezium emits a normal-looking row with an added __deleted field set to true, instead of a confusing empty message. add.fields of op,source.lsn attaches the operation type and the PostgreSQL log sequence number to every event. We will use that log sequence number as our version column, because it always increases, which is exactly what ReplacingMergeTree needs to know which row is newest.

Note that option names on this transformation have changed across Debezium versions. The names above are correct for Debezium 3.5. If you ever use a different version, check the ExtractNewRecordState page in that version's documentation rather than assuming.

Check that the connector is running:

curl -s http://localhost:8083/connectors/postgres-source/status

The state should read RUNNING. Debezium will first take a snapshot of the existing three rows, then switch to streaming new changes.

Step 8: Register the ClickHouse sink connector

Now we connect the other end. Create a file named clickhouse-sink.json:

{
  "name": "clickhouse-sink",
  "config": {
    "connector.class": "com.clickhouse.kafka.connect.ClickHouseSinkConnector",
    "tasks.max": "1",
    "topics": "shop.public.customers",
    "hostname": "clickhouse",
    "port": "8123",
    "database": "shop",
    "username": "default",
    "password": "clickhouse",
    "ssl": "false",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",
 
    "transforms": "renameFields",
    "transforms.renameFields.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.renameFields.renames": "__source_lsn:_version,__deleted:_deleted"
  }
}

Send it to Kafka Connect:

curl -X POST -H "Content-Type: application/json" \
  --data @clickhouse-sink.json \
  http://localhost:8083/connectors

The sink reads from the same topic Debezium writes to, shop.public.customers, and inserts into the shop.customers table in ClickHouse. The ClickHouse sink matches incoming fields to table columns by name, so the field names must line up with your column names. The small ReplaceField transformation renames Debezium's metadata fields (__source_lsn and __deleted) to the column names we created (_version and _deleted). The true and false values that Debezium produces for the deleted flag are stored by ClickHouse as 1 and 0 in the UInt8 column.

Check it is running:

curl -s http://localhost:8083/connectors/clickhouse-sink/status

Step 9: See it work

Query ClickHouse for the three rows that were snapshotted:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT id, name, email FROM shop.customers FINAL ORDER BY id"

Notice the FINAL keyword. It tells ClickHouse to collapse all versions of each row down to the newest one before returning results. Without FINAL, you might temporarily see more than one version of a row, because ClickHouse merges versions in the background on its own schedule. For correctness in queries, always use FINAL (or filter by the latest version yourself) with a ReplacingMergeTree.

Now go back to PostgreSQL and make some changes:

docker compose exec postgres psql -U postgres -d shop
-- An update
UPDATE customers SET email = 'ada@newmail.com', updated_at = now() WHERE id = 1;
-- A new insert
INSERT INTO customers (id, name, email) VALUES (4, 'Edsger Dijkstra', 'edsger@example.com');
-- A delete
DELETE FROM customers WHERE id = 2;

Wait a couple of seconds, then query ClickHouse again:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT id, name, email FROM shop.customers FINAL WHERE _deleted = 0 ORDER BY id"

You should see Ada with her new email, the new row for Edsger, and Alan gone. We add WHERE _deleted = 0 to hide the deleted row. ReplacingMergeTree keeps the deleted row internally (marked as deleted) so it can override the older live version, and filtering it out at query time gives you the clean current state.

Congratulations: you have a working real-time CDC pipeline from PostgreSQL to ClickHouse.

How updates and deletes really work, step by step

It is worth slowing down on the part that confuses most beginners: how an append-only database correctly reflects updates and deletes.

When you updated Ada's email, PostgreSQL logged the change, Debezium emitted a flattened row with the new email and a higher log sequence number, and the sink inserted that as a new row in ClickHouse. For a moment, ClickHouse held two rows with id = 1: the old one and the new one. Because the table is a ReplacingMergeTree keyed on id with _version as the version, a FINAL query returns only the row with the highest version, which is the new email. Eventually ClickHouse merges the parts in the background and physically discards the old row.

The delete worked the same way. Debezium emitted a row for Alan with _deleted set to true and a high version number. That row wins over Alan's older live row, and your query filters out deleted rows, so Alan disappears from results.

This is the core mental model: in CDC to ClickHouse, you never update or delete in place. You always append a newer version, and the table engine plus the FINAL keyword give you the correct current picture.

Production considerations

This tutorial runs a single node of everything, which is perfect for learning but not for production. Here are the things that change when you go live.

Run one Debezium connector task per table. The PostgreSQL connector uses a single task, and the documented pattern is one connector instance per table when you need to scale, because that lets you spread tables across the cluster.

Mind your Kafka partitions and ordering. If a topic has more than one partition, you must ensure that all events for the same PostgreSQL row land in the same partition, otherwise an update could be processed before the insert it depends on. With a single partition this is automatic; with several partitions you need hash-based routing on the primary key. Start with a single partition while you learn.

Replace plaintext passwords and connections. Everything here uses plaintext for clarity. In production, use TLS for ClickHouse ("ssl": "true" and port 8443), real secrets management, and a dedicated PostgreSQL replication user with only the privileges it needs rather than the superuser.

Watch replication slot growth. Debezium keeps a logical replication slot open in PostgreSQL. If the connector stops for a long time, PostgreSQL retains WAL files for that slot and disk usage grows. Monitor slot lag and have an alert for it.

Use at least three Kafka brokers and replication factor three. Our single broker has no redundancy. A real deployment needs replication so that a broker failure does not lose events.

Turn on exactly-once delivery when correctness is critical. The ClickHouse sink supports exactly-once semantics backed by a ClickHouse feature called KeeperMap. It is off by default; enable it when duplicates would be a problem.

Troubleshooting

If the ClickHouse connector does not appear in the plugin list, the volume mount in Docker Compose points at the wrong folder. Confirm the folder name matches the version you unzipped.

If rows never arrive in ClickHouse, check the sink status endpoint for an error, then confirm the Kafka topic actually has messages with docker compose exec kafka /opt/kafka/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic shop.public.customers --from-beginning --max-messages 1.

If the field names do not match, the sink will report that columns are missing. Remember that the ClickHouse sink maps by field name, so your column names must match the (possibly renamed) fields in the Kafka message.

If updates appear as duplicates in ClickHouse, you almost certainly forgot FINAL, or your ORDER BY key in ClickHouse does not exactly match the PostgreSQL primary key. The order key and the source key must be the same set of columns.

Cleaning up

When you are finished, tear everything down with:

docker compose down -v

The -v flag also removes the data volumes, giving you a clean slate.

References

What is next

You now understand the full shape of a CDC pipeline: a source database's log, Debezium turning log entries into events, Kafka carrying them, and the ClickHouse Kafka Connect Sink landing them into a ReplacingMergeTree. Most of this knowledge transfers directly to other databases.

In Part 2 we apply the same architecture to MySQL, where Debezium reads the binary log instead of the write-ahead log. In Part 3 we tackle Oracle, which uses LogMiner and has a few more setup steps. The ClickHouse side stays almost identical, which is the whole point of building on a clean, well-understood foundation.

If you would like help designing a production-grade CDC pipeline into ClickHouse, including partitioning strategy, schema evolution, and monitoring, the engineers at Quantrail Data do exactly this. Reach out through our services page and we will be glad to help.

Work with Quantrail

Expert ClickHouse services

We design, migrate, tune, and run ClickHouse for teams that own their data, from first architecture through day-two operations. Tell us what you are building and we will help.

Talk to an expert

Manage ClickHouse with CHOps

CHOps is our free, open-source ClickHouse admin tool: monitoring, query profiling, backups, visual access control, and alerting in one self-hosted interface, with zero agents on your servers.

Explore CHOps
Share: