All posts
Cassandra to ClickHouse® CDC with Debezium 3.5 and Kafka 4: A Complete Beginner's Guide

Cassandra to ClickHouse® CDC with Debezium 3.5 and Kafka 4: A Complete Beginner's Guide

June 23, 202613 min readMohamed Hussain S
Share:

This guide shows you how to stream changes from Apache Cassandra into the ClickHouse® database in near real time, using Debezium, Apache Kafka, and Docker. Cassandra is the most unusual source in this series, because its Debezium connector does not work like any of the others, and we explain exactly why and how.

This article is self-contained. If Debezium and ClickHouse are brand new to you, the short overview in What is Debezium and how to offload analytics to ClickHouse is a good primer first. Because Cassandra CDC is genuinely more involved than the other sources, it helps to have built one of the simpler pipelines, such as the PostgreSQL one, before this.

A note before you start

The Debezium Cassandra connector is an incubating component, and it is the most advanced setup in this series. The concepts below are correct and stable, but exact artifact names and a few property names move between versions, so treat this as a working blueprint and check the Debezium Cassandra documentation for your exact version. We flag the version-sensitive parts as we go.

What you will build

A pipeline where Cassandra writes changes to its commit log, a Debezium agent running alongside Cassandra reads that commit log and turns each change into an event, Apache Kafka stores the events durably, and the ClickHouse Kafka Connect Sink writes them into a ClickHouse table.

How Cassandra CDC is fundamentally different

Every other connector in this series runs inside Kafka Connect. The Cassandra connector does not. This is the single most important thing to understand.

Cassandra is a distributed database where data lives on many nodes, and each node keeps its own commit log of the writes it received. There is no single central log to read. So Debezium's Cassandra connector is built as a small Java agent that you run directly on each Cassandra node. The agent watches that node's commit log directory, turns the changes it finds into Debezium events, and publishes them straight to Kafka with a Kafka producer. It is not a Kafka Connect connector, you do not register it with a REST call, and it does not appear in the connector plugin list.

This has two consequences that shape everything below. First, the agent must share the Cassandra node's files, specifically its CDC commit log directory, so in Docker the agent and Cassandra share a volume. Second, because the agent is not in Kafka Connect, it cannot apply the flattening transform we used on every other source. So for Cassandra, we move that transform to the ClickHouse sink instead.

Why Cassandra and ClickHouse are a great pair

Cassandra is superb at high-volume writes spread across many nodes, but it is deliberately limited at analytical queries: no joins, no easy aggregations across partitions. ClickHouse is the opposite, built for exactly those large scans and aggregations. Streaming Cassandra changes into ClickHouse gives you a queryable analytical copy of data that is otherwise hard to analyze in place.

The tools and the exact versions

ComponentRoleImage and version
CassandraSource databasecassandra:4.1
Debezium Cassandra agentReads the commit logdebezium-connector-cassandra-4 3.5.2.Final (a JAR)
Apache KafkaEvent log / transportapache/kafka:4.1.0 (KRaft mode, no ZooKeeper)
ClickHouse Kafka Connect SinkLoads events into ClickHousev1.3.7
ClickHouseAnalytics databaseclickhouse/clickhouse-server:26.3 (LTS)

The Cassandra agent's JAR flavor must match your Cassandra major version: debezium-connector-cassandra-4 for Cassandra 4.x. We use Cassandra 4.1 because the version 4 connector targets it directly. The agent needs Java 17 or later, which we get from a Temurin image. Kafka 4 uses KRaft and has no ZooKeeper.

The important caveats of Cassandra CDC

Be aware of these before you build, because they are properties of Cassandra itself, not bugs:

Commit logs record only the columns that changed, plus the primary key. Unlike a relational database, an update event may not carry the values of columns you did not touch. This affects how a partial update lands in ClickHouse.

In older Cassandra, commit log segments were only handed to CDC once full, adding delay. Cassandra 4 added near-real-time processing, which we enable.

In a real multi-node cluster you run one agent per node, and because of replication the same change can appear more than once, so downstream consumers must deduplicate. Our single-node tutorial sidesteps this, but keep it in mind for production.

The dataset

We use the OpenFlights airports dataset, published under the Open Database License. We model it the Cassandra way, partitioning airports by country, which is a natural Cassandra data model.

Prerequisites

Docker and Docker Compose, plus roughly 6 GB of free memory.

Step 1: Prepare the project and download the pieces

mkdir cassandra-to-clickhouse-cdc
cd cassandra-to-clickhouse-cdc
mkdir -p connect-plugins agent scripts

Download the ClickHouse sink (for the Kafka-to-ClickHouse half):

cd connect-plugins
curl -L -o ch.zip \
  https://github.com/ClickHouse/clickhouse-kafka-connect/releases/download/v1.3.7/clickhouse-kafka-connect-v1.3.7.zip
unzip ch.zip && rm ch.zip
cd ..

Download the Debezium Cassandra agent JAR. Confirm the exact filename on Maven Central, since this is the version-sensitive part:

cd agent
curl -L -o debezium-cassandra-agent.jar \
  https://repo1.maven.org/maven2/io/debezium/debezium-connector-cassandra-4/3.5.2.Final/debezium-connector-cassandra-4-3.5.2.Final-jar-with-dependencies.jar
cd ..

Step 2: The agent configuration

The agent reads a properties file rather than a JSON document. Create agent/cdc.properties:

connector.name=cdc-airports
http.port=8000
 
# Where Cassandra's config and CDC commit logs live (shared with Cassandra).
cassandra.config=/etc/cassandra/cassandra.yaml
cassandra.hosts=cassandra
cassandra.port=9042
 
# Cassandra 4 near-real-time commit log processing.
commit.log.real.time.processing.enabled=true
commit.log.marked.complete.poll.interval.ms=1000
 
# Move processed commit logs here, then discard them.
commit.log.relocation.dir=/var/lib/cassandra/cdc_relocate
commit.log.transfer.class=io.debezium.connector.cassandra.BlackHoleCommitLogTransfer
latest.commit.log.only=false
 
# Order events by the Cassandra partition key.
event.order.guarantee.mode=PARTITION_VALUES
 
# Where the agent records its position.
offset.backing.store.dir=/var/lib/cassandra/cdc_offset
 
# Publish straight to Kafka.
topic.prefix=cdc
kafka.producer.bootstrap.servers=kafka:9092
 
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false

The cassandra.config path points at the same cassandra.yaml the database uses, which the agent reads to find the commit log directories. BlackHoleCommitLogTransfer means processed commit logs are simply deleted, which is what you want for a tutorial. Events are published to topics named cdc.<keyspace>.<table>, so our airports topic will be cdc.flights.airports.

Step 3: The ingest script

Create scripts/init.cql. It creates a keyspace and a CDC-enabled table, and loads a slice of the OpenFlights airports data, partitioned by country:

CREATE KEYSPACE IF NOT EXISTS flights
  WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
 
CREATE TABLE IF NOT EXISTS flights.airports (
    country    text,
    airport_id int,
    name       text,
    city       text,
    iata       text,
    PRIMARY KEY (country, airport_id)
) WITH cdc = true;
 
INSERT INTO flights.airports (country, airport_id, name, city, iata)
  VALUES ('Papua New Guinea', 1, 'Goroka Airport', 'Goroka', 'GKA');
INSERT INTO flights.airports (country, airport_id, name, city, iata)
  VALUES ('Papua New Guinea', 2, 'Madang Airport', 'Madang', 'MAG');
INSERT INTO flights.airports (country, airport_id, name, city, iata)
  VALUES ('Papua New Guinea', 3, 'Mount Hagen Airport', 'Mount Hagen', 'HGU');
INSERT INTO flights.airports (country, airport_id, name, city, iata)
  VALUES ('France', 1386, 'Charles De Gaulle', 'Paris', 'CDG');
INSERT INTO flights.airports (country, airport_id, name, city, iata)
  VALUES ('France', 1382, 'Orly', 'Paris', 'ORY');

The crucial part is WITH cdc = true, which tells Cassandra to preserve this table's commit logs for capture. We partition by country, so each country is a partition and airport_id orders airports within it; this is idiomatic Cassandra modeling.

Step 4: The Docker Compose file

Create docker-compose.yml. The trick that makes the agent work is sharing Cassandra's data and config directories with the agent through named volumes:

services:
  # Cassandra, with CDC switched on in its config at startup.
  cassandra:
    image: cassandra:4.1
    command: >
      bash -c "sed -i 's/^cdc_enabled: false/cdc_enabled: true/' /etc/cassandra/cassandra.yaml
      && exec docker-entrypoint.sh cassandra -f"
    ports:
      - "9042:9042"
    volumes:
      - cassandra-data:/var/lib/cassandra
      - cassandra-config:/etc/cassandra
 
  # The Debezium Cassandra agent, co-located via shared volumes.
  cassandra-agent:
    image: eclipse-temurin:17-jre
    depends_on:
      - cassandra
      - kafka
    volumes:
      - cassandra-data:/var/lib/cassandra
      - cassandra-config:/etc/cassandra:ro
      - ./agent:/agent
    command: >
      bash -c "mkdir -p /var/lib/cassandra/cdc_relocate /var/lib/cassandra/cdc_offset
      && until bash -c 'echo > /dev/tcp/cassandra/9042' 2>/dev/null; do echo waiting for cassandra; sleep 5; done
      && java -jar /agent/debezium-cassandra-agent.jar /agent/cdc.properties"
 
  kafka:
    image: apache/kafka:4.1.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
 
  # Kafka Connect runs only the ClickHouse sink here, not a source.
  connect:
    image: quay.io/debezium/connect:3.5
    depends_on:
      - kafka
    ports:
      - "8083:8083"
    environment:
      BOOTSTRAP_SERVERS: kafka:9092
      GROUP_ID: cdc-connect
      CONFIG_STORAGE_TOPIC: connect_configs
      OFFSET_STORAGE_TOPIC: connect_offsets
      STATUS_STORAGE_TOPIC: connect_statuses
      KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
      CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE: "false"
      CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: "false"
    volumes:
      - ./connect-plugins/clickhouse-kafka-connect-v1.3.7:/kafka/connect/clickhouse-kafka-connect
 
  clickhouse:
    image: clickhouse/clickhouse-server:26.3
    ports:
      - "8123:8123"
      - "9000:9000"
    environment:
      CLICKHOUSE_USER: default
      CLICKHOUSE_PASSWORD: clickhouse
      CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: "1"
    ulimits:
      nofile:
        soft: 262144
        hard: 262144
 
volumes:
  cassandra-data:
  cassandra-config:

Two notes. The sed line flips cdc_enabled to true in Cassandra's config before the database starts; because cassandra-config is a named volume, Docker fills it from the image on first run, so the edit sticks. The agent waits for Cassandra's port to open before launching, since it cannot do anything until Cassandra is up.

Step 5: Start infrastructure, then ingest, then the agent

The order matters here, because the agent snapshots CDC-enabled tables when it starts. Bring up everything except the agent first:

docker compose up -d cassandra kafka connect clickhouse

Wait about a minute for Cassandra to be ready, then run the ingest script:

cat scripts/init.cql | docker compose exec -T cassandra cqlsh

Now start the agent, which will snapshot the rows you just loaded and then stream new changes:

docker compose up -d cassandra-agent
docker compose logs -f cassandra-agent

In the logs you should see the agent connect to Cassandra, detect the CDC-enabled flights.airports table, and begin producing. Press Ctrl+C to stop following the logs.

Step 6: Create the target table in ClickHouse

docker compose exec clickhouse clickhouse-client --password clickhouse
CREATE DATABASE IF NOT EXISTS flights;
 
CREATE TABLE flights.airports
(
    country    String,
    airport_id Int32,
    name       String,
    city       String,
    iata       String,
    _version   UInt64,
    _deleted   UInt8
)
ENGINE = ReplacingMergeTree(_version, _deleted)
ORDER BY (country, airport_id);

The ORDER BY matches the Cassandra primary key, country then airport_id, so ReplacingMergeTree collapses versions of the same airport correctly.

Step 7: Register the ClickHouse sink connector

Here is the Cassandra-specific twist. Because the agent published the raw Debezium event without flattening it, the sink connector must do the flattening that other sources did on the source side. So the unwrap transform lives here. Create clickhouse-sink.json:

{
  "name": "clickhouse-sink",
  "config": {
    "connector.class": "com.clickhouse.kafka.connect.ClickHouseSinkConnector",
    "tasks.max": "1",
    "topics": "cdc.flights.airports",
    "hostname": "clickhouse",
    "port": "8123",
    "database": "flights",
    "username": "default",
    "password": "clickhouse",
    "ssl": "false",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter.schemas.enable": "false",
 
    "transforms": "unwrap,renameFields",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
    "transforms.unwrap.delete.tombstone.handling.mode": "rewrite",
    "transforms.unwrap.add.fields": "op,source.ts_ms",
    "transforms.renameFields.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
    "transforms.renameFields.renames": "__source_ts_ms:_version,__deleted:_deleted"
  }
}

Send it:

curl -X POST -H "Content-Type: application/json" \
  --data @clickhouse-sink.json \
  http://localhost:8083/connectors

A version-sensitive caveat: the Cassandra connector's operation codes are not identical to the standard Debezium codes, so depending on your version you may need to adjust the delete handling, or place Debezium's EnvelopeTransformation before ExtractNewRecordState. If deletes do not behave, that is the first thing to check in the Debezium Cassandra docs.

Step 8: Verify the CDC pipeline

Four checks, from source to destination.

First, the agent is healthy. It exposes a health endpoint on port 8000 inside its container:

docker compose logs --tail=30 cassandra-agent

Look for lines showing it detected the CDC table and is processing, with no repeated stack traces.

Second, Cassandra has CDC enabled on the table:

docker compose exec cassandra cqlsh -e "SELECT keyspace_name, table_name, cdc FROM system_schema.tables WHERE keyspace_name='flights' ALLOW FILTERING;"

The cdc column should read True for airports.

Third, events reached Kafka:

docker compose exec kafka /opt/kafka/bin/kafka-console-consumer.sh \
  --bootstrap-server kafka:9092 --topic cdc.flights.airports \
  --from-beginning --max-messages 1

You should see a JSON event for an airport.

Fourth, the data landed in ClickHouse:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT count() FROM flights.airports FINAL"

The count should match the rows you loaded. If all four pass, the pipeline works end to end.

Step 9: See changes flow

Make some changes in Cassandra:

docker compose exec cassandra cqlsh -e "
INSERT INTO flights.airports (country, airport_id, name, city, iata) VALUES ('Germany', 340, 'Frankfurt am Main', 'Frankfurt', 'FRA');
UPDATE flights.airports SET city = 'Goroka City' WHERE country = 'Papua New Guinea' AND airport_id = 1;
DELETE FROM flights.airports WHERE country = 'France' AND airport_id = 1382;
"

Wait a few seconds, then query ClickHouse, hiding deleted rows. Remember to use FINAL:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT country, airport_id, name, city FROM flights.airports FINAL WHERE _deleted = 0 ORDER BY country, airport_id"

You should see the new Frankfurt row, Goroka with its updated city, and Orly gone. Then run an aggregation, the kind ClickHouse excels at:

docker compose exec clickhouse clickhouse-client --password clickhouse \
  --query "SELECT country, count() FROM flights.airports FINAL WHERE _deleted = 0 GROUP BY country ORDER BY count() DESC"

A reminder about partial updates

Recall the caveat that Cassandra commit logs record only changed columns plus the primary key. When you updated only city, the event carried country, airport_id, and city, but the other columns may have arrived empty. With a ReplacingMergeTree, the newest version wins wholesale, so an update that omits columns can blank them out in ClickHouse. For analytics where you need the full row preserved, capture full rows at the source, or reconstruct the latest complete row in ClickHouse using aggregate functions that ignore nulls. This is a Cassandra characteristic, not a ClickHouse one, and it is the main thing that makes Cassandra CDC different to reason about.

Production considerations

Run one agent per Cassandra node, and deduplicate downstream because replication means the same change appears on multiple nodes. Watch the CDC free space on each node, since a full cdc_raw directory causes Cassandra to reject writes to CDC-enabled tables. Use TLS for the Kafka producer, run at least three Kafka brokers, and enable the ClickHouse sink's exactly-once mode when correctness is critical.

Troubleshooting

If no events appear, the most common causes are the table missing cdc = true, or the agent not sharing the same /var/lib/cassandra volume as Cassandra, so it cannot see the commit logs. If the agent starts but never produces, confirm commit.log.real.time.processing.enabled is true, otherwise on older Cassandra you wait for a segment to fill. If deletes misbehave in ClickHouse, revisit the operation-code note in Step 7.

Cleaning up

docker compose down -v

References

What is next

You have now built CDC pipelines for every common source in this series, including the unusual agent-based Cassandra connector. Revisit any of the others to compare: PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and MongoDB.

If you would like help designing a production-grade CDC pipeline into ClickHouse, including the harder cases like Cassandra, the engineers at Quantrail Data do exactly this. Reach out through our services page and we will be glad to help.

Work with Quantrail

Expert ClickHouse services

We design, migrate, tune, and run ClickHouse for teams that own their data, from first architecture through day-two operations. Tell us what you are building and we will help.

Talk to an expert

Manage ClickHouse with CHOps

CHOps is our free, open-source ClickHouse admin tool: monitoring, query profiling, backups, visual access control, and alerting in one self-hosted interface, with zero agents on your servers.

Explore CHOps
Share: