This guide shows you how to stream changes from Apache Cassandra into the ClickHouse® database in near real time, using Debezium, Apache Kafka, and Docker. Cassandra is the most unusual source in this series, because its Debezium connector does not work like any of the others, and we explain exactly why and how.
This article is self-contained. If Debezium and ClickHouse are brand new to you, the short overview in What is Debezium and how to offload analytics to ClickHouse is a good primer first. Because Cassandra CDC is genuinely more involved than the other sources, it helps to have built one of the simpler pipelines, such as the PostgreSQL one, before this.
A note before you start
The Debezium Cassandra connector is an incubating component, and it is the most advanced setup in this series. The concepts below are correct and stable, but exact artifact names and a few property names move between versions, so treat this as a working blueprint and check the Debezium Cassandra documentation for your exact version. We flag the version-sensitive parts as we go.
What you will build
A pipeline where Cassandra writes changes to its commit log, a Debezium agent running alongside Cassandra reads that commit log and turns each change into an event, Apache Kafka stores the events durably, and the ClickHouse Kafka Connect Sink writes them into a ClickHouse table.
How Cassandra CDC is fundamentally different
Every other connector in this series runs inside Kafka Connect. The Cassandra connector does not. This is the single most important thing to understand.
Cassandra is a distributed database where data lives on many nodes, and each node keeps its own commit log of the writes it received. There is no single central log to read. So Debezium's Cassandra connector is built as a small Java agent that you run directly on each Cassandra node. The agent watches that node's commit log directory, turns the changes it finds into Debezium events, and publishes them straight to Kafka with a Kafka producer. It is not a Kafka Connect connector, you do not register it with a REST call, and it does not appear in the connector plugin list.
This has two consequences that shape everything below. First, the agent must share the Cassandra node's files, specifically its CDC commit log directory, so in Docker the agent and Cassandra share a volume. Second, because the agent is not in Kafka Connect, it cannot apply the flattening transform we used on every other source. So for Cassandra, we move that transform to the ClickHouse sink instead.
Why Cassandra and ClickHouse are a great pair
Cassandra is superb at high-volume writes spread across many nodes, but it is deliberately limited at analytical queries: no joins, no easy aggregations across partitions. ClickHouse is the opposite, built for exactly those large scans and aggregations. Streaming Cassandra changes into ClickHouse gives you a queryable analytical copy of data that is otherwise hard to analyze in place.
The tools and the exact versions
| Component | Role | Image and version |
|---|---|---|
| Cassandra | Source database | cassandra:4.1 |
| Debezium Cassandra agent | Reads the commit log | debezium-connector-cassandra-4 3.5.2.Final (a JAR) |
| Apache Kafka | Event log / transport | apache/kafka:4.1.0 (KRaft mode, no ZooKeeper) |
| ClickHouse Kafka Connect Sink | Loads events into ClickHouse | v1.3.7 |
| ClickHouse | Analytics database | clickhouse/clickhouse-server:26.3 (LTS) |
The Cassandra agent's JAR flavor must match your Cassandra major version: debezium-connector-cassandra-4 for Cassandra 4.x. We use Cassandra 4.1 because the version 4 connector targets it directly. The agent needs Java 17 or later, which we get from a Temurin image. Kafka 4 uses KRaft and has no ZooKeeper.
The important caveats of Cassandra CDC
Be aware of these before you build, because they are properties of Cassandra itself, not bugs:
Commit logs record only the columns that changed, plus the primary key. Unlike a relational database, an update event may not carry the values of columns you did not touch. This affects how a partial update lands in ClickHouse.
In older Cassandra, commit log segments were only handed to CDC once full, adding delay. Cassandra 4 added near-real-time processing, which we enable.
In a real multi-node cluster you run one agent per node, and because of replication the same change can appear more than once, so downstream consumers must deduplicate. Our single-node tutorial sidesteps this, but keep it in mind for production.
The dataset
We use the OpenFlights airports dataset, published under the Open Database License. We model it the Cassandra way, partitioning airports by country, which is a natural Cassandra data model.
Prerequisites
Docker and Docker Compose, plus roughly 6 GB of free memory.
Step 1: Prepare the project and download the pieces
mkdir cassandra-to-clickhouse-cdc
cd cassandra-to-clickhouse-cdc
mkdir -p connect-plugins agent scriptsDownload the ClickHouse sink (for the Kafka-to-ClickHouse half):
cd connect-plugins
curl -L -o ch.zip \
https://github.com/ClickHouse/clickhouse-kafka-connect/releases/download/v1.3.7/clickhouse-kafka-connect-v1.3.7.zip
unzip ch.zip && rm ch.zip
cd ..Download the Debezium Cassandra agent JAR. Confirm the exact filename on Maven Central, since this is the version-sensitive part:
cd agent
curl -L -o debezium-cassandra-agent.jar \
https://repo1.maven.org/maven2/io/debezium/debezium-connector-cassandra-4/3.5.2.Final/debezium-connector-cassandra-4-3.5.2.Final-jar-with-dependencies.jar
cd ..Step 2: The agent configuration
The agent reads a properties file rather than a JSON document. Create agent/cdc.properties:
connector.name=cdc-airports
http.port=8000
# Where Cassandra's config and CDC commit logs live (shared with Cassandra).
cassandra.config=/etc/cassandra/cassandra.yaml
cassandra.hosts=cassandra
cassandra.port=9042
# Cassandra 4 near-real-time commit log processing.
commit.log.real.time.processing.enabled=true
commit.log.marked.complete.poll.interval.ms=1000
# Move processed commit logs here, then discard them.
commit.log.relocation.dir=/var/lib/cassandra/cdc_relocate
commit.log.transfer.class=io.debezium.connector.cassandra.BlackHoleCommitLogTransfer
latest.commit.log.only=false
# Order events by the Cassandra partition key.
event.order.guarantee.mode=PARTITION_VALUES
# Where the agent records its position.
offset.backing.store.dir=/var/lib/cassandra/cdc_offset
# Publish straight to Kafka.
topic.prefix=cdc
kafka.producer.bootstrap.servers=kafka:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=falseThe cassandra.config path points at the same cassandra.yaml the database uses, which the agent reads to find the commit log directories. BlackHoleCommitLogTransfer means processed commit logs are simply deleted, which is what you want for a tutorial. Events are published to topics named cdc.<keyspace>.<table>, so our airports topic will be cdc.flights.airports.
Step 3: The ingest script
Create scripts/init.cql. It creates a keyspace and a CDC-enabled table, and loads a slice of the OpenFlights airports data, partitioned by country:
CREATE KEYSPACE IF NOT EXISTS flights
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE IF NOT EXISTS flights.airports (
country text,
airport_id int,
name text,
city text,
iata text,
PRIMARY KEY (country, airport_id)
) WITH cdc = true;
INSERT INTO flights.airports (country, airport_id, name, city, iata)
VALUES ('Papua New Guinea', 1, 'Goroka Airport', 'Goroka', 'GKA');
INSERT INTO flights.airports (country, airport_id, name, city, iata)
VALUES ('Papua New Guinea', 2, 'Madang Airport', 'Madang', 'MAG');
INSERT INTO flights.airports (country, airport_id, name, city, iata)
VALUES ('Papua New Guinea', 3, 'Mount Hagen Airport', 'Mount Hagen', 'HGU');
INSERT INTO flights.airports (country, airport_id, name, city, iata)
VALUES ('France', 1386, 'Charles De Gaulle', 'Paris', 'CDG');
INSERT INTO flights.airports (country, airport_id, name, city, iata)
VALUES ('France', 1382, 'Orly', 'Paris', 'ORY');The crucial part is WITH cdc = true, which tells Cassandra to preserve this table's commit logs for capture. We partition by country, so each country is a partition and airport_id orders airports within it; this is idiomatic Cassandra modeling.
Step 4: The Docker Compose file
Create docker-compose.yml. The trick that makes the agent work is sharing Cassandra's data and config directories with the agent through named volumes:
services:
# Cassandra, with CDC switched on in its config at startup.
cassandra:
image: cassandra:4.1
command: >
bash -c "sed -i 's/^cdc_enabled: false/cdc_enabled: true/' /etc/cassandra/cassandra.yaml
&& exec docker-entrypoint.sh cassandra -f"
ports:
- "9042:9042"
volumes:
- cassandra-data:/var/lib/cassandra
- cassandra-config:/etc/cassandra
# The Debezium Cassandra agent, co-located via shared volumes.
cassandra-agent:
image: eclipse-temurin:17-jre
depends_on:
- cassandra
- kafka
volumes:
- cassandra-data:/var/lib/cassandra
- cassandra-config:/etc/cassandra:ro
- ./agent:/agent
command: >
bash -c "mkdir -p /var/lib/cassandra/cdc_relocate /var/lib/cassandra/cdc_offset
&& until bash -c 'echo > /dev/tcp/cassandra/9042' 2>/dev/null; do echo waiting for cassandra; sleep 5; done
&& java -jar /agent/debezium-cassandra-agent.jar /agent/cdc.properties"
kafka:
image: apache/kafka:4.1.0
ports:
- "9092:9092"
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 0
KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
# Kafka Connect runs only the ClickHouse sink here, not a source.
connect:
image: quay.io/debezium/connect:3.5
depends_on:
- kafka
ports:
- "8083:8083"
environment:
BOOTSTRAP_SERVERS: kafka:9092
GROUP_ID: cdc-connect
CONFIG_STORAGE_TOPIC: connect_configs
OFFSET_STORAGE_TOPIC: connect_offsets
STATUS_STORAGE_TOPIC: connect_statuses
KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_KEY_CONVERTER_SCHEMAS_ENABLE: "false"
CONNECT_VALUE_CONVERTER_SCHEMAS_ENABLE: "false"
volumes:
- ./connect-plugins/clickhouse-kafka-connect-v1.3.7:/kafka/connect/clickhouse-kafka-connect
clickhouse:
image: clickhouse/clickhouse-server:26.3
ports:
- "8123:8123"
- "9000:9000"
environment:
CLICKHOUSE_USER: default
CLICKHOUSE_PASSWORD: clickhouse
CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT: "1"
ulimits:
nofile:
soft: 262144
hard: 262144
volumes:
cassandra-data:
cassandra-config:Two notes. The sed line flips cdc_enabled to true in Cassandra's config before the database starts; because cassandra-config is a named volume, Docker fills it from the image on first run, so the edit sticks. The agent waits for Cassandra's port to open before launching, since it cannot do anything until Cassandra is up.
Step 5: Start infrastructure, then ingest, then the agent
The order matters here, because the agent snapshots CDC-enabled tables when it starts. Bring up everything except the agent first:
docker compose up -d cassandra kafka connect clickhouseWait about a minute for Cassandra to be ready, then run the ingest script:
cat scripts/init.cql | docker compose exec -T cassandra cqlshNow start the agent, which will snapshot the rows you just loaded and then stream new changes:
docker compose up -d cassandra-agent
docker compose logs -f cassandra-agentIn the logs you should see the agent connect to Cassandra, detect the CDC-enabled flights.airports table, and begin producing. Press Ctrl+C to stop following the logs.
Step 6: Create the target table in ClickHouse
docker compose exec clickhouse clickhouse-client --password clickhouseCREATE DATABASE IF NOT EXISTS flights;
CREATE TABLE flights.airports
(
country String,
airport_id Int32,
name String,
city String,
iata String,
_version UInt64,
_deleted UInt8
)
ENGINE = ReplacingMergeTree(_version, _deleted)
ORDER BY (country, airport_id);The ORDER BY matches the Cassandra primary key, country then airport_id, so ReplacingMergeTree collapses versions of the same airport correctly.
Step 7: Register the ClickHouse sink connector
Here is the Cassandra-specific twist. Because the agent published the raw Debezium event without flattening it, the sink connector must do the flattening that other sources did on the source side. So the unwrap transform lives here. Create clickhouse-sink.json:
{
"name": "clickhouse-sink",
"config": {
"connector.class": "com.clickhouse.kafka.connect.ClickHouseSinkConnector",
"tasks.max": "1",
"topics": "cdc.flights.airports",
"hostname": "clickhouse",
"port": "8123",
"database": "flights",
"username": "default",
"password": "clickhouse",
"ssl": "false",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "false",
"transforms": "unwrap,renameFields",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.delete.tombstone.handling.mode": "rewrite",
"transforms.unwrap.add.fields": "op,source.ts_ms",
"transforms.renameFields.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.renameFields.renames": "__source_ts_ms:_version,__deleted:_deleted"
}
}Send it:
curl -X POST -H "Content-Type: application/json" \
--data @clickhouse-sink.json \
http://localhost:8083/connectorsA version-sensitive caveat: the Cassandra connector's operation codes are not identical to the standard Debezium codes, so depending on your version you may need to adjust the delete handling, or place Debezium's EnvelopeTransformation before ExtractNewRecordState. If deletes do not behave, that is the first thing to check in the Debezium Cassandra docs.
Step 8: Verify the CDC pipeline
Four checks, from source to destination.
First, the agent is healthy. It exposes a health endpoint on port 8000 inside its container:
docker compose logs --tail=30 cassandra-agentLook for lines showing it detected the CDC table and is processing, with no repeated stack traces.
Second, Cassandra has CDC enabled on the table:
docker compose exec cassandra cqlsh -e "SELECT keyspace_name, table_name, cdc FROM system_schema.tables WHERE keyspace_name='flights' ALLOW FILTERING;"The cdc column should read True for airports.
Third, events reached Kafka:
docker compose exec kafka /opt/kafka/bin/kafka-console-consumer.sh \
--bootstrap-server kafka:9092 --topic cdc.flights.airports \
--from-beginning --max-messages 1You should see a JSON event for an airport.
Fourth, the data landed in ClickHouse:
docker compose exec clickhouse clickhouse-client --password clickhouse \
--query "SELECT count() FROM flights.airports FINAL"The count should match the rows you loaded. If all four pass, the pipeline works end to end.
Step 9: See changes flow
Make some changes in Cassandra:
docker compose exec cassandra cqlsh -e "
INSERT INTO flights.airports (country, airport_id, name, city, iata) VALUES ('Germany', 340, 'Frankfurt am Main', 'Frankfurt', 'FRA');
UPDATE flights.airports SET city = 'Goroka City' WHERE country = 'Papua New Guinea' AND airport_id = 1;
DELETE FROM flights.airports WHERE country = 'France' AND airport_id = 1382;
"Wait a few seconds, then query ClickHouse, hiding deleted rows. Remember to use FINAL:
docker compose exec clickhouse clickhouse-client --password clickhouse \
--query "SELECT country, airport_id, name, city FROM flights.airports FINAL WHERE _deleted = 0 ORDER BY country, airport_id"You should see the new Frankfurt row, Goroka with its updated city, and Orly gone. Then run an aggregation, the kind ClickHouse excels at:
docker compose exec clickhouse clickhouse-client --password clickhouse \
--query "SELECT country, count() FROM flights.airports FINAL WHERE _deleted = 0 GROUP BY country ORDER BY count() DESC"A reminder about partial updates
Recall the caveat that Cassandra commit logs record only changed columns plus the primary key. When you updated only city, the event carried country, airport_id, and city, but the other columns may have arrived empty. With a ReplacingMergeTree, the newest version wins wholesale, so an update that omits columns can blank them out in ClickHouse. For analytics where you need the full row preserved, capture full rows at the source, or reconstruct the latest complete row in ClickHouse using aggregate functions that ignore nulls. This is a Cassandra characteristic, not a ClickHouse one, and it is the main thing that makes Cassandra CDC different to reason about.
Production considerations
Run one agent per Cassandra node, and deduplicate downstream because replication means the same change appears on multiple nodes. Watch the CDC free space on each node, since a full cdc_raw directory causes Cassandra to reject writes to CDC-enabled tables. Use TLS for the Kafka producer, run at least three Kafka brokers, and enable the ClickHouse sink's exactly-once mode when correctness is critical.
Troubleshooting
If no events appear, the most common causes are the table missing cdc = true, or the agent not sharing the same /var/lib/cassandra volume as Cassandra, so it cannot see the commit logs. If the agent starts but never produces, confirm commit.log.real.time.processing.enabled is true, otherwise on older Cassandra you wait for a segment to fill. If deletes misbehave in ClickHouse, revisit the operation-code note in Step 7.
Cleaning up
docker compose down -vReferences
- Debezium Cassandra connector documentation
- Debezium Cassandra connector source code
- Debezium New Record State Extraction (event flattening) SMT
- ClickHouse Kafka Connect Sink
- OpenFlights dataset (Open Database License)
What is next
You have now built CDC pipelines for every common source in this series, including the unusual agent-based Cassandra connector. Revisit any of the others to compare: PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, and MongoDB.
If you would like help designing a production-grade CDC pipeline into ClickHouse, including the harder cases like Cassandra, the engineers at Quantrail Data do exactly this. Reach out through our services page and we will be glad to help.



