All posts
What Is Debezium? Using Change Data Capture to Offload Analytics to ClickHouse®

What Is Debezium? Using Change Data Capture to Offload Analytics to ClickHouse®

June 3, 20269 min readGayathri
Share:

If your dashboards are slow, your nightly reports are getting heavier, or your analysts keep asking to run "just one more query" against the production database, this article is for you. It explains, in plain language, what Debezium and Change Data Capture are, and how they let you move heavy analytics off your main database and onto the ClickHouse® database without disrupting your application.

This is a concept piece. It does not ask you to run anything. Once the ideas are clear, the hands-on tutorials in our CDC series show you exactly how to build the pipeline for PostgreSQL, MySQL, and Oracle, with more databases on the way.

Two kinds of database work: OLTP and OLAP

Almost every data problem in a growing company comes down to one mismatch, so it is worth naming it clearly.

Your application database does transactional work. It handles many small operations: create an order, update a profile, mark an item as shipped. Each touches a few rows and must be fast and correct. This style of work is called OLTP, which stands for Online Transaction Processing. PostgreSQL, MySQL, Oracle, SQL Server, MariaDB, and MongoDB are all excellent at it.

Analytics is a completely different kind of work. A question like "what were our weekly sales by region for the last two years" does not touch a few rows. It scans millions of them, groups them, and adds them up. This style of work is called OLAP, which stands for Online Analytical Processing.

The trouble is that a database tuned for the first kind of work is usually bad at the second. When you run a giant analytical query on your OLTP database, it competes with real customer traffic for memory, CPU, and disk. Reports run slowly, and worse, the application slows down for everyone while the report runs.

Why you should not run analytics on your production database

Teams usually feel this pain in stages. First the reports get slow. Then someone runs a heavy query at the wrong time and the application stutters. Then the team starts scheduling reports for the middle of the night to avoid the traffic, which means the data is always hours old by the time anyone looks at it. Eventually they add read replicas, which helps with read load but still uses a database that was never designed to scan billions of rows quickly.

The clean solution is to stop asking one database to do two opposite jobs. Keep your OLTP database doing what it is good at, and send a continuously updated copy of the data to a database built for analytics. That analytics database is ClickHouse.

Why ClickHouse for the analytics side

ClickHouse is a columnar database built specifically for OLAP. Two ideas explain most of its speed.

First, it stores data by column rather than by row. When you ask for the sum of one column across a billion rows, ClickHouse reads only that column from disk and skips the rest. A row-based database has to read every row in full to get at that one column.

Second, it compresses aggressively and processes data in large vectorized batches, using the full power of modern CPUs. The result is that queries which take minutes on a transactional database often take well under a second on ClickHouse.

ClickHouse is not a replacement for your OLTP database. You would not run your checkout process on it. It is the other half of the picture: the place where reporting, dashboards, and data exploration live.

The missing piece: getting data across, continuously

So we want to keep the operational database as the source of truth and continuously feed a ClickHouse copy. The naive way is to copy the whole table on a schedule, but that is slow, wasteful, and always stale. We want something that copies only what changed, the moment it changes. That technique is called Change Data Capture.

Change Data Capture, or CDC, is based on a simple observation. Every serious database already keeps a private log of every change it makes, so it can recover after a crash. PostgreSQL calls it the write-ahead log. MySQL and MariaDB call it the binary log. Oracle has the redo log. SQL Server has the transaction log. These logs contain a perfect, ordered record of every insert, update, and delete.

CDC means reading that log and turning each entry into an event you can send somewhere else. Because it reads the log rather than querying your tables, it adds almost no load to the database, and because the log is ordered, you get changes in exactly the order they happened.

Where Debezium comes in

Reading each database's internal log is hard. The formats are different, undocumented in places, and full of edge cases. You do not want to write that code yourself.

Debezium is an open-source platform that does it for you. It is the most widely used CDC tool in the world, it is free and Apache-licensed, and it already knows how to read the logs of all the major databases. You point Debezium at your database, and it produces a clean, uniform stream of change events, regardless of which database is underneath.

Debezium provides a separate connector for each database. As of Debezium 3.5, the current stable series, there are connectors for PostgreSQL, MySQL, MariaDB, Oracle, SQL Server, MongoDB, Cassandra, Db2, Informix, Spanner, and more. Each connector understands one source, but they all produce events in the same shape, which is what makes the downstream side so reusable.

A Debezium change event describes one change to one row. It includes the state of the row before the change, the state after the change, the type of operation (insert, update, or delete), and metadata such as a timestamp and the position in the source log. That last detail matters: it gives every event a natural ordering, which we use later to keep the ClickHouse copy correct.

How the pieces fit together

The standard, production-proven architecture has four parts.

Your source database sits at the start, writing changes to its log as it always has. Debezium reads that log and produces change events. Those events flow into Apache Kafka, a durable, ordered log of messages that acts as a buffer between the database and the analytics side. Finally, a sink connector reads the events from Kafka and writes them into ClickHouse.

Kafka in the middle is what makes the whole thing reliable. If ClickHouse is briefly down or slow, the events wait safely in Kafka and are delivered when it recovers. Nothing is lost. Kafka also lets more than one system consume the same change stream, so the same events that feed ClickHouse could also feed a search index or a cache.

On the ClickHouse end, the official ClickHouse Kafka Connect Sink reads the events and inserts them. With recent Kafka versions there is no ZooKeeper to manage, because Kafka now coordinates itself.

The one clever trick: representing updates and deletes

There is a single concept that surprises newcomers, so it is worth previewing here.

ClickHouse is append-only at heart. It is superb at adding rows and reading them back, but it does not update or delete individual rows the way a transactional database does. So how can it reflect an update or a delete that happened in PostgreSQL?

The answer is a special ClickHouse table type called ReplacingMergeTree. Instead of changing a row in place, the pipeline appends a new version of the row, tagged with a version number taken from the source log position. When you query the table with a keyword called FINAL, ClickHouse returns only the newest version of each row. Deletes are handled the same way: the pipeline appends a version of the row marked as deleted, and your queries filter those out.

The mental model is: never change a row, always append a newer version, and let ClickHouse collapse them to the latest state at query time. Every hands-on tutorial in this series uses exactly this pattern, and it is the same for every source database.

When this approach is the right fit

CDC into ClickHouse is an excellent fit when you want near real-time analytics on data that lives in a transactional database, when heavy reporting is hurting your production workload, or when you want to combine data from several operational databases into one analytical view. It shines for dashboards, internal reporting, customer-facing analytics, and ad-hoc data exploration.

It is less suitable when you genuinely need a few-row lookup with strict transactional guarantees, because that is what your OLTP database is already best at. CDC is about feeding analytics, not replacing your primary store. It also adds moving parts, so for a tiny application with light reporting, a nightly export might be simpler. The value grows as your data and your reporting needs grow.

What you need to learn next

To actually build this, you need two practical skills, and we have a guide for each.

First, you need somewhere to run ClickHouse. The companion article, How to run ClickHouse with Docker Compose, shows you how to start a single node for learning and a small sharded, replicated cluster for something closer to production, all on your own machine.

Second, you need to connect your specific database. The hands-on tutorials walk through the full pipeline end to end for PostgreSQL, MySQL, and Oracle, with MongoDB, MariaDB, Cassandra, and SQL Server coming next. The source database changes from guide to guide, but the Kafka and ClickHouse half stays the same, which is the whole point of building on a clean foundation.

Summary

Operational databases are built for transactions, not for scanning billions of rows. When analytics starts to hurt production, the answer is not a bigger transactional database, it is a second database built for analytics. ClickHouse is that database. Change Data Capture, powered by Debezium and Kafka, keeps a ClickHouse copy continuously in sync with your source, reading the database's own change log so it adds almost no load. The result is fast analytics on fresh data, with your production database left to do what it does best.

If you would like help designing a CDC pipeline into ClickHouse for your own stack, the engineers at Quantrail Data do exactly this work. Reach out through our services page and we will be glad to help.

References

Work with Quantrail

Expert ClickHouse services

We design, migrate, tune, and run ClickHouse for teams that own their data, from first architecture through day-two operations. Tell us what you are building and we will help.

Talk to an expert

Manage ClickHouse with CHOps

CHOps is our free, open-source ClickHouse admin tool: monitoring, query profiling, backups, visual access control, and alerting in one self-hosted interface, with zero agents on your servers.

Explore CHOps
Share: