All posts
How ClickHouse Queries Apache Iceberg: Internals & Trade-offs

How ClickHouse Queries Apache Iceberg: Internals & Trade-offs

January 27, 20265 min readMohamed Hussain S
Share:

Modern data platforms increasingly rely on data lakes for scalable and cost-efficient storage. However, querying data lakes reliably and efficiently has historically been difficult. This is where open table formats such as Apache Iceberg come into play.

In this article, we’ll explore how ClickHouse queries Apache Iceberg tables, focusing on internals, metadata flow, and architectural trade-offs. By the end, you’ll understand how these systems work together and when this approach makes sense in production.

The Problem with Traditional Data Lakes

A data lake typically stores data on object storage such as Amazon S3 or S3-compatible systems like MinIO. While this provides durability and low cost, object storage has fundamental limitations:

  • No transactions
  • No schema enforcement
  • No consistent notion of “latest data”
  • No coordination between readers and writers

From the storage layer’s perspective, a table is simply a collection of files. This makes correctness fragile as systems scale.

Why Open Table Formats Exist

To solve these issues, open table formats introduce a metadata layer on top of object storage. This layer defines:

  • Which files belong to a table
  • Which version of the table is current
  • How concurrent reads and writes are coordinated

This is where Apache Iceberg enters the picture.

What Is Apache Iceberg?

Apache Iceberg is not a database and not a query engine.
It is an open table format designed to bring database-like guarantees to data lakes.

Iceberg provides:

  • ACID-like transactional behavior
  • Snapshot-based versioning
  • Schema and partition evolution
  • Safe concurrent access for multiple engines

Iceberg achieves this by relying entirely on metadata, not directory scans.

Iceberg’s Metadata-First Architecture

An Iceberg table consists of four conceptual layers:

1. Data Files

These are immutable files (usually Parquet) stored in object storage.

2. Manifest Files

Manifest files list data files along with statistics such as:

  • Row counts
  • Partition values
  • Min/max column values

3. Snapshots

Each snapshot represents a consistent version of the table, pointing to a set of manifest files.

4. Table Metadata

The top-level metadata file tracks:

  • Current snapshot
  • Schema definitions
  • Partition specs

Key principle:

Query engines using Iceberg never discover data files by scanning object storage;
they rely entirely on Iceberg metadata to locate valid files.

Writers and Readers in an Iceberg-Based Lakehouse

Who Are the Writers?

Writers are systems that:

  1. Write data files
  2. Update Iceberg metadata
  3. Atomically commit a new snapshot

Typical writers include:

  • Apache Spark
  • Apache Flink
  • Streaming pipelines fed by Apache Kafka

Uploading files directly to S3 without updating metadata bypasses Iceberg and breaks consistency.

Who Are the Readers?

Readers are query engines that:

  • Read Iceberg metadata
  • Resolve the latest snapshot
  • Fetch only the relevant data files

Readers never guess file validity.

How ClickHouse Queries Apache Iceberg Tables

ClickHouse is primarily a column-oriented OLAP database, but it can also act as a query engine for external data sources.

When ClickHouse queries an Iceberg table, the flow looks like this:

  1. ClickHouse connects to the Iceberg catalog
  2. Reads the table metadata file
  3. Resolves the latest snapshot
  4. Reads manifest files
  5. Applies partition and statistics-based pruning
  6. Reads only the required Parquet files from object storage

ClickHouse never scans S3 directories blindly.
All file discovery is driven by Iceberg metadata.

Predicate Pushdown and Pruning

Iceberg enables effective pruning by exposing file-level statistics. ClickHouse can leverage this to:

  • Skip entire files
  • Reduce I/O significantly
  • Avoid unnecessary reads

However, predicate pushdown works best when:

  • Filters align with partition specs
  • Expressions are simple and deterministic

Complex transformations may reduce pruning effectiveness.

ClickHouse as a Database vs Query Engine

It’s important to be precise with terminology.

ClickHouse is:

  • A database when it owns its own MergeTree tables
  • A query engine when reading external formats like Iceberg

Both roles are valid, but they imply different trade-offs.

Real-World Example Architecture

  1. Events are produced by edge systems
  2. Data flows through Kafka
  3. Spark or Flink processes the stream
  4. Data is written to Iceberg tables on S3
  5. Iceberg commits a new snapshot
  6. ClickHouse queries the table for analytics

This design ensures:

  • Consistent reads
  • Safe concurrent access
  • Engine independence

Trade-offs of Querying Iceberg with ClickHouse

Advantages

  • No data duplication
  • Strong consistency guarantees
  • Easy schema evolution
  • Multi-engine compatibility

Limitations

  • Metadata parsing overhead
  • Higher query latency than native MergeTree tables
  • Limited indexing compared to ClickHouse-native storage

Iceberg optimizes correctness and interoperability, not raw OLAP speed.

Iceberg vs Other Table Formats

Other open table formats include:

  • Delta Lake
  • Apache Hudi

At a high level:

  • Iceberg focuses on metadata clarity and engine neutrality
  • Delta Lake integrates deeply with Spark
  • Hudi excels in incremental and CDC-heavy workloads

When This Architecture Makes Sense

Querying Iceberg with ClickHouse is ideal when:

  • Multiple engines need access to the same data
  • Data is large and shared across teams
  • Correctness matters more than lowest latency

It is less suitable when:

  • ClickHouse is the sole analytics engine
  • Sub-second latency is critical
  • Complex indexing is required

Final Thoughts

Apache Iceberg does not replace analytical databases like ClickHouse.
Instead, it complements them by solving the hardest problems in data lakes: correctness, consistency, and coordination.

ClickHouse, in turn, brings fast analytical querying to data that Iceberg safely manages.

Together, they form a powerful and flexible lakehouse architecture.

References

Apache Iceberg Documentation
Apache Iceberg Blog
ClickHouse Documentation – Iceberg Integration

Share: