All posts
Advanced ClickHouse® Aggregating Functions

Advanced ClickHouse® Aggregating Functions

June 24, 20267 min readSanjeev Kumar G
Share:

Introduction

Aggregation is one of the core strengths of ClickHouse®. While functions such as sum(), count(), and avg() are widely used, real-world analytical workloads often require more sophisticated aggregation techniques.

Large-scale event analytics, observability platforms, recommendation systems, financial reporting, and user behavior analysis frequently depend on advanced aggregating functions that can:

  • Estimate cardinality efficiently
  • Calculate percentiles on billions of rows
  • Track top-performing values
  • Build aggregation states for incremental processing
  • Merge pre-aggregated data
  • Reduce storage and computation costs

This article explores advanced aggregation functions in ClickHouse®, how they work, and when they should be used.


Understanding Aggregate Function States

Before discussing advanced functions, it is important to understand one of the most powerful concepts in ClickHouse®: aggregate states.

Most databases execute aggregation in a single step:

SELECT sum(revenue)
FROM sales;

ClickHouse® internally performs aggregation in two phases:

  1. Build aggregation state
  2. Finalize aggregation result

The database exposes these phases through combinators such as:

sumState()
sumMerge()
avgState()
avgMerge()
uniqState()
uniqMerge()

Example:

SELECT
    uniqState(user_id) AS state
FROM events;

The result is not a number. It is an intermediate aggregation state.

Later:

SELECT
    uniqMerge(state)
FROM aggregated_events;

The final value is computed by merging states.

This mechanism powers:

  • Materialized views
  • AggregatingMergeTree
  • Incremental aggregations
  • Distributed query processing

Cardinality Estimation Functions

Counting unique values is expensive.

A query like:

SELECT count(DISTINCT user_id)
FROM events;

requires maintaining a large hash set.

For datasets containing hundreds of millions or billions of rows, approximate cardinality functions become significantly more efficient.


uniq()

SELECT uniq(user_id)
FROM events;

Characteristics:

  • Approximate
  • Very fast
  • Low memory usage
  • Default recommendation for most workloads

ClickHouse® documentation recommends uniq() as the general-purpose distinct counting function because it balances speed, memory usage, and accuracy effectively.


uniqExact()

SELECT uniqExact(user_id)
FROM events;

Characteristics:

  • Exact result
  • Higher memory consumption
  • Slower than approximate variants

Use when:

  • Financial calculations
  • Compliance reporting
  • Auditing workloads

Avoid using it on extremely high-cardinality datasets unless exactness is mandatory.


uniqCombined()

SELECT uniqCombined(user_id)
FROM events;

Characteristics:

  • Hybrid algorithm
  • Better scalability
  • Lower memory footprint than exact counting
  • High accuracy

Useful when distinct counts become very large.


uniqHLL12()

SELECT uniqHLL12(user_id)
FROM events;

Uses HyperLogLog internally.

Characteristics:

  • Fixed memory consumption
  • Approximate result
  • Suitable for very large datasets

Trade-off:

  • Less accurate than some newer algorithms
  • Primarily useful when predictable memory usage is critical

uniqTheta()

SELECT uniqTheta(user_id)
FROM events;

Based on Theta Sketches.

Advantages:

  • Supports sketch merging
  • Good distributed aggregation behavior
  • High scalability

Common in systems requiring large-scale cardinality estimation across distributed clusters.


Quantile Functions

Averages hide distribution details.

Example:

SELECT avg(response_time_ms)
FROM requests;

A service can have:

  • Average latency = 100ms
  • P99 latency = 5000ms

Users experience the tail latency, not the average.

Quantile functions solve this problem.


quantile()

SELECT quantile(0.95)(response_time_ms)
FROM requests;

Returns approximately:

95th percentile

Meaning:

95% of requests are below this value.


quantiles()

Compute multiple percentiles simultaneously.

SELECT
    quantiles(0.5, 0.9, 0.95, 0.99)
    (response_time_ms)
FROM requests;

Output:

Median
P90
P95
P99

More efficient than calculating each percentile separately.


quantileExact()

SELECT quantileExact(0.99)
       (response_time_ms)
FROM requests;

Characteristics:

  • Exact percentile
  • Higher memory usage
  • More expensive computation

Use when exact percentile calculations are required.


quantileTDigest()

SELECT quantileTDigest(0.99)
       (response_time_ms)
FROM requests;

Uses T-Digest.

Advantages:

  • Efficient memory usage
  • Excellent tail percentile estimation
  • Popular for observability workloads

Frequently used for:

  • API latency
  • Database query latency
  • Infrastructure monitoring

quantileTiming()

SELECT quantileTiming(0.99)
       (response_time_ms)
FROM requests;

Optimized specifically for timing distributions.

Particularly useful for:

  • Response times
  • Service latency
  • Request duration metrics

Top-K Analysis

Finding the most common values is a common analytical requirement.

Examples:

  • Most viewed products
  • Most searched keywords
  • Most active users
  • Most common error codes

topK()

SELECT topK(10)(product_id)
FROM orders;

Returns:

Top 10 most frequent products

Characteristics:

  • Approximate
  • Extremely efficient
  • Suitable for streaming-scale datasets

Instead of sorting every distinct value, ClickHouse® maintains a compact structure tracking likely heavy hitters.


topKWeighted()

SELECT
    topKWeighted(10)
    (product_id, quantity)
FROM sales;

Incorporates weights into frequency estimation.

Useful for:

  • Revenue contribution
  • Purchase quantity
  • Weighted popularity metrics

Statistical Aggregation Functions

ClickHouse® includes several statistical aggregators.


varPop() and varSamp()

Population variance:

SELECT varPop(price)
FROM products;

Sample variance:

SELECT varSamp(price)
FROM products;

stddevPop() and stddevSamp()

Population standard deviation:

SELECT stddevPop(price)
FROM products;

Sample standard deviation:

SELECT stddevSamp(price)
FROM products;

Useful for:

  • Risk analysis
  • Anomaly detection
  • Data quality monitoring

covarPop()

Covariance between two variables:

SELECT covarPop(x, y)
FROM measurements;

Measures how variables move together.


corr()

Pearson correlation coefficient:

SELECT corr(x, y)
FROM measurements;

Returns:

-1 to +1

Useful for exploratory analytics and feature analysis.


Bitmap-Based Aggregations

For very large user sets, bitmaps can outperform traditional distinct counting approaches.


groupBitmap()

SELECT
    groupBitmap(user_id)
FROM events;

Creates a bitmap representation.


bitmapCardinality()

SELECT bitmapCardinality(bitmap_column)
FROM users;

Returns distinct count from a bitmap.

Benefits:

  • Compact storage
  • Fast set operations
  • Efficient merging

Common in:

  • Ad-tech
  • User segmentation
  • Audience analytics

Array-Based Aggregations

Sometimes the goal is not a single value.

Instead, all values must be collected.


groupArray()

SELECT
    user_id,
    groupArray(page)
FROM visits
GROUP BY user_id;

Example result:

["home","pricing","checkout"]

Useful for user journey analysis.


groupUniqArray()

SELECT
    groupUniqArray(page)
FROM visits;

Removes duplicates during aggregation.


groupArraySorted()

SELECT
    groupArraySorted(10)(score)
FROM results;

Returns sorted values.


AggregatingMergeTree and State Functions

One of the most advanced aggregation patterns combines:

  • Aggregate states
  • Materialized views
  • AggregatingMergeTree

Example:

CREATE TABLE daily_users
(
    date Date,
    users AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY date;

Materialized view:

CREATE MATERIALIZED VIEW mv_daily_users
TO daily_users
AS
SELECT
    toDate(event_time) AS date,
    uniqState(user_id) AS users
FROM events
GROUP BY date;

Query:

SELECT
    date,
    uniqMerge(users)
FROM daily_users
GROUP BY date;

Benefits:

  • Faster analytical queries
  • Reduced CPU consumption
  • Incremental aggregation
  • Better scalability

This pattern is heavily used in production ClickHouse® deployments.


Exploring ClickHouse® for Your Analytics?

At Quantrail Data, we help teams run ClickHouse® reliably for real-time analytics – from Kubernetes deployments and migrations to performance tuning in production.

We see these challenges firsthand while supporting demanding analytics workloads. In one recent engagement, a customer achieved near bare-metal performance with ClickHouse® in production – a story we’ve shared here:

Success Story: Quantrail Bare-Metal ClickHouse® Deployment

If you’re evaluating ClickHouse® or trying to get more out of an existing setup, we’re happy to share practical lessons from real-world deployments.

Contact

Quantrail Data

Conclusion

Advanced aggregation functions are one of the primary reasons ClickHouse® can execute analytical queries efficiently at scale.

The most important concepts to understand are:

  1. Aggregate states (State() / Merge()).
  2. Approximate cardinality estimation (uniq, uniqCombined, uniqHLL12).
  3. Percentile calculations (quantile, quantileTDigest).
  4. Heavy hitter detection (topK).
  5. Bitmap aggregations (groupBitmap).
  6. Incremental aggregation using AggregatingMergeTree.

For production analytical systems, selecting the appropriate aggregation function often has a greater impact on performance and resource consumption than hardware upgrades or query tuning.

References

clickhouse docs

Share: