Introduction
Aggregation is one of the core strengths of ClickHouse®. While functions such as sum(), count(), and avg() are widely used, real-world analytical workloads often require more sophisticated aggregation techniques.
Large-scale event analytics, observability platforms, recommendation systems, financial reporting, and user behavior analysis frequently depend on advanced aggregating functions that can:
- Estimate cardinality efficiently
- Calculate percentiles on billions of rows
- Track top-performing values
- Build aggregation states for incremental processing
- Merge pre-aggregated data
- Reduce storage and computation costs
This article explores advanced aggregation functions in ClickHouse®, how they work, and when they should be used.
Understanding Aggregate Function States
Before discussing advanced functions, it is important to understand one of the most powerful concepts in ClickHouse®: aggregate states.
Most databases execute aggregation in a single step:
SELECT sum(revenue)
FROM sales;ClickHouse® internally performs aggregation in two phases:
- Build aggregation state
- Finalize aggregation result
The database exposes these phases through combinators such as:
sumState()
sumMerge()
avgState()
avgMerge()
uniqState()
uniqMerge()Example:
SELECT
uniqState(user_id) AS state
FROM events;The result is not a number. It is an intermediate aggregation state.
Later:
SELECT
uniqMerge(state)
FROM aggregated_events;The final value is computed by merging states.
This mechanism powers:
- Materialized views
- AggregatingMergeTree
- Incremental aggregations
- Distributed query processing
Cardinality Estimation Functions
Counting unique values is expensive.
A query like:
SELECT count(DISTINCT user_id)
FROM events;requires maintaining a large hash set.
For datasets containing hundreds of millions or billions of rows, approximate cardinality functions become significantly more efficient.
uniq()
SELECT uniq(user_id)
FROM events;Characteristics:
- Approximate
- Very fast
- Low memory usage
- Default recommendation for most workloads
ClickHouse® documentation recommends uniq() as the general-purpose distinct counting function because it balances speed, memory usage, and accuracy effectively.
uniqExact()
SELECT uniqExact(user_id)
FROM events;Characteristics:
- Exact result
- Higher memory consumption
- Slower than approximate variants
Use when:
- Financial calculations
- Compliance reporting
- Auditing workloads
Avoid using it on extremely high-cardinality datasets unless exactness is mandatory.
uniqCombined()
SELECT uniqCombined(user_id)
FROM events;Characteristics:
- Hybrid algorithm
- Better scalability
- Lower memory footprint than exact counting
- High accuracy
Useful when distinct counts become very large.
uniqHLL12()
SELECT uniqHLL12(user_id)
FROM events;Uses HyperLogLog internally.
Characteristics:
- Fixed memory consumption
- Approximate result
- Suitable for very large datasets
Trade-off:
- Less accurate than some newer algorithms
- Primarily useful when predictable memory usage is critical
uniqTheta()
SELECT uniqTheta(user_id)
FROM events;Based on Theta Sketches.
Advantages:
- Supports sketch merging
- Good distributed aggregation behavior
- High scalability
Common in systems requiring large-scale cardinality estimation across distributed clusters.
Quantile Functions
Averages hide distribution details.
Example:
SELECT avg(response_time_ms)
FROM requests;A service can have:
- Average latency = 100ms
- P99 latency = 5000ms
Users experience the tail latency, not the average.
Quantile functions solve this problem.
quantile()
SELECT quantile(0.95)(response_time_ms)
FROM requests;Returns approximately:
95th percentileMeaning:
95% of requests are below this value.
quantiles()
Compute multiple percentiles simultaneously.
SELECT
quantiles(0.5, 0.9, 0.95, 0.99)
(response_time_ms)
FROM requests;Output:
Median
P90
P95
P99More efficient than calculating each percentile separately.
quantileExact()
SELECT quantileExact(0.99)
(response_time_ms)
FROM requests;Characteristics:
- Exact percentile
- Higher memory usage
- More expensive computation
Use when exact percentile calculations are required.
quantileTDigest()
SELECT quantileTDigest(0.99)
(response_time_ms)
FROM requests;Uses T-Digest.
Advantages:
- Efficient memory usage
- Excellent tail percentile estimation
- Popular for observability workloads
Frequently used for:
- API latency
- Database query latency
- Infrastructure monitoring
quantileTiming()
SELECT quantileTiming(0.99)
(response_time_ms)
FROM requests;Optimized specifically for timing distributions.
Particularly useful for:
- Response times
- Service latency
- Request duration metrics
Top-K Analysis
Finding the most common values is a common analytical requirement.
Examples:
- Most viewed products
- Most searched keywords
- Most active users
- Most common error codes
topK()
SELECT topK(10)(product_id)
FROM orders;Returns:
Top 10 most frequent productsCharacteristics:
- Approximate
- Extremely efficient
- Suitable for streaming-scale datasets
Instead of sorting every distinct value, ClickHouse® maintains a compact structure tracking likely heavy hitters.
topKWeighted()
SELECT
topKWeighted(10)
(product_id, quantity)
FROM sales;Incorporates weights into frequency estimation.
Useful for:
- Revenue contribution
- Purchase quantity
- Weighted popularity metrics
Statistical Aggregation Functions
ClickHouse® includes several statistical aggregators.
varPop() and varSamp()
Population variance:
SELECT varPop(price)
FROM products;Sample variance:
SELECT varSamp(price)
FROM products;stddevPop() and stddevSamp()
Population standard deviation:
SELECT stddevPop(price)
FROM products;Sample standard deviation:
SELECT stddevSamp(price)
FROM products;Useful for:
- Risk analysis
- Anomaly detection
- Data quality monitoring
covarPop()
Covariance between two variables:
SELECT covarPop(x, y)
FROM measurements;Measures how variables move together.
corr()
Pearson correlation coefficient:
SELECT corr(x, y)
FROM measurements;Returns:
-1 to +1Useful for exploratory analytics and feature analysis.
Bitmap-Based Aggregations
For very large user sets, bitmaps can outperform traditional distinct counting approaches.
groupBitmap()
SELECT
groupBitmap(user_id)
FROM events;Creates a bitmap representation.
bitmapCardinality()
SELECT bitmapCardinality(bitmap_column)
FROM users;Returns distinct count from a bitmap.
Benefits:
- Compact storage
- Fast set operations
- Efficient merging
Common in:
- Ad-tech
- User segmentation
- Audience analytics
Array-Based Aggregations
Sometimes the goal is not a single value.
Instead, all values must be collected.
groupArray()
SELECT
user_id,
groupArray(page)
FROM visits
GROUP BY user_id;Example result:
["home","pricing","checkout"]Useful for user journey analysis.
groupUniqArray()
SELECT
groupUniqArray(page)
FROM visits;Removes duplicates during aggregation.
groupArraySorted()
SELECT
groupArraySorted(10)(score)
FROM results;Returns sorted values.
AggregatingMergeTree and State Functions
One of the most advanced aggregation patterns combines:
- Aggregate states
- Materialized views
- AggregatingMergeTree
Example:
CREATE TABLE daily_users
(
date Date,
users AggregateFunction(uniq, UInt64)
)
ENGINE = AggregatingMergeTree()
ORDER BY date;Materialized view:
CREATE MATERIALIZED VIEW mv_daily_users
TO daily_users
AS
SELECT
toDate(event_time) AS date,
uniqState(user_id) AS users
FROM events
GROUP BY date;Query:
SELECT
date,
uniqMerge(users)
FROM daily_users
GROUP BY date;Benefits:
- Faster analytical queries
- Reduced CPU consumption
- Incremental aggregation
- Better scalability
This pattern is heavily used in production ClickHouse® deployments.
Exploring ClickHouse® for Your Analytics?
At Quantrail Data, we help teams run ClickHouse® reliably for real-time analytics – from Kubernetes deployments and migrations to performance tuning in production.
We see these challenges firsthand while supporting demanding analytics workloads. In one recent engagement, a customer achieved near bare-metal performance with ClickHouse® in production – a story we’ve shared here:
Success Story: Quantrail Bare-Metal ClickHouse® Deployment
If you’re evaluating ClickHouse® or trying to get more out of an existing setup, we’re happy to share practical lessons from real-world deployments.
Contact
Conclusion
Advanced aggregation functions are one of the primary reasons ClickHouse® can execute analytical queries efficiently at scale.
The most important concepts to understand are:
- Aggregate states (
State()/Merge()). - Approximate cardinality estimation (
uniq,uniqCombined,uniqHLL12). - Percentile calculations (
quantile,quantileTDigest). - Heavy hitter detection (
topK). - Bitmap aggregations (
groupBitmap). - Incremental aggregation using
AggregatingMergeTree.
For production analytical systems, selecting the appropriate aggregation function often has a greater impact on performance and resource consumption than hardware upgrades or query tuning.



