I’ve heard a lot about ClickHouse but never had the chance to work with it closely. I want to solve an example task to dig into ClickHouse a bit more.
Let’s build a simple aggregation pipeline that accepts 1-minute stock data and aggregates it into 4-hour data on the fly—using nothing but ClickHouse.
Input data
Stock market data is typically represented as candlesticks with the following attributes:
Time
– Unix timestampSymbol
– The stock tickerOpen
– Opening priceHigh
– Highest price during the time intervalLow
– Lowest price during the time intervalClose
– Closing priceVolume
– Number of shares traded
For aggregation, Time
and Symbol
serve as the aggregation key.
Each of the remaining attributes requires a different aggregation function:
Open
→ Take the first value.High
→ Take the highest value.Low
→ Take the lowest value.Close
→ Take the last value.Volume
→ Compute the sum of all values.
Pipeline
We’ll leverage ClickHouse’s materialized views to build an efficient real-time aggregation pipeline:
Simply put, the candles_4h_mv
materialized view will be triggered on every insert into the candles_raw
table.
It will aggregate the data and store the results in the candles_4h
table continuously and in real-time.
Now, let’s take a look at the database schema.
Database schema
Since we don’t need to store raw 1-minute data, we can keep it simple:
CREATE TABLE candles_raw
(
dateTime DATETIME('UTC'),
symbol String,
open Float64,
high Float64,
low Float64,
close Float64,
volume Float64
) ENGINE = Null;
The Null
engine means this table doesn’t actually store data.
Instead, it acts as a passthrough for the candles_4h_mv
materialized view.
Now, let’s create the table for 4-hour aggregated data:
CREATE TABLE candles_4h
(
startTime DATETIME('UTC') CODEC (Delta, ZSTD),
symbol LowCardinality(String),
open Float64 CODEC (Delta, ZSTD),
high Float64 CODEC (Delta, ZSTD),
low Float64 CODEC (Delta, ZSTD),
close Float64 CODEC (Delta, ZSTD),
volume Float64 CODEC (Delta, ZSTD)
) ENGINE = ReplacingMergeTree()
ORDER BY (startTime, symbol);
This table could be huge, so some optimizations are required:
LowCardinality
optimizes storage forsymbol
, which has a limited number of unique values.Delta
codec stores differences between consecutive values instead of full values, reducing storage size.ZSTD
codec further compresses stored data for efficiency.ReplacingMergeTree
engine removes duplicate data.
⚠️ The ReplacingMergeTree
engine does not remove duplicates instantly; it relies on ClickHouse’s background merges.
To force cleanup, run:
OPTIMIZE TABLE candles_4h FINAL;
If you don’t care about possible duplicates, then the AggregatingMergeTree
engine is a better option.
Now, let’s connect these tables with the materialized view.
CREATE MATERIALIZED VIEW candles_4h_mv
TO candles_4h -- destination table
AS
SELECT toStartOfInterval(dateTime, INTERVAL 4 HOUR) AS startTime, -- 4-hour aggregation interval
symbol,
argMin(open, dateTime) AS open, -- take the first `open` price
max(high) AS high, -- take the max `high` price
min(low) AS low, -- take the min `low` price
argMax(close, dateTime) AS close, -- take the last `close` price
SUM(volume) AS volume -- sum `volume` amounts
FROM candles_raw -- source table
GROUP BY symbol, startTime; -- the aggregation key
⚠️ Aggregation functions like argMin()
and argMax()
are order-sensitive.
If data is inserted out of historical order, the results may be incorrect.
Test pipeline
Insert 1-minute data for two symbols, FOO
and BAR
:
INSERT INTO candles_raw (dateTime, symbol, open, high, low, close, volume)
VALUES ('2025-02-01 01:10:00', 'FOO', 18000.0, 18500.0, 17500.0, 18200.0, 153.45),
('2025-02-01 01:10:01', 'FOO', 18100.0, 18700.0, 17400.0, 18500.0, 173.45),
('2025-02-01 01:10:00', 'BAR', 18000.0, 18600.0, 17500.0, 18200.0, 163.45);
Insert a duplicate entry for the BAR
symbol:
INSERT INTO candles_raw (dateTime, symbol, open, high, low, close, volume)
VALUES ('2025-02-01 01:10:00', 'BAR', 19000.0, 18600.0, 17600.0, 18200.0, 163.45);
Ensure that duplicates are removed:
OPTIMIZE TABLE candles_4h FINAL;
Check the result:
SELECT * FROM candles_4h;
The result:
+-----------------+------+-----+-----+-----+-----+------+
|startTime |symbol|open |high |low |close|volume|
+-----------------+------+-----+-----+-----+-----+------+
|2025-02-01T00:00Z|BAR |19000|18600|17600|18200|163.45|
|2025-02-01T00:00Z|FOO |18000|18700|17400|18500|326.90|
+-----------------+------+-----+-----+-----+-----+------+
As you can see, the result is correct:
FOO
entries have been properly aggregated.- The first
BAR
entry was replaced by the most recent insert, as both had the same timestamp.
Conclusion
This approach is highly efficient in terms of storage space and read/write performance.
- It requires only 90MB to store 3 million records in
candles_4h
. - The query performance is impressive—I was able to retrieve over 100K records in just 40ms.
Also, additional aggregations like 1-day, 1-week, etc., can be easily added: