I’ve heard a lot about ClickHouse but never had the chance to work with it closely. I want to solve an example task to dig into ClickHouse a bit more.

Let’s build a simple aggregation pipeline that accepts 1-minute stock data and aggregates it into 4-hour data on the fly—using nothing but ClickHouse.

Input data

Stock market data is typically represented as candlesticks with the following attributes:

  • Time – Unix timestamp
  • Symbol – The stock ticker
  • Open – Opening price
  • High – Highest price during the time interval
  • Low – Lowest price during the time interval
  • Close – Closing price
  • Volume – Number of shares traded

For aggregation, Time and Symbol serve as the aggregation key.
Each of the remaining attributes requires a different aggregation function:

  • Open → Take the first value.
  • High → Take the highest value.
  • Low → Take the lowest value.
  • Close → Take the last value.
  • Volume → Compute the sum of all values.

Pipeline

We’ll leverage ClickHouse’s materialized views to build an efficient real-time aggregation pipeline:

img.png

Simply put, the candles_4h_mv materialized view will be triggered on every insert into the candles_raw table.

It will aggregate the data and store the results in the candles_4h table continuously and in real-time.

Now, let’s take a look at the database schema.

Database schema

Since we don’t need to store raw 1-minute data, we can keep it simple:

CREATE TABLE candles_raw
(
    dateTime DATETIME('UTC'),
    symbol   String,
    open     Float64,
    high     Float64,
    low      Float64,
    close    Float64,
    volume   Float64
) ENGINE = Null;

The Null engine means this table doesn’t actually store data. Instead, it acts as a passthrough for the candles_4h_mv materialized view.

Now, let’s create the table for 4-hour aggregated data:

CREATE TABLE candles_4h
(
    startTime DATETIME('UTC') CODEC (Delta, ZSTD),
    symbol    LowCardinality(String),
    open      Float64 CODEC (Delta, ZSTD),
    high      Float64 CODEC (Delta, ZSTD),
    low       Float64 CODEC (Delta, ZSTD),
    close     Float64 CODEC (Delta, ZSTD),
    volume    Float64 CODEC (Delta, ZSTD)
) ENGINE = ReplacingMergeTree()
      ORDER BY (startTime, symbol);

This table could be huge, so some optimizations are required:

  • LowCardinality optimizes storage for symbol, which has a limited number of unique values.
  • Delta codec stores differences between consecutive values instead of full values, reducing storage size.
  • ZSTD codec further compresses stored data for efficiency.
  • ReplacingMergeTree engine removes duplicate data.

⚠️ The ReplacingMergeTree engine does not remove duplicates instantly; it relies on ClickHouse’s background merges. To force cleanup, run:

OPTIMIZE TABLE candles_4h FINAL;

If you don’t care about possible duplicates, then the AggregatingMergeTree engine is a better option.

Now, let’s connect these tables with the materialized view.

CREATE MATERIALIZED VIEW candles_4h_mv
    TO candles_4h -- destination table
AS
SELECT toStartOfInterval(dateTime, INTERVAL 4 HOUR) AS startTime, -- 4-hour aggregation interval
       symbol,
       argMin(open, dateTime)                       AS open,      -- take the first `open` price
       max(high)                                    AS high,      -- take the max `high` price
       min(low)                                     AS low,       -- take the min `low` price
       argMax(close, dateTime)                      AS close,     -- take the last `close` price
       SUM(volume)                                  AS volume     -- sum `volume` amounts
FROM candles_raw -- source table
GROUP BY symbol, startTime; -- the aggregation key

⚠️ Aggregation functions like argMin() and argMax() are order-sensitive. If data is inserted out of historical order, the results may be incorrect.

Test pipeline

Insert 1-minute data for two symbols, FOO and BAR:

INSERT INTO candles_raw (dateTime, symbol, open, high, low, close, volume)
VALUES ('2025-02-01 01:10:00', 'FOO', 18000.0, 18500.0, 17500.0, 18200.0, 153.45),
       ('2025-02-01 01:10:01', 'FOO', 18100.0, 18700.0, 17400.0, 18500.0, 173.45),
       ('2025-02-01 01:10:00', 'BAR', 18000.0, 18600.0, 17500.0, 18200.0, 163.45);

Insert a duplicate entry for the BAR symbol:

INSERT INTO candles_raw (dateTime, symbol, open, high, low, close, volume)
VALUES ('2025-02-01 01:10:00', 'BAR', 19000.0, 18600.0, 17600.0, 18200.0, 163.45);

Ensure that duplicates are removed:

OPTIMIZE TABLE candles_4h FINAL;

Check the result:

SELECT * FROM candles_4h;

The result:

+-----------------+------+-----+-----+-----+-----+------+
|startTime        |symbol|open |high |low  |close|volume|
+-----------------+------+-----+-----+-----+-----+------+
|2025-02-01T00:00Z|BAR   |19000|18600|17600|18200|163.45|
|2025-02-01T00:00Z|FOO   |18000|18700|17400|18500|326.90|
+-----------------+------+-----+-----+-----+-----+------+

As you can see, the result is correct:

  • FOO entries have been properly aggregated.
  • The first BAR entry was replaced by the most recent insert, as both had the same timestamp.

Conclusion

This approach is highly efficient in terms of storage space and read/write performance.

  • It requires only 90MB to store 3 million records in candles_4h.
  • The query performance is impressive—I was able to retrieve over 100K records in just 40ms.

Also, additional aggregations like 1-day, 1-week, etc., can be easily added:

img_1.png