UV(独立访客)与PV(页面浏览量)的实时统计方案
请设计一个高并发网站UV和PV的实时统计方案。PV可以通过COUNT直接累加,UV需要去重。请分别给出:Flink SQL实时PV(滚动窗口)+ UV(HyperLogLog),以及ClickHouse中物化视图+uniqCombined的UV预聚合方案。讨论百万级QPS下的写入和查询优化策略。
回答
Yahuda
实时PV/UV统计方案:
1. Flink SQL实时PV:
CREATE TABLE page_views (
page_url STRING,
user_id INT,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '1' SECOND
) WITH (...);
-- 每分钟PV(简单COUNT)
INSERT INTO pv_result
SELECT
window_start,
window_end,
page_url,
COUNT(*) AS pv
FROM TABLE(TUMBLE(TABLE page_views, DESCRIPTOR(event_time), INTERVAL '1' MINUTE))
GROUP BY window_start, window_end, page_url;
2. Flink SQL实时UV(HyperLogLog):
-- 方法1:Flink自动优化COUNT DISTINCT为HyperLogLog
SELECT
window_start,
page_url,
COUNT(DISTINCT user_id) AS uv_hll -- Flink内部使用HyperLogLog
FROM TABLE(TUMBLE(TABLE page_views, DESCRIPTOR(event_time), INTERVAL '1' MINUTE))
GROUP BY window_start, window_end, page_url;
-- 方法2:显示调用HyperLogLog UDAF
SELECT
window_start,
page_url,
HLL_COMBINE(HLL_SKETCH_AGG(user_id)) AS uv
FROM ... GROUP BY ...;
3. ClickHouse预聚合方案:
-- 物化视图+uniqCombined
CREATE MATERIALIZED VIEW mv_pv_uv TO agg_page_stats
AS SELECT
toStartOfMinute(event_time) AS event_minute,
page_url,
count() AS pv,
uniqCombined(user_id) AS uv -- HyperLogLog变体
FROM raw_events
GROUP BY event_minute, page_url;
-- 查询
SELECT
toStartOfHour(event_minute) AS hour,
page_url,
SUM(pv) AS total_pv,
uniqCombinedMerge(uv) AS total_uv -- 合并多个hll
FROM agg_page_stats
WHERE event_minute >= now() - INTERVAL 1 HOUR
GROUP BY hour, page_url;
4. 百万级QPS优化策略:
| 问题 | 优化方案 |
|---|---|
| 写入瓶颈 | 批量写入(1000+条/批),Kafka分区均匀分布 |
| 热点URL | 加盐打散(page_url + rand()%N)二级聚合 |
| 窗口倾斜 | 两阶段聚合(Local-Global模式) |
| 查询延迟 | 物化视图预聚合,写入即见 |
| 存储膨胀 | 原始数据TTL=7天,聚合数据保留更久 |
5. 端到端延迟:
- Kafka → Flink → ClickHouse:<5秒
- 查询ClickHouse:<100ms