CodeWalk

UV(独立访客)与PV(页面浏览量)的实时统计方案

作者:Yahuda · 2026-05-30 12:55

请设计一个高并发网站UV和PV的实时统计方案。PV可以通过COUNT直接累加,UV需要去重。请分别给出:Flink SQL实时PV(滚动窗口)+ UV(HyperLogLog),以及ClickHouse中物化视图+uniqCombined的UV预聚合方案。讨论百万级QPS下的写入和查询优化策略。

回答

Yahuda

实时PV/UV统计方案:

1. Flink SQL实时PV

CREATE TABLE page_views (
  page_url STRING,
  user_id INT,
  event_time TIMESTAMP(3),
  WATERMARK FOR event_time AS event_time - INTERVAL '1' SECOND
) WITH (...);

-- 每分钟PV(简单COUNT)
INSERT INTO pv_result
SELECT
  window_start,
  window_end,
  page_url,
  COUNT(*) AS pv
FROM TABLE(TUMBLE(TABLE page_views, DESCRIPTOR(event_time), INTERVAL '1' MINUTE))
GROUP BY window_start, window_end, page_url;

2. Flink SQL实时UV(HyperLogLog)

-- 方法1:Flink自动优化COUNT DISTINCT为HyperLogLog
SELECT
  window_start,
  page_url,
  COUNT(DISTINCT user_id) AS uv_hll  -- Flink内部使用HyperLogLog
FROM TABLE(TUMBLE(TABLE page_views, DESCRIPTOR(event_time), INTERVAL '1' MINUTE))
GROUP BY window_start, window_end, page_url;

-- 方法2:显示调用HyperLogLog UDAF
SELECT
  window_start,
  page_url,
  HLL_COMBINE(HLL_SKETCH_AGG(user_id)) AS uv
FROM ... GROUP BY ...;

3. ClickHouse预聚合方案

-- 物化视图+uniqCombined
CREATE MATERIALIZED VIEW mv_pv_uv TO agg_page_stats
AS SELECT
  toStartOfMinute(event_time) AS event_minute,
  page_url,
  count() AS pv,
  uniqCombined(user_id) AS uv  -- HyperLogLog变体
FROM raw_events
GROUP BY event_minute, page_url;

-- 查询
SELECT
  toStartOfHour(event_minute) AS hour,
  page_url,
  SUM(pv) AS total_pv,
  uniqCombinedMerge(uv) AS total_uv  -- 合并多个hll
FROM agg_page_stats
WHERE event_minute >= now() - INTERVAL 1 HOUR
GROUP BY hour, page_url;

4. 百万级QPS优化策略

问题优化方案
写入瓶颈批量写入(1000+条/批),Kafka分区均匀分布
热点URL加盐打散(page_url + rand()%N)二级聚合
窗口倾斜两阶段聚合(Local-Global模式)
查询延迟物化视图预聚合,写入即见
存储膨胀原始数据TTL=7天,聚合数据保留更久

5. 端到端延迟

  • Kafka → Flink → ClickHouse:<5秒
  • 查询ClickHouse:<100ms