CodeWalk

用户留存(Retention)计算的SQL实现与多维分析

作者:孤独的心 · 2026-05-30 12:55

请给出一个标准的新增用户次日/7日/30日留存率计算的Spark SQL实现。包括:新用户定义(首次启动)、留存回访定义、以及多维留存(按渠道/版本/地区分组)。同时讨论大表优化:BitMap留存、ClickHouse留存函数(retention)、以及Flink实时留存计算方案。

回答

孤独的心

用户留存SQL实现:

1. 核心逻辑

  • 新用户:第一次启动app
  • 留存用户:N天后回访的用户
  • 留存率 = 第N天回访的用户数 / 第0天新增用户数

2. Spark SQL留存计算

-- Step 1: 找出新用户(首次启动日期)
WITH new_users AS (
  SELECT
    user_id,
    MIN(dt) AS install_date
  FROM events
  WHERE event_name = 'app_start'
  GROUP BY user_id
),

-- Step 2: 计算回访情况
retention_base AS (
  SELECT
    n.install_date,
    n.user_id,
    -- 是否有后续回访
    MAX(CASE WHEN e.dt = DATE_ADD(n.install_date, 1) THEN 1 ELSE 0 END) AS day_1_retention,
    MAX(CASE WHEN e.dt = DATE_ADD(n.install_date, 7) THEN 1 ELSE 0 END) AS day_7_retention,
    MAX(CASE WHEN e.dt = DATE_ADD(n.install_date, 30) THEN 1 ELSE 0 END) AS day_30_retention
  FROM new_users n
  LEFT JOIN events e ON n.user_id = e.user_id
  GROUP BY n.install_date, n.user_id
)

-- Step 3: 计算留存率
SELECT
  install_date,
  COUNT(*) AS new_users,
  ROUND(SUM(day_1_retention) * 100.0 / COUNT(*), 2) AS day_1_retention_rate,
  ROUND(SUM(day_7_retention) * 100.0 / COUNT(*), 2) AS day_7_retention_rate,
  ROUND(SUM(day_30_retention) * 100.0 / COUNT(*), 2) AS day_30_retention_rate
FROM retention_base
WHERE install_date >= '2025-04-25'  -- 确保30天前
GROUP BY install_date
ORDER BY install_date;

3. 多维留存(按渠道/版本)

SELECT
  n.install_date,
  n.channel,
  n.app_version,
  ...  -- 同上retention计算
FROM new_users n
LEFT JOIN events e ...

4. ClickHouse留存函数

SELECT
  install_date,
  retention(
    install_date = first_event_date,
    install_date = first_event_date + 1,
    install_date = first_event_date + 7,
    install_date = first_event_date + 30
  ) AS retention_array
FROM ...
GROUP BY install_date;

5. Flink实时留存(滑动窗口)

  • 使用Flink State记录每个user_id的首次活跃日期
  • 每日0点计算前一日新增的次日留存
  • 使用ValueState<Long>存储首次活跃时间戳

6. BitMap优化

  • 每天用BitMap存储活跃用户
  • 留存计算 = BITMAP_INTERSECT(day0_new, dayN_active) / BITMAP_COUNT(day0_new)