用户留存(Retention)计算的SQL实现与多维分析
请给出一个标准的新增用户次日/7日/30日留存率计算的Spark SQL实现。包括:新用户定义(首次启动)、留存回访定义、以及多维留存(按渠道/版本/地区分组)。同时讨论大表优化:BitMap留存、ClickHouse留存函数(retention)、以及Flink实时留存计算方案。
回答
孤独的心
用户留存SQL实现:
1. 核心逻辑:
- 新用户:第一次启动app
- 留存用户:N天后回访的用户
- 留存率 = 第N天回访的用户数 / 第0天新增用户数
2. Spark SQL留存计算:
-- Step 1: 找出新用户(首次启动日期)
WITH new_users AS (
SELECT
user_id,
MIN(dt) AS install_date
FROM events
WHERE event_name = 'app_start'
GROUP BY user_id
),
-- Step 2: 计算回访情况
retention_base AS (
SELECT
n.install_date,
n.user_id,
-- 是否有后续回访
MAX(CASE WHEN e.dt = DATE_ADD(n.install_date, 1) THEN 1 ELSE 0 END) AS day_1_retention,
MAX(CASE WHEN e.dt = DATE_ADD(n.install_date, 7) THEN 1 ELSE 0 END) AS day_7_retention,
MAX(CASE WHEN e.dt = DATE_ADD(n.install_date, 30) THEN 1 ELSE 0 END) AS day_30_retention
FROM new_users n
LEFT JOIN events e ON n.user_id = e.user_id
GROUP BY n.install_date, n.user_id
)
-- Step 3: 计算留存率
SELECT
install_date,
COUNT(*) AS new_users,
ROUND(SUM(day_1_retention) * 100.0 / COUNT(*), 2) AS day_1_retention_rate,
ROUND(SUM(day_7_retention) * 100.0 / COUNT(*), 2) AS day_7_retention_rate,
ROUND(SUM(day_30_retention) * 100.0 / COUNT(*), 2) AS day_30_retention_rate
FROM retention_base
WHERE install_date >= '2025-04-25' -- 确保30天前
GROUP BY install_date
ORDER BY install_date;
3. 多维留存(按渠道/版本):
SELECT
n.install_date,
n.channel,
n.app_version,
... -- 同上retention计算
FROM new_users n
LEFT JOIN events e ...
4. ClickHouse留存函数:
SELECT
install_date,
retention(
install_date = first_event_date,
install_date = first_event_date + 1,
install_date = first_event_date + 7,
install_date = first_event_date + 30
) AS retention_array
FROM ...
GROUP BY install_date;
5. Flink实时留存(滑动窗口):
- 使用Flink State记录每个user_id的首次活跃日期
- 每日0点计算前一日新增的次日留存
- 使用
ValueState<Long>存储首次活跃时间戳
6. BitMap优化:
- 每天用BitMap存储活跃用户
- 留存计算 =
BITMAP_INTERSECT(day0_new, dayN_active) / BITMAP_COUNT(day0_new)