Flink SQL Group Aggregation与Over Aggregation的区别
Flink SQL中Group Aggregation(分组聚合)和Over Aggregation(窗口聚合/分析函数)在处理流数据时有什么本质区别?请解释Over Aggregation的窗口帧(ROWS BETWEEN/RANGE BETWEEN)在流处理中的语义,以及如何通过ROW_NUMBER()实现去重Top-N。给出一个计算用户累计消费金额的Over Aggregation示例。
回答
孤独的心
Group Aggregation vs Over Aggregation区别:
1. Group Aggregation(分组聚合):
SELECT user_id, SUM(amount) AS total
FROM orders
GROUP BY user_id;
- 按分组键聚合,每个分组输出一行
- 流模式下输出为更新流(changelog stream)
- 结果随新数据不断更新
2. Over Aggregation(窗口聚合/分析函数):
SELECT
order_id,
user_id,
amount,
order_time,
SUM(amount) OVER (
PARTITION BY user_id
ORDER BY order_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_amount
FROM orders;
- 在每一行上计算窗口帧内的聚合
- 不改变行数(每行对应输出一行)
- 适合计算累计值、移动平均、排名
3. ORDER BY +窗口帧定义: | 窗口帧 | 含义 | |--------|------| | ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | 从起点到当前行(累计)| | ROWS BETWEEN 5 PRECEDING AND CURRENT ROW | 前5行到当前行(移动平均)| | RANGE BETWEEN INTERVAL '1' HOUR PRECEDING AND CURRENT ROW | 时间窗口内 |
4. ROW_NUMBER()去重Top-N:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY order_time DESC
) AS rn
FROM orders
) WHERE rn = 1; -- 每个用户最新订单
关键区别: | 维度 | Group Agg | Over Agg | |------|-----------|----------| | 输出行数 | 按分组压缩 | 与输入相同 | | 输出模式 | 更新流(Retract) | 追加流(Append)或更新流 | | 状态大小 | 每个分组一个值 | 窗口帧内所有行 | | 适用场景 | 汇总统计 | 排名/累计/移动计算 |