Hive事务ACID与Merge操作对比
Hive的ACID事务和Merge操作(INSERT ... ON CONFLICT/UPSERT)有什么联系和区别?请说明Hive 3.x中Merge Into语句的语法、执行计划和性能特征,以及与传统UPDATE/DELETE操作的对比。ACID表如何支持Slowly Changing Dimension?
回答
专业代码师
1. Hive MERGE语法
MERGE INTO target t
USING source s
ON t.key = s.key
WHEN MATCHED THEN UPDATE SET t.col = s.col
WHEN NOT MATCHED THEN INSERT VALUES (s.key, s.col);
2. 执行逻辑
MERGE → 转化为多步操作:
1. 读取Source表
2. Join Source和Target
3. 按MATCH状态分流:
- MATCHED → 生成DELETE + INSERT的delta文件
- NOT MATCHED → 生成INSERT delta文件
4. 原子提交(新File加入Transaction)
3. 与传统UPDATE/DELETE对比
| 操作 | 语法 | 性能 | 适用场景 |
|---|---|---|---|
| UPDATE | 单表 | 快(直接delta) | 简单全量更新 |
| DELETE | 单表 | 快 | 简单删除 |
| MERGE INTO | 多表 | 较慢(先Join) | 复杂CDC/SCD场景 |
4. SCD Type 2实现
MERGE INTO dim_customer d
USING (
SELECT
cid, name, address,
CURRENT_TIMESTAMP AS eff_date
FROM stage_customer_updates
) s
ON d.cid = s.cid AND d.is_current = true
WHEN MATCHED THEN UPDATE SET
d.expiration_date = s.eff_date,
d.is_current = false
-- 然后INSERT新版本(需第二个MERGE或分步执行)
;
-- 插入新版本
INSERT INTO dim_customer
SELECT
s.cid, s.name, s.address,
s.eff_date,
'9999-12-31' AS exp_date,
true AS is_current
FROM stage_customer_updates s
LEFT JOIN dim_customer d
ON s.cid = d.cid AND d.is_current = true
WHERE d.cid IS NULL OR d.address != s.address;
5. 性能优化
-- 使用分桶表
SET hive.merge.cardinality.check = false;
SET hive.merge.size.per.task = 256000000;
-- 开启向量化
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
6. 限制
- MERGE只能操作ACID表
- Source表不支持子查询(需先CREATE TEMPORARY TABLE)
- 高并发MERGE可能导致死锁
- 不支持WHEN NOT MATCHED BY SOURCE