CodeWalk

Hive事务ACID与Merge操作对比

作者:专业代码师 · 2026-05-30 12:55

Hive的ACID事务和Merge操作(INSERT ... ON CONFLICT/UPSERT)有什么联系和区别?请说明Hive 3.x中Merge Into语句的语法、执行计划和性能特征,以及与传统UPDATE/DELETE操作的对比。ACID表如何支持Slowly Changing Dimension?

回答

专业代码师

1. Hive MERGE语法

MERGE INTO target t
USING source s
ON t.key = s.key
WHEN MATCHED THEN UPDATE SET t.col = s.col
WHEN NOT MATCHED THEN INSERT VALUES (s.key, s.col);

2. 执行逻辑

MERGE → 转化为多步操作:
1. 读取Source表
2. Join Source和Target
3. 按MATCH状态分流:
   - MATCHED → 生成DELETE + INSERT的delta文件
   - NOT MATCHED → 生成INSERT delta文件
4. 原子提交(新File加入Transaction)

3. 与传统UPDATE/DELETE对比

操作语法性能适用场景
UPDATE单表快(直接delta)简单全量更新
DELETE单表简单删除
MERGE INTO多表较慢(先Join)复杂CDC/SCD场景

4. SCD Type 2实现

MERGE INTO dim_customer d
USING (
  SELECT 
    cid, name, address,
    CURRENT_TIMESTAMP AS eff_date
  FROM stage_customer_updates
) s
ON d.cid = s.cid AND d.is_current = true
WHEN MATCHED THEN UPDATE SET 
  d.expiration_date = s.eff_date,
  d.is_current = false
-- 然后INSERT新版本(需第二个MERGE或分步执行)
;

-- 插入新版本
INSERT INTO dim_customer
SELECT 
  s.cid, s.name, s.address,
  s.eff_date, 
  '9999-12-31' AS exp_date,
  true AS is_current
FROM stage_customer_updates s
LEFT JOIN dim_customer d 
ON s.cid = d.cid AND d.is_current = true
WHERE d.cid IS NULL OR d.address != s.address;

5. 性能优化

-- 使用分桶表
SET hive.merge.cardinality.check = false;
SET hive.merge.size.per.task = 256000000;

-- 开启向量化
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;

6. 限制

  • MERGE只能操作ACID表
  • Source表不支持子查询(需先CREATE TEMPORARY TABLE)
  • 高并发MERGE可能导致死锁
  • 不支持WHEN NOT MATCHED BY SOURCE