CodeWalk

HyperLogLog与Bitmap在用户画像交叉分析中的组合应用

作者:编译有声 · 2026-05-30 12:55

在大规模用户画像系统中,HyperLogLog(近似基数)和Bitmap(精确基数)各有优缺点。请设计一个混合策略:在什么场景下使用HLL(如7日累计UV)、什么场景下使用Bitmap(如多标签AND/OR圈选)?并说明ClickHouse中uniqCombined(HLL)和RoaringBitmap的联合使用策略,给出一个『高活人群∩高消费人群』的组合查询示例。

回答

编译有声

HLL与Bitmap混合策略:

1. 选型场景

场景推荐原因
多标签AND/OR圈选(≤100万用户)Bitmap精确、支持位运算
用户规模估算(DAU/MAU)HyperLogLog内存极低(12KB),误差<1%
留存计算(多天交集)Bitmap精确计算多天交集
大规模指标看板(近似值可接受)HyperLogLog响应快,资源少

2. ClickHouse混合应用

-- 方法1:uniqCombined(HLL变体)用于大规模近似去重
SELECT
  toDate(event_time) AS dt,
  uniqCombined(user_id) AS dau  -- HLL近似,内存12KB
FROM events
WHERE event_name = 'app_start' AND dt >= yesterday()
GROUP BY dt;

-- 方法2:uniqExact用于精确去重(小范围)
SELECT
  toDate(event_time) AS dt,
  uniqExact(user_id) AS dau_precise
FROM events
GROUP BY dt;

-- 方法3:RoaringBitmap精确交集/并集
SELECT
  bitmapCardinality(
    bitmapAnd(
      bitmapBuild(CAST(groupArrayState(user_id)) AS AggregateFunction(groupBitmap, UInt64)),
      bitmapBuild(...)
    )
  ) AS common_users;

3. 高活人群∩高消费人群查询

-- Create两张Bitmap表
CREATE TABLE bitmap_high_active (
  dt Date,
  user_bitmap AggregateFunction(groupBitmap, UInt64)
) ENGINE = AggregatingMergeTree()
ORDER BY dt;

CREATE TABLE bitmap_high_spender (
  dt Date,
  user_bitmap AggregateFunction(groupBitmap, UInt64)
) ENGINE = AggregatingMergeTree()
ORDER BY dt;

-- 查询交集
SELECT 
  bitmapCardinality(
    bitmapAnd(
      bitmapMerge(active.user_bitmap),
      bitmapMerge(spender.user_bitmap)
    )
  ) AS intersection_count
FROM bitmap_high_active active
CROSS JOIN bitmap_high_spender spender
WHERE active.dt = '2025-05-25' AND spender.dt = '2025-05-25';

4. 混合优化的内存对比: | 方法 | 1亿用户 | 5000万用户 | 误差 | |------|---------|-----------|------| | HyperLogLog | ~12KB | ~12KB | ~0.8% | | RoaringBitmap | ~16MB | ~8MB | 0%(精确)| | uniqExact | 巨大(O(N)) | 巨大 | 0%(精确)|

5. 最佳实践

  • 日活统计:HyperLogLog(快速)
  • 留存计算:Bitmap(精确,可用RoaringBitmap计算多天留存)
  • 标签圈选:Bitmap(AND/OR位运算)
  • 7日/30日累计:HyperLogLog Union(uniqCombinedMerge