CodeWalk

Prometheus+Grafana大数据集群监控方案设计与告警规则

作者:屠龙少年 · 2026-05-30 12:55

请设计一个基于Prometheus+Grafana的大数据集群监控方案。包括:如何通过JMX Exporter/Node Exporter采集Hadoop/Spark/Flink/Kafka的指标、Prometheus的指标类型(Counter/Gauge/Histogram/Summary)及在监控中的应用场景、以及关键告警规则(节点宕机/GC频繁/Checkpoint失败/消费延迟)的PromQL配置。给出一个完整的告警规则示例。

回答

屠龙少年

Prometheus+Grafana监控方案:

1. 指标采集架构

集群节点
  ├─ Node Exporter (9100):CPU/内存/磁盘/网络
  ├─ JMX Exporter (9101):HDFS/Spark/Flink JVM指标
  ├─ Kafka Exporter (9308):Kafka消费延迟/分区状态
  ├─ HDFS NameNode HTTP API:DataNode存活
  └─ Flink Metrics Reporter (9249):Job/Checkpoint/反压
  ↓
Prometheus Server (PULL模式)
  ↓
Grafana Dashboard (可视化) + AlertManager (告警)

2. Prometheus指标类型: | 类型 | 说明 | 监控示例 | |------|------|---------| | Counter | 只增不减的累计值 | flink_job_numRestarts_total(重启次数)| | Gauge | 可增可减的瞬时值 | node_memory_MemAvailable_bytes(可用内存)| | Histogram | 分布统计 | flink_taskmanager_job_task_operator_currentFetchEventTimeLag(延迟分布)| | Summary | 分位数统计 | kafka_consumer_lag_summary(延迟P99)|

3. 关键告警规则(PromQL)

# Alert: YARN节点宕机
- alert: YARNNodeDown
  expr: (count(node_uname_info) by (rack)) 
        < scalar(node_uname_info_count) * 0.9
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "YARN集群节点宕机超过10%"

# Alert: Flink Checkpoint失败
- alert: FlinkCheckpointFailed
  expr: rate(flink_jobmanager_job_numberOfFailedCheckpoints_total[5m]) > 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Flink Checkpoint最近5分钟出现失败"

# Alert: Kafka消费延迟
- alert: KafkaConsumerLag
  expr: kafka_consumergroup_lag{group=~"flink.*"} > 10000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Flink消费者组{{ $labels.group }}延迟超过10000条"

# Alert: GC频繁
- alert: FlinkGcFrequent
  expr: rate(jvm_gc_collection_seconds_sum{job="flink"}[5m]) > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Flink JVM GC时间占用超过50%"

# Alert: HDFS剩余空间不足
- alert: HdfsLowSpace
  expr: (hadoop_namenode_capacity_used / hadoop_namenode_capacity_total) > 0.9
  for: 10m
  labels:
    severity: critical

4. Grafana Dashboard

  • 使用预置Dashboard:Flink(ID:11920)、Kafka(ID:15873)、JVM(ID:4701)
  • 自定义Dashboard:聚合Spark Job执行时间/Shuffle数据量