Prometheus+Grafana大数据集群监控方案设计与告警规则
请设计一个基于Prometheus+Grafana的大数据集群监控方案。包括:如何通过JMX Exporter/Node Exporter采集Hadoop/Spark/Flink/Kafka的指标、Prometheus的指标类型(Counter/Gauge/Histogram/Summary)及在监控中的应用场景、以及关键告警规则(节点宕机/GC频繁/Checkpoint失败/消费延迟)的PromQL配置。给出一个完整的告警规则示例。
回答
屠龙少年
Prometheus+Grafana监控方案:
1. 指标采集架构:
集群节点
├─ Node Exporter (9100):CPU/内存/磁盘/网络
├─ JMX Exporter (9101):HDFS/Spark/Flink JVM指标
├─ Kafka Exporter (9308):Kafka消费延迟/分区状态
├─ HDFS NameNode HTTP API:DataNode存活
└─ Flink Metrics Reporter (9249):Job/Checkpoint/反压
↓
Prometheus Server (PULL模式)
↓
Grafana Dashboard (可视化) + AlertManager (告警)
2. Prometheus指标类型:
| 类型 | 说明 | 监控示例 |
|------|------|---------|
| Counter | 只增不减的累计值 | flink_job_numRestarts_total(重启次数)|
| Gauge | 可增可减的瞬时值 | node_memory_MemAvailable_bytes(可用内存)|
| Histogram | 分布统计 | flink_taskmanager_job_task_operator_currentFetchEventTimeLag(延迟分布)|
| Summary | 分位数统计 | kafka_consumer_lag_summary(延迟P99)|
3. 关键告警规则(PromQL):
# Alert: YARN节点宕机
- alert: YARNNodeDown
expr: (count(node_uname_info) by (rack))
< scalar(node_uname_info_count) * 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "YARN集群节点宕机超过10%"
# Alert: Flink Checkpoint失败
- alert: FlinkCheckpointFailed
expr: rate(flink_jobmanager_job_numberOfFailedCheckpoints_total[5m]) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "Flink Checkpoint最近5分钟出现失败"
# Alert: Kafka消费延迟
- alert: KafkaConsumerLag
expr: kafka_consumergroup_lag{group=~"flink.*"} > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Flink消费者组{{ $labels.group }}延迟超过10000条"
# Alert: GC频繁
- alert: FlinkGcFrequent
expr: rate(jvm_gc_collection_seconds_sum{job="flink"}[5m]) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Flink JVM GC时间占用超过50%"
# Alert: HDFS剩余空间不足
- alert: HdfsLowSpace
expr: (hadoop_namenode_capacity_used / hadoop_namenode_capacity_total) > 0.9
for: 10m
labels:
severity: critical
4. Grafana Dashboard:
- 使用预置Dashboard:Flink(ID:11920)、Kafka(ID:15873)、JVM(ID:4701)
- 自定义Dashboard:聚合Spark Job执行时间/Shuffle数据量