大数据集群Node Exporter+JMX Exporter全链路监控配置
请详细说明如何配置Prometheus Node Exporter采集Hadoop/Spark/Flink等大数据组件的系统级指标(CPU/内存/磁盘/网络),以及JMX Exporter采集JVM级指标(堆内存/GC/线程/类加载)。给出一个完整的docker-compose或systemd配置示例,包含prometheus.yml中的job配置和Grafana Dashboard导入步骤。
回答
我还是少年
Node Exporter + JMX Exporter全链路监控配置:
1. Node Exporter系统级指标:
# prometheus.yml Job配置
scrape_configs:
- job_name: 'node'
static_configs:
- targets:
- 'hadoop-nn-01:9100' # NameNode
- 'hadoop-dn-01:9100' # DataNode
- 'spark-master:9100' # Spark Master
- 'kafka-broker-01:9100' # Kafka Broker
2. JMX Exporter JVM级指标:
# Java进程启动参数(以Flink TaskManager为例)
# 下载jmx_prometheus_javaagent.jar
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar
# Flink配置(config.yaml)
env.java.opts.taskmanager: |
-javaagent:/opt/jmx_prometheus_javaagent.jar=9101:/opt/jmx_config.yaml
# jmx_config.yaml
startDelaySeconds: 10
rules:
- pattern: 'java.lang:type=Memory'
name: jvm_memory_bytes
labels:
area: '$1'
- pattern: 'java.lang:type=GarbageCollector,name=.*'
name: jvm_gc_collection_seconds
- pattern: 'java.lang:type=Threading'
name: jvm_threads_current
3. prometheus.yml完整配置:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['node1:9100','node2:9100','node3:9100']
- job_name: 'hdfs'
metrics_path: '/jmx'
static_configs:
- targets: ['nn1:9101','nn2:9101','dn1:9101']
- job_name: 'flink'
metrics_path: '/metrics'
static_configs:
- targets: ['flink-jm:9249','flink-tm-1:9249']
- job_name: 'kafka'
static_configs:
- targets: ['kafka1:9308']
4. Grafana Dashboard导入:
# 1. 启动Grafana
docker run -d -p 3000:3000 grafana/grafana
# 2. 配置Prometheus数据源
# UI: Configuration → Data Sources → Add → Prometheus
# 3. 导入预置Dashboard
# Node Exporter Full: ID 1860
# JVM (Micrometer): ID 4701
# Flink: ID 11920
# Kafka: ID 15873
# Hadoop HDFS: ID 6014
5. systemd服务配置:
[Unit]
Description=Node Exporter
After=network.target
[Service]
ExecStart=/usr/local/bin/node_exporter \
--collector.filesystem.ignored-mount-points="^/(sys|proc|dev|host|etc)" \
--collector.netclass.ignored-devices="^(veth|docker)"
[Install]
WantedBy=multi-user.target