CodeWalk

大数据集群Node Exporter+JMX Exporter全链路监控配置

作者:我还是少年 · 2026-05-30 12:55

请详细说明如何配置Prometheus Node Exporter采集Hadoop/Spark/Flink等大数据组件的系统级指标(CPU/内存/磁盘/网络),以及JMX Exporter采集JVM级指标(堆内存/GC/线程/类加载)。给出一个完整的docker-compose或systemd配置示例,包含prometheus.yml中的job配置和Grafana Dashboard导入步骤。

回答

我还是少年

Node Exporter + JMX Exporter全链路监控配置:

1. Node Exporter系统级指标

# prometheus.yml Job配置
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
        - 'hadoop-nn-01:9100'   # NameNode
        - 'hadoop-dn-01:9100'   # DataNode
        - 'spark-master:9100'  # Spark Master
        - 'kafka-broker-01:9100' # Kafka Broker

2. JMX Exporter JVM级指标

# Java进程启动参数(以Flink TaskManager为例)
# 下载jmx_prometheus_javaagent.jar
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar

# Flink配置(config.yaml)
env.java.opts.taskmanager: |
  -javaagent:/opt/jmx_prometheus_javaagent.jar=9101:/opt/jmx_config.yaml

# jmx_config.yaml
startDelaySeconds: 10
rules:
  - pattern: 'java.lang:type=Memory'
    name: jvm_memory_bytes
    labels:
      area: '$1'
  - pattern: 'java.lang:type=GarbageCollector,name=.*'
    name: jvm_gc_collection_seconds
  - pattern: 'java.lang:type=Threading'
    name: jvm_threads_current

3. prometheus.yml完整配置

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node1:9100','node2:9100','node3:9100']

  - job_name: 'hdfs'
    metrics_path: '/jmx'
    static_configs:
      - targets: ['nn1:9101','nn2:9101','dn1:9101']

  - job_name: 'flink'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['flink-jm:9249','flink-tm-1:9249']

  - job_name: 'kafka'
    static_configs:
      - targets: ['kafka1:9308']

4. Grafana Dashboard导入

# 1. 启动Grafana
docker run -d -p 3000:3000 grafana/grafana

# 2. 配置Prometheus数据源
# UI: Configuration → Data Sources → Add → Prometheus

# 3. 导入预置Dashboard
# Node Exporter Full: ID 1860
# JVM (Micrometer): ID 4701
# Flink: ID 11920
# Kafka: ID 15873
# Hadoop HDFS: ID 6014

5. systemd服务配置

[Unit]
Description=Node Exporter
After=network.target

[Service]
ExecStart=/usr/local/bin/node_exporter \
  --collector.filesystem.ignored-mount-points="^/(sys|proc|dev|host|etc)" \
  --collector.netclass.ignored-devices="^(veth|docker)"

[Install]
WantedBy=multi-user.target