CodeWalk

Prometheus AlertManager告警路由与集群高可用配置

作者:编译有声 · 2026-05-30 12:55

如何配置Prometheus AlertManager实现智能告警路由(如不同级别的告警发送给不同接收人)和集群高可用(多实例+Gossip协议)?请给出一个完整的alertmanager.yml配置,包含:inhibit_rules(告警抑制,避免重复告警)、route(路由规则,按severity分发)、以及receivers(Email/钉钉/Webhook)。同时说明AlertManager集群的Mesh通信原理。

回答

编译有声

AlertManager告警路由与高可用配置:

1. 完整alertmanager.yml配置

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts'
  smtp_auth_password: '****'

# 抑制规则:避免重复告警
inhibit_rules:
  - source_match:
      severity: 'critical'      # 出现critical告警时
    target_match:
      severity: 'warning'       # 抑制同实例的warning告警
    equal: ['instance', 'alertname']
  - source_match:
      alertname: 'NodeDown'     # 节点宕机时
    target_match:
      alertname: 'HighCpuUsage' # 抑制CPU告警(无明显意义)
    equal: ['instance']

# 路由规则
route:
  receiver: 'default'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: 'critical'
      receiver: 'sre-team'
      repeat_interval: 1h       # 关键告警每1小时重复
    - match:
        severity: 'warning'
      receiver: 'data-team'
    - match:
        alertname: 'FlinkCheckpointFailed'
      receiver: 'flink-oncall'

# 接收器配置
receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://webhook/alert'

  - name: 'sre-team'
    webhook_configs:
      - url: 'https://oapi.dingtalk.com/robot/send?access_token=xxx'
        send_resolved: true
    email_configs:
      - to: 'sre@example.com'

  - name: 'data-team'
    email_configs:
      - to: 'data-team@example.com'

2. AlertManager集群高可用(Mesh/Gossip)

# 3个AlertManager实例,通过Mesh协议同步
./alertmanager --cluster.listen-address=0.0.0.0:9094 \
               --cluster.peer=am1:9094,am2:9094,am3:9094 \
               --config.file=/etc/alertmanager/alertmanager.yml

# 集群特性:
- Gossip协议确保告警信息在所有节点间同步
- 任意节点收到告警,都会同步到其他节点
- 即使部分节点宕机,告警也不会丢失

3. Prometheus与AlertManager集成

# prometheus.yml
rule_files:
  - '/etc/prometheus/alerts/*.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 'am1:9093'
          - 'am2:9093'
          - 'am3:9093'  # Prometheus对多个AlertManager做负载均衡

4. 告警分组与间隔: | 参数 | 说明 | 建议值 | |------|------|--------| | group_wait | 同类告警等待时间(合并)| 30s | | group_interval | 同类告警发送间隔 | 5m | | repeat_interval | 已发送告警的重复间隔 | 4h(warning)/1h(critical)|