Etcd Exporter

简要概述

对来自 etcd 采集的性能数据做关键告警。

全局默认告警

集群成员在线数量异常

alert: EtcdMembersCountInvaild
expr: count by (appid) (up{appid!~"uptime",job=~"etcd.*exporter"}) < 3
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,etcd集群数量基数小于3(当前值:{{$value}})

集群是否存在leader节点

alert: EtcdNoLeader
expr: (etcd_server_has_leader{appid!~"uptime",job=~"etcd.*exporter"}) == 0
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,etcd集群未发现leader角色

集群leader切换频繁

alert: EtcdHighNumberOfLeaderChanges
expr: ceil(increase(etcd_server_leader_changes_seen_total{appid!~"uptime",job=~"etcd.*exporter"}[10m])) >= 3
for: 0m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 10分钟内,etcd集群leader切换大于3次(当前值:{{$value}})

集群间同步延迟高

alert: EtcdMemberCommunicationSlow
expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{appid!~"uptime"}[5m])) >= 0.3
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,etcd集群99%的成员间同步延迟高于0.3秒(当前值:{{$value}})

集群过去1小时proposal请求失败

提议是一个需要完成 raft 协议的请求

alert: EtcdHighNumberOfFailedProposals
expr: ceil(increase(etcd_server_proposals_failed_total{appid!~"uptime"}[1h])) >= 10
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: etcd集群过去1小时内,proposal请求大于10(当前值:{{$value}})

集群wal fsync延迟高

alert: EtcdHighFsyncDurations
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{appid!~"uptime"}[2m])) >= 1
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,etcd集群99%的wal fsync延迟高于1秒(当前值:{{$value}})

集群commit延迟高

alert: EtcdHighFsyncDurations
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{appid!~"uptime"}[2m])) >= 1
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,etcd集群99%的commit延迟高于1秒(当前值:{{$value}})

集群空间快达到配额

alert: EtcdBackendQuotaLowSpace
expr: (etcd_mvcc_db_total_size_in_bytes{appid!~"uptime"}/etcd_server_quota_backend_bytes{appid!~"uptime"}) * 100 > 60
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,etcd集群空间使用率已达配额60%(当前值:{{$value}})



最后修改 2023.02.13: feat: 添加告警规则 (2fcc453)