Etcd Exporter
少于1分钟
简要概述
对来自 etcd 采集的性能数据做关键告警。
全局默认告警
集群成员在线数量异常
alert: EtcdMembersCountInvaild
expr: count by (appid) (up{appid!~"uptime",job=~"etcd.*exporter"}) < 3
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,etcd集群数量基数小于3(当前值:{{$value}})
集群是否存在leader节点
alert: EtcdNoLeader
expr: (etcd_server_has_leader{appid!~"uptime",job=~"etcd.*exporter"}) == 0
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,etcd集群未发现leader角色
集群leader切换频繁
alert: EtcdHighNumberOfLeaderChanges
expr: ceil(increase(etcd_server_leader_changes_seen_total{appid!~"uptime",job=~"etcd.*exporter"}[10m])) >= 3
for: 0m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 10分钟内,etcd集群leader切换大于3次(当前值:{{$value}})
集群间同步延迟高
alert: EtcdMemberCommunicationSlow
expr: histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{appid!~"uptime"}[5m])) >= 0.3
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,etcd集群99%的成员间同步延迟高于0.3秒(当前值:{{$value}})
集群过去1小时proposal请求失败
提议是一个需要完成 raft 协议的请求
alert: EtcdHighNumberOfFailedProposals
expr: ceil(increase(etcd_server_proposals_failed_total{appid!~"uptime"}[1h])) >= 10
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: etcd集群过去1小时内,proposal请求大于10(当前值:{{$value}})
集群wal fsync延迟高
alert: EtcdHighFsyncDurations
expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{appid!~"uptime"}[2m])) >= 1
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,etcd集群99%的wal fsync延迟高于1秒(当前值:{{$value}})
集群commit延迟高
alert: EtcdHighFsyncDurations
expr: histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{appid!~"uptime"}[2m])) >= 1
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,etcd集群99%的commit延迟高于1秒(当前值:{{$value}})
集群空间快达到配额
alert: EtcdBackendQuotaLowSpace
expr: (etcd_mvcc_db_total_size_in_bytes{appid!~"uptime"}/etcd_server_quota_backend_bytes{appid!~"uptime"}) * 100 > 60
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,etcd集群空间使用率已达配额60%(当前值:{{$value}})
最后修改 2023.02.13: feat: 添加告警规则 (2fcc453)