Node Exporter
3 分钟阅读
简要概述
appid = “oneops”
指标名 | 类型 | 说明 |
---|---|---|
node_filesystem_avail_bytes | xx | 分区还剩的空间 |
node_filesystem_size_bytes | xx | 分区总的空间 |
公共标签
标签名 | 说明 |
---|---|
api.opsaid.cn/fault-priority | P5 |
api.opsaid.cn/pm2-uuid | 项目的UUID |
全局默认告警规则
groups:
- name: default-global-host-2m
interval: 2m
rules:
检测周期较长,持续周期较长,用于全局判断
主机cpu使用率预警
alert: NodeCPUUseHigh60
expr: |
(
1 - (sum(increase(node_cpu_seconds_total{appid!~"uptime",mode="idle"}[1m])) by (appid,job,instance,serial_num) / sum(increase(node_cpu_seconds_total{appid!~"uptime"}[1m])) by (appid,job,instance,serial_num) )
) >= 0.6
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,CPU使用率大于60%(当前值:{{$value|humanizePercentage}})
主机iowait使用率预警
alert: NodeCPUIOWaitHigh20
expr: |
(
(sum(increase(node_cpu_seconds_total{appid!~"uptime",mode="iowait"}[1m])) by (appid,job,instance,serial_num) / sum(increase(node_cpu_seconds_total{appid!~"uptime"}[1m])) by (appid,job,instance,serial_num) )
) >= 0.2
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,iowait使用率大于20%(当前值:{{$value|humanizePercentage}})
主机最近1分钟负载预警
alert: NodeLoad1High30
expr: ceil(avg(node_load1{appid!~"uptime"}) by (appid,job,instance,serial_num)) > 30
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,load1大于30(当前值:{{$value}})
预测空间24小时内会满
基于前6小时数据集预测24小时内空间可能满,需持续1小时
alert: NodeFilesystemSpaceFillingUp
expr: |
(
node_filesystem_avail_bytes{appid="oneops",fstype!="",device="rootfs"} / node_filesystem_size_bytes{appid="oneops",fstype!="",device="rootfs"} * 100 < 40
and
predict_linear(node_filesystem_avail_bytes{appid="oneops",fstype!="",device="rootfs"}[6h], 24*60*60) < 0
and
node_filesystem_readonly{appid="oneops",fstype!="",device="rootfs"} == 0
)
for: 1h
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 分区空间可能会在未来24小时内满
分区空间剩余可用率:(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 预测空间可用率:predict_linear(node_filesystem_avail_bytes{appid=“oneops”,fstype!=""}[6h], 246060) 分区文件系统非只读:node_filesystem_readonly{appid=“oneops”,fstype!=""}
predict_linear(v range-vector, t scalar) 基于v指标的前6小时数据集合,预测当前时间至t秒后值可能为多少
磁盘空间可用率不足
alert: NodeFilesystemAlmostOutOfSpace
expr: |
(
node_filesystem_avail_bytes{appid="oneops",fstype!="",device="rootfs"} / node_filesystem_size_bytes{appid="oneops",fstype!="",device="rootfs"} * 100 < 5
and
node_filesystem_readonly{appid="oneops",fstype!="",device="rootfs"} == 0
)
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,空间可用率小于5%(当前值:{{$value|humanizePercentage}})
预测inode24小时内会满
alert: NodeFilesystemFilesFillingUp
expr: |
(
node_filesystem_files_free{appid="oneops",fstype!="",device="rootfs"} / node_filesystem_files{appid="oneops",fstype!="",device="rootfs"} * 100 < 40
and
predict_linear(node_filesystem_files_free{appid="oneops",fstype!="",device="rootfs"}[6h], 24*60*60) < 0
and
node_filesystem_readonly{appid="oneops",fstype!="",device="rootfs"} == 0
)
for: 1h
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 分区inode可能会在未来24小时内满
分区inode可用率不足
alert: NodeFilesystemAlmostOutOfSpace
expr: |
(
node_filesystem_files_free{appid="oneops",fstype!="",device="rootfs"} / node_filesystem_files{appid="oneops",fstype!="",device="rootfs"} * 100 < 5
and
node_filesystem_readonly{appid="oneops",fstype!="",device="rootfs"} == 0
)
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,inode可用率小于5%(当前值:{{$value|humanizePercentage}})
主机网卡数据包接收错误
alert: NodeNetworkReceiveErrs
expr: |
rate(node_network_receive_errs_total{appid="oneops"}[2m]) / rate(node_network_receive_packets_total{appid="oneops"}[2m]) > 0.01
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,网卡{{ $labels.device }}存在接收数据包接收错误
conntrack使用条目是否快要达到限制
alert: NodeHighNumberConntrackEntriesUsed
expr: |
(node_nf_conntrack_entries{appid!~"uptime"} / node_nf_conntrack_entries_limit{appid!~"uptime"}) > 0.75
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,conntrack数量使用率快要达到上限(当前值:{{$value|humanizePercentage}})
机器时钟发生漂移
alert: NodeClockSkewDetected
expr: |
(
node_timex_offset_seconds{appid!~"uptime"} > 0.05
and
deriv(node_timex_offset_seconds{appid!~"uptime"}[5m]) >= 0
)
or
(
node_timex_offset_seconds{appid!~"uptime"} < -0.05
and
deriv(node_timex_offset_seconds{appid!~"uptime"}[5m]) <= 0
)
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,机器时间漂移超过300秒
机器时间未同步
alert: NodeClockNotSynchronising
expr: |
min_over_time(node_timex_sync_status{appid!~"uptime"}[5m]) == 0
and
node_timex_maxerror_seconds{appid!~"uptime"} >= 16
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,机器时间没有在同步
文件描述符检测
- 告警规则
alert: NodeFileDescriptorLimit
expr: (node_filefd_allocated{appid!~"uptime"}/node_filefd_maximum{appid!~"uptime"}) * 100 > 70
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,文件描述符使用率快要达到上限(当前值:{{$value|humanizePercentage}})
- 规则解析
/proc/sys/fs/file-max
这个文件决定了当前内核可以打开的文件描述符的数量限制
/proc/sys/fs/file-nr
这个是一个状态指示的文件,一共三个值,第一个代表全局已经分配的文件描述符数量,第二个代表自由的文件描述符(待重新分配的),第三个代表总的文件描述符的数量。
Swap使用达到50%预警
- 告警规则
alert: NodeMemorySwapUsedHigh50
expr: ((node_memory_SwapTotal_bytes{appid!~"uptime"} - node_memory_SwapFree_bytes{appid!~"uptime"}) / (node_memory_SwapTotal_bytes{appid!~"uptime"})) > 0.5
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,主机swap使用达到50%以上(当前值:{{$value|humanizePercentage}})
- 规则解析
pass
进程已打开fd达到最大50%预警
alert: NodeFileDescriptorProcessesUsedHigh30
expr: (process_open_fds{appid!~"uptime"}>process_max_fds{appid!~"uptime"}) > 0.3
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,主机进程已打开fd达到最大50%预警当前值:{{$value|humanizePercentage}})
- 规则解析
依赖客户端上报数据
网卡数量过多
- alert: NodeNetworkCardCountHigh600
expr: sort_desc(count(node_network_up{}) by(job,instance,host_ip)) > 600
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,主机数量大于600预警,当前值:{{$value}})
如果网卡数量过多,会导致采集的 node_exporter 进程卡死,一般一个网卡需要 3k 字节(未压缩)。
网卡接收数据带宽利用率达到90%
- 告警规则
- alert: NodeNetworkReceiveUsedHigh90
expr: |
(
(
max by(appid,job,serial_num,device,host_ip,instance) (rate(node_network_receive_bytes_total{appid!~"uptime"}[2m])*8)
) / 1024 / 1024
)
>
(
(
max (node_network_speed_bytes{appid!~"uptime"} * 8 / 1000 / 1000) by (appid,job,serial_num,device,host_ip,instance) > 20
) * 0.9
)
for: 6m
labels:
api_opsaid_cn_fault_priority: P5
api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,主机数量大于600预警,当前值:{{$value}})
- 规则解析
判断网卡速率,/sys/class/net/eth0/
,对应指标 node_network_speed_bytes
,它的单位为 “bytes” 转换 “bits” 需要乘以8。
sort_desc(node_network_speed_bytes{} * 8 / 1000 / 1000 > 10 )
过滤出大于 10Mbits 以上网卡
注意:
- 大部分虚机可能不存在该值
- alert: NodeNetworkTransmitBytesTotalHigh1000Mbps
expr: (max by(appid,job,serial_num,host_ip,instance)(rate(node_network_transmit_bytes_total{}[2m])*8)) / 1024 / 1024 > 1000
for: 6m
labels:
ops_internal_17173_com_fault_priority: P5
ops_internal_17173_com_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
summary: 2分钟检测一次,持续6分钟,出流量大于1000Mbps(当前值:{{$value}}Mbps)
最后修改 2023.02.15: feat: 添加 node-exporter 主机带宽预警 (60d6433)