Node Exporter

简要概述

appid = “oneops”

指标名 类型 说明
node_filesystem_avail_bytes xx 分区还剩的空间
node_filesystem_size_bytes xx 分区总的空间

公共标签

标签名 说明
api.opsaid.cn/fault-priority P5
api.opsaid.cn/pm2-uuid 项目的UUID

全局默认告警规则

groups:
- name: default-global-host-2m
  interval: 2m
  rules:

检测周期较长,持续周期较长,用于全局判断

主机cpu使用率预警

alert: NodeCPUUseHigh60
expr: |
  (
    1 - (sum(increase(node_cpu_seconds_total{appid!~"uptime",mode="idle"}[1m])) by (appid,job,instance,serial_num) / sum(increase(node_cpu_seconds_total{appid!~"uptime"}[1m])) by (appid,job,instance,serial_num) )
  ) >= 0.6
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,CPU使用率大于60%(当前值:{{$value|humanizePercentage}})

主机iowait使用率预警

alert: NodeCPUIOWaitHigh20
expr: |
  (
    (sum(increase(node_cpu_seconds_total{appid!~"uptime",mode="iowait"}[1m])) by (appid,job,instance,serial_num) / sum(increase(node_cpu_seconds_total{appid!~"uptime"}[1m])) by (appid,job,instance,serial_num) )
  ) >= 0.2
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,iowait使用率大于20%(当前值:{{$value|humanizePercentage}})

主机最近1分钟负载预警

alert: NodeLoad1High30
expr: ceil(avg(node_load1{appid!~"uptime"}) by (appid,job,instance,serial_num)) > 30
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,load1大于30(当前值:{{$value}})

预测空间24小时内会满

基于前6小时数据集预测24小时内空间可能满,需持续1小时

alert: NodeFilesystemSpaceFillingUp
expr: |
  (
    node_filesystem_avail_bytes{appid="oneops",fstype!="",device="rootfs"} / node_filesystem_size_bytes{appid="oneops",fstype!="",device="rootfs"} * 100 < 40
  and
    predict_linear(node_filesystem_avail_bytes{appid="oneops",fstype!="",device="rootfs"}[6h], 24*60*60) < 0
  and
    node_filesystem_readonly{appid="oneops",fstype!="",device="rootfs"} == 0
  )
for: 1h
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 分区空间可能会在未来24小时内满

分区空间剩余可用率:(node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 预测空间可用率:predict_linear(node_filesystem_avail_bytes{appid=“oneops”,fstype!=""}[6h], 246060) 分区文件系统非只读:node_filesystem_readonly{appid=“oneops”,fstype!=""}

predict_linear(v range-vector, t scalar) 基于v指标的前6小时数据集合,预测当前时间至t秒后值可能为多少

磁盘空间可用率不足

alert: NodeFilesystemAlmostOutOfSpace
expr: |
  (
    node_filesystem_avail_bytes{appid="oneops",fstype!="",device="rootfs"} / node_filesystem_size_bytes{appid="oneops",fstype!="",device="rootfs"} * 100 < 5
  and
    node_filesystem_readonly{appid="oneops",fstype!="",device="rootfs"} == 0
  )
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,空间可用率小于5%(当前值:{{$value|humanizePercentage}})

预测inode24小时内会满

alert: NodeFilesystemFilesFillingUp
expr: |
  (
    node_filesystem_files_free{appid="oneops",fstype!="",device="rootfs"} / node_filesystem_files{appid="oneops",fstype!="",device="rootfs"} * 100 < 40
  and
    predict_linear(node_filesystem_files_free{appid="oneops",fstype!="",device="rootfs"}[6h], 24*60*60) < 0
  and
    node_filesystem_readonly{appid="oneops",fstype!="",device="rootfs"} == 0
  )
for: 1h
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 分区inode可能会在未来24小时内满

分区inode可用率不足

alert: NodeFilesystemAlmostOutOfSpace
expr: |
  (
    node_filesystem_files_free{appid="oneops",fstype!="",device="rootfs"} / node_filesystem_files{appid="oneops",fstype!="",device="rootfs"} * 100 < 5
  and
    node_filesystem_readonly{appid="oneops",fstype!="",device="rootfs"} == 0
  )
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,inode可用率小于5%(当前值:{{$value|humanizePercentage}})

主机网卡数据包接收错误

alert: NodeNetworkReceiveErrs
expr: |
  rate(node_network_receive_errs_total{appid="oneops"}[2m]) / rate(node_network_receive_packets_total{appid="oneops"}[2m]) > 0.01
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,网卡{{ $labels.device }}存在接收数据包接收错误

conntrack使用条目是否快要达到限制

alert: NodeHighNumberConntrackEntriesUsed
expr: |
  (node_nf_conntrack_entries{appid!~"uptime"} / node_nf_conntrack_entries_limit{appid!~"uptime"}) > 0.75
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,conntrack数量使用率快要达到上限(当前值:{{$value|humanizePercentage}})

机器时钟发生漂移

alert: NodeClockSkewDetected
expr: |
  (
    node_timex_offset_seconds{appid!~"uptime"} > 0.05
  and
    deriv(node_timex_offset_seconds{appid!~"uptime"}[5m]) >= 0
  )
  or
  (
    node_timex_offset_seconds{appid!~"uptime"} < -0.05
  and
    deriv(node_timex_offset_seconds{appid!~"uptime"}[5m]) <= 0
  )
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,机器时间漂移超过300秒

机器时间未同步

alert: NodeClockNotSynchronising
expr: |
  min_over_time(node_timex_sync_status{appid!~"uptime"}[5m]) == 0
  and
  node_timex_maxerror_seconds{appid!~"uptime"} >= 16
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,机器时间没有在同步

文件描述符检测

  • 告警规则
alert: NodeFileDescriptorLimit
expr: (node_filefd_allocated{appid!~"uptime"}/node_filefd_maximum{appid!~"uptime"}) * 100 > 70
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,文件描述符使用率快要达到上限(当前值:{{$value|humanizePercentage}})
  • 规则解析
/proc/sys/fs/file-max

这个文件决定了当前内核可以打开的文件描述符的数量限制

/proc/sys/fs/file-nr

这个是一个状态指示的文件,一共三个值,第一个代表全局已经分配的文件描述符数量,第二个代表自由的文件描述符(待重新分配的),第三个代表总的文件描述符的数量。

Swap使用达到50%预警

  • 告警规则
alert: NodeMemorySwapUsedHigh50
expr: ((node_memory_SwapTotal_bytes{appid!~"uptime"} - node_memory_SwapFree_bytes{appid!~"uptime"}) / (node_memory_SwapTotal_bytes{appid!~"uptime"})) > 0.5
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,主机swap使用达到50%以上(当前值:{{$value|humanizePercentage}})
  • 规则解析

pass

进程已打开fd达到最大50%预警

alert: NodeFileDescriptorProcessesUsedHigh30
expr: (process_open_fds{appid!~"uptime"}>process_max_fds{appid!~"uptime"}) > 0.3
for: 6m
labels:
  api_opsaid_cn_fault_priority: P5
  api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
annotations:
  summary: 2分钟检测一次,持续6分钟,主机进程已打开fd达到最大50%预警当前值:{{$value|humanizePercentage}})
  • 规则解析

依赖客户端上报数据

网卡数量过多

- alert: NodeNetworkCardCountHigh600
  expr: sort_desc(count(node_network_up{}) by(job,instance,host_ip)) > 600
  for: 6m
  labels:
    api_opsaid_cn_fault_priority: P5
    api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
  annotations:
    summary: 2分钟检测一次,持续6分钟,主机数量大于600预警,当前值:{{$value}})

如果网卡数量过多,会导致采集的 node_exporter 进程卡死,一般一个网卡需要 3k 字节(未压缩)。

网卡接收数据带宽利用率达到90%

  • 告警规则
- alert: NodeNetworkReceiveUsedHigh90
  expr: |
    (
      (
        max by(appid,job,serial_num,device,host_ip,instance) (rate(node_network_receive_bytes_total{appid!~"uptime"}[2m])*8)
      ) / 1024 / 1024
    )

    >

    (
      (
        max (node_network_speed_bytes{appid!~"uptime"} * 8 / 1000 / 1000) by (appid,job,serial_num,device,host_ip,instance) > 20
      ) * 0.9
    )
  for: 6m
  labels:
    api_opsaid_cn_fault_priority: P5
    api_opsaid_cn_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
  annotations:
    summary: 2分钟检测一次,持续6分钟,主机数量大于600预警,当前值:{{$value}})
  • 规则解析

判断网卡速率,/sys/class/net/eth0/,对应指标 node_network_speed_bytes,它的单位为 “bytes” 转换 “bits” 需要乘以8。

sort_desc(node_network_speed_bytes{} * 8 / 1000 / 1000 > 10 )

过滤出大于 10Mbits 以上网卡

注意:

  1. 大部分虚机可能不存在该值
  - alert: NodeNetworkTransmitBytesTotalHigh1000Mbps
    expr: (max by(appid,job,serial_num,host_ip,instance)(rate(node_network_transmit_bytes_total{}[2m])*8)) / 1024 / 1024 > 1000
    for: 6m
    labels:
      ops_internal_17173_com_fault_priority: P5
      ops_internal_17173_com_pm2_uuid: 99feafb5-bed6-4daf-927a-69a2ab80c485
    annotations:
      summary: 2分钟检测一次,持续6分钟,出流量大于1000Mbps(当前值:{{$value}}Mbps)