etcd

简要概述

记录 etcd 相关的场景问题。

无法启动类

“received signal; shutting down”

  • 关键日志
{"level":"info","ts":"2023-07-27T13:31:34.277Z","caller":"rafthttp/stream.go:274","msg":"established TCP streaming connection with remote peer","stream-writer-type":"stream MsgApp v2","local-member-id":"edacac26a3fb24b3","remote-peer-id":"2c8b55a3459bb7ff"}
{"level":"info","ts":"2023-07-27T13:33:03.898Z","caller":"osutil/interrupt_unix.go:64","msg":"received signal; shutting down","signal":"terminated"}
{"level":"info","ts":"2023-07-27T13:33:03.898Z","caller":"embed/etcd.go:367","msg":"closing etcd server","name":"node3","data-dir":"/data/etcd/","advertise-peer-urls":["https://10.49.2.36:2380"],"advertise-client-urls":["https://10.49.2.36:2379"]}
  • 问题原因

在文件 /etc/containerd/config.toml 中的 SystemdCgroup 配置为 false 了,与 kubelet 中不同。

  • 解决步骤

更改配置,使用同 “kubelet” 一样的 cgroup 为 systemd

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

备份恢复类

控制节点 etcd 数据恢复

  • 登录故障节点,移除 etcd 静态 POD
cd /etc/kubernetes/manifests/
mv etcd.yaml /usr/local/src/
  • 登录另外正常的 etcd 集群节点

查看成员列表

ctr --namespace k8s.io run \
    --rm \
    --net-host \
    --mount type=bind,src=/etc/kubernetes,dst=/etc/kubernetes,options=rbind:ro \
	registry.cn-hangzhou.aliyuncs.com/kube-image-repo/etcd:v3.5.10 etcd-health \
    etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt \
        --key /etc/kubernetes/pki/etcd/peer.key \
        --cacert /etc/kubernetes/pki/etcd/ca.crt \
        --endpoints https://127.0.0.1:2379 \
	member list
  • 移除异常的成员
ctr --namespace k8s.io run \
    --rm \
    --net-host \
    --mount type=bind,src=/etc/kubernetes,dst=/etc/kubernetes,options=rbind:ro \
	registry.cn-hangzhou.aliyuncs.com/kube-image-repo/etcd:v3.5.10 etcd-health \
    etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt \
        --key /etc/kubernetes/pki/etcd/peer.key \
        --cacert /etc/kubernetes/pki/etcd/ca.crt \
        --endpoints https://127.0.0.1:2379 \
	member remove 63e2da0114a7b9b7
  • 更改故障节点 etcd 作为新成员加入
ctr --namespace k8s.io run \
    --rm \
    --net-host \
    --mount type=bind,src=/etc/kubernetes,dst=/etc/kubernetes,options=rbind:ro \
	registry.cn-hangzhou.aliyuncs.com/kube-image-repo/etcd:v3.5.10 etcd-health \
    etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt \
        --key /etc/kubernetes/pki/etcd/peer.key \
        --cacert /etc/kubernetes/pki/etcd/ca.crt \
        --endpoints https://127.0.0.1:2379 \
        member add node2 --peer-urls="https://192.168.201.122:2380"
ETCD_NAME="node2"
ETCD_INITIAL_CLUSTER="node3=https://192.168.201.123:2380,node2=https://192.168.201.122:2380,node1=https://192.168.201.121:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://192.168.201.122:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
  • 启动故障节点的 etcd 服务
--initial-cluster-state=new

更改为

--initial-cluster-state=existing



最后修改 2024.05.06: docs: 独立常见问题处理 (96c4309)