kube-proxy

简要概述

在一定内核版本下,“cilium” 已经可以完全替代 “kube-proxy” 组件的功能了,如处理 k8s 服务类型为:ClusterIP, NodePort, LoadBalancer 等流量转发。

如何启用功能

更改配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:

......
  kube-proxy-replacement: "true"
......

在配置中把 “kube-proxy-replacement” 设置为 “true” 开启。

验证状态

登录至 “cilium” 容器内运行以下指令查看状态:

cilium status --verbose

如下,各功能启用状态:

KubeProxyReplacement Details:
  Status:                 True
  Socket LB:              Enabled
  Socket LB Tracing:      Enabled
  Socket LB Coverage:     Full
  Devices:                enx0826ae396977 192.168.0.2 (Direct Routing), wlp0s20f3 192.168.0.2
  Mode:                   SNAT
  Backend Selection:      Random
  Session Affinity:       Enabled
  Graceful Termination:   Enabled
  NAT46/64 Support:       Disabled
  XDP Acceleration:       Disabled
  Services:
  - ClusterIP:      Enabled
  - NodePort:       Enabled (Range: 30000-32767)
  - LoadBalancer:   Enabled
  - externalIPs:    Enabled
  - HostPort:       Enabled

验证功能

  • 创建以下资源:
---
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: nginx
  name: nginx
spec:
  containers:
  - image: registry.cn-hangzhou.aliyuncs.com/kube-image-repo/nginx:1.23.2
    name: nginx
    resources: {}
    ports:
    - containerPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

---
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    run: nginx
  name: nginx
spec:
  ports:
  - port: 80
    protocol: TCP
    targetPort: 80
  selector:
    run: nginx
  type: NodePort
  • 查看 Pod、Service 状态:
$ kubectl get pods,services -l run=nginx -o wide
NAME        READY   STATUS    RESTARTS   AGE   IP              NODE          NOMINATED NODE   READINESS GATES
pod/nginx   1/1     Running   0          79s   192.168.201.4   192.168.0.2   <none>           <none>

NAME            TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE   SELECTOR
service/nginx   NodePort   10.233.90.187   <none>        80:32515/TCP   79s   run=nginx
$
  • 查看 cilium 转发状态:
$ kubectl exec -i -t cilium-mzzk6 -n kube-system -- cilium service list
ID   Frontend             Service Type   Backend
1    10.233.0.1:443       ClusterIP      1 => 192.168.0.2:6443 (active)
11   10.233.0.10:9153     ClusterIP      1 => 192.168.201.3:9153 (active)
12   10.233.0.10:53       ClusterIP      1 => 192.168.201.3:53 (active)
13   10.233.166.209:443   ClusterIP      1 => 192.168.201.12:4443 (active)
25   10.233.2.226:2746    ClusterIP      1 => 192.168.201.7:2746 (active)
......
37   192.168.0.2:32515    NodePort       1 => 192.168.201.4:80 (active)
38   0.0.0.0:32515        NodePort       1 => 192.168.201.4:80 (active)
$
  • 验证服务访问:
$ curl 'http://192.168.0.2:32515'
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
......
</body>
</html>
$

常见几种用法

指定网卡用于 NodePort 服务

应用场景:当宿主存在多网卡时,避免 “Service” 为 “NodePort” 类型的资源监听在非期望的网段上,如未配置好防火墙则可能导致意外数据暴露。

  • 该实验宿主存在以下网卡
$ ip addr list
2: wlp0s20f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    inet 10.5.56.241/24 brd 10.5.56.255 scope global dynamic noprefixroute wlp0s20f3
3: enx0826ae396977: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    inet 192.168.0.2/24 brd 192.168.0.255 scope global noprefixroute enx0826ae396977
$

网卡 “wlp0s20f3”、“enx0826ae396977” 分别配置了 “10.5.56.241/24”、“192.168.0.2/24” 地址。

  • 创建 NodePort 类型的服务
apiVersion: v1
kind: Service
metadata:
  name: cilium-ingress
  namespace: kube-system
  labels:
    cilium.io/ingress: "true"
spec:
  ports:
  - name: http
    port: 80
    protocol: TCP
    nodePort: 30080
  type: NodePort
  • 此时监听在 NodePort 的地址
$ cilium service list
ID   Frontend              Service Type   Backend
10   10.233.0.1:443        ClusterIP      1 => 192.168.0.2:6443 (active)
11   10.233.124.160:80     ClusterIP
12   192.168.0.2:30080     NodePort
14   0.0.0.0:30080         NodePort
15   10.233.148.250:80     ClusterIP      1 => 192.168.0.2:4244 (active)
16   10.233.46.189:80      ClusterIP      1 => 192.168.201.4:4245 (active)
17   10.233.49.29:80       ClusterIP      1 => 192.168.201.14:8081 (active)
18   10.233.216.132:2746   ClusterIP      1 => 192.168.201.1:2746 (active)
19   10.5.56.241:30080     NodePort
$

也通过 “bpftool net” 查看当前绑定在 “tc” 阶段的网卡有哪些:

$ bpftool net
xdp:

tc:
wlp0s20f3(2) clsact/ingress cil_from_netdev-wlp0s20f3 id 19355
wlp0s20f3(2) clsact/egress cil_to_netdev-wlp0s20f3 id 19360
enx0826ae396977(3) clsact/ingress cil_from_netdev-enx0826ae396977 id 19340
enx0826ae396977(3) clsact/egress cil_to_netdev-enx0826ae396977 id 19341
cilium_net(4) clsact/ingress cil_to_host-cilium_net id 19325
cilium_host(5) clsact/ingress cil_to_host-cilium_host id 19304
cilium_host(5) clsact/egress cil_from_host-cilium_host id 19284
lxc96ee90fabd66(93) clsact/ingress cil_from_container-lxc96ee90fabd66 id 19327
lxc5f977feb7c71(95) clsact/ingress cil_from_container-lxc5f977feb7c71 id 19262
lxc442d93d4a07e(99) clsact/ingress cil_from_container-lxc442d93d4a07e id 19300
lxc3d8018e15dbd(101) clsact/ingress cil_from_container-lxc3d8018e15dbd id 19313
lxc_health(145) clsact/ingress cil_from_container-lxc_health id 19333

flow_dissector:

$

此时可以通过 “telnet 10.5.56.241 30080” 联通,当并不希望在该网卡上监听服务。

  • 配置 “cilium-config” 中的 “devices” 参数
apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  ......
  devices: "lxc+ enx0826ae396977"
  ......

多网卡情况下 “devices” 中间使用空格隔开,如:“eth0 em3 bind1”。以上配置说明仅在 “lxc” 开头与名为 “enx0826ae396977” 网卡上处理。

在通过 “bpftool net” 或者 “cilium service list” 查看信息,是否满足需求:

$ bpftool net
xdp:

tc:
enx0826ae396977(3) clsact/ingress cil_from_netdev-enx0826ae396977 id 21450
enx0826ae396977(3) clsact/egress cil_to_netdev-enx0826ae396977 id 21447
cilium_net(4) clsact/ingress cil_to_host-cilium_net id 21434
cilium_host(5) clsact/ingress cil_to_host-cilium_host id 21355
cilium_host(5) clsact/egress cil_from_host-cilium_host id 21410
lxc96ee90fabd66(93) clsact/ingress cil_from_netdev-lxc96ee90fabd66 id 21500
lxc96ee90fabd66(93) clsact/egress cil_to_netdev-lxc96ee90fabd66 id 21502
lxc5f977feb7c71(95) clsact/ingress cil_from_netdev-lxc5f977feb7c71 id 21494
lxc5f977feb7c71(95) clsact/egress cil_to_netdev-lxc5f977feb7c71 id 21491
lxc442d93d4a07e(99) clsact/ingress cil_from_netdev-lxc442d93d4a07e id 21473
lxc442d93d4a07e(99) clsact/egress cil_to_netdev-lxc442d93d4a07e id 21475
lxc3d8018e15dbd(101) clsact/ingress cil_from_netdev-lxc3d8018e15dbd id 21463
lxc3d8018e15dbd(101) clsact/egress cil_to_netdev-lxc3d8018e15dbd id 21461
lxc_health(155) clsact/ingress cil_from_container-lxc_health id 21362

flow_dissector:

$

这里需注意下为什么要带上 “lxc” 开头的网卡,如果缺失 “bpftool net” 查看的数据为:

$ bpftool net
xdp:

tc:
enx0826ae396977(3) clsact/ingress cil_from_netdev-enx0826ae396977 id 19760
enx0826ae396977(3) clsact/egress cil_to_netdev-enx0826ae396977 id 19762
cilium_net(4) clsact/ingress cil_to_host-cilium_net id 19723
cilium_host(5) clsact/ingress cil_to_host-cilium_host id 19706
cilium_host(5) clsact/egress cil_from_host-cilium_host id 19665
lxc96ee90fabd66(93) clsact/ingress cil_from_container-lxc96ee90fabd66 id 19691
lxc5f977feb7c71(95) clsact/ingress cil_from_container-lxc5f977feb7c71 id 19682
lxc442d93d4a07e(99) clsact/ingress cil_from_container-lxc442d93d4a07e id 19733
lxc3d8018e15dbd(101) clsact/ingress cil_from_container-lxc3d8018e15dbd id 19670
lxc_health(147) clsact/ingress cil_from_container-lxc_health id 19747

flow_dissector:

$

可以看到在 “lxc” 开头的网卡仅剩 “clsact/ingress” hook 钩子了,丢失了 “clsact/egress” 这种情况下在使用 “cilium-envoy” 会无法路由到后端,观察到是 “reserved:ingress” 的 ip 无法与具体的 Pod 联通,具体原因待细查。

$ cilium ip list | grep reserved
0.0.0.0/0           reserved:world
10.5.56.241/32      reserved:host
                    reserved:kube-apiserver
192.168.0.2/32      reserved:host
                    reserved:kube-apiserver
192.168.201.4/32    reserved:health
192.168.201.12/32   reserved:ingress
192.168.201.13/32   reserved:host
                    reserved:kube-apiserver
$

可在宿主上 “tcpdump” 具体 Pod 网卡数据包:

$ tcpdump -i lxc442d93d4a07e -nn
......
17:26:21.709985 IP 192.168.201.12.43687 > 192.168.201.1.2746: Flags [S], seq 3062515778, win 64240, options [mss 1460,sackOK,TS val 2429471181 ecr 0,nop,wscale 7], length 0
17:26:21.710010 IP 192.168.201.1.2746 > 192.168.201.12.43687: Flags [S.], seq 955549526, ack 3062515779, win 65160, options [mss 1460,sackOK,TS val 2974496186 ecr 2429468135,nop,wscale 7], length 0
17:26:21.710058 IP 192.168.201.1.2746 > 192.168.201.12.43687: Flags [S.], seq 955549526, ack 3062515779, win 65160, options [mss 1460,sackOK,TS val 2974496186 ecr 2429468135,nop,wscale 7], length 0
17:26:21.741919 IP 192.168.201.13 > 192.168.201.1: ICMP host 192.168.201.12 unreachable, length 68
17:26:21.741954 IP 192.168.201.13 > 192.168.201.1: ICMP host 192.168.201.12 unreachable, length 68
17:26:21.741973 IP 192.168.201.13 > 192.168.201.1: ICMP host 192.168.201.12 unreachable, length 68
17:26:21.741990 IP 192.168.201.13 > 192.168.201.1: ICMP host 192.168.201.12 unreachable, length 68
17:26:21.742008 IP 192.168.201.13 > 192.168.201.1: ICMP host 192.168.201.12 unreachable, length 68
......

其中 “reserved:ingress = 192.168.201.12” 无法与后端 “Pod 192.168.201.1” 建立 TCP 连接,错误 “ICMP unreachable”。

集群流量策略

internalTrafficPolicy externalTrafficPolicy 南北向流量 东西向流量
Cluster Cluster 所有节点可达 所有节点可达
Cluster Local 仅被访问节点 所有节点可达
Local Cluster 所有节点可达 仅被访问节点
Local Local 仅被访问节点 仅被访问节点



最后修改 2024.05.06: docs: 独立常见问题处理 (96c4309)