kube-proxy
5 分钟阅读
简要概述
在一定内核版本下,“cilium” 已经可以完全替代 “kube-proxy” 组件的功能了,如处理 k8s 服务类型为:ClusterIP, NodePort, LoadBalancer 等流量转发。
如何启用功能
更改配置
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
data:
......
kube-proxy-replacement: "true"
......
在配置中把 “kube-proxy-replacement” 设置为 “true” 开启。
验证状态
登录至 “cilium” 容器内运行以下指令查看状态:
cilium status --verbose
如下,各功能启用状态:
KubeProxyReplacement Details:
Status: True
Socket LB: Enabled
Socket LB Tracing: Enabled
Socket LB Coverage: Full
Devices: enx0826ae396977 192.168.0.2 (Direct Routing), wlp0s20f3 192.168.0.2
Mode: SNAT
Backend Selection: Random
Session Affinity: Enabled
Graceful Termination: Enabled
NAT46/64 Support: Disabled
XDP Acceleration: Disabled
Services:
- ClusterIP: Enabled
- NodePort: Enabled (Range: 30000-32767)
- LoadBalancer: Enabled
- externalIPs: Enabled
- HostPort: Enabled
验证功能
- 创建以下资源:
---
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: nginx
name: nginx
spec:
containers:
- image: registry.cn-hangzhou.aliyuncs.com/kube-image-repo/nginx:1.23.2
name: nginx
resources: {}
ports:
- containerPort: 80
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
---
apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
run: nginx
name: nginx
spec:
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
run: nginx
type: NodePort
- 查看 Pod、Service 状态:
$ kubectl get pods,services -l run=nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/nginx 1/1 Running 0 79s 192.168.201.4 192.168.0.2 <none> <none>
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/nginx NodePort 10.233.90.187 <none> 80:32515/TCP 79s run=nginx
$
- 查看 cilium 转发状态:
$ kubectl exec -i -t cilium-mzzk6 -n kube-system -- cilium service list
ID Frontend Service Type Backend
1 10.233.0.1:443 ClusterIP 1 => 192.168.0.2:6443 (active)
11 10.233.0.10:9153 ClusterIP 1 => 192.168.201.3:9153 (active)
12 10.233.0.10:53 ClusterIP 1 => 192.168.201.3:53 (active)
13 10.233.166.209:443 ClusterIP 1 => 192.168.201.12:4443 (active)
25 10.233.2.226:2746 ClusterIP 1 => 192.168.201.7:2746 (active)
......
37 192.168.0.2:32515 NodePort 1 => 192.168.201.4:80 (active)
38 0.0.0.0:32515 NodePort 1 => 192.168.201.4:80 (active)
$
- 验证服务访问:
$ curl 'http://192.168.0.2:32515'
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
......
</body>
</html>
$
常见几种用法
指定网卡用于 NodePort 服务
应用场景:当宿主存在多网卡时,避免 “Service” 为 “NodePort” 类型的资源监听在非期望的网段上,如未配置好防火墙则可能导致意外数据暴露。
- 该实验宿主存在以下网卡
$ ip addr list
2: wlp0s20f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
inet 10.5.56.241/24 brd 10.5.56.255 scope global dynamic noprefixroute wlp0s20f3
3: enx0826ae396977: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
inet 192.168.0.2/24 brd 192.168.0.255 scope global noprefixroute enx0826ae396977
$
网卡 “wlp0s20f3”、“enx0826ae396977” 分别配置了 “10.5.56.241/24”、“192.168.0.2/24” 地址。
- 创建 NodePort 类型的服务
apiVersion: v1
kind: Service
metadata:
name: cilium-ingress
namespace: kube-system
labels:
cilium.io/ingress: "true"
spec:
ports:
- name: http
port: 80
protocol: TCP
nodePort: 30080
type: NodePort
- 此时监听在 NodePort 的地址
$ cilium service list
ID Frontend Service Type Backend
10 10.233.0.1:443 ClusterIP 1 => 192.168.0.2:6443 (active)
11 10.233.124.160:80 ClusterIP
12 192.168.0.2:30080 NodePort
14 0.0.0.0:30080 NodePort
15 10.233.148.250:80 ClusterIP 1 => 192.168.0.2:4244 (active)
16 10.233.46.189:80 ClusterIP 1 => 192.168.201.4:4245 (active)
17 10.233.49.29:80 ClusterIP 1 => 192.168.201.14:8081 (active)
18 10.233.216.132:2746 ClusterIP 1 => 192.168.201.1:2746 (active)
19 10.5.56.241:30080 NodePort
$
也通过 “bpftool net” 查看当前绑定在 “tc” 阶段的网卡有哪些:
$ bpftool net
xdp:
tc:
wlp0s20f3(2) clsact/ingress cil_from_netdev-wlp0s20f3 id 19355
wlp0s20f3(2) clsact/egress cil_to_netdev-wlp0s20f3 id 19360
enx0826ae396977(3) clsact/ingress cil_from_netdev-enx0826ae396977 id 19340
enx0826ae396977(3) clsact/egress cil_to_netdev-enx0826ae396977 id 19341
cilium_net(4) clsact/ingress cil_to_host-cilium_net id 19325
cilium_host(5) clsact/ingress cil_to_host-cilium_host id 19304
cilium_host(5) clsact/egress cil_from_host-cilium_host id 19284
lxc96ee90fabd66(93) clsact/ingress cil_from_container-lxc96ee90fabd66 id 19327
lxc5f977feb7c71(95) clsact/ingress cil_from_container-lxc5f977feb7c71 id 19262
lxc442d93d4a07e(99) clsact/ingress cil_from_container-lxc442d93d4a07e id 19300
lxc3d8018e15dbd(101) clsact/ingress cil_from_container-lxc3d8018e15dbd id 19313
lxc_health(145) clsact/ingress cil_from_container-lxc_health id 19333
flow_dissector:
$
此时可以通过 “telnet 10.5.56.241 30080” 联通,当并不希望在该网卡上监听服务。
- 配置 “cilium-config” 中的 “devices” 参数
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
data:
......
devices: "lxc+ enx0826ae396977"
......
多网卡情况下 “devices” 中间使用空格隔开,如:“eth0 em3 bind1”。以上配置说明仅在 “lxc” 开头与名为 “enx0826ae396977” 网卡上处理。
在通过 “bpftool net” 或者 “cilium service list” 查看信息,是否满足需求:
$ bpftool net
xdp:
tc:
enx0826ae396977(3) clsact/ingress cil_from_netdev-enx0826ae396977 id 21450
enx0826ae396977(3) clsact/egress cil_to_netdev-enx0826ae396977 id 21447
cilium_net(4) clsact/ingress cil_to_host-cilium_net id 21434
cilium_host(5) clsact/ingress cil_to_host-cilium_host id 21355
cilium_host(5) clsact/egress cil_from_host-cilium_host id 21410
lxc96ee90fabd66(93) clsact/ingress cil_from_netdev-lxc96ee90fabd66 id 21500
lxc96ee90fabd66(93) clsact/egress cil_to_netdev-lxc96ee90fabd66 id 21502
lxc5f977feb7c71(95) clsact/ingress cil_from_netdev-lxc5f977feb7c71 id 21494
lxc5f977feb7c71(95) clsact/egress cil_to_netdev-lxc5f977feb7c71 id 21491
lxc442d93d4a07e(99) clsact/ingress cil_from_netdev-lxc442d93d4a07e id 21473
lxc442d93d4a07e(99) clsact/egress cil_to_netdev-lxc442d93d4a07e id 21475
lxc3d8018e15dbd(101) clsact/ingress cil_from_netdev-lxc3d8018e15dbd id 21463
lxc3d8018e15dbd(101) clsact/egress cil_to_netdev-lxc3d8018e15dbd id 21461
lxc_health(155) clsact/ingress cil_from_container-lxc_health id 21362
flow_dissector:
$
这里需注意下为什么要带上 “lxc” 开头的网卡,如果缺失 “bpftool net” 查看的数据为:
$ bpftool net
xdp:
tc:
enx0826ae396977(3) clsact/ingress cil_from_netdev-enx0826ae396977 id 19760
enx0826ae396977(3) clsact/egress cil_to_netdev-enx0826ae396977 id 19762
cilium_net(4) clsact/ingress cil_to_host-cilium_net id 19723
cilium_host(5) clsact/ingress cil_to_host-cilium_host id 19706
cilium_host(5) clsact/egress cil_from_host-cilium_host id 19665
lxc96ee90fabd66(93) clsact/ingress cil_from_container-lxc96ee90fabd66 id 19691
lxc5f977feb7c71(95) clsact/ingress cil_from_container-lxc5f977feb7c71 id 19682
lxc442d93d4a07e(99) clsact/ingress cil_from_container-lxc442d93d4a07e id 19733
lxc3d8018e15dbd(101) clsact/ingress cil_from_container-lxc3d8018e15dbd id 19670
lxc_health(147) clsact/ingress cil_from_container-lxc_health id 19747
flow_dissector:
$
可以看到在 “lxc” 开头的网卡仅剩 “clsact/ingress” hook 钩子了,丢失了 “clsact/egress” 这种情况下在使用 “cilium-envoy” 会无法路由到后端,观察到是 “reserved:ingress” 的 ip 无法与具体的 Pod 联通,具体原因待细查。
$ cilium ip list | grep reserved
0.0.0.0/0 reserved:world
10.5.56.241/32 reserved:host
reserved:kube-apiserver
192.168.0.2/32 reserved:host
reserved:kube-apiserver
192.168.201.4/32 reserved:health
192.168.201.12/32 reserved:ingress
192.168.201.13/32 reserved:host
reserved:kube-apiserver
$
可在宿主上 “tcpdump” 具体 Pod 网卡数据包:
$ tcpdump -i lxc442d93d4a07e -nn
......
17:26:21.709985 IP 192.168.201.12.43687 > 192.168.201.1.2746: Flags [S], seq 3062515778, win 64240, options [mss 1460,sackOK,TS val 2429471181 ecr 0,nop,wscale 7], length 0
17:26:21.710010 IP 192.168.201.1.2746 > 192.168.201.12.43687: Flags [S.], seq 955549526, ack 3062515779, win 65160, options [mss 1460,sackOK,TS val 2974496186 ecr 2429468135,nop,wscale 7], length 0
17:26:21.710058 IP 192.168.201.1.2746 > 192.168.201.12.43687: Flags [S.], seq 955549526, ack 3062515779, win 65160, options [mss 1460,sackOK,TS val 2974496186 ecr 2429468135,nop,wscale 7], length 0
17:26:21.741919 IP 192.168.201.13 > 192.168.201.1: ICMP host 192.168.201.12 unreachable, length 68
17:26:21.741954 IP 192.168.201.13 > 192.168.201.1: ICMP host 192.168.201.12 unreachable, length 68
17:26:21.741973 IP 192.168.201.13 > 192.168.201.1: ICMP host 192.168.201.12 unreachable, length 68
17:26:21.741990 IP 192.168.201.13 > 192.168.201.1: ICMP host 192.168.201.12 unreachable, length 68
17:26:21.742008 IP 192.168.201.13 > 192.168.201.1: ICMP host 192.168.201.12 unreachable, length 68
......
其中 “reserved:ingress = 192.168.201.12” 无法与后端 “Pod 192.168.201.1” 建立 TCP 连接,错误 “ICMP unreachable”。
集群流量策略
internalTrafficPolicy | externalTrafficPolicy | 南北向流量 | 东西向流量 |
---|---|---|---|
Cluster | Cluster | 所有节点可达 | 所有节点可达 |
Cluster | Local | 仅被访问节点 | 所有节点可达 |
Local | Cluster | 所有节点可达 | 仅被访问节点 |
Local | Local | 仅被访问节点 | 仅被访问节点 |