一、遇到的问题
work 节点执行 kubeadm join
命令后集群状态一直显示 not ready,如下的 k8s-node-4
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-jmeter-1.novalocal Ready <none> 17d v1.18.5
k8s-jmeter-2.novalocal Ready <none> 17d v1.18.5
k8s-jmeter-3.novalocal Ready <none> 17d v1.18.5
k8s-master.novalocal Ready master 51d v1.18.5
k8s-node-1.novalocal Ready <none> 51d v1.18.5
k8s-node-2.novalocal Ready <none> 51d v1.18.5
k8s-node-3.novalocal Ready <none> 51d v1.18.5
k8s-node-4.novalocal NotReady <none> 160m v1.18.5
二、问题排查
首先查看系统 pod 初始化情况:
$ kubectl get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-5b8b769fcd-srkrb 1/1 Running 0 3d19h 10.100.185.9 k8s-jmeter-2.novalocal <none> <none>
calico-node-5c8xj 1/1 Running 10 51d 172.16.106.227 k8s-node-1.novalocal <none> <none>
calico-node-9d7rt 1/1 Running 8 51d 172.16.106.203 k8s-node-3.novalocal <none> <none>
calico-node-crczj 1/1 Running 5 51d 172.16.106.226 k8s-node-2.novalocal <none> <none>
calico-node-g4hx4 0/1 Init:ImagePullBackOff 0 99s 172.16.106.219 k8s-node-4.novalocal <none> <none>
calico-node-gpmsv 1/1 Running 5 17d 172.16.106.209 k8s-jmeter-1.novalocal <none> <none>
calico-node-pz7w5 1/1 Running 4 51d 172.16.106.200 k8s-master.novalocal <none> <none>
calico-node-r59bw 1/1 Running 3 17d 172.16.106.216 k8s-jmeter-2.novalocal <none> <none>
calico-node-xhjj8 1/1 Running 4 17d 172.16.106.210 k8s-jmeter-3.novalocal <none> <none>
coredns-66db54ff7f-2cxcp 1/1 Running 0 5d22h 10.100.167.140 k8s-node-1.novalocal <none> <none>
coredns-66db54ff7f-gptgt 1/1 Running 0 5d22h 10.100.41.31 k8s-master.novalocal <none> <none>
eip-nfs-nfs-storage-6fddcc8f9d-hqv7m 1/1 Running 0 3d19h 10.100.185.4 k8s-jmeter-2.novalocal <none> <none>
etcd-k8s-master.novalocal 1/1 Running 0 5d21h 172.16.106.200 k8s-master.novalocal <none> <none>
kube-apiserver-k8s-master.novalocal 1/1 Running 14 51d 172.16.106.200 k8s-master.novalocal <none> <none>
kube-controller-manager-k8s-master.novalocal 1/1 Running 56 16d 172.16.106.200 k8s-master.novalocal <none> <none>
kube-proxy-5msrp 1/1 Running 1 9d 172.16.106.226 k8s-node-2.novalocal <none> <none>
kube-proxy-64pkw 1/1 Running 2 9d 172.16.106.210 k8s-jmeter-3.novalocal <none> <none>
kube-proxy-6j2fw 1/1 Running 1 9d 172.16.106.203 k8s-node-3.novalocal <none> <none>
kube-proxy-7cptn 1/1 Running 0 157m 172.16.106.219 k8s-node-4.novalocal <none> <none>
kube-proxy-fkt9p 1/1 Running 1 9d 172.16.106.227 k8s-node-1.novalocal <none> <none>
kube-proxy-fxvjb 1/1 Running 4 9d 172.16.106.209 k8s-jmeter-1.novalocal <none> <none>
kube-proxy-wnj2l 1/1 Running 2 9d 172.16.106.216 k8s-jmeter-2.novalocal <none> <none>
kube-proxy-wnzqg 1/1 Running 0 9d 172.16.106.200 k8s-master.novalocal <none> <none>
kube-scheduler-k8s-master.novalocal 1/1 Running 48 16d 172.16.106.200 k8s-master.novalocal <none> <none>
kuboard-5cc4bcccd7-t8h8f 1/1 Running 0 21h 10.100.185.24 k8s-jmeter-2.novalocal <none> <none>
metrics-server-677dcb8b4d-jtpgd 1/1 Running 0 3d20h 172.16.106.227 k8s-node-1.novalocal <none> <none>
通过结果我们可以看到 node-4 的 calico 组件未初始化成功,Pod 状态显示为 ImagePullBackoff
三、问题解决
1、获取容器镜像
通过命令获取 Pod 所使用的容器镜像:
$ kubectl get pods calico-node-7vrgx -n kube-system -o yaml | grep image:
f:image: {}
f:image: {}
f:image: {}
f:image: {}
image: calico/node:v3.13.1
image: calico/cni:v3.13.1
image: calico/cni:v3.13.1
- image: calico/pod2daemon-flexvol:v3.13.1
- image: calico/node:v3.13.1
- image: calico/cni:v3.13.1
- image: calico/cni:v3.13.1
- image: calico/pod2daemon-flexvol:v3.13.1
我们可以看到此 calico Pod 主要使用以下三个 image:
- calico/node:v3.13.1
- calico/cni:v3.13.1
- calico/pod2daemon-flexvol:v3.13.1
2、下载镜像
找到 node-4 主机,在上面执行:
$ docker pull calico/node:v3.13.1
$ docker pull calico/cni:v3.13.1
$ docker pull calico/pod2daemon-flexvol:v3.13.1
3、离线镜像
如果 docker pull 无法下载镜像,可以考虑从其他节点导出 calico 插件的镜像:
# 保存镜像到本地
$ docker save image_id -o xxxx.tar
# 拷贝镜像到 work 节点
$ scp xxxx.tar root@k8s-node-4:/root/
# 装载镜像
$ docker load -i xxxx.tar
# 给镜像打tag
$ docker tag image_id tag
4、重新创建pod
在 master 删除原有的 pod:
$ kubectl delete pod calico-node-g4hx4 -n kube-system
等一会重新查看 pod 状态:
$ kubectl get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-5b8b769fcd-srkrb 1/1 Running 0 3d19h 10.100.185.9 k8s-jmeter-2.novalocal <none> <none>
calico-node-5c7hn 0/1 Running 0 8s 172.16.106.219 k8s-node-4.novalocal <none> <none>
calico-node-5c8xj 1/1 Running 10 51d 172.16.106.227 k8s-node-1.novalocal <none> <none>
calico-node-9d7rt 1/1 Running 8 51d 172.16.106.203 k8s-node-3.novalocal <none> <none>
calico-node-crczj 1/1 Running 5 51d 172.16.106.226 k8s-node-2.novalocal <none> <none>
calico-node-gpmsv 1/1 Running 5 17d 172.16.106.209 k8s-jmeter-1.novalocal <none> <none>
calico-node-pz7w5 1/1 Running 4 51d 172.16.106.200 k8s-master.novalocal <none> <none>
calico-node-r59bw 1/1 Running 3 17d 172.16.106.216 k8s-jmeter-2.novalocal <none> <none>
calico-node-xhjj8 1/1 Running 4 17d 172.16.106.210 k8s-jmeter-3.novalocal <none> <none>
coredns-66db54ff7f-2cxcp 1/1 Running 0 5d22h 10.100.167.140 k8s-node-1.novalocal <none> <none>
coredns-66db54ff7f-gptgt 1/1 Running 0 5d22h 10.100.41.31 k8s-master.novalocal <none> <none>
eip-nfs-nfs-storage-6fddcc8f9d-hqv7m 1/1 Running 0 3d19h 10.100.185.4 k8s-jmeter-2.novalocal <none> <none>
etcd-k8s-master.novalocal 1/1 Running 0 5d21h 172.16.106.200 k8s-master.novalocal <none> <none>
kube-apiserver-k8s-master.novalocal 1/1 Running 14 51d 172.16.106.200 k8s-master.novalocal <none> <none>
kube-controller-manager-k8s-master.novalocal 1/1 Running 56 16d 172.16.106.200 k8s-master.novalocal <none> <none>
kube-proxy-5msrp 1/1 Running 1 9d 172.16.106.226 k8s-node-2.novalocal <none> <none>
kube-proxy-64pkw 1/1 Running 2 9d 172.16.106.210 k8s-jmeter-3.novalocal <none> <none>
kube-proxy-6j2fw 1/1 Running 1 9d 172.16.106.203 k8s-node-3.novalocal <none> <none>
kube-proxy-7cptn 1/1 Running 0 160m 172.16.106.219 k8s-node-4.novalocal <none> <none>
kube-proxy-fkt9p 1/1 Running 1 9d 172.16.106.227 k8s-node-1.novalocal <none> <none>
kube-proxy-fxvjb 1/1 Running 4 9d 172.16.106.209 k8s-jmeter-1.novalocal <none> <none>
kube-proxy-wnj2l 1/1 Running 2 9d 172.16.106.216 k8s-jmeter-2.novalocal <none> <none>
kube-proxy-wnzqg 1/1 Running 0 9d 172.16.106.200 k8s-master.novalocal <none> <none>
kube-scheduler-k8s-master.novalocal 1/1 Running 48 16d 172.16.106.200 k8s-master.novalocal <none> <none>
kuboard-5cc4bcccd7-t8h8f 1/1 Running 0 21h 10.100.185.24 k8s-jmeter-2.novalocal <none> <none>
metrics-server-677dcb8b4d-jtpgd 1/1 Running 0 3d20h 172.16.106.227 k8s-node-1.novalocal <none> <none>
我们看到所有的 Pod 都已经处于正常状态,这时候查看下 node 的状态:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-jmeter-1.novalocal Ready <none> 17d v1.18.5
k8s-jmeter-2.novalocal Ready <none> 17d v1.18.5
k8s-jmeter-3.novalocal Ready <none> 17d v1.18.5
k8s-master.novalocal Ready master 51d v1.18.5
k8s-node-1.novalocal Ready <none> 51d v1.18.5
k8s-node-2.novalocal Ready <none> 51d v1.18.5
k8s-node-3.novalocal Ready <none> 51d v1.18.5
k8s-node-4.novalocal Ready <none> 161m v1.18.5
以上,初始化的 node 状态恢复正常。