K8S自己动手系列 - 1.2 - 节点管理

节点管理

节点状态

please refer: https://kubernetes.io/docs/concepts/architecture/nodes/#node-status
这里我们重点关注下Condition部分
如文档描述
K8S自己动手系列 - 1.2 - 节点管理

节点失联 or 节点宕机

查看节点列表

  ~ kubectl get node -o wide
NAME       STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
worker01   Ready    master   21h   v1.14.3   192.168.101.113   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
worker02   Ready    <none>   21h   v1.14.3   192.168.100.117   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
# 查看pod和其运行所在节点
  ~ kubectl get pods -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-6q2pt   1/1     Running   1          19h   10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-msd6t   1/1     Running   0          19h   10.244.1.17   worker02   <none>           <none>

现在我们使worker02关机

  ~ kubectl get pod -o wide
NAME                    READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-6q2pt   1/1     Running   1          20h   10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-msd6t   1/1     Running   0          20h   10.244.1.17   worker02   <none>           <none>
  ~ kubectl get node -o wide
NAME       STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
worker01   Ready    master   22h   v1.14.3   192.168.101.113   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
worker02   Ready    <none>   21h   v1.14.3   192.168.100.117   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
  ~ ping 192.168.100.117
PING 192.168.100.117 (192.168.100.117) 56(84) bytes of data.
^C
--- 192.168.100.117 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1007ms

可以看到虽然查看状态还是正常,实际上节点已经无法ping通,再次查看,发现节点已经处于NotReady状态

  ~ kubectl get node -o wide
NAME       STATUS     ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
worker01   Ready      master   22h   v1.14.3   192.168.101.113   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
worker02   NotReady   <none>   21h   v1.14.3   192.168.100.117   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
  ~ kubectl get pod -o wide 
NAME                    READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-6q2pt   1/1     Running   1          20h   10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-msd6t   1/1     Running   0          20h   10.244.1.17   worker02   <none>           <none>

查看节点详情,发现节点已经被打上node.kubernetes.io/unreachable:NoExecute和node.kubernetes.io/unreachable:NoSchedule两个污点,conditions也随之变化

  ~ kubectl describe node worker02
Name:             worker02
Roles:              <none>
Labels:            beta.kubernetes.io/arch=amd64
...
Annotations:    flannel.alpha.coreos.com/backend-data: {"VtepMAC":"4e:0a:b4:88:0d:83"}
...
CreationTimestamp:  Sat, 08 Jun 2019 12:42:00 +0800
Taints:             node.kubernetes.io/unreachable:NoExecute
                       node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Sun, 09 Jun 2019 10:03:52 +0800   Sun, 09 Jun 2019 10:04:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Sun, 09 Jun 2019 10:03:52 +0800   Sun, 09 Jun 2019 10:04:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Sun, 09 Jun 2019 10:03:52 +0800   Sun, 09 Jun 2019 10:04:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Sun, 09 Jun 2019 10:03:52 +0800   Sun, 09 Jun 2019 10:04:52 +0800   NodeStatusUnknown   Kubelet stopped posting node status.

再次查看pod,发现已经被调度到其他节点

  ~ kubectl get pods -o wide
NAME                    READY   STATUS        RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-6q2pt   1/1     Running       1          20h   10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-msd6t   1/1     Terminating   0          20h   10.244.1.17   worker02   <none>           <none>
nginx-6657c9ffc-n9s4p   1/1     Running       0          71s   10.244.0.14   worker01   <none>           <none>

查看正在terminating的pod状态:

  ~ kubectl describe pod/nginx-6657c9ffc-msd6t
Name:                      nginx-6657c9ffc-msd6t
Node:                      worker02/192.168.100.117
Status:                    Terminating (lasts 2m5s)
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

发现两个污点容忍tolerations选项,解释了为什么节点处于NotReady后,pod还是running状态一段时间后被terminate重建
更多细节,please refer https://kubernetes.io/docs/concepts/architecture/nodes/#node-controller

移除节点

为了实验效果,先把pod扩容到10个

  ~ kubectl scale --replicas=10 deployment/nginx
deployment.extensions/nginx scaled

  ~ kubectl get pod -o wide                     
NAME                    READY   STATUS    RESTARTS   AGE   IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-6lfbp   1/1     Running   0          8s    10.244.1.20   worker02   <none>           <none>
nginx-6657c9ffc-6q2pt   1/1     Running   1          20h   10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-8cbpl   1/1     Running   0          8s    10.244.1.23   worker02   <none>           <none>
nginx-6657c9ffc-dbr7c   1/1     Running   0          8s    10.244.1.22   worker02   <none>           <none>
nginx-6657c9ffc-kk84d   1/1     Running   0          8s    10.244.1.21   worker02   <none>           <none>
nginx-6657c9ffc-kp64c   1/1     Running   0          8s    10.244.0.16   worker01   <none>           <none>
nginx-6657c9ffc-n9s4p   1/1     Running   0          40m   10.244.0.14   worker01   <none>           <none>
nginx-6657c9ffc-pxbs2   1/1     Running   0          8s    10.244.1.19   worker02   <none>           <none>
nginx-6657c9ffc-sb85v   1/1     Running   0          8s    10.244.1.18   worker02   <none>           <none>
nginx-6657c9ffc-sxb25   1/1     Running   0          8s    10.244.0.15   worker01   <none>           <none>

使用drain放干节点,然后删除节点

  ~ kubectl drain worker02 --force --grace-period=900 --ignore-daemonsets
node/worker02 already cordoned
WARNING: ignoring DaemonSet-managed Pods: kube-system/kube-flannel-ds-amd64-crq7k, kube-system/kube-proxy-x9h7m
evicting pod "nginx-6657c9ffc-sb85v"
evicting pod "nginx-6657c9ffc-dbr7c"
evicting pod "nginx-6657c9ffc-6lfbp"
evicting pod "nginx-6657c9ffc-8cbpl"
evicting pod "nginx-6657c9ffc-kk84d"
evicting pod "nginx-6657c9ffc-pxbs2"
pod/nginx-6657c9ffc-kk84d evicted
pod/nginx-6657c9ffc-dbr7c evicted
pod/nginx-6657c9ffc-8cbpl evicted
pod/nginx-6657c9ffc-pxbs2 evicted
pod/nginx-6657c9ffc-sb85v evicted
pod/nginx-6657c9ffc-6lfbp evicted
node/worker02 evicted

# 执行完后发现所有pod已经运行在worker01上了
  ~ kubectl get pod -o wide   
NAME                    READY   STATUS    RESTARTS   AGE     IP            NODE       NOMINATED NODE   READINESS GATES
nginx-6657c9ffc-45svt   1/1     Running   0          63s     10.244.0.18   worker01   <none>           <none>
nginx-6657c9ffc-489zz   1/1     Running   0          63s     10.244.0.21   worker01   <none>           <none>
nginx-6657c9ffc-49vqv   1/1     Running   0          63s     10.244.0.20   worker01   <none>           <none>
nginx-6657c9ffc-6q2pt   1/1     Running   1          20h     10.244.0.11   worker01   <none>           <none>
nginx-6657c9ffc-fq7gg   1/1     Running   0          63s     10.244.0.22   worker01   <none>           <none>
nginx-6657c9ffc-kl6j9   1/1     Running   0          63s     10.244.0.19   worker01   <none>           <none>
nginx-6657c9ffc-kp64c   1/1     Running   0          4m31s   10.244.0.16   worker01   <none>           <none>
nginx-6657c9ffc-n9s4p   1/1     Running   0          44m     10.244.0.14   worker01   <none>           <none>
nginx-6657c9ffc-ssmph   1/1     Running   0          63s     10.244.0.17   worker01   <none>           <none>
nginx-6657c9ffc-sxb25   1/1     Running   0          4m31s   10.244.0.15   worker01   <none>           <none>
# 删除节点
  ~ kubectl delete node worker02
node "worker02" deleted
  ~ kubectl get node
NAME       STATUS   ROLES    AGE   VERSION
worker01   Ready    master   22h   v1.14.3

更多详情 please refer https://kubernetes.io/docs/concepts/workloads/pods/disruptions/

加入节点

使用kubeadm初始化master节点后,会有一段提示,说明如何加入新节点,那个务必要保存好
接下来我们将刚刚被下线的worker02重置,然后重新加入集群

  ~ kubeadm reset
[reset] WARNING: Changes made to this host by 'kubeadm init' or 'kubeadm join' will be reverted.
[reset] Are you sure you want to proceed? [y/N]: y
[preflight] Running pre-flight checks
W0609 11:00:23.206928    7376 reset.go:234] [reset] No kubeadm config, using etcd pod spec to get data directory
[reset] No etcd config found. Assuming external etcd
[reset] Please manually reset etcd to prevent further issues
[reset] Stopping the kubelet service
[reset] unmounting mounted directories in "/var/lib/kubelet"
[reset] Deleting contents of stateful directories: [/var/lib/kubelet /etc/cni/net.d /var/lib/dockershim /var/run/kubernetes]
[reset] Deleting contents of config directories: [/etc/kubernetes/manifests /etc/kubernetes/pki]
[reset] Deleting files: [/etc/kubernetes/admin.conf /etc/kubernetes/kubelet.conf /etc/kubernetes/bootstrap-kubelet.conf /etc/kubernetes/controller-manager.conf /etc/kubernetes/scheduler.conf]

The reset process does not reset or clean up iptables rules or IPVS tables.
If you wish to reset iptables, you must do so manually.
For example:
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X

If your cluster was setup to utilize IPVS, run ipvsadm --clear (or similar)
to reset your system's IPVS tables.

  ~ iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X

执行kubeadm join

  ~ kubeadm join 192.168.101.113:6443 --token ss6flg.csw4u0ok134n2fy1 \      
    --discovery-token-ca-cert-hash sha256:bac9a150228342b7cdedf39124ef2108653db1f083e9f547d251e08f03c41945
[preflight] Running pre-flight checks
    [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[kubelet-start] Downloading configuration for the kubelet from the "kubelet-config-1.14" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Activating the kubelet service
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

This node has joined the cluster:
* Certificate signing request was sent to apiserver and a response was received.
* The Kubelet was informed of the new secure connection details.

Run 'kubectl get nodes' on the control-plane to see this node join the cluster.

查看节点,发现已经添加成功

  ~ kubectl get node -o wide
NAME       STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
worker01   Ready    master   23h   v1.14.3   192.168.101.113   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
worker02   Ready    <none>   33s   v1.14.3   192.168.100.117   <none>        Ubuntu 16.04.6 LTS   4.4.0-150-generic   docker://18.9.5
上一篇:《嵌入式设备驱动开发精解》——2.4 建立一个具体的嵌入式开发的小项目


下一篇:《Linux设备驱动开发详解 A》导读