问题现象:
k8s 执行 delete操作发现pod一直处于terminating
问题排查
执行:kubectl get APIService
发现:
v1beta1.events.k8s.io Local True 13d
v1beta1.extensions Local True 13d
v1beta1.metrics.k8s.io kube-system/metrics-server False (EndpointsNotFound) 71s
有个新增的APIService 处于不正常状态,然后删除重建问题依旧,查看报错:
kubectl describe APIService
v1beta1.metrics.k8s.io
Name: v1beta1.metrics.k8s.io
Namespace:
Labels: <none>
Annotations: <none>
API Version: apiregistration.k8s.io/v1
Kind: APIService
Metadata:
Creation Timestamp: 2021-10-11T03:01:28Z
Resource Version: 4057041
Self Link: /apis/apiregistration.k8s.io/v1/apiservices/v1beta1.metrics.k8s.io
UID: 805aac80-69b7-4c41-bd00-b7e72f1f5fcb
Spec:
Group: metrics.k8s.io
Group Priority Minimum: 100
Insecure Skip TLS Verify: true
Service:
Name: metrics-server
Namespace: kube-system
Port: 443
Version: v1beta1
Version Priority: 100
Status:
Conditions:
Last Transition Time: 2021-10-11T03:01:28Z
Message: cannot find endpoints for service/metrics-server in "kube-system"
Reason: EndpointsNotFound
Status: False
Type: Available
Events: <none>
原因:
cannot find endpoints for service/metrics-server in "kube-system"
查看下集群中的endpoints:
执行:
kubectl get endpoints
elasticsearch-logging 10.244.1.7:9300,10.244.1.8:9300,10.244.1.7:9200 + 1 more... 13d
kube-controller-manager <none> 13d
kube-dns 10.244.0.6:53,10.244.3.5:53,10.244.0.6:9153 + 3 more... 13d
kube-scheduler <none> 13d
node-exporter 10.244.0.4:9100,10.244.1.6:9100,10.244.2.2:9100 + 5 more... 13d
这里没有发现service/metrics-server,说明service和pod的关联出现了问题:
查看下service
apiVersion: v1
kind: Service
metadata:
name: metrics-server
namespace: kube-system
spec:
ports:
- name: https
port: 443
protocol: TCP
targetPort: https
居然没有selector 这就是问题所在service根本没有跟pod进行绑定:
最终修改service.yaml重新部署
apiVersion: v1
kind: Service
metadata:
name: metrics-server
namespace: kube-system
spec:
ports:
- name: https
port: 443
protocol: TCP
targetPort: https
selector:
app: metric-server
然后查看下endpoints:
kubectl get endpoints -n kube-system
NAME ENDPOINTS AGE
elasticsearch-logging 10.244.1.7:9300,10.244.1.8:9300,10.244.1.7:9200 + 1 more... 13d
kube-controller-manager <none> 13d
kube-dns 10.244.0.6:53,10.244.3.5:53,10.244.0.6:9153 + 3 more... 13d
kube-scheduler <none> 13d
metrics-server 10.244.6.10:443 13s
node-exporter 10.244.0.4:9100,10.244.1.6:9100,10.244.2.2:9100 + 5 more... 13d
已经发现了metrics-server 问题结局。
问题原因:
主要是APIService不正常导致一直在处在状态APIService执行不下去,最终排查到是metrics-server 中的service绑定异常导致的。
名词解释:
endpoint是k8s集群中的一个资源对象,存储在etcd中,用来记录一个service对应的所有pod的访问地址。