监控指标 | 具体实现 | 举例 |
---|---|---|
Pod性能 | cAdvisor | 容器CPU,内存利用率 |
Node性能 | node-exporter | 节点CPU,内存利用率 |
K8S资源对象 | kube-state-metrics | Pod/Deployment/Service |
数据收集
我们这里使用kube-state-metrics
对k8s
资源数据进行收集。
架构图
监控指标
指标类别包括:
- CronJob Metrics
- DaemonSet Metrics
- Deployment Metrics
- Job Metrics
- LimitRange Metrics
- Node Metrics
- PersistentVolume Metrics
- PersistentVolumeClaim Metrics
- Pod Metrics
- Pod Disruption Budget Metrics
- ReplicaSet Metrics
- ReplicationController Metrics
- ResourceQuota Metrics
- Service Metrics
- StatefulSet Metrics
- Namespace Metrics
- Horizontal Pod Autoscaler Metrics
- Endpoint Metrics
- Secret Metrics
- ConfigMap Metrics
以pod为例:
- kube_pod_info
- kube_pod_owner
- kube_pod_status_phase
- kube_pod_status_ready
- kube_pod_status_scheduled
- kube_pod_container_status_waiting
- kube_pod_container_status_terminated_reason
- ...
部署 kube-state-metrics
默认会在kube-system命名空间下创建对应的资源,最好不要更换yaml
文件中的命名空间。
# 获取yml文件
git clone https://gitee.com/tengfeiwu/kube-state-metrics_prometheus_wechat.git
# 部署kube-state-metrics
kubectl apply -f kube-state-metrics-configs
# 查看pod状态
kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
kube-state-metrics-c698dc7b5-zstz9 1/1 Running 0 32m
获取kube-state-metrics-c698dc7b5-zstz9
容器日志,如图:
数据对接prometheus
当prometheus
准备好了之后,添加对应的采集job
即可:
# 添加在sidecar/cm-kube-mon-sidecar.yaml最后
- job_name: ‘kube-state-metrics‘
static_configs:
- targets: [‘kube-state-metrics.kube-system.svc.cluster.local:8080‘]
# 重新apply configmap文件,无需重启prometheus(配置的有热加载)
kubectl apply -f sidecar/cm-kube-mon-sidecar.yaml
打开promethues的web看一下target里的配置是否生效,如下图:
数据对接Grafana
当Grafana
准备好之后,我们在Grafana中导入选定/自定义的dashboard
,添加Prometheus
数据源,即可:
prometheus报警规则
修改sidecar/rules-cm-kube-mon-sidecar.yaml
配置文件,添加如下报警指标。
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: kube-mon
data:
alert-rules.yaml: |-
groups:
- name: White box monitoring
rules:
- alert: Pod-重启
expr: changes(kube_pod_container_status_restarts_total{pod !~ "analyzer.*"}[10m]) > 0
for: 1m
labels:
severity: 警告
service: prometheus_bot
receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
annotations:
summary: "Pod: {{ $labels.pod }} Restart"
k8scluster: "{{ $labels.k8scluster}}"
namespace: "{{ $labels.namespace }}"
pod: "{{ $labels.pod }}"
container: "{{ $labels.container }}"
- alert: Pod-未知错误/失败
expr: kube_pod_status_phase{phase="Unknown"} == 1 or kube_pod_status_phase{phase="Failed"} == 1
for: 1m
labels:
severity: 紧急
service: prometheus_bot
receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
annotations:
summary: "Pod: {{ $labels.pod }} 未知错误/失败"
k8scluster: "{{ $labels.k8scluster}}"
namespace: "{{ $labels.namespace }}"
pod: "{{ $labels.pod }}"
container: "{{ $labels.container }}"
- alert: Daemonset Unavailable
expr: kube_daemonset_status_number_unavailable > 0
for: 5m
labels:
severity: 紧急
service: prometheus_bot
receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
annotations:
summary: "Daemonset: {{ $labels.daemonset }} 守护进程不可用"
k8scluster: "{{ $labels.k8scluster}}"
namespace: "{{ $labels.namespace }}"
daemonset: "{{ $labels.daemonset }}"
- alert: Job-失败
expr: kube_job_status_failed == 1
for: 5m
labels:
severity: 警告
service: prometheus_bot
receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
annotations:
summary: "Job: {{ $labels.job_name }} Failed"
k8scluster: "{{ $labels.k8scluster}}"
namespace: "{{ $labels.namespace }}"
job: "{{ $labels.job_name }}"
- alert: Pod NotReady
expr: sum by (namespace, pod, cluster_id) (max by(namespace, pod, cluster_id)(kube_pod_status_phase{job=~".*kubernetes-service-endpoints",phase=~"Pending|Unknown"}) * on(namespace, pod, cluster_id)group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod,owner_kind, cluster_id) (kube_pod_owner{owner_kind!="Job"}))) > 0
for: 5m
labels:
severity: 警告
service: prometheus_bot
receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
annotations:
summary: "pod: {{ $labels.pod }} 处于 NotReady 状态超过15分钟"
k8scluster: "{{ $labels.k8scluster}}"
namespace: "{{ $labels.namespace }}"
- alert: Deployment副本数
expr: (kube_deployment_spec_replicas{job=~".*kubernetes-service-endpoints"} !=kube_deployment_status_replicas_available{job=~".*kubernetes-service-endpoints"}) and (changes(kube_deployment_status_replicas_updated{job=~".*kubernetes-service-endpoints"}[5m]) == 0)
for: 5m
labels:
severity: 警告
service: prometheus_bot
receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
annotations:
summary: "Deployment: {{ $labels.deployment }} 实际副本数和设置副本数不一致"
k8scluster: "{{ $labels.k8scluster}}"
namespace: "{{ $labels.namespace }}"
deployment: "{{ $labels.deployment }}"
- alert: Statefulset副本数
expr: (kube_statefulset_status_replicas_ready{job=~".*kubernetes-service-endpoints"} !=kube_statefulset_status_replicas{job=~".*kubernetes-service-endpoints"}) and (changes(kube_statefulset_status_replicas_updated{job=~".*kubernetes-service-endpoints"}[5m]) == 0)
for: 5m
labels:
severity: 警告
service: prometheus_bot
receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
annotations:
summary: "Statefulset: {{ $labels.statefulset }} 实际副本数和设置副本数不一致"
k8scluster: "{{ $labels.k8scluster}}"
namespace: "{{ $labels.namespace }}"
statefulset: "{{ $labels.statefulset }}"
- alert: 存储卷PV
expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job=~".*kubernetes-service-endpoints"} > 0
for: 5m
labels:
severity: 紧急
service: prometheus_bot
receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
annotations:
summary: "存储卷PV: {{ $labels.persistentvolume }} 处于Failed或Pending状态"
k8scluster: "{{ $labels.k8scluster}}"
namespace: "{{ $labels.namespace }}"
persistentvolume: "{{ $labels.persistentvolume }}"
- alert: 存储卷PVC
expr: kube_persistentvolumeclaim_status_phase{phase=~"Failed|Pending|Lost",job=~".*kubernetes-service-endpoints"} > 0
for: 5m
labels:
severity: 紧急
service: prometheus_bot
receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
annotations:
summary: "存储卷PVC: {{ $labels.persistentvolumeclaim }} Failed或Pending状态"
k8scluster: "{{ $labels.k8scluster}}"
namespace: "{{ $labels.namespace }}"
persistentvolumeclaim: "{{ $labels.persistentvolumeclaim }}"
- alert: k8s service
expr: kube_service_status_load_balancer_ingress != 1
for: 5m
labels:
severity: 紧急
service: prometheus_bot
receiver_group: "{{ $labels.k8scluster}}_{{ $labels.namespace }}"
annotations:
summary: "Service: {{ $labels.service }} 服务负载均衡器入口状态DOWN!"
k8scluster: "{{ $labels.k8scluster}}"
namespace: "{{ $labels.namespace }}"
persistentvolumeclaim: "{{ $labels.service }}"
更新sidecar/rules-cm-kube-mon-sidecar.yaml
配置文件,如下:
# 稍等待一分钟左右,prometheus已定义热更新,无需apply
kubectl apply -f rules-cm-kube-mon-sidecar.yaml
对接AlertManager
部署微信告警
# 获取yml文件
git clone https://gitee.com/tengfeiwu/kube-state-metrics_prometheus_wechat.git && cd thanos/AlertManager
# 部署AlertManager
## 更改为自己wechat信息
kubectl apply -f cm-kube-mon-alertmanager.yaml
kubectl apply -f wechat-template-kube-mon.yaml
kubectl apply -f deploy-kube-mon-alertmanager.yaml
kubectl apply -f svc-kube-mon-alertmanager.yaml