准备k8s 集群
- 前言
准备好k8s 集群,通过部署prometheus 达到获取k8s 容器资源,根据收集指标制定报警策略,从而提高监控响应能力。
$ kubectl get node
NAME STATUS ROLES AGE VERSION
master01 Ready master 13d v1.16.0
master02 Ready master 13d v1.16.0
master03 Ready master 13d v1.16.0
node01 Ready <none> 13d v1.16.0
node02 Ready <none> 13d v1.16.0
node03 Ready <none> 13d v1.16.0
部署prometheus
- prometheus 配置文件主要监控项目为 kubernetes-nodes (集群各个节点)
- prometheus 监控prometheus 服务
- kubernetes-services 通过blackbox-exporter 监控 service
- kubernetes-nodes-cadvisor 通过 cadvisor 收集各个node 节点Pod 信息
- kubernetes-ingresses 通过blackbox-exporter 监控 ingresses
- kubernetes-kubelet 监控各个节点 kubelet
- traefik 监控traefik 存活
- kubernetes-apiservers 监控 apiservers 存活
- 关键性服务监控 监控各个关键性服务状态
- blackbox_http_pod_probe 通过blackbox-exporter监控各个pod
├── prometheus-conf.yaml
├── prometheus-deployment.yaml
├── prometheus-ingress.yaml
├── prometheus-pv-pvc.yaml
├── prometheus-rules.yaml
└── prometheus-svc.yaml
一、准备 pv、pvc 配置文件
# cat prometheus-pv-pvc.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-server
namespace: prometheus
labels:
name: prometheus-server
spec:
nfs:
path: /export/nfs_share/volume-prometheus/prometheus-server
server: 10.65.0.94
accessModes: ["ReadWriteMany","ReadOnlyMany"]
capacity:
storage: 50Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-server
namespace: prometheus
spec:
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 50Gi
二、准备prometheus-conf 文件
# cat prometheus-conf.yaml
apiVersion: v1
data:
prometheus.yml: |-
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager-service.prometheus:9093'
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/etc/prometheus/rules/nodedown.rule.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: '关键性服务监控'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://news-gg-xy.com/healthz
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
- job_name: blackbox_http_pod_probe
honor_timestamps: true
params:
module:
- http_2xx
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /probe
scheme: http
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
separator: ;
regex: http
replacement: $1
action: keep
- source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port, __meta_kubernetes_pod_annotation_blackbox_path]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+);(.+)
target_label: __param_target
replacement: $1:$2$3
action: replace
- separator: ;
regex: (.*)
target_label: __address__
replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
action: replace
- source_labels: [__param_target]
separator: ;
regex: (.*)
target_label: instance
replacement: $1
action: replace
- separator: ;
regex: __meta_kubernetes_pod_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: kubernetes_pod_name
replacement: $1
action: replace
kubernetes_sd_configs:
- role: pod
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: kubernetes-nodes
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/${1}/proxy/metrics
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- job_name: 'kubernetes-services'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: kubernetes-nodes-cadvisor
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
source_labels:
- __meta_kubernetes_node_name
target_label: __metrics_path__
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: 'kubernetes-ingresses'
metrics_path: /probe
params:
module: [http_2xx]
kubernetes_sd_configs:
- role: ingress
relabel_configs:
- source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
regex: (.+);(.+);(.+)
replacement: ${1}://${2}${3}
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_ingress_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_ingress_name]
target_label: kubernetes_name
- job_name: 'kubernetes-kubelet'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
- job_name: "traefik"
static_configs:
- targets: ['traefik-ingress-service.kube-system.svc.cluster.local:8080']
- bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
job_name: kubernetes-apiservers
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- action: keep
regex: default;kubernetes;https
source_labels:
- __meta_kubernetes_namespace
- __meta_kubernetes_service_name
- __meta_kubernetes_endpoint_port_name
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
kind: ConfigMap
metadata:
labels:
app: prometheus
name: prometheus-conf
namespace: prometheus
三、准备prometheus-rules 配置文件
# cat prometheus-rules.yaml
apiVersion: v1
data:
nodedown.rule.yml: |
groups:
- name: YingPuDev-Alerting
rules:
- alert: 实例崩溃
expr: up {instance !~""} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "实例{{ $labels.instance }}崩溃"
description: "{{ $labels.instance }}的{{ $labels.job }}实例已经崩溃超过了1分钟。"
- alert: HTTP探测发现不健康服务端点
expr: probe_http_status_code >= 400 or probe_http_status_code {instance !~ "videoai-php-dev.videoai.svc:22"} == 0
for: 1m
labels:
severity: warning
annotations:
summary: "API {{ $labels.kubernetes_name }} 服务不可用"
description: "{{ $labels.kubernetes_name }}的{{ $labels.job }}服务已经超过1分钟不可用了。当前状态码为:{{ $value }}"
kind: ConfigMap
metadata:
labels:
app: prometheus
name: prometheus-rules
namespace: prometheus
四、准备prometheus-deployment 配置文件
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: prometheus
name: prometheus
namespace: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app: prometheus
spec:
containers:
- args:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention=30d
env:
- name: STAKATER_PROMETHEUS_CONF_CONFIGMAP
value: e4dd2838dd54e8392b62d85898083cc3d20210cc
- name: STAKATER_PROMETHEUS_RULES_CONFIGMAP
value: ca65a78fcb15d2c767166e468e8e734c6d4e267f
image: prom/prometheus:latest
imagePullPolicy: Always
name: prometheus
ports:
- containerPort: 9090
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /prometheus
name: prometheus-data-volume
- mountPath: /etc/prometheus/prometheus.yml
name: prometheus-conf-volume
subPath: prometheus.yml
- mountPath: /etc/prometheus/rules
name: prometheus-rules-volume
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
runAsUser: 0
serviceAccount: prometheus
serviceAccountName: prometheus
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
volumes:
- name: prometheus-data-volume
persistentVolumeClaim:
claimName: prometheus-server
- configMap:
defaultMode: 420
name: prometheus-conf
name: prometheus-conf-volume
- configMap:
defaultMode: 420
name: prometheus-rules
name: prometheus-rules-volume
五、准备 prometheus-svc 配置文件
# cat prometheus-svc.yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: prometheus
name: prometheus-service
namespace: prometheus
spec:
ports:
- port: 9090
protocol: TCP
targetPort: 9090
selector:
app: prometheus
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
六、准备 prometheus-ingress 配置文件
# cat prometheus-ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: prometheus
spec:
rules:
- host: prometheus.movie.cn
http:
paths:
- backend:
serviceName: prometheus-service
servicePort: 9090
status:
loadBalancer: {}
依次生成配置文件
kubectl apply -f prometheus-conf.yaml
kubectl apply -f prometheus-pv-pvc.yaml
kubectl apply -f prometheus-rules.yaml
kubectl apply -f prometheus-deployment.yaml
kubectl apply -f prometheus-svc.yaml
kubectl apply -f prometheus-ingress.yaml