k8s 集群部署prometheus + alertmanager + grafana

准备k8s 集群

  • 前言
    准备好k8s 集群,通过部署prometheus 达到获取k8s 容器资源,根据收集指标制定报警策略,从而提高监控响应能力。
$ kubectl get  node
NAME       STATUS     ROLES    AGE   VERSION
master01   Ready      master   13d   v1.16.0
master02   Ready      master   13d   v1.16.0
master03   Ready      master   13d   v1.16.0
node01     Ready      <none>   13d   v1.16.0
node02     Ready      <none>   13d   v1.16.0
node03     Ready      <none>   13d   v1.16.0

部署prometheus

  • prometheus 配置文件主要监控项目为 kubernetes-nodes (集群各个节点)
  • prometheus 监控prometheus 服务
  • kubernetes-services 通过blackbox-exporter 监控 service
  • kubernetes-nodes-cadvisor 通过 cadvisor 收集各个node 节点Pod 信息
  • kubernetes-ingresses 通过blackbox-exporter 监控 ingresses
  • kubernetes-kubelet 监控各个节点 kubelet
  • traefik 监控traefik 存活
  • kubernetes-apiservers 监控 apiservers 存活
  • 关键性服务监控 监控各个关键性服务状态
  • blackbox_http_pod_probe 通过blackbox-exporter监控各个pod
├── prometheus-conf.yaml
├── prometheus-deployment.yaml
├── prometheus-ingress.yaml
├── prometheus-pv-pvc.yaml
├── prometheus-rules.yaml
└── prometheus-svc.yaml

一、准备 pv、pvc 配置文件
# cat prometheus-pv-pvc.yaml 
apiVersion: v1
kind: PersistentVolume
metadata:
  name: prometheus-server
  namespace: prometheus
  labels:
    name: prometheus-server
spec:
  nfs:
    path: /export/nfs_share/volume-prometheus/prometheus-server
    server: 10.65.0.94
  accessModes: ["ReadWriteMany","ReadOnlyMany"]
  capacity:
    storage: 50Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-server
  namespace: prometheus
spec:
  accessModes: ["ReadWriteMany"]
  resources:
    requests:
      storage: 50Gi

二、准备prometheus-conf 文件

# cat prometheus-conf.yaml  
apiVersion: v1
data:
  prometheus.yml: |-
    # my global config
    global:
      scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
      evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
      # scrape_timeout is set to the global default (10s).

    # Alertmanager configuration
    alerting:
      alertmanagers:
      - static_configs:
        - targets:
            - 'alertmanager-service.prometheus:9093'
    # Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
    rule_files:
      - "/etc/prometheus/rules/nodedown.rule.yml"
    scrape_configs:
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']


      - job_name: '关键性服务监控'
        metrics_path: /probe
        params:
          module: [http_2xx]
        static_configs:
          - targets:
            - https://news-gg-xy.com/healthz
        relabel_configs:
          - source_labels: [__address__]
            target_label: __param_target
          - source_labels: [__param_target]
            target_label: instance
          - target_label: __address__
            replacement: blackbox-exporter.prometheus.svc.cluster.local:9115

      - job_name: blackbox_http_pod_probe
        honor_timestamps: true
        params:
          module:
          - http_2xx
        scrape_interval: 15s
        scrape_timeout: 10s
        metrics_path: /probe
        scheme: http
        relabel_configs:
        - source_labels: [__meta_kubernetes_pod_annotation_blackbox_scheme]
          separator: ;
          regex: http
          replacement: $1
          action: keep
        - source_labels: [__address__, __meta_kubernetes_pod_annotation_blackbox_port, __meta_kubernetes_pod_annotation_blackbox_path]
          separator: ;
          regex: ([^:]+)(?::\d+)?;(\d+);(.+)
          target_label: __param_target
          replacement: $1:$2$3
          action: replace
        - separator: ;
          regex: (.*)
          target_label: __address__
          replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
          action: replace
        - source_labels: [__param_target]
          separator: ;
          regex: (.*)
          target_label: instance
          replacement: $1
          action: replace
        - separator: ;
          regex: __meta_kubernetes_pod_label_(.+)
          replacement: $1
          action: labelmap
        - source_labels: [__meta_kubernetes_namespace]
          separator: ;
          regex: (.*)
          target_label: kubernetes_namespace
          replacement: $1
          action: replace
        - source_labels: [__meta_kubernetes_pod_name]
          separator: ;
          regex: (.*)
          target_label: kubernetes_pod_name
          replacement: $1
          action: replace
        kubernetes_sd_configs:
        - role: pod


      - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        job_name: kubernetes-nodes
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - replacement: kubernetes.default.svc:443
          target_label: __address__
        - regex: (.+)
          replacement: /api/v1/nodes/${1}/proxy/metrics
          source_labels:
          - __meta_kubernetes_node_name
          target_label: __metrics_path__
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true

      - job_name: 'kubernetes-services'
        metrics_path: /probe
        params:
          module: [http_2xx]
        kubernetes_sd_configs:
        - role: service
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
          action: keep
          regex: true
        - source_labels: [__address__]
          target_label: __param_target
        - target_label: __address__
          replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
        - source_labels: [__param_target]
          target_label: instance
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          target_label: kubernetes_name

      - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        job_name: kubernetes-nodes-cadvisor
        kubernetes_sd_configs:
        - role: node
        relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        - replacement: kubernetes.default.svc:443
          target_label: __address__
        - regex: (.+)
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
          source_labels:
          - __meta_kubernetes_node_name
          target_label: __metrics_path__
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true

      - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        job_name: 'kubernetes-ingresses'
        metrics_path: /probe
        params:
          module: [http_2xx]
        kubernetes_sd_configs:
        - role: ingress
        relabel_configs:
        - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
          regex: (.+);(.+);(.+)
          replacement: ${1}://${2}${3}
          target_label: __param_target
        - target_label: __address__
          replacement: blackbox-exporter.prometheus.svc.cluster.local:9115
        - source_labels: [__param_target]
          target_label: instance
        - action: labelmap
          regex: __meta_kubernetes_ingress_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_ingress_name]
          target_label: kubernetes_name

      - job_name: 'kubernetes-kubelet'
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics

      - job_name: "traefik"
        static_configs:
        - targets: ['traefik-ingress-service.kube-system.svc.cluster.local:8080']

      - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        job_name: kubernetes-apiservers
        kubernetes_sd_configs:
        - role: endpoints
        relabel_configs:
        - action: keep
          regex: default;kubernetes;https
          source_labels:
          - __meta_kubernetes_namespace
          - __meta_kubernetes_service_name
          - __meta_kubernetes_endpoint_port_name
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
          insecure_skip_verify: true
kind: ConfigMap
metadata:
  labels:
    app: prometheus
  name: prometheus-conf
  namespace: prometheus

三、准备prometheus-rules 配置文件
# cat prometheus-rules.yaml 
apiVersion: v1
data:
  nodedown.rule.yml: |
    groups:
    - name: YingPuDev-Alerting
      rules:
      - alert: 实例崩溃
        expr: up {instance !~""} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "实例{{ $labels.instance }}崩溃"
          description: "{{ $labels.instance }}的{{ $labels.job }}实例已经崩溃超过了1分钟。"

      - alert: HTTP探测发现不健康服务端点
        expr: probe_http_status_code >= 400 or probe_http_status_code {instance !~ "videoai-php-dev.videoai.svc:22"} == 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "API {{ $labels.kubernetes_name }} 服务不可用"
          description: "{{ $labels.kubernetes_name }}的{{ $labels.job }}服务已经超过1分钟不可用了。当前状态码为:{{ $value }}"
kind: ConfigMap
metadata:
  labels:
    app: prometheus
  name: prometheus-rules
  namespace: prometheus

四、准备prometheus-deployment 配置文件
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: prometheus
  name: prometheus
  namespace: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: prometheus
    spec:
      containers:
      - args:
        - --config.file=/etc/prometheus/prometheus.yml
        - --storage.tsdb.path=/prometheus
        - --storage.tsdb.retention=30d
        env:
        - name: STAKATER_PROMETHEUS_CONF_CONFIGMAP
          value: e4dd2838dd54e8392b62d85898083cc3d20210cc
        - name: STAKATER_PROMETHEUS_RULES_CONFIGMAP
          value: ca65a78fcb15d2c767166e468e8e734c6d4e267f
        image: prom/prometheus:latest
        imagePullPolicy: Always
        name: prometheus
        ports:
        - containerPort: 9090
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /prometheus
          name: prometheus-data-volume
        - mountPath: /etc/prometheus/prometheus.yml
          name: prometheus-conf-volume
          subPath: prometheus.yml
        - mountPath: /etc/prometheus/rules
          name: prometheus-rules-volume
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsUser: 0
      serviceAccount: prometheus
      serviceAccountName: prometheus
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
      volumes:
      - name: prometheus-data-volume
        persistentVolumeClaim:
          claimName: prometheus-server
      - configMap:
          defaultMode: 420
          name: prometheus-conf
        name: prometheus-conf-volume
      - configMap:
          defaultMode: 420
          name: prometheus-rules
        name: prometheus-rules-volume

五、准备 prometheus-svc 配置文件
# cat prometheus-svc.yaml  
apiVersion: v1
kind: Service
metadata:
  labels:
    app: prometheus
  name: prometheus-service
  namespace: prometheus
spec:
  ports:
  - port: 9090
    protocol: TCP
    targetPort: 9090
  selector:
    app: prometheus
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

六、准备 prometheus-ingress 配置文件
# cat prometheus-ingress.yaml  
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: prometheus-ingress
  namespace: prometheus
spec:
  rules:
  - host: prometheus.movie.cn
    http:
      paths:
      - backend:
          serviceName: prometheus-service
          servicePort: 9090
status:
  loadBalancer: {}


依次生成配置文件
kubectl apply  -f prometheus-conf.yaml
kubectl apply  -f prometheus-pv-pvc.yaml
kubectl apply  -f prometheus-rules.yaml
kubectl apply  -f prometheus-deployment.yaml
kubectl apply  -f prometheus-svc.yaml
kubectl apply  -f prometheus-ingress.yaml
上一篇:02 github与gitlab


下一篇:Prometheus+Grafana+Alertmanager监控部署