k8s以Deployment方式部署prometheus + grafana:
- 主机说明:
系统 | ip | 角色 | cpu | 内存 | hostname |
---|---|---|---|---|---|
CentOS 7.8 | 192.168.30.128 | master | >=2 | >=2G | master1 |
CentOS 7.8 | 192.168.30.129 | master | >=2 | >=2G | master2 |
CentOS 7.8 | 192.168.30.130 | node | >=2 | >=2G | node1 |
CentOS 7.8 | 192.168.30.131 | node | >=2 | >=2G | node2 |
CentOS 7.8 | 192.168.30.132 | node | >=2 | >=2G | node3 |
kubectl get node NAME STATUS ROLES AGE VERSION master1 Ready master 63d v1.14.0 master2 Ready master 63d v1.14.0 node1 Ready <none> 63d v1.14.0 node2 Ready <none> 63d v1.14.0 node3 Ready <none> 63d v1.14.0
- 采集方案:
通过prometheus-node-exporter采集主机的性能指标数据,并通过暴露的 /metrics 接口用prometheus抓取 通过kube-apiserver、kube-controller-manager、kube-scheduler、etcd、kubelet、kube-proxy自身暴露的 /metrics 获取节点上与k8s集群相关的一些指标数据 通过cadvisor采集容器、Pod相关的性能指标数据,并通过暴露的 /metrics 接口用prometheus抓取 通过blackbox-exporter采集应用的网络性能(http、tcp、icmp等)数据,并通过暴露的 /metrics 接口用prometheus抓取 通过kube-state-metrics采集k8s资源对象的状态指标数据,并通过暴露的 /metrics 接口用prometheus抓取 应用自己采集容器中进程主动暴露的指标数据(暴露指标的功能由应用自己实现,并添加约定的annotation,prometheus负责根据annotation实现抓取)
- 抓取介绍:
Kubernetes可以约定好带哪些annotation前缀的服务是自主暴露监控指标的服务。应用添加约定的这些annotations,Prometheus可以根据annotation实现抓取。例如:
prometheus.io/scrape: 'true' 获知对应的endpoint是需要被scrape的 prometheus.io/app-metrics: 'true' 获知对应的endpoint中有应用进程暴露的metrics prometheus.io/app-metrics-port: '8080' 获知进程暴露的metrics的端口 prometheus.io/app-metrics-path: '/metrics' 获知进程暴露的metrics的具体路径
应用可以在service中指定约定的annotation,实现Prometheus对该应用的网络服务进行探测:
http探测: prometheus.io/scrape: 'true' prometheus.io/http-probe: 'true' prometheus.io/http-probe-port: '8080' prometheus.io/http-probe-path: '/healthz' tcp探测: prometheus.io/scrape: 'true' prometheus.io/tcp-probe: 'true' prometheus.io/tcp-probe-port: '80'
Prometheus根据这些annotations可以获知相应service是需要被探测的,探测的网络协议可以是http、tcp或其他,以及具体的探测端口。http探测需要知道探测的具体url。
namespace
mkdir -p /home/k8s/monitoring/{node-exporter,k8s,kube-state-metrics,blackbox-exporter,dingtalk,alertmanager,prometheus,grafana}cd /home/k8s/monitoring vim namespace.yaml
apiVersion: v1kind: Namespacemetadata: name: monitoring
node-exporter
vim node-exporter/node-exporter.yaml
apiVersion: v1kind: Servicemetadata: name: node-exporter namespace: monitoring labels: app: node-exporter annotations: prometheus.io/scrape: 'true'spec: selector: app: node-exporter ports: - name: node-exporter port: 9100 protocol: TCP targetPort: 9100 clusterIP: None---apiVersion: apps/v1kind: DaemonSetmetadata: name: node-exporter namespace: monitoring labels: app: node-exporterspec: selector: matchLabels: app: node-exporter template: metadata: name: node-exporter labels: app: node-exporter spec: containers: - name: node-exporter image: prom/node-exporter:latest imagePullPolicy: IfNotPresent ports: - containerPort: 9100 hostPort: 9100 hostNetwork: true hostPID: true tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule
k8s组件
controller-manager:
vim k8s/kube-controller-manager-prometheus-discovery.yaml
apiVersion: v1kind: Servicemetadata: name: kube-controller-manager-prometheus-discovery namespace: kube-system labels: component: kube-controller-manager annotations: prometheus.io/scrape: 'true'spec: selector: component: kube-controller-manager ports: - name: http-metrics port: 10252 targetPort: 10252 protocol: TCP clusterIP: None
kube-scheduler:
vim k8s/kube-scheduler-prometheus-discovery.yaml
apiVersion: v1kind: Servicemetadata: name: kube-scheduler-prometheus-discovery namespace: kube-system labels: component: kube-scheduler annotations: prometheus.io/scrape: 'true'spec: selector: component: kube-scheduler ports: - name: http-metrics port: 10251 protocol: TCP targetPort: 10251 clusterIP: None
kube-proxy:
vim k8s/kube-proxy-prometheus-discovery.yaml
apiVersion: v1kind: Servicemetadata: name: kube-proxy-prometheus-discovery namespace: kube-system labels: k8s-app: kube-proxy annotations: prometheus.io/scrape: 'true'spec: selector: k8s-app: kube-proxy ports: - name: http-metrics port: 10249 protocol: TCP targetPort: 10249 clusterIP: None
kube-state-metrics
vim kube-state-metrics/rbac.yaml
apiVersion: v1kind: ServiceAccountmetadata: name: kube-state-metrics namespace: monitoring labels: app: kube-state-metrics --- apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: kube-state-metrics labels: app: kube-state-metricsrules:- apiGroups: - "" resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: - list - watch- apiGroups: - extensions resources: - daemonsets - deployments - replicasets - ingresses verbs: - list - watch- apiGroups: - apps resources: - statefulsets - daemonsets - deployments - replicasets verbs: - list - watch- apiGroups: - batch resources: - cronjobs - jobs verbs: - list - watch- apiGroups: - autoscaling resources: - horizontalpodautoscalers verbs: - list - watch- apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create- apiGroups: - authorization.k8s.io resources: - subjectacce***eviews verbs: - create- apiGroups: - policy resources: - poddisruptionbudgets verbs: - list - watch- apiGroups: - certificates.k8s.io resources: - certificatesigningrequests verbs: - list - watch- apiGroups: - storage.k8s.io resources: - storageclasses - volumeattachments verbs: - list - watch- apiGroups: - admissionregistration.k8s.io resources: - mutatingwebhookconfigurations - validatingwebhookconfigurations verbs: - list - watch- apiGroups: - networking.k8s.io resources: - networkpolicies verbs: - list - watch- apiGroups: - coordination.k8s.io resources: - leases verbs: - list - watch ---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata: name: kube-state-metrics labels: app: kube-state-metricsroleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metricssubjects:- kind: ServiceAccount name: kube-state-metrics namespace: monitoring
vim kube-state-metrics/kube-state-metrics.yaml
apiVersion: v1kind: Servicemetadata: name: kube-state-metrics namespace: monitoring labels: app: kube-state-metrics annotations: prometheus.io/scrape: 'true' prometheus.io/http-probe: 'true' prometheus.io/http-probe-path: '/healthz' prometheus.io/http-probe-port: '8080'spec: selector: app: kube-state-metrics ports: - name: kube-state-metrics port: 8080 protocol: TCP targetPort: 8080 ---apiVersion: apps/v1kind: Deploymentmetadata: name: kube-state-metrics namespace: monitoring labels: app: kube-state-metricsspec: replicas: 1 selector: matchLabels: app: kube-state-metrics template: metadata: labels: app: kube-state-metrics spec: serviceAccountName: kube-state-metrics containers: - name: kube-state-metrics image: quay.mirrors.ustc.edu.cn/coreos/kube-state-metrics:v1.8.0 # kube-state-metrics:v1.9.7 适用于Kubernetes 1.16以上版本 imagePullPolicy: IfNotPresent ports: - containerPort: 8080 nodeSelector: node-role.kubernetes.io/master: "" kubernetes.io/hostname: "master2" tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule
blackbox-exporter
blackbox-exporter是一个黑盒探测工具,可以对服务的http、tcp、icmp等进行网络探测。
vim blackbox-exporter/config.yaml
apiVersion: v1kind: ConfigMapmetadata: name: blackbox-exporter namespace: monitoring labels: app: blackbox-exporterdata: blackbox.yml: |- modules: http_2xx: prober: http timeout: 10s http: valid_http_versions: ["HTTP/1.1", "HTTP/2"] valid_status_codes: [] method: GET preferred_ip_protocol: "ip4" http_post_2xx: prober: http timeout: 10s http: valid_http_versions: ["HTTP/1.1", "HTTP/2"] method: POST preferred_ip_protocol: "ip4" tcp_connect: prober: tcp timeout: 10s icmp: prober: icmp timeout: 10s icmp: preferred_ip_protocol: "ip4"
vim blackbox-exporter/blackbox-exporter.yaml
apiVersion: v1kind: Servicemetadata: name: blackbox-exporter namespace: monitoring labels: app: blackbox-exporter annotations: prometheus.io/scrape: 'true'spec: selector: app: blackbox-exporter ports: - name: blackbox port: 9115 protocol: TCP targetPort: 9115 nodePort: 30115 type: NodePort --- apiVersion: apps/v1kind: Deploymentmetadata: name: blackbox-exporter namespace: monitoringspec: replicas: 1 selector: matchLabels: app: blackbox-exporter template: metadata: labels: app: blackbox-exporter spec: containers: - name: blackbox-exporter image: prom/blackbox-exporter:latest imagePullPolicy: IfNotPresent ports: - containerPort: 9115 readinessProbe: tcpSocket: port: 9115 initialDelaySeconds: 10 timeoutSeconds: 5 resources: requests: memory: 50Mi cpu: 100m limits: memory: 60Mi cpu: 200m volumeMounts: - name: config mountPath: /etc/blackbox_exporter args: - '--config.file=/etc/blackbox_exporter/blackbox.yml' - '--web.listen-address=:9115' volumes: - name: config configMap: name: blackbox-exporter nodeSelector: node-role.kubernetes.io/master: "" tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule
注意:
blackbox-exporter的配置文件为/etc/blackbox_exporter/blackbox.yml
, 运行时可以动态重载配置文件,当重新加载配置文件失败时,不影响运行中的配置。重载方式:
curl -XPOST http://ip:9115/-/reload
dingtalk
vim dingtalk/config.yaml
apiVersion: v1kind: ConfigMapmetadata: name: dingtalk-config namespace: monitoringdata: config.yml: |- targets: webhook: url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxxxxxx #修改为钉钉机器人的webhook mention: all: true #@所有人
vim dingtalk/dingtalk.yaml
apiVersion: v1kind: Servicemetadata: name: dingtalk namespace: monitoring labels: app: dingtalk annotations: prometheus.io/scrape: 'false'spec: selector: app: dingtalk ports: - name: dingtalk port: 8060 protocol: TCP targetPort: 8060 ---apiVersion: apps/v1kind: Deploymentmetadata: name: dingtalk namespace: monitoringspec: replicas: 1 selector: matchLabels: app: dingtalk template: metadata: name: dingtalk labels: app: dingtalk spec: containers: - name: dingtalk image: timonwong/prometheus-webhook-dingtalk:latest imagePullPolicy: IfNotPresent ports: - containerPort: 8060 volumeMounts: - name: config mountPath: /etc/prometheus-webhook-dingtalk volumes: - name: config configMap: name: dingtalk-config
alertmanager
vim alertmanager/templates.yaml
apiVersion: v1kind: ConfigMapmetadata:
name: alertmanager-templates namespace: monitoringdata:
default.tmpl: |
{{ define "__alertmanager" }}AlertManager{{ end }}
{{ define "__alertmanagerURL" }}{{ .ExternalURL }}/#/alerts?receiver={{ .Receiver }}{{ end }}
{{ define "__subject" }}[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " " }} {{ if gt (len .CommonLabels) (len .GroupLabels) }}({{ with .CommonLabels.Remove .GroupLabels.Names }}{{ .Values | join " " }}{{ end }}){{ end }}{{ end }}
{{ define "__description" }}{{ end }}
{{ define "__text_alert_list" }}{{ range . }}Labels:
{{ range .Labels.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Annotations:
{{ range .Annotations.SortedPairs }} - {{ .Name }} = {{ .Value }}
{{ end }}Source: {{ .GeneratorURL }}
{{ end }}{{ end }}
{{ define "slack.default.title" }}{{ template "__subject" . }}{{ end }}
{{ define "slack.default.username" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "slack.default.fallback" }}{{ template "slack.default.title" . }} | {{ template "slack.default.titlelink" . }}{{ end }}
{{ define "slack.default.pretext" }}{{ end }}
{{ define "slack.default.titlelink" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "slack.default.iconemoji" }}{{ end }}
{{ define "slack.default.iconurl" }}{{ end }}
{{ define "slack.default.text" }}{{ end }}
{{ define "hipchat.default.from" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "hipchat.default.message" }}{{ template "__subject" . }}{{ end }}
{{ define "pagerduty.default.description" }}{{ template "__subject" . }}{{ end }}
{{ define "pagerduty.default.client" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "pagerduty.default.clientURL" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "pagerduty.default.instances" }}{{ template "__text_alert_list" . }}{{ end }}
{{ define "opsgenie.default.message" }}{{ template "__subject" . }}{{ end }}
{{ define "opsgenie.default.description" }}{{ .CommonAnnotations.SortedPairs.Values | join " " }}
{{ if gt (len .Alerts.Firing) 0 -}}
Alerts Firing:
{{ template "__text_alert_list" .Alerts.Firing }}
{{- end }}
{{ if gt (len .Alerts.Resolved) 0 -}}
Alerts Resolved:
{{ template "__text_alert_list" .Alerts.Resolved }}
{{- end }}
{{- end }}
{{ define "opsgenie.default.source" }}{{ template "__alertmanagerURL" . }}{{ end }}
{{ define "victorops.default.message" }}{{ template "__subject" . }} | {{ template "__alertmanagerURL" . }}{{ end }}
{{ define "victorops.default.from" }}{{ template "__alertmanager" . }}{{ end }}
{{ define "email.default.subject" }}{{ template "__subject" . }}{{ end }}
{{ define "email.default.html" }}
{
{ end }} {
{ define "pushover.default.title" }}{
{ template "__subject" . }}{
{ end }} {
{ define "pushover.default.message" }}{
{ .CommonAnnotations.SortedPairs.Values | join " " }} {
{ if gt (len .Alerts.Firing) 0 }} Alerts Firing: {
{ template "__text_alert_list" .Alerts.Firing }} {
{ end }} {
{ if gt (len .Alerts.Resolved) 0 }} Alerts Resolved: {
{ template "__text_alert_list" .Alerts.Resolved }} {
{ end }} {
{ end }} {
{ define "pushover.default.url" }}{
{ template "__alertmanagerURL" . }}{
{ end }} slack.tmpl: | {{ define "slack.devops.text" }} {{range .Alerts}}{{.Annotations.DESCRIPTION}} {{end}} {{ end }}
vim alertmanager/config.yaml
apiVersion: v1kind: ConfigMapmetadata: name: alertmanager-config namespace: monitoringdata: config.yml: |- global: resolve_timeout: 5m smtp_smarthost: 'smtp.163.com:465' #邮箱smtp服务器代理,启用SSL发信, 端口一般是465 smtp_from: 'alert@163.com' #发送邮箱名称 smtp_auth_username: 'alert@163.com' #邮箱名称 smtp_auth_password: 'password' #邮箱密码或授权码 smtp_require_tls: false templates: - '/etc/templates/*.tmpl' route: receiver: 'default' group_wait: 10s group_interval: 1m repeat_interval: 1h group_by: ['alertname', 'instance', 'cluster', 'service'] routes: - receiver: 'default' match: severity: 'warning' - receiver: 'dingtalk' match: severity: 'critical' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance', 'cluster', 'service'] receivers: - name: 'default' email_configs: - to: 'receiver@163.com' send_resolved: true - name: 'dingtalk' webhook_configs: - url: 'http://dingtalk:8060/dingtalk/webhook/send' send_resolved: true
vim alertmanager/alertmanager.yaml
apiVersion: extensions/v1beta1kind: Ingressmetadata: name: alertmanager namespace: monitoringspec: rules: - host: alertmanager.lzxlinux.cn http: paths: - path: / backend: serviceName: alertmanager servicePort: 9093 ---apiVersion: v1kind: Servicemetadata: name: alertmanager namespace: monitoring labels: name: alertmanager annotations: prometheus.io/scrape: 'true'spec: selector: app: alertmanager ports: - name: alertmanager port: 9093 protocol: TCP targetPort: 9093 ---apiVersion: apps/v1kind: Deploymentmetadata: name: alertmanager namespace: monitoringspec: replicas: 1 selector: matchLabels: app: alertmanager template: metadata: name: alertmanager labels: app: alertmanager spec: containers: - name: alertmanager image: prom/alertmanager:latest imagePullPolicy: IfNotPresent ports: - containerPort: 9093 env: - name: POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP args: - "--config.file=/etc/alertmanager/config.yml" - "--storage.path=/alertmanager" - "--cluster.advertise-address=$(POD_IP):6783" #没有该参数会报错:Failed to get final advertise address volumeMounts: - name: config mountPath: /etc/alertmanager - name: templates mountPath: /etc/templates - name: alertmanager mountPath: /alertmanager volumes: - name: config configMap: name: alertmanager-config - name: templates configMap: name: alertmanager-templates - name: alertmanager emptyDir: {}
prometheus
vim prometheus/rbac.yaml
apiVersion: v1kind: ServiceAccountmetadata: name: prometheus namespace: monitoring---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: prometheusrules:- apiGroups: [""] resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"]- apiGroups: ["networking.k8s.io"] resources: - ingresses verbs: ["get", "list", "watch"]- apiGroups: [""] resources: - configmaps verbs: ["get"]- nonResourceURLs: ["/metrics"] verbs: ["get"]---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata: name: prometheusroleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheussubjects:- kind: ServiceAccount name: prometheus namespace: monitoring
prometheus配置文件说明:
annotations将任何非标识metadata附加到对象,而labels可用于选择对象并查找满足某些条件的对象集合。相比之下,annotations不用于标识和选择对象,虽然它也是键值形式。annotations不会被Kubernetes直接使用,其主要目的是方便用户阅读查找。
Kubernetes的API SERVER会暴露API服务,Promethues集成了对Kubernetes的自动发现,它有5种模式:node、service、pod、endpoints、ingress。 上面是Prometheus官方给出的对Kubernetes服务发现的配置。可以看到大量的relabel_configs,其实把所有的relabel_configs去掉一样可以对kubernetes做服务发现。 relabel_configs仅仅是对采集过来的指标做二次处理,比如要什么、不要什么以及替换什么等等。而以 __meta_ 开头的这些元数据标签都是实例中包含的, 而relabel则是动态的修改、覆盖、添加删除这些标签或者这些标签对应的值。而且以 __ 开头的标签通常是系统内部使用的,因此这些标签不会被写入样本数据中, 如果我们要收集这些东西那么则要进行relabel操作。当然relabel操作也不仅限于操作以 __ 开头的标签。 action的行为: replace:默认行为,不配置action的话就采用这种行为,它会根据regex来去匹配source_labels标签上的值,并将并将匹配到的值写入target_label中 labelmap:它会根据regex去匹配标签名称,并将匹配到的内容作为新标签的名称,其值作为新标签的值 keep:仅收集匹配到regex的源标签,而会丢弃没有匹配到的所有标签,用于选择 drop:丢弃匹配到regex的源标签,而会收集没有匹配到的所有标签,用于排除 labeldrop:使用regex匹配标签,符合regex规则的标签将从target实例中移除,其实也就是不收集不保存 labelkeep:使用regex匹配标签,仅收集符合regex规则的标签,不符合的不收集
vim prometheus/config.yaml
apiVersion: v1kind: ConfigMapmetadata: name: prometheus-config namespace: monitoringdata: prometheus.yml: | global: scrape_interval: 10s scrape_timeout: 10s evaluation_interval: 10s alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 rule_files: - "/etc/prometheus-rules/*.rules" scrape_configs: - job_name: 'node-exporter' #node节点性能指标数据 tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_endpoint_port_name] regex: true;node-exporter action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+)(?::\d+);(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - job_name: 'kube-apiservers' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] regex: default;kubernetes;https action: keep - job_name: 'kube-controller-manager' tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_namespace, __meta_kubernetes_service_name] regex: true;kube-system;kube-controller-manager-prometheus-discovery action: keep - job_name: 'kube-scheduler' tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_namespace, __meta_kubernetes_service_name] regex: true;kube-system;kube-scheduler-prometheus-discovery action: keep - job_name: 'kubelet' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: 192.168.30.188:6443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'kube-proxy' tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_namespace, __meta_kubernetes_service_name] regex: true;kube-system;kube-proxy-prometheus-discovery action: keep - job_name: 'kubernetes-cadvisor' #容器、Pod相关的性能指标数据 scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: 192.168.30.188:6443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor metric_relabel_configs: - source_labels: [id] action: replace regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$' target_label: rkt_container_name replacement: '${2}-${1}' - source_labels: [id] action: replace regex: '^/system\.slice/(.+)\.service$' target_label: systemd_service_name replacement: '${1}' - job_name: 'kube-state-metrics' #资源对象(Deployment、Pod等)的状态 tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: endpoints relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_endpoint_port_name] regex: true;kube-state-metrics action: keep - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+)(?::\d+);(\d+) replacement: $1:$2 - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: kubernetes_name - job_name: 'kubernetes-service-http-probe' #通过http方式探测Service状态 tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: service metrics_path: /probe params: module: [http_2xx] relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_service_annotation_prometheus_io_http_probe] regex: true;true action: keep - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_namespace, __meta_kubernetes_service_annotation_prometheus_io_http_probe_port, __meta_kubernetes_service_annotation_prometheus_io_http_probe_path] action: replace target_label: __param_target regex: (.+);(.+);(.+);(.+) replacement: $1.$2:$3$4 - target_label: __address__ replacement: 192.168.30.128:30115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_service_annotation_prometheus_io_app_info_(.+) - job_name: 'kubernetes-service-tcp-probe' #通过tcp方式探测Service状态 tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: service metrics_path: /probe params: module: [tcp_connect] relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_service_annotation_prometheus_io_tcp_probe] regex: true;true action: keep - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_namespace, __meta_kubernetes_service_annotation_prometheus_io_tcp_probe_port] action: replace target_label: __param_target regex: (.+);(.+);(.+) replacement: $1.$2:$3 - target_label: __address__ replacement: 192.168.30.128:30115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_service_annotation_prometheus_io_app_info_(.+) - job_name: 'kubernetes-ingresses' #通过http方式探测ingresses状态 kubernetes_sd_configs: - role: ingress metrics_path: /probe params: module: [http_2xx] relabel_configs: - source_labels: [__meta_kubernetes_ingress_scheme, __address__, __meta_kubernetes_ingress_path] regex: (.+);(.+);(.+) replacement: ${1}://${2}${3} target_label: __param_target - target_label: __address__ replacement: 192.168.30.128:30115 - source_labels: [__param_target] target_label: instance - action: labelmap regex: __meta_kubernetes_ingress_label_(.+) - source_labels: [__meta_kubernetes_namespace] target_label: kubernetes_namespace - source_labels: [__meta_kubernetes_ingress_name] target_label: kubernetes_name
vim prometheus/rules.yaml
apiVersion: v1kind: ConfigMapmetadata: name: prometheus-rules namespace: monitoringdata: node.rules: | groups: - name: node rules: - alert: NodeDown expr: up == 0 for: 3m labels: severity: critical annotations: summary: "{{ $labels.instance }}: down" description: "{{ $labels.instance }} has been down for more than 3m" value: "{{ $value }}" - alert: NodeCPUHigh expr: (1 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 75 for: 5m labels: severity: warning annotations: summary: "{{$labels.instance}}: High CPU usage" description: "{{$labels.instance}}: CPU usage is above 75%" value: "{{ $value }}" - alert: NodeCPUIowaitHigh expr: avg by (instance) (irate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 50 for: 5m labels: severity: warning annotations: summary: "{{$labels.instance}}: High CPU iowait usage" description: "{{$labels.instance}}: CPU iowait usage is above 50%" value: "{{ $value }}" - alert: NodeMemoryUsageHigh expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90 for: 5m labels: severity: warning annotations: summary: "{{$labels.instance}}: High memory usage" description: "{{$labels.instance}}: Memory usage is above 90%" value: "{{ $value }}" - alert: NodeDiskRootLow expr: (1 - node_filesystem_avail_bytes{fstype=~"ext.*|xfs",mountpoint ="/"} / node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint ="/"}) * 100 > 80 for: 10m labels: severity: warning annotations: summary: "{{$labels.instance}}: Low disk(the / partition) space" description: "{{$labels.instance}}: Disk(the / partition) usage is above 80%" value: "{{ $value }}" - alert: NodeDiskBootLow expr: (1 - node_filesystem_avail_bytes{fstype=~"ext.*|xfs",mountpoint ="/boot"} / node_filesystem_size_bytes{fstype=~"ext.*|xfs",mountpoint ="/boot"}) * 100 > 80 for: 10m labels: severity: warning annotations: summary: "{{$labels.instance}}: Low disk(the /boot partition) space" description: "{{$labels.instance}}: Disk(the /boot partition) usage is above 80%" value: "{{ $value }}" - alert: NodeLoad5High expr: (node_load5) > (count by (instance) (node_cpu_seconds_total{mode='system'}) * 2) for: 5m labels: severity: warning annotations: summary: "{{$labels.instance}}: Load(5m) High" description: "{{$labels.instance}}: Load(5m) is 2 times the number of CPU cores" value: "{{ $value }}"
vim prometheus/prometheus.yaml
apiVersion: extensions/v1beta1kind: Ingressmetadata: name: prometheus namespace: monitoringspec: rules: - host: prometheus.lzxlinux.cn http: paths: - path: / backend: serviceName: prometheus servicePort: 9090--- apiVersion: v1kind: Servicemetadata: name: prometheus namespace: monitoring labels: app: prometheus annotations: prometheus.io/scrape: 'true'spec: selector: app: prometheus ports: - name: prometheus port: 9090 protocol: TCP targetPort: 9090 ---apiVersion: apps/v1kind: Deploymentmetadata: name: prometheus namespace: monitoring labels: app: prometheusspec: replicas: 1 selector: matchLabels: app: prometheus template: metadata: name: prometheus labels: app: prometheus spec: serviceAccountName: prometheus containers: - name: prometheus image: prom/prometheus:latest imagePullPolicy: IfNotPresent args: - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=30d' - '--config.file=/etc/prometheus/prometheus.yml' ports: - containerPort: 9090 resources: requests: cpu: 500m memory: 500M limits: cpu: 500m memory: 500M volumeMounts: - name: config mountPath: /etc/prometheus - name: rules mountPath: /etc/prometheus-rules - name: prometheus mountPath: /prometheus volumes: - name: config configMap: name: prometheus-config - name: rules configMap: name: prometheus-rules - name: prometheus emptyDir: {} nodeSelector: node-role.kubernetes.io/master: "" kubernetes.io/hostname: "master2" tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule
grafana
vim grafana/secret.yaml
apiVersion: v1kind: Secretmetadata: name: grafana namespace: monitoringdata: admin-password: YWRtaW4= # base64 加解密 admin-username: YWRtaW4=type: Opaque
vim grafana/grafana.yaml
apiVersion: extensions/v1beta1kind: Ingressmetadata: name: grafana namespace: monitoringspec: rules: - host: grafana.lzxlinux.cn http: paths: - path: / backend: serviceName: grafana servicePort: 3000---apiVersion: v1kind: Servicemetadata: name: grafana namespace: monitoring labels: app: grafana annotations: prometheus.io/scrape: 'true' prometheus.io/path: '/metrics'spec: selector: app: grafana ports: - name: grafana port: 3000 protocol: TCP targetPort: 3000---apiVersion: apps/v1kind: Deploymentmetadata: name: grafana namespace: monitoring labels: app: grafanaspec: replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana:latest imagePullPolicy: IfNotPresent ports: - containerPort: 3000 name: grafana resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi env: - name: GF_AUTH_BASIC_ENABLED value: "true" - name: GF_AUTH_ANONYMOUS_ENABLED value: "false" - name: GF_AUTH_ANONYMOUS_ORG_ROLE value: Admin - name: GF_DASHBOARDS_JSON_ENABLED value: "true" - name: GF_INSTALL_PLUGINS value: grafana-kubernetes-app #安装grafana-kubernetes-app插件 - name: GF_SECURITY_ADMIN_USER valueFrom: secretKeyRef: name: grafana key: admin-username - name: GF_SECURITY_ADMIN_PASSWORD valueFrom: secretKeyRef: name: grafana key: admin-password readinessProbe: httpGet: path: /login port: 3000 initialDelaySeconds: 10 timeoutSeconds: 5 volumeMounts: - name: grafana-storage mountPath: /var/lib/grafana volumes: - name: grafana-storage emptyDir: {}
部署
tree ..├── alertmanager │ ├── alertmanager.yaml │ ├── config.yaml │ └── templates.yaml ├── blackbox-exporter │ ├── blackbox-exporter.yaml │ └── config.yaml ├── dingtalk │ ├── config.yaml │ └── dingtalk.yaml ├── grafana │ ├── grafana.yaml │ └── secret.yaml ├── k8s │ ├── kube-controller-manager-prometheus-discovery.yaml │ ├── kube-proxy-prometheus-discovery.yaml │ └── kube-scheduler-prometheus-discovery.yaml ├── kube-state-metrics │ ├── kube-state-metrics.yaml │ └── rbac.yaml ├── namespace.yaml ├── node-exporter │ └── node-exporter.yaml └── prometheus ├── config.yaml ├── prometheus.yaml ├── rbac.yaml └── rules.yaml 8 directories, 20 files
kubectl apply -f namespace.yaml kubectl apply -f node-exporter/ kubectl apply -f k8s/ kubectl apply -f kube-state-metrics/ kubectl apply -f blackbox-exporter/ kubectl apply -f dingtalk/ kubectl apply -f alertmanager/ kubectl apply -f prometheus/ kubectl apply -f grafana/
kubectl get all -n monitoring NAME READY STATUS RESTARTS AGE pod/alertmanager-9c4bf8565-z9mp9 1/1 Running 0 2m54s pod/blackbox-exporter-57d847fc4c-mq8mx 1/1 Running 0 2m58s pod/dingtalk-957f5896-9bd9b 1/1 Running 0 2m56s pod/grafana-76779dc8cf-2fk4x 1/1 Running 0 2m46s pod/kube-state-metrics-5d5f7cd774-tw4sw 1/1 Running 0 2m58s pod/node-exporter-29bkg 1/1 Running 0 3m5s pod/node-exporter-45k2d 1/1 Running 0 3m5s pod/node-exporter-8dbts 1/1 Running 0 3m5s pod/node-exporter-9kwwt 1/1 Running 0 3m5s pod/node-exporter-bxhcf 1/1 Running 0 3m5s pod/prometheus-65848cf9b4-m5kcf 1/1 Running 0 2m49s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alertmanager ClusterIP 10.98.52.72 <none> 9093/TCP 2m55s service/blackbox-exporter NodePort 10.106.73.127 <none> 9115:30115/TCP 2m58s service/dingtalk ClusterIP 10.103.205.136 <none> 8060/TCP 2m57s service/grafana ClusterIP 10.103.12.113 <none> 3000/TCP 2m47s service/kube-state-metrics ClusterIP 10.98.99.215 <none> 8080/TCP 3m1s service/node-exporter ClusterIP None <none> 9100/TCP 3m6s service/prometheus ClusterIP 10.99.50.109 <none> 9090/TCP 2m51s NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/node-exporter 5 5 5 5 5 <none> 3m5s NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/alertmanager 1/1 1 1 2m55s deployment.apps/blackbox-exporter 1/1 1 1 2m58s deployment.apps/dingtalk 1/1 1 1 2m57s deployment.apps/grafana 1/1 1 1 2m46s deployment.apps/kube-state-metrics 1/1 1 1 3m deployment.apps/prometheus 1/1 1 1 2m51s NAME DESIRED CURRENT READY AGE replicaset.apps/alertmanager-9c4bf8565 1 1 1 2m55s replicaset.apps/blackbox-exporter-57d847fc4c 1 1 1 2m58s replicaset.apps/dingtalk-957f5896 1 1 1 2m56s replicaset.apps/grafana-76779dc8cf 1 1 1 2m46s replicaset.apps/kube-state-metrics-5d5f7cd774 1 1 1 3m replicaset.apps/prometheus-65848cf9b4 1 1 1 2m51s
任选一个node ip,在本地添加hosts:
192.168.30.130 alertmanager.lzxlinux.cn 192.168.30.130 prometheus.lzxlinux.cn 192.168.30.130 grafana.lzxlinux.cn
访问prometheus.lzxlinux.cn
,
监控的Targets中仅kube-proxy由于监听地址是127.0.0.1:10249
而状态为DOWN,修改其监听地址即可。
访问grafana.lzxlinux.cn
,数据源是http://prometheus:9090
,导入主机详情
模板8919,
grafana配置kubernetes数据源
启动插件:Plugins
→ kubernetes
→ Enable
,然后配置集群访问地址及访问证书:
如果是通过kubeadm方式搭建的k8s集群,会有一个 /etc/kubernetes/admin.conf
文件,里面包含了客户端的证书和密码base64编码。
cat /etc/kubernetes/admin.conf
apiVersion: v1clusters:- cluster: certificate-authority-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN5RENDQWJDZ0F3SUJBZ0lCQURBTkJna3Foa2lHOXcwQkFRc0ZBREFWTVJNd0VRWURWUVFERXdwcmRXSmwKY201bGRHVnpNQjRYRFRJd01EVXhNakV3TURBek1Gb1hEVE13TURVeE1ERXdNREF6TUZvd0ZURVRNQkVHQTFVRQpBeE1LYTNWaVpYSnVaWFJsY3pDQ0FTSXdEUVlKS29aSWh2Y05BUUVCQlFBRGdnRVBBRENDQVFvQ2dnRUJBTDVpCmYxTGcxUUlqN0VlWlZ0cVFmS3dGZjg4V3NVbVVialZldll5NDZTUittMWpwRWdvM2wxWEIvZHBFNzRWOGtqTGQKZkdGdmVWZkVxNy8rMzdyamNGMXRpSm1BbThLZnMrMW9QdEpLOE0yZjNTSm5FZVVIQUlBeFl2cUE4ZFNsbThTQwpmSkJWU2J3K1pROTBTelpKNzdQUzFuZTBmYnRod0Y2VHE0Uy9FV3h3cUZZMzF5cENub05lVUNtcElsSjVnYWdtCnJ2QmhkTmFNb2oyQlRrMWNDVjh3dkRVS3RlbXFVYVE4R2ZCalZLeHhkdWtwcjJ3S3RPbXZkem1vMEdLSE11MFcKWmQ1TVd0dStIQVZrTXhzcE95Yk41NkFkNnloUkN5YkFJbTN2ZWJlTFV5cjBEY2JhNzJXNVlPRHRCY3ZBOEJxOAoxR1JQc1EwaXBUdGtYbDVCZEhzQ0F3RUFBYU1qTUNFd0RnWURWUjBQQVFIL0JBUURBZ0trTUE4R0ExVWRFd0VCCi93UUZNQU1CQWY4d0RRWUpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFJS0lIb25wVllFWWpwR3JrN2wraGJyeGlxZXkKeGFQT1M3UW5TZEVZMC94TWtiUWxKcy9rUFcxU2lVemdoUk4wQWJxMnFtTXVuNHhlZ0pLdG***XhYRGJZNEhZbgpVVCtPWG0rQ1hBQjd3S3pYcDlmUTZBUDk3cmY0L2FRaXlGZEtsZUJ6Y3JNUkErZHZWTjk3NGlHUW94aFh3T1FNCmZXeGNrMDNhU0Qvc2s5UnJrcFhlL1g2NHQrV3BkUlFGRjE2YVFlSHVxNnJQRWZTR2VPUWVpcVIrQVgvdWpIOHoKZzJZY2JKWE85U3ZheXcyb3oxSlozTUx6K0FpeE5RTHFNYU00Tm43TklvMExxUHFqNzZoU3d1Qk1nREE0VnFtZAowZHRtS211OVZjTGZHcW9ITnZnajlTYlVlZ1crL3VEbzcwVXdvb2NGTmlnSnRnOVVSZWpEUXJJSm4rUT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo= server: https://192.168.30.188:6443 name: kubernetescontexts:- context: cluster: kubernetes user: kubernetes-admin name: kubernetes-admin@kubernetescurrent-context: kubernetes-admin@kuberneteskind: Configpreferences: {}users:- name: kubernetes-admin user: client-certificate-data: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUM4akNDQWRxZ0F3SUJBZ0lJZWtMYS9Fc1ZDSGd3RFFZSktvWklodmNOQVFFTEJRQXdGVEVUTUJFR0ExVUUKQXhNS2EzVmlaWEp1WlhSbGN6QWVGdzB5TURBMU1USXhNREF3TXpCYUZ3MHlNVEExTVRJeE1EQXdNekphTURReApGekFWQmdOVkJBb1REbk41YzNSbGJUcHRZWE4wWlhKek1Sa3dGd1lEVlFRREV4QnJkV0psY201bGRHVnpMV0ZrCmJXbHVNSUlCSWpBTkJna3Foa2lHOXcwQkFRRUZBQU9DQVE4QU1JSUJDZ0tDQVFFQXJ2a084MUg0ZW9zak5kM3oKSy9UUEhHcGtCR1FvZm1hbm9ldjRlWXNmUTlPZW0wYzBvVUJ3cXoxM2JabmJUbmJweFFqbmdZMkc4bHF4UmkwaQpCdlA2ZmtmS0ZFQlZzUTd4dGlqZXBrdnByWEdPL08wUUE1U0k4NHJzTjVHOVhOa2pQbWdzYTBlblZxNUVvRTBGClRaNXpRRjlwUlkxWUZZZXYrTDE1bU5FaXlScUg4UDJRY3BoUmxWK09IUXVHaVdLNEhIRVB2QWw2QUpJeWN6d3MKWWMrdk1IdHlZbmF5NUMwUldVWHhyUmc0ZytKMksrY1h1YlF0elhXdjdxaTNhNjFDekpaZi9TZkNOd0Jyam9zRwp0b215WEJWNVZTVGJUYVk1OFZrLzFPK1NSc3BybjF3TDc0djdXUXVEaE9ydXhBRXpuYmRXWWxOMEZBMm5MTjlZCmwxWkVKUUlEQVFBQm95Y3dKVEFPQmdOVkhROEJBZjhFQkFNQ0JhQXdFd1lEVlIwbEJBd3dDZ1lJS3dZQkJRVUgKQXdJd0RRWUpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFHazlIRDAxRmRRUnd4THhGUi8yRjdPM2ZpdGRFV3pDTC9UawpsZUxZaGlQaVh3NjNwOGtWU0VabEIyNEYzNEd2WlB3YS9LWnNUQnZXM0Mwek9uNGpHQ2hueHEvaVdqTWFnVEdBCktPUFV2bUI2VzhvVzhlb0lrSStOOEs0NFhSRnZzeGIwNUtqaCtwd0VZZzJUQXpBNEFlQzlnSjZYaTBzbHpnVnIKcWRzbXZtV0QzNEdXYzJOcVIzSDA3cW43RlJwRHIrTjlrTHE4Ukt4L0YwMWNCV1I3VVRZcnJTLzJEQ2t1N3lsWgptdTcwcXZicndYWnF6TkI5b05hQk82SHJsZXpuU2JQbnFKZUo0Q1czc2NMNmJ1N3A3bEppV1VQb0VHT0xic3YvCnFjT0xqdnZSRFF6eC9Xak5DWFZLNFhxbzJjVERGYitXeFJ1U2xGaUlQclk1QjlkQlFJWT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo= client-key-data: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFb2dJQkFBS0NBUUVBcnZrTzgxSDRlb3NqTmQzeksvVFBIR3BrQkdRb2ZtYW5vZXY0ZVlzZlE5T2VtMGMwCm9VQndxejEzYlpuYlRuYnB4UWpuZ1kyRzhscXhSaTBpQnZQNmZrZktGRUJWc1E3eHRpamVwa3ZwclhHTy9PMFEKQTVTSTg0cnNONUc5WE5ralBtZ3NhMGVuVnE1RW9FMEZUWjV6UUY5cFJZMVlGWWV2K0wxNW1ORWl5UnFIOFAyUQpjcGhSbFYrT0hRdUdpV0s0SEhFUHZBbDZBSkl5Y3p3c1ljK3ZNSHR5WW5heTVDMFJXVVh4clJnNGcrSjJLK2NYCnViUXR6WFd2N3FpM2E2MUN6SlpmL1NmQ053QnJqb3NHdG9teVhCVjVWU1RiVGFZNThWay8xTytTUnNwcm4xd0wKNzR2N1dRdURoT3J1eEFFem5iZFdZbE4wRkEybkxOOVlsMVpFSlFJREFRQUJBb0lCQUI4OHI0S1krN2RFNThCUwpJM3VSZFBncHRqbGllQ2c0dzJ5UTZBY3E0eVlFdmFnVENqNVBkczNiWjFyVndPVTlMWGJUcENEbzExS2xCa2owCi9jSW9CR3hPL0xDbzI2T0VlM3A5eVdIKzQzVG5kUk9LYnZWMHF3NXZtc1JBN0lHSzhsUE4zVUE1eHBJZkFubHIKeHFxWXd4S1c5Z0JJdjVUNGFGNEwxWTJHcUtNbUlhenVjLzVleU5rZjk3bnRyOFJncXQxcDJyaWJIVS9nRzFlYgpIWktyNm01UWx2MWJpYTFIYms2SWI2b1pYTVFIWWJSckpVemJSaWp6eE1RVkRmZ2tpalhSR3pSRjdZeVFZbTk5CmwwUzI1bDYzY1dIT3J6czM5R21xRmQ4Z0JJc0M3SkxuUThUY01nb1Axb0M5WXUzMGRrSDBvUi95M2lOTnFxRW8KZVJ2d0w0RUNnWUVBeG1tY2JDZXRlRDI5VWZDTE1kKzByTU14aVV4bHI3TTdrYUJOeWxPK2lOZWxKbUk5UHRXcwpkUzQzS2hkeElnSVNIQVRTdWxpR3VKMStEekxTWWNGZ2FkSmQxd25NNHppOUc2cW9NOXZTTVN2ZGtvK28ydXRqCmNscVBZcnVRbC9nK252dkE2N1ZyZzAyUXF0NFlqcmkrYmxQL1RNaDZGS3NGa0VQeXZPTHUxOTBDZ1lFQTRjSFgKUm43WUl2TWtMNGQrR3dkUmRwcXl5YStxVC9nbUtKTnNwb3VLUVZlaUd3aW9vR21BR0E0MEJBR2hyN24vMXB6Rwo5VkVQb201VDdPRnVmVWxGaUNURmJBblN2RWU5RTREUHJ3SDNhazlXR0JzcWxYcUZwMjdwWWFyZ3NSS2JDWU9UCm9Nc1FJR0wxelN4NEpkdFArMUxDQ1BuRnowMTNkajhRbmc4TVBPa0NnWUFwczF5cTVwUHc1NWo0dGNPcmtjYloKWUpUeXRGblMyYXExYXFtdTBuY0RMNytJRjdHam1Ta0wzOUM4U2Z6L0ZzeFRremZ1N2xneVNQZUxualRWVXQwKwpvSFlVa2Z5NzdOcmlDN1lhWUNNSExwNzlCTENLZ2xwK1dFWTJqQkZSdjF6NThST1U5cVpJREc5UldpaHpKcVR2CmJ6d0RHVWQvUElxSXpaOGd6OWsvQ1FLQmdIbEFRaDVEdkZReElNdENTM0c2NFg4QklXdC9wTXFrcmVIM0pGRGkKKzFPUy9LYm1aS01iWnNnRXdOMHgveVJCa3U0eWNBMk1Cd2lubHYzUUtpYXlOdDBqV3NGbkdUODBqSkd3Q2x1bApnN3dlZGxBbUx4M3ZtMTlOQzU0QVNBUHl5VUEzNGc5bllQYjBENjZ0NXEzMmQ2TzFWQys3N3dralF6bElMK1drCmtWOFpBb0dBVk01R1lLbnpNVjUzVzNXT3I0dFdLSm5XUHFiaHVlUEt5SXMzbTNkUzhGUE56SDU2UHhNKzRUM24Ka2NzT1VsZTlkQkFENXRXT3E5eHFmNWF4MXpaU2s1SzFhdUphSzRaa3RzNkdMRUgrU09WckdoK1JXQWtRcUFVbgo0Qmk4ZVA4MmR5M3N2RmV1UkNvTWFXRVQ0QlFHaGRQaFFCd1NNdlYrSWI2R3U0VldwN289Ci0tLS0tRU5EIFJTQSBQUklWQVRFIEtFWS0tLS0tCg==
其中属性certificate-authority-data
、client-certificate-data
、client-key-data
对应 CA 证书、Client 证书、Client 私钥,文件里面的内容是base64编码过后的,分别执行 echo "
就能还原成证书源文件。
首先将Datasource
设置为Prometheus
,
点击Save
保存即可,grafana中自动出现下图中的dashboard,
K8s Cluster:
K8s Node:
没有显示是因为prometheus的metric有更新,比如node_cpu
改成了node_cpu_seconds_total
。可以重新定义该dashboard的变量(可以参考8919 dashboard的变量),然后根据prometheus界面的metrics来调试。调试完毕示图:
K8s Deployments:
K8s Container:
grafana配置kubernetes数据源完成,接下来测试告警。
告警测试
- 宕机测试:
模拟node-exporter宕机,测试钉钉告警(critical
)是否正常。这里选择node3
作为测试机器。
vim kill_node-exporter.sh
#!/bin/bashnodepid=`netstat -lntp | grep 9100 | awk '{print $NF}' | awk -F '/' '{print $1}'`nodenum=`netstat -lntp | grep 9100 | grep -v pause | wc -l`if [ $nodenum -eq 0 ];then exitelse kill -9 $nodepid exitfi
while :; do sh kill_node-exporter.sh; sleep 3; done
等待3m,收到钉钉故障告警和恢复告警,
- CPU测试:
模拟CPU使用率为80%,测试邮件告警(warning
)是否正常。这里选择node3
作为测试机器。
yum install -y stress-ng stress-ng -c 0 -l 80 # -c 指定压力源进程的数量,以匹配在线CPU的数量,0表示加载每个cpu;-l 指定CPU使用率;Ctrl + C 退出
等待5m,收到邮件故障告警和恢复告警,
测试钉钉和邮件告警均没有问题。更多关于k8s集群的指标监控可以自行配置,此处仅作为演示。
k8s以Deployment方式部署 prometheus + grafana 完成。生产环境建议使用StatefulSet方式部署集群,大致过程与上面类似。已存放至个人github:kubernetes