AlertManager
前言
搭建好了一套监控后,必不可少的就是报警机制了,以各种各样的方式推送消息,比如邮件、短信、钉钉、企业微信等方式,帮助运维人员尽快发现并修复问题
1. 创建AlertManager
老规矩开局直接偷配置文件
docker cp alertmanager:/etc/alertmanager/alertmanager.yml .
启动AlertManager
docker run --name alertmanager -d -p 9093:9093 -v /Users/yujian/Documents/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml prom/alertmanager:latest
2. 创建AlertManager告警方式
邮件方式,修改alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: xxxxxxx@163.com
smtp_auth_username: xxxxxx@163.com
smtp_auth_password: xxxxx
smtp_require_tls: false
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1m
receiver: 'mail'
receivers:
- name: 'mail'
email_configs:
- to: xxxxxxxx@qq.com
此时AlertManager的告警已经配置完成。
3. 创建告警规则
告警规则代表什么情况下会触发报警,由Prometheus控制
#修改prometheus.yml
rule_files:
- "/etc/prometheus/rules.yml"
# - "second_rules.yml"
此时并没有/etc/prometheus/rules.yml的配置文件,我们来创建一个
vi rule.yml
groups:
- name: node-up
rules:
- alert: cpumax #aleartname
expr: easy_prometheus_system_cpu_percent{job="easy_prometheus"} > 20 #promQL
for: 3s #保持的时间
annotations: #为了更好触发我改为了20%
summary: "{{ $labels.instance }} cpu使用率超过20%!"
- alert: node-up
expr: up{job="easy_prometheus"} == 0 #promQL
for: 4s
labels: #描述
severity: 1
team: node
annotations:
summary: "{{ $labels.instance }} 已停止运行!"
重新创建Prometheus容器,将rule.yml挂载到/etc/prometheus/rules.yml,启动完成查看Alerts是否成功
webhook方式
route:
group_by: ['instance']
group_wait: 10s
group_interval: 20s
repeat_interval: 20s
#repeat_interval: 1h
receiver: 'webhook'
receivers:
- name: 'webhook'
webhook_configs:
- url: 'http://192.168.31.150:8089/webhook'
消息格式
{"receiver":"webhook","status":"resolved","alerts":[{"status":"resolved","labels":{{"status":"resolved","labels":{"action":"Cpu利用率","alertname":"cpumax","application":"easy_prometheus","cause":"Cpu利"exported_application":"easy_prometheus","instance":"192.168.31.150:8089","job":"easy_prometheus"},"annotations":{"summary":"192.168.31.150:8089 cpu使用率超过20%!"},"startsAt":"2021-06-19T03:21:56.117Z","ends021-06-19T03:22:11.117Z","generatorURL":"http://406161e43292:9090/graph?g0.expr=easy_prometheus_system_cpu_percent%7Bjob%3D%22easy_prometheus%22%7D+%3E+20\u0026g0.tab=1","fingerprint":"1bcf523f0c524538"}],"groupLabels":{"instance":"192.168.31.150:8089"},"commonLabels":{"application":"easy_prometheus","instance":"192.168.31.150:8089","job":"easy_prometheus"},"commonAnnotations":{},"externalURL":"http://c731ba69bfca:9093","version":"4","groupKey":"{}:{instance=\"192.168.31.150:8089\"}","truncatedAlerts":0}
改造一下Easy-Prometheus(已更新到github)的源码增加监听webhook通知
access_token在钉钉群机器人处创建得到
type Ding struct {
Alerts []struct{
Annotations struct{
Summary string `json:"summary"`
} `json:"annotations"`
} `json:"alerts"`
}
func dingding(w http.ResponseWriter, r *http.Request) {
s, _ := ioutil.ReadAll(r.Body)
ding := &Ding{}
fmt.Println(string(s))
json.Unmarshal(s,ding)
anno := ding.Alerts[0]
req :=&httpgo.Req{}
x, err := req.Header("Content-Type", "application/json").
Method(http.MethodPost).
Url("https://oapi.dingtalk.com/robot/send?access_token=xxxxxxx").
Params(httpgo.Query{
"link": map[string]interface{}{
"title": "AlertManager通知",
"text": "通知" + anno.Annotations.Summary,
#图是网上随便找的
"picUrl": "https://photo.16pic.com/00/65/09/16pic_6509905_b.png",
#点击消息标题快速跳转到Prometheus
"messageUrl":"http://localhost:9090/alerts",
},
"msgtype": "link",
}).Go().Body()
if err!=nil {
log.Println(err)
}
fmt.Println(x)
}
3. 测试告警
我这里测试启动多个应用以让CPU达到20%利用率并维持3秒钟。
钉钉