这里写自定义目录标题
- Prometheus是什么?
- Prometheus数据模型
- 指标类型
- 在go中使用
- 注册metrics
- 使用metrics
- 服务端收集metrics监控数据
- 监控数据查询与可视化
- PromQL查询语句
Prometheus是什么?
Prometheus是一套开源的监控&报警&时间序列数据库的组合。
主要组件有
- Prometheus server
主要负责数据采集和存储,提供PromQL查询语言的支持 - Client Libraris/SDK
各语言的客户端库和Sdk等 - Push Gateway
用于支持临时任务的推送网关, 各客户端可以主动向push gateway推送监控指标数据,prometheus会到push gateway上拉取。 - alertmanager
告警功能 - Exporters
用来监控 HAProxy,StatsD,Graphite 等特殊的监控目标,并向 Prometheus 提供标准格式的监控样本数据. - 各种其他支持工具
Prometheus数据模型
Prometheus 从根本上所有的存储都是按时间序列去实现的,每条时间序列是由唯一的 指标名称 和 一组 标签 (key=value)的形式组成。
指标名称
通常代表了监控对象的名称,可以简单理解为数据表的表名
标签
就是对一条时间序列不同维度的识别了,可以简单理解为数据表的字段。
【举个例子】
rpcServiceRequestsHistogram = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Subsystem: "rpc_service_requests",
Name: "something",
Help: "HistogramOpts statistics of rpc requests received",
Buckets: []float64{0.001, 0.002, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.8, 1, 2, 5, 10},
},
[]string{"kind", "code", "source", "invoke_service", "invoke_method"},
)
该监控指标的指标名称为eletesdk_rpc_service_requests (组成为{{NameSpace}}{{Subsystem}}{{Name}}, 作为histogram类型的指标,最终自动生成的指标名称会加入_count, _sum, _bucket等,如:rpc_service_requests_something_count),该指标下有{“kind”, “code”, “source”, “invoke_service”, “invoke_method”}等标签。按照传统数据库的理解可已大概理解为
- 有一张叫做rpc_service_requests_something_count的表
- 表中有{“kind”, “code”, “source”, “invoke_service”, “invoke_method”}等查询字段
- 该表的主键是个timestamp
- 还有一个记录监控值的值字段(float64类型)
指标类型
Counter
type Counter interface {
Metric
Collector // Inc increments the counter by 1. Use Add to increment it by arbitrary // non-negative values.
Inc()
// Add adds the given value to the counter. It panics if the value is < // 0.
Add(float64)
}
Gauge
与Counter不同,Gauge类型的指标侧重于反应系统的当前状态。因此这类指标的样本数据可增可减,比如监控cpu使用率,内存占用等
提供了增、减相关的方法.
type Gauge interface {
Metric
Collector // Set sets the Gauge to an arbitrary value. Set(float64)
// Inc increments the Gauge by 1. Use Add to increment it by arbitrary // values.
Inc()
// Dec decrements the Gauge by 1. Use Sub to decrement it by arbitrary // values.
Dec()
// Add adds the given value to the Gauge. (The value can be negative, // resulting in a decrease of the Gauge.)
Add(float64)
// Sub subtracts the given value from the Gauge. (The value can be // negative, resulting in an increase of the Gauge.)
Sub(float64)
// SetToCurrentTime sets the Gauge to the current Unix time in seconds. SetToCurrentTime()
}
Histogram
直方图,柱状图。常用于跟踪事件发生(通常是请求持续时间或响应大小)的规模,例如:请求耗时、响应大小。它特别之处是可以对记录的内容进行分组,提供 count 和 sum 全部值的功能。
type Histogram interface {
Metric
Collector // Observe adds a single observation to the histogram.
Observe(float64)
}
Summary
Summary和Histogram十分相似,常用于跟踪事件(通常是要求持续时间和响应大小)发生的规模,例如:请求耗时、响应大小。除了同样提供 count 和 sum 全部值的功能,还提供一个quantiles的功能,用于计算一个滑动时间窗口的上的分为数(如中位数)。其分为数指标在客户端中实时计算,比较耗客户端性能。但是分位数无法聚合,计算的分位数只能反应单个实例的数据。Histogram也可已在服务端使用histogram_quantile函数计算分位数,只是准确度较差,但可以支持聚合。
type Summary interface {
Metric
Collector // Observe adds a single observation to the summary.
Observe(float64)
}
在go中使用
go get github.com/prometheus/client_golan
定义metrics
import "github.com/prometheus/client_golang/prometheus"
// 定义指标
var (
// 统计请求数量
httpRequestCounter = prometheus.NewCounter(
prometheus.CounterOpts{
Subsystem: "service",
Name: "http_request_total",
Help: "Total number of http_request",
},
)
//prometheus.NewCounter与prometheus.NewCounterVec的区别
//httpRequestCounter = prometheus.NewCounterVec(
// prometheus.CounterOpts{
// Subsystem: "service",
// Name: "http_request_total",
// Help: "Total number of http_request",
// },
// []string{"kind"}
//)
// 监控实时并发量(处理中的请求)
concurrentHttpRequestsGauge = prometheus.NewGauge(
prometheus.GaugeOpts{
Subsystem: "sdk",
Name: "http_handle_concurrent",
Help: "Number of incoming HTTP Requests handling concurrently now.",
},
)
// 监控请求量,请求耗时等
httpRequestsHistogram = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Subsystem: "sdk",
Name: "http_handle_requests",
Help: "Histogram statistics of http requests handle by elete http. Buckets by latency",
Buckets: []float64{0.001, 0.002, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.8, 1, 2, 5, 10},
},
[]string{"code"},
)
summary := prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Name: "test_summary",
Help: "test summary",
Objectives: map[float64]float64{
0.5: 0.05,
0.9: 0.01,
0.99: 0.001,
}, // 计算的分位数和对应的允许误差值
},
[]string{"name"},
)
)
注册metrics
定义后的metrics主要注册进指标注册器中
Prometheus sdk提供了默认的直播注册器,使用prometheus.MustRegister即可将定义的指标注册进默认指标注册器。
// 注册指标收集器
func init() {
prometheus.MustRegister(dropRequestCounter)
// prometheus.Register(dropRequestCounter)
prometheus.MustRegister(concurrentHttpRequestsGauge)
prometheus.MustRegister(httpRequestsHistogram)
prometheus.MustRegister(summary)
}
使用metrics
如在gin中间件中使用
func GinMetricsMid() gin.HandlerFunc {
return func(ctx *gin.Context) {
// 统计接口请求数量
httpRequestCounter.Inc()
// 监控并发量,进入接口前 +1
concurrentHttpRequestsGauge.Inc()
startTime := time.Now()
// 处理后续逻辑
ctx.Next()
// after request
finishTime := time.Now()
// 监控计算接口耗时,请求数量等
httpRequestsHistogram.With(prometheus.Labels{"code": strconv.Itoa(w2.StatusCode)}).Observe(float64(finishTime.Sub(startTime)) / (1000 * 1000 * 1000))
// 监控并发量,离开接口 -1
concurrentHttpRequestsGauge.Dec()
}
}
服务端收集metrics监控数据
服务端收集监控数据主要有两种方式
- Prometheus server直接到client客户端拉取
- 由客户端将metrics推送至push gateway服务,再由prometheus server到push gateway拉取
Pull拉取形式
pull形式需要客户端暴露一个http拉取接口
简单来说就是启动一个http服务,并向外暴露一个/metrics的http接口。
func StartMetricsHandler(metricsAddr string) string {
...
// 定义一个http服务
var prometheusExporter http.Server
// 添加handler
mux := http.NewServeMux()
mux.Handle("/metrics", promhttp.Handler())
...
prometheusExporter.Handler = mux
// 拼接http服务的服务地址
var ln net.Listener
var err error
if metricsAddr == "" {
ln, err = net.Listen("tcp4", "0.0.0.0:0")
if err != nil {
}
} else {
config := &net.ListenConfig{Control: reusePort}
ln, err = config.Listen(context.Background(), "tcp", metricsAddr)
}
if err != nil {
panic(fmt.Sprintf("can't listen port %v", err))
}
spr := strings.Split(ln.Addr().String(), ":")
port := spr[len(spr)-1]
url := fmt.Sprintf("%s:%s", GetFQDN(), port)
prometheusExporter.Addr = url
INFO.Printf("prometheus metrics server start at %s", url)
//启动监控http服务
go func() { //serve goroutine
prometheusExporter.Serve(ln)
}()
...
// 将地址返回出去,供向prometheus server注册拉取接口使用
return url
}
启动好服务后,需要至prometheus server中配置拉取节点的地址,prometheus才会至该端口拉取监控数据。
prometheus.yml
....
scrape_configs:
# Prometheus的自身监控 将在采集到的时间序列数据上打上标签job=xx
- job_name: 'prometheus'
# 采集指标的默认路径为:/metrics,如 localhost:9090/metric
# 协议默认为http
static_configs:
- targets: ['localhost:9090']
....
但是这种形式不够灵活,而且在docker容器等场景下不适用,不可能每启动一个容器都到prometheus中配置。所以prometheus通常采用服务发现形式。
支持的服务发现类型:
// prometheus/discovery/config/config.go
type ServiceDiscoveryConfig struct {
StaticConfigs []*targetgroup.Group `yaml:"static_configs,omitempty"`
DNSSDConfigs []*dns.SDConfig `yaml:"dns_sd_configs,omitempty"`
FileSDConfigs []*file.SDConfig `yaml:"file_sd_configs,omitempty"`
ConsulSDConfigs []*consul.SDConfig `yaml:"consul_sd_configs,omitempty"`
ServersetSDConfigs []*zookeeper.ServersetSDConfig `yaml:"serverset_sd_configs,omitempty"`
NerveSDConfigs []*zookeeper.NerveSDConfig `yaml:"nerve_sd_configs,omitempty"`
MarathonSDConfigs []*marathon.SDConfig `yaml:"marathon_sd_configs,omitempty"`
KubernetesSDConfigs []*kubernetes.SDConfig `yaml:"kubernetes_sd_configs,omitempty"`
GCESDConfigs []*gce.SDConfig `yaml:"gce_sd_configs,omitempty"`
EC2SDConfigs []*ec2.SDConfig `yaml:"ec2_sd_configs,omitempty"`
OpenstackSDConfigs []*openstack.SDConfig `yaml:"openstack_sd_configs,omitempty"`
AzureSDConfigs []*azure.SDConfig `yaml:"azure_sd_configs,omitempty"`
TritonSDConfigs []*triton.SDConfig `yaml:"triton_sd_configs,omitempty"`
}
服务注册方式
将metrics监控地址注册到consul等注册中心,prometheus主动发现新的需要监控的地址
Push GateWay形式
import (
"fmt"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/push"
)
var (
pusher *push.Pusher
httpRequestCounter = prometheus.NewCounter(
prometheus.CounterOpts{
Subsystem: "service",
Name: "http_request_total",
Help: "Total number of http_request",
},
)
// 统计请求数量
httpRequestCounter = prometheus.NewCounterVec(
prometheus.CounterOpts{
Subsystem: "service",
Name: "http_request_total",
Help: "Total number of http_request",
},
[]string{"kind"},
)
// 监控实时并发量(处理中的请求)
concurrentHttpRequestsGauge = prometheus.NewGauge(
prometheus.GaugeOpts{
Subsystem: "sdk",
Name: "http_handle_concurrent",
Help: "Number of incoming HTTP Requests handling concurrently now.",
},
)
// 监控请求量,请求耗时等
concurrentHttpRequestsGauge = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Subsystem: "sdk",
Name: "http_handle_requests",
Help: "Histogram statistics of http requests handle by elete http. Buckets by latency",
Buckets: []float64{0.001, 0.002, 0.005, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.8, 1, 2, 5, 10},
},
[]string{"code"},
)
summary := prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Name: "test_summary",
Help: "test summary",
Objectives: map[float64]float64{
0.5: 0.05,
0.9: 0.01,
0.99: 0.001,
}, // 计算的分位数和对应的允许误差值
},
[]string{"name"},
)
completionTime := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "db_backup_last_completion_timestamp_seconds",
Help: "The timestamp of the last successful completion of a DB backup.",
})
)
func main() {
pusher = push.New("http://pushgateway:9091", "job_name") // 初始化一个pusher
// 为pusher添加一些grouping key
pusher.Grouping("service", "live_backend_go").Grouping("host", "localhost")
// 向pusher中注册一个metric收集器
pusher.Collector(completionTime)
// 向puser中注册多个meterics
registry := prometheus.NewRegistry() // 向创建一个自定义的register
registry.MustRegister(httpRequestCounter, concurrentHttpRequestsGauge, concurrentHttpRequestsGauge, summary) // 向register中注册多个meterics
// 将register添加进pusher
pusher.Gatherer(registry)
// 将各metrics中的指标推送至push gateway
pusher.Push() // 使用http的PUT方法
pusher.Add() // 使用http的POST方法
}
Push 和 Add方法的区别源码中的解释,大致意思理解为:
- Push方法使用的是http PUT方式,他会覆盖push gateway中同一个job_name和相同grouping key下的所有metrics。(之前的metrics会被清空)
- Add方法使用http POST方法。他只会覆盖此次推送中包含的metrics 名字相同(job_name和grouping key也相同)的指标。
// Push collects/gathers all metrics from all Collectors and Gatherers added to
// this Pusher. Then, it pushes them to the Pushgateway configured while
// creating this Pusher, using the configured job name and any added grouping
// labels as grouping key. All previously pushed metrics with the same job and
// other grouping labels will be replaced with the metrics pushed by this
// call. (It uses HTTP method “PUT” to push to the Pushgateway.)
//
// Push returns the first error encountered by any method call (including this
// one) in the lifetime of the Pusher.
func (p *Pusher) Push() error {
return p.push(http.MethodPut)
}
// Add works like push, but only previously pushed metrics with the same name
// (and the same job and other grouping labels) will be replaced. (It uses HTTP
// method “POST” to push to the Pushgateway.)
func (p *Pusher) Add() error {
return p.push(http.MethodPost)
}
监控数据查询与可视化
grafana常用界面操作
创建Dashboard
-
可视化第一步,我们需要一个Dashboard, 在grafana主页点击左侧 【+】-【Create】-【Dashboard】
-
创建图表
-
创建查询语句
-
图表设置
图表类型选择,已直方图为例 -
报警配置界面
- 首先到主页报警规则设置页面添加报警渠道。(生产环境应该已经预设了4个P级的notification channel, 没有的话需要添加)
PromQL查询语句
官方文档:https://prometheus.io/docs/prometheus/latest/querying/basics/
常用方法函数
- rate()
计算范围向量中时间序列的每秒平均平均增长率。 - irate()
计算范围向量中时间序列的每秒瞬时增加率。这基于最后两个数据点
irate should only be used when graphing volatile, fast-moving counters. Use rate for alerts and slow-moving counters
irate用于计数器快速变化的场景。rate通常用于报警和慢速变化的计数器 - sum()
聚合
举例
var (
NormalHistogram = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: "test",
Subsystem: "normal_app",
Name: "normal_http_histogram",
Buckets: []float64{0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1},
},
[]string{"code", "invoke_service", "invoke_method"},
)
)
sum(rate(test_normal_app_normal_http_histogram_count{}[1m])) by (invoke_method, code)
- 查询接口平均耗时
sum(rate(test_normal_app_normal_http_histogram_sum{}[1m])) by (invoke_method, code) / sum(rate(test_normal_app_normal_http_histogram_count{}[1m])) by (invoke_method, code)
- 查询接口耗时分位数
查询0.99分位数
histogram_quantile(0.99, sum(rate(test_normal_app_normal_http_histogram_bucket{}[1m])) by (invoke_method, le))
- 接口Http 400比例(错误率)
(sum(rate(test_normal_app_normal_http_histogram_count{ code="400"}[1m])) by (invoke_method) / sum(rate(test_normal_app_normal_http_histogram_count{}[1m])) by (invoke_method)) * 100
- 查询服务并发量(即同一时刻处理的请求数)
使用NormalGauge
sum(test_normal_app_normal_http_gauge{})