kafka监控实战(jmxtrans+InfluxDb+Grafana)
一、需求分析
一直在调研找一款好用的kafka监控,我测试使用过的KafkaOffsetMonitor、Burrow、kafka-monitor、Kafka-Manager,他们各有优缺点,下面会介绍, 并且它们基本上都是监控topic的写入和读取等等,没有提供对于整体集群的监控信息,比如集群的分片、延时、内存使用情况等等,无意中发现了jmxtrans,jmxtrans它是一个通过jmx采集java应用的数据采集器,他的输出可以是Graphite、StatsD、Ganglia、InfluxDb等等通过Grafana做展示,下面就给大家介绍一下jmxtrans+InfluxDb+Grafana监控kafka的整体解决方案,并且不需要任何额外的开发工作,完全使用原生的。
其余三种监控程序的优缺点
Kafka Web Console:监控功能较为全面,可以预览消息,监控Offset、Lag等信息,但存在bug,不建议在生产环境中使 用
Kafka Manager:偏向Kafka集群管理,若操作不当,容易导致集群出现故障。对Kafka实时生产和消费消息是通过JMX实现的。没有记录Offset、Lag等信息。
KafkaOffsetMonitor:程序一个jar包的形式运行,部署较为方便。只有监控功能,使用起来也较为安全。若只需要监控功能,推荐使用KafkaOffsetMonito,若偏重Kafka集群管理,推荐使用Kafka Manager。因为都是开源程序,稳定性欠缺。故需先了解清楚目前已存在哪些Bug,多测试一下,避免出现类似于Kafka Web Console的问题。
二、环境介绍
1、角色
a、10.10.10.10 InfluxDb
b、10.10.10.100 Grafana
c、10.10.30.69 jmxtrans
d、kafka集群
10.10.20.14 node1
10.10.20.15 node2
10.10.20.16 node3
10.10.20.17 node4
2、软件版本
influxdb-1.2.4-1.x86_64
grafana-4.1.1-1484211277.x86_64
jmxtrans-266.rpm
kafka_2.10-0.9.0.0.jar.asc
3、架构图
三、配置规划
1、jmxtrans我们可以分别在每台kafka节点上部署,也可以部署到一台机器上,我这里是选择了后者,因为我的集群小,这样配置文件可以集中管理,如果集群比较大,可以考虑分散部署。
2、关于jmxtrans的配置文件,分全局指标(每个kafka节点)和topic指标,全局指标每个节点一个配置文件,命名规则:base_10.10.20.14.json,topic指标是每个topic一个配置文件,命名规则:falcon_monitor_us_17.json
四、监控指标
1、全局指标
每秒输入的流量
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec""attr" : [ "Count" ]"resultAlias":"BytesInPerSec""tags" : {"application" : "BytesInPerSec"}
每秒输入的流量
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec""attr" : [ "Count" ]"resultAlias":"BytesOutPerSec""tags" : {"application" : "BytesOutPerSec"}
每秒输入的流量
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec""attr" : [ "Count" ]"resultAlias":"BytesRejectedPerSec""tags" : {"application" : "BytesRejectedPerSec"}
每秒的消息写入总量
"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec""attr" : [ "Count" ]"resultAlias":"MessagesInPerSec""tags" : {"application" : "MessagesInPerSec"}
每秒FetchFollower的请求次数
"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower""attr" : [ "Count" ]"resultAlias":"RequestsPerSec""tags" : {"request" : "FetchFollower"}
每秒FetchConsumer的请求次数
"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer""attr" : [ "Count" ]"resultAlias":"RequestsPerSec""tags" : {"request" : "FetchConsumer"}
每秒Produce的请求次数
"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce""attr" : [ "Count" ]"resultAlias":"RequestsPerSec""tags" : {"request" : "Produce"}
内存使用的使用情况
"obj" : "java.lang:type=Memory""attr" : [ "HeapMemoryUsage", "NonHeapMemoryUsage" ]"resultAlias":"MemoryUsage""tags" : {"application" : "MemoryUsage"}
GC的耗时和次数
"obj" : "java.lang:type=GarbageCollector,name=*""attr" : [ "CollectionCount","CollectionTime" ]"resultAlias":"GC""tags" : {"application" : "GC"}
线程的使用情况
"obj" : "java.lang:type=Threading""attr" : [ "PeakThreadCount","ThreadCount" ]"resultAlias":"Thread""tags" : {"application" : "Thread"}
副本落后主分片的最大消息数量
"obj" : "kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica""attr" : [ "Value" ]"resultAlias":"ReplicaFetcherManager""tags" : {"application" : "MaxLag"}
该broker上的partition的数量
"obj" : "kafka.server:type=ReplicaManager,name=PartitionCount""attr" : [ "Value" ]"resultAlias":"ReplicaManager""tags" : {"application" : "PartitionCount"}
正在做复制的partition的数量
"obj" : "kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions""attr" : [ "Value" ]"resultAlias":"ReplicaManager""tags" : {"application" : "UnderReplicatedPartitions"}
Leader的replica的数量
"obj" : "kafka.server:type=ReplicaManager,name=LeaderCount""attr" : [ "Value" ]"resultAlias":"ReplicaManager""tags" : {"application" : "LeaderCount"}
一个请求FetchConsumer耗费的所有时间
"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer""attr" : [ "Count","Max" ]"resultAlias":"TotalTimeMs""tags" : {"application" : "FetchConsumer"}
一个请求FetchFollower耗费的所有时间
"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower""attr" : [ "Count","Max" ]"resultAlias":"TotalTimeMs""tags" : {"application" : "FetchFollower"}
一个请求Produce耗费的所有时间
"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce""attr" : [ "Count","Max" ]"resultAlias":"TotalTimeMs""tags" : {"application" : "Produce"}
2、topic的监控指标
falcon_monitor_us每秒的写入流量
"kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=falcon_monitor_us""attr" : [ "Count" ]"resultAlias":"falcon_monitor_us""tags" : {"application" : "BytesInPerSec"}
falcon_monitor_us每秒的输出流量
"kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=falcon_monitor_us""attr" : [ "Count" ]"resultAlias":"falcon_monitor_us""tags" : {"application" : "BytesOutPerSec"}
falcon_monitor_us每秒写入消息的数量
"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=falcon_monitor_us""attr" : [ "Count" ]"resultAlias":"falcon_monitor_us""tags" : {"application" : "MessagesInPerSec"}
falcon_monitor_us在每个分区最后的Offset
"obj" : "kafka.log:type=Log,name=LogEndOffset,topic=falcon_monitor_us,partition=*""attr" : [ "Value" ]"resultAlias":"falcon_monitor_us""tags" : {"application" : "LogEndOffset"}
PS:
1、参数说明
"obj"对应jmx的ObjectName,就是我们要监控的指标
"attr"对应ObjectName的属性,可以理解为我们要监控的指标的值
"resultAlias"对应metric 的名称,在InfluxDb里面就是MEASUREMENTS名
"tags" 对应InfluxDb的tag功能,对与存储在同一个MEASUREMENTS里面的不同监控指标可以做区分,我们在用Grafana绘图的时候会用到,建议对每个监控指标都打上tags
2、对于全局监控,每一个监控指标对应一个MEASUREMENTS,所有的kafka节点同一个监控指标数据写同一个MEASUREMENTS ,对于topc监控的监控指标,同一个topic所有kafka节点写到同一个MEASUREMENTS,并且以topic名称命名
五、安装
1、kafka
这里不详细介绍kafka集群的安装,主要说一下kafka的启动方式,因为我们需要通过jmx采集kafka的监控数据,所以在kafka的启动时候需要启动jmx端口,启动方式如下:
cd /data/kafka/bin/
JMX_PORT=9999 nohup ./kafka-server-start.sh ../config/server.properties >/dev/null 2>&1 &
2、InfluxDb
yum -y install influxdb ##安装
/etc/init.d/influxdb start ##启动服务
[root@ip-10-10-10-10 jmxtrans]# influx
Connected to http://localhost:8086 version 1.3.2
InfluxDB shell version: 1.3.2> CREATE USER "root" WITH PASSWORD '123456' WITH ALL PRIVILEGES ##添加一个账号>
3、Grafana
yum -y install grafana ##安装
/etc/init.d/grafana-server start ##启动服务
4、jmxtrans
wget http://central.maven.org/maven2/org/jmxtrans/jmxtrans/266/jmxtrans-266.rpm
rpm -ivh jmxtrans-266.rpm ##安装
/etc/init.d/jmxtrans start ##启动
六、配置
这里主要介绍jmxtrans采集数据的配置文件撰写和Grafana绘图的配置注意事项,kafka和InfluxDb的配置这里不做描述。
1、jmxtrans
a、jmxtrans默认读取/var/lib/jmxtrans下的配置文件去采集数据的,所以我们把采集kafka监控数据的配置文件都在这个目录下,下面是我的配置文件命名规范:
[root@ip-10-10-30-69 jmxtrans]# ll
total 96
-rw-r--r-- 1 root root 1657 Aug 18 17:03 article-feedback-10min-json_14.json
-rw-r--r-- 1 root root 1657 Aug 18 17:03 article-feedback-10min-json_15.json
-rw-r--r-- 1 root root 1657 Aug 18 17:04 article-feedback-10min-json_16.json
-rw-r--r-- 1 root root 1657 Aug 18 17:04 article-feedback-10min-json_17.json
-rw-r--r-- 1 root root 8430 Aug 22 08:24 base_10.10.20.14.json
-rw-r--r-- 1 root root 8431 Aug 22 08:24 base_10.10.20.15.json
-rw-r--r-- 1 root root 8431 Aug 22 08:25 base_10.10.20.16.json
-rw-r--r-- 1 root root 8431 Aug 22 08:25 base_10.10.20.17.json
-rw-r--r-- 1 root root 2027 Aug 21 16:19 falcon_monitor_us_14.json
-rw-r--r-- 1 root root 2027 Aug 21 16:20 falcon_monitor_us_15.json
-rw-r--r-- 1 root root 2484 Aug 21 20:58 falcon_monitor_us_16.json
-rw-r--r-- 1 root root 2027 Aug 21 16:20 falcon_monitor_us_17.json
-rw-r--r-- 1 root root 2147 Aug 21 17:43 highgmp-articles-through-primary_14.json
-rw-r--r-- 1 root root 2147 Aug 21 17:46 highgmp-articles-through-primary_15.json
-rw-r--r-- 1 root root 2147 Aug 21 17:46 highgmp-articles-through-primary_16.json
-rw-r--r-- 1 root root 2147 Aug 21 17:47 highgmp-articles-through-primary_17.json[root@ip-10-10-30-69 jmxtrans]# pwd
/var/lib/jmxtrans
b、全局监控的配置文件,以10.10.20.14为例:
[root@ip-10-10-30-69 jmxtrans]# cat base_10.10.20.14.json{
"servers" : [ {
"port" : "9999",
"host" : "10.10.20.14",
"queries" : [ {
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec",
"attr" : [ "Count","OneMinuteRate" ],
"resultAlias":"BytesInPerSec",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "BytesInPerSec"}
} ]
},{
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec",
"attr" : [ "Count","OneMinuteRate" ],
"resultAlias":"BytesOutPerSec",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "BytesOutPerSec"}
} ]
},{
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesRejectedPerSec",
"attr" : [ "Count","OneMinuteRate" ],
"resultAlias":"BytesRejectedPerSec",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "BytesRejectedPerSec"}
} ]
},{
"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec",
"attr" : [ "Count","OneMinuteRate" ],
"resultAlias":"MessagesInPerSec",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "MessagesInPerSec"}
} ]
},{
"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchConsumer",
"attr" : [ "Count" ],
"resultAlias":"RequestsPerSec",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"request" : "FetchConsumer"}
} ]
},{
"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=FetchFollower",
"attr" : [ "Count" ],
"resultAlias":"RequestsPerSec",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"request" : "FetchFollower"}
} ]
},{
"obj" : "kafka.network:type=RequestMetrics,name=RequestsPerSec,request=Produce",
"attr" : [ "Count" ],
"resultAlias":"RequestsPerSec",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"request" : "Produce"}
} ]
},{
"obj" : "java.lang:type=Memory",
"attr" : [ "HeapMemoryUsage", "NonHeapMemoryUsage" ],
"resultAlias":"MemoryUsage",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "MemoryUsage"}
} ]
},{
"obj" : "java.lang:type=GarbageCollector,name=*",
"attr" : [ "CollectionCount","CollectionTime" ],
"resultAlias":"GC",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "GC"}
} ]
},{
"obj" : "java.lang:type=Threading",
"attr" : [ "PeakThreadCount","ThreadCount" ],
"resultAlias":"Thread",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "Thread"}
} ]
},{
"obj" : "kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica",
"attr" : [ "Value" ],
"resultAlias":"ReplicaFetcherManager",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "MaxLag"}
} ]
},{
"obj" : "kafka.server:type=ReplicaManager,name=PartitionCount",
"attr" : [ "Value" ],
"resultAlias":"ReplicaManager",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "PartitionCount"}
} ]
},{
"obj" : "kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions",
"attr" : [ "Value" ],
"resultAlias":"ReplicaManager",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "UnderReplicatedPartitions"}
} ]
},{
"obj" : "kafka.server:type=ReplicaManager,name=LeaderCount",
"attr" : [ "Value" ],
"resultAlias":"ReplicaManager",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "LeaderCount"}
} ]
},{
"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchConsumer",
"attr" : [ "Count","Max" ],
"resultAlias":"TotalTimeMs",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "FetchConsumer"}
} ]
},{
"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=FetchFollower",
"attr" : [ "Count","Max" ],
"resultAlias":"TotalTimeMs",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "FetchConsumer"}
} ]
},{
"obj" : "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce",
"attr" : [ "Count","Max" ],
"resultAlias":"TotalTimeMs",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "Produce"}
} ]
},{
"obj" : "kafka.server:type=ReplicaManager,name=IsrShrinksPerSec",
"attr" : [ "Count" ],
"resultAlias":"ReplicaManager",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "IsrShrinksPerSec"}
} ]
}]
} ]}
c、topic监控的配置文件,以falcon_monitor_us的10.10.20.14节点为例:
[root@ip-10-10-30-69 jmxtrans]# cat falcon_monitor_us_14.json{
"servers" : [ {
"port" : "9999",
"host" : "10.10.20.14",
"queries" : [ {
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=falcon_monitor_us",
"attr" : [ "Count" ],
"resultAlias":"falcon_monitor_us",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "BytesInPerSec"}
} ]
},{
"obj" : "kafka.server:type=BrokerTopicMetrics,name=BytesOutPerSec,topic=falcon_monitor_us",
"attr" : [ "Count" ],
"resultAlias":"falcon_monitor_us",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "BytesOutPerSec"}
} ]
},{
"obj" : "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec,topic=falcon_monitor_us",
"attr" : [ "Count" ],
"resultAlias":"falcon_monitor_us",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "MessagesInPerSec"}
} ]
},{
"obj" : "kafka.log:type=Log,name=LogEndOffset,topic=falcon_monitor_us,partition=*",
"attr" : [ "Value" ],
"resultAlias":"falcon_monitor_us",
"outputWriters" : [ {
"@class" : "com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
"url" : "http://10.10.10.10:8086/",
"username" : "root",
"password" : "root",
"database" : "jmxDB",
"tags" : {"application" : "LogEndOffset"}
} ]
}]
} ]}
2、Grafana配置
a、添加数据源
Url、Database、User、Password需要和jmxtrans采集数据配置文件里面的写一致,然后点击Save&Test,提示成功就正常了
b、创建一个dashboard,然后在这里配置每一个监控指标的图
c、要点说明
1、对于监控指标为Count的监控项,需要通过Grafana做计算得到我们想要的监控,比如BytesInPerSec这个指标,它的监控值是一个累计值,我们想要取到每秒的流量,肯定需要计算,(本次采集的值-上次采集的值)/60 ,jmxtrans是一分钟采集一次数据,具体配置参考下面截图:
2、X轴的单位选择,比如流量的单位、时间的单位、每秒消息的个数无单位等等,下面分布举一个例子介绍说明因为我们是一分钟采集一次数据,所以group by 和derivative选1分钟;因为我们要每秒的流量,所以math这里除以60
设置流量的单位 ,点击需要设置的图,选择"Edit"进入编辑页面,切到Axes这个tab页,Unit--》data(Metric)--》bytes
设置时间的单位 ,点击需要设置的图,选择"Edit"进入编辑页面,切到Axes这个tab页,Unit--》time--》milliseconds(ms)
设置按原始值展示,无单位 ,点击需要设置的图,选择"Edit"进入编辑页面,切到Axes这个tab页,Unit--》none--》none
七、收获总结
1、关于jmx收集了kafka的那些指标,对应的值都是那些类型,对应这个问题走了很多弯路,各种谷歌百度拿到了有人整理过的,一个一个试,发现很多不能用,要不就是写的是错误的,要不就是版本不同,写法不一样,最后看到了jconsole这个工具,他可以连接到本地或者远程的jmx端口,能看到在收集的所有指标,在windows下装好jdk,在bin目录你可以找到这个工具。
2、关于consumer的延时,关官方介绍有一个type是 type=consumer-fetch-manager-metrics的指标,但是我这通过jconsole连进来死活没有找到,昨天我已经用 pyhon写了一个小脚本,通过crontab 一分钟采集一次,入库到InfluxDB,也可以通过Grafana做展示了。
a、数据采集的脚本
#!/usr/bin/env python
#-*-coding:utf8-*-
from influxdb import InfluxDBClient
from kafka.consumer import SimpleConsumer
from kazoo.client import KazooClient
from kafka.client import SimpleClientimport timeimport datetime#Zookeepers
zookeepers="10.10.20.14"#Kafka
kafka="10.10.20.14:9092"#consumer group
group_list = ["monitor","topic1"]#InfluxDB client
client = InfluxDBClient('10.10.10.10', 8086, 'root', 'root', 'jmxDB', timeout=2, retries=3)#time##now = time.strftime("%Y-%m-%dT%H:%M:%SZ")
now = ((datetime.datetime.now()-datetime.timedelta(hours=8)).strftime("%Y-%m-%d %H:%M"))if __name__ == '__main__':
try:
broker = SimpleClient(hosts=kafka)
zk = KazooClient(hosts=zookeepers, read_only=True)
zk.start()
for group in group_list:
print group
topics = zk.get_children("/consumers/%s/owners" %(group))
for topic in topics:
try:
consumer = SimpleConsumer(broker, group, str(topic))
lag = int(consumer.pending())
json_body = [
{
"measurement": "lag",
"tags": {
"Consumer_id": group,
"topic": topic,
"addr": kafka
},
"time": now,
"fields": {
"value": int(lag)
}
}
]
try:
client.write_points(json_body)
except Exception as e:
print e
except Exception as e:
print e
except Exception as e:
print e
finally:
zk.stop()
b、Grafana的配置,下面这张图是topic1的两个Consumer的延时情况:
官网监控指标如下:
http://kafka.apache.org/documentation/#monitoring
八、grafana的报警(补充)
所有的指标都收集齐了,最后一步就是报警了,邮件报警配置很简单,这里就不多说了,好多公司现在办公都用钉钉,当然我们也在用钉钉,grafana从4.3开始支持钉钉报警了,这里介绍下具体的配置方法
1、创建一个钉钉群,把需要接收报警的人拉到群里
2、创建一个钉钉机器人并获取url,具体方法如下:
a、在PC版钉钉上,如下图1、2、3操作
b、点击“添加 ”
3、在grafana上配置钉钉报警
a、按下图方法添加一条通知策略
b、Name自己随便定义,勾选下面两个勾,Url这个填写上面在添加钉钉机器人复制的url,保存即可。
4、配置需要发报警的监控指标,编辑需要配置报警的监控图,切换到Alert页面,点击"Create Alert"到下图配置页面,
Alert Config,Name 会根据图的Title自动生成,可以修改,Evaluate every多久执行一次
Conditions,主要说一下query的第一个参数,如果以在一个图上有多个监控指标,需要多那个指标报警,选择对应的Metrics的行就可以了,后面两个参数就是时间范围了,这个报警的配置接收如下:
在5分钟内,如果A这个Metrics的平均值大于100 就报警,或者C这个Metrics的最大值大于200就报警;如果没有数据或者数据为null也报警,如果执行错误或者超时也报警
切换到Notifications,点击"+"号添加钉钉报警,并在Message添加报警内容,切记要执行“Ctrl +S”保存页面,不然刚才的配置白瞎了
ok,到此钉钉报警已经配置完成,点击钉钉群里的链接可以查看报警监控指标的具体情况。
转自:http://blog.51cto.com/navyaijm/1958376