在监控zabbix内部的性能时,我们通常使用如下的几个metric来衡量服务的性能:
nvps, queue,update percent,process busy和pending sync data,cache。
通过增加相应的监控,可以有效的发现zabbix的性能问题,进而进行有的放矢的优化。
下面简要说明下:
1.nvps,每秒钟处理的数据量,是一个理论值。
取值的sql:
整集群:
1
|
SELECTSUM(1.0/i.delay) AS qps FROM items i,hosts h WHERE i.status= '0' AND i.hostid=h.hostid AND h.status= '0' AND i.delay<>0;
|
breakdown 到proxy的:
1
|
SELECT h.proxy_hostid,SUM( 1.0 /i.delay) AS qps FROM items i,hosts h WHERE i.status= '0' AND i.hostid=h.hostid AND h.status= '0' AND i.delay<> 0 AND h.proxy_hostid is NOT NULL GROUP BY h.proxy_hostid;
|
2.数据的delay情况,比如一个item的interval设置为60s,但是在70s左右才进行了更新,那么就说明delay了10s。
queue值越大就说明zabbix内部存在某些性能上的问题了。比较常见的是poller和trapper的进程busy问题。
这个是一个interval check,可以建立如下item:
zabbix[queue]
zabbix[queue,5m]
zabbix[queue,10m]
3.update percent:
用来衡量item值的更新情况,如果percent很低,证明数据存在delay或者某些agent端的数据存在异常。
1)整个集群的
1
2
3
4
5
6
7
8
9
10
|
select a.aa/b.bb from (select count(*) as aa from items
where lastclock > UNIX_TIMESTAMP()- 1800 and delay < 900
and hostid in (select hostid from hosts where status= 0 )
and status = 0
) a,
(select count(*) as bb from items
where delay < 900 and status = 0
and hostid in (select hostid from hosts where status= 0 )
) b
|
2)到proxy的:
1
2
3
4
5
6
7
8
9
10
|
select a.aa/b.bb from (select count(*) as aa from items
where lastclock > UNIX_TIMESTAMP()- 1800 and delay < 900
and hostid in (select hostid from hosts where status= 0 and proxy_hostid = 10100 )
and status = 0
) a,
(select count(*) as bb from items
where delay < 900 and status = 0
and hostid in (select hostid from hosts where status= 0 and proxy_hostid = 10100 )
) b
|
其中proxy_hostid是对应的proxy的id.
3)到主机,可以定位哪些主机的值更新存在异常(比unreachable的报警更加准确):
1
2
3
4
5
6
7
8
9
10
11
12
13
|
select b.hostname ,c.ip,a.update_percent as uppercent from
(select a.hostid,round(a.aa* 100 /b.bb, 2 ) as update_percent from
(select hostid,count(*) as aa from items
where lastclock > UNIX_TIMESTAMP()- 1800 and delay < 900
and hostid in (select hostid from hosts where status= 0 )
and status = 0 group by hostid
) a,
(select hostid,count(*) as bb from items
where delay < 900 and status = 0
and hostid in (select hostid from hosts where status= 0 ) group by hostid
) b where a.hostid=b.hostid)a,(select hostid,lower(host) as hostname from hosts where status= 0 )b,
(select hostid,ip from interface where type= '1' )c
where a.hostid=b.hostid and b.hostid=c.hostid having(a.update_percent) < 80 order by uppercent;
|
4.内部进程的busy情况:
zabbix的工作线程的情况,可以快速定位zabbix内部的性能瓶颈,具体是一些interval check。
比如 zabbix[process,housekeeper,avg,busy], zabbix[process,http poller,avg,busy], zabbix[process,poller,avg,busy]等
5.proxy的 pending send data的情况
用了衡量proxy到server的数据发送情况,值越小说明数据发送越快。
取值sql:
1
|
SELECT (( SELECT MAX (proxy_history.id) FROM proxy_history)-nextid) FROM ids WHERE field_name= 'history_lastid'
|
6.cache,interval check
比如: zabbix[wcache,history,pfree], zabbix[wcache,text,pfree]
本文转自菜菜光 51CTO博客,原文链接:http://blog.51cto.com/caiguangguang/1345664,如需转载请自行联系原作者