windows下有HDTune可以查看磁盘的状态,防止磁盘挂掉才会自己知道,CentOS下有SMART (Self-Monitoring, Analysis and Reporting Technology System) 同样对磁盘做状态检测
下面以dell R720服务器举例,/dev/sda是1T的scsi接口普通硬盘,/dev/sdd 是三块盘做的raid5
# df -h #查看磁盘的名字
# dmesg |grep sdd #查看开机信息里面的磁盘info
sd 0:2:0:0: [sdd] Attached SCSI disk
# hdparm -I /dev/sda #查看磁盘硬件信息、开启的功能等,信息特别详细
下面用smart查看磁盘的状态:
1
2
3
4
5
6
|
# yum install smartmontools //安装SMART # smartctl -H /dev/sdd //磁盘健康状况查看 smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.56-11.el6.centos.alt.x86_64] ( local build)
Copyright (C) 2002-12 by Bruce Allen, http: //smartmontools .sourceforge.net
SMART Health Status: OK |
# smartctl -A /dev/sda 或者 smartctl --all /dev/sda #硬盘的smart信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
# smartctl -a /dev/sdd smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.56-11.el6.centos.alt.x86_64] ( local build)
Copyright (C) 2002-12 by Bruce Allen, http: //smartmontools .sourceforge.net
Vendor: DELL Product: PERC H310 Revision: 2.12 User Capacity: 598,879,502,336 bytes [598 GB] Logical block size: 512 bytes Logical Unit id :
Serial number: Device type : disk
Local Time is: Wed Jan 14 15:37:39 2015 CST Device does not support SMART Error Counter logging not supported Device does not support Self Test logging |
这里提示Device does not support SMART,所以按下面方式查看
查看raid5中第一块磁盘的状态
# smartctl -a -d megaraid,0 /dev/sdd
同样查看第二块、第三块磁盘的状态,根据自己的监控情况,加速nagios、zabbix报警
# smartctl -a -d megaraid,1 /dev/sdd
# smartctl -a -d megaraid,2 /dev/sdd
除此之外的smartctl用法,介绍的很详细:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
|
# smartctl -h Usage: smartctl [options] device ============================================ SHOW INFORMATION OPTIONS ===== -h, --help, --usage
Display this help and exit
-V, --version, --copyright, --license
Print license, copyright, and version information and exit
-i, --info
Show identity information for device
-g NAME, --get=NAME
Get device setting: all, aam, apm, lookahead, security, wcache
-a, --all
Show all SMART information for device
-x, --xall
Show all information for device
--scan
Scan for devices
--scan- open
Scan for devices and try to open each device
================================== SMARTCTL RUN-TIME BEHAVIOR OPTIONS ===== -q TYPE, --quietmode=TYPE (ATA)
Set smartctl quiet mode to one of: errorsonly, silent, noserial
-d TYPE, --device=TYPE
Specify device type to one of: ata, scsi, sat[,auto][,N][+TYPE],
usbcypress[,X], usbjmicron[,x][,N], usbsunplus, marvell, areca,N /E ,
3ware,N, hpt,L /M/N , megaraid,N, cciss,N, auto, test
-T TYPE, --tolerance=TYPE (ATA)
Tolerance: normal, conservative, permissive, verypermissive
-b TYPE, --badsum=TYPE (ATA)
Set action on bad checksum to one of: warn, exit , ignore
-r TYPE, --report=TYPE
Report transactions (see man page)
-n MODE, --nocheck=MODE (ATA)
No check if : never, sleep , standby, idle (see man page)
============================== DEVICE FEATURE ENABLE /DISABLE COMMANDS =====
-s VALUE, --smart=VALUE
Enable /disable SMART on device (on /off )
-o VALUE, --offlineauto=VALUE (ATA)
Enable /disable automatic offline testing on device (on /off )
-S VALUE, --saveauto=VALUE (ATA)
Enable /disable Attribute autosave on device (on /off )
-s NAME[,VALUE], -- set =NAME[,VALUE]
Enable /disable/change device setting: aam,[N|off], apm,[N|off],
lookahead,[on|off], security-freeze, standby,[N|off|now],
wcache,[on|off]
======================================= READ AND DISPLAY DATA OPTIONS ===== -H, --health
Show device SMART health status
-c, --capabilities (ATA)
Show device SMART capabilities
-A, --attributes
Show device SMART vendor-specific Attributes and values
-f FORMAT, -- format =FORMAT (ATA)
Set output format for attributes: old, brief, hex[, id |val]
-l TYPE, --log=TYPE
Show device log. TYPE: error, selftest, selective, directory[,g|s],
xerror[,N][,error], xselftest[,N][,selftest],
background, sasphy[,reset], sataphy[,reset],
scttemp[sts,hist], scttempint,N[,p],
scterc[,N,M], devstat[,N], ssd,
gplog,N[,RANGE], smartlog,N[,RANGE]
- v N,OPTION , --vendorattribute=N,OPTION (ATA)
Set display OPTION for vendor Attribute N (see man page)
-F TYPE, --firmwarebug=TYPE (ATA)
Use firmware bug workaround: none, samsung, samsung2,
samsung3, swapid
-P TYPE, --presets=TYPE (ATA)
Drive-specific presets: use, ignore, show, showall
-B [+]FILE, --drivedb=[+]FILE (ATA)
Read and replace [add] drive database from FILE
[default is + /etc/smart_drivedb .h
and then /usr/share/smartmontools/drivedb .h]
============================================ DEVICE SELF-TEST OPTIONS ===== -t TEST, -- test =TEST
Run test . TEST: offline, short, long, conveyance, force, vendor,N,
select ,M-N, pending,N, afterselect,[on|off]
-C, --captive
Do test in captive mode (along with -t)
-X, --abort
Abort any non-captive test on device
=================================================== SMARTCTL EXAMPLES ===== smartctl --all /dev/hda (Prints all SMART information)
smartctl --smart=on --offlineauto=on --saveauto=on /dev/hda
(Enables SMART on first disk)
smartctl -- test =long /dev/hda (Executes extended disk self- test )
smartctl --attributes --log=selftest --quietmode=errorsonly /dev/hda
(Prints Self-Test & Attribute errors)
smartctl --all --device=3ware,2 /dev/sda
smartctl --all --device=3ware,2 /dev/twe0
smartctl --all --device=3ware,2 /dev/twa0
smartctl --all --device=3ware,2 /dev/twl0
(Prints all SMART info for 3rd ATA disk on 3ware RAID controller)
smartctl --all --device=hpt,1 /1/3 /dev/sda
(Prints all SMART info for the SATA disk attached to the 3rd PMPort
of the 1st channel on the 1st HighPoint RAID controller)
smartctl --all --device=areca,3 /1 /dev/sg2
(Prints all SMART info for 3rd ATA disk of the 1st enclosure
on Areca RAID controller)
|
http://linux-wiki.cn/wiki/zh-hans/SSD_(%E5%9B%BA%E6%80%81%E7%A1%AC%E7%9B%98)
nagios设置
下面检测raid5磁盘,总共3块磁盘
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
root@web: /usr/local/nagios/libexec # vim check_disk_status.sh
#!/bin/bash # STATE_OK=0 STATE_W ARNING=1 SMARTCTL= "/usr/sbin/smartctl"
CHECK_DISK= "/dev/sda" DISK_HEALTH1=`$SMARTCTL -a -d megaraid,0 $CHECK_DISK | grep "SMART Health Status" | awk '{print $4}' `
if [ "$DISK_HEALTH1" = "OK" ]|| [ "$DISK_HEALTH1" = "PASSED" ]; then
echo "OK - $CHECK_DISK 1 status is $DISK_HEALTH1 "
else echo "CRITICAL - $CHECK_DISK status is $DISK_HEALTH1 "
exit $STATE_CRITICAL
fi DISK_HEALTH2=`$SMARTCTL -a -d megaraid,1 $CHECK_DISK | grep "SMART Health Status" | awk '{print $4}' `
if [ "$DISK_HEALTH2" = "OK" ]|| [ "$DISK_HEALTH2" = "PASSED" ]; then
echo "OK - $CHECK_DISK 2 status is $DISK_HEALTH2 "
else echo "CRITICAL - $CHECK_DISK status is $DISK_HEALTH2 "
exit $STATE_CRITICAL
fi DISK_HEALTH3=`$SMARTCTL -a -d megaraid,2 $CHECK_DISK | grep "SMART Health Status" | awk '{print $4}' `
if [ "$DISK_HEALTH3" = "OK" ]|| [ "$DISK_HEALTH3" = "PASSED" ]; then
echo "OK - $CHECK_DISK 3 status is $DISK_HEALTH3 "
else echo "CRITICAL - $CHECK_DISK status is $DISK_HEALTH3 "
exit $STATE_CRITICAL
fi # chmod 755 check_disk_status.sh |
1
2
|
vim /usr/local/nagios/etc/nrpe .cfg
command [check_disk_status]= /usr/bin/sudo /usr/local/nagios/libexec/check_disk_status .sh
|
因为/usr/sbin/smartctl必须要root才可以运行,得到磁盘的状态
1
2
3
|
vim /etc/sudoers
#Defaults requiretty nagios ALL=(ALL) NOPASSWD: /usr/local/nagios/libexec/check_disk_status .sh
|
在nagios服务器端执行命令来测试:
1
2
3
4
|
root@nagios: /usr/local/nagios/libexec # ./check_nrpe -H 192.168.2.2 -c check_disk_status
OK - /dev/sda 1 status is OK
OK - /dev/sda 2 status is OK
OK - /dev/sda 3 status is OK
|
定义nagios服务
1
2
3
4
5
6
|
define service{ use linux-service
host_name 192_168_2_2
service_description check disk status
check_command check_nrpe!check_disk_status
}
|
再把时间定义为1天一次,省的总扫描硬盘,对硬盘也不好
参考http://blog.chinaunix.net/uid-20592013-id-2436813.html
执行脚本,发邮件
最简单的,加入crontab,查看邮件即可,下面是脚本