接上篇,通过分析listener日志发现rac1数据库无法连接时出现了
listener_20160628.log:28-JUN-2016 07:55:47 * service_died * LsnrAgt * 12537
日志。原因是rac2在此前突发断电28-JUN-2016 07:55:30 * service_updateTue Jun 28 16:47:37 2016
。同时,问题是在节点2断开的情况下节点1应该自动接管监听,对于前端应用是无感知的。
另外,分析日志还发现在运行的半年过程中还有几次serice_died错误。本文分析记录是什么造成了这个service_died错误。
查看CRS日志
直接通过find /u1 -name alert
获得CRS日志的位置'/u1/app/grid/diag/crs/test-rac1/crs/alert'。
使用以下方式获得问题当天的日志。
[root@test-rac1 alert]# grep -n '2016-06-28' log.xml | head -1
121600:<msg time='2016-06-28T07:55:46.994+08:00' org_id='oracle' comp_id='crs'
[root@test-rac1 alert]# grep -n '2016-06-27' log.xml | head -1
[root@test-rac1 alert]# grep -n '2016-06-29' log.xml | head -1
123264:<msg time='2016-06-29T04:33:38.602+08:00' org_id='oracle' comp_id='crs'
[root@test-rac1 alert]# sed -n '121600,123264p' log.xml > log_20160628.xml
得到错误信息
<msg time='2016-06-28T07:55:46.994+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:46.991 [CRSD(2962)]CRS-2771: Maximum restart attempts reached for resource 'ora.scan1.vip'; will not restart.
</txt>
</msg>
<msg time='2016-06-28T07:55:47.450+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:47.449 [CRSD(2962)]CRS-2771: Maximum restart attempts reached for resource 'ora.test-rac1.vip'; will not restart.
</txt>
</msg>
<msg time='2016-06-28T07:55:47.974+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:47.974 [ORAROOTAGENT(3053)]CRS-5017: The resource action "ora.net1.network start" encountered the following error:
2016-06-28 07:55:47.974+CRS-5008: Invalid attribute value: eth0 for the network interface
. For details refer to "(:CLSN00107:)" in "/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_orarootagent_root.trc".
</txt>
</msg>
<msg time='2016-06-28T07:55:47.992+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:47.990 [CRSD(2962)]CRS-2878: Failed to restart resource 'ora.net1.network'
</txt>
</msg>
<msg time='2016-06-28T07:55:48.005+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:48.005 [CRSD(2962)]CRS-2769: Unable to failover resource 'ora.net1.network'.
</txt>
</msg>
<msg time='2016-06-28T07:55:49.119+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:49.115 [ORAAGENT(3044)]CRS-5016: Process "/u1/app/12.1.0/grid/bin/lsnrctl" spawned by agent "ORAAGENT" for action "check" failed: details at "(:CLSN00010:)" in "/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_oraagent_grid.trc"
</txt>
</msg>
<msg time='2016-06-28T07:55:49.132+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:49.131 [ORAROOTAGENT(3053)]CRS-5017: The resource action "ora.net1.network start" encountered the following error:
2016-06-28 07:55:49.131+CRS-5008: Invalid attribute value: eth0 for the network interface
. For details refer to "(:CLSN00107:)" in "/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_orarootagent_root.trc".
</txt>
</msg>
<msg time='2016-06-28T07:55:49.144+08:00' org_id='oracle' comp_id='crs'
msg_id='clsdadr_process_bucket:4466:2974305713' type='UNKNOWN' group='CLSDADR'
level='16' host_id='test-rac1.tp-link.net' host_addr='192.19.88.70'>
<txt>2016-06-28 07:55:49.144 [CRSD(2962)]CRS-2878: Failed to restart resource 'ora.LISTENER_SCAN1.lsnr'
</txt>
</msg>
查看orarootagent的trace日志
位置:/u1/app/grid/diag/crs/test-rac1/crs/trace
按照监听的错误时间,以这个时间查看trace中的错误信息。
listener_20160628.log:28-JUN-2016 07:55:47 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 07:56:56 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 07:57:09 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:13:54 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:15:05 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:28:04 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:36:42 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:41:56 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:42:08 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:56:58 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 10:57:10 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 11:02:46 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 11:04:53 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 11:05:06 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 14:18:03 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:24:42 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:26:56 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:30:21 * service_died * LsnrAgt * 12537
listener_20160628.log:28-JUN-2016 16:30:26 * service_died * LsnrAgt * 12537
2016-06-28 07:55:46.806663 :CLSDYNAM:450848512: [ora.scan1.vip]{2:11414:10026} [check] Failed to check 192.19.88.76 on eth0
2016-06-28 07:55:46.806700 :CLSDYNAM:450848512: [ora.scan1.vip]{2:11414:10026} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:55:46.806741 :CLSDYNAM:450848512: [ora.scan1.vip]{2:11414:10026} [check] VipAgent::checkIp returned false
2016-06-28 07:55:46.811719 : AGFW:463456000: {2:11414:10026} ora.scan1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:55:46.823646 : AGFW:463456000: {0:5:245} Generating new Tint for unplanned state change. Original Tint: {2:11414:10026}
2016-06-28 07:55:46.823776 : AGFW:463456000: {0:5:245} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3749652
2016-06-28 07:55:46.933964 :CLSDYNAM:675165952: [ora.net1.network]{1:3322:2} [check] Network Res Check Action returned 1 return ONLINE
2016-06-28 07:55:47.182374 : AGFW:463456000: {1:3322:2} Agent received the message: AGENT_HB[Engine] ID 12293:223268
2016-06-28 07:55:47.436435 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:2} [check] Failed to check 192.19.88.82 on eth0
2016-06-28 07:55:47.436484 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:2} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:55:47.436518 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:2} [check] VipAgent::checkIp returned false
2016-06-28 07:55:47.440127 : USRTHRD:450848512: {1:3322:2} Thread:[SendFail2SrvThread:] start { acquire thndMX:f8023680
2016-06-28 07:55:47.440156 : USRTHRD:450848512: {1:3322:2} Thread:[SendFail2SrvThread:] start pThnd:f8000a30
2016-06-28 07:55:47.440246 : USRTHRD:450848512: {1:3322:2} Thread:[SendFail2SrvThread:] start 2 release thndMX:f8023680 }
2016-06-28 07:55:47.440817 : AGFW:463456000: {1:3322:2} ora.test-rac1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:55:47.440980 : AGFW:463456000: {0:5:246} Generating new Tint for unplanned state change. Original Tint: {1:3322:2}
2016-06-28 07:55:47.441095 : AGFW:463456000: {0:5:246} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3749662
2016-06-28 07:55:47.445142 : USRTHRD:465557248: {1:3322:2} VipAgent::sendFail2Srv {
2016-06-28 07:55:47.461505 : USRTHRD:465557248: {1:3322:2} VipAgent::sendFail2Srv }
2016-06-28 07:55:47.461662 : USRTHRD:465557248: {1:3322:2} Thread:[SendFail2SrvThread:] isRunning is reset to false here
2016-06-28 07:55:47.941795 :CLSDYNAM:450848512: [ora.net1.network]{1:3322:2} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.943998 : AGFW:463456000: {1:3322:2} ora.net1.network test-rac1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:55:47.944024 : AGFW:463456000: {1:3322:2} Switching online monitor to offline one
2016-06-28 07:55:47.944108 : AGFW:463456000: {1:3322:2} Starting offline monitor
2016-06-28 07:55:47.944197 : AGFW:463456000: {1:3322:2} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:55:47.944248 : AGFW:463456000: {0:5:247} Generating new Tint for unplanned state change. Original Tint: {1:3322:2}
2016-06-28 07:55:47.944357 : AGFW:463456000: {0:5:247} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3749668
2016-06-28 07:55:47.955510 : AGFW:463456000: {0:5:247} Agent received the message: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.955543 : AGFW:463456000: {0:5:247} Preparing START command for: ora.net1.network test-rac1 1
2016-06-28 07:55:47.955561 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: OFFLINE to: STARTING
2016-06-28 07:55:47.957213 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] (:CLSN00107:) clsn_agent::start {
2016-06-28 07:55:47.958107 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] NetworkAgent::init enter {
2016-06-28 07:55:47.958500 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] VendorType=0
2016-06-28 07:55:47.958556 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] Checking if eth0 Interface is fine
2016-06-28 07:55:47.958730 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.973654 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] Agent::commonStart Exception UserErrorException
2016-06-28 07:55:47.973987 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.net1.network start" encountered the following error:
CRS-5008: Invalid attribute value: eth0 for the network interface
2016-06-28 07:55:47.974495 :CLSDYNAM:450848512: [ora.net1.network]{0:5:247} [start] (:CLSN00107:) clsn_agent::start }
2016-06-28 07:55:47.974525 : AGFW:450848512: {0:5:247} Command: start for resource: ora.net1.network test-rac1 1 completed with status: FAIL
2016-06-28 07:55:47.974822 : AGFW:463456000: {0:5:247} Agent sending reply for: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.975059 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] NetworkAgent::init enter {
2016-06-28 07:55:47.975252 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] VendorType=0
2016-06-28 07:55:47.975296 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] Checking if eth0 Interface is fine
2016-06-28 07:55:47.975524 : AGFW:463456000: {0:5:247} Agent sending reply for: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.975551 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.976004 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] exception in init
2016-06-28 07:55:47.976497 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: STARTING to: OFFLINE
2016-06-28 07:55:47.976523 : AGFW:463456000: {0:5:247} Switching online monitor to offline one
2016-06-28 07:55:47.976587 : AGFW:463456000: {0:5:247} Starting offline monitor
2016-06-28 07:55:47.976700 : AGFW:463456000: {0:5:247} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:55:47.976840 : AGFW:463456000: {0:5:247} Agent sending last reply for: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223304
2016-06-28 07:55:47.982037 : AGFW:463456000: {0:5:247} Agent received the message: RESOURCE_CLEAN[ora.net1.network test-rac1 1] ID 4100:223311
2016-06-28 07:55:47.982071 : AGFW:463456000: {0:5:247} Preparing CLEAN command for: ora.net1.network test-rac1 1
2016-06-28 07:55:47.982089 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: OFFLINE to: CLEANING
2016-06-28 07:55:47.982930 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] (:CLSN00106:) clsn_agent::clean {
2016-06-28 07:55:47.982997 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] clean {
2016-06-28 07:55:47.983018 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] clean }
2016-06-28 07:55:47.983060 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [clean] (:CLSN00106:) clsn_agent::clean }
2016-06-28 07:55:47.983079 : AGFW:459253504: {0:5:247} Command: clean for resource: ora.net1.network test-rac1 1 completed with status: SUCCESS
2016-06-28 07:55:47.983607 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] NetworkAgent::init enter {
2016-06-28 07:55:47.983716 : AGFW:463456000: {0:5:247} Agent sending reply for: RESOURCE_CLEAN[ora.net1.network test-rac1 1] ID 4100:223311
2016-06-28 07:55:47.983802 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] VendorType=0
2016-06-28 07:55:47.983844 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] Checking if eth0 Interface is fine
2016-06-28 07:55:47.984009 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:55:47.984601 :CLSDYNAM:459253504: [ora.net1.network]{0:5:247} [check] exception in init
2016-06-28 07:55:47.985001 : AGFW:463456000: {0:5:247} ora.net1.network test-rac1 1 state changed from: CLEANING to: OFFLINE
2016-06-28 07:55:47.985026 : AGFW:463456000: {0:5:247} Switching online monitor to offline one
2016-06-28 07:55:47.985086 : AGFW:463456000: {0:5:247} Starting offline monitor
2016-06-28 07:55:47.985143 : AGFW:463456000: {0:5:247} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:55:47.985691 : AGFW:463456000: {0:5:247} Agent sending last reply for: RESOURCE_CLEAN[ora.net1.network test-rac1 1] ID 4100:223311
2016-06-28 07:56:55.862677 :CLSDYNAM:675165952: [ora.test-rac2.vip]{1:3322:17228} [check] Failed to check 192.19.88.83 on eth0
2016-06-28 07:56:55.862713 :CLSDYNAM:675165952: [ora.test-rac2.vip]{1:3322:17228} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:56:55.862775 :CLSDYNAM:675165952: [ora.test-rac2.vip]{1:3322:17228} [check] VipAgent::checkIp returned false
2016-06-28 07:56:55.863613 : AGFW:463456000: {1:3322:17228} ora.test-rac2.vip 1 1 state changed from: PARTIAL to: OFFLINE
2016-06-28 07:56:55.863701 : AGFW:463456000: {0:5:251} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:55.863813 : AGFW:463456000: {0:5:251} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750208
2016-06-28 07:56:56.196759 :CLSDYNAM:459253504: [ora.net1.network]{1:3322:17228} [check] Network Res Check Action returned 1 return ONLINE
2016-06-28 07:56:56.363286 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:17228} [check] Failed to check 192.19.88.82 on eth0
2016-06-28 07:56:56.363324 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:17228} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:56:56.363408 :CLSDYNAM:450848512: [ora.test-rac1.vip]{1:3322:17228} [check] VipAgent::checkIp returned false
2016-06-28 07:56:56.363737 : USRTHRD:450848512: {1:3322:17228} Thread:[SendFail2SrvThread:] start { acquire thndMX:f8034b50
2016-06-28 07:56:56.363760 : USRTHRD:450848512: {1:3322:17228} Thread:[SendFail2SrvThread:] start pThnd:f80235e0
2016-06-28 07:56:56.363855 : USRTHRD:450848512: {1:3322:17228} Thread:[SendFail2SrvThread:] start 2 release thndMX:f8034b50 }
2016-06-28 07:56:56.364491 : AGFW:463456000: {1:3322:17228} ora.test-rac1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:56:56.364598 : AGFW:463456000: {0:5:252} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:56.364658 : USRTHRD:452949760: {1:3322:17228} VipAgent::sendFail2Srv {
2016-06-28 07:56:56.364681 : AGFW:463456000: {0:5:252} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750217
2016-06-28 07:56:56.387222 : USRTHRD:452949760: {1:3322:17228} VipAgent::sendFail2Srv }
2016-06-28 07:56:56.387282 : USRTHRD:452949760: {1:3322:17228} Thread:[SendFail2SrvThread:] isRunning is reset to false here
2016-06-28 07:56:56.863825 :CLSDYNAM:675165952: [ora.scan1.vip]{1:3322:17228} [check] Failed to check 192.19.88.76 on eth0
2016-06-28 07:56:56.863856 :CLSDYNAM:675165952: [ora.scan1.vip]{1:3322:17228} [check] (null) category: 0, operation: , loc: , OS error: 0, other:
2016-06-28 07:56:56.863883 :CLSDYNAM:675165952: [ora.scan1.vip]{1:3322:17228} [check] VipAgent::checkIp returned false
2016-06-28 07:56:56.864616 : AGFW:463456000: {1:3322:17228} ora.scan1.vip 1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:56:56.864711 : AGFW:463456000: {0:5:253} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:56.864780 : AGFW:463456000: {0:5:253} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750220
2016-06-28 07:56:57.197099 :CLSDYNAM:675165952: [ora.net1.network]{1:3322:17228} [check] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:56:57.197197 :CLSDYNAM:675165952: [ora.net1.network]{1:3322:17228} [check] ifName = eth1:1
2016-06-28 07:56:57.197608 : AGFW:463456000: {1:3322:17228} ora.net1.network test-rac1 1 state changed from: ONLINE to: OFFLINE
2016-06-28 07:56:57.197639 : AGFW:463456000: {1:3322:17228} Switching online monitor to offline one
2016-06-28 07:56:57.197714 : AGFW:463456000: {1:3322:17228} Starting offline monitor
2016-06-28 07:56:57.197804 : AGFW:463456000: {1:3322:17228} Started implicit monitor for [ora.net1.network test-rac1 1] interval=60000 delay=60000
2016-06-28 07:56:57.197841 : AGFW:463456000: {0:5:254} Generating new Tint for unplanned state change. Original Tint: {1:3322:17228}
2016-06-28 07:56:57.197897 : AGFW:463456000: {0:5:254} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:3750226
2016-06-28 07:56:57.203615 : AGFW:463456000: {0:5:254} Agent received the message: RESOURCE_START[ora.net1.network test-rac1 1] ID 4098:223726
2016-06-28 07:56:57.203639 : AGFW:463456000: {0:5:254} Preparing START command for: ora.net1.network test-rac1 1
2016-06-28 07:56:57.203651 : AGFW:463456000: {0:5:254} ora.net1.network test-rac1 1 state changed from: OFFLINE to: STARTING
2016-06-28 07:56:57.204319 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] (:CLSN00107:) clsn_agent::start {
2016-06-28 07:56:57.204507 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] NetworkAgent::init enter {
2016-06-28 07:56:57.204735 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] VendorType=0
2016-06-28 07:56:57.204795 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] Checking if eth0 Interface is fine
2016-06-28 07:56:57.205009 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] (null) category: -1, operation: failed system call, loc: ioctl, OS error: 99, other:
2016-06-28 07:56:57.205101 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] ifName = eth1:1
2016-06-28 07:56:57.205528 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] Agent::commonStart Exception UserErrorException
2016-06-28 07:56:57.205805 :CLSDYNAM:459253504: [ora.net1.network]{0:5:254} [start] clsnUtils::error Exception type=2 string=
CRS-5017: The resource action "ora.net1.network start" encountered the following error:
CRS-5008: Invalid attribute value: eth0 for the network interface
. For details refer to "(:CLSN00107:)" in "/u1/app/grid/diag/crs/test-rac1/crs/trace/crsd_orarootagent_root.trc".
验证测试
手工转移
看手工是否可以转移VIP
- 在节点1启动监控脚本
cat mon_crs_status.sh
while [[ 1 ]]
do
date
ifconfig -a
echo ''
ping -c 2 test-rac1
ping -c 2 test-rac2
ping -c 2 test-rac1-vip
ping -c 2 test-rac2-vip
ping -c 2 test-cluster
crsctl stat res -t
sleep 5
done
- 在RAC1使用以下命令执行
nohup ./mon_crs_status.sh > mon_crs_log &
ps -ef | grep mon_crs_status.sh
- 截取网络包
tcpdump -s 0 -c 100 -w test1.cap
- 停止节点2的VIP
srvctl stop instance -d testdb -i TESTDB2
srvctl stop asm -n test-rac2 -force
srvctl stop listener -n test-rac2
srvctl stop nodeapps -n test-rac2
- 在本地测试rac2的vip是否迁移
tnsping rac_testdb2
tnsping rac_testdb
- 手动迁移VIP
crs_relocate ora.test-rac2.vip
- 重新测试rac2的vip是否迁移
tnsping rac_testdb2
tnsping rac_testdb
- 启动rac2服务
srvctl start nodeapps -n test-rac2
srvctl start asm -n test-rac2
srvctl start instance -d testdb -i TESTDB2
crsctl stat res -t
关闭RAC2虚拟机,查看VIP是否转移
- 在RAC1使用以下命令执行
nohup ./mon_crs_status.sh > mon_crs_log &
ps -ef | grep mon_crs_status.sh
- 截取网络包
tcpdump -s 0 -c 100 -w test2.cap
- 停止rac2服务器
poweroff
- 在本地测试rac2的vip是否迁移
tnsping rac_testdb2
tnsping rac_testdb
- 启动服务器,启动实例
crsctl check crs
crsctl start crs
测试结果(0706中午)
- 手工停RAC服务,集群服务仍可用。rac2-vip不能手动迁移。
[grid@test-rac2 ~]$ crs_relocate ora.test-rac2.vip
CRS-5708: Resource 'ora.test-rac2.vip' is not relocatable (current and target state not running)
CRS-0223: Resource 'ora.test-rac2.vip' has placement error.
[grid@test-rac2 ~]$ srvctl relocate vip -vip test-rac2
PRCR-1090 : Failed to relocate resource ora.test-rac2.vip. It is not running.
- rac2虚拟机关机,集群服务仍可用。则rac2-vip可以自动迁移到rac1节点,但是状态处于INTERMEDIATE状态。
[oracle@test-rac1 scripts]$ crsctl stat res -t
--------------------------------------------------------------------------------
Name Target State Server State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE test-rac1 STABLE
ora.test-rac1.vip
1 ONLINE ONLINE test-rac1 STABLE
ora.test-rac2.vip
1 ONLINE INTERMEDIATE test-rac1 FAILED OVER,STABLE
ora.scan1.vip
1 ONLINE ONLINE test-rac1 STABLE
--------------------------------------------------------------------------------
Wed Jul 6 12:50:53 CST 2016
***********************start*************************************
eth0 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.70 Bcast:192.19.88.255 Mask:255.255.255.0
inet6 addr: fe80::215:5dff:fe75:b15/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:18106573 errors:0 dropped:0 overruns:0 frame:0
TX packets:16997728 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5394898582 (5.0 GiB) TX bytes:14791039101 (13.7 GiB)
eth0:1 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.76 Bcast:192.19.88.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
eth0:2 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.83 Bcast:192.19.88.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
eth0:3 Link encap:Ethernet HWaddr 00:15:5D:75:0B:15
inet addr:192.19.88.82 Bcast:192.19.88.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
eth1 Link encap:Ethernet HWaddr 00:15:5D:75:0B:16
inet addr:192.168.2.5 Bcast:192.168.2.255 Mask:255.255.255.0
inet6 addr: fe80::215:5dff:fe75:b16/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:109282486 errors:0 dropped:0 overruns:0 frame:0
TX packets:68084084 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:82205348347 (76.5 GiB) TX bytes:41966966280 (39.0 GiB)
eth1:1 Link encap:Ethernet HWaddr 00:15:5D:75:0B:16
inet addr:169.254.31.89 Bcast:169.254.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
结论:无论手动关节点还是直接关闭虚拟机,集群服务不受影响。单节点转移有问题,需要查找配置能够使vip漂移后使其状态修复为Online。等软件测试通过后再做硬件掉电测试。
结合操作系统日志进行分析
既然从上面的实验可以看出停单个节点情况下数据库服务是持续的,RAC切换功能可用。进一步结合Linux服务器的/var/log/message进行分析。
网卡重启
从操作系统日志看到网卡删除与数据库监听停止强相关。
操作系统日志:
Jun 28 07:55:47 test-rac1 kernel: lo: Disabled Privacy Extensions
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #2 eth1, fe80::215:5dff:fe75:b16#123, interface stats: received=0, sent=0, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #4 eth0, fe80::215:5dff:fe75:b15#123, interface stats: received=0, sent=0, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #6 eth0, 192.19.88.70#123, interface stats: received=0, sent=0, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #7 eth1, 192.168.2.5#123, interface stats: received=461, sent=461, dropped=0, active_time=415431 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #8 eth1:1, 169.254.31.89#123, interface stats: received=0, sent=0, dropped=0, active_time=415389 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #9 eth0:1, 192.19.88.82#123, interface stats: received=0, sent=0, dropped=0, active_time=415350 secs
Jun 28 07:55:48 test-rac1 ntpd[2054]: Deleting interface #11 eth0:3, 192.19.88.76#123, interface stats: received=0, sent=0, dropped=0, active_time=415108 secs
Jun 28 07:55:51 test-rac1 ntpd[2054]: Listening on interface #13 eth0, fe80::215:5dff:fe75:b15#123 Enabled
监听日志:
listener_20160628.log:28-JUN-2016 07:55:47 * service_died * LsnrAgt * 12537
操作系统日志:
Jun 28 16:24:28 test-rac1 kernel: lo: Disabled Privacy Extensions
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #133 eth1, 192.168.2.5#123, interface stats: received=0, sent=8, dropped=0, active_time=7583 secs
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #135 eth0:1, 192.19.88.82#123, interface stats: received=0, sent=0, dropped=0, active_time=7281 secs
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #136 eth0:2, 192.19.88.76#123, interface stats: received=0, sent=0, dropped=0, active_time=7227 secs
Jun 28 16:24:33 test-rac1 ntpd[2054]: Deleting interface #137 eth0:3, 192.19.88.83#123, interface stats: received=0, sent=0, dropped=0, active_time=7178 secs
Jun 28 16:24:37 test-rac1 ntpd[2054]: Listening on interface #138 eth1, 192.168.2.5#123 Enabled
监听日志:
listener_20160628.log:28-JUN-2016 16:24:42 * service_died * LsnrAgt * 12537
ntp服务
从上面的日志看,ntpd进程删除了网络接口。初步分析是由于数据库的NTP服务器设置为192.168.2.4虚拟机,而此虚拟机在物理机2上,所以物理机2一旦试图启动,ntpd进程试图连接NTP服务器,但是实际物理机启动失败,ntpd进程会重启所有网卡试图重连。此问题可见serverfault,redhat官网。
解决方案
因为没有独立的NTP服务器,可以使用CRS自带的ctss服务进行集群的时钟同步。
具体操作可以参考博客1,2
Disabled Privacy Extensions
这个Linux内核操作与IPV6相关,因为整个HyperV的虚拟网卡都没有启用ipv6,可以在Linux操作系统上关闭ipv6配置,以免引起异常。