7月19日15点20分左右,客户反应业务无法连接数据库,到15点30左右时,业务可以正常连接数据库。在20分到30分之间没有做任何操作。
故障分析
数据库告警日志在15点21分报错,在29分之后没有发现ORA-00445。
Sat Jul 18 15:21:11 2020
Errors in file /oracle/app/diag/rdbms/orcl/orcl1/trace/orcl1_cjq0_15129.trc (incident=224489):
ORA-00445: background process "J000" did not start after 120 seconds
……
Sat Jul 18 15:29:25 2020
Errors in file /oracle/app/diag/rdbms/orcl/orcl1/trace/orcl1_qmnc_15048.trc (incident=224433):
ORA-00445: background process "q001" did not start after 120 seconds
ORA-00445错误指的是oracle无法在操作系统层面生成新的进程来为新的请求服务,或者是现有进程已经僵死。通常来讲,这个错误发生在系统资源短缺或者系统资源参数配置错误等,应该从操作系统层面来分析。
从orcl1_cjq0_15129.trc中可以看出,当时的内存使用正常,有充足的空余。
* 2020-07-18 15:19:06.074
loadavg : 65.00 22.26 8.73
Memory (Avail / Total) = 72904.64M / 128946.45M
Swap (Avail / Total) = 32768.00M / 32768.00M
skgpgcmdout: read() for cmd /bin/ps -elf | /bin/egrep 'PID | 68878' | /bin/grep -v grep timed out after >15.000 seconds
集群会定时检查集群资源的健康状况,在15点18分时发现大量的异常,31分后恢复正常。
2020-07-18 15:18:23.883
[/oracle/grid/bin/oraagent.bin(18771)]CRS-5818:Aborted command 'check' for resource >'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {1:42006:2} in /oracle/grid/log/lsfby01/agent/crsd/oraagent_grid/oraagent_grid.log.
2020-07-18 15:18:43.914
[/oracle/grid/bin/oraagent.bin(18771)]CRS-5818:Aborted command 'check' for resource 'ora.ons'. Details at (:CRSAGF00113:) {1:42006:2} in /oracle/grid/log/lsfby01/agent/crsd/oraagent_grid/oraagent_grid.log.
2020-07-18 15:18:53.953
[/oracle/grid/bin/oraagent.bin(18771)]CRS-5014:Agent "/oracle/grid/bin/oraagent.bin" timed out starting process "/oracle/grid/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/oracle/grid/log/lsfby01/agent/crsd/oraagent_grid/oraagent_grid.log"
……
[/oracle/grid/bin/oraagent.bin(18771)]CRS-5822:Agent '/oracle/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:2:39} in /oracle/grid/log/lsfby01/agent/crsd/oraagent_grid/oraagent_grid.log.
2020-07-18 15:30:07.480
[/oracle/grid/bin/orarootagent.bin(18778)]CRS-5822:Agent '/oracle/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:3:21264} in /oracle/grid/log/lsfby01/agent/crsd/orarootagent_root/orarootagent_root.log.
2020-07-18 15:30:07.480
[/oracle/grid/bin/oraagent.bin(18800)]CRS-5822:Agent '/oracle/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:4:72} in /oracle/grid/log/lsfby01/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2020-07-18 15:31:17.399
[crsd(69518)]CRS-1012:The OCR service started on node lsfby01.
2020-07-18 15:31:17.846
[crsd(69518)]CRS-1201:CRSD started on node lsfby01.
在ASM告警日志中也同样发现ORA-00445错误。
Sat Jul 18 15:22:12 2020
Errors in file /oracle/app/diag/asm/+asm/+ASM1/trace/+ASM1_mmon_13816.trc (incident=181601):
ORA-00445: background process "m000" did not start after 120 seconds
考虑到有可能是集群的心跳异常,导致了集群踢掉了一节点,进而导致业务无法连接。但是在查看ocssd日志后,没有发现心跳问题。
在15日的时候,双节点已经部署osw,在查看osw的oswtop、oswmeminfo和oswprvtnet发现了异常的地方。Osw设置的是5s采集一次信息,但是大概在17分到28分之间却没有任何记录,从这点可以大致推论出是操作系统出现了问题。
从操作系统日志中发现,在17分时出现异常,直到31分钟才结束,正好覆盖了数据库无法连接的时间。
Jul 18 15:17:27 lsfby01 kernel: scand[18500]: segfault at 7f194a664000 ip 00007f1952d95a68 sp 00007f19148950d8 error 4 in modmap.yeIQ3s (deleted)[7f1952d7b000+1c000]
Jul 18 15:17:37 lsfby01 kernel: eset_rtp: wait for scanner reply timeout, path: /tmp/.nstat.u0, event: OPEN, pid: 68854
Jul 18 15:17:37 lsfby01 kernel: eset_rtp: wait for scanner reply timeout, path: /etc/ld.so.cache, event: OPEN, pid: 68858
Jul 18 15:17:37 lsfby01 kernel: eset_rtp: wait for scanner reply timeout, path: /etc/ld.so.cache, event: OPEN, pid: 68857
……
Jul 18 15:31:09 lsfby01 kernel: eset_rtp: wait for scanner reply timeout, path: /usr/share/augeas/lenses/libreport.aug, event: OPEN, pid: 11371
Jul 18 15:31:09 lsfby01 kernel: eset_rtp: wait for scanner reply timeout, path: /usr/lib64/gconv/gconv-modules.cache, event: OPEN, pid: 69467
Jul 18 15:31:10 lsfby01 abrtd: Directory 'ccpp-2020-07-18-15:26:08-18407' creation detected
Jul 18 15:31:11 lsfby01 abrtd: Package 'efs' isn't signed with proper key
Jul 18 15:31:11 lsfby01 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2020-07-18-15:26:08-18407' exited with 1
Jul 18 15:31:11 lsfby01 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2020-07-18-15:26:08-18407'
在与客户沟通后,确认服务器上安装了杀毒软件,至此可以推断出是由于杀毒软件造成的问题。
总结
从这次故障可以看出,在分析问题时,如果从数据库层面遇到了瓶颈,无法定位问题,那么不妨从操作系统层面上来分析,通过分析日志和osw采集日志,往往会有奇效。