Instance Eviction 导致实例重启

2024-01-02 20:32:52

一、Environment
[告警通知]
告警对象：XXX平台-19cdb-ip
触发时间：2021-10-10 03:55:38
告警规则：Oracle实例重启告警
告警信息：数据库的启动时间接近告警设置阈值，数据库可能在之前重启过，请注意。
告警次数：1
告警描述：instance_name:orcl2，启动时间:2021-10-10 03:53:25，运行时间:2分钟

二、Symptoms、Cause分析
1.二节点数据库alert日志如下，2021-10-10 03:51有明显LMS0 hung信息,随后进程中断、提示内存不存在，日志文件无法访问实例中断，紧跟着2021-10-10 03:53实例正常启动。

***********************************************************************
2021-10-10T03:51:34.572753+08:00
LMS0 (ospid: 25068_25071) has not called a wait for 22 secs.
2021-10-10T03:51:42.718171+08:00

Fatal NI connect error 12537, connecting to:
(ADDRESS=(PROTOCOL=tcp)(HOST=10.111.253.3)(PORT=39534))
2021-10-10T03:51:45.202219+08:00
LMS0 (ospid: 25068_25071) is hung in an acceptable location ( 0x2.ffff).
2021-10-10T03:51:52.557322+08:00

ns main err code: 12537
Incident details in: /data/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_10473/orcl2_rms0_25080_i10473.trc
2021-10-10T03:52:38.323480+08:00
opiodr aborting process unknown ospid (648) as a result of ORA-609

nt OS err code: 0
TNS-12537: TNS:connection closed
2021-10-10T03:52:38.325465+08:00
opiodr aborting process unknown ospid (1052) as a result of ORA-609
ns secondary err code: 12560
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
2021-10-10T03:52:38.327578+08:00
opiodr aborting process unknown ospid (1941) as a result of ORA-609
2021-10-10T03:52:38.328910+08:00
opiodr aborting process unknown ospid (1061) as a result of ORA-609
2021-10-10T03:52:40.123054+08:00
Errors in file /data/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_fenc_25118.trc:
ORA-01034: ORACLE not available
ORA-27101: shared memory realm does not exist
Linux-x86_64 Error: 2: No such file or directory
Additional information: 4376
Additional information: -989601789
2021-10-10T03:52:42.061215+08:00
Errors in file /data/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_rms0_25080.trc:
ORA-00313: open failed for members of log group 7 of thread 1
ORA-00312: online log 7 thread 1: '+DATA/JIRADB/ONLINELOG/group_7.307.1009060777'
ORA-17503: ksfdopn:2 Failed to open file +DATA/JIRADB/ONLINELOG/group_7.307.1009060777
ORA-01092: ORACLE instance terminated. Disconnection forced
2021-10-10T03:52:42.062072+08:00
Errors in file /data/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_rms0_25080.trc:
ORA-00313: open failed for members of log group 7 of thread 1
ORA-00312: online log 7 thread 1: '+DATA/JIRADB/ONLINELOG/group_7.307.1009060777'
ORA-17503: ksfdopn:2 Failed to open file +DATA/JIRADB/ONLINELOG/group_7.307.1009060777
ORA-01092: ORACLE instance terminated. Disconnection forced
Errors in file /data/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_rms0_25080.trc (incident=10474) (PDBNAME=CDB$ROOT):
ORA-313 [] [] [] [] [] [] [] [] [] [] [] []
2021-10-10T03:52:42.101863+08:00

2021-10-10T03:52:42.062072+08:00
Errors in file /data/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_rms0_25080.trc:
ORA-00313: open failed for members of log group 7 of thread 1
ORA-00312: online log 7 thread 1: '+DATA/JIRADB/ONLINELOG/group_7.307.1009060777'
ORA-17503: ksfdopn:2 Failed to open file +DATA/JIRADB/ONLINELOG/group_7.307.1009060777
ORA-01092: ORACLE instance terminated. Disconnection forced
Errors in file /data/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_rms0_25080.trc (incident=10474) (PDBNAME=CDB$ROOT):
ORA-313 [] [] [] [] [] [] [] [] [] [] [] []
2021-10-10T03:52:42.101863+08:00
System state dump requested by (instance=2, osid=25080 (RMS0)), summary=[abnormal instance termination]. error - 'Instance is terminating.
'
USER (ospid: ): terminating the instance due to ORA error
System State dumped to trace file /data/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_diag_25045.trc
2021-10-10T03:52:46.606056+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:47.630982+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:48.323006+08:00
Instance terminated by USER, pid = 25080
2021-10-10T03:53:21.788175+08:00
Starting ORACLE instance (normal) (OS id: 9001)
2021-10-10T03:53:22.141057+08:00
************************************************************
Instance SGA_TARGET = 15360 MB and SGA_MAX_SIZE = 15360 MB

2.查看asm的alert日志信息，也存在LMD0进程hung信息，提示检查一节点的alert和lmon进程，由于远程请求（很可能是实例驱逐），实例正在终止。

2021-10-10T00:04:10.556537+08:00
NOTE: cleaning up empty system-created directory '+DATA/orcl-cluster/OCRBACKUP/backup01.ocr.1429.1085500999'
2021-10-10T00:04:11.589012+08:00
NOTE: cleaning up empty system-created directory '+DATA/orcl-cluster/OCRBACKUP/backup00.ocr.1412.1085515417'
2021-10-10T00:04:12.788257+08:00
NOTE: cleaning up empty system-created directory '+DATA/orcl-cluster/OCRBACKUP/33845561.1422.1085529839'
2021-10-10T03:47:46.057983+08:00
LMD0 (ospid: 26035) has not called a wait for sub 0 secs.
2021-10-10T03:47:57.490009+08:00
LMD0 (ospid: 26035) is hung in an acceptable location ( 0x2.ffff).
2021-10-10T03:52:37.192259+08:00
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
USER (ospid: ): terminating the instance due to ORA error
Cause - 'Instance is being terminated due to remote request most likely an instance eviction
'
2021-10-10T03:52:38.149593+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.149613+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.261098+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.288658+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.288746+08:00
System state dump requested by (instance=2, osid=26040 (LMS0)), summary=[abnormal instance termination]. error - 'Instance is terminating.
'
System State dumped to trace file /data/app/19.3/diag/asm/+asm/+ASM2/trace/+ASM2_diag_26025.trc
2021-10-10T03:52:38.315951+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.342262+08:00
ORA-1092 : opitsk aborting process

2021-10-10T03:52:38.342262+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.369704+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.394837+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.420410+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.449409+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.478570+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.519968+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.922881+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:38.996310+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:39.218941+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:39.617907+08:00
Dumping diagnostic data in directory=[cdmp_20211010035237], requested by (instance=2, osid=26040 (LMS0)), summary=[abnormal instance termination].
2021-10-10T03:52:39.928602+08:00
ORA-1092 : opitsk aborting process
2021-10-10T03:52:39.945557+08:00
Instance terminated by USER, pid = 26037_26040
2021-10-10T03:53:05.189885+08:00
MEMORY_TARGET defaulting to 1128267776.
* instance_number obtained from CSS = 2, checking for the existence of node 0...
* node 0 does not exist. instance_number = 2
Starting ORACLE instance (normal) (OS id: 6096)
2021-10-10T03:53:05.456719+08:00
****************************************************
/dev/shm will be used for creating SGA
Large pages will not be used. Only standard 4K pages will be used
****************************************************
2021-10-10T03:53:05.460487+08:00

3.查看节点1的asm日志，2021-10-10 03:52:16 节点二从集群中被驱逐，LMD0进程没有响应，被系统kill。
2021-10-10T03:51:42.119890+08:00
Communications reconfiguration: instance_number 2 from LMS0 (ospid: 9528_9531)
2021-10-10T03:52:16.666228+08:00
Evicting instance 2 from cluster
Waiting for instances to leave: 2
2021-10-10T03:52:36.826748+08:00
LMD0 (ospid: 9526) has detected no messaging activity from instance 2
2021-10-10T03:52:36.591598+08:00
Remote instance kill is issued with system inc 10
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
2021-10-10T03:52:37.820611+08:00
Reconfiguration started (old inc 8, new inc 12)
List of instances (total 1) :
1
Dead instances (total 1) :
2
My inst 1
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
Communication channels reestablished
Begin lmon rcfg omni enqueue reconfig stage1
End lmon rcfg omni enqueue reconfig stage1
Master broadcasted resource hash value bitmaps
Begin lmon rcfg omni enqueue reconfig stage2

My inst 1
Global Resource Directory frozen
* dead instance detected - domain 1 invalid = TRUE
* dead instance detected - domain 2 invalid = TRUE
Communication channels reestablished
Begin lmon rcfg omni enqueue reconfig stage1
End lmon rcfg omni enqueue reconfig stage1
Master broadcasted resource hash value bitmaps
Begin lmon rcfg omni enqueue reconfig stage2
End lmon rcfg omni enqueue reconfig stage2
Non-local Process blocks cleaned out
2021-10-10T03:52:37.867914+08:00
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived, skipped 0
Set master node info
Begin lmon rcfg omni enqueue reconfig stage3
End lmon rcfg omni enqueue reconfig stage3
Begin lmon rcfg omni enqueue reconfig stage4
End lmon rcfg omni enqueue reconfig stage4
Dwn-cvts replayed, VALBLKs dubious
Begin lmon rcfg omni enqueue reconfig stage5
End lmon rcfg omni enqueue reconfig stage5
All grantable enqueues granted
Begin lmon rcfg omni enqueue reconfig stage6
End lmon rcfg omni enqueue reconfig stage6
Post SMON to start 1st pass IR
2021-10-10T03:52:37.904154+08:00
Begin lmon rcfg omni enqueue reconfig stage7
End lmon rcfg omni enqueue reconfig stage7
2021-10-10T03:52:37.905430+08:00
NOTE: SMON detected lock domain 1 invalid at system inc 8 10/10/21 03:52:37
Reconfiguration complete (total time 0.1 secs)
2021-10-10T03:52:37.925380+08:00
NOTE: SMON starting instance recovery of group OCR domain 1 inc 12 (mounted) at 10/10/21 03:52:37
NOTE: 10/10/21 03:52:37 OCR.F1X0 found on disk 0 au 10 fcn 0.0 datfmt 1
NOTE: instance recovery of group OCR will recover thread=2 ckpt=2.0 domain=1 inc#=4 instnum=2
NOTE: SMON waiting for thread 2 recovery enqueue

五、Solution

根据监控查看，确实当时系统负载异常高。
个人认为：可以开启大页，部署oswatcher，优化系统.

六、References
Top 5 issues for Instance Eviction (Doc ID 1374110.1)
Issue #3 The problem instance was hanging before the instance crashes or is evicted

拓展：
1.什么是LMON进程?
在RAC中，LMON LMS LMD三个进程，不断的同其他进程进行通信，LMS负责管理、传输数据块，以支持cache fusion，LMD负责管理enqueue,LMON负责监控整个RAC的监控状况、数据库注册到css或从css中解除。

2.Node Eviction (节点驱逐)和Instance Eviction(实例驱逐)

码农公寓

相关文章