最近有一套数据库 crash ,环境2nodes RAC 11.2.0.3 on AIX , alert日志中记录了在crash前出现了ora-7445[opiaba],ora-600 [kgh_heap_sizes:ds], ora-600 [17147], 下面记录当时的一些日志。
# DB alert log
Wed Sep 16 23:44:23 2015 Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xF00000E161B7CD2] [PC:0x10662AFD4, opiaba()+788] [flags: 0x0, count: 1] Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_21759124.trc (incident=191931): ORA-07445: 出现异常错误: 核心转储 [opiaba()+788] [SIGSEGV] [ADDR:0xF00000E161B7CD2] [PC:0x10662AFD4] [Address not mapped to object] [] Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_191931/anbob1_ora_21759124_i191931.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Sep 16 23:46:37 2015 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_21759124.trc (incident=191932): ORA-00600: 内部错误代码, 参数: [kgh_heap_sizes:ds], [0x700000E82B30018], [], [], [], [], [], [], [], [], [], [] ORA-07445: 出现异常错误: 核心转储 [opiaba()+788] [SIGSEGV] [ADDR:0xF00000E161B7CD2] [PC:0x10662AFD4] [Address not mapped to object] [] Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_191932/anbob1_ora_21759124_i191932.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Sep 16 23:47:14 2015 Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Sep 16 23:47:14 2015 Dumping diagnostic data in directory=[cdmp_20150916234714], requested by (instance=1, osid=21759124), summary=[incident=191932]. Wed Sep 16 23:56:58 2015 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_14549208.trc (incident=203370): ORA-04030: out of process memory when trying to allocate 103504 bytes (QERHJ hash-joi,kllcqc:kllcqslt) Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Sep 16 23:57:02 2015 Sweep [inc][203370]: completed Sweep [inc][191932]: completed Sweep [inc][191931]: completed Wed Sep 16 23:58:03 2015 Process 0x700000fd0ac9dc8 appears to be hung while dumping Current time = 2507987599, process death time = 2507922725 interval = 60000 Attempting to kill process 0x700000fd0ac9dc8 with OS pid = 21759124 OSD kill succeeded for process 700000fd0ac9dc8 ERROR: Unable to normalize symbol name for the following short stack (at offset 521): dbgexProcessError()+180<-dbgeExecuteForError()+72<-dbgePostErrorKGE()+2048<-dbkePostKGE_kgsf()+68<-kgeade()+364<-kgeriv_int()+116<-kgeriv()+36<-kgeric1()+60<-kgherror()+324<-kghfrmrg()+1964<-kghfre()+368<-kghfrh_internal()+552<-kghfrh()+16<-kksFreeHeap()+272<-kksLockDelete()+760<-kksLockOperation()+140<-kgllkdl()+3136<-kgllkds()+104<-kssxdl()+1872<-kssdel()+724<-kssdch()+5732<-ksuxds()+940<-kssxdl()+472<-kssdel()+724<-kssdch()+5732<-ksudlp()+792<-kssxdl()+472<-kssdel()+724<-ksuxdl()+456<-ksuxda_del_proc()+188<-IPRA.$ksucln_dpc_cleanup()+272<-IPRA.$ksucln_dpc_dfs()+268<-ksucln_dpc_main()+892<-ksucln_dpc()+36<-ksucln()+1340<-ksbrdp()+2044<-opirip()+1624<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+268<-main()+204<-__start()+112 Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_16253104.trc (incident=190123): ORA-00600: internal error code, arguments: [17147], [0x700000E82B30000], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_190123/anbob1_pmon_16253104_i190123.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Wed Sep 16 23:58:12 2015 Dumping diagnostic data in directory=[cdmp_20150916235812], requested by (instance=1, osid=16253104 (PMON)), summary=[incident=190123]. Errors in file /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_pmon_16253104.trc: ORA-00600: internal error code, arguments: [17147], [0x700000E82B30000], [], [], [], [], [], [], [], [], [], [] PMON (ospid: 16253104): terminating the instance due to error 472 Wed Sep 16 23:58:12 2015 opiodr aborting process unknown ospid (15336054) as a result of ORA-1092 Wed Sep 16 23:58:14 2015 ORA-1092 : opitsk aborting process Wed Sep 16 23:58:18 2015 Instance terminated by PMON, pid = 16253104
# anbob1_pmon_16253104.trc
*** TRACE FILE RECREATED AFTER BEING REMOVED *** Process 0x700000fd0ac9dc8 appears to be hung while dumping Current time = 2507987599, process death time = 2507922725 interval = 60000 Attempting to kill process 0x700000fd0ac9dc8 with OS pid = 21759124 OSD kill succeeded for process 700000fd0ac9dc8 *** 2015-09-16 23:58:06.766 KGX cleanup... KGX Atomic Operation Log 70000101e762520 Mutex 700000ea6aeec70(1300, 0) idn 49899eb8 oper EXCL Cursor Parent uid 1300 efd 11 whr 17 slp 0 oper=OPERATION_DEFAULT pt1=700000ea6aef098 pt2=700000ea6aeebf8 pt3=0 pt4=0 u41=0 stt=0 Incident 190123 created, dump file: /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_190123/anbob1_pmon_16253104_i190123.trc ORA-00600: internal error code, arguments: [17147], [0x700000E82B30000], [], [], [], [], [], [], [], [], [], [] PMON: fatal error while deleting s.o. 700001011075e88 in this tree: ---------------------------------------- SO: 0x700000fd0ac9dc8, type: 2, owner: 0x0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x700000fd0ac9dc8, name=process, file=ksu.h LINE:12616 ID:, pg=0 (process) Oracle pid:228, ser:25, calls cur/top: 0x700000f0aca7218/0x700000f0ac85808 flags : (0x1) DEAD flags2: (0x8000), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 Cleanup details: Found dead = 5 sec ago Total Cleanup attempts = 1, Cleanup time = 0 sec, Cleanup timer = 0.000 sec Last attempt (full) started 5 sec ago, Length = in progress (post info) last post received: 2147483647 0 2 last post received-location: ksl2.h LINE:2362 ID:kslpsr last process to post me: 700000fe0aa3210 1 6 last post sent: 0 0 26 last post sent-location: ksa2.h LINE:285 ID:ksasnd last process posted by me: 700000fb8ab0d80 1 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 0x700000fc0b81778 O/S info: user: grid, term: UNKNOWN, ospid: 21759124 (DEAD) OSD pid info: Unix process pid: 21759124, image: oracle@4adata1 ---------------------------------------- SO: 0x700001020e39a58, type: 4, owner: 0x700000fd0ac9dc8, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0x700000fd0ac9dc8, name=session, file=ksu.h LINE:12624 ID:, pg=0 (session) sid: 1300 ser: 54478 trans: 0x0, creator: 0x700000fd0ac9dc8 flags: (0x100045) USR/- flags_idl: (0x21) BSY/-/-/DEL/-/- flags2: (0x40008) -/-/- DID: , short-term DID: txn branch: 0x0 oct: 47, prv: 0, sql: 0x70000100eea0d10, psql: 0x70000100b3773b8, user: 118/HBYDRMS ksuxds FALSE at location: 0 service name: SYS$USERS client details: O/S info: user: Administrator, term: unknown, ospid: 1234 machine: WIN-1BBB4BL8USU program: JDBC Thin Client Current Wait Stack: Not in wait; last wait ended 167 min 17 sec ago Wait State: fixed_waits=0 flags=0x21 boundary=0x0/-1 Session Wait History: elapsed time of 167 min 17 sec since last wait 0: waited for 'latch: shared pool' address=0x700000000109838, number=0x133, tries=0x0 wait_id=2824 seq_num=2825 snap_id=1 wait times: snap=0.000022 sec, exc=0.000022 sec, total=0.000022 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 129 min 40 sec of elapsed time 1: waited for 'latch: shared pool' address=0x700000000109838, number=0x133, tries=0x0 wait_id=2823 seq_num=2824 snap_id=1 wait times: snap=23.825311 sec, exc=23.825311 sec, total=23.825311 sec wait times: max=infinite wait counts: calls=0 os=0 occurred after 0.015511 sec of elapsed time ... ... ... ORA-00600: internal error code, arguments: [17147], [0x700000E82B30000], [], [], [], [], [], [], [], [], [], [] kjzduptcctx: Notifying DIAG for crash event ----- Abridged Call Stack Trace ----- ksedsts()+360<-kjzdssdmp()+240<-kjzduptcctx()+228<-kjzdicrshnfy()+100<-ksuitm()+5124<-ksbrdp()+4508<-opirip()+1624<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+268<-main()+204<-__start()+112 ----- End of Abridged Call Stack Trace ----- *** 2015-09-16 23:58:12.650 PMON (ospid: 16253104): terminating the instance due to error 472 ksuitm: waiting up to [5] seconds before killing DIAG(16843208)
# anbob1_ora_21759124_i191931.trc
Dump file /u01/app/oracle/diag/rdbms/anbob/anbob1/incident/incdir_191931/anbob1_ora_21759124_i191931.trc Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP, Data Mining and Real Application Testing options ORACLE_HOME = /u01/app/oracle/product/11.2.0/dbhome_1 System name: AIX Node name: 4adata1 Release: 1 Version: 6 Machine: 00F8D36E4C00 Instance name: anbob1 Redo thread mounted by this instance: 1 Oracle process number: 228 Unix process pid: 21759124, image: oracle@4adata1 Dump continued from file: /u01/app/oracle/diag/rdbms/anbob/anbob1/trace/anbob1_ora_21759124.trc ORA-07445: 出现异常错误: 核心转储 [opiaba()+788] [SIGSEGV] [ADDR:0xF00000E161B7CD2] [PC:0x10662AFD4] [Address not mapped to object] [] *** 2015-09-16 23:44:23.229 dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x3, level=3, mask=0x0) ----- Current SQL Statement for this session (sql_id=by0ga2x4sm7ps) ----- begin execute immediate 'TRUNCATE TABLE sm_organization'; insert into sm_organization(groupcode,groupname,parentgroupcode) values(:1 ,:2 ,:3 ); ............. ; insert into sm_organization(groupcode,groupname,parentgroupcode) values(:412987 ,:412988 ,:412989 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:412990 ,:412991 ,:412992 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:412993 ,:412994 ,:412995 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:412996 ,:412997 ,:412998 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:412999 ,:413000 ,:413001 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413002 ,:413003 ,:413004 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413005 ,:413006 ,:413007 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413008 ,:413009 ,:413010 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413011 ,:413012 ,:413013 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413014 ,:413015 ,:413016 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413017 ,:413018 ,:413019 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413020 ,:413021 ,:413022 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413023 ,:413024 ,:413025 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413026 ,:413027 ,:413028 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413029 ,:413030 ,:413031 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413032 ,:413033 ,:413034 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413035 ,:413036 ,:413037 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413038 ,:413039 ,:413040 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413041 ,:413042 ,:413043 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413044 ,:413045 ,:413046 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413047 ,:413048 ,:413049 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413050 ,:413051 ,:413052 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413053 ,:413054 ,:413055 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413056 ,:413057 ,:413058 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413059 ,:413060 ,:413061 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413062 ,:413063 ,:413064 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413065 ,:413066 ,:413067 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413068 ,:413069 ,:413070 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413071 ,:413072 ,:413073 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413074 ,:413075 ,:413076 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413077 ,:413078 ,:413079 ); insert into sm_organization(groupcode,groupname,parentgroupcode) values(:413080 ,:413081 ,:413082 ); end; ----- Call Stack Trace ----- skdstdst <- ksedst1 <- ksedst <- dbkedDefDump <- ksedmp <- ssexhd <- 48bc <- opiprs <- kksParseChildCursor <- rpiswu2 <- kksLoadChild <- kxsGetRuntimeLock <- kksfbc <- kkspsc0 <- kksParseCursor <- opiosq0 <- kpooprx <- kpoal8 <- opiodr <- ttcpip <- opitsk <- opiino <- opiodr <- opidrv <- sou2o <- opimai_real <- ssthrdmain <- main <- start
Note:
触发ora-7445的PL/SQL是一条拼接的匿名块清空表后再多次insert的调用,绑定变量个数超过了41万, 在MOS中找到了篇note非常符合, instance crashed after ORA-7445 [opiaba] and ORA-600 [17147] (文档 ID 1466343.1),原因是当同一个sql/ plsql文本中使用的绑定变量个数大于65535时触发的bug 12578873, 导致数据库instance crash。
符合note中记录的在call stack 中含部分或全部下列函数调用:
opiaba opiprs rpiswu2 kksLoadChild kxsGetRuntimeLock kksfbc
Solution:
解决方法是可以打上补丁12578873, 该补丁只是避免类似SQL时宕机, 同时还是要建议修改SQL 使用少于65535的绑定变量个数, 本案例建议使用loop 修改, 另外注意大量的绑定变量使用在SQL解析时需要更久的DB TIME。