某客户操作人员反应很慢不能操作,管理人员登录小机系统后发现CPU使用到了96%。而且这种情况持续了几个月。以下是登录后小机后载取的topas图,而且是周末,并没有人使用系统。小机是IBM的550,配置是2颗6核的CPU,内存是48G。
如是登录数据库执行以下脚本来查看当前数据库消耗CPU最多的进程在执行什么
Connected to Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 Connected as gtp2 SQL> select s.sid,p.SPID,s.username,s.event,s.wait_time,s.state,s.seconds_in_wait,p.PROGRAM,s.MACHINE, 2 (select c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=s.SQL_ID) sql_fulltext, 3 (select c.BIND_DATA from v$sqlarea c where c.SQL_ID=s.SQL_ID) BIND_DATA,s.SQL_ID 4 from v$session s,v$process p 5 where p.SPID in(491720,90116,127336,529102,987524,331990) 6 and s.event not like'%SQL*Net%' and s.USERNAME='GTP2' 7 order by s.wait_time desc 8 ; SID SPID USERNAME EVENT WAIT_TIME STATE SECONDS_IN_WAIT PROGRAM MACHINE SQL_FULLTEXT BIND_DATA SQL_ID ---------- ------------ ------------------------------ ---------------------------------------------------------------- ---------- ------------------- --------------- ------------------------------------------------ ---------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- ------------- 1020 90116 GTP2 kksfbc child completion -1 WAITED SHORT TIME 53742 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 987524 GTP2 kksfbc child completion -1 WAITED SHORT TIME 53742 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 331990 GTP2 kksfbc child completion -1 WAITED SHORT TIME 53742 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 491720 GTP2 kksfbc child completion -1 WAITED SHORT TIME 53742 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 4 rows selected
SQL> select s.sid,p.SPID,s.username,s.event,s.wait_time,s.state,s.seconds_in_wait,p.PROGRAM,s.MACHINE, 2 (select c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=s.SQL_ID) sql_fulltext, 3 (select c.BIND_DATA from v$sqlarea c where c.SQL_ID=s.SQL_ID) BIND_DATA,s.SQL_ID 4 from v$session s,v$process p 5 where p.SPID in(491720,90116,127336,529102,987524,331990) 6 and s.event not like'%SQL*Net%' and s.USERNAME='GTP2' 7 order by s.wait_time desc 8 ; SID SPID USERNAME EVENT WAIT_TIME STATE SECONDS_IN_WAIT PROGRAM MACHINE SQL_FULLTEXT BIND_DATA SQL_ID ---------- ------------ ------------------------------ ---------------------------------------------------------------- ---------- ------------------- --------------- ------------------------------------------------ ---------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- ------------- 1020 90116 GTP2 ksdxexeotherwait -1 WAITED SHORT TIME 3342 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 987524 GTP2 ksdxexeotherwait -1 WAITED SHORT TIME 3342 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 331990 GTP2 ksdxexeotherwait -1 WAITED SHORT TIME 3342 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 1020 491720 GTP2 ksdxexeotherwait -1 WAITED SHORT TIME 3342 oracleorcl@dbserv WORKGROUP\WIN-AUQ43P0UU9L 063cu7y841kmc 4 rows selected
从上面的信息可以看到这些进程的等待事件为kksfbc child completion,ksdxexeotherwait。当看到这种情况时第一反应是不是遇到的BUG,以KKSFBC CHILD COMPLETION为关键字到MOS查询可以找到,该Bug的症状为进程不断spin且hang住、出现’KKSFBC CHILD COMPLETION’等待事件、还可能伴有’Waits for “cursor: pin S”‘等待事件,直接影响的版本有11.1.0.6、10.2.0.3和10.2.0.4。而我这里的版本是10.2.0.1。
对于该Bug的描述是在发生’kksfbc child completion’等待事件后会话陷入无休止的自旋(spins)中,这种自旋(spins)发生在由堆栈调用(stack call)kksSearchChildList->kkshgnc陷入对kksSearchChildList函数的无限循环中。需要更详细的stack call,如是对系统进程90116进行跟踪。
SQL> oradebug setospid 90116 Oracle pid: 40, Unix process pid: 90116, image: oracleorcl@dbserv SQL> oradebug unlimit; Statement processed. SQL> oradebug short_stack; ksdxfstk+002c< -ksdxcb+04e4<-sspuser+0068<-00004750<-kksfbc+0bb0<-kkspsc0+0f3c<-kksParseCursor+00d4<-opiosq0+0b10<-kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1004<-opitsk+1000<-opiino+0990<-opiodr+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0070 SQL> oradebug dump processstate 10; Statement processed. SQL> oradebug dump systemstate 266; Statement processed. SQL> oradebug tracefile_name /oracle/admin/orcl/udump/orcl_ora_90116.trc
查看生成的跟踪文件orcl_ora_90116.trc有如下内容:
SO: 7000001486ab188, type: 4, owner: 70000014346c5a8, flag: INIT/-/-/0x00 (session) sid: 1020 trans: 0, creator: 70000014346c5a8, flag: (41) USR/- BSY/-/-/-/-/- DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000 txn branch: 0 oct: 0, prv: 0, sql: 7000001473dcf10, psql: 7000001225ac0c8, user: 82/GTP2 O/S info: user: gtp-default, term: WIN-AUQ43P0UU9L, ospid: 6708:12196, machine: WORKGROUP\WIN-AUQ43P0UU9L program: w3wp.exe last wait for 'kksfbc child completion' blocking sess=0x0 seq=2831 wait_time=48850 seconds since wait started=572057 =0, =0, =0 Dumping Session Wait History for 'kksfbc child completion' count=1 wait_time=48850 =0, =0, =0
可以从以上trace中看到会话确实曾长时间处于’kksfbc child completion’等待中,之后陷入无限自旋(spins)中消耗了大量CPU时间。但这里实际的表现又存有差异,引发无限循环的函数是kksfbc而不是kksSearchChildList(常规的调用序列是:kksParseCursor->kkspsc0->kksfbc ->kksSearchChildList->kkshgnc)。kksfbc意为K[Kernel]K[Kompile]S[Shared]F[Find]B[Best]C[Child]该函数用以在软解析时找寻合适的子游标,在10.2.0.2以后引入了mutex互斥体来取代原有的Cursor Pin机制,Mutex较Latch更为轻量级。虽然mutex的引入改变了众多cursor pin的内部机制,但kksfbc仍需要持有library cache latches才能扫描library cache hash chains。另一方面当kksfbc函数针对某个parent cursor找到合适child cursor后,可能使用KKSCHLPINx方法将该child cursor pin住,这个时候就需要exclusive地持有该child cursor相应的mutex。Oracle在10.2.0.4上提供了该Bug的one-off Patch
8575528,其在10.2.0.4 psu4以后的等价补丁为(Equivalent patch)为merge patch 9696904:8557428 9696904 7527908 Both fixes are needed. 6795880 superceded by 8575528 in 9696904 which includes extra files so may cause new conflicts。但merge patch 9696904目前仅有Linux x86/64平台上的版本,而问题数据库所在平台为IBM AIX on POWER Systems (64-bit),而且版本是10.2.0.1。那么要解决这个问题是不是没有办法了,其实不然,我们可以将数据库从10.2.0.1升级到10.2.0.5来解决这个BUG,在升级到10.2.0.5之后确实解决这个问题。