环境:
操作系统:hp-unix
数据库版本:10.2.0.5
故障描述:开始是27号上午9左右,应用开发有人执行update语句,发现执行好长时间没有完成,然后在数据库中查询并试图杀死会话,但是执行杀会话后再操作系统中依然有相关会话存在,
然后,试图使用kill -9杀进程,发现相关进程依然无法杀死;接下来,到晚上6点是相关会话大量积压的过程,数据库服务器实在无法承受会话继续积压(24800个),晚上7点左右与业务方沟通后
决定重启数据库,发现数据库无法正常的shutdown immediate,又尝试shutdown abort,但是startup的时候实例无法启动,查明是update的相关进程很多都积压在后台,决定重启主机服务器,节点1重启后正常,节点2又出现节点1类似的征兆,为了处理故障,避免累次重启无法解决故障,先保护现场待故障处理完成。
接下来的故障定位如下:
从dba_hist_active_sess_history视图查询数据库采样变化发现上午10点左右开始出现会话堆积:
select sample_time,count(*) from dba_hist_active_sess_history where dbid=3965601902 and snap_id>49716 and snap_id and sample_time>to_date('20150527 10:18:00','yyyymmdd hh24:mi:ss') group by sample_time order by sample_time; SAMPLE_TIME COUNT(*) ---------------------------------------- ---------- 27-MAY-15 10.18.04.924 AM 4 27-MAY-15 10.18.12.694 AM 3 27-MAY-15 10.18.15.024 AM 3 27-MAY-15 10.18.22.794 AM 1 27-MAY-15 10.18.25.124 AM 4 27-MAY-15 10.18.35.224 AM 3 27-MAY-15 10.18.45.324 AM 1 27-MAY-15 10.18.55.434 AM 2 27-MAY-15 10.19.03.204 AM 1 27-MAY-15 10.19.13.304 AM 1 27-MAY-15 10.19.23.404 AM 2 27-MAY-15 10.19.35.834 AM 2 27-MAY-15 10.19.43.614 AM 1 27-MAY-15 10.19.45.934 AM 1 27-MAY-15 10.19.53.714 AM 2 27-MAY-15 10.19.56.044 AM 2 27-MAY-15 10.20.03.814 AM 1 27-MAY-15 10.20.06.144 AM 1 27-MAY-15 10.20.13.914 AM 1 27-MAY-15 10.20.16.244 AM 1 27-MAY-15 10.20.24.014 AM 3 27-MAY-15 10.20.26.343 AM 2 27-MAY-15 10.20.34.114 AM 1 27-MAY-15 10.20.36.443 AM 3 27-MAY-15 10.20.44.224 AM 1 27-MAY-15 10.20.46.553 AM 1 27-MAY-15 10.20.54.324 AM 9 27-MAY-15 10.21.04.424 AM 13 27-MAY-15 10.21.06.753 AM 7 27-MAY-15 10.21.14.525 AM 21 27-MAY-15 10.21.16.853 AM 17 27-MAY-15 10.21.24.624 AM 32 27-MAY-15 10.21.26.953 AM 1 27-MAY-15 10.21.34.814 AM 50 27-MAY-15 10.21.37.053 AM 2 27-MAY-15 10.21.44.914 AM 49 27-MAY-15 10.21.47.163 AM 2 27-MAY-15 10.21.55.014 AM 61 27-MAY-15 10.21.57.263 AM 2 27-MAY-15 10.22.05.116 AM 69 27-MAY-15 10.22.07.363 AM 12 27-MAY-15 10.22.15.216 AM 77 27-MAY-15 10.22.17.463 AM 12 27-MAY-15 10.22.25.316 AM 87 27-MAY-15 10.22.27.562 AM 13 27-MAY-15 10.22.35.426 AM 112 27-MAY-15 10.22.37.662 AM 47 27-MAY-15 10.22.45.526 AM 121 27-MAY-15 10.22.47.772 AM 7 27-MAY-15 10.22.55.626 AM 129 27-MAY-15 10.22.57.872 AM 10 |
select sample_time,event,count(*) from dba_hist_active_sess_history where dbid=3965601902 and snap_id>49716 and snap_id and sample_time>to_date('20150527 10:20:00','yyyymmdd hh24:mi:ss') group by sample_time,event order by sample_time; SAMPLE_TIME EVENT COUNT(*) ---------------------------------------- ---------------------------------------- ---------- 27-MAY-15 10.20.03.814 AM 1 27-MAY-15 10.20.06.144 AM db file sequential read 1 27-MAY-15 10.20.13.914 AM 1 27-MAY-15 10.20.16.244 AM 1 27-MAY-15 10.20.24.014 AM db file sequential read 1 27-MAY-15 10.20.24.014 AM 2 27-MAY-15 10.20.26.343 AM 2 27-MAY-15 10.20.34.114 AM db file sequential read 1 27-MAY-15 10.20.36.443 AM 3 27-MAY-15 10.20.44.224 AM 1 27-MAY-15 10.20.46.553 AM 1 27-MAY-15 10.20.54.324 AM db file sequential read 1 27-MAY-15 10.20.54.324 AM direct path write 8 27-MAY-15 10.21.04.424 AM db file sequential read 1 27-MAY-15 10.21.04.424 AM direct path write 10 27-MAY-15 10.21.04.424 AM enq: TX - row lock contention 1 27-MAY-15 10.21.04.424 AM 1 27-MAY-15 10.21.06.753 AM db file sequential read 1 27-MAY-15 10.21.06.753 AM 6 27-MAY-15 10.21.14.525 AM db file sequential read 1 27-MAY-15 10.21.14.525 AM direct path write 17 27-MAY-15 10.21.14.525 AM enq: TX - row lock contention 2 27-MAY-15 10.21.14.525 AM 1 27-MAY-15 10.21.16.853 AM gc current block 2-way 1 27-MAY-15 10.21.16.853 AM gc current grant 2-way 1 27-MAY-15 10.21.16.853 AM kjbdrmcvtq lmon drm quiesce: ping completion 1 27-MAY-15 10.21.16.853 AM log file sync 8 27-MAY-15 10.21.16.853 AM wait for scn ack 1 27-MAY-15 10.21.16.853 AM 5 27-MAY-15 10.21.24.624 AM db file sequential read 1 27-MAY-15 10.21.24.624 AM direct path write 26 27-MAY-15 10.21.24.624 AM enq: TX - row lock contention 3 27-MAY-15 10.21.24.624 AM gcs drm freeze in enter server mode 1 27-MAY-15 10.21.24.624 AM 1 27-MAY-15 10.21.26.953 AM 1 27-MAY-15 10.21.34.814 AM db file sequential read 1 27-MAY-15 10.21.34.814 AM direct path write 33 27-MAY-15 10.21.34.814 AM enq: TX - row lock contention 4 27-MAY-15 10.21.34.814 AM gcs drm freeze in enter server mode 1 27-MAY-15 10.21.34.814 AM row cache lock 10 27-MAY-15 10.21.34.814 AM 1 27-MAY-15 10.21.37.053 AM db file sequential read 1 27-MAY-15 10.21.37.053 AM 1 27-MAY-15 10.21.44.914 AM db file sequential read 1 27-MAY-15 10.21.44.914 AM direct path write 42 27-MAY-15 10.21.44.914 AM enq: TX - row lock contention 4 27-MAY-15 10.21.44.914 AM enq: US - contention 1 27-MAY-15 10.21.44.914 AM gcs drm freeze in enter server mode 1 27-MAY-15 10.21.47.163 AM log file parallel write 1 27-MAY-15 10.21.47.163 AM 1 27-MAY-15 10.21.55.014 AM db file sequential read 1 27-MAY-15 10.21.55.014 AM direct path write 42 27-MAY-15 10.21.55.014 AM enq: TX - row lock contention 5 27-MAY-15 10.21.55.014 AM gc current block 2-way 1 27-MAY-15 10.21.55.014 AM gc current grant 2-way 2 27-MAY-15 10.21.55.014 AM kjbdrmcvtq lmon drm quiesce: ping completion 1 27-MAY-15 10.21.55.014 AM log file sync 3 27-MAY-15 10.21.55.014 AM wait for scn ack 1 27-MAY-15 10.21.55.014 AM 5 27-MAY-15 10.21.57.263 AM db file parallel write 1 27-MAY-15 10.21.57.263 AM 1 |
接下来查询等待事件direct path write等待的实例、资源的id号、等待次数的积压情况:
select instance_number,p1,count(*) from dba_hist_active_sess_history where dbid=3965601902 and snap_id>49716 and snap_id and event='direct path write' and sample_time>to_date('20150527 10:20:00','yyyymmdd hh24:mi:ss') group by instance_number,p1 order by 3 ; INSTANCE_NUMBER P1 COUNT(*) --------------- ---------- ---------- 1 461 1 1 494 1 1 372 1 1 495 1 2 581 1 2 285 1 1 479 1 2 495 1 2 317 1 1 587 1 1 543 1 2 316 1 2 438 1 2 498 1 2 497 1 2 499 2 1 520 2 2 437 2 1 466 2 2 496 2 1 556 2 1 439 2 1 438 2 1 316 2 1 498 2 2 439 3 1 474 3 1 436 3 1 519 3 1 496 4 1 475 4 1 497 4 2 475 4 2 494 4 2 436 4 1 499 5 1 437 6 2 474 7 1 589 59128 2 588 84346 |
SQL> select file_name,tablespace_name from dba_data_files where file_id=&file; Enter value for file: 588 old 1: select file_name,tablespace_name from dba_data_files where file_id=&file new 1: select file_name,tablespace_name from dba_data_files where file_id=588 FILE_NAME TABLESPACE_NAME --------------------------------- ------------------------------ /dev/essdb2vg5/rdb2vg5_10_lv001 TBS_TRADE_D10 SQL> / Enter value for file: 589 old 1: select file_name,tablespace_name from dba_data_files where file_id=&file new 1: select file_name,tablespace_name from dba_data_files where file_id=589 FILE_NAME TABLESPACE_NAME --------------------------------- ------------------------------ /dev/essdb2vg5/rdb2vg5_10_lv002 TBS_TRADE_D09 |
后续处理:由于这是电商平台,业务对数据很敏感,考虑先进行备份,查明有问题的两个裸设备分别属于两个不同的表空间,2个表空间是2个不同方案下相同名称的分区表的photo字段分区。先检验是否能读,方法如下,结果是读失败,所以排除掉备份(无论是exp、expdp、rman);由于业务方要求28号上午9点必须恢复业务,经过多次尝试,时间已是27号晚上10点多,进行数据库rman全备恢复消耗时间不可控,考虑基于归档的恢复。
select /*+ parallel(t 8)*/* from "UCR_TRADE_03"."TF_F_IDCARD" partition(PAR_TF_B_IDCARD_VERIFY_8) t;
基于归档的恢复操作流程,通知存储工程师,准备2个与故障裸设备相同容积的裸设备,指定好数据库用户权限,在数据库中将2个有问题的裸设备进行下线,然后重定向到新的裸设备,再将节点2上从今天早上开始对数据库扩容的时间6:00时的归档拷贝到节点1上,进行recover。
新的裸设备如下:
/dev/essdb2vg7/rdb2vg7_10_lv001
/dev/essdb2vg7/rdb2vg7_10_lv002
表空间下线后重定向:
SQL> alter database create datafile 588 as '/dev/essdb2vg7/rdb2vg7_10_lv001';
SQL> alter database create datafile 589 as '/dev/essdb2vg7/rdb2vg7_10_lv002';
基于归档恢复:
SQL> recover datafile '/dev/essdb2vg7/rdb2vg7_10_lv001';
SQL> recover datafile '/dev/essdb2vg7/rdb2vg7_10_lv002';
恢复完毕后将数据文件上线:
SQL> alter database datafile '/dev/essdb2vg7/rdb2vg7_10_lv001' online;
SQL> alter database datafile '/dev/essdb2vg7/rdb2vg7_10_lv002' online;
如果有必要,回退到故障点:
停止数据库
shutdown immediate;
startup mount;
alter database rename file '/dev/essdb2vg5/rdb2vg5_10_lv038' to '/dev/essdb2vg5/rdb2vg5_10_lv001';
alter database rename file '/dev/essdb2vg5/rdb2vg5_10_lv039' to '/dev/essdb2vg5/rdb2vg5_10_lv002';
幸运的是,基于归档的数据文件恢复是成功的。