集团某在线系统使用OGG做了同步复制用于二期业务生产使用。有同事过来说复制进程有点异常
-bash-3.2$ ogg
Oracle GoldenGate Command Interpreter for Oracle
Version 11.2.1.0.6 16211226 OGGCORE_11.2.1.0.6_PLATFORMS_130418.1829_FBO
Solaris, sparc, 64bit (optimized), Oracle 11g on Apr 22 2013 15:23:39
Copyright (C) 1995, 2013, Oracle and/or its affiliates. All rights reserved.
GGSCI (ODSDB) 1> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
JAGENT STOPPED
REPLICAT RUNNING RP10 00:00:00 00:00:03
REPLICAT RUNNING RP7 00:00:06 00:00:03
REPLICAT RUNNING RPS1 42:41:14 00:00:01
REPLICAT RUNNING RPS2 00:00:00 00:00:10
REPLICAT RUNNING RPS2A 00:00:00 00:00:00
REPLICAT RUNNING RPS2B 00:00:00 00:00:00
REPLICAT RUNNING RPS3 00:00:00 00:00:02
REPLICAT RUNNING RPS3A 00:00:00 00:00:03
REPLICAT RUNNING RPS3B 00:00:00 00:00:02
REPLICAT RUNNING RPS4 00:00:00 00:00:07
REPLICAT RUNNING RPS4A 00:00:00 00:00:07
REPLICAT RUNNING RPS4B 00:00:00 00:00:07
REPLICAT RUNNING RPS4C 00:00:00 00:00:07
REPLICAT RUNNING RPS4D 00:00:00 00:00:08
REPLICAT RUNNING RPS4E 00:00:00 00:00:07
REPLICAT RUNNING RPS4F 00:00:00 00:00:07
REPLICAT RUNNING RPS4G 00:00:00 00:00:07
REPLICAT RUNNING RPS4H 00:00:00 00:00:07
REPLICAT RUNNING RPS4I 00:00:00 00:00:08
REPLICAT RUNNING RPS4J 00:00:00 00:00:07
跟他去了终端旁边,发现确实有个rps1进程延迟有点高。看着进程状态是RUNNING,但是根据我维护六年ogg来看,有可能是假象,那么怎么判断呢?
1,查看ggserr.log
2,查看dirrpt下面该进程的dsc文件
3,info 进程名多次,看看RBA是否有变化
经过以上几步rps1进程状态是正常的,延迟高问题出在哪里?view params rps1看到该进程只有一张表。那我们看看这个进程到底卡在哪里了
SQL> !ps -ef |grep rps1 |grep -v grep
orao 12169 11012 0 Jun 07 ? 614:58 /odogg/baseogg/replicat PARAMFILE /odsogg/baseogg/dirprm/rps1.prm REPORTFILE /
SQL> !ps -ef |grep 12169 |grep -v grep
orao 12170 12169 1 Jun 07 ? 2315:30 oracleodb (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
orao 12169 11012 0 Jun 07 ? 614:58 /odogg/baseogg/replicat PARAMFILE /odsogg/baseogg/dirprm/rps1.prm REPORTFILE /
SQL> select s.sid,s.SERIAL#,sql_id from v$process p,v$session s where p.addr=s.paddr and p.spid=12170;
SID SERIAL# SQL_ID
---------- ---------- -------------
5146 113 5099dwmx3s4mf
SQL> /
SID SERIAL# SQL_ID
---------- ---------- -------------
5146 113 dzubcf8jm69yx
SQL> /
SID SERIAL# SQL_ID
---------- ---------- -------------
5146 113 dzubcf8jm69yx
SQL> /
SID SERIAL# SQL_ID
---------- ---------- -------------
5146 113 dzubcf8jm69yx
询问了一下维护过该系统的同事,说是不是统计信息太旧。顺着思路往下走走
SQL> select OWNER,TABLE_NAME,LAST_ANALYZED from dba_tables where table_name='TF_F_USER_REVGRPTAG';
OWNER TABLE_NAME LAST_ANALYZED
-------------------------------------------------------------------------------
TTT TF_F_USER_REXXX 2017-06-26 00:20:56
SQL> select table_name,num_rows from dba_tables where owner='TTT' and table_name='TF_F_USER_REXXX';
TABLE_NAME NUM_ROWS
------------------------------ ----------
TF_F_USER_REXXX 409269832
SQL> select count(*) from TTT.TF_F_USER_REXXX;
COUNT(*)
----------
409955510
接下来我们看看该sql执行计划是不是走错了?
两条执行计划,很明显137632316运行了5400次但是执行时间却远远高于第二个执行计划。
用sql profile绑定一下执行计划
SQL> @coe_xfr_sql_profile.sql
Parameter 1:
SQL_ID (required)
Enter value for 1: dzubcf8jm69yx
PLAN_HASH_VALUE AVG_ET_SECS
--------------- -----------
2320424968 .004
137632316 21.119
Parameter 2:
PLAN_HASH_VALUE (required)
Enter value for 2: 2320424968
Values passed to coe_xfr_sql_profile:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SQL_ID : "dzubcf8jm69yx"
PLAN_HASH_VALUE: "2320424968"
查看是否绑定成功
SQL>select name from dba_sql_profiles;
NAME
------------------------------
coe_dzubcf8jm69yx_2320424968
SYS_SQLPROF_015d075ccdd40000
coe_a5hgtqfq09tcu_229988255
在将内存中的该sql语句清除出去
SQL>select address,hash_value,executions,parse_calls from v$sql where sql_TEXT like 'UPDATE "TTT"."TF_F_USER_RExxxx" SET "PARTITION_ID" = :a27,"USER_ID" = :a28%';
ADDRESS HASH_VALUE EXECUTIONS PARSE_CALLS
---------------- ---------- ---------- -----------
00000017673EA5E8 4198240878 100 0
00000017CF3E0000 897328139 673 0
00000017982D40D0 1588244011 8 0
000000049D04B1A8 590555101 10 0
00000017AA525230 4271852353 2 0
0000001777FC6A20 3666847391 735 0
00000017B9154A58 1180290247 1 0
7 rows selected.
SQL>alter session set events '5614566 trace name context forever';
Session altered.
SQL>exec dbms_shared_pool.purge('00000017673EA5E8,4198240878','C');
PL/SQL procedure successfully completed.
SQL>exec dbms_shared_pool.purge('00000017CF3E0000,897328139','C');
exec db
PL/SQL procedure successfully completed.
SQL>ms_shared_pool.purge('00000017982D40D0,1588244011','C');
PL/SQL procedure successfully completed.
SQL>exec dbms_shared_pool.purge('000000049D04B1A8,590555101','C');
PL/SQL procedure successfully completed.
SQL>exec dbms_shared_pool.purge('00000017AA525230,4271852353','C');
PL/SQL procedure successfully completed.
SQL>exec dbms_shared_pool.purge('0000001777FC6A20,3666847391','C');
PL/SQL procedure successfully completed.
SQL>exec dbms_shared_pool.purge('00000017B9154A58,1180290247','C');
PL/SQL procedure successfully completed.
观察ogg复制进程延迟已经开始减小。该隐患顺利解决!
GGSCI (ODSDB) 8> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
JAGENT STOPPED
REPLICAT RUNNING RP10 00:00:00 00:00:01
REPLICAT RUNNING RP7 00:00:00 00:00:07
REPLICAT RUNNING RPS1 31:10:29 00:00:50
REPLICAT RUNNING RPS2 00:00:00 00:00:00
REPLICAT RUNNING RPS2A 00:00:00 00:00:00
REPLICAT RUNNING RPS2B 00:00:00 00:00:00
REPLICAT RUNNING RPS3 00:00:00 00:00:01
REPLICAT RUNNING RPS3A 00:00:00 00:00:02
REPLICAT RUNNING RPS3B 00:00:00 00:00:01
REPLICAT RUNNING RPS4 00:00:00 00:00:05
REPLICAT RUNNING RPS4A 00:00:00 00:00:04
REPLICAT RUNNING RPS4B 00:00:00 00:00:04
REPLICAT RUNNING RPS4C 00:00:00 00:00:04
REPLICAT RUNNING RPS4D 00:00:00 00:00:04
REPLICAT RUNNING RPS4E 00:00:00 00:00:04
REPLICAT RUNNING RPS4F 00:00:00 00:00:04
REPLICAT RUNNING RPS4G 00:00:00 00:00:05
REPLICAT RUNNING RPS4H 00:00:00 00:00:04
REPLICAT RUNNING RPS4I 00:00:00 00:00:04
REPLICAT RUNNING RPS4J 00:00:00 00:00:04
本文转自yangjunfeng 51CTO博客,原文链接:http://blog.51cto.com/yangjunfeng/1944421