【OGG】OGG故障错误处理总结
第一章 GoldenGate错误分析与处理
在维护GoldenGate过程中,由于各种意外情况,难免还是会遇到各种各样的问题。掌握一些常见的GoldenGate故障诊断和错误分析的方法是非常有必要的,而且掌握这些错误分析工具也进一步加深对GoldenGate产品的认识与对GoldenGate原理的理解。
1.1 GoldenGate常见异常处理
GoldenGate运行起来后,随着时间的推移可能会碰到各种各样的问题,下面就来介绍常见的异常现象以及常见的异常处理方法。
1.1.1 异常处理的一般步骤
首先确定是GoldenGate的哪类进程有故障(是抽取,投递还是复制进程有问题),解决故障的一般思路如下。
(1)通过GGSCI>view report命令查找ERROR字样,确定错误原因并根据其信息进行排除。
(2)通过GGSCI>view ggsevt查看告警日志信息。
(3)检查两端数据库是否正常运行,网络是否连通。
(4)通过logdump工具对队列文件进行分析。
1.1.2 RAC单节点失败
在RAC环境下,GoldenGate软件安装在共享目录下,可以通过任一个节点连接到共享目录,启动GoldenGate运行界面。如果其中一个节点失败,导致GoldenGate进程中止,可直接切换到另外一个节点继续运行。
操作步骤如下。
(1)以Oracle用户登录源系统(使用另外一个正常的节点)。
(2)确认将GoldenGate安装的所在文件系统装载到另一节点相同目录。
(3)确认GoldenGate安装目录属于Oracle用户及其所在组。
(4)确认Oracle用户及其所在组对GoldenGate安装目录拥有读写权限。
(5)进入GoldenGate安装目录。
(6)执行./ggsci进入命令行界面。
(7)执行start mgr启动MGR。
(8)执行start er *启动所有进程。
检查各进程是否正常启动,即可进入正常复制。
1.1.3 Extract常见异常
以下为列举的一些常见错误信息作参考用。
Extract进程包括抽取与投递进程,投递进程报错大部分原因是由于网络故障。对于源数据库,抽取进程ext**如果变为abended,则可以通过在GGSCI中使用view report命令查看报告,可以通过搜索ERROR快速定位错误。
一般情况下,抽取异常的原因是因为其无法找到对应的归档日志,可以通过到归档日志目录命令行下执行
示例9-1:
ls –lt arch_x_xxxx.arc
查看该日志是否存在,如不存在则可能的原因如下。
“ 日志已经被压缩。
“ GoldenGate无法自动解压缩,需要人工解压缩后才能读取。
“ 日志已经被删除。
如果日志已经被删除,需要进行恢复才能继续复制。
一般需要定期备份归档日志,并清除旧的归档日志。需要保证归档日志在归档目录中保留足够长时间之后,才能被备份和清除。即定期备份清除若干小时之前的归档,而不是全部归档。保留时间计算如下。
某归档文件保留时间?抽取进程处理完该文件中所有日志所需的时间。
可以通过命令行或者GoldenGate Director Web界面,运行info extxx showch命令查看抓取进程ext处理到哪条日志序列号。在此序列号之前的归档,都可以被安全的清除。
抽取进程在抽取不支持的数据对象时也会abend,report文件会有详细的报错信息,根据report文件来定位错误信息然后再排错即可。
下面再单独列出更多的几个故障。
(1)Extract: Application failded to initialize(Win)。
错误信息:run GGSCI command but the Alert window report "Application failded to initialize(0xc000026e)"。
GoldenGate在Windows平台上需要安装Microsoft Visual C ++ 2005 SP1 Redistributable Package。如果是Microsoft Itanium平台,需要安装vcredist_IA64.exe。
Windows 2008需以下额外操作:右击‘cmd’ (DOS),选择‘run as administrator’,然后在该命令行窗口中启动MGR和Extract才能够读取数据库日志。
将OGG安装为服务时(即运行“install ADDSERVICE”),需要使用管理员权限,这样启动服务后即能访问日志。
通过以下方法为运行MGR和Extract的用户添加读取日志文件的权限,右键单击文件->property->security->edit->add。
(2)Extract: Cannot load program./ggsci…
错误分析:请首先检查该OGG Build是否与操作系统和数据库相符;其次如果是Aix请检查xLC版本是否符合10.0以上。
另外,检查环境变量中动态库路径是否包含了数据库动态库目录,例如:
示例9-2:
export LD_LIBRARY_PATH=$ORACLE_HOME/lib
不同平台下的环境变量不同。
“ AIX LIBPATH。
“ Solaris、Linux等 LD_LIBRARY_PATH。
“ HP-Unix SHLIB_PATH。
重设环境变量需重启Mgr和Ext/Rep进程。
(3)Extract: Block size mismatch (8192/512)…
裸设备的偏移量各操作系统默认为0,但AIX默认为4096。当创建裸设备时使用了-TO选项时,Oracle不会跳过4096字节而是直接从0开始读写。 因此在AIX下使用裸设备时,出现此错误需要指定OGG从偏移量0开始读取。
示例9-3:
tranlogoptions rawdeviceoffset 0
该参数其在实际环境中使用几率非常高,在以前版本中如果缺少此参数Extract立即终止,但新版本Extract会持续进行尝试,并不自动终止,需检查报告文件。
(4)Extract: ORA-15000 ASM connection error
该错误为OCI错误,表示Extract是在连接数据库时出现问题,根据错误信息判断为权限问题。
首先在Extract参数中检查ASM相关参数tranlogoptions asmuser sys@+ASM1,asmpassword oracle,再检查tnsnames.ora和listener.ora验证ASM实例配置是否正确,确认ASM用户具有SYSDBA 权限;如果使用SYS,需要将ASM实例的init.ora中REMOTE_LOGIN_PASSWORDFILE参数设置为SHARED(多个数据库可以使用一个password文件,只有SYS用户可以远程登录)。
使用sqlplus验证:
示例9-4:
sqlplus sys/oracle@asm1 as sysdba; //可以登录
sqlplus sys/oracle@asm1; //报告15000错误
(5)Extract: Encountered SCN That Is Not Greater Than The Highest SCN Already Processed…
原因分析:在Oracle RAC环境中,Extract会启动一个coordinator线程对各个节点上的操作进行根据SCN进行排序,它在交易提交后会等待THREADOPTIONS MAXCOMMITPROPAGATIONDELAY参数所定义时间来确认空闲节点没有交易,然后再收集交易数据;写入该交易后如果空闲节点后来又读到了一个SCN号要小的交易,则会报告该错误。
可能原因:
“ 各节点之间没有配置时钟同步。
“ 一个节点比另外一个节点慢(IO问题可能性较大)。
解决办法:
调整Extract参数:
示例9-5:
THREADOPTIONS MAXCOMMITPROPAGATIONDELAY <msec> IOLATENCY <msec>
MAXCOMMITPROPAGATIONDELAY有效范围是0-90000ms,默认为3s(即3000ms)。
GGS V9.x多了一个IOLATENCY参数,可以与上面参数一起加大等待时间。IOLATENCY默认为1.5s,最大值为180000。
建议出现该错误后可以将此二参数设置为较大值,然后逐步降低获取最佳设置。
需要补充说明的是,出现此错误后,因后面的交易可能已被写入日志,重启Extract可成功启动,但是可能出现如下问题:Extract会重写当前队列覆盖前面的交易数据,后面的Data Pump进程可能会出现“abend with incompatible record errors”错误终止(旧版本可能出现)。
此问题的恢复步骤如下。
① 停止所有Data Pump和Replicat,针对所有的Extract记录其Write Checkpoint的队列Seqno。
② 对于每个Extract向下滚动一个队列:
示例9-6:
ALTER EXTRACT [name], ETROLLOVER
启动Extract查看是否滚动到了下一个队列,记录其新队列seqno,应当是旧队列号+1。
③ 修改Data Pump从新的队列开始传输:
示例9-7:
ALTER EXTRACT [pump_name], EXTSEQNO ##### EXTRBA 0
重启Data Pump查看是否能够重启成功并从新的队列传输。
④ 修改Replicat参数文件,加入或者打开HANDLECOLLISIONS,如果有GROUPTRANSOPS和MAXTRANSOPS请注释掉,启动Replicat,观察其是否能够读取新传输过来的队列如Replicat无法自动滚动到下一个队列,需要通过如下命令手工滚动:
示例9-8:
alter replicat [replicat_name], EXTSEQNO ##### EXTRBA 0
等待Replicat处理到结尾没有延迟时,可以关闭HANDLECOLLISIONS和恢复原来的GROUPTRANSOPS和MAXTRANSOPS参数。
⑤ 重新启动Replicat即可恢复正常复制。
1.1.4 网络故障
如果MGR进程参数文件里面设置了autorestart参数,GoldenGate可以自动重启,无需人工干预。
当网络不稳定或者发生中断时, GoldenGate负责产生远地队列的Pump进程会自动停止。 此时,MGR进程会定期根据mgr.prm里面autorestart设置自动启动Pump进程以试探网络是否恢复。在网络恢复后,负责产生远程队列的Pump进程会被重新启动,GoldenGate的检查点机制可以保证进程继续从上次中止复制的日志位置继续复制。
需要注意的是,因为源端的抽取进程(Capture)仍然在不断地抓取日志并写入本地队列文件,但是Pump进程不能及时把本地队列搬动到远地,所以本地队列文件无法被自动清除而堆积下来,需要保证足够容量的存储空间来存储堆积的队列文件。计算公式如下。
存储容量?单位时间产生的队列大小×网络故障恢复时间
MGR定期启动抓取和复制进程参数配置参考:
示例9-9:
GGSCI > edit param mgr
port 7809
autorestart er *,waitminutes 3,retries 5,RESETMINUTES 60
每3分钟重试一次,5次重试失败以后等待60分钟,然后重新试三次。
1.1.5 Replicat进程常见异常
对于目标数据库,投递进程repXX如果变为abended,则可以通过在GGSCI中使用view report命令查看报告,可以通过搜索ERROR快速定位错误。
复制进程的错误通常为目标数据库错误,比如:
“ 数据库临时停机。
“ 目标表空间存储空间不够。
“ 目标表出现不一致。
可以根据报告查看错误原因,排除后重新启动rep进程即可。
需要注意一点:往往容易忽略UNDO表空间。如果DML语句中包含了大量的UPDATE和DELETE操作,则目标端UNDO的生成速度会很快,有可能填满UNDO表空间。
典型错误(数据复制典型错误)如下:
示例9-10:
- SQL error 1403 mapping 2010-02-25 13:20:08 GGS WARNING 218 Oracle GoldenGate Delivery for Oracle, rep_stnd.prm: SQL error 1403 mapping HR.MY_EMPLOYEE to HR.MY_EMPLOYEE.
可能原因包括以下几个方面。
“ 两端结构不一致(异构环境,列和主键不同)。
“ 两端有不一致记录。
“ 附加日志不全。
可以到discard文件中查看具体错误信息,如果为UPDATE或者DELETE找不到对应记录,并且某几个字段为空,则可认定为缺少了附加日志。
1.2 使用reperror进行错误处理
对于Replicat进程处理DML操作过程中报错时,GoldenGate提供了一个参数用来控制如何处理Replicat进程的报错。这就是本节内容要介绍的reperror参数。这个参数能控制大部分的GoldenGate错误处理的手段。
如某案例的Replicat进程参数如图9-1所示。
图9-1
1.2.1 reperror处理类型与含义
Reperror在GoldenGate11版本*提供了7类处理错误方式,分别如下。
(1)abend:Replicat遇到不能处理的记录时,回滚事务,然后停止处理,Replicat进程状态转为abend。
(2)discard:将不能处理记录的错误信息放到discard文件而Replicat进程继续处理下面的记录。
(3)exception:将错误按照预先定义好的方式处理。
(4)ignore:将不能处理的记录忽略掉,然后继续处理下面的记录。
(5)retryop [maxretries <n>]:遇到不能处理的记录时,重试n次。
(6)transabort [,maxretries <n>][, delay[c]sesc<n>];终止事务处理,将rba号指到该事务的开头,也可以指定重试几次。
(7)reset:清除掉所有的reperror规则,然后将reperror的规则默认为abend。
在Replicat进程的参数中,可以将任意一个处理类型设置为默认,如reperror、default、abend。
通常,为了保证数据的一致性,都将reperror的默认规则设置为abend。
1.2.2 复制进程常见数据库错误类型与处理方法
在实际的GoldenGate系统中,很大一部分Replicat错误信息都类似于ORA开头的数据库错误(这里以Oracle数据库为例)。虽然,通常对于ORA错误,需要手动查找数据库的原因,但可以用reperror处理一些预知的错误类型,然后再在数据库层面找到错误的原因,手动排除,而不至于导致该进程处理其他正常的表而abend掉。
例如:可以忽略掉重复数据的插入而其他类型的报错则abend。
示例9-11:
Reperror (default, abend)
Reperror (-1, ignore)
当然,也可以只针对某张表的忽略掉重复数据的插入而abend掉其他类型的报错。
示例9-12:
REPERROR (-1, IGNORE)
MAP sales.product, TARGET sales.product;
REPERROR RESET
MAP sales.account, TARGET sales.account;
最常见的错误为ORA-1403。
1403错误是指记录无法投递到目标库,纯属数据错误,要通过查看错误信息和discard文件,到两端库寻找相应记录,结合logdump分析队列中的实际数据,再分析出问题的原因。可能存在的原因有:两端表结构不一致;附加日志错误;初始化方法错误导致不一致;目标端级联删除、trigger没有被禁止;目标端存在Oracle的job或者操作系统任务修改数据。
处理方法:
“ 重新初始化该表。
“ 手工修复该条数据。
“ 修改reperror参数为discard或ignore模式,忽略掉错误(在使用这个参数之前用户应该非常清楚自己在做什么,因为它会导致两端数据不一致)。
1.3 Ddlerror处理DDL复制错误
当GoldenGate打开了DDL复制时,当DDL复制报错时,则需要用到此处的ddlerror参数预处理一些常见的报错信息。Ddlerror对于抽取、复制进程均有效,默认为abend。
Ddlerror参数的语法为:
示例9-13:
DDLERROR
{<error> | DEFAULT} {<response>}
[RETRYOP MAXRETRIES <n> [RETRYDELAY <delay>]]
{INCLUDE <inclusion clause> | EXCLUDE <exclusion clause>}
[IGNOREMISSINGTABLES | ABENDONMISSINGTABLES]
如当DDL复制报ORA-1430错误,传递了重复的alter语句导致,则可以用ddlerror (1430, discard)将错误信息扔到discard文件里。
其他的错误处理与reperror类似。
1.4 Discardfile记录进程错误信息
用discardfile 这个参数来生成一个discard文件,将GoldenGate不能处理的信息记录到这个文件。这样对GoldenGate的troubleshooting非常的有帮助。
如源端表结构有变化,默认传递过来的数据应用时Replicat进程则报错,此时则可以通过discard文件看到报错信息位哪个表做了怎样的alter操作,再在目标端也将表结构改变一些,错误即可排除。
Discard文件默认在GoldenGate安装目录的dirrpt子文件夹,如图9-2所示。
图9-2
Discard文件记录的报错信息如图9-3所示。
图9-3
1.5 GoldenGate常见错误分析
(1)解决GoldenGate错误的一个关键点就是通过错误分析工具(包括report文件,ggserr.log discard文件logdump工具,GGSCI命令行)确定错误的根源是哪个组件引起的。
“ 系统或者网络?
“ 数据库报错或者应用报错?
“ GoldenGate安装报错?
“ GoldenGate的某个进程报错?
“ GoldenGate的参数配置文件报错?
“ SQL语句或者存储过程报错?
然后再确定错误的原因,逐个排查。
(2)当GoldenGate遇到错误时,则可以借助日志、report文件找到错误原因,一步一步来排查。一般的错误信息GoldenGate都会提示有相应的解决办法。
如下介绍一个错误案例:
通过命令:
示例9-14:
GGSCI>view ggsevt
看到的报错信息如图9-4所示。
图9-4
通过view report dpeyb 看到的也是类似的信息。
再来观察容灾端复制进程的报错信息为:
示例9-15:
2011-03-02 12:03:37 ERROR OGG-01028 Incompatible record in ./dirdat/ yb018262, rba 72955479 (getting header).
通过logdump进入到该trail文件查看,如图9-5所示。
图9-5
通过分析推敲等,确认是因为trail文件有一条记录已损坏,导致投递进程不识别,不能自动翻滚到下一个trail文件,而复制进程也不能自动应用到下一个trail文件,Pump进程通过手动etrollover,复制进程通过alter手动指定到下一个trail文件应用,故障即可排除。
1.5.1 AIX GGSCI无法运行
错误信息:
示例9-16:
Cannot load ICU resource bundle 'ggMessage', error code 2 - No such file or directory
Cannot load ICU resource bundle 'ggMessage', error code 2 - No such file or directory
IOT/Abort trap (core dumped)
或者GGSCI可以启动,但是运行任何命令都报上面的错误。
处理方法:通常使用已有的mount点安装GoldenGate,在mount时使用了并发CIO参数。新建文件系统,重新mount,作为GoldenGate安装目录。
错误信息:
示例9-17:
$ ./ggsci
exec(): 0509-036 Cannot load program GGSCI because of the following errors:
0509-130 Symbol resolution failed for GGSCI because:
0509-136 Symbol _GetCatName__FiPCc (number 158) is not exported from dependent module /usr/lib/libC.a[ansi_64.o].
0509-136 Symbol _Getnumpunct__FPCc (number 162) is not exported from dependent module /usr/lib/libC.a[ansi_64.o].
0509-136 Symbol __ct__Q2_3std8_LocinfoFPCci (number 183) is not exported from dependent module /usr/lib/libC.a[ansi_64.o].
0509-192 Examine .loader section symbols with the 'dump -Tv' command.
原因是XLC是6.0版本,升级XLC版本到10.1以上,问题即可解决。
1.5.2 HP-UX GGSCI无法运行
错误信息:core dumped
该问题只在HP-UX11.31上发现。
处理方法:环境变量没有设置正确。
1.5.3 队列文件保存天数
在mgr.prm中,添加:
示例9-29:
PURGEOLDEXTRACTS ./dirdat/*,usecheckpoints, minkeepdays 3
修改之后,必须重启manager即可看到队列文件占用的空间被按照上面指定的规则释放。
如果存储空间不够,可以将minkeepdays修改为MINKEEPHOURS。
如果源端存储空间不足,最好修改最少保留的时间。
9.5.12 复制进程拆分及指定队列文件及RBA
拆分前通过INFO XXX获取队列文件信息及RBA号,返回样例如下:
示例9-30:
GGSCI> INFO REPYXA
REPLICAT REPYXA Last Started 2011-01-08 19:48 Status RUNNING
Checkpoint Lag 00:00:00 (updated 00:01:42 ago)
Log Read Checkpoint File ./dirdat/p1000556 First Record RBA 59193235
在将Replicat进程拆分后,指定从拆分前的队列文件及RBA号码开始复制:
示例9-31:
ALTER REPLICAT xxx EXTSEQNO nnn, EXTRBA mmm
以上面的为例:
示例9-32:
ALTER REPLICAT REPYXA 556, EXTRBA 59193235
1.5.4 BOUNDED RECOVERY
错误信息:
示例9-33:
BOUNDED RECOVERY: reset to initial or altered checkpoint.
数据库问题,不能读取第2个节点的archivelog文件。
1.5.5 排除不复制的表
在参数文件中增加:
示例9-34:
TABLEEXCLUDE schema.table_name
1.5.6 从指定时间重新抓取
重新抓取数据前提:归档文件没有删除。
示例9-35:
ALTER EXTRACT xxx, TRANLOG, BEGIN 2010-12-31 08:00
时间格式:yyyy-mm-dd [hh:mi:[ss[.cccccc]]]
如果是新建:
示例9-36:
ADD EXTRACT xxx, TRANLOG, BEGIN 2010-12-31 08:00
1.5.7 进程无法停止
通常情况是在处理大交易,尤其在有超过2小时以上的大交易,建议等待进程处理完毕。
处理方法:如果必须停止进程,可以强制杀死进程。
示例9-37:
send xxx forcestop
1.5.8 CLOB处理
如果包含CLOB字段,在Extract参数文件中必须添加:
示例9-38:
TRANLOGOPTIONS CONVERTUCS2CLOBS
1.5.9 DB2不能使用checkpoint table
处理方法:在增加Replicat进程时使用nodbcheckpoint参数。
示例9-39:
add replicat xxx, exttrail /GoldenGate/dirdat/rb, nodbcheckpoint
1.6 ogg-错误
1.6.1 OGG-00446
1.6.1.1 OGG-00446 Could not find archived log for sequence 53586 thread 1 under alternative destinations.
错误信息:
OGG-00446 Could not find archived log for sequence 53586 thread 1 under alternative destinations. SQL <SELECT MAX(sequence#) FROM v$log WHERE thread# = :ora_thread>. Last alternative log tried /arch_cx/1_53586_776148274.arc., error retri
eving redo file name for sequence 53586, archived = 1, use_alternate = 0Not able to establish initial position for sequence 53586, rba
44286992.
处理办法:
将缺失的归档日志从备份中恢复出来。如果依旧找不到所需归档日志,那么只能重新实施数据初始化。
今天启动一个extract时,出现以下错误:
2011-10-16 22:41:02 ERROR OGG-00446 Oracle GoldenGate Capture for Oracle, e430rks2.prm: Could not find archived log for sequence 10770 thread 1 under default destinations SQL <SELECT name FROM v$archived_log WHERE sequence# = :ora_seq_no AND thread# = :ora_thread AND resetlogs_id = :ora_resetlog_id AND archived = 'YES' AND deleted = 'NO>, error retrieving redo file name for sequence 10770, archived = 1, use_alternate = 0Not able to establish initial position for sequence 10770, rba 78960656.
2011-10-16 22:41:02 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, e430rks2.prm: PROCESS ABENDING.
原因是extract 所需的archived log已经被清走,不在log_archive_dest指定的目录下,解决方法很简单,只要把sequence 从10770开始到当前的archived log重新拷贝回log_archive_dest目录下即可。
第一种办法,会导致数据不一致,改变抽取进程的时间,如下执行:
GGSCI (HP-HP) 8> alter extract extl,begin now
第二种办法:重新初始化
重新初始化过程如下:
---- source库
SQL> col current_scn format 999999999999999
SQL> Select current_scn from v$database;
CURRENT_SCN
----------------
12242466771468
expdp XPADB/XPADB directory=DMP dumpfile=xpadb_20160125_01.dmp LOGFILE=xpadb_20160125.log TABLES=BASE_ACTIONPOWER,BASE_BANK FLASHBACK_SCN=12242466771468
--- target库
impdp XPADRPT/xpadrpt DIRECTORY=OGGD DUMPFILE=xpadb_20160125_01.dmp LOGFILE=impdp.xpadb_20160125_01.log REMAP_SCHEMA=xpadb:xpadrpt REMAP_TABLESPACE=xpaddat:xpaddata table_exists_action=replace
start replicat ggsrep , aftercsn 12242466771468
1.6.1.2 OGG-00446 No valid log files for current redo sequence
goldengate 从oracle asm 中增量同步数据时,出现下述错误.
ERROR OGG-00446 No valid log files for current redo sequence 367, thread 1, error retrieving redo file name
for sequence 367, archived = 0, use_alternate = 0Not able to establish initial position for begin time 2013-03-27 15:32:46.
ERROR OGG-01668 PROCESS ABENDING.
在抽取进程的参数文件中加入TRANLOGOPTIONS DBLOGREADER即可。
参考:Extract fail due to an ASM connection configuration issue [ID 1061093.1]
Applies to:
Oracle GoldenGate - Version 11.1.1.0.0 and later
Information in this document applies to any platform.
Goal
To show how to recover from an extract failure when your Archive or Redo files are stored under ASM
and you see one of the following messages
ERROR 118 No Valid Log File For Current Redo Sequence Xxxx, Thread Y
ERROR 500 No valid log files for current redo sequence X, thread Y, error retrieving redo file name for sequence X, archived = 0, use_alternate = 0 Not able to establish initial position for begin time YYYY-MM-DD HH:MI:SS
ERROR OGG-00446 error 2 (No such file or directory) opening redo log <log file name>.dbf for sequence ####
Not able to establish initial position for begin time YYYY-MM-DD HH:MI:SS
Fix
If you are running Oracle ASM, the problem may be that the ASM connection is either not defined or is incorrectly defined or TRANSLOGOPTINS DBLOGREADER needs to be added. If your archive files are ONLY under ASM and extract receives an error 500, extract may have run successfully until the process needed to read from the ARCHIVES instead of the REDO. Once it needs to read from archive, the extract will fail.
Please Add the following line, or correct it in your Extract parameter file, if you are On Oracle 11.2.0.2 or better, or 10.2.0.5 or better and using OGG 11.x
TRANLOGOPTIONS DBLOGREADER
If the above version of Oracle or OGG doesn't apply to you specifying a user that can connect to the ASM instance and restart your Extract:
TRANLOGOPTIONS ASMUSER <user>@<ASM_instance_name>,
ASMPASSWORD <password>
1.6.1.3 OGG-00446 Missing filename opening checkpoint file.
ERROR OGG-00446 Missing filename opening checkpoint file.
进程RSJQZ011进程abended,如下:
ERROR OGG-00446 Missing filename opening checkpoint file.
检查RSJQZ011配置情况:
GGSCI (oraserver.localdomain) 19> view param RSJQZ011
Sourcedefs /goldengate/dirdef/DESJQZ001.def
---handlecollisions
batchsql
SETENV ( NLS_LANG = ".ZHS16GBK")
OBEY /goldengate/dirprm/pwd.obey
Discardfile /goldengate/dirrpt/RSJZX001.dsc, append, megabytes 100
map DB_DJGL.A, target DB_NBGY.A;
发现Replicat RSJQZ011一行被删除了,所以导致报错。
加上Replicat RSJQZ011后进程启动正常。
1.6.2 OGG-01154 Oracle GoldenGate Delivery for Oracle, repn.prm
错误信息:
OGG-01154 Oracle GoldenGate Delivery for Oracle, repn.prm: SQL error 1691 mapping DATA_USER.DMH_WJXXB to DATA_USER.DMH_WJXXB OCI Error ORA-01691: unable to extend lob segment DATA_USER.SYS_LOB0000083691C00014$$ by 16384 in tablespace DATA_USER_LOB_U128M_1 (status = 1691), SQL <INSERT INTO "DATA_USER"."DMH_WJXXB" ("DMH_WJXXB_ID","DMH_ZLXXB_ID","DMH_GPXXB_ID","DMH_PCXXB_ID","PICIH","SHENQINGH","FID","WENJIANZL","WENJIANLXDM","WENJIANMC","DTDBBH","FAMINGMC","FUTUGS","WENJIANST>.
处理办法:
数据库中该表空间已满,需要对该表空间进行扩容。
1.6.2.1 OGG-01154
错误信息:2011-03-29 15:53:57 WARNING OGG-01154 Oracle GoldenGate Delivery for Oracle, repya.prm: SQL error 14402 mapping EPMA.D_METER to E
PMA.D_METER OCI Error ORA-14402: updating partition key column would cause a partition change (status = 14402), SQL <UPDATE "EPMA"."D_METER" SET "PR_ORG" = :a1,"BELONG_DEPT" = :a2 WHERE "METER_ID" = :b0>.
导致原因:源端更新了分区列,但目标端没有打开行移动,导致更新时报错;
处理方法:SQLPLUS>alter table SCHEMA.TABLENAME enable row movement;
1.6.3 OGG-00664
1.6.3.1 OGG-00664 OCI Error during OCIServerAttach (status = 12541-ORA-12541: TNS:no listener).
错误信息:
OGG-00664 OCI Error during OCIServerAttach (status = 12541-ORA-12541: TNS:no listener).
处理方法:
启动数据库的监听器。
1.6.3.2 OGG-00664 OCI Error during OCIServerAttach (status = 12545-Error while trying to retrieve text for error ORA-12545).
2015-06-09 22:31:11 ERROR OGG-00664 OCI Error during OCIServerAttach (status = 12545-Error while trying to retrieve text for error ORA-12545).
2015-06-09 22:31:16 ERROR OGG-01668 PROCESS ABENDING.
ORACLE_HOME设置有问题。
解决办法:setenv (ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1)
1.6.4 OGG-00665
1.6.4.1 OGG-00665 OCI Error describe for query (status = 3135-ORA-03135: connection lost contact
错误信息:
OGG-00665 OCI Error describe for query (status = 3135-ORA-03135: connection lost contact
Process ID: 8859
Session ID: 131 Serial number: 31), SQL<SELECT DECODE(archived, 'YES', 1, 0), status FROM v$log WHERE thread# = :ora_thread AND sequence# = :ora_seq_no>.
处理方法:
在没有关闭OGG进程的情况下,提前关闭了数据库,导致OGG进程出现异常。如果是发现了这个错误提示,应该马上关闭OGG进程,注意数据库的归档日志情况,保证归档日志不会缺失,然后等待数据库启动成功后,马上启动OGG进程。
1.6.4.2 OGG-00665 OCI Error describe for query
Applies to:
Oracle GoldenGate - Version: 11.1.1.0.7 and later [Release: 11.1.1 and later ]
Information in this document applies to any platform.
Symptoms
When attempting to start an Extract, we get error
2010-12-09 18:59:25 GGS ERROR 182 OCI Error describe for query (bad syntax) (status = 942-ORA-00942: table or view does not exist), SQL< select value$ from sys.props$ where name = 'NLS_LANGUAGE'>.
2010-12-09 18:59:25 GGS ERROR 190 PROCESS ABENDING.
Cause
The database user does not have the necessary privilege.
Solution
Grant the necessary privilege to the Golden Gate user.
SQL> grant select on sys.props$ to ggsuser;
or
SQL> grant select any dictionary to ggsuser;
1.6.4.3 OGG-00665 OCI Error describe for query (status = 942-ORA-00942: table or view does not exist), SQL<SELECT 1 FROM DUAL WHERE EXISTS ( SELECT 'x' FROM ggusr.GGS_DDL_HIST WHERE OPTIME < '2015-05-25 11:12:43')>.
2015-06-08 12:12:43 ERROR OGG-00665 OCI Error describe for query (status = 942-ORA-00942: table or view does not exist), SQL<SELECT 1 FROM DUAL WHERE EXISTS ( SELECT 'x' FROM ggusr.GGS_DDL_HIST WHERE OPTIME < '2015-05-25 11:12:43')>.
2015-06-08 12:12:43 ERROR OGG-01668 PROCESS ABENDING.
如果想使用DDL功能,需要在之前运行支持DDL的相关脚本。
1.@marker_setup.sql
2.@ddl_setup.sql mode of installation:initialsetup
3.@role_setup.sql
4.GRANT GGS_GGSUSER_ROLE TO gguser
5.@ddl_enable.sql
1.6.5 OGG-01161 Bad column index (4) specified for table QQQ.TIANSHI, max columns = 4.
错误信息:
OGG-01161 Bad column index (4) specified for table QQQ.TIANSHI, max columns = 4.
处理方法:
对照一下生产端与容灾端的这一张表的表结构,如果容灾端的表缺少一列,则在容灾端,登陆数据库,增加这一列,然后启动复制进程。
1.6.6 OGG-00199 Table QQQ.T0417 does not exist in target database.
错误信息:
ERROR OGG-00199 Table QQQ.T0417 does not exist in target database.
处理方法:
查看源端抽取进程的参数,DDL复制参数是否配置,针对这张表,重新实施数据初始化。
1.6.7 OGG-01738 BOUNDED RECOVERY
database version:11.2.0.3 RAC
goldengate version :11.1.1.1.2
早上发现数据同步异常,source端状态如下:
GGSCI (ulecardrac1) 3> info all
Program Status Group Lag Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EXT232 00:00:00 06:32:33
EXTRACT RUNNING PUMP232 00:00:00 00:00:03
status还是为RUNNING,但是已经有六个半小时没有update了,其实该进程已经hang住
查看告警日志ggserr.log
发现存在OGG-01738提示
2013-03-07 02:42:28 INFO OGG-01738 Oracle GoldenGate Capture for Oracle, ext232.prm: BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p5905_Redo Thread 1: start=SeqNo: 679, RBA: 83280912, SCN: 1.913813052 (5208780348), Timestamp: 2013-03-06 22:00:20.000000, end=SeqNo: 679, RBA: 129051136, SCN: 1.938808049 (5233775345), Timestamp: 2013-03-07 02:42:03.000000.
2013-03-07 02:42:28 INFO OGG-01738 Oracle GoldenGate Capture for Oracle, ext232.prm: BOUNDED RECOVERY: CHECKPOINT: for object pool 2: p5905_Redo Thread 2: start=SeqNo: 692, RBA: 103611920, SCN: 1.913812238 (5208779534), Timestamp: 2013-03-06 22:00:16.000000, end=SeqNo: 693, RBA: 93604864, SCN: 1.938808100 (5233775396), Timestamp: 2013-03-07 02:42:15.000000.
MOS上有一篇关于该错误的文章 note 1293772.1
国内大牛刘相兵的博客上也有一篇关于该错误的说明:
http://www.askmaclean.com/archives/ogg-01738-bounded-recovery.html
The solution is to reset the Bounded Recovery Checkpoint file when restarting the extract like:
GGSCI> start <extract_name> BRRESET
因为extract进程ext232已经假死,无法stop掉,甚至用'send ext232 forcestop'和'stop mgr'也无法stop掉该extract进程
最后只能在shell下kill掉进程,再重新执行
GGSCI> start ext232 BRRESET
重新启动后,发现状态已经正常,同步已经基本无延迟。
该bug只在RAC中或者单实例设置了多个thread的情况下出现,而且在更高级版本中已经修复,为了一劳永逸,可以考虑将ogg升级至11.2.1.0.1
2012-10-20 10:28:02 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p17448_extr: start=SeqNo: 343, RBA: 79286800, SCN: 0.3712874 (3712874), Timesta
mp: 2012-10-19 22:27:45.000000, Thread: 1, end=SeqNo: 343, RBA: 79287296, SCN: 0.3712874 (3712874), Timestamp: 2012-10-19 22:27:45.000000, Thread: 1.
2012-10-20 14:28:05 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p17448_extr: start=SeqNo: 343, RBA: 107000336, SCN: 0.3725744 (3725744), Timest
amp: 2012-10-20 02:27:14.000000, Thread: 1, end=SeqNo: 343, RBA: 107000832, SCN: 0.3725744 (3725744), Timestamp: 2012-10-20 02:27:14.000000, Thread: 1.
2012-10-20 18:28:06 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p17448_extr: start=SeqNo: 343, RBA: 128054288, SCN: 0.3739371 (3739371), Timest
amp: 2012-10-20 06:28:02.000000, Thread: 1, end=SeqNo: 343, RBA: 128054784, SCN: 0.3739371 (3739371), Timestamp: 2012-10-20 06:28:02.000000, Thread: 1.
2012-10-20 22:28:06 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p17448_extr: start=SeqNo: 343, RBA: 153368080, SCN: 0.3752583 (3752583), Timest
amp: 2012-10-20 10:27:46.000000, Thread: 1, end=SeqNo: 343, RBA: 153368576, SCN: 0.3752583 (3752583), Timestamp: 2012-10-20 10:27:46.000000, Thread: 1.
2012-10-21 02:28:08 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p17448_extr: start=SeqNo: 343, RBA: 165712912, SCN: 0.3763760 (3763760), Timest
amp: 2012-10-20 14:28:00.000000, Thread: 1, end=SeqNo: 343, RBA: 165713408, SCN: 0.3763760 (3763760), Timestamp: 2012-10-20 14:28:00.000000, Thread: 1.
2012-10-21 06:28:15 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p17448_extr: start=SeqNo: 343, RBA: 179789328, SCN: 0.3774866 (3774866), Timest
...skipping one line
2012-10-21 10:28:16 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p17448_extr: start=SeqNo: 343, RBA: 201859088, SCN: 0.3788193 (3788193), Timest
amp: 2012-10-20 22:26:32.000000, Thread: 1, end=SeqNo: 343, RBA: 201859584, SCN: 0.3788193 (3788193), Timestamp: 2012-10-20 22:26:32.000000, Thread: 1.
2012-10-21 14:28:26 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p17448_extr: start=SeqNo: 343, RBA: 246480912, SCN: 0.3803284 (3803284), Timest
amp: 2012-10-21 02:27:31.000000, Thread: 1, end=SeqNo: 343, RBA: 246481408, SCN: 0.3803284 (3803284), Timestamp: 2012-10-21 02:27:31.000000, Thread: 1.
2012-10-21 18:28:33 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p17448_extr: start=SeqNo: 343, RBA: 291493392, SCN: 0.3821051 (3821051), Timest
amp: 2012-10-21 06:28:22.000000, Thread: 1, end=SeqNo: 343, RBA: 291493888, SCN: 0.3821051 (3821051), Timestamp: 2012-10-21 06:28:22.000000, Thread: 1.
在Oracle GoldenGate版本11.x中,引入了Bounded Recovery(BR)的概念,即允许extract对于长事务(long running transaction 比BRINTERVAL指定值更长的事务)写入到本地BR目录。当extract重启时,它会首先读取BR文件,取而代之读取恢复检查点指定的归档日志,这样有助于提升性能以及减少对旧归档文件的依赖。
但是当在RAC环境中使用Bounded Recovery(BR)特性来恢复一个异常abend掉的extract的话,小概率可能会遇到extract hang住或丢失特性的事务。该BUG仅在RAC环境中或者单实例情况下使用多个thread设置时出现。
1. bug 10368242: transaction loss with BR
When a transaction is committed, it will be flushed to trail file. But when BR writing started (after the transaction commit) and extract abends abnormally, the extract may not have chance to flush the committed transaction to trail. When extract restarted, it will read from BR, and leave that committed transaction as persist committed transaction in memory and never be written to trail. So this committed transaction may be lost.
The problem will not happen when the extract stops in normal mode.
2.bug 12532428 (base bug 10408077 ): extract hung when using BR and new objects are added to extract
With BR setup, when new objects (table, sequence, DDL, et al) are including in the extract, restarted extract will pick up more data that causes the producer queue limit (a fixed number) used by BR be reached. Because the extract is still in BR recovery, the consumer thread is stopped and not processing data from the producer queues. This caused a deadlock, and the extract will appear hung.
解决方案
1. 对于BUG 12532428引起的事务丢失,该BUG在11.1.1.1中被修复,且会在11.1.1.0中被backport。
2. 对于BUG 10408077 引起的extract hang,该BUG在11.1.1.1和 11.1.1.0.30中被修复,也可以如下workaround绕过:
A workaround with earlier 11.1.1.0 version is to start extract with BRRESET, when new object is added to an extract. All the archived logs since recovery checkpoint need to be available.
ggsci> start extract, BRRESET
When running Oracle Golden Gate 11.1.1.0.6 or higher, extract is “abending” every 4 hours on the hour. This approximates the same time or interval that Bounded Recovery is set to by default.
Extract can be restarted and continues to work but then fails again after 4 hours with the same errors as shown below.
ERROR
———
2011-02-06 05:15:38 WARNING OGG-01573 br_validate_bcp: failed in call to: ggcrc64valid.
2011-02-06 05:15:38 WARNING OGG-01573 br_validate_bcp: failed in call to: br_validate_bcp.
2011-02-06 05:15:38 INFO OGG-01639 BOUNDED RECOVERY: ACTIVE: for object pool 1: p7186_Redo Thread 1.
2011-02-06 05:15:38 INFO OGG-01640 BOUNDED RECOVERY: recovery start XID: 0.0.0.
…
2011-02-06 09:15:46 INFO OGG-01738 BOUNDED RECOVERY: CHECKPOINT: for object pool 1: p7186_Redo Thread 1: start=SeqNo: 21659, RBA: 117520912, SCN: 0.2984644709 (2984644709), Timestamp: 2011-02-06 09:15:44.000000, end=SeqNo: 21659, RBA: 117602816, SCN: 0.2984644709 (2984644709), Timestamp: 2011-02-06 09:15:44.000000.
Cause
Under these conditions, this may be a problem with the Bounded Recovery Checkpoint file. It is likely corrupted.
Solution
The solution is to reset the Bounded Recovery Checkpoint file when restarting the extract like:
GGSCI> start <extract> BRRESET
BOUNDED RECOVERY
错误信息:BOUNDED RECOVERY: reset to initial or altered checkpoint.
数据库问题,不能读取第2个节点的archivelog文件
1.6.8 OGG-00268 OGG-01668参数文件格式问题
OGG-00268 OGG-01668参数文件格式问题
现象:
Start ext1报错:
2012-04-23 04:17:21 ERROR OGG-00268 Parameter unterminated.
2012-04-23 04:17:21 ERROR OGG-01668 PROCESS ABENDING.
原因:
GoldenGate对语法要求非常严格,比如逗号,分号,空格等
处理方法:
在参数文件最后加一个分号“;”
1.6.9 WARNING OGG-00959 (MINKEEPFILES option not used.).
在mgr.rpt 里面找到这个警告WARNING OGG-00959 (MINKEEPFILES option not used.).
-- Purge old trail-files
删除老文件
PURGEOLDEXTRACTS /ggs/tdmInput/m1/g3*, USECHECKPOINTS, MINKEEPHOURS 12
2012-10-30 15:15:09 WARNING OGG-00959 PURGEOLDEXTRACTS /ggs/tdmInput/m1/g3*, USECHECKPOINTS, MINKEEPHOURS 12 (MINKEEPFILES option not used.).
The descripton for this warning is:
// *Cause: The PURGEOLDEXTRACTS parameter contains the option MINKEEPHOURS or
// MINKEEPDAYS with the option MINKEEPFILES. These are mutually
// exclusive. If either MINKEEPHOURS or MINKEEPDAYS is used with
// MINKEEPFILES, then MINKEEPHOURS or MINKEEPDAYS is accepted, and
// MINKEEPFILES is ignored.
// *Action: Remove MINKEEPFILES (or MINKEEPHOURS depending on your
// requirements.
告警描述:
原因:PURGEOLDEXTRACTS 参数包含了MINKEEPHOURS 或者MINKEEPDAYS 参数并且包含MINKEEPFILES参数 ,他们之间是相互冲突的。
如果MINKEEPHOURS ,MINKEEPDAYS ,MINKEEPFILES 同时使用那么系统接受MINKEEPHOURS和MINKEEPDAYS 参数将对MINKEEPFILES 参数做忽略。
1.6.10 OGG-00303 Did not recognize parameter argument.
参数变量配置不正确
问题描述:
ERROR OGG-00303 Did not recognize parameter argument. |
问题分析:
进程参数文件配置不正确。
问题处理:
检查参数配置文件,可能是进程名称与配置文件不一致或者是参数不正确,重启进程。
1.6.11 OGG-01044
2015-06-08 17:54:45 ERROR OGG-01044 The trail './dirdat/aa' is not assigned to extract 'EORA_T1'. Assign the trail to the extract with the command "ADD EXTTRAIL/RMTTRAIL ./dirdat/aa, EXTRACT EORA_T1".
解决办法:需要添加trail文件
GGSCI (orcltest) 11> add exttrail ./dirdat/aa,extract eora_t1,megabytes 100
EXTTRAIL added.
1.6.12 OGG-00396
2015-01-07 11:39:38 ERROR OGG-00396 Command not terminated by semi-colon.
2015-01-07 11:39:38 ERROR OGG-01668 PROCESS ABENDING.
原因是配置文件中没有以分号结尾;
解决办法:修改配置文件。
1.6.13 OGG-01031 goldengate源端意外宕机,导致OGG-01031报错
示例9-21:
ERROR OGG-01031 There is a problem in network communication, a remote file problem, encryption keys for target and source do not match (if using ENCRYPT) or an unknown error. (Reply received is Expected 4 bytes, but got 0 bytes, in trail ./dirdat/t1000026, seqno 26, reading record trailer token at RBA 103637218).
2011-01-06 11:04:16 ERROR OGG-01668 PROCESS ABENDING.
处理方法:
可能是目标端的trail file出问题了,前滚重新生成一个新的SEND EXTRACT xxx ROLLOVER,或者“alter extract xxx rollover”。
服务器宕机,没有停止dpump进程,启动后处于abend状态,检查ggserr.log报以下错误:
2011-04-01 11:13:19 ERROR OGG-01031 Oracle GoldenGate Capture for Oracle, dpump.prm: There is a problem in network communication, a remote file problem, encryption keys for target and source do not match (if using ENCRYPT) or an unknown error. (Reply received is Unable to open file "/vistor/media/GG/dirdat/rt000003" (error 11, Resource temporarily unavailable)).
原因是由于目标端的OGG代码正在更新,同时dpump进程没有停止导致dpump进程始终寻找老的manager端口和源端的trail文件。
解决方法重新启动exp、ddump、ext、mananger进程,若还是报错就需要更改参数。
dpump添加 ETROLLOVER属性,产生一个新的文件点
alter extract ext1 etrollover
start extract dpump
info extract dpump
标记源端trail文件sequence number开启生成新的rt文件
send replicat rep1,logend
alter replicat rep1,extseqno 4, extrba 0
start replicat rep1
进程启动恢复正常。
source端:
GGSCI (orcltest) 31> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EORA_HR 00:00:00 00:00:07
EXTRACT ABENDED PORA_HR 00:00:00 40:04:19
REPLICAT RUNNING RORA_HR2 00:00:00 00:00:00
REPLICAT STOPPED TESTRPT 00:00:00 00:05:48
GGSCI (orcltest) 32> view report PORA_HR
***********************************************************************
Oracle GoldenGate Capture for Oracle
Version 11.2.1.0.1 OGGCORE_11.2.1.0.1_PLATFORMS_120423.0230_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Apr 23 2012 08:42:16
Copyright (C) 1995, 2012, Oracle and/or its affiliates. All rights reserved.
Starting at 2015-06-12 10:36:28
***********************************************************************
Operating System Version:
Linux
Version #1 SMP Sun Nov 10 22:19:54 EST 2013, Release 2.6.32-431.el6.x86_64
Node: orcltest
Machine: x86_64
soft limit hard limit
Address Space Size : unlimited unlimited
Heap Size : unlimited unlimited
File Size : unlimited unlimited
CPU Time : unlimited unlimited
Process id: 14523
Description:
***********************************************************************
** Running with the following parameters **
***********************************************************************
2015-06-12 10:36:28 INFO OGG-03035 Operating system character set identified as UTF-8. Locale: en_US, LC_ALL:.
extract pora_hr
setenv (ORACLE_SID=ogg1)
Set environment variable (ORACLE_SID=ogg1)
setenv (ORACLE_HOME=/u02/app/oracle/product/11.2.0/dbhome_1)
Set environment variable (ORACLE_HOME=/u02/app/oracle/product/11.2.0/dbhome_1)
setenv (NLS_LANG=AMERICAN_AMERICA.ZHS16GBK)
Set environment variable (NLS_LANG=AMERICAN_AMERICA.ZHS16GBK)
passthru
rmthost 192.168.59.130,mgrport 7809
rmttrail ./dirdat/pa
table hr.*;
2015-06-12 10:36:28 INFO OGG-01815 Virtual Memory Facilities for: COM
anon alloc: mmap(MAP_ANON) anon free: munmap
file alloc: mmap(MAP_SHARED) file free: munmap
target directories:
/u01/gg11/dirtmp.
CACHEMGR virtual memory values (may have been adjusted)
CACHESIZE: 64G
CACHEPAGEOUTSIZE (normal): 8M
PROCESS VM AVAIL FROM OS (min): 128G
CACHESIZEMAX (strict force to disk): 96G
2015-06-12 10:36:33 INFO OGG-01226 Socket buffer size set to 27985 (flush size 27985).
Source Context :
SourceModule : [er.extrout]
SourceID : [/scratch/aime1/adestore/views/aime1_adc4150256/oggcore/OpenSys/src/app/er/extrout.c]
SourceFunction : [complete_tcp_msg]
SourceLine : [1522]
ThreadBacktrace : [9] elements
: [/u01/gg11/libgglog.so(CMessageContext::AddThreadContext()+0x1e) [0x7f5c2f9bd06e]]
: [/u01/gg11/libgglog.so(CMessageFactory::CreateMessage(CSourceContext*, unsigned int, ...)+0x2cc) [0x7f5c2f9b944c]]
: [/u01/gg11/libgglog.so(_MSG_ERR_ER_REMOTE_COMM_PROBLEM(CSourceContext*, char const*, CMessageFactory::MessageDisposition)+0x31) [0x7f5c2f9a11e9]]
: [/u01/gg11/extract(complete_tcp_msg(extract_def*)+0x424) [0x51313c]]
: [/u01/gg11/extract(flush_tcp(extract_def*, int)+0x20d) [0x5139f1]]
: [/u01/gg11/extract(RECOVERY_initialize()+0x371) [0x524f91]]
: [/u01/gg11/extract(main+0x4a5) [0x56ca65]]
: [/lib64/libc.so.6(__libc_start_main+0xfd) [0x3a5221ed1d]]
: [/u01/gg11/extract(__gxx_personality_v0+0x38a) [0x4e8b7a]]
2015-06-12 10:36:43 ERROR OGG-01031 There is a problem in network communication, a remote file problem, encryption keys for target and source do not match (if using ENCRYPT) or an unknown error. (Reply rece
ived is Unable to open file "./dirdat/pa000002" (error 11, Resource temporarily unavailable)).
2015-06-12 10:36:43 ERROR OGG-01668 PROCESS ABENDING.
GGSCI (orcltest) 34> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EORA_HR 00:00:00 00:00:05
EXTRACT ABENDED PORA_HR 00:00:00 40:05:10
REPLICAT RUNNING RORA_HR2 00:00:00 00:00:10
REPLICAT STOPPED TESTRPT 00:00:00 00:06:39
GGSCI (orcltest) 35> alter extract pora_hr etrollover
2015-06-12 10:38:15 INFO OGG-01520 Rollover performed. For each affected output trail of Version 10 or higher format, after starting the source extract, issue ALTER EXTSEQNO for that trail's reader (either pump EXTRACT or REPLICAT) to move the reader's scan to the new trail file; it will not happen automatically.
EXTRACT altered.
GGSCI (orcltest) 36> view params PORA_HR
extract pora_hr
setenv (ORACLE_SID=ogg1)
setenv (ORACLE_HOME=/u02/app/oracle/product/11.2.0/dbhome_1)
setenv (NLS_LANG=AMERICAN_AMERICA.ZHS16GBK)
passthru
rmthost 192.168.59.130,mgrport 7809
rmttrail ./dirdat/pa
table hr.*;
GGSCI (orcltest) 37> start extract PORA_HR
Sending START request to MANAGER ...
EXTRACT PORA_HR starting
GGSCI (orcltest) 38> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EORA_HR 00:00:00 00:00:06
EXTRACT RUNNING PORA_HR 00:00:00 00:00:49
REPLICAT RUNNING RORA_HR2 00:00:00 00:00:01
REPLICAT STOPPED TESTRPT 00:00:00 00:07:42
target端:
GGSCI (rhel6_lhr) 30> view report RORA_HR
***********************************************************************
Oracle GoldenGate Delivery for Oracle
Version 11.2.1.0.1 OGGCORE_11.2.1.0.1_PLATFORMS_120423.0230_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Apr 23 2012 08:48:07
Copyright (C) 1995, 2012, Oracle and/or its affiliates. All rights reserved.
Starting at 2015-06-10 04:48:15
***********************************************************************
Operating System Version:
Linux
Version #1 SMP Tue Apr 21 08:37:59 PDT 2015, Release 2.6.32-504.16.2.el6.x86_64
Node: rhel6_lhr
Machine: x86_64
soft limit hard limit
Address Space Size : unlimited unlimited
Heap Size : unlimited unlimited
File Size : unlimited unlimited
CPU Time : unlimited unlimited
Process id: 40019
Description:
***********************************************************************
** Running with the following parameters **
***********************************************************************
2015-06-10 04:48:15 INFO OGG-03035 Operating system character set identified as UTF-8. Locale: en_US, LC_ALL:.
replicat rora_hr
setenv (ORACLE_SID=ogg2)
Set environment variable (ORACLE_SID=ogg2)
setenv (ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1)
Set environment variable (ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1)
setenv (NLS_LANG=AMERICAN_AMERICA.ZHS16GBK)
Set environment variable (NLS_LANG=AMERICAN_AMERICA.ZHS16GBK)
ddl include all
ddlerror default ignore retryop maxretries 3 retrydelay 5
userid ggusr,password ***
handlecollisions
assumetargetdefs
discardfile ./dirrpt/rora_hr.dsc,purge
map hr.* ,target hr.*;
2015-06-10 04:48:15 INFO OGG-01815 Virtual Memory Facilities for: COM
anon alloc: mmap(MAP_ANON) anon free: munmap
file alloc: mmap(MAP_SHARED) file free: munmap
target directories:
/u01/gg11/dirtmp.
CACHEMGR virtual memory values (may have been adjusted)
CACHESIZE: 2G
CACHEPAGEOUTSIZE (normal): 8M
PROCESS VM AVAIL FROM OS (min): 4G
CACHESIZEMAX (strict force to disk): 3.41G
Database Version:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE 11.2.0.3.0 Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production
Database Language and Character Set:
NLS_LANG = "AMERICAN_AMERICA.ZHS16GBK"
NLS_LANGUAGE = "AMERICAN"
NLS_TERRITORY = "AMERICA"
NLS_CHARACTERSET = "ZHS16GBK"
***********************************************************************
** Run Time Messages **
***********************************************************************
Opened trail file ./dirdat/pa000002 at 2015-06-10 04:48:15
2015-06-10 04:48:19 WARNING OGG-01519 Waiting at EOF on input trail file ./dirdat/pa000002, which is not marked as complete; but succeeding trail file ./dirdat/pa000003 exists. If ALTER ETROLLOVER has been pe
rformed on source extract, ALTER EXTSEQNO must be performed on each corresponding downstream reader.
2015-06-10 04:48:29 WARNING OGG-01519 Waiting at EOF on input trail file ./dirdat/pa000002, which is not marked as complete; but succeeding trail file ./dirdat/pa000003 exists. If ALTER ETROLLOVER has been pe
rformed on source extract, ALTER EXTSEQNO must be performed on each corresponding downstream reader.
2015-06-10 04:48:50 WARNING OGG-01519 Waiting at EOF on input trail file ./dirdat/pa000002, which is not marked as complete; but succeeding trail file ./dirdat/pa000003 exists. If ALTER ETROLLOVER has been pe
rformed on source extract, ALTER EXTSEQNO must be performed on each corresponding downstream reader.
2015-06-10 04:49:30 WARNING OGG-01519 Waiting at EOF on input trail file ./dirdat/pa000002, which is not marked as complete; but succeeding trail file ./dirdat/pa000003 exists. If ALTER ETROLLOVER has been pe
rformed on source extract, ALTER EXTSEQNO must be performed on each corresponding downstream reader.
2015-06-10 04:50:50 WARNING OGG-01519 Waiting at EOF on input trail file ./dirdat/pa000002, which is not marked as complete; but succeeding trail file ./dirdat/pa000003 exists. If ALTER ETROLLOVER has been pe
rformed on source extract, ALTER EXTSEQNO must be performed on each corresponding downstream reader.
2015-06-10 04:53:30 WARNING OGG-01519 Waiting at EOF on input trail file ./dirdat/pa000002, which is not marked as complete; but succeeding trail file ./dirdat/pa000003 exists. If ALTER ETROLLOVER has been pe
rformed on source extract, ALTER EXTSEQNO must be performed on each corresponding downstream reader.
2015-06-10 04:54:21 INFO OGG-01021 Command received from GGSCI: STATS.
GGSCI (rhel6_lhr) 31> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EORA_HR2 00:00:00 00:00:04
EXTRACT RUNNING PORA_HR2 00:00:00 00:00:04
REPLICAT RUNNING RORA_HR 00:00:00 00:00:10
GGSCI (rhel6_lhr) 32> send replicat RORA_HR,logend
ERROR: No Command for SEND.
GGSCI (rhel6_lhr) 33> alter replicat RORA_HR,extseqno 3, extrba 0
ERROR: REPLICAT RORA_HR is running and cannot be altered (1,2,No such file or directory).
GGSCI (rhel6_lhr) 34>
GGSCI (rhel6_lhr) 34> stop RORA_HR
Sending STOP request to REPLICAT RORA_HR ...
Request processed.
GGSCI (rhel6_lhr) 35> alter replicat RORA_HR,extseqno 3, extrba 0
REPLICAT altered.
GGSCI (rhel6_lhr) 36> start RORA_HR
Sending START request to MANAGER ...
REPLICAT RORA_HR starting
GGSCI (rhel6_lhr) 37> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EORA_HR2 00:00:00 00:00:08
EXTRACT RUNNING PORA_HR2 00:00:00 00:00:05
REPLICAT RUNNING RORA_HR 00:05:33 00:00:03
GGSCI (rhel6_lhr) 38> view report RORA_HR
***********************************************************************
Oracle GoldenGate Delivery for Oracle
Version 11.2.1.0.1 OGGCORE_11.2.1.0.1_PLATFORMS_120423.0230_FBO
Linux, x64, 64bit (optimized), Oracle 11g on Apr 23 2012 08:48:07
Copyright (C) 1995, 2012, Oracle and/or its affiliates. All rights reserved.
Starting at 2015-06-10 05:01:13
***********************************************************************
Operating System Version:
Linux
Version #1 SMP Tue Apr 21 08:37:59 PDT 2015, Release 2.6.32-504.16.2.el6.x86_64
Node: rhel6_lhr
Machine: x86_64
soft limit hard limit
Address Space Size : unlimited unlimited
Heap Size : unlimited unlimited
File Size : unlimited unlimited
CPU Time : unlimited unlimited
Process id: 40703
Description:
***********************************************************************
** Running with the following parameters **
***********************************************************************
2015-06-10 05:01:13 INFO OGG-03035 Operating system character set identified as UTF-8. Locale: en_US, LC_ALL:.
replicat rora_hr
setenv (ORACLE_SID=ogg2)
Set environment variable (ORACLE_SID=ogg2)
setenv (ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1)
Set environment variable (ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1)
setenv (NLS_LANG=AMERICAN_AMERICA.ZHS16GBK)
Set environment variable (NLS_LANG=AMERICAN_AMERICA.ZHS16GBK)
ddl include all
ddlerror default ignore retryop maxretries 3 retrydelay 5
userid ggusr,password ***
handlecollisions
assumetargetdefs
discardfile ./dirrpt/rora_hr.dsc,purge
map hr.* ,target hr.*;
2015-06-10 05:01:13 INFO OGG-01815 Virtual Memory Facilities for: COM
anon alloc: mmap(MAP_ANON) anon free: munmap
file alloc: mmap(MAP_SHARED) file free: munmap
target directories:
/u01/gg11/dirtmp.
CACHEMGR virtual memory values (may have been adjusted)
CACHESIZE: 2G
CACHEPAGEOUTSIZE (normal): 8M
PROCESS VM AVAIL FROM OS (min): 4G
CACHESIZEMAX (strict force to disk): 3.41G
Database Version:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE 11.2.0.3.0 Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production
Database Language and Character Set:
NLS_LANG = "AMERICAN_AMERICA.ZHS16GBK"
NLS_LANGUAGE = "AMERICAN"
NLS_TERRITORY = "AMERICA"
NLS_CHARACTERSET = "ZHS16GBK"
***********************************************************************
** Run Time Messages **
***********************************************************************
Opened trail file ./dirdat/pa000003 at 2015-06-10 05:01:13
2015-06-10 05:01:13 INFO OGG-01020 Processed extract process RESTART_ABEND record at seq 3, rba 1046 (aborted 0 records).
Switching to next trail file ./dirdat/pa000004 at 2015-06-10 05:01:13 due to EOF, with current RBA 1108
Opened trail file ./dirdat/pa000004 at 2015-06-10 05:01:13
Processed extract process graceful restart record at seq 4, rba 1074.
Processed extract process graceful restart record at seq 4, rba 1136.
2015-06-10 05:01:13 INFO OGG-01407 Setting current schema for DDL operation to [hr].
2015-06-10 05:01:13 INFO OGG-01408 Restoring current schema for DDL operation to [ggusr].
GGSCI (rhel6_lhr) 39>
1.6.13.1 OGG-01031
启动源端传输进程DPEND,ggserr.log错误显示如下:
2012-08-28 15:09:39 ERROR OGG-01031 Oracle GoldenGate Capture for Oracle, dpend.prm: There is a problem in network communication, a remote file problem, encryption keys for target and source do not match (if using ENCRYPT) or an unknown error. (Reply received is Unable to open file "/uo1/app/ogg/dirdat/nd000004" (error 2, No such file or directory)).
2012-08-28 15:09:41 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, dpend.prm: PROCESS ABENDING.目标端ggserr.log错误显示如下:
2012-08-28 15:06:30 WARNING OGG-01223 Oracle GoldenGate Collector for Oracle: Unable to lock file "/uo1/app/ogg/dirdat/nd000004" (error 11, Resource temporarily unavailable). Lock currently held by process id (PID) 13854.
2012-08-28 15:06:30 WARNING OGG-01223 Oracle GoldenGate Collector for Oracle: Unable to open file "/uo1/app/ogg/dirdat/nd000004" (error 2, No such file or directory).
导致原因:可能是网络出现过故障,OGG源端的Data Pump进程与目标断了联系,目标端mgr为其启动的server进程一直还在运行,下次data pump重启时目标mgr会试图生成另外一个server进程,这样两个进程会争同一个队列文件。
处理方法:
1、停掉源端的所有data pump,使用ps –ef|grep server(或OGG安装目录)看看是不是还有OGG的server进程在跑,如果有,杀死它(一定要确认源端data pump全停掉,并且杀的是server进程,不要杀其它extract/replicat/mgr等),重启源端data pump即可。
2、可能是目标端的trail file出问题了,前滚重新生成一个新的队列文件
SEND EXTRACT xxx ETROLLOVER
或者:alter extract xxx etrollover
xxx为datapump的名称
1.6.14 OGG-01296
示例9-18:
ERROR OGG-01296 Oracle GoldenGate Delivery for Oracle, yx_rep3.prm: Error mapping from SGPM.A_PAY_FLOW to SGPM.A_PAY_FLOW.
由于源端进行了表结构更改,没有通知目标端,导致此错误。
处理方法:在目标端执行相应的语句,将表结构修改为和源端一致。
1.6.15 OGG-01088
错误信息:
示例9-19:
ERROR OGG-01088 Oracle GoldenGate Delivery for Oracle, pms_rep1.prm: malloc 2097152 bytes failed.
ERROR OGG-01668 Oracle GoldenGate Delivery for Oracle, pms_rep1.prm: PROCESS ABENDING.
处理方法:
(1)“ulimit –a”,验证操作系统对用户是否所有资源都是无限制。
(2)将进程进行拆分,拆分为多个进程。
(3)从support.oracle.com下载最新的补丁包,升级GoldenGate。
1.6.16 OGG-01223
启动源端传输进程DPEND,ggserr.log错误显示如下:
2012-08-17 11:43:50 WARNING OGG-01223 Oracle GoldenGate Capture for Oracle, dpend.prm: TCP/IP error 79 (Connection refused).
2012-08-17 11:45:01 WARNING OGG-01223 Oracle GoldenGate Capture for Oracle, dpend.prm: TCP/IP error 79 (Connection refused).
导致原因:因为目标端110上MGR进程没有启动,导致报错
处理方法:
在目标端启动start mgr启动进程后,再启动源端的传输进程DPEND,错误消失,文件顺利传输过来了。
正常的日志如下:
2012-08-17 14:31:51 INFO OGG-00993 Oracle GoldenGate Capture for Oracle, dpend.prm: EXTRACT DPEND started.
2012-08-17 14:33:13 INFO OGG-01226 Oracle GoldenGate Capture for Oracle, dpend.prm: Socket buffer size set to 27985 (flush size 27985).
2012-08-17 14:33:26 INFO OGG-01052 Oracle GoldenGate Capture for Oracle, dpend.prm: No recovery is required for target file F:\ogg\dirdat\nd000000, at RBA 0 (file not opened).
2012-08-17 14:33:26 INFO OGG-01478 Oracle GoldenGate Capture for Oracle, dpend.prm: Output file F:\ogg\dirdat\nd is using format RELEASE 11.2.
1.6.17 OGG-01224
示例9-20:
ERROR OGG-01224 Oracle GoldenGate Manager for Oracle, mgr.prm: No buffer space available
处理方法:
修改mgr.prm,扩大动态端口范围,dynamicportlist 7840-7914。
1.6.17.1 OGG-01224
启动源端传输进程DPEND,ggserr.log错误显示如下:
2012-08-22 05:33:10 ERROR OGG-01224 Oracle GoldenGate Capture for Oracle, dpend.prm: TCP/IP error 113 (No route to host).
2012-08-22 05:33:10 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, dpend.prm: PROCESS ABENDING.
导致原因:因为目标端235上的防火墙没有关闭,导致报错
处理方法:
在目标端机器关闭防火墙后,再启动源端的传输进程DPEND,错误消失,文件顺利传输过来了。
1.6.18 OGG-01476
ERROR OGG-01476 The previous run abended due to an out of order transaction. Issue ALTER ETROLLOVER to advance the output trail sequence past the current trail sequence number, then restart. Then, use ALTER EXTSEQNO on the subsequent pump EXTRACT, or REPLICAT, process group to start reading from the new trail file created by ALTER ETROLLOVER; the downstream process will not automatically switch to the new trail file.
在初始化的时候,由于容灾端没有准备就绪,生产端来回进行了很多次的操作,导致生产端抽取混乱,此时在进行RMAN之前,重新启动抽取,忽略调之前的混乱信息。
处理方法:“alter extract xxx, etrollover”。
1.6.19 OGG-00850
ERROR OGG-00850 Oracle GoldenGate Capture for DB2, extxa.prm: Database instance XP1 has both USEREXIT and LOGRETAIN set to off.
ERROR OGG-01668 Oracle GoldenGate Capture for DB2, extxa.prm: PROCESS ABENDING.
处理方法:
(I)如果是DB2 8.1/8.2,必须将USEREXIT和LOGRETAIN设置为ON。
(2)如果是DB2 9.5,已经使用LOGARCHMETH1和LOGARCHMETH2代替以上两个参数,通常LOGARCHMETH1为DISK,LOGARCHMETH2为TSM,采用这两个参数开启归档模式。在DB2 9.5中,USEREXIT可以设置为OFF,但是LOGRETAIN仍需设置为ON。
1.6.20 OGG-01027(长事务)
示例9-25:
WARNING OGG-01027 Long Running Transaction: XID 82.4.242063, Items 0, Extract YX_EXT1, Redo Thread 1, SCN 2379.2132775890 (10219859973074), Redo Seq #5688, Redo RBA 195997712.
可以通过下面的命令寻找更详细的信息:
示例9-26:
GGSCI> send extract xxx, showtrans [thread n] [count n]
其中,thread n是可选的,表示只查看其中一个节点上的未提交交易;count n也是可选的,表示只显示n条记录。
例如查看xxx进程中节点1上最长的10个交易,可以通过下列命令:
示例9-27:
GGSCI> send extract extsz , showtrans thread 1 count 10
记录XID,通过DBA查找具体的长交易执行的内容:
示例9-28:
GGSCI> SEND EXTRACT xxx, SKIPTRANS <82.4.242063> THREAD <2> //跳过交易
GGSCI>SEND EXTRACT xxx, FORCETRANS <82.4.242063> THREAD <1> //强制认为该交 易已经提交
使用这些命令只会让GoldenGate进程跳过或者认为该交易已经提交,但并不改变数据库中的交易,它们依旧存在于数据库中。因此,强烈建议使用数据库中提交或者回滚交易而不是使用GoldenGate处理。
1.6.21 OGG-01072
示例9-22:
ERROR OGG-01072 LOBROW_get_next_chunk(LOBROW_row_t *, BOOL, BOOL, BOOL, LOBROW_chunk_header_t *, char *, size_t, BOOL, *) Buffer overflow, needed: 132, alloc 2.
处理方法:
(1)如果版本为11.1.1.0.1 Build 078版本,升级到最新的补丁包。
(2)使用“ulimit –a”查看资源使用限制,调整资源为unlimited。
(3)Extract: DBOPTIONS LOBBUFSIZE <bytes>。
(4)replicat: DBOPTIONS LOBWRITESIZE 1MB。
1.7 用户不存在
问题描述:
2010-05-02 10:45:20 GGS ERROR 2001 Oracle GoldenGate Delivery for Oracle, rcrmheal.prm: Fatal error executing DDL replication: error [Error code [1918], ORA-01918: user 'KINGSTAR' does not exist, SQL /* GOLDENGATE_DDL_REPLICATION */ alter user kingstar account unlock ], no error handler present. |
问题分析:
根据分析日志可以确定是目标端不存在该用户导致的故障。
问题处理:
方法1、如果不需要同步该用户,可以在目标端去掉掉映射该用户,再重启进程。
例如去掉:MAP KINGSTAR.*, TARGET CRMKINGSTAR.*;
方法2、在目标端手工创建该用户,再重启进程。
1.8 表不存在
问题描述:
2010-05-10 15:02:12 GGS ERROR 101 Oracle GoldenGate Delivery for Oracle, rcrmheal.prm: Table CRMOLAP.TB_FT_OFSTK_CLIENT_BY_DAY does not exist in target database. |
问题分析:
根据分析日志可以确定是目标端不存在该表导致的故障。
问题处理:
方法1、如果不需要同步该表,可以在目标端排除掉该表,再重启进程。
例如添加:MAPEXCLUDE OLAP.TB_FT_OFSTK_CLIENT_BY_DAY
方法2、在目标端手工创建该表, 异构数据库还需要重新生成表结构定义文件,再重启进程。
1.9 数据库索引失效
问题描述
2010-07-05 14:48:32 GGS WARNING 218 Oracle GoldenGate Delivery for Oracle, rapcaxht.prm: SQL error 1502 mapping AXHT.DOCONTRACT to APCAXHT.DOCONTRACT OCI Error ORA-01502: index 'APCAXHT.PK_SID' or partition of such index is in unusable state (status = 1502), SQL <INSERT INTO "APCAXHT"."DOCONTRACT" ("SID","RIQI","JGID","HT_ID","KH_XM","KH_ID","KH_NUM","CREATEDDATE","MODIFIEDDATE","USERNAME","REALNAME","BS","MEMO1","MEMO2","KH_IDLX","DXJGID","KH_IDTY","CPID") VA>. |
问题分析:
数据库索引失效引起的故障。
问题处理:
重建这个有问题的索引,再重启进程,故障排除。
1.10 表结构不一致
问题描述:
2010-05-08 14:50:44 GGS ERROR 218 Oracle GoldenGate Delivery for Oracle, rcrmheal.prm: Error mapping from OLAP.TB_FT_OFSTK_BAL_HIS to CRMOLAP.TB_FT_OFSTK_BAL_HIS. |
问题分析:
出现该问题一般都是由于同步的源和目标表结构不一致,包括表字段和索引。
问题处理:
1、 如果是表字段不一致,需要修改表字段,异构数据库还需要重新生成表结构定义文件,再重启进程。
2、 如果是索引不一致,需要重建索引,异构数据库还需要重新生成表结构定义文件,再重启进程。
1.11 磁盘空间不足
问题描述:
2010-05-07 04:05:31 GGS ERROR 103 Oracle GoldenGate Collector: Unable to write to file "./dirdat/crm/fl003629" (error 28, No space left on device). 2010-05-07 04:05:31 GGS ERROR 190 PROCESS ABENDING. |
问题分析:
根据分析日志可以确定是磁盘空间不足导致的故障。
问题处理:
划分足够的磁盘空间,再重启进程。
1.12 TCP/IP故障
问题描述:
2010-06-25 21:06:04 GGS WARNING 150 Oracle GoldenGate Capture for Oracle, BSAIAXEC.prm: TCP/IP error 10060 (由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。). |
问题分析:
根据分析日志可以确定是不能连接到远程主机,包括ip地址或端口号。
问题处理:
需要打通能够连接到远程主机IP和端口,再重启进程。
1.13 数据库不能连接
问题描述:
2010-05-20 18:25:13 GGS ERROR 182 Oracle GoldenGate Delivery for Oracle, rtasaxta.prm: OCI Error during OCIServerAttach (status = 12154-ORA-12154: TNS:could not resolve the connect identifier specified). |
问题分析:
这种故障是数据库不能连接导致goldengate进程异常。
问题处理:
需要先解决数据库异常,再重启进程。
1.14 表空间不足
问题描述:
2010-02-01 17:19:18 GGS ERROR 103 Discard file (./dirrpt/rep1.dsc) exceeded max bytes (10000000). |
问题分析:
根据错误可以看出直接引起GoldenGate进程停止的原因是discard文件被写满了,是什么原因造成discard文件被写满的呢?从discard文件中我们看到是发生了ORA-01653: unable to extend 错误,看到这里我相信大家都知道该怎么处理了吧,我们只要扩展这个aaa.TB_LVY_TEMPINVOIC对象所在的表空间的大小即可。
问题处理:
1、找到相关对象存储的表空间;
例如:select owner,table_name,tablespace_name from dba_tables
2、执行表空间扩展
例如:ALTER TABLESPACE tbs_03 ADD DATAFILE 'tbs_f04.dbf' SIZE 100K AUTOEXTEND ON NEXT 10K MAXSIZE 100K;
1.15 网络传输问题
问题描述:
2010-06-29 16:22:28 GGS ERROR 112 There is a problem in network communication, a remote file problem, encryption keys for target and source do not match (if using ENCRYPT) or an unknown error. (Remote file used is /oradataA/ggtrail/b1000008, reply received is Unable to lock file "/oradataA/ggtrail/b1000008" (error 13, Permission denied). Lock currently held by process id (PID) 3674350). |
问题分析:
问题处理:
方法1、手工去KILL掉相应的锁进程,再重新启动进程。
方法2、不需理会,大概2小时后会自动释放该锁进程。
方法3、goldengate 10.4.0.76 会解决锁问题。
1.16 捕获进程不能为表添加补充日志
问题描述:
2010-07-19 16:20:03 GGS ERROR 2100 Oracle GoldenGate Capture for Oracle, ecrmheal.prm: Could not add TRAN DATA for table, error [ORA-32588: supplemental logging attribute all column exists, SQL ALTER TABLE "AXTECH"."TB_FUND_MATCHING" ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS /* GOLDENGATE_DDL_REPLICATION */], error code [32588], operation [ALTER TABLE "AXTECH"."TB_FUND_MATCHING" ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS /* GOLDENGATE_DDL_REPLICATION */ (size 113)]. |
问题分析:
因为表已经开启了补充日志(附加日志),而对表做DDL操作时,参数“DDLOPTIONS ADDTRANDATA”会对表重新开启补充日志(附加日子),但如果该表超过32个字段,并且该表没有唯一索引时会出现上面的异常;
问题处理:
方法1、去掉参数“DDLOPTIONS ADDTRANDATA”。
方法2、DELETE TRANDATA 用户.表
方法3、登录数据库执行: ALTER TABLE AXHT.BMBM2002 DROP SUPPLEMENTAL LOG DATA (ALL) COLUMNS
1.17 数据库补充日志(附加日志)没有打开
问题描述:
2010-10-14 09:25:50 GGS ERROR 190 Oracle GoldenGate Capture for Oracle, ECRMGGS.prm: No minimum supplemental logging is enabled. This may cause extract process to handle key update incorrectly if key column is not in first row piece. 2010-10-14 09:25:50 GGS ERROR 190 Oracle GoldenGate Capture for Oracle, ECRMGGS.prm: PROCESS ABENDING. |
问题分析:
根据分析日志可以确定是源端oracle补充日志没有打开导致的故障,如果主键或唯一索引是组合的(复合的),就需要为表配置supplemental log,否则就不必,也就是说,如果所有表的主键是单列的,那根本就不必去理会它是什么意思,如果更新了主键中的部分字段,那supplemental log的作用就是把该记录其余的组成部分的数据也传输到目标机,否则目标机就存在不确定性。
问题处理:
登录数据库,使用命令ALTER DATABASE ADD SUPPLEMENTAL LOG DATA打开补充日志。然后重新添加捕获进程和本地队列。
1.18 表补充日志(附加日志)没有打开
问题描述:
2010-10-14 09:30:49 GGS WARNING Z1-078 Oracle GoldenGate Capture for Oracle, ECRMGGS.prm: No valid default archive log destination directory found for thread 1. 2010-10-14 09:30:50 GGS ERROR 500 Oracle GoldenGate Capture for Oracle, ECRMGGS.prm: Found unsupported in-memory undo record in sequence 2, at RBA 39675920, with SCN 0.554993 (554993) ... Minimum supplemental logging must be enabled to prevent data loss. 2010-10-14 09:30:51 GGS ERROR 190 Oracle GoldenGate Capture for Oracle, ECRMGGS.prm: PROCESS ABENDING. |
问题分析:
根据分析日志可以确定是源端oracle补充日志没有打开导致的故障。
问题处理:
登录数据库,使用命令ALTER DATABASE ADD SUPPLEMENTAL LOG DATA打开补充日志。
1.19 DDL复制表没找到
问题描述:
2010-10-14 13:32:10 GGS ERROR 2008 Oracle GoldenGate Capture for Oracle, ECRMGGS.prm: DDL Replication is enabled but table GGS.GGS_DDL_HIST is not found. Please check DDL installation in the database. 2010-10-14 13:32:10 GGS ERROR 190 Oracle GoldenGate Capture for Oracle, ECRMGGS.prm: PROCESS ABENDING. |
问题分析:
根据分析日志可以确定是DDL复制操作已经打开,但没有找到安装复制DDL执行脚本产生的表GGS.GGS_DDL_HIST导致的故障。
问题处理:
因为安装复制DDL是使用用户GGDDL,执行脚本后会在该用户产生跟踪goldengate运行的表,所以要实现支持DDL操作,在参数文件中登录数据库必须使用GGDLL和对应的密码登录。例如:USERID GGDDL@CRMDB,PASSWORD GGDDL。
1.20 GoldenGate之update操作节点间不同步
GoldenGate之update操作节点间不同步
故障现象:节点1、节点2进行update操作后,不能实现同步
解决过程:
1、常规巡检:
检查进程状态:正常
GGSCI (gc1) 7> info all
Program Status Group Lag Time Since Chkpt
MANAGER RUNNING
EXTRACT RUNNING EORA_1 00:00:00 00:00:04
EXTRACT RUNNING PORA_1 00:00:00 00:00:08
REPLICAT RUNNING RORA_1 00:00:00 00:00:05
重新赋权:
SQL> grant INSERT, UPDATE, DELETE on scott.tcustmer to ogg;
--把需要同步表的DML操作授权给ogg
SQL> grant INSERT, UPDATE, DELETE on scott.tcustord to ogg;
--把需要同步表的DML操作授权给ogg
故障依旧。
2、执行如下操作:
GGSCI (gc1) 8> ADD TRANDATA scott.*
注意:先关闭rora_1进程,再添加,然后重新启动。
--两个节点操作:ADD TRANDATA scott.新表,设置后发现ogg会去捕捉新表的日志信息
故障解决,两节点可以同步update操作。
3、总结:
如果新建的表进行同步,update可能不成功,需要进行以下操作,
这样ogg才会去捕捉新表的日志信息:
操作指令:ADD TRANDATA scott.new_tab
但要注意:先关闭rora_1进程,再添加,然后重新启动。
1.21 ERROR: Could not delete DB checkpoint for REPLICAT
GGSCI (mail) 9> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT ABENDED REP1 00:00:00 00:58:10
GGSCI (mail) 10> delete REPLICAT REP1
ERROR: Could not delete DB checkpoint for REPLICAT REP1 (Database login required to delete database checkpoint).
GGSCI (mail) 11> dblogin userid ogg,password oracle
Successfully logged into database.
GGSCI (mail) 12> delete REP1
Deleted REPLICAT REP1.
GGSCI (mail) 13> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
若还是不能删除,则如下操作:
GGSCI (rhel6_lhr) 23> delete REPLICAT RORA_HR
ERROR: Could not delete DB checkpoint for REPLICAT RORA_HR (Database login required to delete database checkpoint).
GGSCI (rhel6_lhr) 24> dblogin userid ggusr@ogg2, password lhr
Successfully logged into database.
GGSCI (rhel6_lhr) 25> delete RORA_HR
ERROR: Could not delete DB checkpoint for REPLICAT RORA_HR (OCI Error ORA-00942: table or view does not exist (status = 942). Deleting from checkpoint table ggusr.ggschkpt, group 'RORA_HR', key 293545198 (0x117f24ee), SQL <DELETE FROM ggusr.ggschkpt WHERE group_name = 'RORA_HR' AND group_key = 293545198>).
GGSCI (rhel6_lhr) 26> info all
Program Status Group Lag at Chkpt Time Since Chkpt
MANAGER RUNNING
REPLICAT STOPPED RORA_HR 00:00:00 00:13:32
GGSCI (rhel6_lhr) 27> delete REPLICAT RORA_HR
ERROR: Could not delete DB checkpoint for REPLICAT RORA_HR (OCI Error ORA-00942: table or view does not exist (status = 942). Deleting from checkpoint table ggusr.ggschkpt, group 'RORA_HR', key 293545198 (0x117f24ee), SQL <DELETE FROM ggusr.ggschkpt WHERE group_name = 'RORA_HR' AND group_key = 293545198>).
GGSCI (rhel6_lhr) 28> add checkpointtable ggusr.ggschkpt
Successfully created checkpoint table ggusr.ggschkpt.
GGSCI (rhel6_lhr) 29> delete REPLICAT RORA_HR
Deleted REPLICAT RORA_HR.
1.22 GoldenGate OGG-00717 unsupported in-memory undo record
2013-07-08 16:31:48 INFO OGG-01515 Oracle GoldenGate Capture for Oracle, EXT1.prm: Positioning to begin time 2013-7-8 下午04:10:22.
2013-07-08 16:31:48 INFO OGG-01516 Oracle GoldenGate Capture for Oracle, EXT1.prm: Positioned to Sequence 18, RBA 9212432, SCN 0.0, 2013-7-8 下午04:10:22.
2013-07-08 16:31:48 ERROR OGG-00717 Oracle GoldenGate Capture for Oracle, EXT1.prm: Found unsupported in-memory undo record in sequence 18, at RBA 9212432, with SCN 0.1347542 (1347542) ... Minimum supplemental logging must be enabled to prevent data loss.
2013-07-08 16:31:48 ERROR OGG-01668 Oracle GoldenGate Capture for Oracle, EXT1.prm: PROCESS ABENDING.
搭建GoldenGate环境过程时,碰到了一个诡异的问题。“ Found unsupported in-memory undo record in sequence 18” 实际上,oracle的最小日志附加模式已经开启。
经过重启进程mgr, extract进程,问题解决,若还是不能解决就删掉重建该进程。
1.23 中文表/中文字段处理
比如有个如下的中文表:
示例9-40:
create table 测试表(
ID NUMBER,
姓名 VARCHAR2(30),
FLAG CHAR(1),
CONSTRAINT PK_TESTD PRIMARY KEY (ID) USING INDEX);
--源端创建MV LOG和MV:
drop materialized view log on "测试表";
create materialized view log on "测试表" with primary key;
drop materialized view mv_cn_table;
create materialized view mv_cn_table refresh fast on commit as select id, 姓名 as en_name,flag from "测试表";
在目标端创建表及view:
示例9-41:
create or replace view v_cn_table as select id,姓名 as en_name,flag from 测 试表;
这里NLS_LANG在GG中,抽取和复制必须设置为和目标字符集一致:
示例9-42:
SETENV (NLS_LANG = "AMERICAN_AMERICA.AL32UTF8")
Extract相关:
示例9-43:
extract ODISC
SETENV (NLS_LANG = "AMERICAN_AMERICA.AL32UTF8")
userid custom_src, password custom_src
exttrail D:/GoldenGate/dirdat/ODISoc/oc
TABLE CUSTOM_SRC.MV_CN_TABLE;
Pump相关:
示例9-44:
extract ODIT1P
SETENV (NLS_LANG = "AMERICAN_AMERICA.AL32UTF8")
PASSTHRU
rmthost localhost, mgrport 7909
rmttrail D:/gg_stg/dirdat/ODIT1op/op
TABLE CUSTOM_SRC.MV_CN_TABLE;
Replicat相关:
示例9-45:
replicat ODIT1A1
SETENV (NLS_LANG = "AMERICAN_AMERICA.AL32UTF8")
userid odi_staging, password odi_staging
discardfile D:/gg_stg/dirrpt/ODIT1.dsc, purge
ASSUMETARGETDEFS
这里必须指定APPLYNOOPUPDATES参数,否则UPDATE有问题,另外,也要指定KEYCOLS,否则删除和更新有问题:
示例9-46:
map CUSTOM_SRC.MV_CN_TABLE, TARGET ODI_STAGING.V_CN_TABLE, KEYCOLS (ID);