使用Oracle Grid配置Goldengate或其他第三方应用高可用

1. 概述

Oracle Grid不止能提供自身Oracle Database高可用,还可以为第三方应用提供高可用。

可以为OGG、SharePlex等逻辑复制,Apache等应用提供高可用。

使用Oracle Grid代理第三方应用主要有以下两种方式:

  1. Oracle Grid Infrastructure Agents
  1. Third-Part Script
  1. 官方文档位置:
  2. Clusterware Administration and Deployment Guide
  3. Third-Party Applications Using the Script Agent
  4. Mos文档参考:
  5. Oracle_GoldenGate_Best_Practices_-_Oracle_GoldenGate_high_availability_using_Oracle_Clusterware_v8_6_ID1313703_1_.pdf
  1. 关于第三方应用日志位置
  1. Oracle Grid 11.2如果使用oracle添加资源,则日志位置:
  2. $GRID_HOME/log/{node_name}/agent/crsd/scriptagent_oracle
  3. 12c以后GRID日志也变为标准ADR目录
  4. $GRID_BASE/diag/crs/crs/agent/scriptagent_oracle.trc
  5. # 如果为GRID添加资源,路径或日志名称scriptagent_grid即可。

2. Grid代理第三方脚本

下面测试利用Grid代理第三方脚本形式提供高可用,XAG方式参考官方文档即可。

部署步骤概述:

  1. 配置应用VIP(此VIP不是RAC VIP,仅仅为了应用本身使用),对外提供唯一IP,使切换对应用透明。
  2. 部署goldengate启停第三方脚本。
  3. crsctl加载资源,配置权限。
  4. 测试高可用。

2.1 配置VIP

  1. (1) login as root
  2. # appvipcfg create -network=1 \
  3. -ip=192.168.204.242 \
  4. -vipname=czhvip \
  5. -user=root
  6. (2) 查看配置vip
  7. # crsctl stat res -p |grep -ie .network -ie subnet |grep -ie name -ie subnet
  8. (3) login as root
  9. # crsctl setperm resource czhvip -u user:oracle:r-x
  10. --配置资源使用权限用户,IP资源属主一定必须是root,其他用户无法配置IP,会导致无法启动VIP资源。

2.2 部署OGG

  1. ogg安装部署不在此赘述,可按照以下几种方式:
  2. 1. 使用ACFS作为共享磁盘,OGG软件本身以及dir*相关目录均存放于ACFS文件系统。
  3. ACFS相应版本以及补丁参考下面文档:
  4. ACFS Support On OS Platforms (Certification Matrix). (Doc ID 1369107.1).pdf
  5. 2.使用ACFS存放goldengate的trail文件等,OGG软件本身存放于操作挂载点即可,通过在操作系统相应路径下建立软链接方式指向ACFS中dir*相应目录
  6. $ ln –s /acfs_mount_point/dirdat dirdat
  7. 3.使用例如ocfs2、gpfs等集群文件系统存放

2.2 部署脚本说明

下面脚本仅仅用做示例,实际脚本可以根据不同应用加入相应模块脚本,比如check脚本就需要判断进程状态等等。

Grid 第三方脚本模块说明

  1. 1. Grid 11.2脚本需要包含start/stop/clean/check/abort
  2. --示例脚本
  3. #!/bin/sh
  4. case $1 in
  5. 'start')
  6. echo $(date)' start'>>/tmp/crs.log
  7. exit 0
  8. ;;
  9. 'stop')
  10. echo $(date)' stop'>>/tmp/crs.log
  11. exit 0
  12. ;;
  13. 'clean')
  14. echo $(date)' clean'>>/tmp/crs.log
  15. echo $?'clean' >>/tmp/crs.log
  16. exit 0
  17. ;;
  18. 'check')
  19. echo "CHECK entry point has been called.."
  20. echo $(date)' check'>>/tmp/crs.log
  21. exit 0
  22. ;;
  23. 'abort')
  24. echo $(date)' abort'>>/tmp/crs.log
  25. exit 0
  26. ;;
  27. esac
  28. 2. 模块说明
  29. --主要介绍11gR2引入的两个新模块
  30. --12c以后版本引入了更多模块,这点可以从启动日志中看到。
  31. CLEAN
  32. Clean was introduced with Oracle Clusterware 11g Release 2. It will not be used for Oracle
  33. Clusterware 10g Release 2 or 11g Release 1. Clean is called when there is a need to clean up the
  34. resource. It is a non-graceful operation.
  35. ABORT
  36. Abort was introduced with Oracle Clusterware 11g Release 2. It will not be used for Oracle
  37. Clusterware 10g Release 2 or 11g Release 1. Abort is called if any of the resource components
  38. hang to abort the ongoing action. Abort is not required to be included.
  39. 3.关于脚本中变量说明
  40. 如果start/stop/clean/check/abort对应脚本中启动程序脚本需要依赖环境变量,例如
  41. (1)ogg如果extract配置使用本地ORACLE_SID连接数据库进行捕获,不是使用tnsalias方式连接数据库,则ggsci> start extract时,依赖于环境变量ORACLE_SID,这种情况下,需要在上面脚本中定义好依赖的ORACLE_SID以及ORACLE_HOME变量,因为Grid启动时由于vip属主为root,所以如果vip与ogg资源强依赖时,只能获取到root的用户环境变量,无法获得oracle用户环境变量,会导致资源无法正常启动。
  42. (2)所以环境变量一定要在脚本中完全定义,不要依赖于外部变量,否则将会发生问题后很难排查以及遇到无法启动资源或启动资源无法启动程序中相应进程。

2.3 OGG高可用脚本

下面为OGG连接ASM与版本关系

  1. 如果 Redo Log 存储在 ASM 中,设置 Catpure ASM 连接方式如下:
  2. Oracle 10.2.0.5 或 11.2.0.2 之前版本:
  3. TRANLOGOPTIONS ASMUSER sys@asminst, asmpassword oracle
  4. Oracle 10.2.0.5、11.2.0.2 或之以后版本,GoldenGate 为 11g 或以后版本:
  5. TRANLOGOPTIONS DBLOGREADER
  6. 如果在 AIX 平台数据库的 redo log 使用的是 RAW,则可能需要设置参数:TRANLOGOPTIONS
  7. RAWDEVICEOFFSET,设置此参数:
  8. TRANLOGOPTIONS RAWDEVICEOFFSET 0
  9. 其他平台不需要设置此参数。

下面脚本为未使用ASM或Oracle 10.2.0.5、11.2.0.2 或之以后版本,如果为早期需要调取ASM实例ORACLE_SID,则需要特殊处理

完整示例详细可以参考OracleGoldenGate_Best_Practices-Oracle_GoldenGate_high_availability_using_Oracle_Clusterware_v8_6_ID1313703_1.pdf

  1. #!/bin/sh
  2. # goldengate_action.scr
  3. # 生效oracle用户下环境变量,oracle下环境变量一定要配置相关变量,防止下面启动ogg无法读取相关ORACLE_SID导致启动extract失败
  4. . ~oracle/.bash_profile
  5. # 判断调用脚本是否有选项,如果第一个选项为空,则报错,提示使用选项
  6. [ -z "$1" ]&& echo "ERROR!! Usage $0 <start|stop|abort|clean>"&& exit 99
  7. # 指定goldengate安装目录
  8. GGS_HOME=<set the path here>
  9. #specify delay after start before checking for successful start
  10. start_delay_secs=5
  11. #Include the Oracle GoldenGate home in the library path to start GGSCI,AIX variable is LIBPATH
  12. export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${GGS_HOME}
  13. #set the oracle home to the database to ensure Oracle GoldenGate will get
  14. #the right environment settings to be able to connect to the database
  15. export ORACLE_HOME=<set the ORACLE_HOME path here>
  16. export CRS_HOME=<set the CRS_HOME path here>
  17. #Set NLS_LANG otherwise it will default to US7ASCII
  18. export NLS_LANG=American_America.US7ASCII
  19. logfile=/tmp/crs_gg_start.log
  20. \rm ${logfile}
  21. # define function log.
  22. function log ()
  23. {
  24. DATETIME=`date +%d/%m/%y-%H:%M:%S`
  25. echo $DATETIME "goldengate_action.scr>>" $1
  26. echo $DATETIME "goldengate_action.scr>>" $1 >> $logfile
  27. }
  28. # define function check_process to check goldengate MGR process is runing or not.
  29. #check_process validates that a manager process is running at the PID
  30. #that Oracle GoldenGate specifies.
  31. check_process ()
  32. {
  33. if ( [ -f "${GGS_HOME}/dirpcs/MGR.pcm" ] )
  34. then
  35. pid=`cut -f8 "${GGS_HOME}/dirpcs/MGR.pcm"`
  36. if [ ${pid} = `ps -e |grep ${pid} |grep mgr |awk '{ print $1 }'` ]
  37. then
  38. #manager process is running on the PID . exit success
  39. echo "manager process is running on the PID . exit success">> /tmp/check.out
  40. exit 0
  41. else
  42. #manager process is not running on the PID
  43. echo "manager process is not running on the PID" >> /tmp/check.out
  44. exit 1
  45. fi
  46. else
  47. #manager is not running because there is no PID file
  48. echo "manager is not running because there is no PID file" >> /tmp/check.out
  49. exit 1
  50. fi
  51. }
  52. # call_ggsci is a generic routine that executes a ggsci command
  53. call_ggsci () {
  54. log "entering call_ggsci"
  55. ggsci_command=$1
  56. #log "about to execute $ggsci_command"
  57. log "id= $USER"
  58. cd ${GGS_HOME}
  59. ggsci_output=`${GGS_HOME}/ggsci << EOF
  60. ${ggsci_command}
  61. exit
  62. EOF`
  63. log "got output of : $ggsci_output"
  64. }
  65. case $1 in
  66. 'start')
  67. #Updated by Sourav B (02/10/2011)
  68. # During failover if the “mgr.pcm” file is not deleted at the node crash
  69. # then Oracle clusterware won’t start the manager on the new node assuming the
  70. # manager process is still running on the failed node. To get around this issue
  71. # we will delete the “mgr.prm” file before starting up the manager on the new
  72. # node. We will also delete the other process files with pc* extension and to
  73. # avoid any file locking issue we will first backup the checkpoint files and then
  74. # delete them from the dirchk directory.After that we will restore the checkpoint
  75. # files from backup to the original location (dirchk directory).
  76. log "removing *.pc* files from dirpcs directory..."
  77. rm -f $GGS_HOME/dirpcs/*.pc*
  78. log "creating tmp directory to backup checkpoint file...."
  79. mkdir $GGS_HOME/dirchk/tmp
  80. log "backing up checkpoint files..."
  81. cp $GGS_HOME/dirchk/*.cp* $GGS_HOME/dirchk/tmp
  82. log "Deleting checkpoint files under dirchk......"
  83. rm -f $GGS_HOME/dirchk/*.cp*
  84. log "Restore checkpoint files from backup to dirchk directory...."
  85. cp $GGS_HOME/dirchk/tmp/*.cp* $GGS_HOME/dirchk
  86. log "Deleting tmp directory...."
  87. rm -r $GGS_HOME/dirchk/tmp
  88. log "starting manager"
  89. call_ggsci 'start manager'
  90. #there is a small delay between issuing the start manager command
  91. #and the process being spawned on the OS . wait before checking
  92. log "sleeping for start_delay_secs"
  93. sleep ${start_delay_secs}
  94. #check whether manager is running and exit accordingly
  95. check_process
  96. ;;
  97. 'stop')
  98. #attempt a clean stop for all non-manager processes
  99. call_ggsci 'stop er *'
  100. #ensure everything is stopped
  101. call_ggsci 'stop er *!'
  102. #stop manager without (y/n) confirmation
  103. call_ggsci 'stop manager!'
  104. #exit success
  105. exit 0
  106. ;;
  107. 'check')
  108. check_process
  109. exit 0
  110. ;;
  111. 'clean')
  112. #attempt a clean stop for all non-manager processes
  113. call_ggsci 'stop er *'
  114. #ensure everything is stopped
  115. call_ggsci 'stop er *!'
  116. #in case there are lingering processes
  117. call_ggsci 'kill er *'
  118. #stop manager without (y/n) confirmation
  119. call_ggsci 'stop manager!'
  120. #exit success
  121. exit 0
  122. ;;
  123. 'abort')
  124. #ensure everything is stopped
  125. call_ggsci 'stop er *!'
  126. #in case there are lingering processes
  127. call_ggsci 'kill er *'
  128. #stop manager without (y/n) confirmation
  129. call_ggsci 'stop manager!'
  130. #exit success
  131. exit 0
  132. ;;
  133. esac

2.4 CRSCTL添加ogg Grid资源

  1. # login as oracle:
  2. $ /u01/app/11.2/grid/bin/crsctl add resource oggapp -type cluster_resource -attr "ACTION_SCRIPT='/acfs_mount_point/ogg.sh',CHECK_INTERVAL=30,START_DEPENDENCIES='hard(czhvip) pullup(czhvip)',STOP_DEPENDENCIES='hard(mvggatevip)'"
  3. --脚本位置可以存放于本地oracle用户有读取执行权限的目录,如果存放于本地,则Grid各个节点都需要备份该文件
  4. --如果ogg安装使用acfs,则START_DEPENDENCIES可以配置与ASM强依赖。

上述步骤即已完成第三方应用使用Grid托管,还是非常方便实用的。

3. 遇到问题解决

3.1 无法启动resource

  1. 1. 无法启动
  2. $ crsctl start res czhapp
  3. CRS-2672: Attempting to start 'czhapp' on 'db-oracle-node1'
  4. CRS-2674: Start of 'czhapp' on 'db-oracle-node1' failed
  5. CRS-2679: Attempting to clean 'czhapp' on 'db-oracle-node1'
  6. CRS-2678: 'czhapp' on 'db-oracle-node1' has experienced an unrecoverable failure
  7. CRS-0267: Human intervention required to resume its availability.
  8. CRS-4000: Command Start failed, or completed with errors.
  9. # 如果配置资源属于Oracle,则日志目录为:
  10. $GRID_HOME/log/{node_name}/agent/crsd/scriptagent_oracle
  11. --关键内容如下
  12. 2021-04-26 11:35:07.342: [czhapp][156428032]{1:39006:13462} [clean] Executing action script: /software/crs.sh[clean]
  13. 2021-04-26 11:35:07.397: [ AGFW][156428032]{1:39006:13462} Command: clean for resource: czhapp 1 1 completed with invalid status: 209
  14. 2021-04-26 11:35:07.397: [czhapp][156428032]{1:39006:13462} [check] Executing action script: /software/crs.sh[check]
  15. 2021-04-26 11:35:07.397: [ AGFW][158529280]{1:39006:13462} Agent sending reply for: RESOURCE_CLEAN[czhapp 1 1] ID 4100:717590
  16. 2021-04-26 11:35:07.454: [ AGFW][156428032]{1:39006:13462} Received unknown resource status code: 209
  17. 2021-04-26 11:35:07.455: [ AGFW][158529280]{1:39006:13462} czhapp 1 1 state changed from: CLEANING to: UNKNOWN
  18. 2. 分析
  19. 可以从日志输出看到,识别到了脚本,但是通过在脚本中指定位置配置输出,发现脚本并未真正执行。
  20. 最终排查原因主要为脚本开头未声明脚本类型导致该问题。
  21. 3. 解决
  22. #!/bin/sh
  23. --写脚本还是要规范,以前写脚本偶尔拉下声明部分,并不影响,这次Oracle Grid代理脚本没有声明部分无法启动还是挺意外的,也说明还是要规范。

3.2 OGG无法启动extract

  1. 1. 现象
  2. OCI相关报错,无法连接数据库
  3. 2.分析
  4. AIX:
  5. ps -ef|grep goldengate
  6. ps eauwww <pid>
  7. 查看进程环境变量发现,变量中无ORACLE_SID。
  8. 由于goldengate extract中配置,未配置使用tnsalias方式连接数据库,所以依赖于启动extract时用户操作系统环境变量ORACLE_SID,但是由于appvipcfg配置
  9. 的vip资源未给oracle足够权限,导致使用oracle用户无法启动vip资源,进而导致使用root启动vip资源之后,环境变量无法取到ORACLE_SID,导致未能启动extract。
  10. 3.解决
  11. --login as root
  12. # crsctl setperm resource oggvip -u user:oracle:rwx
  13. --login as oracle 测试
  14. $ crsctl start resource oggvip
  15. --如果上述命令依然无法使oracle启动资源,则继续修改oggvip权限
  16. --login as root
  17. --将other组权限设置为rwx即可解决
  18. # crsctl getperm resource oggvip
  19. # crsctl setperm resource oggvip -u other::rwx

3.3 appvipcfg无法执行

  1. 1. 现象
  2. # ./appvipcfg create -network=1 \
  3. -ip=192.168.204.245 \
  4. -vipname=czhvip \
  5. -user=root
  6. /bin/ls: cannot access /ade/ade_88979932/perl/lib: No such file or directory
  7. 2. 原因
  8. 由于opatch打补丁导致appvipcfg内容发生改变,appvipcfg本身为$GRID_HOME/bin/下的一个脚本文件,不是一个二进制文件,脚本中定义了ORACLE_HOME与ORA_CRS_HOME,由于打补丁导致该文件两个变量不正确,修改为正确路径即可解决。
  9. $ cat /u01/app/11.2/grid/bin/appvipcfg
  10. #!/bin/sh
  11. #
  12. # This script is used for managing
  13. # user mode vip resource.
  14. #
  15. # Do not change the line below for ORACLE_HOME setting
  16. #ORACLE_HOME=/u01/app/11.2/grid
  17. ORACLE_HOME=/ade/ade_19289128/11.2/grid
  18. export ORACLE_HOME
  19. #ORA_CRS_HOME=/u01/app/11.2/grid
  20. ORA_CRS_HOME=/ade/ade_19289128/11.2/grid
  21. export ORA_CRS_HOME
上一篇:Netty 直接内存(堆外内存)溢出分析


下一篇:OGG 的初始化加载