1. 概述
Oracle Grid不止能提供自身Oracle Database高可用,还可以为第三方应用提供高可用。
可以为OGG、SharePlex等逻辑复制,Apache等应用提供高可用。
使用Oracle Grid代理第三方应用主要有以下两种方式:
- Oracle Grid Infrastructure Agents
- Third-Part Script
官方文档位置:
Clusterware Administration and Deployment Guide
Third-Party Applications Using the Script Agent
Mos文档参考:
Oracle_GoldenGate_Best_Practices_-_Oracle_GoldenGate_high_availability_using_Oracle_Clusterware_v8_6_ID1313703_1_.pdf
- 关于第三方应用日志位置
Oracle Grid 11.2如果使用oracle添加资源,则日志位置:
$GRID_HOME/log/{node_name}/agent/crsd/scriptagent_oracle
12c以后GRID日志也变为标准ADR目录
$GRID_BASE/diag/crs/crs/agent/scriptagent_oracle.trc
# 如果为GRID添加资源,路径或日志名称scriptagent_grid即可。
2. Grid代理第三方脚本
下面测试利用Grid代理第三方脚本形式提供高可用,XAG方式参考官方文档即可。
部署步骤概述:
- 配置应用VIP(此VIP不是RAC VIP,仅仅为了应用本身使用),对外提供唯一IP,使切换对应用透明。
- 部署goldengate启停第三方脚本。
- crsctl加载资源,配置权限。
- 测试高可用。
2.1 配置VIP
(1) login as root
# appvipcfg create -network=1 \
-ip=192.168.204.242 \
-vipname=czhvip \
-user=root
(2) 查看配置vip
# crsctl stat res -p |grep -ie .network -ie subnet |grep -ie name -ie subnet
(3) login as root
# crsctl setperm resource czhvip -u user:oracle:r-x
--配置资源使用权限用户,IP资源属主一定必须是root,其他用户无法配置IP,会导致无法启动VIP资源。
2.2 部署OGG
ogg安装部署不在此赘述,可按照以下几种方式:
1. 使用ACFS作为共享磁盘,OGG软件本身以及dir*相关目录均存放于ACFS文件系统。
ACFS相应版本以及补丁参考下面文档:
ACFS Support On OS Platforms (Certification Matrix). (Doc ID 1369107.1).pdf
2.使用ACFS存放goldengate的trail文件等,OGG软件本身存放于操作挂载点即可,通过在操作系统相应路径下建立软链接方式指向ACFS中dir*相应目录
$ ln –s /acfs_mount_point/dirdat dirdat
3.使用例如ocfs2、gpfs等集群文件系统存放
2.2 部署脚本说明
下面脚本仅仅用做示例,实际脚本可以根据不同应用加入相应模块脚本,比如check脚本就需要判断进程状态等等。
Grid 第三方脚本模块说明
1. Grid 11.2脚本需要包含start/stop/clean/check/abort
--示例脚本
#!/bin/sh
case $1 in
'start')
echo $(date)' start'>>/tmp/crs.log
exit 0
;;
'stop')
echo $(date)' stop'>>/tmp/crs.log
exit 0
;;
'clean')
echo $(date)' clean'>>/tmp/crs.log
echo $?'clean' >>/tmp/crs.log
exit 0
;;
'check')
echo "CHECK entry point has been called.."
echo $(date)' check'>>/tmp/crs.log
exit 0
;;
'abort')
echo $(date)' abort'>>/tmp/crs.log
exit 0
;;
esac
2. 模块说明
--主要介绍11gR2引入的两个新模块
--12c以后版本引入了更多模块,这点可以从启动日志中看到。
CLEAN
Clean was introduced with Oracle Clusterware 11g Release 2. It will not be used for Oracle
Clusterware 10g Release 2 or 11g Release 1. Clean is called when there is a need to clean up the
resource. It is a non-graceful operation.
ABORT
Abort was introduced with Oracle Clusterware 11g Release 2. It will not be used for Oracle
Clusterware 10g Release 2 or 11g Release 1. Abort is called if any of the resource components
hang to abort the ongoing action. Abort is not required to be included.
3.关于脚本中变量说明
如果start/stop/clean/check/abort对应脚本中启动程序脚本需要依赖环境变量,例如
(1)ogg如果extract配置使用本地ORACLE_SID连接数据库进行捕获,不是使用tnsalias方式连接数据库,则ggsci> start extract时,依赖于环境变量ORACLE_SID,这种情况下,需要在上面脚本中定义好依赖的ORACLE_SID以及ORACLE_HOME变量,因为Grid启动时由于vip属主为root,所以如果vip与ogg资源强依赖时,只能获取到root的用户环境变量,无法获得oracle用户环境变量,会导致资源无法正常启动。
(2)所以环境变量一定要在脚本中完全定义,不要依赖于外部变量,否则将会发生问题后很难排查以及遇到无法启动资源或启动资源无法启动程序中相应进程。
2.3 OGG高可用脚本
下面为OGG连接ASM与版本关系
如果 Redo Log 存储在 ASM 中,设置 Catpure ASM 连接方式如下:
Oracle 10.2.0.5 或 11.2.0.2 之前版本:
TRANLOGOPTIONS ASMUSER sys@asminst, asmpassword oracle
Oracle 10.2.0.5、11.2.0.2 或之以后版本,GoldenGate 为 11g 或以后版本:
TRANLOGOPTIONS DBLOGREADER
如果在 AIX 平台数据库的 redo log 使用的是 RAW,则可能需要设置参数:TRANLOGOPTIONS
RAWDEVICEOFFSET,设置此参数:
TRANLOGOPTIONS RAWDEVICEOFFSET 0
其他平台不需要设置此参数。
下面脚本为未使用ASM或Oracle 10.2.0.5、11.2.0.2 或之以后版本,如果为早期需要调取ASM实例ORACLE_SID,则需要特殊处理
完整示例详细可以参考OracleGoldenGate_Best_Practices-Oracle_GoldenGate_high_availability_using_Oracle_Clusterware_v8_6_ID1313703_1.pdf
#!/bin/sh
# goldengate_action.scr
# 生效oracle用户下环境变量,oracle下环境变量一定要配置相关变量,防止下面启动ogg无法读取相关ORACLE_SID导致启动extract失败
. ~oracle/.bash_profile
# 判断调用脚本是否有选项,如果第一个选项为空,则报错,提示使用选项
[ -z "$1" ]&& echo "ERROR!! Usage $0 <start|stop|abort|clean>"&& exit 99
# 指定goldengate安装目录
GGS_HOME=<set the path here>
#specify delay after start before checking for successful start
start_delay_secs=5
#Include the Oracle GoldenGate home in the library path to start GGSCI,AIX variable is LIBPATH
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${GGS_HOME}
#set the oracle home to the database to ensure Oracle GoldenGate will get
#the right environment settings to be able to connect to the database
export ORACLE_HOME=<set the ORACLE_HOME path here>
export CRS_HOME=<set the CRS_HOME path here>
#Set NLS_LANG otherwise it will default to US7ASCII
export NLS_LANG=American_America.US7ASCII
logfile=/tmp/crs_gg_start.log
\rm ${logfile}
# define function log.
function log ()
{
DATETIME=`date +%d/%m/%y-%H:%M:%S`
echo $DATETIME "goldengate_action.scr>>" $1
echo $DATETIME "goldengate_action.scr>>" $1 >> $logfile
}
# define function check_process to check goldengate MGR process is runing or not.
#check_process validates that a manager process is running at the PID
#that Oracle GoldenGate specifies.
check_process ()
{
if ( [ -f "${GGS_HOME}/dirpcs/MGR.pcm" ] )
then
pid=`cut -f8 "${GGS_HOME}/dirpcs/MGR.pcm"`
if [ ${pid} = `ps -e |grep ${pid} |grep mgr |awk '{ print $1 }'` ]
then
#manager process is running on the PID . exit success
echo "manager process is running on the PID . exit success">> /tmp/check.out
exit 0
else
#manager process is not running on the PID
echo "manager process is not running on the PID" >> /tmp/check.out
exit 1
fi
else
#manager is not running because there is no PID file
echo "manager is not running because there is no PID file" >> /tmp/check.out
exit 1
fi
}
# call_ggsci is a generic routine that executes a ggsci command
call_ggsci () {
log "entering call_ggsci"
ggsci_command=$1
#log "about to execute $ggsci_command"
log "id= $USER"
cd ${GGS_HOME}
ggsci_output=`${GGS_HOME}/ggsci << EOF
${ggsci_command}
exit
EOF`
log "got output of : $ggsci_output"
}
case $1 in
'start')
#Updated by Sourav B (02/10/2011)
# During failover if the “mgr.pcm” file is not deleted at the node crash
# then Oracle clusterware won’t start the manager on the new node assuming the
# manager process is still running on the failed node. To get around this issue
# we will delete the “mgr.prm” file before starting up the manager on the new
# node. We will also delete the other process files with pc* extension and to
# avoid any file locking issue we will first backup the checkpoint files and then
# delete them from the dirchk directory.After that we will restore the checkpoint
# files from backup to the original location (dirchk directory).
log "removing *.pc* files from dirpcs directory..."
rm -f $GGS_HOME/dirpcs/*.pc*
log "creating tmp directory to backup checkpoint file...."
mkdir $GGS_HOME/dirchk/tmp
log "backing up checkpoint files..."
cp $GGS_HOME/dirchk/*.cp* $GGS_HOME/dirchk/tmp
log "Deleting checkpoint files under dirchk......"
rm -f $GGS_HOME/dirchk/*.cp*
log "Restore checkpoint files from backup to dirchk directory...."
cp $GGS_HOME/dirchk/tmp/*.cp* $GGS_HOME/dirchk
log "Deleting tmp directory...."
rm -r $GGS_HOME/dirchk/tmp
log "starting manager"
call_ggsci 'start manager'
#there is a small delay between issuing the start manager command
#and the process being spawned on the OS . wait before checking
log "sleeping for start_delay_secs"
sleep ${start_delay_secs}
#check whether manager is running and exit accordingly
check_process
;;
'stop')
#attempt a clean stop for all non-manager processes
call_ggsci 'stop er *'
#ensure everything is stopped
call_ggsci 'stop er *!'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;
'check')
check_process
exit 0
;;
'clean')
#attempt a clean stop for all non-manager processes
call_ggsci 'stop er *'
#ensure everything is stopped
call_ggsci 'stop er *!'
#in case there are lingering processes
call_ggsci 'kill er *'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;
'abort')
#ensure everything is stopped
call_ggsci 'stop er *!'
#in case there are lingering processes
call_ggsci 'kill er *'
#stop manager without (y/n) confirmation
call_ggsci 'stop manager!'
#exit success
exit 0
;;
esac
2.4 CRSCTL添加ogg Grid资源
# login as oracle:
$ /u01/app/11.2/grid/bin/crsctl add resource oggapp -type cluster_resource -attr "ACTION_SCRIPT='/acfs_mount_point/ogg.sh',CHECK_INTERVAL=30,START_DEPENDENCIES='hard(czhvip) pullup(czhvip)',STOP_DEPENDENCIES='hard(mvggatevip)'"
--脚本位置可以存放于本地oracle用户有读取执行权限的目录,如果存放于本地,则Grid各个节点都需要备份该文件
--如果ogg安装使用acfs,则START_DEPENDENCIES可以配置与ASM强依赖。
上述步骤即已完成第三方应用使用Grid托管,还是非常方便实用的。
3. 遇到问题解决
3.1 无法启动resource
1. 无法启动
$ crsctl start res czhapp
CRS-2672: Attempting to start 'czhapp' on 'db-oracle-node1'
CRS-2674: Start of 'czhapp' on 'db-oracle-node1' failed
CRS-2679: Attempting to clean 'czhapp' on 'db-oracle-node1'
CRS-2678: 'czhapp' on 'db-oracle-node1' has experienced an unrecoverable failure
CRS-0267: Human intervention required to resume its availability.
CRS-4000: Command Start failed, or completed with errors.
# 如果配置资源属于Oracle,则日志目录为:
$GRID_HOME/log/{node_name}/agent/crsd/scriptagent_oracle
--关键内容如下
2021-04-26 11:35:07.342: [czhapp][156428032]{1:39006:13462} [clean] Executing action script: /software/crs.sh[clean]
2021-04-26 11:35:07.397: [ AGFW][156428032]{1:39006:13462} Command: clean for resource: czhapp 1 1 completed with invalid status: 209
2021-04-26 11:35:07.397: [czhapp][156428032]{1:39006:13462} [check] Executing action script: /software/crs.sh[check]
2021-04-26 11:35:07.397: [ AGFW][158529280]{1:39006:13462} Agent sending reply for: RESOURCE_CLEAN[czhapp 1 1] ID 4100:717590
2021-04-26 11:35:07.454: [ AGFW][156428032]{1:39006:13462} Received unknown resource status code: 209
2021-04-26 11:35:07.455: [ AGFW][158529280]{1:39006:13462} czhapp 1 1 state changed from: CLEANING to: UNKNOWN
2. 分析
可以从日志输出看到,识别到了脚本,但是通过在脚本中指定位置配置输出,发现脚本并未真正执行。
最终排查原因主要为脚本开头未声明脚本类型导致该问题。
3. 解决
#!/bin/sh
--写脚本还是要规范,以前写脚本偶尔拉下声明部分,并不影响,这次Oracle Grid代理脚本没有声明部分无法启动还是挺意外的,也说明还是要规范。
3.2 OGG无法启动extract
1. 现象
OCI相关报错,无法连接数据库
2.分析
AIX:
ps -ef|grep goldengate
ps eauwww <pid>
查看进程环境变量发现,变量中无ORACLE_SID。
由于goldengate extract中配置,未配置使用tnsalias方式连接数据库,所以依赖于启动extract时用户操作系统环境变量ORACLE_SID,但是由于appvipcfg配置
的vip资源未给oracle足够权限,导致使用oracle用户无法启动vip资源,进而导致使用root启动vip资源之后,环境变量无法取到ORACLE_SID,导致未能启动extract。
3.解决
--login as root
# crsctl setperm resource oggvip -u user:oracle:rwx
--login as oracle 测试
$ crsctl start resource oggvip
--如果上述命令依然无法使oracle启动资源,则继续修改oggvip权限
--login as root
--将other组权限设置为rwx即可解决
# crsctl getperm resource oggvip
# crsctl setperm resource oggvip -u other::rwx
3.3 appvipcfg无法执行
1. 现象
# ./appvipcfg create -network=1 \
-ip=192.168.204.245 \
-vipname=czhvip \
-user=root
/bin/ls: cannot access /ade/ade_88979932/perl/lib: No such file or directory
2. 原因
由于opatch打补丁导致appvipcfg内容发生改变,appvipcfg本身为$GRID_HOME/bin/下的一个脚本文件,不是一个二进制文件,脚本中定义了ORACLE_HOME与ORA_CRS_HOME,由于打补丁导致该文件两个变量不正确,修改为正确路径即可解决。
$ cat /u01/app/11.2/grid/bin/appvipcfg
#!/bin/sh
#
# This script is used for managing
# user mode vip resource.
#
# Do not change the line below for ORACLE_HOME setting
#ORACLE_HOME=/u01/app/11.2/grid
ORACLE_HOME=/ade/ade_19289128/11.2/grid
export ORACLE_HOME
#ORA_CRS_HOME=/u01/app/11.2/grid
ORA_CRS_HOME=/ade/ade_19289128/11.2/grid
export ORA_CRS_HOME