说到问题,真是层出不穷,自己搭建了也不少的rac的环境的,但是在本地试验的时候总是会碰到一些问题,昨晚铲掉旧环境,搭建了两遍rac环境,终于在凌晨搭建好了环境,配置好EM,看了下效果,还不错,然后就把虚拟机设为suspend状态,早上打开虚拟机发现两个节点都自动停掉了,再次重启就启动不了了。 这个时候其实问题才刚刚开始。
#问题1:节点实例无法启动
使用srvctl启动报出了下面的错误。
srvctl start database -d RACDB
PRKP-1001 : Error starting instance RACDB1 on node rac1
CRS-0215: Could not start resource 'ora.RACDB.RACDB1.inst'.
PRKP-1001 : Error starting instance RACDB2 on node rac2
CRS-0215: Could not start resource 'ora.RACDB.RACDB2.inst'.
自动手工尝试启动,结果就收到了ora-00600的错误。
ORA-00600: internal error code, arguments: [3716], [], [], [], [], [], [], []
对于这个问题,查看日志也没有得到太多的信息,两个节点都是同样的问题,最后无奈查看metalink求助,刚好有一篇文章796425.
描述的刚好就是这个问题,对于这个问题的原因描述如下:
CAUSE
Just before the error occurs we see there is a Node reboot and a Oracle instance startup without
proper shutdown. This might cause a missed write to Controlfile.
The error may be reported during alter database open due to mismatch in information in controlfile
明白了问题,解决起来就容易多了。
在节点1上这样操作。
[oracle@rac1 ~]$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.1.0 - Production on Sat Aug 22 10:26:43 2015
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [3716], [], [], [], [], [], [], []
做备份文件的备份。
SQL> Alter database backup controlfile to trace ;
Database altered.
SQL> Alter database backup controlfile to '/home/oracle/ctl_bak.ctl';
Database altered.
SQL> exit
Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
[oracle@rac1 ~]$ rman target /
Recovery Manager: Release 10.2.0.1.0 - Production on Sat Aug 22 10:28:03 2015
Copyright (c) 1982, 2005, Oracle. All rights reserved.
connected to target database: RACDB (DBID=885985165, not open)
RMAN> List backup of controlfile ;
using target database control file instead of recovery catalog
RMAN> Shutdown immediate ;
database dismounted
Oracle instance shut down
RMAN> Startup nomount
Oracle instance started
Total System Global Area 218103808 bytes
Fixed Size 1218604 bytes
Variable Size 88082388 bytes
Database Buffers 125829120 bytes
Redo Buffers 2973696 bytes
RMAN> Restore controlfile from '/home/oracle/ctl_bak.ctl';
Starting restore at 2015-08-22:10:30:34
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=147 instance=RACDB1 devtype=DISK
channel ORA_DISK_1: copied control file copy
output filename=+DATA/racdb/controlfile/current.286.888368397
output filename=+DATA/racdb/controlfile/current.285.888368401
Finished restore at 2015-08-22:10:30:42
RMAN> alter database mount;
recover database;database mounted
released channel: ORA_DISK_1
Starting recover at 2015-08-22:10:30:58
Starting implicit crosscheck backup at 2015-08-22:10:30:58
allocated channel: ORA_DISK_1
Finished implicit crosscheck backup at 2015-08-22:10:30:59
Starting implicit crosscheck copy at 2015-08-22:10:30:59
using channel ORA_DISK_1
Finished implicit crosscheck copy at 2015-08-22:10:30:59
searching for all files in the recovery area
cataloging files...
no files cataloged
RMAN> recover database;
using channel ORA_DISK_1
starting media recovery
archive log thread 1 sequence 2 is already on disk as file +DATA/racdb/onlinelog/group_1.287.888368405
archive log thread 1 sequence 3 is already on disk as file +DATA/racdb/onlinelog/group_2.289.888368411
archive log thread 2 sequence 1 is already on disk as file +DATA/racdb/onlinelog/group_3.293.888368557
archive log filename=+DATA/racdb/onlinelog/group_1.287.888368405 thread=1 sequence=2
archive log filename=+DATA/racdb/onlinelog/group_3.293.888368557 thread=2 sequence=1
archive log filename=+DATA/racdb/onlinelog/group_2.289.888368411 thread=1 sequence=3
media recovery complete, elapsed time: 00:00:06
Finished recover at 2015-08-22:10:31:07
RMAN> alter database open resetlogs;
然后在节点2直接启动即可。
节点2:
SQL> startup nomount
......
SQL> alter database mount;
Database altered.
SQL> alter database open;
Database altered.
对于这个问题的解决就告一段落,然后这次为了保险,把虚拟机关了,然后吃完饭再次打开,发现又有问题,这次的情况是节点2能够启动,但是节点1无论如何都启动不了。
#问题2:1个节点实例无法启动
查看日志提示是MMAN进程停掉了,然后尝试了各种对比的测试方法,节点1就是启动不了。
对于这个问题,竟然在metalink还是找到了帮助信息,
Oracle Crash After Ora-7445 ORA-822 (文档 ID 1422003.1)
文章中的描述更多是说swap设置过小,或者是调整sga_target,一看到这些字眼,我就马上明白了,其实还是内存资源不足导致了节点1无法启动,使用free -m一看还剩30M左右的内容资源,总共就开了800M的内存,crs+asm+em内存基本都被耗完了,这个时候额外多加了点内存资源,问题就迎刃而解了。
当然了,问题到此还没有完,再次搭建rac的时候,crs,asm的配置i还算基本顺利,错误和问题都在控制之中,但是最后一步使用dbca建库的地方竟然还是抛出了错,而且还不能ignore,
#问题3:dbca建库失败
错误信息如下:
DBCA_PROGRESS : 2%
ORA-00119: invalid specification for system parameter REMOTE_LISTENER
ORA-00132: syntax error or unresolved network name 'LISTENER_RACDBTEST'
ORA-01078: failure in processing system parameters
这个问题还比较诡异,因为使用dbca的部分,参数我也使用默认的,没有做其它修改。
对于这个问题,确认让自己长了见识,因为之前也没碰到过。查看了Metalink
ORA-119, ORA-132 ORA-1078 Received From DBCA (文档 ID 433817.1)
里面的思路是在dbca建库的时候把对应的参数给取消勾选就可以了,在最后的那几个步骤里面。这样修改确实能够解决这个问题,但是我们也不能盲目,因为至少目前为止,创建数据库实例,默认的参数还没有碰到过问题。需要手工屏蔽,其实问题的原因也相对简单。
我们查看crs的时候查看listener的时候有下面这么一段内容。可以看到两个节点上的listener部分由Unknown的字样,这个如果展开来说,其实还是在清理rac环境的时候,没有合理清楚oc的信息导致的。
我们可以创建新的Listener或者把这部分过旧的内容删除。
[oracle@rac1 ~]$ crs_stat -t|grep lsnr
ora....C1.lsnr application ONLINE UNKNOWN rac1
ora....C1.lsnr application ONLINE ONLINE rac1
ora....C2.lsnr application ONLINE UNKNOWN rac2
ora....C2.lsnr application ONLINE ONLINE rac2
这个问题修复这个,3个rac节点的问题总算得到了初步的解决。
所以以上三个问题,明白了原委,其实解决起来就会容易的多。但是说实在的,这几个问题都算比较特殊,只能说本地测试环境千变万化,还是需要我们好好利用。