昨天刚装完的一个数据库在启动的时候,报错ORA-01102,而且安装的时候也没有看到哪里有报错信息,一路都比较顺利,
而且这也是第一次我碰到这个问题,当时我首先就检查了alert日志文件,并把相关的错误信息在metalink上查看过了,
经过分析后判断是由于进程间通信被争用导致,以下是我处理该问题的一个思路,并在最后附上了metalink原文以及朋友对该
问题的一个理解和处理办法。
为什么会发生如下错误,原因是多个用户同时去访问同一个资源就会发生独占模式,
因为在Linux里面默认一个进程只被一个用户访问,要避免这个问题,在创建用户的时候
指定默认去指定不同于其它用户的优先级就可以避免此类问题的发生。
sculkget: failed to lock /orasoft/product/10.2.0/db_1/dbs/lkWWL exclusive 同一个进程被多个用户访问发生了独占模式
sculkget: lock held by PID: 26312 发生独占模式的进程号为pid:26312
ORA-09968: Message 9968 not found; No message file for product=RDBMS, facility=ORA 并且没有找到9968的数据信号,同时了我们该信号的类型
Linux Error: 11: Resource temporarily unavailable 导致资源无法被正常利用
Additional information: 26312
Thu Nov 17 15:51:16 2011
ORA-1102 signalled during: ALTER DATABASE MOUNT...
解决如上错误过程如下:
1、我们可以通过如下命令查看到发生独占的进程名称为ora_dbw0_wwl
[oracle@ora10g dbs]$ ps -ef|grep 26312
oracle 26312 1 0 15:43 ? 00:00:02 ora_dbw0_wwl
oracle 26663 26574 0 17:39 pts/1 00:00:00 grep 26312
2、进入数据库,先关闭实例
[oracle@ora10g ~]$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.1.0 - Production on Thu Nov 17 17:45:56 2011
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options
SQL> shutdown immediate
ORA-01507: database not mounted
ORACLE instance shut down.
SQL> exit
Disconnected from Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options
进入到 $ORACLE_HOME/dbs,查看到一个名为lkWWL的文件,正常情况下是没有这个文件的
[oracle@ora10g ~]$ cd $ORACLE_HOME/dbs
[oracle@ora10g dbs]$ ls
hc_wwl.dat initdw.ora init.ora lkWWL orapwwwl spfilewwl.ora
[oracle@ora10g dbs]$ su - root
口令:
通过fuser -u lkWWL 命令一看,果然果然进程没有被释放
[root@ora10g ~]# cd /orasoft/product/10.2.0/db_1/dbs
[root@ora10g dbs]# fuser -u lkWWL
lkWWL: 26306 26308 26310 26312 26314 26316 26318 26320 26322 26324 26326 26334 26336 26340 26354 26356
[root@ora10g dbs]# fuser -k lkWWL
lkWWL: 26306 26308 26310 26312 26314 26316 26318 26320 26322 26324 26326 26334 26336 26340 26354 26356
[root@ora10g dbs]# fuser -u lkWWL
重新启动数据库看看,这个时候数据库没有报错了,能正常起来。
[root@ora10g dbs]# su - oracle
[oracle@ora10g ~]$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.1.0 - Production on Thu Nov 17 17:47:50 2011
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 285212672 bytes
Fixed Size 1218992 bytes
Variable Size 92276304 bytes
Database Buffers 188743680 bytes
Redo Buffers 2973696 bytes
Database mounted.
Database opened.
SQL> col host_name format a20
SQL> select host_name,instance_name,status from v$instance
HOST_NAME INSTANCE_NAME STATUS
-------------------- ---------------- ------------
ora10g.localdomain wwl OPEN
SQL>
Metalink 原文如下:
analysis:
Problem Description:
====================
You are trying to startup the database and you receive the following error:
ORA-01102: cannot mount database in EXCLUSIVE mode
Cause: Some other instance has the database mounted exclusive
or shared.
Action: Shutdown other instance or mount in a compatible mode.
Problem Explanation:
====================
A database is started in EXCLUSIVE mode by default. Therefore, the
ORA-01102 error is misleading and may have occurred due to one of the
following reasons:
- there is still an "sgadef<sid>.dbf" file in the "ORACLE_HOME/dbs"
directory
- the processes for Oracle (pmon, smon, lgwr and dbwr) still exist
- shared memory segments and semaphores still exist even though the
database has been shutdown
- there is a "ORACLE_HOME/dbs/lk<sid>" file
Search Words:
=============
ORA-1102, crash, immediate, abort, fail, fails, migration
Solution Description:
=====================
Verify that the database was shutdown cleanly by doing the following:
1. Verify that there is not a "sgadef<sid>.dbf" file in the directory
"ORACLE_HOME/dbs".
% ls $ORACLE_HOME/dbs/sgadef<sid>.dbf
If this file does exist, remove it.
% rm $ORACLE_HOME/dbs/sgadef<sid>.dbf
2. Verify that there are no background processes owned by "oracle"
% ps -ef | grep ora_ | grep $ORACLE_SID
If background processes exist, remove them by using the Unix
command "kill". For example:
% kill -9 <rocess_ID_Number>
3. Verify that no shared memory segments and semaphores that are owned
by "oracle" still exist
% ipcs -b
If there are shared memory segments and semaphores owned by "oracle",
remove the shared memory segments
% ipcrm -m <Shared_Memory_ID_Number>
and remove the semaphores
% ipcrm -s <Semaphore_ID_Number>
NOTE: The example shown above assumes that you only have one
database on this machine. If you have more than one
database, you will need to shutdown all other databases
before proceeding with Step 4.
4. Verify that the "$ORACLE_HOME/dbs/lk<sid>" file does not exist
5. Startup the instance
Solution Explanation:
=====================
The "lk<sid>" and "sgadef<sid>.dbf" files are used for locking shared memory. It seems that even though no memory is allocated, Oracle thinks memory is still locked. By removing the "sgadef" and "lk" files you remove any knowledge oracle has of shared memory
that is in use. Now the database can start.
我朋友对该问题的理解和解决办法如下:
出现1102错误可能有以下几种可能:
一、在HA系统中,已经有其他节点启动了实例,将双机共享的资源(如磁盘阵列上的裸设备)占用了;
二、说明Oracle被异常关闭时,有资源没有被释放,一般有以下几种可能,
1、Oracle的共享内存段或信号量没有被释放;
2、Oracle的后台进程(如SMON、PMON、DBWn等)没有被关闭;
3、用于锁内存的文件lk<sid>和sgadef<sid>.dbf文件没有被删除。
solution:
method1:
首先,虽然我们的系统是HA系统,但是备节点的实例始终处在关闭状态,这点通过在备节点上查数据库状态可以证实。
其次、是因系统掉电引起数据库宕机的,系统在接电后被重启,因此我们排除了第二种可能种的1、2点。最可疑的就是第3点了。
查$ORACLE_HOME/dbs目录:
$ cd $ORACLE_HOME/dbs
$ ls sgadef*
sgadef* not found
$ ls lk*
lkORA92
果然,lk<sid>文件没有被删除。将它删除掉
$ rm lk*
再启动数据库,成功。
如果怀疑是共享内存没有被释放,可以用以下命令查看:
$ipcs -mop
IPC status from /dev/kmem as of Thu Jul 6 14:41:43 2006
T ID KEY MODE OWNER GROUP NATTCH CPID LPID
Shared Memory:
m 0 0x411c29d6 --rw-rw-rw- root root 0 899 899
m 1 0x4e0c0002 --rw-rw-rw- root root 2 899 901
m 2 0x4120007a --rw-rw-rw- root root 2 899 901
m 458755 0x0c6629c9 --rw-r----- root sys 2 9113 17065
m 4 0x06347849 --rw-rw-rw- root root 1 1661 9150
m 65541 0xffffffff --rw-r--r-- root root 0 1659 1659
m 524294 0x5e100011 --rw------- root root 1 1811 1811
m 851975 0x5fe48aa4 --rw-r----- oracle oinstall 66 2017 25076
然后它ID号清除共享内存段:
$ipcrm –m 851975
对于信号量,可以用以下命令查看:
$ ipcs -sop
IPC status from /dev/kmem as of Thu Jul 6 14:44:16 2006
T ID KEY MODE OWNER GROUP
Semaphores:
s 0 0x4f1c0139 --ra------- root root
... ...
s 14 0x6c200ad8 --ra-ra-ra- root root
s 15 0x6d200ad8 --ra-ra-ra- root root
s 16 0x6f200ad8 --ra-ra-ra- root root
s 17 0xffffffff --ra-r--r-- root root
s 18 0x410c05c7 --ra-ra-ra- root root
s 19 0x00446f6e --ra-r--r-- root root
s 20 0x00446f6d --ra-r--r-- root root
s 21 0x00000001 --ra-ra-ra- root root
s 45078 0x67e72b58 --ra-r----- oracle oinstall
根据信号量ID,用以下命令清除信号量:
$ipcrm -s 45078
如果是Oracle进程没有关闭,用以下命令查出存在的oracle进程:
$ ps -ef|grep ora
oracle 29976 1 0 Jun 22 ? 0:52 ora_dbw0_ora92
oracle 29978 1 0 Jun 22 ? 0:51 ora_dbw1_ora92
oracle 5128 1 0 Jul 5 ? 0:00 oracleora92 (LOCAL=NO)
... ...
然后用kill -9命令杀掉进程
$kill -9 <ID>
method 2
[root@qa-oracle dbs]# fuser -u lkNDMSQA
lkNDMSQA: 6666(oracle) 6668(oracle) 6670(oracle) 6672(oracle) 6674(oracle) 6676(oracle) 6678(oracle) 6680(oracle) 6690(oracle) 6692(oracle) 6694(oracle) 6696(oracle) 6737(oracle) 6830(oracle)
果然该文件没释放,用fuser命令kill掉:
[root@qa-oracle dbs]# fuser -k lkNDMSQA
lkNDMSQA: 6666 6668 6670 6672 6674 6676 6678 6680 6690 6692 6694 6696 6737 6830
[root@qa-oracle dbs]# fuser -u lkNDMSQA
总结:
当发生1102错误时,可以按照以下流程检查、排错:
如果是HA系统,检查其他节点是否已经启动实例;
检查Oracle进程是否存在,如果存在则杀掉进程;
检查信号量是否存在,如果存在,则清除信号量;
检查共享内存段是否存在,如果存在,则清除共享内存段;
检查锁内存文件lk<sid>和sgadef<sid>.dbf是否存在,如果存在,则删除。
ORA-09968: unable to lock file lk$ORACLE_SID (2010-03-04 14:53)
分类: DBA
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
starting up 1 shared server(s) ...
Thu Mar 4 11:48:07 2010
ALTER DATABASE MOUNT
Thu Mar 4 11:48:07 2010
sculkget: failed to lock /u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS exclusive
sculkget: lock held by PID: 3443
Thu Mar 4 11:48:07 2010
ORA-09968: unable to lock file
Linux Error: 11: Resource temporarily unavailable
Additional information: 3443
Thu Mar 4 11:48:07 2010
ORA-1102 signalled during: ALTER DATABASE MOUNT...
提示进程3443锁定该资源,根据上次的启动日志发现该进程是Oracle的后台进程
DBWR,根据文档提示236794.1可能是该进程已经挂死,导致数据库无法正常运行。
fuser -u /u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS
PMON started with pid=2, OS id=3437
MMAN started with pid=4, OS id=3441
PSP0 started with pid=3, OS id=3439
DBW0 started with pid=5, OS id=3443
LGWR started with pid=6, OS id=3445
CKPT started with pid=7, OS id=3447
SMON started with pid=8, OS id=3449
RECO started with pid=9, OS id=3451
CJQ0 started with pid=10, OS id=3453
MMON started with pid=11, OS id=3455
Tue Feb 16 11:08:17 2010
starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'...
MMNL started with pid=12, OS id=3457
Tue Feb 16 11:08:17 2010
starting up 1 shared server(s) ...
Tue Feb 16 11:08:18 2010
ALTER DATABASE MOUNT
Tue Feb 16 11:08:22 2010
Setting recovery target incarnation to 2
Tue Feb 16 11:08:22 2010
Successful mount of redo thread 1, with mount id 1844152034
Tue Feb 16 11:08:22 2010
Database mounted in Exclusive Mode
Completed: ALTER DATABASE MOUNT
Tue Feb 16 11:08:22 2010
ALTER DATABASE OPEN
losf 查看锁定进程
# lsof |grep lkFDS
oracle 4476 oracle 17uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4478 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4480 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4482 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4484 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4486 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4488 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4490 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4492 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4494 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4496 oracle 15uR REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4513 oracle 15u REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4531 oracle 15u REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4534 oracle 15u REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
oracle 4812 oracle 15u REG 8,7 24 2911344 /var/oracle/product/10.2.0/db_1/dbs/lkFDS
fuser查看锁定进程
# fuser -u /u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS
/u01/app/oracle/product/10.2.0/db_1/dbs/lkFDS: 4476(oracle) 4478(oracle) 4480(oracle) 4482(oracle) 4484(oracle) 4486(oracle) 4488(oracle) 4490(oracle) 4492(oracle) 4494(oracle) 4496(oracle) 4513(oracle) 4531(oracle) 4534(oracle) 4812(oracle)
[root@CHN-DG-3-5CE ~]#
请教fuser的作用及具体用法!
fuser Command
Purpose
Identifies processes using a file or file structure.
Syntax
fuser [ -c | -d | -f ] [ -k ] [ -u ] [ -x ] [ -V ]File ...
Description
The fuser command lists the process numbers of local processes that use the
local or remote files specified by the File parameter. For block special
devices, the command lists the processes that use any file on that device.
c Uses the file as the current directory.
e Uses the file as a program's executable object.
r Uses the file as the root directory.
s Uses the file as a shared library (or other loadable object).
The process numbers are written to standard output in a line with spaces between
process numbers. A new line character is written to standard error after the
last output for each file operand. All other output is written to standard
error.
The fuser command will not detect processes that have mmap regions where that
associated file descriptor has since been closed.
Flags
-c Reports on any open files in the file system containing File.
-d Implies the use of the -c and -x flags. Reports on any open files which have
been unlinked from the file system (deleted from the parent directory). When
of the deleted file.
-f Reports on open instances of File only.
-k Sends the SIGKILL signal to each local process. Only the root user can kill a
process of another user.
-u Provides the login name for local processes in parentheses after the process
number.
-V Provides verbose output.
-x Used in conjunction with -c or -f, reports on executable and loadable objects
in addition to the standard fuser output.
Examples
1. To list the process numbers of local processes using the /etc/passwd file,
enter:
fuser /etc/passwd
2. To list the process numbers and user login names of processes using the
fuser -u /etc/filesystems
3. To terminate all of the processes using a given file system, enter:
fuser -k -x -u /dev/hd1 -OR-
fuser -kxuc /home
Either command lists the process number and user name, and then terminates
each process that is using the /dev/hd1 (/home) file system. Only the root
user can terminate processes that belong to another user. You might want to
use this command