一般情况下,ORACLE DBA看到如下情况的第一反应是,数据库实例没有启动或者是数据库环境变量没有设置正确,今天遇到的情况均不是以上两种情况,有点特别,且来看看为哪般。
oracle@POC-SV12-I2KDB:~> sqlplus / as sysdba
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Aug 13 11:31:58 2015
Copyright (c) 1982, 2008, Oracle. All rights reserved.
Connected to an idle instance.
SQL> quit
Disconnected.
检查数据库实例是否启动
oracle@POC-SV12-I2KDB:~> ps -ef|grep ora_
oracle 8799 1 0 2011 ? 02:27:30 ora_pmon_inomc
oracle 8801 1 0 2011 ? 00:00:13 ora_vktm_inomc
oracle 8805 1 0 2011 ? 00:00:26 ora_diag_inomc
oracle 8807 1 0 2011 ? 00:00:53 ora_dbrm_inomc
oracle 8809 1 0 2011 ? 00:07:28 ora_psp0_inomc
oracle 8811 1 0 2011 ? 1-22:14:21 ora_dia0_inomc
oracle 8813 1 0 2011 ? 00:00:22 ora_mman_inomc
oracle 8815 1 0 2011 ? 01:09:25 ora_dbw0_inomc
oracle 8817 1 0 2011 ? 00:39:51 ora_lgwr_inomc
oracle 8819 1 0 2011 ? 04:23:53 ora_ckpt_inomc
oracle 8821 1 0 2011 ? 01:35:38 ora_smon_inomc
oracle 8823 1 0 2011 ? 00:00:05 ora_reco_inomc
oracle 8825 1 0 2011 ? 00:16:42 ora_mmon_inomc
oracle 8827 1 0 2011 ? 02:24:41 ora_mmnl_inomc
oracle 8829 1 0 2011 ? 00:00:07 ora_d000_inomc
oracle 8831 1 0 2011 ? 00:00:07 ora_s000_inomc
oracle 8875 1 0 2011 ? 00:03:26 ora_arc0_inomc
oracle 8877 1 0 2011 ? 00:03:27 ora_arc1_inomc
oracle 8879 1 0 2011 ? 00:03:28 ora_arc2_inomc
oracle 8881 1 0 2011 ? 00:00:16 ora_arc3_inomc
oracle 8883 1 0 2011 ? 00:00:16 ora_fbda_inomc
oracle 8885 1 0 2011 ? 00:00:10 ora_qmnc_inomc
oracle 8901 1 0 2011 ? 00:00:05 ora_q000_inomc
oracle 8903 1 0 2011 ? 00:08:33 ora_q001_inomc
oracle 10895 1 0 2011 ? 01:02:28 ora_cjq0_inomc
oracle 10965 1 0 2011 ? 00:00:26 ora_smco_inomc
oracle 23221 22576 0 11:31 pts/2 00:00:00 grep ora_
检查数据库监听状态是否正常
oracle@POC-SV12-I2KDB:~> lsnrctl status
LSNRCTL for Linux: Version 11.1.0.7.0 - Production on 13-AUG-2015 11:31:34
Copyright (c) 1991, 2008, Oracle. All rights reserved.
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Linux: Version 11.1.0.7.0 - Production
Start Date 17-SEP-2011 14:59:03
Uptime 431 days 15 hr. 36 min. 44 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /opt/oracle/app/product/11g/db/network/admin/listener.ora
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=POC-SV12-I2KDB)(PORT=1521)))
Services Summary...
Service "inomc" has 1 instance(s).
Instance "inomc", status READY, has 1 handler(s) for this service...
Service "inomcXDB" has 1 instance(s).
Instance "inomc", status READY, has 1 handler(s) for this service...
Service "inomc_XPT" has 1 instance(s).
Instance "inomc", status READY, has 1 handler(s) for this service...
The command completed successfully
看吧,数据库实例状态、监听状态都是正常的,但是ORACLE数据库用户SYS登录就是不能一如往常登录进去。
再排除一下环境变量设置不当(如果同一数据库安装多个实例,ORACLE环境变量又ORACLE_SID配置不当是有可能出现SYS登录空闲实例的)的问题
oracle@POC-SV12-I2KDB:/opt/oracle/app/product/11g/db> pwd
/opt/oracle/app/product/11g/db
oracle@POC-SV12-I2KDB:/opt/oracle/app/product/11g/db> ps -ef|grep smon
oracle 8821 1 0 2011 ? 01:35:38 ora_smon_inomc
oracle 32267 22576 0 11:36 pts/2 00:00:00 grep smon
oracle@POC-SV12-I2KDB:~> echo $ORACLE_HOME
/opt/oracle/app/product/11g/db
oracle@POC-SV12-I2KDB:~> echo $ORACLE_SID
inomc
数据库环境变量也都是正常配置了的,那我接下来尝试使用业务用户在本地登录看数据库报什么错误:
oracle@POC-SV12-I2KDB:/opt/oracle/app/product/11g/db/network/admin> sqlplus imap/imap
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Aug 13 11:42:21 2015
Copyright (c) 1982, 2008, Oracle. All rights reserved.
ERROR:
ORA-01034: ORACLE not available
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory
Additional information: 1
Additional information: 819200
Additional information: 8
Process ID: 0
Session ID: 0 Serial number: 0
这回很明显了,数据库报内存溢出了,接下来再看下当前数据库连接到数据库实例的进程,由于当前数据库SYS无法登录,只能ps -ef|grep LOCAL了,结果发现有300个之多。
oracle@POC-SV12-I2KDB:~> ps -ef|grep LOCAL
oracle 28688 1 0 2011 ? 00:03:05 oracleinomc (LOCAL=NO)
oracle 28690 1 0 2011 ? 00:00:00 oracleinomc (LOCAL=NO)
oracle 28737 1 0 2011 ? 00:00:00 oracleinomc (LOCAL=NO)
oracle 28762 1 0 2011 ? 00:00:00 oracleinomc (LOCAL=NO)
oracle 28782 1 0 2011 ? 00:00:16 oracleinomc (LOCAL=NO)
oracle 28785 1 0 2011 ? 00:00:16 oracleinomc (LOCAL=NO)
oracle 28791 1 0 2011 ? 00:00:16 oracleinomc (LOCAL=NO)
oracle 28795 1 0 2011 ? 00:00:16 oracleinomc (LOCAL=NO)
oracle 28798 1 0 2011 ? 00:00:16 oracleinomc (LOCAL=NO)
oracle 28815 1 0 2011 ? 00:00:00 oracleinomc (LOCAL=NO)
oracle 29217 1 0 2011 ? 00:11:59 oracleinomc (LOCAL=NO)
oracle 29226 1 0 2011 ? 00:09:16 oracleinomc (LOCAL=NO)
.
.
.
再看下数据库服务器的内存状态发现,16G的内存确实已经用完了,SWAP分区也已经使用殆尽了。
oracle@POC-SV12-I2KDB:~> free -t
total used free shared buffers cached
Mem: 8096276 8039268 57008 0 195744 1732592
-/+ buffers/cache: 6110932 1985344
Swap: 16779852 16779844 8
Total: 24876128 24819112 57016
与现场业务、系统管理员方沟通后决定杀掉LOCAL=NO的进程释放内存。
oracle@POC-SV12-I2KDB:~> kill -9 28688
oracle@POC-SV12-I2KDB:~> kill -9 28690
oracle@POC-SV12-I2KDB:~> kill -9 28737
oracle@POC-SV12-I2KDB:~> kill -9 28762
oracle@POC-SV12-I2KDB:~> kill -9 28782
oracle@POC-SV12-I2KDB:~> kill -9 28785
oracle@POC-SV12-I2KDB:~> kill -9 28791
oracle@POC-SV12-I2KDB:~> kill -9 28795
oracle@POC-SV12-I2KDB:~> kill -9 28798
oracle@POC-SV12-I2KDB:~> kill -9 28815
oracle@POC-SV12-I2KDB:~> free
total used free shared buffers cached
Mem: 8096276 8044760 51516 0 195652 1730480
-/+ buffers/cache: 6118628 1977648
Swap: 16779852 16754328 25524
oracle@POC-SV12-I2KDB:~> free
total used free shared buffers cached
Mem: 8096276 8033788 62488 0 194856 1686568
-/+ buffers/cache: 6152364 1943912
Swap: 16779852 16753524 26328
oracle@POC-SV12-I2KDB:~> free
total used free shared buffers cached
Mem: 8096276 8028440 67836 0 194856 1686568
-/+ buffers/cache: 6147016 1949260
Swap: 16779852 16753524 26328
oracle@POC-SV12-I2KDB:~> free
total used free shared buffers cached
Mem: 8096276 8006900 89376 0 194856 1686568
-/+ buffers/cache: 6125476 1970800
Swap: 16779852 16753524 26328
释放完部分内存,再尝试oracle管理账户SYS本地登录,发现SYS用户能够正常登录了:
oracle@POC-SV12-I2KDB:~> sqlplus / as sysdba
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Aug 13 11:48:29 2015
Copyright (c) 1982, 2008, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.1.0.7.0 - 64bit Production
With the Partitioning, Oracle Label Security, OLAP, Data Mining,
Oracle Database Vault and Real Application Testing options
SQL>
后续分析,真的很奇怪,以前遇见过会话数超、进程数超、内存溢出SYS无法登录的,但是现状也只是到如下就卡主(这是比较明显的,要么是像业务用户本地登录一样直接报内存溢出)
oracle@POC-SV12-I2KDB:~> sqlplus / as sysdba
SQL*Plus: Release 11.1.0.7.0 - Production on Thu Aug 13 11:48:29 2015
Copyright (c) 1982, 2008, Oracle. All rights reserved.
Connected to:
由于查杀进程前无法登录数据库,无法获知告警日志的确切位置,杀过进程后,查看告警日志,有大量如下报错:
Process J000 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Errors in file /opt/oracle/app/diag/rdbms/inomc/inomc/trace/inomc_cjq0_10895.trc:
.
.
.
Thu Aug 13 11:45:52 2015
Process W000 died, see its trace file
Process W000 died, see its trace file
Process W000 died, see its trace file
查看kkjcre1p: unable to spawn jobq slave process 的官方解释,说可能是数据库进程数超了,但是数据库告警日志没有报,并且查看数据库进程数设置有1000个之多,会话数设置有1101个之多,根据现有信息分析,不是数据库进程数超或会话数超引起的。
总结:由于数据库服务器内存小,而数据库会话数、进程数设置过大,当数据库进程数急剧增多时,数据库服务器内存耗尽,导致数据库管理账户SYS无法登陆;比较奇怪的是11G的SYS无法登陆不报服务器内存溢出却连接空闲实例,而使用业务用户登录时却报了明显的内存溢出。