1、事件的原因排查
应用连接数据库失败,先连接数据库服务器,启动数据库服务恢复应用业务,然后排查数据库crush原因:
1)查看messages日志是否有与Oracle用户相关的出错信息
# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)
# cat /var/log/messages |grep failed
2)查看用户邮件日志是否有与Oracle用户相关的出错信息
# tail -n 200 /var/mail/oracle
3)检查跟踪日志文件
SQL> select value from v$diag_info where name =‘Diag Trace‘;
VALUE
--------------------------------
/app/oracle/diag/rdbms/crm/crm/trace
$ cat /app/oracle/diag/rdbms/crm/crm/trace/alert_crm.log |grep ORA-
$ cat /app/oracle/diag/rdbms/crm/crm/trace/alert_crm.log |grep err
$ cat /app/oracle/diag/rdbms/crm/crm/trace/alert_crm.log |grep fail
SQL> select name,value from v$diag_info;
NAME VALUE
--------------------------------
Diag Enabled TRUE
ADR Base /app/oracle/
ADR Home /app/oracle/diag/rdbms/crm/crm
Diag Trace /app/oracle/diag/rdbms/crm/crm/trace
Diag Alert /app/oracle/diag/rdbms/crm/crm/alert
Diag Incident /app/oracle/diag/rdbms/crm/crm/incident
Diag Cdump /app/oracle/diag/rdbms/crm/crm/cdump
Health Monitor /app/oracle/diag/rdbms/crm/crm/hm
Default Trace File /app/oracle/diag/rdbms/crm/crm/trace/crm_ora_10974.trc
Active Problem Count 1
Active Incident Count 1
查看log.xml搜索问题出现时间点
$ less /app/oracle/diag/rdbms/crm/crm/alert/log.xml
Errors in file /app/oracle/diag/rdbms/crm/crm/trace/crm_dbw2_80900.trc:
ORA-27157: OS post/wait facility removed
ORA-27300: OS system dependent operation:semop failed with status: 43
ORA-27301: OS failure message: Identifier removed
ORA-27302: failure occurred at: sskgpwwait1
Errors in file /app/oracle/diag/rdbms/crm/crm/trace/crm_dbw3_80902.trc:
ORA-27157: OS post/wait facility removed
ORA-27300: OS system dependent operation:semop failed with status: 43
ORA-27301: OS failure message: Identifier removed
ORA-27302: failure occurred at: sskgpwwait1
2、事件的解决处理
This could be due to some outside user or application removing the semaphores/shared memory.
To monitor the semaphore/shared memory state we can use the following methods:
Setup a cronjob to run every 5-10min and dump the output of ‘ipcs‘ and ‘ps - ef‘ to a file with a timestamp.
Rotate your logs every 4-7 days to build a history.
Then if the problem re-occurs, we can at least try to make sure ‘ipcrm‘ wasn‘t the culprit and get some general information of the state of the IPC resources plus the processes running.
You can also consult with your sysadmin to check if there is any OS level auditing that can be turned on to audit the usage of commands like ‘ipcrm‘ which can remove shared memory segments /semaphore sets.
Note: This issue can happen on different platform, but in case you encounter the issue in RHEL7.2, then please also check below RHEL7.2 specific information.
In RHEL7.2 operating system setting RemoveIPC=YES crashes the database.The default value for RemoveIPC in RHEL7.2 is YES.
Workaround :
1) Set RemoveIPC=no in /etc/systemd/logind.conf if it is not in that file
2) Reboot the server or restart systemd-logind as follows:
# systemctl daemon-reload
# systemctl restart systemd-logind
Centos7.2有个特性,默认该配置是为yes
# vim /etc/systemd/logind.conf
RemoveIPC=no
# systemctl daemon-reload
# systemctl restart systemd-logind