这里的*号实际表示就是RAC中所有实例都使用

您的位置: ITPUB个人空间 » cc59的个人空间 » 日志 发布新日志

我的日志我的足迹我的收藏
unix/linuxHA随笔backup&restoreperformance tuningTroubleshootConcepts&Basic
RAC Diagnostics Script
2007-02-15 00:00:00

from metalink:

This script is broken up into different SQL statements that can be used individually.  Each SQL statement adds information to help in debugging an RAC hang/severe performance scenerio.  Script------- - - - - - - - - - - - - - - - - Script begins here - - - - - - - - - - - - - - - --- NAME:  RACDIAG.SQL --    SYS OR INTERNAL USER, CATPARR.SQL ALREADY RUN, PARALLEL QUERY OPTION ON-- ------------------------------------------------------------------------ -- AUTHOR:  --    Michael Polaski - Oracle Support Services - DataServer Group--    Copyright 2002, Oracle Corporation      -- ------------------------------------------------------------------------ -- PURPOSE: -- This script is intended to provide a user friendly guide to troubleshoot RAC  -- hung sessions or slow performance scenerios.  The script includes information -- to gather a variety of important debug information to determine the cause of an -- RAC hang.  The script will create a file called racdiag_<timestamp>.out -- in your local directory while dumping hang analyze dumps in the user_dump_dest(s)-- and background_dump_dest(s) on all nodes.---- ------------------------------------------------------------------------ -- DISCLAIMER: --    This script is provided for educational purposes only. It is NOT  --    supported by Oracle World Wide Technical Support. --    The script has been tested and appears to work as intended. --    You should always run new scripts on a test instance initially. -- ------------------------------------------------------------------------ -- Script output is as follows:set echo offset feedback offcolumn timecol new_value timestampcolumn spool_extension new_value suffixselect to_char(sysdate,'Mondd_hhmi') timecol,'.out' spool_extension from sys.dual;column output new_value dbnameselect value || '_' outputfrom v$parameter where name = 'db_name';spool racdiag_&&dbname&×tamp&&suffixset lines 200set pagesize 35set trim onset trims onalter session set nls_date_format = 'MON-DD-YYYY HH24:MI:SS';alter session set timed_statistics = true;set feedback onselect to_char(sysdate) time from dual;set numwidth 5column host_name format a20 truselect inst_id, instance_name, host_name, version, status, startup_timefrom gv$instanceorder by inst_id;set echo on-- Taking Hang Analyze dumps-- This may take a little while...oradebug setmypidoradebug unlimitoradebug -g all hanganalyze 3-- This part may take the longest, you can monitor bdump or udump to see if the-- file is being generated.oradebug -g all dump systemstate 266-- WAITING SESSIONS:-- The entries that are shown at the top are the sessions that have -- waited the longest amount of time that are waiting for non-idle wait -- events (event column).  You can research and find out what the wait-- event indicates (along with its parameters) by checking the Oracle -- Server Reference Manual or look for any known issues or documentation -- by searching Metalink for the event name in the search bar.  Example -- (include single quotes): [ 'buffer busy due to global cache' ].-- Metalink and/or the Server Reference Manual should return some useful -- information on each type of wait event.  The inst_id column shows the-- instance where the session resides and the SID is the unique identifier-- for the session (gv$session).  The p1, p2, and p3 columns will show -- event specific information that may be important to debug the problem.-- To find out what the p1, p2, and p3 indicates see the next section. -- Items with wait_time of anything other than 0 indicate we do not know-- how long these sessions have been waiting.--set numwidth 10column state format a7 trucolumn event format a25 trucolumn last_sql format a40 truselect sw.inst_id, sw.sid, sw.state, sw.event, sw.seconds_in_wait seconds, sw.p1, sw.p2, sw.p3, sa.sql_text last_sqlfrom gv$session_wait sw, gv$session s, gv$sqlarea sawhere sw.event not in ('rdbms ipc message','smon timer','pmon timer','SQL*Net message from client','lock manager wait for remote message','ges remote message', 'gcs remote message', 'gcs for action', 'client message', 'pipe get', 'null event', 'PX Idle Wait', 'single-task message', 'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue', 'listen endpoint status','slave wait','wakeup time manager')and sw.seconds_in_wait > 0 and (sw.inst_id = s.inst_id and sw.sid = s.sid)and (s.inst_id = sa.inst_id and s.sql_address = sa.address)order by seconds desc;-- EVENT PARAMETER LOOKUP:-- This section will give a description of the parameter names of the-- events seen in the last section.  p1test is the parameter value for-- p1 in the WAITING SESSIONS section while p2text is the parameter-- value for p3 and p3 text is the parameter value for p3.  The-- parameter values in the first section can be helpful for debugging-- the wait event.--column event format a30 trucolumn p1text format a25 trucolumn p2text format a25 trucolumn p3text format a25 truselect distinct event, p1text, p2text, p3textfrom gv$session_wait swwhere sw.event not in ('rdbms ipc message','smon timer','pmon timer','SQL*Net message from client','lock manager wait for remote message','ges remote message', 'gcs remote message', 'gcs for action', 'client message', 'pipe get', 'null event', 'PX Idle Wait', 'single-task message', 'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue', 'listen endpoint status','slave wait','wakeup time manager')and seconds_in_wait > 0order by event;-- GES LOCK BLOCKERS:-- This section will show us any sessions that are holding locks that-- are blocking other users.  The inst_id will show us the instance that-- the session resides on while the sid will be a unique identifier for-- the session.  The grant_level will show us how the GES lock is granted to -- the user.  The request_level will show us what status we are trying to obtain.-- The lockstate column will show us what status the lock is in.  The last column -- shows how long this session has been waiting.--set numwidth 5column state format a16 tru;column event format a30 tru;select dl.inst_id, s.sid, p.spid, dl.resource_name1, decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)','KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)','KJUSEREX','Exclusive',request_level) as grant_level,decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)','KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)','KJUSEREX','Exclusive',request_level) as request_level, decode(substr(dl.state,1,8),'KJUSERGR','Granted','KJUSEROP','Opening','KJUSERCA','Canceling','KJUSERCV','Converting') as state,s.sid, sw.event, sw.seconds_in_wait secfrom gv$ges_enqueue dl, gv$process p, gv$session s, gv$session_wait swwhere blocker = 1and (dl.inst_id = p.inst_id and dl.pid = p.spid)and (p.inst_id = s.inst_id and p.addr = s.paddr)and (s.inst_id = sw.inst_id and s.sid = sw.sid)order by sw.seconds_in_wait desc;-- GES LOCK WAITERS:-- This section will show us any sessions that are waiting for locks that-- are blocked by other users.  The inst_id will show us the instance that-- the session resides on while the sid will be a unique identifier for-- the session.  The grant_level will show us how the GES lock is granted to -- the user.  The request_level will show us what status we are trying to obtain.-- The lockstate column will show us what status the lock is in.  The last column -- shows how long this session has been waiting.--set numwidth 5column state format a16 tru;column event format a30 tru;select dl.inst_id, s.sid, p.spid, dl.resource_name1, decode(substr(dl.grant_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)','KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)','KJUSEREX','Exclusive',request_level) as grant_level,decode(substr(dl.request_level,1,8),'KJUSERNL','Null','KJUSERCR','Row-S (SS)','KJUSERCW','Row-X (SX)','KJUSERPR','Share','KJUSERPW','S/Row-X (SSX)','KJUSEREX','Exclusive',request_level) as request_level, decode(substr(dl.state,1,8),'KJUSERGR','Granted','KJUSEROP','Opening','KJUSERCA','Cancelling','KJUSERCV','Converting') as state,s.sid, sw.event, sw.seconds_in_wait secfrom gv$ges_enqueue dl, gv$process p, gv$session s, gv$session_wait swwhere blocked = 1and (dl.inst_id = p.inst_id and dl.pid = p.spid)and (p.inst_id = s.inst_id and p.addr = s.paddr)and (s.inst_id = sw.inst_id and s.sid = sw.sid)order by sw.seconds_in_wait desc;-- LOCAL ENQUEUES:-- This section will show us if there are any local enqueues.  The inst_id will -- show us the instance that the session resides on while the sid will be a -- unique identifier for.  The addr column will show the lock  address. The type -- will show the lock type.  The id1 and id2 columns will show specific parameters -- for the lock type.  --set numwidth 12column event format a12 truselect l.inst_id, l.sid, l.addr, l.type, l.id1, l.id2, decode(l.block,0,'blocked',1,'blocking',2,'global') block, sw.event, sw.seconds_in_wait secfrom gv$lock l, gv$session_wait swwhere (l.sid = sw.sid and l.inst_id = sw.inst_id) and l.block in (0,1)order by l.type, l.inst_id, l.sid;-- LATCH HOLDERS:-- If there is latch contention or 'latch free' wait events in the WAITING-- SESSIONS section we will need to find out which proceseses are holding -- latches.  The inst_id will show us the instance that the session resides -- on while the sid will be a unique identifier for.  The username column -- will show the session's username.  The os_user column will show the os -- user that the user logged in as.  The name column will show us the type-- of latch being waited on.  You can search Metalink for the latch name in -- the search bar.  Example (include single quotes): -- [ 'library cache' latch ]. Metalink should return some useful information -- on the type of latch.  --set numwidth 5select distinct lh.inst_id, s.sid, s.username, p.username os_user, lh.namefrom gv$latchholder lh, gv$session s, gv$process pwhere (lh.sid = s.sid  and lh.inst_id = s.inst_id)and (s.inst_id = p.inst_id and s.paddr = p.addr)order by lh.inst_id, s.sid;-- LATCH STATS:-- This view will show us latches with less than optimal hit ratios-- The inst_id will show us the instance for the particular latch.  The -- latch_name column will show us the type of latch.  You can search Metalink -- for the latch name in the search bar.  Example (include single quotes): -- [ 'library cache' latch ]. Metalink should return some useful information -- on the type of latch.  The hit_ratio shows the percentage of time we -- successfully acquired the latch.--column latch_name format a30 truselect inst_id, name latch_name,round((gets-misses)/decode(gets,0,1,gets),3) hit_ratio, round(sleeps/decode(misses,0,1,misses),3) "SLEEPS/MISS"from gv$latchwhere round((gets-misses)/decode(gets,0,1,gets),3) < .99and gets != 0order by round((gets-misses)/decode(gets,0,1,gets),3);-- No Wait Latches:--select inst_id, name latch_name,round((immediate_gets/(immediate_gets+immediate_misses)), 3) hit_ratio, round(sleeps/decode(immediate_misses,0,1,immediate_misses),3) "SLEEPS/MISS"from gv$latchwhere round((immediate_gets/(immediate_gets+immediate_misses)), 3) < .99and immediate_gets + immediate_misses > 0order by round((immediate_gets/(immediate_gets+immediate_misses)), 3);-- GLOBAL CACHE CR PERFORMANCE-- This shows the average latency of a consistent block request.  -- AVG CR BLOCK RECEIVE TIME should typically be about 15 milliseconds depending -- on your system configuration and volume, is the average latency of a -- consistent-read request round-trip from the requesting instance to the holding -- instance and back to the requesting instance. If your CPU has limited idle time -- and your system typically processes long-running queries, then the latency may -- be higher. However, it is possible to have an average latency of less than one -- millisecond with User-mode IPC. Latency can be influenced by a high value for -- the DB_MULTI_BLOCK_READ_COUNT parameter. This is because a requesting process -- can issue more than one request for a block depending on the setting of this -- parameter. Correspondingly, the requesting process may wait longer.  Also check-- interconnect badwidth, OS tcp settings, and OS udp settings if -- AVG CR BLOCK RECEIVE TIME is high.--set numwidth 20column "AVG CR BLOCK RECEIVE TIME (ms)" format 9999999.9select b1.inst_id, b2.value "GCS CR BLOCKS RECEIVED", b1.value "GCS CR BLOCK RECEIVE TIME",((b1.value / b2.value) * 10) "AVG CR BLOCK RECEIVE TIME (ms)"from gv$sysstat b1, gv$sysstat b2where b1.name = 'global cache cr block receive time' andb2.name = 'global cache cr blocks received' and b1.inst_id = b2.inst_id ;-- GLOBAL CACHE LOCK PERFORMANCE-- This shows the average global enqueue get time. -- Typically AVG GLOBAL LOCK GET TIME should be 20-30 milliseconds.  the elapsed -- time for a get includes the allocation and initialization of a new global -- enqueue. If the average global enqueue get (global cache get time) or average -- global enqueue conversion times are excessive, then your system may be -- experiencing timeouts.  See the 'WAITING SESSIONS', 'GES LOCK BLOCKERS', -- 'GES LOCK WAITERS', and 'TOP 10 WAIT EVENTS ON SYSTEM' sections if the -- AVG GLOBAL LOCK GET TIME is high.--set numwidth 20column "AVG GLOBAL LOCK GET TIME (ms)" format 9999999.9select b1.inst_id, (b1.value + b2.value) "GLOBAL LOCK GETS", b3.value "GLOBAL LOCK GET TIME",(b3.value / (b1.value + b2.value) * 10) "AVG GLOBAL LOCK GET TIME (ms)"from gv$sysstat b1, gv$sysstat b2, gv$sysstat b3where b1.name = 'global lock sync gets' andb2.name = 'global lock async gets' and b3.name = 'global lock get time'and b1.inst_id = b2.inst_id and b2.inst_id = b3.inst_id;-- RESOURCE USAGE-- This section will show how much of our resources we have used. --set numwidth 8select inst_id, resource_name, current_utilization, max_utilization,initial_allocationfrom gv$resource_limitwhere max_utilization > 0order by inst_id, resource_name;-- DLM TRAFFIC INFORMATION-- This section shows how many tickets are available in the DLM.  If the -- TCKT_WAIT columns says "YES" then we have run out of DLM tickets which could-- cause a DLM hang.  Make sure that you also have enough TCKT_AVAIL.  --set numwidth 5select * from gv$dlm_traffic_controllerorder by TCKT_AVAIL;-- DLM MISC--set numwidth 10select * from gv$dlm_misc;-- LOCK CONVERSION DETAIL:-- This view shows the types of lock conversion being done on each instance.--select * from gv$lock_activity;-- TOP 10 WRITE PINGING/FUSION OBJECTS-- This view shows the top 10 objects for write pings accross instances.  -- The inst_id column shows the node that the block was pinged on.  The name -- column shows the object name of the offending object.  The file# shows the -- offending file number (gc_files_to_locks).  The STATUS column will show the -- current status of the pinged block.  The READ_PINGS will show us read converts -- and the WRITE_PINGS will show us objects with write converts.  Any rows that -- show up are objects that are concurrently accessed across more than 1 instance.--set numwidth 8column name format a20 trucolumn kind format a10 truselect inst_id, name, kind, file#, status, BLOCKS, READ_PINGS, WRITE_PINGSfrom (select p.inst_id, p.name, p.kind, p.file#, p.status, count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS, sum(p.forced_writes) WRITE_PINGSfrom gv$ping p, gv$datafile dfwhere p.file# = df.file# (+)group by p.inst_id, p.name, p.kind, p.file#, p.statusorder by sum(p.forced_writes) desc)where rownum < 11order by WRITE_PINGS desc;-- TOP 10 READ PINGING/FUSION OBJECTS-- This view shows the top 10 objects for read pings.  The inst_id column shows -- the node that the block was pinged on.  The name column shows the object name -- of the offending object.  The file# shows the offending file number -- (gc_files_to_locks).  The STATUS column will show the current status of the-- pinged block.  The READ_PINGS will show us read converts and the WRITE_PINGS -- will show us objects with write converts.  Any rows that show up are objects -- that are concurrently accessed across more than 1 instance.--set numwidth 8column name format a20 trucolumn kind format a10 truselect inst_id, name, kind, file#, status, BLOCKS, READ_PINGS, WRITE_PINGSfrom (select p.inst_id, p.name, p.kind, p.file#, p.status, count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS, sum(p.forced_writes) WRITE_PINGSfrom gv$ping p, gv$datafile dfwhere p.file# = df.file# (+)group by p.inst_id, p.name, p.kind, p.file#, p.statusorder by sum(p.forced_reads) desc)where rownum < 11order by READ_PINGS desc;-- TOP 10 FALSE PINGING OBJECTS-- This view shows the top 10 objects for false pings.  This can be avoided by-- better gc_files_to_locks configuration.  The inst_id column shows the node-- that the block was pinged on.  The name column shows the object name of the -- offending object.  The file# shows the offending file number -- (gc_files_to_locks).  The STATUS column will show the current status of the-- pinged block.  The READ_PINGS will show us read converts and the WRITE_PINGS -- will show us objects with write converts.  Any rows that show up are objects -- that are concurrently accessed across more than 1 instance.--set numwidth 8column name format a20 trucolumn kind format a10 truselect inst_id, name, kind, file#, status, BLOCKS, READ_PINGS, WRITE_PINGSfrom (select p.inst_id, p.name, p.kind, p.file#, p.status, count(p.block#) BLOCKS, sum(p.forced_reads) READ_PINGS, sum(p.forced_writes) WRITE_PINGSfrom gv$false_ping p, gv$datafile dfwhere p.file# = df.file# (+)group by p.inst_id, p.name, p.kind, p.file#, p.statusorder by sum(p.forced_writes) desc)where rownum < 11order by WRITE_PINGS desc;-- INITIALIZATION PARAMETERS:-- Non-default init parameters for each node.--set numwidth 5column name format a30 trucolumn value format a50 wracolumn description format a60 truselect inst_id, name, value, descriptionfrom gv$parameterwhere isdefault = 'FALSE'order by inst_id, name;-- TOP 10 WAIT EVENTS ON SYSTEM-- This view will provide a summary of the top wait events in the db.--set numwidth 10column event format a25 truselect inst_id, event, time_waited, total_waits, total_timeoutsfrom (select inst_id, event, time_waited, total_waits, total_timeoutsfrom gv$system_event where event not in ('rdbms ipc message','smon timer','pmon timer', 'SQL*Net message from client','lock manager wait for remote message','ges remote message', 'gcs remote message', 'gcs for action', 'client message', 'pipe get', 'null event', 'PX Idle Wait', 'single-task message', 'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue', 'listen endpoint status','slave wait','wakeup time manager')order by time_waited desc)where rownum < 11order by time_waited desc;-- SESSION/PROCESS REFERENCE:-- This section is very important for most of the above sections to find out -- which user/os_user/process is identified to which session/process.--set numwidth 7column event format a30 trucolumn program format a25 trucolumn username format a15 truselect p.inst_id, s.sid, s.serial#, p.pid, p.spid, p.program, s.username, p.username os_user, sw.event, sw.seconds_in_wait sec  from gv$process p, gv$session s, gv$session_wait swwhere (p.inst_id = s.inst_id and p.addr = s.paddr)and (s.inst_id = sw.inst_id and s.sid = sw.sid)order by p.inst_id, s.sid;-- SYSTEM STATISTICS:-- All System Stats with values of > 0.  These can be referenced in the-- Server Reference Manual--set numwidth 5column name format a60 trucolumn value format 9999999999999999999999999select inst_id, name, valuefrom gv$sysstatwhere value > 0 order by inst_id, name;-- CURRENT SQL FOR WAITING SESSIONS:-- Current SQL for any session in the WAITING SESSIONS list--set numwidth 5column sql format a80 wraselect sw.inst_id, sw.sid, sw.seconds_in_wait sec, sa.sql_text sqlfrom gv$session_wait sw, gv$session s, gv$sqlarea sawhere sw.sid = s.sid (+) and sw.inst_id = s.inst_id (+)and s.sql_address = sa.address and sw.event not in ('rdbms ipc message','smon timer','pmon timer','SQL*Net message from client','lock manager wait for remote message','ges remote message', 'gcs remote message', 'gcs for action', 'client message', 'pipe get', 'null event', 'PX Idle Wait', 'single-task message', 'PX Deq: Execution Msg', 'KXFQ: kxfqdeq - normal deqeue', 'listen endpoint status','slave wait','wakeup time manager')and seconds_in_wait > 0order by sw.seconds_in_wait desc; -- Taking Hang Analyze dumps-- This may take a little while...oradebug setmypidoradebug unlimitoradebug -g all hanganalyze 3-- This part may take the longest, you can monitor bdump or udump to see if the-- file is being generated.oradebug -g all dump systemstate 266set echo offselect to_char(sysdate) time from dual;spool off-- ---------------------------------------------------------------------------Prompt;Prompt racdiag output files have been written to:;Prompt;host pwdPrompt alert log and trace files are located in:;column host_name format a12 trucolumn name format a20 trucolumn value format a60 truselect distinct i.host_name, p.name, p.valuefrom gv$instance i, gv$parameter pwhere p.inst_id = i.inst_id (+)and p.name like '%_dump_dest' and p.name != 'core_dump_dest'; - - - - - - - - - - - - - - - -  Script ends here  - - - - - - - - - - - - - - - -查看(59) 评论(0) 收藏 分享 圈子 管理
RMAN-06726
2007-02-13 00:00:00

在RAC环境中一个RMAN备份出错的案例.

环境是四节点的RAC.在备份归档日志时出现以下错误:

allocated channel: c1
channel ch00: sid=132 devtype=DISK
allocated channel: c2
channel ch01: sid=32 devtype=DISK
allocated channel: c3
channel ch02: sid=156 devtype=DISK
allocated channel: c4
channel ch02: sid=123 devtype=DISK
Starting backup at 03-FEB-07
released channel: c1
released channel: c2
released channel: c3
released channel: c4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 02/03/2007 01:17:34
RMAN-06726: could not locate archivelog XXXXXXXXXXXXXXXXX

查找原因.检查脚本.无任何问题.检查归档路径,无任何问题.

归档日志全部都存在.检查参数.发现其中一个实例中cluster_database_instances 参数值为3. 因为是新添加了一个节点.因此值并未作修改.将cluster_database_instances 修改为4后备份归档日志恢复正常.

查看(201) 评论(0) 收藏 分享 圈子 管理
rac ora-12545
2007-02-09 00:00:00

hpunix

oracle9205 rac

由于node1已很长时间没能起来,直到同步vg信息后才完成RAC的重组。

startup 两node.

启动应用,发现无法连接。ORA-12545: Connect failed because target host or object does not exist ,但重试又可连上.检查原因后,发现两实例都设了local_listener和remote_listener.将其清除后问题解决!
查看(75) 评论(0) 收藏 分享 圈子 管理
Oracle RAC Wait Events
2007-01-19 00:00:00

RAC Differences
The main difference to keep in mind when monitoring a RAC database versus a singleinstance
database, is the buffer cache and its operation. In a RAC environment the
buffer cache is global across all instances in the cluster and hence the processing
differs. When a process in a RAC database needs to modify or read data, Oracle will
first check to see if it already exists in the local buffer cache. If the data is not in the
local buffer cache the global buffer cache will be reviewed to see if another instance
already has it in their buffer cache. In this case the remote instance will send the data
to the local instance via the high-speed interconnect, thus avoiding a disk read.
Monitoring a RAC database often means monitoring this situation and the amount of
requests going back and forth over the RAC interconnect. The most common wait
events related to this are gc cr request and gc buffer busy.

gc cr request
This wait event, also known as global cache cr request prior to Oracle 10g, specifies the
time it takes to retrieve the data from the remote cache. High wait times for this wait
event often are because of:
1. RAC Traffic Using Slow Connection - typically RAC traffic should use a high-speed
interconnect to transfer data between instances, however, sometimes Oracle may not
pick the correct connection and instead route traffic over the slower public network.
This will significantly increase the amount of wait time for the gc rc request event. The
oradebug command can be used to verify which network is being used for RAC traffic:
SQL> oradebug setmypid
SQL> oradebug ipc
This will dump a trace file to the location specified by the user_dump_dest Oracle
parameter containing information about the network and protocols being used for the
RAC interconnect.

2. Inefficient Queries poorly tuned queries will increase the amount of data blocks
requested by an Oracle session. The more blocks requested typically means the more
often a block will need to be read from a remote instance via the interconnect.

gc buffer busy
This wait event, also known as global cache buffer busy prior to Oracle 10g, specifies
the time the remote instance locally spends accessing the requested data block. Thiswait
event is very similar to the buffer busy waits wait event in a single-instance
database and are often the result of:
1. Hot Blocks - multiple sessions may be requesting a block that is either not in buffer
cache or is in an incompatible mode. Deleting some of the hot rows and re-inserting
them back into the table may alleviate the problem. Most of the time the rows will be
placed into a different block and reduce contention on the block. The DBA may also
need to adjust the pctfree and/or pctused parameters for the table to ensure the rows
are placed into a different block.

2. Inefficient Queries as with the gc cr request wait event, the more blocks requested
from the buffer cache the more likelihood of a session having to wait for other sessions.
Tuning queries to access fewer blocks will often result in less contention for the same
block.

Conclusion
Oracle RAC is somewhat of a unique case of an Oracle environment, but everything
learned about wait events in the single instance database also applies to clustered
databases. However, the special use of a global buffer cache in RAC makes it
imperative to monitor inter-instance communication via the cluster-specific wait events
such as the ones discussed above. Understanding these wait events will help in the
diagnosis of problems and pinpointing solutions in a RAC database.
查看(218) 评论(0) 收藏 分享 圈子 管理
oracle pk sybase
2007-01-19 00:00:00

一个用户说要为数据仓库选型.

制定了一套基准测试方案.sybase由原厂的工程师做.ORACLE这这我负责.

发现加载数据ORACLE快得多.

5000万的数量量.sqlldr 5分钟搞定.不过sybase可是用了将近20分钟啊.在检索方面ORACLE当然也虽强于SYBASE啦.
查看(27) 评论(0) 收藏 分享 圈子 管理
处理了一个ora-12500
2007-01-18 00:00:00

接到电话说客户端无法正常连接到数据库,急忙赶到现场.

在停掉部分应用后,发现可以正常连接,再启动应用.发现又无法连接.当连接数达到283后就无法登录了.报ora-12500.

由于客户方的系统为windows2000 2GB内存.

查看sga发现居然有1.4G.加上pga的大小1.5 G

这种情况下,如果连接数高了,资源吃紧,oracle无可用的内存.从而导致连接受限.将sga调整到900M后,问题解决.

我觉得的奇怪的是,为什么XX的数据库(虽然是这个库是小库).也不应该用这么低的配置吧. ......
查看(105) 评论(0) 收藏 分享 圈子 管理
给用VCS的一个提醒
2007-01-08 00:00:00

给oracle rac升级,一定别忘了给oracle lib作同步了:
$ cp /opt/VRTSvcs/ops/lib/libskgxp92_64.so $ORACLE_HOME/lib/libskgxp9.so
$ cp /opt/ORCLcluster/lib/9iR2/libskgxn2_64.so $ORACLE_HOME/lib/libskgxn9.so

否则有可能运行了大半年甚至一年之后问题爆发出来你找不到北了。
查看(36) 评论(0) 收藏 分享 圈子 管理
RAC下trace暴涨诊断
2007-01-07 00:00:00

DB版本oracle 9207 rac
OS版本solaris 9
集群件Veritas cluster server 4.1

故障:

平均三秒钟产生一个trace文件。Trace文件不断增加,导致磁盘空间迅速减小

而alter中无任何错误信息,只有一行:

Thu Jan 4 11:34:53 2007
Errors in file /oracle_bin/rac9i/admin/XXrac/udump/XXrac2_ora_1942.trc

trace文件:
/oracle_bin/rac9i/admin/XXrac/udump/XXrac2_ora_1942.trc
Oracle9i Enterprise Edition Release 9.2.0.7.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.7.0 - Production
ORACLE_HOME = /oracle_bin/rac9i/product
System name: SunOS
Node name: XXXXX
Release: 5.9
Version: Generic_118558-13
Machine: sun4u
Instance name: yjrac2
Redo thread mounted by this instance: 2
Oracle process number: 13
Unix process pid: 1942, image: oracle@XXXXX (TNS V1-V3)

*** SESSION ID:(85.7779) 2006-04-23 01:13:12.231
=================================
Begin 4031 Diagnostic Information
=================================
The following information assists Oracle in diagnosing
causes of ORA-4031 errors. This trace may be disabled
by setting the init.ora parameter _4031_dump_bitvec = 0
======================================
Allocation Request Summary Information
======================================
Current information setting: 00654fff
Dump Interval=300 seconds SGA Heap Dump Interval=3600 seconds
Last Dump Time=04/14/2030 14:07:45
Allocation request for: kglsim object batch
Heap: 380032950, size: 4032
******************************************************
HEAP DUMP heap name="sga heap(2,0)" desc=380032950
extent sz=0xfe0 alt=200 het=32767 rec=9 flg=-126 opc=0
parent=0 owner=0 nex=0 xsz=0x1
====================
Process State Object
====================
----------------------------------------
SO: 41f4ee448, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=13, calls cur/top: 41f6d3320/41f6d3320, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
last post received-location: No post
last process to post me: none
last post sent: 0 0 0
last post sent-location: No post
last process posted by me: none
(latch info) wait_event=0 bits=20
holding 428c53f60 Child library cache level=5 child#=5
Location from where latch is held: kglobpn: child:: latch
Context saved from call: 6
state=busy
Process Group: DEFAULT, pseudo proc: 41f5cbab0
O/S info: user: orarac, term: UNKNOWN, ospid: 1942
OSD pid info: Unix process pid: 1942, image: oracle@XXXXX(TNS V1-V3)
=========================
User Session State Object
=========================
----------------------------------------
SO: 4204f6d48, type: 4, owner: 41f4ee448, flag: INIT/-/-/0x00
(session) trans: 0, creator: 41f4ee448, flag: (41) USR/- BSY/-/-/-/-/-
DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
txn branch: 0
oct: 0, prv: 0, sql: 438fa8348, psql: 0, user: 0/SYS
O/S info: user: , term: , ospid: , machine:
program:
temporary object counter: 0
...No current library cache object being loaded
...No instantiation object
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksm_4031_dump()+186 CALL ksedst() 00000000B ? 000000000 ?
8 000000000 ? 103327258 ?
00000003E ?
FFFFFFFF7FFF7938 ?
ksmasg()+352 CALL ksm_4031_dump() 000103756 ? 380000030 ?
380032950 ? 000654FFF ?
103756000 ? 103751768 ?
kghnospc()+364 PTR_CALL 0000000000000000 1037519C8 ? 380000030 ?
000000FC0 ? 000000FC0 ?
380000078 ?
FFFFFFFF7FFFB138 ?
kghalo()+4156 CALL kghnospc() 1037519C8 ? 380032950 ?
000000000 ? 004000000 ?
102D430F8 ? 103519280 ?
kglsim_chk_objlist( CALL kghalo() 000000000 ?
)+340 FFFFFFFF7FFFB2A0 ?
1037519C8 ? 000001000 ?
428D58638 ? 000000000 ?

很明显,由于4031造成的错误,但是为什么这么频繁的4031错误产生呢?
通过v$resource_limit,我们发现以下情况:
RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
--------------- ------------------- --------------- -------------------- --------------------
processes 479 490 1000 1000
sessions 486 506 1105 1105
enqueue_locks 165 468 13282 13282
enqueue_resources 179 343 5080 UNLIMITED
ges_procs 478 488 1001 1001
ges_ress 32169 59712 20754 UNLIMITED
ges_locks 29613 55281 32150 UNLIMITED
ges_cache_ress 1929 29346 0 UNLIMITED
ges_reg_msgs 1087 2301 2230 UNLIMITED

RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
--------------- ------------------- --------------- -------------------- --------------------
ges_big_msgs 106 1183 2230 UNLIMITED
ges_rsv_msgs 0 0 1000 1000
gcs_resources 161607 189852 211052 211052
gcs_shadows 88914 99594 211052 211052
dml_locks 26 370 4860 UNLIMITED
temporary_table 0 2 UNLIMITED UNLIMITED
_locks

transactions 8 21 1215 UNLIMITED
branches 0 1 1215 UNLIMITED
cmtcallbk 0 1 1215 UNLIMITED

RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
--------------- ------------------- --------------- -------------------- --------------------
sort_segment_lo 27 41 UNLIMITED UNLIMITED
cks

max_rollback_se 14 15 244 244
gments

max_shared_serv 0 0 20 20
ers

parallel_max_se 0 5 6 6
rvers

RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
--------------- ------------------- --------------- -------------------- --------------------

22 rows selected.

我们可以发现ges_ress和ges_locks的当前分配数量已经超出了初始分配数据,最大分配数甚至超出了几倍,
我们知道,当ges_ress和ges_locks超出初始分配的数量时,就会从shared_pool_size里面强行申请内存片。
超出的越多,当然就占用更多的内存区域。而这时候,数据库的压力又非常的大,从而不断的产生4031错误。

这时候我们能做的就是将这两个指标的数量控制在一定范围之内。或者将他的初始分配值扩大,以及限制它的最高值。
可以使用两个隐含参数来控制:_lm_locks和_lm_ress。如_lm_locks=(200000,200000) 。这么设定的意思是:初始值、最大值。
也就是说初始分配200000,最大也只能使用200000, 但是设置这个需要注意一点的是,会增加oracle使用内存的数量。
假如您使用了10G的sga.那么设定1000000的话大概就是多出1G.加上SGA的话。就是11G的内存了。因此在遇到这种情
况时一定要注意主机的内存情况,因为修改完该参数后,重启实例时就会预分配内存。这个可以通过ipcs看出来。
查看(107) 评论(0) 收藏 分享 圈子 管理
RAC平台无法创建任何对象解决一例
2006-12-30 00:00:00

Solaris5.9

集群件:veritas cluster server

文件系统:VXFS

oracle9207 on rac

接到电话,说在任何一个实例上无法创建任何数据库对象。之前曾经发生过此类故障,客户为重启实例后解决。结果这几天该故障重现,

查看到v$lock存在锁定情况,通过查询dba_waiters后发现sid为12的会话持有obj$表的锁。发现在v$sql中的sql语句为:

select o.owner#,o.obj#,decode(o.linkname,null, decode(u.name,null,'SYS',u.name),o.remoteowner),

o.name,o.linkname,o.namespace,o.subname

from user$ u, obj$ o where u.user#(+)=o.owner# and o.type#=1

and not exists (select p_obj# from dependency$ where p_obj# = o.obj#)

order by o.obj# for update;

再通过与v$process关联取得spid后,发现此会话为smon进程。而该进程直接持有了obj$的锁。也就是说,两个实例SMON互锁,导致无法创建任何数据库对象。但是为什么会造成互锁呢?我们知道smon的作用是在实例启时负责进行恢复工作。此外,还负责清除系统中不再使用的临时段, 以及为数据字典管理的表空间合并相邻的可用数据扩展,而在RAC中,一个实例的 SMON 进程能够为出错的 CPU 或 实例进行实例恢复。这我们的情况中,RAC并没有出现实例crash. 当然不存在恢复。那么还有可能的情况就是清除临时段了,通过查询,我们发现运营商的DBA发生了一个严重的错误,那就是将temp表空间使用datafile来创建,而且并且是dict管理模式。这样导致了两个实例的smon相互对此临时段进行清除以及合并,从而导致了互锁,才导致后面出现的无法创建任何数据库对象。解决方法,重建temp表空间,使用tempfile,以及创建为local方式的temp tablespace.然后重启节点。另外一种方法就是使用

Event来禁止某个实例smon

Event=’ 10052 trace name context forever’;

目前我所采用的是第一种策略了.从半个月时间看来,数据库运行正常.
查看(56) 评论(0) 收藏 分享 圈子 管理
诊断数据库hang住一例
2006-12-30 00:00:00

机器:sun4u Sun Fire E4900
内存: 16GB
CPU: 4*1350
ORACLE:10.2.0.2.0 on RAC
群集软件:sun cluster
操作系统:Solaris 5.9

症状描述:
接到移动报告说数据库昨天在群发短信时hang住。下午出现同样情况,并且这个过程中执行任何的sql语句都会hang住。
非常奇怪的是,在另一个节点中可以轻松的执行任何操作。但是在此节点中sqlplus都无法登录。
赶到现场,收集hang住时间点的awrrpt.以下为top 5 event:

Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
latch: library cache 1,031 48,359 46905 89.6 Concurrenc
CPU time 5,235 9.7
gc cr multi block request 4,889,209 824 0 1.5 Cluster
enq: TM - contention 1,181 568 481 1.1 Applicatio
gc current block 2-way 58,770 41 1 0.1 Cluster

可见到latch等待非常之高,对于library cache详见本博客其他文章。
发现top session发现有一个sql语句在半个月前就开始执行,直到现在还未完成。找到该sql。其实只是一条无关紧要的select语句。
是查锁的。将其会话kill之。然后手工运行这个sql.发现很快就hang. 难道是这个sql引起的问题?不可能啊。仅仅是查v$lock和v$session这两个视图而已。其实就不管是什么视图,查询半个月总归是个大问题,并且总是在发短信的高峰期时hang住。
还好,抓到了一次现场。使用带truss命令的sqlplus登录。很快hang住。打开truss内容。并没有发现什么异常。也就是说问题并没有出在os级别或者软件级别了。继续看awrrpt报告。有一个地方引起怀疑,我发现有一个sql语句的version count达到37000之多:
Version
Count Executions SQL Id
-------- ------------ -------------
37,164 N/A g2k0nc8fbn337

这也就意味着oracle需要进行这么多的寻址。搜索到metalink,居然是个bug:5442957,原来是cursor_sharing=similar引起的bug.通过event 38056可解决。

晚上停掉数据库,添加Event = '38056 trace name context forever, level 1',问题OK。
这三天以来一直未出现个hang的情况。

注意:在节点01出现问题时,节点02是没有任何问题的。
查看(172) 评论(0) 收藏 分享 圈子 管理
ORACLE中的等待事件
2006-12-17 00:00:00

from http://www.itpub.net/showthread.php?threadid=398220&pagenumber=

表:非空闲等待事件的级别含义
Buffer busy wait 表示在等待对数据告诉缓存区的访问,这种等待出现在会话读取数据到buffer中或者修改buffer中的数据时,例如DBWR正在写一些数据块到数据文件的同时,其他进程需要去读取相应的数据块。同时也可能表示在表上设置的freelist太小,不能支持大量并发的INSERT操作。在v$session_wait视图的p1子段值表示相关数据块所在的文件号,p2表示文件上的块编号。通过这些信息与dba_data_files和dba_extents的联合查询就可以很快定位到发生竞争的对象,从而近一步确定问题的根源。
Db file parrle write 于dbwr进程相关的等待,一般都代表了io能力出现问题。通常与配置的多个dbwr进程或者dbwr的io slaves个数有关,当然也可能意味这在设备上出现io竞争!
Db file scattered read 表示发生了于全表扫描的等待。通常意味者全表扫描过多,或者io能力不足,或者io竞争
Db file sequential read 表示发生了于索引扫描有关的等待。同样意味者io出现了问题,表示io出现了竞争和io需求太多
Db file single write 表示在检查点发生时与文件头写操作相关的等待。通常于检查点同步数据文件时文件号的紊乱有关
Direct path read 表示于直接io读相关的等待。当直接读数据到pga内存时,direct path read出现。这种类型的读请求典型的作为:排序io并行slave查询或者预先读请求等。通常这种等待于io能力或者io竞争有关
Direct path write 同上
Enqueue 表示于内部队列机制有关的等待,例如保护内部资源或者组件的锁的请求等,一种并发的保护机制
Free buffer inspected 表示在将数据读入数据告诉缓冲区的时候等待进程找到足够大的内存空间。通常这种等待表示数据缓冲区偏小。
Free buffer waits 表述数据告诉缓存区缺少内存空间。通常于数据高速缓冲区内存太小或者脏数据写出太慢导致。在这种情况下,可以考虑增大高速缓存区或者通过设置更多的dbwr来解决
Latch free 表示某个锁存器发生了竞争。首先应该确保已经提供了足够多的latch数,如果仍然发生这种等待事件,应该进一步确定是那种锁存器上发生了竞争(在v$session_wait上的p2子段表示了锁存器的标号),然后判断是什么引起了这种锁存器竞争。大多数锁存器竞争不是简单的锁存器引起的,而是于锁存器相关的组件引起的,需要找到具体导致竞争的根本。例如,如果发生了library cache latch竞争,那么通常表示库缓存配置不合理,或者sql语句书写不合理,带来了大量的硬分析。
Library cache load lock 表示在将对象装入到库高速缓冲区的时候出现了等待。这种事件通常代表者发生了负荷很重的语句重载或者装载,可能由于sql语句没有共享池区域偏小导致的。
Library cache lock 表示与访问库高速缓存的多个进程相关的等待。通常表示不合理的共享池大小。
Library cache pin 这个等待事件也与库高速缓存的并发性有关,当库高速缓存中的对象被修改或者被检测的时候发生
Log buffer space 表示日志缓冲区出现了空间等待事件。这种等待事件意味者写日志缓冲区的时候得不到相应的内存空间,通常发生在日志缓冲区太小或者LGWR进程写太慢的时候。
Log file parallel write 表示等待LGWR向操作系统请求io开始直到完成io。在触发LGWR写的情况下入3秒,1/3,1MB、DBWR写之前可能发生。这种事件发生通常表示日志文件发生了io竞争或者文件所在的驱动器较慢。
Log file single write 表示写文件头块的时候出现了等待。一般都是发生在检查点发生时。
Log file switch
(archiveing needed) 由于归档过慢造成日志无法进行切换而发生的等待。这种等待事件的原因可能比较多,最主要的原因是归档速度赶不上日志切换的速度。可能的原因包括了重作日志太了,重作日志组太少,归档能力低,归档文件发生了io竞争,归档日志挂起,或者归档日志放在了慢的设备上。
Log file switch
(checkpoint incomplete) 表示在日志切换的时候文件上的检查点还没有完成。一般意味者日志文件太小造成日志切换切换太快或者其他原因。
Log file sync 表示当服务进程发出commit或者rollbabk命令后,直到LGWR完成相关日志写操作这段时间的等待。如果有多个服务进程同时发出这种命令,LGWR不能及时完成日志的写操作,就有可能造成这种等待。
Transaction 表示发生一个阻赛回滚操作的等待
Undo segment extension 表示在等待回滚段的动态扩展。这表示可能事务量过大,同时也意味者可能回滚段的初始大小不是最优,minextents设置偏小。考虑减少事务,或者使用最小区数更多的回滚段。
查看(28) 评论(0) 收藏 分享 圈子 管理
10G中的ORA-3136
2006-12-09 00:00:00

经常看到alert日志中报出错误为:

WARNING: inbound connection timed out (ORA-3136)

后来找到metalink的方法,帖出来如下:

1.set INBOUND_CONNECT_TIMEOUT_ =0 in listener.ora
2. set SQLNET.INBOUND_CONNECT_TIMEOUT = 0 in sqlnet.ora of server.
3. stop and start both listener and database.
4. Now try to connect to DB and observe the behaviour
查看(33) 评论(0) 收藏 分享 圈子 管理
格式化输出plan_table
2006-12-09 00:00:00

经常需要看一些sql的执行计划,大的sql用autotrace就不太合适了,需要时间.而通常用plan_table直接输出又不方便阅读.这里给出一个脚本,方便自己,忘了自己在哪本书上看的了.

SQL> EXPLAIN PLAN

2 SET STATEMENT_ID='SQL1' FOR select * from dual;

已解释。

SQL>SQL> select lpad('',2*(level-1))||level||'.'||nvl(position,0)||''||
2 operation||''||options||''||object_name||''||object_type

3 ||''||decode(id,0,statement_id||'cost='||position)||cost

4 ||''||object_node "query plan"
5 from plan_table
6 start with id=0 and statement_id='SQL1'
7 connect by prior id=parent_id

8 and statement_id='SQL1';

query plan-------------------------------------------------------------------------------

1.0SELECT STATEMENTSQL1cost=2.1TABLE ACCESSFULLDUAL

SQL>

在RAC环境中,我们经常发现一些类似于以下的参数:

rac1.instance_number=2

rac2.instance_number=1

*.db_block_size=8192
*.db_cache_size=25165824

这里的*号实际表示就是RAC中所有实例都使用!

上一篇:tail命令


下一篇:CSS3 三次贝塞尔曲线(cubic-bezier)