【MOS】Cluster Health Monitor (CHM) FAQ (文档 ID 1328466.1 ID 2062234.1)

   

Cluster Health Monitor(以下简称CHM)是一个  Oracle提供的工具,用来自动收集操作系统的资源(CPU、内存、SWAP、进程、I/O以及网络等)的使用情况。CHM会每秒收集一次数据。      

   这些系统资源数据对于诊断集群系统的节点重启、Hang、实例驱逐(Eviction)、性能问题等是非常有帮助的。另外,用户可以使用CHM来及早发现一些系统负载高、内存异常等问题,从而避免产生更严重的问题。      

CHM会自动安装在下面的软件:  
    11.2.0.2 及更高版本的 Oracle Grid Infrastructure for Linux (不包括Linux Itanium) 、Solaris (Sparc 64 和 x86-64)      
    11.2.0.3 及更高版本 Oracle Grid Infrastructure for AIX 、 Windows (不包括Windows Itanium)。      

    在集群中,可以通过下面的命令查看CHM对应的资源(ora.crf)的状态:      
    $ crsctl stat res -t -init      
    --------------------------------------------------------------------------------      
    NAME           TARGET  STATE        SERVER                   STATE_DETAILS       Cluster Resources      
ora.crf        ONLINE  ONLINE       rac1      


CHM主要包括两个服务:  
    1). System Monitor Service(osysmond):这个服务在所有节点都会运行,osysmond会将每个节点的资源使用情况发送给cluster logger service,后者将会把所有节点的信息都接收并保存到CHM的资料库。
      $ ps -ef|grep osysmond
       root      7984     1  0 Jun05 ?        01:16:14 /u01/app/11.2.0/grid/bin/osysmond.bin

    2). Cluster Logger Service(ologgerd):在一个集群中的,ologgerd 会有一个主机点(master),还有一个备节点(standby)。当ologgerd在当前的节点遇到问题无法启动后,它会在备用节点启用。

     主节点:
     $ ps -ef|grep ologgerd
       root      8257     1  0 Jun05 ?        00:38:26 /u01/app/11.2.0/grid/bin/ologgerd -M -d      /u01/app/11.2.0/grid/crf/db/rac2

     备节点:
      $ ps -ef|grep ologgerd
       root      8353     1  0 Jun05 ?        00:18:47 /u01/app/11.2.0/grid/bin/ologgerd -m rac2 -r -d
/u01/app/11.2.0/grid/crf/db/rac1

CHM Repository:用于存放收集到数据,默认情况下,会存在于Grid Infrastructure home 下 ,需要1 GB 的磁盘空间,每个节点大约每天会占用0.5GB的空间。 您可以使用OCLUMON来调整它的存放路径以及允许的空间大小(最多只能保存3天的数据)。

下面的命令用来查看它当前设置:  
     $ oclumon manage -get reppath
       CHM Repository Path = /u01/app/11.2.0/grid/crf/db/rac2
       Done

     $ oclumon manage -get repsize
       CHM Repository Size = 68082 <====单位为秒
       Done

     修改路径:
     $ oclumon manage -repos reploc /shared/oracle/chm

     修改大小:
     $ oclumon manage -repos resize 68083 <==在3600(小时) 到 259200(3天)之间
      rac1 --> retention check successful
      New retention is 68083 and will use 1073750609 bytes of disk space
      CRS-9115-Cluster Health Monitor repository size change completed on all nodes.
      Done

获得CHM生成的数据的方法有两种:
     1. 一种是使用Grid_home/bin/diagcollection.pl:
        1). 首先,确定cluster logger service的主节点:
         $ oclumon manage -get master
         Master = rac2  

        2).用root身份在主节点rac2执行下面的命令:
         # /bin/diagcollection.pl -collect -chmos -incidenttime inc_time -incidentduration duration
         inc_time是指从什么时间开始获得数据,格式为MM/DD/YYYY24HH:MM:SS, duration指的是获得开始时间后多长时间的数据。

         比如:# diagcollection.pl -collect -crshome /u01/app/11.2.0/grid -chmoshome  /u01/app/11.2.0/grid -chmos -incidenttime 06/15/201215:30:00 -incidentduration 00:05

       3).运行这个命令之后,CHM的数据会生成在文件chmosData_rac2_20120615_1537.tar.gz。

    2. 另外一种获得CHM生成的数据的方法为oclumon:
        $oclumon dumpnodeview [[-allnodes] | [-n node1 node2] [-last "duration"] | [-s "time_stamp" -e "time_stamp"] [-v] [-warning]] [-h]

        -s表示开始时间,-e表示结束时间
       $ oclumon dumpnodeview -allnodes -v -s "2012-06-15 07:40:00" -e "2012-06-15 07:57:00" > /tmp/chm1.txt

       $ oclumon dumpnodeview -n node1 node2 node3 -last "12:00:00" >/tmp/chm1.txt
       $ oclumon dumpnodeview -allnodes -last "00:15:00" >/tmp/chm1.txt


下面是/tmp/chm1.txt中的部分内容:
----------------------------------------
Node: rac1 Clock: '06-15-12 07.40.01' SerialNo:168880
----------------------------------------

SYSTEM:
#cpus: 1 cpu: 17.96 cpuq: 5 physmemfree: 32240 physmemtotal: 2065856 mcache: 1064024 swapfree: 3988376 swaptotal: 4192956 ior: 57 io
w: 59 ios: 10 swpin: 0 swpout: 0 pgin: 57 pgout: 59 netr: 65.767 netw: 34.871 procs: 183 rtprocs: 10 #fds: 4902 #sysfdlimit: 6815744
 #disks: 4 #nics: 3  nicErrors: 0

TOP CONSUMERS:
topcpu: 'mrtg(32385) 64.70' topprivmem: 'ologgerd(8353) 84068' topshm: 'oracle(8760) 329452' topfd: 'ohasd.bin(6627) 720' topthread:
 'crsd.bin(8235) 44'

PROCESSES:

name: 'mrtg' pid: 32385 #procfdlimit: 65536 cpuusage: 64.70 privmem: 1160 shm: 1584 #fd: 5 #threads: 1 priority: 20 nice: 0
name: 'oracle' pid: 32381 #procfdlimit: 65536 cpuusage: 0.29 privmem: 1456 shm: 12444 #fd: 32 #threads: 1 priority: 15 nice: 0
...
name: 'oracle' pid: 8756 #procfdlimit: 65536 cpuusage: 0.0 privmem: 2892 shm: 24356 #fd: 47 #threads: 1 priority: 16 nice: 0

----------------------------------------
Node: rac2 Clock: '06-15-12 07.40.02' SerialNo:168878
----------------------------------------

SYSTEM:
#cpus: 1 cpu: 40.72 cpuq: 8 physmemfree: 34072 physmemtotal: 2065856 mcache: 1005636 swapfree: 3991808 swaptotal: 4192956 ior: 54 io
w: 104 ios: 11 swpin: 0 swpout: 0 pgin: 54 pgout: 104 netr: 77.817 netw: 33.008 procs: 178 rtprocs: 10 #fds: 4948 #sysfdlimit: 68157
44 #disks: 4 #nics: 4  nicErrors: 0

TOP CONSUMERS:
topcpu: 'orarootagent.bi(8490) 1.59' topprivmem: 'ologgerd(8257) 83108' topshm: 'oracle(8873) 324868' topfd: 'ohasd.bin(6744) 720' t
opthread: 'crsd.bin(8362) 47'

PROCESSES:

name: 'oracle' pid: 9040 #procfdlimit: 65536 cpuusage: 0.19 privmem: 6040 shm: 121712 #fd: 33 #threads: 1 priority: 16 nice: 0
...


  关于CHM的更多解释,请参考Oracle官方文档:
  http://docs.oracle.com/cd/E11882_01/rac.112/e16794/troubleshoot.htm#CWADD92242
  Oracle? Clusterware Administration and Deployment Guide
  11g Release 2 (11.2)
  Part Number E16794-17

  或者 My Oracle Support文档:
  Cluster Health Monitor (CHM) FAQ (Doc ID 1328466.1)

           


 


Cluster Health Monitor (CHM) FAQ (文档 ID 1328466.1)

 

In this Document


Purpose

Questions and Answers
  What is the Cluster Health Monitor?
  What is the purpose of the Cluster Health Monitor?
  What platform does Cluster Health Monitor support and where can I get the Cluster Health Monitor?
  What is the resource name for Cluster Health Monitor in 11.2.0.2 or higher?
  Is stop/start ora.crf affecting clusterware function or cluster database function?
  Can the Cluster Health Monitor be installed on a single node, non-RAC server?
  Do Engineered Systems like Exadata have a default usage with CHM and if so, any specific version??
  Where is oclumon?
  How do I collect the Cluster Health Monitor data?
  Why does “diagcollection.pl --collect --chmos” return “Cannot parse master from output: ERROR : in reading init file” error?
  How do you get the syntax of different options and explanations for those options for diagcollection.pl and oclumon?
  What is IPD/OS?
  How is the Cluster Health Monitor different from OSWatcher?
  Is the Cluster Health Monitor replacing OSWatcher?
  How much of overhead does the Cluster Health Monitor cause?
  Does CHM on Multiple Node configurations (e.g. 4 to 8 nodes) have scaling concerns?
  Will CDB and PDB result in any new information or special conditions using CHM?
  How much of disk space is needed for the Cluster Health Monitor?
  How do I find out the size of data collected and saved by the Cluster Health Monitor in my system?
  How can I increase the size of the Cluster Health Monitor repository ?
  What platforms can I run the Cluster Health Monitor?
  What steps are needed to install 11.2.0.2 when the Cluster Health Monitor from OTN is already running?
  Where does the Cluster Health Monitor from OTN installed in Linux?
  What logs and data should I gather before logging a SR for the Cluster Health Monitor error?
  How do I increase the trace level the Cluster Health Monitor?
  Can I use procwatcher to get the pstack of the Cluster Health Monitor regularly?
  What are the processes and components for the Cluster Health Monitor?
  What is oclumon?
  What is definition of some of the files like *.bdb, _db.* , *.ldb , log.* files created by tool in the BDB (Berkeley Database) location directory ?
  Because it takes many days / weeks to resolve a problem like the node reboot or performance degradation, is there any way to keep the Cluster Health Monitor data for that long so that it can be replayed any time later when needed ?
  Where is the location for the log files for the Cluster Health Monitor from OTN (pre 11.2.0.2)?
  How do I fix the problem that the time in the oclumon report is in UTC time zone instead of the time zone of my server?
  Can I install CHM from OTN on 11.2.0.2? What if I stop and disable CHM resource (ora.crf) on 11.2.0.2?
  Where is the trace file for client like oclumon? How do I increase the trace level for oclumon?
  Can the Directory path to the CHM Repository be same on all nodes if shared storage is used?
  How much of data (how long in time) does the node store CHM data locally when it cannot communicate with the master?
  How often does CHM collect the system metric data? Can this be changed?
  What is the default CHM retention time? 
  How can you reduce the size of bdb file that became big for any reason?
  Can you set up CHM to run locally on each node?
  Can CHM be used on a single node non-RAC server?
  How to start and stop CHM that is installed as a part of GI in 11.2 and higher?
  Database - RAC/Scalability Community

References


APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.1.0.2 to 12.1.0.2 [Release 10.1 to 12.1]
Information in this document applies to any platform.

PURPOSE


 

The Cluster Health Monitor FAQ is an evolving document that answers common questions about the Cluster Health Monitor

QUESTIONS AND ANSWERS

What is the Cluster Health Monitor?

The Cluster Health Monitor collects OS statistics (system metrics) such as memory and swap space usage, processes, IO usage, and network related data. The Cluster Health Monitor collects information in real time and usually once a second. The Cluster Health Monitor collects OS statistics using OS API to gain performance and reduce the CPU usage overhead. The Cluster Health Monitor collects as much of system metrics and data as feasible that is restricted by the acceptable level of resource consumption by the tool.

What is the purpose of the Cluster Health Monitor?

The Cluster Health Monitor is developed to provide system metrics and data for troubleshooting many different types of problems such as node reboot and hang, instance eviction and hang, severe performance degradation, and any other problems that need the system metrics and data. 

By monitoring the data constantly, users can use the Cluster Health Monitor detect potential problem areas such as CPU load, memory constraints, and spinning processes before the problem causes an unwanted outage.

What platform does Cluster Health Monitor support and where can I get the Cluster Health Monitor?

The Cluster Health Monitor is NOT supported on Linux Itanium,  and IBM Linux Z and HP-UX.

The Cluster Health Monitor is integrated part of 11.2.0.2 Oracle Grid Infrastructure for Linux (not on Linux Itanium and IBM Linux Z) and Solaris (Sparc 64 and x86-64 only), so installing 11.2.0.2 Oracle Grid Infrastructure on those platforms will automatically install the Cluster Health Monitor. AIX will have the Cluster Health Monitor starting from 11.2.0.3. The Cluster Health Monitor is also enabled for Windows (except Windows Itanium) in 11.2.0.3.

Prior to 11.2.0.2 on Linux (not on Linux Itanium and IBM Linux Z), the Cluster Health Monitor can be downloaded from OTN.

http://www-content.oracle.com/technetwork/products/clustering/downloads/ipd-download-homepage-087212.html

The OTN version for Windows is not available.  Please upgrade to 11.2.0.3 if you need CHM for Windows.

What is the resource name for Cluster Health Monitor in 11.2.0.2 or higher?

ora.crf is the Cluster Health Monitor resource name that ohasd manages. Issue “crsctl stat res –t –init” to check the current status of the Cluster Health Monitor.

Is stop/start ora.crf affecting clusterware function or cluster database function?

No, stop/start ora.crf resource will stop and start Cluster Health Monitor and its data collection, it will not affect clusterware or database functionality.

Can the Cluster Health Monitor be installed on a single node, non-RAC server?

The Cluster Health Monitor  Standalone for LINUX x86 and x86-64  can be downloaded from OTN, it can be installed on a single node, non-RAC server without the need to install Grid Infrastructure or CRS. For other platform, it is required to install Grid Infrastructure or CRS to get CHM/OS.

Do Engineered Systems like Exadata have a default usage with CHM and if so, any specific version??

Engineered systems use the default GI stack that includes CHM functionality as shipped on standard platforms. At this time there are no specific extensions for Engineered systems but this may change in future releases.

Where is oclumon?

If the CHM is installed as a part of 11.2 installation on the supported platform, then the location of oclumon is in GI_HOME/bin directory. 

If the CHM is manually installed using the CHM file from OTN, then the location of oclumon is in: 
Linux : /usr/lib/oracrf/bin 
Windows : C:\Program Files\oracrf\bin

How do I collect the Cluster Health Monitor data?

As grid user, using command “/bin/diagcollection.pl --collect --chmos” will produce output for all data that is collected in the repository. There may be too much data and may take long time, so the suggestion is limit the query to an interesting time interval.

For example, issue “/bin/diagcollection.pl --collect --crshome $ORA_CRS_HOME --chmos --incidenttime --incidentduration 05:00”

The above outputs the report that covers 5 hours from the time specified by incidenttime.
The incidenttime must be in MM/DD/YYYYHH:MN:SS where MM is month, DD is date, YYYY is year, HH is hour in 24 hour format, MN is minute, and SS is second. For example, if you want to put the incident time to start from 10:15 PM on June 01, 2011, the incident time is 06/01/201122:15:00. The incidenttime and incidentduration can be changed to capture more data.

Alternatively, ‘oclumon dumpnodeview -allnodes -v -last "11:59:59" > your-filename’ if diagcollection.pl fails with any reason. This will generate a report from the repository up to last 12 hours. The -last value can be changed to get more or less data.

Another example of using oclumon is 'oclumon dumpnodeview -allnodes -v -s "2012-06-01 22:15:00" -e "2012-06-02 03:15:00" > /tmp/chm.log '.  The difference in this command is that it specifies the start (-s flag) and end time (-e flag).
In this case, the time format used is "YYYY-MM-DD HH24:MI:SS" like "2007-11-12 23:05:00".

Why does “diagcollection.pl --collect --chmos” return “Cannot parse master from output: ERROR : in reading init file” error?

This is due to bug 10048487 that affects 11.2.0.2. As a result, the bug in the script causes the diagcollection.pl to never be able to retrieve the master node.

The workaround for this is to issue 
oclumon dumpnodeview -allnodes -v -last “amount of data needed”
For example, oclumon dumpnodeview -allnodes -v -last “01:00:00”
will provide last one hour of data from all nodes.

How do you get the syntax of different options and explanations for those options for diagcollection.pl and oclumon?

Issue “/bin/diagcollection.pl –h” and “oclumon –h”. You may need to drill down further to get information for different options.

What is IPD/OS?

The IPD/OS is an old name for the Cluster Health Monitor. The names can be used interchangeably although Oracle now calls the tool Cluster Health Monitor.

How is the Cluster Health Monitor different from OSWatcher?

OSWatcher collects OS statistics by running regular unix commands such as vmstat, top, ps, iostat, netstat, mpstat, and meminfo. The private.net file can be configured in OSWatcher to issue traceroute command over the private interconnect to test the private interconnect. OSWatcher also runs in a user priority, so OSWatcher often cannot run when CPU load is heavy.

Is the Cluster Health Monitor replacing OSWatcher?

The Cluster Health Monitor has many advantages over OSWatcher, and the most significant is that the Cluster Health Monitor runs in real time and usually once a second, so the Cluster Health Monitor will collect data even when OSWatcher cannot. However, there are some information such as top, traceroute, and netstat that the Cluster Health Monitor does not collect, so running the Cluster Health Monitor while running OSWatcher is ideal. Both tools complement each other rather than supplement.
On the other hand, if only one of the tools can be used, then Oracle recommends that the Cluster Health Monitor is used.

How much of overhead does the Cluster Health Monitor cause?

In today's server environment, the Cluster Health Monitor uses approximately less than 3% of the server's capacity for CPU. The overhead of using the Cluster Health Monitor is minimal.  However. CHM on the server with large number of disks or IO devices and more CPUs/memory would use more CPU than CHM on a server that does not have many disks and CPUs/memory.

Does CHM on Multiple Node configurations (e.g. 4 to 8 nodes) have scaling concerns?

CHM functionality is designed to scale automatically with the cluster. While each node hosts an osysmond daemon, the ologgerd daemon services multiple osysmonds. Should a cluster grow large enough another ologgerd daemon is spawned to manage the increased load. The user is responsible for increasing the CHM data repository size as nodes are added to ensure sufficient retention time is maintained. This is recommended to be 72 hours.

Will CDB and PDB result in any new information or special conditions using CHM?

As CHM is collecting OS metrics there currently are no CDB or PDB specific metrics collected. There are currently no special conditions that are triggered when hosting a multitenant (CDB) database.

How much of disk space is needed for the Cluster Health Monitor?

The Cluster Health Monitor takes up 1GB space by  default  on all nodes in the cluster. The approximate amount of data collected is 0.5 GB per node per day. The size of the repository can increase to collect and save data up to 3 days, and this will increase the disk usage appropriately.

How do I find out the size of data collected and saved by the Cluster Health Monitor in my system?

“oclumon manage -get repsize” will show the size in seconds.
To estimate the space required, use the following formula:

# of nodes * 720MB * 3 = Size required for 3 days retention  
eg. for 4 node cluster: 4 * 720 * 3 = 8,640MB (8.4GB)

How can I increase the size of the Cluster Health Monitor repository ?

“oclumon manage -repos resize ”. Setting the value to 259200 will collect and save the data for 72 hours (3 days). It is recommended to set 72 hours of retention based on above formula. This space needs to be available on all node in the cluster. Please resize the repositories or moving them if necessary in order to achieve 72 hours of retention.

What platforms can I run the Cluster Health Monitor?

11.2.0.1 and earlier: Linux only (download from OTN) 
11.2.0.2: Solaris (Sparc 64 and x86-64 only), and Linux. 
11.2.0.3: AIX, Solaris (Sparc 64 and x86-64 only), Linux, and Windows.

Cluster Health Monitor is NOT available for any Itanium platform such as Linux Itanium and Windows Itanium.

What steps are needed to install 11.2.0.2 when the Cluster Health Monitor from OTN is already running?

Remove the Cluster Health Monitor from OTN before upgrading the CRS or installing Grid Infrastructure.

Where does the Cluster Health Monitor from OTN installed in Linux?

$CRF_HOME is set /usr/lib/oracrf on Linux by  default  if the Cluster Health Monitor is from OTN. This is the Cluster Health Monitor home location.

What logs and data should I gather before logging a SR for the Cluster Health Monitor error?

1) provide 3-4 pstack outputs over a minute for osysmond.bin
2) output of strace -v for osysmond.bin about 2 minutes.
3) strace -cp for about 2 min
4) oclumon dumpnodeview -v output for that node for 2 min.
5) output of "uname -a"
6) outpuft of "ps -eLf | grep osysmond.bin"
7) The ologgerd and sysmond log files in the CRS_HOME/log/ directory from all nodes

How do I increase the trace level the Cluster Health Monitor?

Increase the log level for the daemons using,
oclumon debug log all allcomp:

Higher the trace level, more detailed tracing is done, so do not forget to reset the trace level back to 1 (the  default  trace level when the CHM is first installed) by issuing "oclumon debug log all allcomp:1"

Can I use procwatcher to get the pstack of the Cluster Health Monitor regularly?

Procwatcher version 030810 can now be used to monitor IPD procs. Just add the proc names to the CLUSTERPROCS list. The change is that Procwatcher is now smarter about picking the path of the executable so now it can find the IPD daemons if it is looking for them.

What are the processes and components for the Cluster Health Monitor?

Cluster Logger Service (Ologgerd) – there is a master ologgerd that receives the data from other nodes and saves them in the repository (Berkeley database). It compresses the data before persisting to save the disk space. In an environment with multiple nodes, a replica ologgerd is also started on a node where the master ologgerd is not running. The master ologgerd will sync the data with replica ologgerd by sending the data to the replica ologgerd. The replica ologgerd takes over if the master ologgerd dies. A new replica ologgerd starts when the replica ologgerd dies. There is only one master ologgerd and one replica ologgerd per cluster. 

System Monitor Service (Sysmond) – the sysmond process collects the system statistics of the local node and sends the data to the master ologgerd. A sysmond process runs on every node and collects the system statistics including CPU, memory usage, platform info, disk info, nic info, process info, and filesystem info.

To find the master olggerd, one can use the following command:
oclumon manage -get master

What is oclumon?

OCLUMON command-line tool - use oclumon command line to query the CHM repository to display node-specific metrics for a specified time period. 

You can also use oclumon to query and print the durations and the states for a resource on a node during a specified time period. These states are based on predefined thresholds for each resource metric and are denoted as red, orange, yellow, and green, indicating decreasing order of criticality.

What is definition of some of the files like *.bdb, _db.* , *.ldb , log.* files created by tool in the BDB (Berkeley Database) location directory ?

*.bdb & _db.* - These are files created for the berkeley db which stores the data collected.

log.* - These are berkeley bdb logfiles which preserve changes before making them to the db files. We have checkpointing setup and it reuses the log files.

*.ldb - This is the local logging file and MUST be present on all servers.

Do not delete above files except in case of trying to reduce the size of bdb file that get grow to a large size.  To reduce the size of bdb file, refer to the question "How can you reduce the size of bdb file that became big for any reason?" in this document.

Because it takes many days / weeks to resolve a problem like the node reboot or performance degradation, is there any way to keep the Cluster Health Monitor data for that long so that it can be replayed any time later when needed ?

The Cluster Health Monitor is designed to store data up to 3 days as best as it can by increasing the size of the repository up to 2GB. If you want to store data more than that, one way is to zip the output from ‘oclumon dumpnodeviews’ or ‘diagcollection’ regularly (like every hour).

Before 12.1.0.2, another way is to archive the whole BDB regularly (like every day) by making a copy of BDB file in the BDB location directory.

The way that CHMOS reads archived BDB is to start it in debug mode. It starts by using
ologdbg -d  
After it starts, issue the oclumon dumpnodeview to get the data from the archived BDB.
For example, issue
oclumon dumpnodeview -n -s -e -v

Where is the location for the log files for the Cluster Health Monitor from OTN (pre 11.2.0.2)?

Check directory /usr/lib/oracrf/log/* for the alert.log and other subdir for each daemons (SYSMOND, LOGGERD, OPROXYD) log.

How do I fix the problem that the time in the oclumon report is in UTC time zone instead of the time zone of my server?

The time in the repository is in UTC, and by  default , oclumon shows the time in UTC. Check README, it shows UTC if ORACRF_TZ not set. Setting ORACRF_TZ should fix the time zone issue.

Can I install CHM from OTN on 11.2.0.2? What if I stop and disable CHM resource (ora.crf) on 11.2.0.2?

You cannot install CHM from OTN if there is any conflicting install, so installing CHM from OTN on servers that has 11.2.0.2 Grid Infrastructure will not work.  Disabling CHM resource (ora.crf) on 11.2.0.2 will still keep the installation; hence, OTN install will fail.

Where is the trace file for client like oclumon? How do I increase the trace level for oclumon?

The 'log' file for oclumon is in log//clients/oclumon.log.

Generally its not generated because, at the log level 0, there is no log data. 
To see logs at higher log level one needs to do the following
1. oclumon [Enter the interactive mode]
2. query> debug log all allcomp:3

After this, any command execution will produce finer logs in oclumon.log

Can the Directory path to the CHM Repository be same on all nodes if shared storage is used?

One can set CHM repository at a shared storage under the same directory although it is recommended not to do so. One reason is the performance issue. In such a case, each node’s repository location is under the directory named as its hostname.

How much of data (how long in time) does the node store CHM data locally when it cannot communicate with the master?

The local repository size is small for nodes that need to save the local CHM data when it cannot communicate with the master.

With a sampling interval of 1 second, ideally it will be around 1 hour of data. With 11.2.0.3, we have moved to sampling interval of 5 seconds, hence, in that case the data that can be retained is 4-5 hours of data.

How often does CHM collect the system metric data? Can this be changed?

In pre-11.2.0.3, the CHM collection interval is usually once a second, but this can change depending on the the amount of data getting collected.  In 11.2.0.3, the CHM collection interval is changed to once every 5 seconds.

Currently, the collection interval can not be changed.

What is the default CHM retention time? 

In pre-11.2.0.2 CHM available from OTN, the  default  data retention time was 24 hours. 

In 11.2.0.2, the retention time is determined by the size.   The  default  size has changed to 1GB. Depending on how large the cluster is, the  default  retention time is different.  For example, it is usually 6.9 hours for a one-node cluster when sampling interval is 1 second.   Please issue "oclumon manage -get repsize" to find out the retention time of your cluster.  The output is in seconds.

With sampling interval moving to 5 seconds in 11.2.0.3, the retention time becomes 5 times retention time with sampling interval 1 second.

It is recommended to set 72hours retention time.

How can you reduce the size of bdb file that became big for any reason?

You can manage repository size in terms of space using below command. This feature is present from 11.2.0.3.

oclumon manage -repos changesize .

As a temporary work around, you can kill ologgerd and delete the contents in the BDB directory. osysmond should respawn ologgerd and new bdb file will get created. The past data is lost when this is done.

Please note the minimum size must be >= 1024 MB (1 GB), otherwise CRS-9100 "Error setting Cluster Health Monitor repository size" will be reported.

Can you set up CHM to run locally on each node?

On OTN, one can do that by installing CHM on each node independently although it is not recommended.

The Cluster Health Monitor that comes with the Grid Infrastructure install image must run with only one master ologgerd, so it can not be set  up to run locally on each node.

Can CHM be used on a single node non-RAC server?

The CHM available on OTN can be used on a single node non-RAC server, but only Linux and Windows version of CHM is available from OTN.  The CHM that comes with GI in 11.2 and higher must run with GI (RAC)

How to start and stop CHM that is installed as a part of GI in 11.2 and higher?

The ora.crf resource in 11.2 GI (and higher) is the resource for CHM, and the ora.crf resource is managed by ohasd. Starting and stopping ora.crf resource starts and stops CHM. 

To stop CHM (or ora.crf resource managed by ohasd) 
$GRID_HOME/bin/crsctl stop res ora.crf -init 

To start CHM (or ora.crf resource managed by ohasd) 
$GRID_HOME/bin/crsctl start res ora.crf -init


 

Database - RAC/Scalability Community

To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Database - RAC/Scalability Community

 


 

How to relocate CHM repository and increase retention time (文档 ID 2062234.1)

In this Document


Goal

Solution
  11.2
  12.1

References


APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.

GOAL

Often CHM data ages out when if not collected on time, this note provides steps to increase the retention time which is strongly recommended.

SOLUTION

11.2

In 11.2, the repository of CHM is in Grid home, to change the retention time: 

$ /bin/oclumon manage -repos resize 259200
racnode1 --> retention check successful
racnode2 --> retention check successful
New retention is 259200 and will use 4525424640 bytes of disk space

CRS-9115-Cluster Health Monitor repository size change completed on all nodes.

Done

Note: the command line specifies for how many seconds to retain the data and it's recommended to be at least 259200 which is 3 days.

 

In case there's insufficient amount of space in Grid home, relocate CHM data with the following command:

$ /bin/oclumon manage -repos reploc /home/grid/chm
racnode1 --> Ready to commit new location
racnode2 --> Ready to commit new location
New retention is 259200 and will use 4525424640 bytes of disk space

CRS-9113-Cluster Health Monitor repository location change completed on all nodes. Restarting Loggerd.

Done

12.1

In 12c, the repository of CHM is GIMR which is a database, only retention time can be changed. To change the retention time: 

1. Check how much space is needed for the expected retention time:

$ /bin/oclumon manage -repos checkretentiontime 259200  
The Cluster Health Monitor repository is too small for the desired retention. Please first resize the repository to 3896 MB    

Note: the command line specifies for how many seconds to retain the data and it's recommended to be at least 259200 which is 3 days. The output tells that the repository needs to be at least 3896 MB for 3 days.

 

2. Change the repository size: 

$ /bin/oclumon manage -repos changerepossize 3896   
The Cluster Health Monitor repository was successfully resized.The new retention is 259200 seconds.    

 

 

REFERENCES

NOTE:1589394.1  - How to Move/Recreate GI Management Repository to Different Shared Storage (Diskgroup, CFS or NFS etc)  


上一篇:设计模式之责任链模式讲解


下一篇:扒一扒:2020*Android-Kotlin-&-Java-面試題庫,竟如此--