Thursday, July 1, 2010

Diagnosing Oracle Clusterware Node evictions (Diagwait)

Oracle Clusterware evicts the node from the cluster when
  • Node is not pinging via the network heartbeat
  • Node is not pinging the Voting disk
  • Node is hung/busy and is unable to perform either of the earlier tasks
In Most cases when the node is evicted, there is information written to the logs to analyze the cause of the node eviction. However in certain cases this may be missing, the steps documented in this note are to be used for those cases where there is not enough information or no information to diagnose the cause of the eviction for Clusterware versions less than 11gR2 (11.2.0.1)

How to setup diagwait
  1. execute  as  root and   run it on both node
    #crsctl stop crs
    #$CRS_HOME/bin/oprocd stop
  2. Ensure that Clusterware stack is down on all nodes by executing  do not  continue If there are clusterware processes running and you proceed to the next step, you will corrupt your OCR on both node

    #ps -ef |egrep "crsd.bin|ocssd.bin|evmd.bin|oprocd"
  3. From one node of the cluster, change the value of the "diagwait" parameter to 13 seconds by issuing the command as root:Changing the Clusterware parameter diagwait to 13 is the Oracle supported technique to change the oprocd margin to 10 seconds Note that 13 is the only allowed value for setting the diagwait parameter to. Any value other than 13 (or unset) is not allowed and not support

    #crsctl set css diagwait 13 -force
  4. Check if diagwait is set successfully by executing. the following command.The command should return 13. If diagwait is not set, the following message will be returned "Configuration parameter diagwait is not defined"

    #crsctl get css diagwait
  5. Restart the Oracle Clusterware on all the nodes by executing

    #crsctl start crs
  6. Check that nodes  are running

    #crsctl check crs
Unsetting/Removing diagwaitDiagwait should not be unset without fixing the OS scheduling issues as that can lead to node evictions via reboot.Diagwait delays the node eviction (and reconfiguration) by diagwait (13) seconds .In case there is a need to remove diagwait, the above mentioned steps need to be followed except step 3 needs to be replaced by the following command
#crsctl unset css diagwait

No comments: