RAC Database Hang Analysis

---Node: knewracn1 Clock: '13-03-02 06.20.04' SerialNo:178224

---SYSTEM:

#pcpus: 1 #vcpus: 2 cpuht: Y chipname: Intel(R) cpu: 7.97 cpuq: 2 physmemfree: 441396 physmemtotal:

5019920 mcache: 2405048 swapfree: 11625764 swaptotal: 12583912 hugepagetotal: 0 hugepagefree: 0 hugepagesize: 2048 ior: 93 iow: 242 ios: 39 swpin: 0 swpout: 0 pgin: 90 pgout: 180 netr: 179.471 netw: 124.380 procs: 305 rtprocs: 16 #fds: 26144 #sysfdlimit: 6815744 #disks: 5 #nics: 4 nicErrors: 0 TOP CONSUMERS:

topcpu: 'gipcd.bin(4205) 5.79' topprivmem: 'ovmd(719) 214072' topshm: 'ora_ppa7_knewdb(27372) 841520' topfd: 'ovmd(719) 1023' topthread: 'crsd.bin(4415) 48'

CPUS:

cpu0: sys-4.94 user-3.10 nice-0.0 usage-8.5 iowait-10.93 cpu1: sys-5.14 user-2.74 nice-0.0 usage-7.88 iowait-4.68 PROCESSES:

name: 'ora_smco_knewdb' pid: 27360 #procfdlimit: 65536 cpuusage: 0.00 privmem: 2092 shm: 17836 #fd:

26 #threads: 1 priority: 20 nice: 0 state: S

name: 'ora_gtx0_knewdb' pid: 27366 #procfdlimit: 65536 cpuusage: 0.00 privmem: 2028 shm: 17088 #fd:

26 #threads: 1 priority: 20 nice: 0 state: S

name: 'ora_rcbg_knewdb' pid: 27368 #procfdlimit: 65536 cpuusage: 0.00 privmem: ...

In this section, we will explore the conceptual basis for invoking and interpreting a hang analysis dump to diagnose a potential RAC database hung/slow/blocking situation. When a database either is running unacceptably slow, is hung due to an internal system developed interdependence deadlock or a latch causing database hung/slowness,

or else a prolonged deadlock/block hurts overall database performance, it is advisable to perform a hang analysis, which helps greatly in identifying the root cause of the problem. The following set of examples explains how to invoke and use the hang analysis:

SQL> sqlplus / as sysdba SQL> oradebug setmypid SQL> oradebug unlimit

SQL> oradebug setinst all -- enables cluster-wide hang analysis SQL> oradebug –g all hanganalyze 3 --is the most commonly used level

<< wait for couple of minutes >>

SQL> oradebug –g all hanganalyze 3

The hang analysis levels can be the currently set value between 1 to 5 and 10. When hanganlyze is invoked, the diagnostic information will be written to a dump file under $ORACLE_BASE/diag/rdbms/dbname/instance_name/trace, which can be used to troubleshoot the problem.

We have built the following test case to develop a blocking scenario in a RAC database to demonstrate the procedure practically. We will then interpret the trace file to understand the contents to troubleshoot the issue. The following steps were performed as part of the test scenario:

Create an EMP table:

SQL> create table emp (eno number(3),deptno number(2), sal number(9));

Load a few records in the table. From instance 1, execute an update statement:

SQL> update emp set sal=sal+100 where eno=101; -- not commit performed

From instance 2, execute an update statement for the same record to develop a blocking scenario:

SQL> update emp set sal=sal+200 where eno=101;

At this point, the session on instance 2 is hanging and the cursor doesn’t return to the SQL prompt, as expected.

Now, from another session, run the hang analysis as follows:

SQL>oradebug setmypid Statement processed.

SQL >oradebug setinst all Statement processed.

SQL >oradebug -g all hanganalyze 3 <level 3 is most suitable in many circumstances>

Hang Analysis in /u00/app/oracle/diag/rdbms/rondb/RONDB1/trace/RONDB1_diag_6534.trc

Let’s have a walk-through and interpret the contents of the trace file to identify the blocker and holder details in context. Here is the excerpt from the trace file:

Node id: 1

List of nodes: 0, 1, << nodes (instance) count >>

*** 2012-12-16 17:19:18.630

===============================================================================

HANG ANALYSIS:

instances (db_name.oracle_sid): rondb.rondb2, rondb.rondb1 oradebug_node_dump_level: 3 << hanganlysis level >>

analysis initiated by oradebug

os thread scheduling delay history: (sampling every 1.000000 secs) 0.000000 secs at [ 17:19:17 ]

Chains most likely to have caused the hang:

[a] Chain 1 Signature: 'SQL*Net message from client'<='enq: TX - row lock contention' Chain 1 Signature Hash: 0x38c48850

which is waiting for 'SQL*Net message from client' with wait info:

{

p1: 'driver id'=0x62657100 p2: '#bytes'=0x1

time in wait: 27.311965 sec

timeout after: never wait id: 131 blocking: 1 session

*** 2012-12-16 17:19:18.725 State of ALL nodes

([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):

[102]/2/103/1243/c0000000e4ae9518/12250/NLEAF/[262]

[262]/1/14/125/c0000000d4a03f90/8047/LEAF/

*** 2012-12-16 17:19:47.303

===============================================================================

HANG ANALYSIS DUMPS:

oradebug_node_dump_level: 3

===============================================================================

State of LOCAL nodes

([nodenum]/cnode/sid/sess_srno/session/ospid/state/[adjlist]):

[102]/2/103/1243/c0000000e4ae9518/12250/NLEAF/[262]

===============================================================================

END OF HANG ANALYSIS

===============================================================================

In the preceding example, SID 102 on instance 2 is blocked by SID 261 on instance 1. Upon identifying the holder, either complete the transaction or abort the session to release the lock from the database.

It is sometimes advisable to have the SYSTEMSTATE dump along with the HANGANALYSIS to generate more detailed diagnostic information to identify the root cause of the issue. Depending upon the level that is used to dump the SYSTEMSTATE, the cursor might take a very long time to return to the SQL prompt. The trace file details also can be found in the database alter.log file.

You shouldn’t be generating the SYSTEMSTATE dump under normal circumstances; in other words, unless you have some serious issues in the database or are advised by Oracle support to troubleshoot some serious database issues. Besides, the SYSTEMSTATE tends to generate a vast trace file, or it can cause an instance crash under unpredictable circumstances.

Above all, Oracle provides the HANGFG tool to automate the collection of systemstate and hang analysis for a non-RAC and RAC database environment. You need to download the tool from my_oracle_suport (previously known as metalink). Once you invoke the tool, it will generate a couple of output files, named hangfiles.out and

hangfg.log, under the $ORACLE_BASE/diag/rdbms/database/instance_name/trace location.

Summary

This chapter discussed the architecture and components of the Oracle Clusterware stack, including the updates in Oracle Clusterware 12cR1. We will talk about some other new Oracle Clusterware features introduced in Oracle 12cR1 in Chapter 4.

This chapter also discussed tools and tips for Clusterware management and troubleshooting. Applying the tools, utilities, and guidelines described in this chapter, you can diagnose many serious cluster-related issues and address Clusterware stack startup failures. In addition, you have learned how to modify the default tracing levels of various Clusterware daemon processes and their subcomponents to obtain detailed debugging information to troubleshoot various cluster-related issues. In a nutshell, the chapter has offered you all essential cluster management and

troubleshooting concepts and skills that will help you in managing a medium-scale or large-scale cluster environment.

Dans le document Expert Oracle RAC 12cExpert Oracle RAC 12c (Page 67-71)