Root cause analysis - Oracle Enterprise Manager Grid Control 11gR1: Business Service Management

Root cause analysis is a process of solving a problem by identifying the source of origin of that specific problem. This is based on the fact that a problem in a source can have potential impact in other related areas and the dependent problems will also be fixed in most cases by tackling it at the source. In systems management parlance, the process of determining the component that is responsible for any abnormal behavior

Chapter 4

[ 161 ]

that is visible is known as root cause analysis. For instance, a performance issue seen in a database instance can potentially impact the performance of all the applications associated with the database. So by fixing the performance problem in the database, the overall performance degradation can be fixed.

OEM Grid Control provides various diagnostic features in determining a specific performance issue. In the services arena, OEM Grid Control provides root cause analysis to quickly determine the underlying problem in a business service. In root cause analysis, the Oracle Management Server attempts to figure out the possible causes of a service failure. Root cause analysis is associated with the availability of a service and is enabled whenever the service status goes DOWN.

When a service target that is defined based on a system target goes DOWN, the root cause analysis checks the availability state of all the key components. It readily identifies the component targets that are down and points them as possible causes of service failure. In addition, the root cause analysis also executes certain tests on the component targets whenever the service target is DOWN. These component tests indicate if the performance of the component target has degraded even if the availability of the component target is UP. These tests are usually based on the metrics of the component targets and some thresholds. For example, if the number of stuck threads in an Oracle WebLogic Server is higher than a threshold value of three, then that server could be considered as hung and could be enlisted as one of the possible causes behind the service failure. Similarly, the tests can be specified on the hosts on which these component targets run. In general, when the service target is DOWN, the root cause analysis Configuration executes the tests configured for each of the key components or the hosts running the key components. If the key component is down or if any test on the component or the underlying target fail, then this component is considered as one of the possible causes behind the service failure.

Configuration

OEM Grid Control supports two modes of root cause analysis:

Automatic mode: This is the default mode of root cause analysis. In this mode, the Oracle Management Server attempts to find the possible causes of a service failure automatically. It also stores the results of the automatic root cause analysis for future reference even after the service failure is fixed.

Manual mode: In this mode, the analysis is not performed automatically.

The system administrator needs to manually analyze the various parameters views to identify the possible causes of a service failure. Oracle Management Server provides all the relevant monitoring data that can be handy in

determining the root causes here. Manual Mode is preferred if there are too many service targets configured. This mode reduces the load on the OMS.

•

As the Automatic mode performs the root cause analysis automatically and stores the results even after the service failure is rectified, it is the recommended mode of root cause analysis.

The root cause analysis configuration can be specified as either Automatic or Manual in the Root Cause Analysis Configuration page. This page can be reached from the Monitoring Configuration page by clicking on the Root Cause Analysis Configuration link.

The previous image is a screen shot of the Root Cause Analysis Configuration page for the generic service CheckOutService in OEM Grid Control. As can be seen in the previous screenshot, this page provides different configuration options for the root cause analysis for the CheckOutService. In the Analysis Mode section, the current analysis mode is displayed. In this case, the current analysis mode is shown as Automatic. The root cause analysis can also be made manual by clicking on the Set Mode Manual button. If the current configuration is Manual, a button Set Mode Automatic is provided.

Chapter 4

[ 163 ]

The Root Cause Analysis Configuration also displays that the service

CheckOutService is based on the CheckOutSystem system target and enlists the key components associated with it in a tabular form in the Component Tests section. This table displays the number of additional tests configured for each of the component targets. It also provides a link to configure component tests by clicking on the hyperlink provided in the Component Tests column for each key component.

In the image shown here, the key component of type Oracle WebLogic Server target associated with this service has an additional component test configured. Whenever the CheckOutService is DOWN, if this test fails or if the component target is down, then this WebLogic Server will be considered as one of the possible causes of the service failure.

The previous image is a screenshot of the Component Tests page within the root cause analysis configuration. This page can be reached by clicking on any one of the Component Tests link in key components table in the Root Cause Analysis Configuration page. The previous image displays the component test configured for the key component of type Oracle WebLogic Server in the User-Defined Tests section. As can be seen, the test indicates that if the Stuck Threads in the server is higher than the threshold of 2, then this component can be considered as a possible root cause. This page also displays the number of component tests enabled for the host target on which, this component is running.

The User-Defined Tests section also provides various management controls such as Add, Edit, and Remove to manage the Component Tests. Clicking on the Add button navigates to the Add Component Test page. This page provides the list of all metrics for the component target from which, the desired metric can be chosen. Once the metric is chosen for the test, the user is navigated to the Add Component Test:

Set Object and Threshold page where the metric thresholds can be specified.

The Component Tests page also provides a link that drills down to the Component Tests configuration page for the component target.

This offers a quick way of adding component tests on the host targets.

The previous image is a screenshot of the Add Component Test: Set Object and Threshold page. This page can be navigated by clicking on the Add button in the Component Tests page and then selecting the required metric in the Add Component Test page. In this page, the user can reuse the critical severity threshold configured within the component target for alerts. The user can also specify an alternate threshold different from the severity threshold. By clicking on the Finish button, the new component tests are added for the key components of the service.

Whenever the service is Down, these tests are also executed to determine the possible causes of failure of the service.

Chapter 4

[ 165 ]

Service failure: Diagnostics using root cause

Dans le document Oracle Enterprise Manager Grid Control 11gR1: Business Service Management (Page 177-182)