Root cause analysis

To identify the root cause of problems, DESK doesn't depend solely on time correlation. It uses all the OneAgent-delivered context information such as topology, transaction and code-level information to identify events that share the root-cause. Using all available context information, DESK can pinpoint the root cause of problems in your application-delivery chain and therefore dramatically reduce the alert spam for single incidences that originate from the same root-cause.

Why time correlation alone isn't effective

Time correlation alone is ineffective in identifying the root cause of many performance problems. Consider for example a simple time correlation in which a 'Service A' calls a 'Service B'. The first event in this problem evolution sequence is a slow down on 'Service B'. The next event in the problem evolution sequence is a slow down on 'Service A'. In this case, time correlation would seem to work pretty well for indicating the root cause of the problem: The slow down on 'Service B' led sequentially to the slow down on Service A. This is however a very simplistic problem.

What if the events in the problem evolution sequence are more nuanced and open to interpretation? What if for example 'Service A' has a long history of performance problems? With such knowledge it becomes impossible to say conclusively that the slow down on 'Service A' was caused by the slow down on 'Service B'. It may be that 'Service A' is simply experiencing another in a history of performance issues. Such subtleties make time correlation alone ineffective in conclusively pinpointing the root cause of many performance problems.

A context-aware approach for the detection of interdependent events

Once Davis (the DESK AI causation engine) identifies a problem in one of your application's components, it uses all monitored transactions (PurePath) to identify interdependencies between the problem and other components that took place around the same time and within a dependent topology. Therefore, all vertical topological dependencies are automatically analyzed as well as the complete horizontal dependency tree.

The image below shows how DESK automatically analyzes all the vertical and horizontal topological dependencies for a given problem. According to this example, an application exhibits abnormal behavior, but the underlying horizontal stack is not showing any incidents. The automatic analysis follows all the transactions that were monitored for that application and detects a dependency on Service 1, where Service 1 exhibits also abnormal behavior. In addition, all dependencies of Service 1 do show abnormal behavior and are part of the root-cause of the overall problem. The automatic root-cause detection includes all the relevant vertical stacks as it is shown in the example and ranks all root-cause contributors to find out which one has the most negative impact. DESK not only detects all the root-cause contributors but also offers drilldowns on a component level to analyze the root-cause down to a code level, showing, for instance, failing methods within your service code or high GC activity on underlying Java processes.

Problems are seldom one-time events; they usually appear in regular patterns and are often shown to be symptoms of larger issues within your environment. If any other entities that depends on the same components also experienced problems around the same time, then those entities will also be part of the problem's root-cause analysis. When DESK detects an interdependence between a service problem and other monitored events, it shows you the details of this interdependence and the related root cause analysis.

Drill down to code-level details of a detected root-cause component

On the problem overview page, click the component tile appearing within the Root cause section to navigate to the components infographics page. You will see the relevant service, host or process overview page in the context of the actual problem you're analyzing.

The example below presents a typical problem overview page that shows two root-cause contributors, one service called CheckDestination that degraded in response time and an underlying host that experiences a CPU saturation.

Opening a component overview page within the context of a problem will give you specific navigational hints about the violating metrics or about the detected issues on the focused component. The image below shows the host entity page with a navigational hint to review the CPU metric.

In case of a high CPU event on a host, you can further drill down to the list of consuming processes on that host to find out which processes are the main contributors.

Visual resolution path

If there are several components of your infrastructure affected, then a Visual resolution path will be included in the Root cause (see the example above). The visual resolution path provides an overview of the part of your topology that has been affected by this problem. If you click on the visual resolution path tile, you will presented with an enlarged view of the resolution path along with the Replay tab on the right (see image below). This tab enables you to illustrate the problem lifespan in detail by clicking the play arrow at the top. In the example below, you can see that the problem spiked between 8:00 and 9:00 o'clock. The list of events appearing underneath the diagram includes all the events that occurred within the highlighted interval (i.e. 2018-06-07 07:45 - 08:00). The events are grouped along the respective entities. If you click the little arrow next to the name of an entity (e.g. next to MicroJourneyService), you enter the entity overview page where you can follow the navigational hints for further analysis.

For further reading on root cause analysis, you can check the root cause analysis use case provided below:

High network retransmission rate as a root cause

When a network link or segment is overloaded or under performing, it drops data packets. This is because overloaded network equipment queues are purged during periods of excessive traffic or limited hardware resources. In response, TCP protocol mechanisms attempt to fix the situation by re-transmitting the dropped packets.

Ideally, retransmission rates should not exceed 0.5% on local area networks and 2% in Internet or cloud based networks. Retransmission rates above 3% will negatively affect user experience in most modern applications. Retransmission issues are especially noticeable by customers using mobile devices in poor network coverage areas.

See the example below at 05:00 on the timeline. This period of high retransmission has dramatically increased the duration of each user action and reduced the number of user actions per minute. Although different applications have varying sensitivity to poor network connection quality, such a condition will likely not only be detectable on the infrastructure level. It will also affect the response time of your application's services and ultimately degrade user experience in the form of increased load times.

The problem is detected

Davis detects this problem and monitors its severity across the infrastructure, service, and Real User Monitoring layers, showing you thereby how this infrastructure issue translates into user experience problems for your customers.

Below is an example of packet loss causing high TCP retransmission rates on an Apache web server. This high TCP retransmission rate causes service response time to increase (the server stack needs more time to re-transmit the missing data packets). This ultimately has an impact on end user experience because the users now have to wait longer for their web pages to load.

Root cause analysis of the problem

Davis detects the common context of these events across the infrastructure, service, and user experience layers and presents the root cause analysis. See the Root cause section of the Problem page below: Retransmission rate is shown to be the root cause of this problem.