Next generation AI root-cause analysis

The next generation of the DESK AI engine delivers smarter, more precise answers along with an increased awareness of external data and events.

To enable the new AI engine, select Problems from the navigation menu. Then, click the Switch to next generation AI button on the in-product teaser.

Opt out

Once enabled, you can switch between the previous version and next generation causation engines (current and enhanced engines). This enables you to try the new AI engine without risk and to provide feedback before the next generation AI engine becomes the new standard.

To switch between the current and enhanced causation engines:

  1. Select Problems from the navigation menu.
  2. Click the Browse [...] menu in the upper-right corner.
  3. Click either Switch to new causation engine or Revert to previous AI engine.

Smarter and more precise root causes

Switching to the new causation engine provides several major improvements:

  • Metric and event-based detection of abnormal component state
    The new AI engine automatically checks all component metrics for suspicious behavior. This involves the near real-time analysis of thousands of topologically related metrics per component and even your own custom metrics.
  • Seamless integration of custom metrics within the DESK AI process
    You can integrate all kinds of metrics by writing custom plugins, JMX, or by using the DESK REST API. The new AI causation engine seamlessly analyzes your custom metrics along with all affected transactions. It’s no longer necessary to define a threshold or to trigger an event for your custom metrics as DESK AI automatically picks up metrics that display abnormal behavior.
  • Third-party event ingests
    While the current DESK AI doesn't consider external events for root-causes, the new DESK AI seamlessly picks up any third-party event along the affected Smartscape topology.
  • Availability root-cause
    In many cases, the shutdown or restart of hosts or individual processes is the root-cause of a detected problem. The newly introduced availability root-cause section summarizes all relevant changes in availability within the grouped vertical stack.
  • Grouped root-cause
    To date, each problem details page presented root-cause candidates as individual components, even when the affected component was a single process or a subset of processes within a larger cluster. The improved root-cause section still displays up to three root-cause candidates, but those candidates are aggregated into groups of vertical topologies. This allows you to quickly review outliers within affected service instances or process clusters.

The following sections explain these improvements in greater detail.

Metric and event-based detection of abnormal component state

The original root-cause analysis depends on events to indicate an unhealthy state of a given component. An example here is a baseline-triggered slowdown event on a web service or a simple CPU saturation event on a host. DESK detects more than 100 event types on various topological components that are raised either by automatic baselining or by thresholds. Whenever an event is triggered on a component, the AI root-cause analysis automatically collects all the transactions (PurePaths) along the horizontal stack. The analysis automatically continues when the horizontal stack shows that a called service is also marked unhealthy. With each hop on the horizontal stack, the vertical technology stack is also collected and analyzed for unhealthy states. This automatic analysis has been proved to be highly superior to any manual analysis. One of the problems that the enhanced root-cause analysis solves is that this approach is highly dependent on single events.
As shown in the following image, an event is open on all unhealthy components and DESK correctly detects the Linux host as the root cause:

If an event is present, the root-cause analysis correctly detects the Linux host as the root cause:

The past has shown that it is not in all situations that the baseline or a threshold can trigger an event within an abnormal situation. Let’s modify the above example and remove one of the critical events within the affected topology. Assume that the Linux host CPU spikes but misses the critical threshold as shown below:

As there is no event on the Linux host, the host is shown as healthy and the old analysis wouldn't consider the host as part of the root cause. See the following vertical stack diagram and make a not of the Linux host that no longer shows an open CPU event:

Compared to the situation above we would detect the root cause on the backend service but wouldn't identify a root cause on process or host level. In many cases the root-cause section will be simply empty as shown in the following screen:

The overall vision of the next generation of the DESK AI engine is to solve the above problem of not displaying a root cause in non-event scenarios. Follow our considerations listed below that lead to the new approach within the enhanced AI root-cause analysis:

  • Every host comes with around 400 different metric types and timeseries depending on the number of processes and technologies running. That means that 10K hosts result in 4,000,000 metrics in total.
  • Every threshold you set on a metric or even the best automatic baseline observed over a period of time means ~1% false positive alerts. 1 false positive alert on a host does not sound much but it also means 10,000 on 10K hosts! With the growing number of metrics per component, we must expect a proportionally higher number of false positive alerts, which leads to alert spam.
  • It’s obvious that additional or more aggressive thresholds or even baselines on all those metrics is not a solution.
  • AI 2.0 solution: analyze all problem-related metrics proactively within the root-cause detection process.

To tackle the challenge of the increasing number of metrics the new root-cause analysis automatically checks all the available metrics on all the affected components. Suspicious metric behavior is detected by analyzing the metric value distribution in the past and comparing it with the actual metric values. Therefore, the new analysis is no longer depending on events and thresholds. In case an event is present or a user defines a custom threshold, this is still included in the root-cause process.

See how the new root-cause analysis would tackle such a missing root-cause scenario:

To sum up, the new root-cause analysis is based on a hybrid approach that can detect root-cause even if there is no open event on a component.

Seamless integration of custom metrics within the DESK AI process

The DESK platform allows the ingest of customer defined metrics and events through plugins and REST API. Plugins for third-party integrations can represent a great resource for additional root-cause information. An example here is the tight integration into your continuous integration and deployment toolchain that provides information about recent rollouts, responsible product owners, and possible remediation actions. The new analysis covers both information ingest, custom metrics as well as custom events sent from third-party integrations. Let’s focus on the analysis of custom metrics first, as its main functionality was already described within the previous section. A specific JMX metric with the title ‘Account creation duration’ where you measure the time needed to create a new account. Once the JMX metric is registered and monitored, it becomes a first-class citizen within our root-causation engine. In case of a real user affecting problem the JMX metric is automatically analyzed. If it shows an abnormal distribution compared to the past it will be identified within the root-cause as shown below:

Third-party event ingests

External events is another information source that the enhanced AI engine analyzes along the root-cause detection process. Such events could fall under one of the following categories:

  • semantically predefined events, such as deployment, configuration change, or annotation
  • generic events on each severity level, such as availability, error, slowdown, or resource
  • informational purpose events External events can also contain key-value pairs to add additional context information about the event. See the following example for a third-party deployment event that was sent through the REST event API and was collected along the root-cause process:

Availability root-cause

Changes in availability on host or process level often represent the root-cause of large scale issues within your technology stack. Different reason leads to changes in availability, such as explicit restart of application servers after software updates, restart of hosts or virtual machines but also crashes of individual processes or servers. While each of the DESK monitored hosts and processes shows an availability chart within its component dashboard, it can be hard to quickly check the availability state of all the relevant components on the vertical stack of a service. The newly introduced availability section within the problem root-cause section immediately collects and summarizes all relevant downtimes of the relevant infrastructure. The availability section shows all the changes in availability of all the relevant processes and hosts that are running your services on top of the vertical stack. See an example of the newly introduced availability root-cause section in the following screen:

Grouped root-cause

Another improvement within the new analysis is the detection of grouped root-causes. While the old analysis detected root-cause candidates on individual components rather than on group level, it always led to an information explosion in case of highly clustered environments. Imagine a case where you run 25 processes within a cluster to serve a microservice. If some of the processes are identified as root-cause, the DESK root-cause section displayed individual instances rather than explaining the overall problem. The new analysis identifies root-cause candidates on group level to explain the overall situation, such as a set of outliers within a large cluster of service instances. While the problem details screen just shows a quick summary of the top contributors, a click on the Analyze findings button opens a detailed analysis view. The root-cause analysis view is capable of charting all affected service instances in a combined chart along with all the identified abnormal metrics. This drill-down view is organized to show an identified root cause as a grouped vertical stack, which means that the top layer always shows service findings followed by the process group findings and finally all host and infrastructure findings.
As shown in the following screen, each vertical stack layer is shown as a tile containing all the metrics where abnormal behavior is detected. If more than one service instance, process group instances, or docker images are affected, the metric chart automatically groups those instances into a combined chart that shows all metric findings on the vertical stack, as shown below:

Overall benefit

By introducing the next generation of the DESK AI engine, we've further improved the strengths of the existing automated root-cause detection. Well-proven aspects such as business impact analysis as well as the PurePath-based analysis of single incidences are unchanged while improvements such as metric anomaly detection, custom events, and custom metrics have been seamlessly integrated. Overall, these improvements have pushed the boundaries of automatic AI based root-cause analysis into the future and have opened up DESK as a platform for third-party integrations.