How to adjust the sensitivity of problem detection

If your organization has scheduled periods of system downtime during which you want to pause DESK monitoring, see How to define a maintenance window.

Typical application and service-level anomalies reported by DESK include failure rate increases, response time degradations, and spikes or drops in application traffic. On top of this automated learning of reference values, DESK allows you to define specific thresholds that specify at what levels deviations above baseline performance are severe enough to generate problem alerts. Keep in mind that these threshold settings only adjust the levels at which DESK alerts you to detected anomalies. These settings don’t affect automated performance baselining.

There are some use cases for which parameterization of automated baselining algorithms may be beneficial:

  • Setting higher thresholds for applications and services that are still in development or are in the testing stage.
  • Setting lower thresholds for mission-critical services within your infrastructure (where default thresholds may be too tolerant).

Defining specific thresholds that specify at what levels deviations above baseline performance are severe enough to generate problem alerts is essentially about adjusting the sensitivity of problem detection.

The sensitivity of problem detection controls the level of statistical confidence required to raise an alert. Low sensitivity means that high confidence is required to raise an alert, while high sensitivity implies low statistical confidence for raising alerts. This means for example that to view alerts immediately, even when only few data points have breached the threshold, high sensitivity should be selected.

To configure detection sensitivity, from the navigation menu, go to Setting > Anomaly detection. If you click Applications for example, you will see that DESK distinguishes between an absolute threshold and a relative threshold for the median and the slowest 10 percent of each given metric. As shown in the example below, the median thresholds for response time degradation are set to 100 ms (absolute) and 50% (relative) above the auto-learned baseline. The threshold for the slowest 10% of the requests is set to 1,000 ms (absolute) and 100% (relative) above the auto-learned baseline.

relative and absolute thresholds

Also, as you can see in the example above, DESK anomaly detection threshold settings allow you to specify how many actions per minute should be observed before Davis (the DESK AI causation engine) sends out problem alerts related to anomalies. This setting allows you to disable alerting for low traffic applications and services—baselining and alerting on low traffic applications often leads to unnecessary alerts.

In addition to automatically detecting all your applications, services, and running processes, DESK also monitors your development and testing services—even build processes such as Jenkins. In cases where Davis isn't able to collect enough statistically relevant data for such services, automated baselining isn't the best approach to anomaly detection. For such situations where your development team knows better, DESK provides fixed thresholds. Fixed thresholds allow you to overrule Davis smart multidimensional baselining by setting hard limits on response times and error rates that are not to be exceeded. You can specify fixed thresholds for services and applications on the global level or for specific application and service instances.

Adapting the sensitivity of anomaly detection either by deviating from automated baselines or by specifying fixed thresholds is supported for:

For the following, sensitivity can be adapted only by specifying fixed thresholds:

Configure thresholds for individual entities

As an alternative to defining thresholds globally across your entire environment, you can disable global settings and instead fine tune threshold settings for individual applications and services using the application- and service-specific settings pages. See examples below.