DESK Managed cluster failover mechanism

DESK Managed allows for high-availability deployments built with multiple equally important nodes. While every cluster contains a master node, all nodes in a DESK Managed cluster run the same services and are capable of taking the role of master node. As a result, information about which node is currently performing the master node role is unnecessary and all cluster configuration and maintenance should be done through the Cluster Management Console.

To achieve the best failover deployments, we recommend the following:

  • Redundancy
    Plan to deploy a minimum of three nodes per cluster. In such clusters, all nodes automatically replicate the data across nodes, so there are two replicas in addition to the primary shard.

    All events, user sessions, and metrics are stored with a replication factor of three. The entire configuration of the DESK cluster and its environments is stored on each node. This means that for DESK to continue fully operationally when one data node is completely unavailable, you will need a majority quorum to run: for three nodes, two nodes need to be operational while one is down. For five nodes, two nodes can be down. The latency between nodes should be around 10 ms or less.

    Raw transaction data (call stacks, database statements, code-level visibility, etc.) isn't replicated across nodes, it's evenly distributed across all nodes. As a result, in the event of a node failure, DESK can accurately estimate the missing data. This is possible because this data is typically short lived, and the high volume of raw data that DESK collects ensures that each node still has a large enough data set even if a node isn't available for some time.

    If you plan to distribute nodes in separate data centers, you shouldn't deploy more than two nodes in each data center. The replication factor of three then ensures that each data center has all the metric and event data. Also, for seamless continuity you need at least three data centers, one of which can fail.

  • Hardware
    To prevent loss of monitoring data, deploy each node on a separate physical disk. To minimize performance loss, deploy nodes on systems with the same hardware characteristics. In the event of a hardware failure, only the data on the failed machine is affected; there is no monitoring data loss because all nodes replicate the monitoring data. Performance loss is minimized because all nodes operate on the same type of hardware with an evenly distributed workload.

  • Processing capacity
    Build your cluster with additional capacity and possible node failure in mind. Clusters that operate at 100% of their processing capacity have no processing capacity to compensate for a lost node and are thus susceptible to dropping data in the event of a node failure. Deployments planned for node failure should have a processing capacity one-third higher than their typical utilization.

If a node fails, the NGINX that is load-balancing the system automatically redirects all OneAgent traffic to the remaining working nodes, and there is no need for user action other than replacing the failed node.