Distributed systems math: availability

Who doesn’t like math? (Rhetorical question, do not answer.) Today we’ll look at some simple math calculations which can help you approximate the overall availability of a distributed system.

This model will assume a quorum-based protocol where a simple majority of nodes must be online and functional in order to ensure availability. In other words, if you have five nodes, a total of three (int(5/2) + 1) must be up to form a majority quorum. As well, the nodes do not have overlapping fault domains (i.e. they have separate power, network connections, etc.). This ensures that single node faults are largely independent of each other and not subject to cascading failures.

We’ll start with the degenerate case. In a system with exactly one node and an associated node failure rate of 5%, what would be the unavailability? It should be easy to see that this is just 5%; when one node is down, the whole system is down. Alternatively, we could say that we have 95% availability (halfway between “one nine” and “two nines”).

How about a more interesting case, where we have a three node system? Here, a single node fault will not cause unavailability. Instead, we must experience two or three simultaneous faults to be unavailable. The unavailability would thus be the sum of all two and three node fault cases. There are three distinct ways that a two node fault could occur (think: only one of node A, B, or C survives) and exactly one way all nodes could fault. The probability of a two node fault is 5% x 5% x 95% (where 95% is the probability that no fault occurred) times the three ways or 0.2375% x 3 = 0.7125%. The probability of a three node fault is just 5% x 5% x 5% or 0.0125%. Overall we would have just 0.725% unavailability, squarely within the “two nines” range.

We can do this for any number of nodes n and node fault rate f by using the binomial distribution. Since the math gets rather tedious with more than a couple nodes, you can use this C# program to automate it:

namespace BinomialDistribution
{
    using System;

    class Program
    {
        static void Main(string[] args)
        {
            double f = 0.05d;
            int n = 3;

            double u = 0.0d;
            for (int i = 1 + (n / 2); i <= n; ++i)
            {
                u += Pow(f, i) * Pow(1.0d - f, n - i) * Comb(n, i);
            }

            Console.WriteLine("Unavailability: {0}%", u * 100.0d);
            Console.WriteLine("Availability:   {0}%", (1 - u) * 100.0d);
        }

        private static double Pow(double b, int n)
        {
            double result = 1.0d;
            for (int i = 0; i < n; ++i)
            {
                result *= b;
            }

            return result;
        }

        private static int Comb(int n, int r)
        {
            int numerator = 1;
            int k = n;
            while (k > (n - r))
            {
                numerator *= k--;
            }

            int denominator = 1;
            k = r;
            while (k > 1)
            {
                denominator *= k--;
            }

            return numerator / denominator;
        }
    }
}

Some experimentation will show that you can achieve 99.9% availability with three nodes when the fault rate is less than 2%, or with five nodes and a 4% fault rate.

While this is a simplistic model, it is a useful teaching aid for basic reliability and mathematical concepts. Viva Bernoulli!

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.