Who doesn’t like math? (Rhetorical question, do not answer.) Today we’ll look at some simple math calculations which can help you approximate the overall availability of a distributed system. This model will assume a quorum-based protocol where a simple majority…
Design for disaster: the small stuff
Is your system designed with disaster recovery in mind? There are obvious things like taking backups of state (and maybe validating them, too) and configuring geo-replication to mitigate the effects of regional outages. But what about the more subtle aspects…
Once is not enough
In the olden days of boxed software products, the “full test pass” was a borderline sacred ritual performed near the end of a release. Ostensibly, its purpose was to make sure all the product features worked as intended — for…
Automatic vs. automated
Modern software systems involve plenty of automation, especially in testing, deployment, and operations. However, there is a world of difference between automatic and automated processes. Automation can be an insidious half measure which creates the illusion of agility while actually…
A watched app never fails
Watchdogs have been longtime staples of the embedded systems space. But they are also quite useful for distributed services where it is advantageous to attempt automatic recovery from transient errors. When it comes to watchdog implementations, the stakes are understandably…
Scale minimization
Long ago, I wrote about high-level testing and alluded to scale minimization as a useful technique in doing so. In this post, I’ll explore this idea a bit more. What is scale minimization? You may have heard it referred to…
Disorderly fault injection
Fault injection is a commonly used testing technique to force the system under test into failure paths, allowing observation and evaluation of the system’s ability to tolerate and recover from errors. There are many fault injection tools available, such as…
Testing from up high
In the previous post, I introduced a simple distributed service and some considerations that might drive a test planning effort. In this initial drill down, I will take a look at the tradeoffs of testing this system “from up high”…
Testing at the right level
A large-scale distributed service is deployed to a datacenter across hundreds of machines. The basic topology is as follows: Consider the following scenarios and requirements: The service should respond with an error if a client requests a nonexistent resource. The…