Fault injection is a commonly used testing technique to force the system under test into failure paths, allowing observation and evaluation of the system’s ability to tolerate and recover from errors. There are many fault injection tools available, such as TestApi for .NET fault injection and the classic Holodeck (open-sourced as of 2014).
A good chunk of these fault injection tools use some form of API hooking to return an error or throw an exception in response to a known library or system call. For example, if the program you are testing is known to use the Win32 HeapAlloc
function to allocate memory, you could force this call to throw an exception with code STATUS_NO_MEMORY
to simulate an out of memory fault.
This is fine as a gray-box approach to explore the system and probe for missing error handlers but it tends to be a bit artificial — the system is not likely to fail in exactly one place or API call while all other paths proceed as usual. These highly targeted errors can be patched in a straightforward way which may lead to false confidence that the system is now “robust“; in reality a true low memory situation could cause an unrecoverable failure cascade especially when interacting with third party libraries or code that you do not otherwise control.
With distributed systems, fault injection becomes that much more difficult. The software under test spans network and machine boundaries, beyond the reach of local tools. To cope with these difficulties, forward-looking software engineers will often create various debug APIs to allow operations like shutting down and restarting service processes, forcing a crash fault (e.g. to create a debug dump) and so on. These low-level functions tend to come in handy in times of incidents or outages when normal system paths have failed and operator-assisted recovery is necessary. They also happen to be useful for fault injection. However, these system-provided APIs are generally going to inject “orderly” faults — not too invasive and not all-encompassing (since there is usually no legitimate service operator scenario to, say, randomly fail memory allocations). Of course, this approach has a critical prerequisite: the system itself must be up and running in order to use the system APIs! This is not necessarily a given for fault scenarios which may involve (purposely) extended downtime.
In the quest for software resiliency, disorderly faults are generally more interesting. These types of faults — abrupt crashes, total network disconnection, sustained high CPU and memory load — cannot just be guarded against with trivial patches and bug fixes. Protection from these conditions must be designed in. Any service with a disaster recovery plan deserves of bit of invasive fault injection to aid in proving it works in practice.
What does it take to get started with this type of fault injection? Simple tools like Chaos Monkey by Netflix can take you pretty far, maybe as far as you need to go depending on the specifics of your system. Also consider designing the system with not just debug APIs but a debug service instance which can inflict disorder upon other collocated instances. (Proceed with caution, however, and never expose such debug APIs/instances via insecure channels.)
Why not begin with the most basic disorderly fault? Abruptly terminate a service process. This gives the system no time to collect its thoughts or flush data to disk. If the system is designed well, no real downtime should occur (high availability!) with no permanent data loss or corruption upon recovery (data consistency!). In day to day development, there are inevitably going to be bugs that could crash or “fail-fast” the system without warning, so this is just a way to exercise these paths in a (tester-)controlled fashion. Also consider introducing network loss (tools such as dummynet can help here). Again, the system architecture should have accounted for this since networks are imperfect and retransmissions are inevitable (albeit at the cost of increased latency).
With only these two faults, a lot of interesting issues can be found. Armed with a few simple tools and a bit of forethought about testability, software engineers can gain more confidence that their system has what it takes to persevere in the face of the inevitable disorder and chaos of the real world.