Is your system designed with disaster recovery in mind? There are obvious things like taking backups of state (and maybe validating them, too) and configuring geo-replication to mitigate the effects of regional outages. But what about the more subtle aspects — the things you don’t think much of until they stop working? These aren’t necessarily the business critical pieces but they can greatly impact the time wasted (or saved) on recovery operations.
Let’s start with the diagnostics pipeline. Do your application monitors and error log viewers have any dependencies which might also be down during an outage? If the backing storage where your logs are shipped is down, do you have a way to at least read the data collected so far? Can you fall back to a “read in place” solution — without manually logging into servers and copying files around? The convenient diagnostic tools and dashboards you’ve invested so much in should be functional during a disaster, if even in a degraded mode.
How complicated is the bootstrapping process for your application platform? A bad enough outage may force you to consider spinning up fresh new instances without relying on any existing resources being online. If you have a complex web of dependencies or an arcane sequence of setup steps, you will undoubtedly suffer longer downtime than necessary. Either you will need to wait for the precious few “in-house experts” to get everything running again, or suffer through multiple missteps as less enlightened engineers fumble through lengthy and error-prone procedures.
Disasters (thankfully) do not happen all that often. Engineers with the right tribal knowledge can avoid rookie mistakes and may have clever workarounds for quirks or limitations of the system. These facts cause many to become complacent and lose focus on the “small stuff” that stands in the way of a friction-free disaster recovery story. This is why it is critical to execute disaster drills to find all the weak points in your recovery. Involve multiple engineers across the stack, including the most junior members of the team. See how foolproof your documented procedures and tools really are.
A recovery strategy based on technical wizardry is unlikely to succeed in the long run. Do sweat the small stuff when it comes to disaster readiness.