In the olden days of boxed software products, the “full test pass” was a borderline sacred ritual performed near the end of a release. Ostensibly, its purpose was to make sure all the product features worked as intended — for some nebulous definitions of “all,” “product features,” and “worked as intended.” Even in those ancient times, nigh on five or ten years ago, it was clear that running through all the tests only a handful of times was hardly sufficient. Sure, the more deterministic tests in the basic regression suite would generally stop yielding new information after the first or second pass. But it seemed like there was always something interesting and unexpected happening in the longer running load tests that warranted much more than one time through.
In the current software landscape, we have continuous integration as the modern answer to the full test pass of yore. The CI approach definitely offers many improvements, but these mostly boil down to timeliness and overall reduction of latency. Despite the “continuous” appellation, it is still a discrete test suite run once against each change. Once is just not enough.
To approach the truly continuous validation that we desire, we need to look beyond regression suites and integration environments. Software breaks for many reasons which are impossible to discover by just “testing the changes.” A service that worked perfectly yesterday could fail horribly the day after — without even deploying anything new.
It could be that a new usage pattern in production (e.g. spike load) brought the system to its knees. Or maybe there was an unbounded cache which eventually ate up all the memory on the server. These are all things that we could test for (assuming we know enough to do so), but eventually the overall cost becomes an issue. How much time and effort (if any) are we willing to spend in pre-production trying various load and scale experiments?
As it turns out, we do have options to enable reasonably accurate and continuous measurement of the software in its natural environment without breaking the budget or falling prey to the static test pass trap. We can deploy watchdogs alongside the system to alert us to issues as they happen. We can think of them like little tests that run all the time which do not depend on any discrete triggers (like a new push to master). We can also deploy other active monitors in production to generate synthetic load to measure stress, scale, performance, and the like. To be clear, these monitors, load generators, and watchdogs do not take the place of all the useful and necessary testing done in the CI system; rather, they augment the test portfolio and make possible scenarios which were previously out of reach in the more static world.
Bottom line: a validation strategy that relies on a single example to gauge success is at best incomplete. A distributed service with many moving parts has myriad sources of error which only become apparent when tested in place — in production — over and over again.