One box to rule them all?

Spread the love

If your distributed system supports some form of scale minimization, you may feel a temptation to solve a variety of semi-related problems with the same “one box.” It’s more efficient to reuse code, after all. Unfortunately, it is also easy to go wrong here and end up creating a suboptimal experience for everyone. Do you really understand and appreciate the distinct use cases for your one box infrastructure, or are you building “one box to rule them all”?

Let’s look at a few reasons to have a one box environment:

  • To allow a support engineer to locally reproduce and debug a problem seen in the production system.
  • To enable an internal group to run their software together with yours without the need for a hosted instance.
  • To offer a low-friction “playground” environment for external customers of your system.
  • To facilitate automated integration tests of your system.
  • To help your engineers perform interactive exploration and experimentation with the features of your system.

If we squint, we can see some commonality among the different scenarios but the devil is in the details. I will drill down into these examples and offer questions and considerations to help motivate the right solutions.

Consider the support engineer, pressed for time, attempting to find the root cause of a pernicious production problem. A debugging session is easiest when the number of debuggees is as small as possible. Does your one box create many distinct processes or can they all be faithfully and easily collapsed to one? Can you quickly and easily load pre-created data and state (e.g. that which was produced on a production node)? Is it possible to generate crash dumps on fatal errors or asserts? You may also want to consider why you are resorting to a local repro; it could be a sign of diagnostic deficiencies (insufficient logging/tracing, etc.).

As for the internal group with dependencies on your system, they are likely going to take new builds from you as often as you can provide them. It is particularly important that prior versions can be replaced by newer ones with a minimum of hassle. Can they easily setup and instantiate your one box, e.g. with XCOPY deployment and a startup script? Is the teardown/refresh step just as easy? Does your one box require machine-wide configuration steps or administrative privileges? You may also want to consider if this team really needs a one box solution from you or if they are overcompensating for inappropriately tight coupling to externals.

For the external customers in search of a playground, you should determine if they want a one box or just a simulator/emulator. How “external” are they (first party vs. third party)? It may not be advisable (or allowable depending on licenses or intellectual property rights) to ship internal system libraries that would normally make up a one box solution to such a customer. In any case, simplicity is the key. Verbose configuration files and fancy debugging features appropriate for a support engineer or internal customer probably won’t fly here. Is a graphical UI warranted? How much automation or scripting capabilities do they need, if any?

As the esteemed J.B. Rainsberger (aka @jbrains) will tell you, integrated tests are a scam. But many engineering teams still rely on them and consider a one box solution a necessary tool to enable them. Does your one box operate in an automation-friendly “silent mode” (no UI, no prompts)? Is it easy to distinguish failures in the one box infrastructure from test failures? Is it possible (or desirable) for multiple tests to share the same one box “instance” and benefit from the same initialization steps? Can the complete diagnostic output of the one box session be gathered easily in the event of a failure? You may also want to consider the advice of Mr. Rainsberger; are these one box integration tests really adding value or are you just missing a fast, isolated unit test suite?

Finally, we have the interactive and exploratory use case — perhaps a sort of “dogfooding” scenario. It can be informative and enlightening to actually use the software you are building. For complex distributed systems, a one box solution could lower the barrier to entry here. Is there an interactive shell/console to help drive useful workloads as they come to an engineer’s mind? This might be better than requiring that everyone build a custom client application. Is the one box system a normal build target which can be recompiled automatically as code changes are made? Is it configurable enough to facilitate meaningful experiments (e.g. imagine probing for performance issues by tweaking background thread counts and timer values)? Does it run without tweaks by having a useful default configuration?

As you can see, there are myriad pointed questions to ask yourself and your users before you ship “the” one-box-fits-all solution. Always keep in mind the goals and non-goals of your one box story and help build the right tool(s) for the job.

Leave a Reply

Your email address will not be published. Required fields are marked *