Watchdogs have been longtime staples of the embedded systems space. But they are also quite useful for distributed services where it is advantageous to attempt automatic recovery from transient errors.
When it comes to watchdog implementations, the stakes are understandably much lower for typical cloud software compared to, say, the Voyager spacecraft. A minimal implementation of a synthetic monitoring app paired with “kill switch” mechanism a can get you pretty far. I have uploaded a WatchdogSample on GitHub as a demonstration of this concept.
The basic use case is as follows:
- The application to be watched is deployed side-by-side with its watchdog app. Both applications are assumed to be configured to restart on failure.
- The watchdog periodically runs simple operations and consistency checks to gauge the health of the watched app (the exact details of which would be domain-specific).
- When some error threshold is reached, e.g. the app fails to respond properly after a period of time, the watchdog requests termination of the app.
- The app would then be restarted and the process repeats.
The WatchdogSample produces a native C++ DLL to manage the kill switch, which in this case is a simple manual reset event object. The watched app calls the
InitializeWatchdog export method (preferably as the first thing in its
Main function) which initiates a thread pool wait on the event. When the watchdog app signals the event via
SignalWatchdog, the wait is satisfied and a failfast exception is generated, causing an immediate shutdown of the watched app. On a properly configured Windows system, this would initiate some sort of application crash handling (perhaps via a postmortem debugger or Windows Error Reporting). If, as luck would have it, the watched application exits normally, the event is closed and the wait is canceled, thus safely disabling the kill switch.
There are a few other features of note for the WatchdogSample:
- The core logic is fully unit tested aside from the very thin Win32 integration layer in the DLL itself. The approach makes use of two “ports” (in the “ports-adapters-simulators” architectural style),
IEvent(a simple “schedule and signal” abstraction) and
ManualResetEvent(which defines much lower level Win32 interaction details pluggable in a statically polymorphic way via a
- The underlying implementation is native but there is an easy-to-use managed P/Invoke wrapper demonstrated in WatchedManagedApp.
- The managed/native interoperability makes it possible to use any combination of native and managed watchdogs and watched apps. You can see for yourself by signaling WatchedNativeApp with WatchedManagedApp or vice versa.
Keep in mind that a production-ready watchdog implementation may need additional considerations like a good naming scheme and discoverability for the shared events, proper security so that unintended callers cannot gain unauthorized access to the kill switch, a “watcher for the watcher” to make sure the watchdog itself is monitored (perhaps via a heartbeat mechanism), and so on. There’s also a small chance that the kill switch will not actually succeed in terminating the app, in which case you would need to fall back to a “kill -9” strategy of last resort. Caveats aside, check out WatchdogSample and see how it might be a useful starting point for your scenarios.