Starting in .NET 4.5, managed code developers finally have a simple way to interact with ETW —
System.Diagnostics.Tracing.EventSource. For a quick introduction, check out Vance Morrison’s inaugural blog post on
EventSource and the also very helpful feature specification.
EventSource is an improvement over the prior art (using
EventProvider manually or via unwieldy generated code from MC.exe), it is not perfect. In particular:
- It forces the use of the singleton anti-pattern. If you construct more than one instance of the same
EventSource, you are likely to encounter problems.
- Most existing tools do not understand how to decode
EventSourcedoes support the generation of an instrumentation manifest from your code (and it will even send this manifest at runtime as part of the event stream). However, the current in-box tools generally only know about registered ETW providers. Getting traces in a human-readable form is therefore a bit of a challenge, but there are solutions which I will cover in a future post.
- In the initial version, it did not provide built-in support for activity/correlation IDs. This has been addressed in .NET 4.5.1 (e.g. see
WriteEventWithRelatedActivityId) but will remain a problem for those who can’t upgrade. You can still set activity IDs yourself, however, and it will propagate them, as I show in the sample referenced below.
But don’t let these issues dissuade you —
EventSource is worth a look.
I put together a complete sample to demonstrate some of the more challenging aspects of designing traces: EventSourceSample on GitHub. The sample is built around a WCF calculator service client (the canonical WCF sample). Below I describe a few important design points which may be of use to trace authors.
I limited the number of contract methods to keep things relatively succinct (add, subtract, square root), but you can imagine in the real world you may have dozens of different client operations. The temptation might be to instrument every single operation with events like
AddError but I don’t recommend this. Not only does it bloat your trace events, it makes writing automated analysis tools harder since they have to deal with dozens of distinct-but-really-the-same cases for start/end/error operations. Instead, as demonstrated in
CalculatorClientWithEvents, I have a single typed event only for the start of every operation. Wrapping around this, I have
CalculatorClientWithActivity which writes more generic client requests events for start, end, and error.
The activity wrapper also handles the mostly mechanical process of generating a correlation ID before a call, resuming it on completion, and (perhaps most importantly) restoring it outside these scopes. The correlation ID is a critical piece of the tracing equation since it allows you to stitch together related calls across contexts. In this case, it is essential since the client calls are asynchronous and there is really no other way to tell which start and end/error events are related.
With correlation IDs I can track distinct calls but I still can’t tell where the calls originated. There might be one client or 100 clients all invoking methods at the same time. To distinguish them, it is useful to provide a client ID in the event payload. Depending on the needs of the application, this can be a friendly name (string), an integer (perhaps auto-incremented from some factory class?), or a totally unique ID (GUID). I opted for a GUID in the sample since it has the benefit of being truly unique anywhere and fewer drawbacks: integers can’t easily be synchronized across clients in different processes or machines; strings also have uniqueness issues and if allowed to be arbitrary size, can quickly bloat trace files.
Keywords are an important filtering mechanism. In ETW, they group together broad categories of functionality within a provider. In the sample I define three keywords: “Basic” (add, subtract), “Advanced” (square root), “Request” (generic request start/end/error), and “Connection” (open/abort). These come in handy in situations where you want to generate a pre-filtered trace file with only the events likely to be useful for a specific analysis scenario (e.g. to do performance analysis of requests regardless of operation type you could just enable the “Request” keyword).
Choose event levels wisely. For most situations,
Informational is appropriate. Use
Verbose for events which are helpful for detailed debugging but would cause too much noise or performance degradation if always enabled.
Error are tricky; I use both of them for cases of unexpected runtime exceptions with the distinction that
Warning is more of a non-fatal or recoverable condition whereas
Error indicates true degraded functionality or imminent failure. For expected exceptions, use
Informational or even
Verbose so as not to litter your traces with “wolf cry” warnings and errors; these “ignorable” errors cause heartburn for others who aren’t clued in (i.e. basically everyone except the original developer). I rarely see
Critical in practice, but it could be the right choice for an event after which the app would immediately terminate.
Is it possible to test your traces? Yes, quite easily in fact when you use
EventSource. With an appropriately configured
EventListener implementation, you too can practice TDD for trace events. This gives you a nice safety net as you evolve your system (e.g. asserting that important traces will continue to be emitted in all tested code paths, despite later refactoring) and also saves you from mistakes in initial event definitions (which are quite easy to make due to a preponderance of copy/paste errors in such areas). Check out
ClientWithActivityTest for a starting point.
Check out the commit history for EventSourceSample to see how I built a complete ETW-instrumented app in steps with full TDD.