Previously, we determined that mixing EventSource and Span is costly. Now it’s just a matter of finding the lowest cost.
The last approach used an ArrayPool for all ETW strings. But remember that ETW events are capped at 64 KB. It would therefore be impossible to have Unicode string data exceeding 32,768 characters. Further, ETW is a synchronous technology. There is no async/await or thread-hopping going on here.
With these limitations in mind, we can use the infamous [ThreadStatic] to share a buffer. Each string we trace can copy its data into the buffer and increment the index for the next user. It’s a tricky pattern, and it “leaks” 64KB per ETW thread, but we’ll do anything in the name of performance! We’ll also assume that we never have to deal with already-terminated Spans.
public ref struct EtwString { [ThreadStatic] private static readonly char[] Buffer = ArrayPool<char>.Shared.Rent(32768); private readonly Span<char> dest; public EtwString(ReadOnlySpan<char> source, ref int index) { int len = source.Length + 1; this.dest = new Span<char>(Buffer, index, len); source.CopyTo(dest); this.dest[len - 1] = '\0'; index += len; } public int Size => 2 * this.dest.Length; public ReadOnlySpan<char> Span => this.dest; }
Here is the subtly different calling code in our EventSource:
[NonEvent] public unsafe int Hello(int n1, int n2, ReadOnlySpan<char> s1) { int i = 0; EtwString _s1z = new EtwString(s1, ref i); fixed (char* _s1 = _s1z.Span) { EventData* descrs = stackalloc EventData[3]; descrs[0].DataPointer = (IntPtr)(&n1); descrs[0].Size = sizeof(int); descrs[1].DataPointer = (IntPtr)(&n2); descrs[1].Size = sizeof(int); descrs[2].DataPointer = (IntPtr)_s1; descrs[2].Size = _s1z.Size; return this.Write(1, null, 3, descrs); } }
Now, for the benchmarks. We no longer care about “SpanZ” and we know the pre-allocated string case is constant cost, so now we’re just comparing substring allocation to our tricky EtwString
:
| Method | Len | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | |--------- |------ |------------:|----------:|----------:|-------:|------:|------:|----------:| | StrAlloc | 10 | 23.03 ns | 0.492 ns | 0.673 ns | 0.0061 | - | - | 48 B | | Span | 10 | 33.18 ns | 0.150 ns | 0.140 ns | - | - | - | - | | StrAlloc | 100 | 37.69 ns | 0.425 ns | 0.397 ns | 0.0286 | - | - | 224 B | | Span | 100 | 39.77 ns | 0.714 ns | 0.668 ns | - | - | - | - | | StrAlloc | 1000 | 209.36 ns | 1.272 ns | 1.127 ns | 0.2580 | - | - | 2024 B | | Span | 1000 | 86.85 ns | 0.569 ns | 0.533 ns | - | - | - | - | | StrAlloc | 10000 | 1,929.48 ns | 40.373 ns | 37.765 ns | 2.5444 | - | - | 20024 B | | Span | 10000 | 846.17 ns | 7.394 ns | 6.916 ns | - | - | - | - |
This is an improvement! The break-even point appears to be ~100 chars, though the overhead is not far off even in the small string case. Still, this ThreadStatic business seems a bit weird, and who likes intentional memory leaks?
We have at least one other option — our dear friend stackalloc. We’re already using it for the EventData array, so why not go all out? We just need a buffer big enough to hold the string data plus the null terminator:
[NonEvent] public unsafe int Hello(int n1, int n2, ReadOnlySpan<char> s1) { int n = s1.Length; char* buffer = stackalloc char[n + 1]; buffer[n] = '\0'; fixed (char* _s1 = s1) { Buffer.MemoryCopy(_s1, buffer, 2 * n, 2 * n); EventData* descrs = stackalloc EventData[3]; descrs[0].DataPointer = (IntPtr)(&n1); descrs[0].Size = sizeof(int); descrs[1].DataPointer = (IntPtr)(&n2); descrs[1].Size = sizeof(int); descrs[2].DataPointer = (IntPtr)buffer; descrs[2].Size = 2 * (n + 1); return this.Write(1, null, 3, descrs); } }
No more ref structs, no destination Span buffers; just good old fashioned C-style memcpy patterns. Does the performance justify this pointer madness?
| Method | Len | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | |--------- |------ |------------:|----------:|----------:|-------:|------:|------:|----------:| | StrAlloc | 10 | 22.52 ns | 0.224 ns | 0.198 ns | 0.0061 | - | - | 48 B | | Span | 10 | 14.74 ns | 0.172 ns | 0.144 ns | - | - | - | - | | StrAlloc | 100 | 38.83 ns | 0.775 ns | 0.796 ns | 0.0286 | - | - | 224 B | | Span | 100 | 23.32 ns | 0.291 ns | 0.258 ns | - | - | - | - | | StrAlloc | 1000 | 212.57 ns | 4.108 ns | 3.642 ns | 0.2580 | - | - | 2024 B | | Span | 1000 | 170.16 ns | 3.368 ns | 5.041 ns | - | - | - | - | | StrAlloc | 10000 | 1,910.12 ns | 38.194 ns | 35.726 ns | 2.5444 | - | - | 20024 B | | Span | 10000 | 1,708.65 ns | 32.994 ns | 36.673 ns | - | - | - | - |
Well, just barely. The stackalloc case is faster than ThreadStatic up to a point, but is about 50% slower in the larger (1000+ char) strings. Maybe Buffer.MemoryCopy is too slow? Let’s try using Span.CopyTo instead:
[NonEvent] public unsafe int Hello(int n1, int n2, ReadOnlySpan<char> s1) { int n = s1.Length; char* buffer = stackalloc char[n + 1]; s1.CopyTo(new Span<char>(buffer, n + 1)); buffer[n] = '\0'; fixed (char* _s1 = s1) { EventData* descrs = stackalloc EventData[3]; descrs[0].DataPointer = (IntPtr)(&n1); descrs[0].Size = sizeof(int); descrs[1].DataPointer = (IntPtr)(&n2); descrs[1].Size = sizeof(int); descrs[2].DataPointer = (IntPtr)buffer; descrs[2].Size = 2 * (n + 1); return this.Write(1, null, 3, descrs); } }
The results:
| Method | Len | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | |--------- |------ |------------:|----------:|-----------:|-------:|------:|------:|----------:| | StrAlloc | 10 | 23.73 ns | 0.506 ns | 0.449 ns | 0.0061 | - | - | 48 B | | Span | 10 | 13.97 ns | 0.315 ns | 0.462 ns | - | - | - | - | | StrAlloc | 100 | 39.16 ns | 0.845 ns | 1.264 ns | 0.0286 | - | - | 224 B | | Span | 100 | 21.95 ns | 0.068 ns | 0.057 ns | - | - | - | - | | StrAlloc | 1000 | 216.41 ns | 4.341 ns | 4.061 ns | 0.2580 | - | - | 2024 B | | Span | 1000 | 136.79 ns | 0.966 ns | 0.807 ns | - | - | - | - | | StrAlloc | 10000 | 2,034.16 ns | 40.512 ns | 100.888 ns | 2.5444 | - | - | 20024 B | | Span | 10000 | 1,384.88 ns | 2.610 ns | 2.314 ns | - | - | - | - |
This is measurably better. Kudos to the .NET Core team for their performance work! Still, stackalloc is slower than the ArrayPool case for 1000+ chars. It is also dangerous and can lead to unrecoverable stack overflows.
What is the final verdict for EventSource with Spans? All things being equal, I am tempted to recommend EtwString with a leaked ThreadStatic ArrayPool buffer. Of course, this is a very tentative result based solely on microbenchmarks which cannot tell the whole story. Take these conclusions as guidelines and always measure in context.