EventSource and Span: take 2

Spread the love

Last time, we talked about how EventSource and Span don’t really mix. ETW wants a null-terminated string for tracing and Span isn’t set up for that.

It turns out this null terminator issue has been explored in some depth by Eric Sink. He even has a solution for his particular use case in a custom utf8z struct. Still, it requires somehow allocating a string with a null terminator. Is there any hope of a zero allocation solution?

Well, one way to get amortized zero allocation would be a pooled allocation strategy like ArrayPool. We could ask for a buffer, copy the Span contents, and return it when we’re done. Here is a struct EtwString that can do this for us:

    public ref struct EtwString
    {
        private readonly ReadOnlySpan<char> source;
        private readonly char[] dest;

        public EtwString(ReadOnlySpan<char> source)
        {
            this.source = source;
            int len = this.source.Length;
            if ((len == 0) || (this.source[len - 1] != '\0'))
            {
                char[] buf = ArrayPool<char>.Shared.Rent(++len);
                this.source.CopyTo(buf);
                buf[len - 1] = '\0';
                this.dest = buf;
            }
            else
            {
                this.dest = null;
            }

            this.Size = len * 2;
        }

        public int Size { get; }

        public ReadOnlySpan<char> Span => this.dest ?? this.source;

        public void Dispose()
        {
            if (this.dest != null)
            {
                ArrayPool<char>.Shared.Return(this.dest);
            }
        }
    }

This is a very similar strategy used by .NET Core itself in ValueUtf8Converter. The implementation above also takes of the (exceedingly rare) case that we were somehow given a null-terminated Span. Note that we cannot actually use IDisposable but rather just define a Dispose method because ref structs cannot implement interfaces.

Now, if we want to use this in our EventSource and also benchmark its performance, we’re going to have to make some changes:

        [NonEvent]
        public unsafe int Hello(int n1, int n2, ReadOnlySpan<char> s1)
        {
            using EtwString _s1z = new EtwString(s1);
            fixed (char* _s1 = _s1z.Span)
            {
                EventData* descrs = stackalloc EventData[3];
                descrs[0].DataPointer = (IntPtr)(&n1);
                descrs[0].Size = sizeof(int);
                descrs[1].DataPointer = (IntPtr)(&n2);
                descrs[1].Size = sizeof(int);
                descrs[2].DataPointer = (IntPtr)_s1;
                descrs[2].Size = _s1z.Size;
                return this.Write(1, null, 3, descrs);
            }
        }

        [Event(1, Message = "Hello {0} {1} {2}")]
        public unsafe int Hello(int n1, int n2, string s1)
        {
            s1 ??= string.Empty;
            fixed (char* _s1 = s1)
            {
                EventData* descrs = stackalloc EventData[3];
                descrs[0].DataPointer = (IntPtr)(&n1);
                descrs[0].Size = sizeof(int);
                descrs[1].DataPointer = (IntPtr)(&n2);
                descrs[1].Size = sizeof(int);
                descrs[2].DataPointer = (IntPtr)_s1;
                descrs[2].Size = (s1.Length + 1) * 2;
                return this.Write(1, null, 3, descrs);
            }
        }

        private unsafe int Write(ushort eventId, Guid* relatedActivityId, byte eventDataCount, EventData* data)
        {
            // Actually writing to ETW will interfere with the benchmark
            // this.WriteEventWithRelatedActivityIdCore(eventId, relatedActivityId, eventDataCount, data);
            int total = 0;
            for (int i = 0; i < eventDataCount; ++i)
            {
                total += data[i].Size;
            }

            return total;
        }

In particular, we have gotten rid of the IsEnabled check (since it never will be during our benchmark) and avoided writing to ETW in favor of doing some basic calculations over the EventData. The benchmarks will cover four cases — a cached string, an allocated substring, a normal sliced Span, and a null-terminated Span (“SpanZ”):

    [SimpleJob(RuntimeMoniker.NetCoreApp31)]
    [MemoryDiagnoser]
    public class EtwBenchmark
    {
        private string str;
        private string cached;
        private string cachedZ;

        [Params(10, 100, 1000, 10000)]
        public int Len { get; set; }

        [GlobalSetup]
        public void Setup()
        {
            this.str = new string('x', 32767);
            this.cached = this.str.Substring(0, this.Len);
            this.cachedZ = this.cached + '\0';
        }

        [Benchmark]
        public int StrCached()
        {
            return MyEventSource.Instance.Hello(1, 2, this.cached);
        }

        [Benchmark]
        public int StrAlloc()
        {
            string s = this.str.Substring(0, this.Len);
            return MyEventSource.Instance.Hello(1, 2, s);
        }

        [Benchmark]
        public int Span()
        {
            ReadOnlySpan<char> s = this.str.AsSpan(0, this.Len);
            return MyEventSource.Instance.Hello(1, 2, s);
        }

        [Benchmark]
        public int SpanZ()
        {
            ReadOnlySpan<char> s = this.cachedZ.AsSpan();
            return MyEventSource.Instance.Hello(1, 2, s);
        }
    }

The benchmark results:

|    Method |   Len |         Mean |      Error |     StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------- |------ |-------------:|-----------:|-----------:|-------:|------:|------:|----------:|
| StrCached |    10 |     5.953 ns |  0.0429 ns |  0.0359 ns |      - |     - |     - |         - |
|  StrAlloc |    10 |    23.154 ns |  0.4894 ns |  0.5026 ns | 0.0061 |     - |     - |      48 B |
|      Span |    10 |    67.104 ns |  0.3162 ns |  0.2803 ns |      - |     - |     - |         - |
|     SpanZ |    10 |    26.225 ns |  0.0798 ns |  0.0666 ns |      - |     - |     - |         - |
| StrCached |   100 |     6.551 ns |  0.0689 ns |  0.0611 ns |      - |     - |     - |         - |
|  StrAlloc |   100 |    38.310 ns |  0.7810 ns |  0.7671 ns | 0.0286 |     - |     - |     224 B |
|      Span |   100 |    77.672 ns |  0.6171 ns |  0.5153 ns |      - |     - |     - |         - |
|     SpanZ |   100 |    26.091 ns |  0.1227 ns |  0.1148 ns |      - |     - |     - |         - |
| StrCached |  1000 |     5.948 ns |  0.0605 ns |  0.0566 ns |      - |     - |     - |         - |
|  StrAlloc |  1000 |   210.800 ns |  3.2663 ns |  3.0553 ns | 0.2580 |     - |     - |    2024 B |
|      Span |  1000 |   122.370 ns |  1.0371 ns |  0.8660 ns |      - |     - |     - |         - |
|     SpanZ |  1000 |    26.298 ns |  0.1379 ns |  0.1289 ns |      - |     - |     - |         - |
| StrCached | 10000 |     6.528 ns |  0.0859 ns |  0.0803 ns |      - |     - |     - |         - |
|  StrAlloc | 10000 | 1,935.778 ns | 37.2905 ns | 53.4809 ns | 2.5444 |     - |     - |   20024 B |
|      Span | 10000 |   880.727 ns |  7.4884 ns |  7.0046 ns |      - |     - |     - |         - |
|     SpanZ | 10000 |    26.551 ns |  0.2645 ns |  0.2474 ns |      - |     - |     - |         - |

Clearly, a pre-allocated string is the way to go. It has by far the lowest overhead, not surprisingly. But if all you have is a Span, the ArrayPool strategy wins out if you need a few hundred characters. Perhaps counterintuitively, it is better to take the heap allocation hit for shorter substrings — at least in this microbenchmark scenario. Of course, a null-terminated Span has good performance all around, but that’s not likely to be an option in real life.

So is it just not worth it to mix EventSource and Span? The numbers don’t look good, but maybe there are a few more things we can try…

Leave a Reply

Your email address will not be published. Required fields are marked *