{"id":5705,"date":"2020-01-08T15:00:20","date_gmt":"2020-01-08T15:00:20","guid":{"rendered":"http:\/\/writeasync.net\/?p=5705"},"modified":"2020-01-05T17:34:14","modified_gmt":"2020-01-05T17:34:14","slug":"eventsource-and-span-take-3","status":"publish","type":"post","link":"http:\/\/writeasync.net\/?p=5705","title":{"rendered":"EventSource and Span: take 3"},"content":{"rendered":"<p>Previously, we determined that <a href=\"writeasync.net\/?p=5700\">mixing EventSource and Span is costly<\/a>. Now it&#8217;s just a matter of finding the lowest cost.<\/p>\n<p>The last approach used an ArrayPool for all ETW strings. But remember that <a href=\"https:\/\/docs.microsoft.com\/en-us\/windows\/win32\/etw\/about-event-tracing#missing-events\">ETW events are capped at 64 KB<\/a>. It would therefore be impossible to have Unicode string data exceeding 32,768 characters. Further, ETW is a synchronous technology. There is no async\/await or <a href=\"https:\/\/spring.io\/blog\/2019\/12\/13\/flight-of-the-flux-3-hopping-threads-and-schedulers\">thread-hopping<\/a> going on here.<\/p>\n<p>With these limitations in mind, we can use the infamous <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.threadstaticattribute\">[ThreadStatic]<\/a> to share a buffer. Each string we trace can copy its data into the buffer and increment the index for the next user. It&#8217;s a tricky pattern, and it &#8220;leaks&#8221; 64KB per ETW thread, but we&#8217;ll do anything in the name of performance! We&#8217;ll also assume that we never have to deal with already-terminated Spans.<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\n    public ref struct EtwString\r\n    {\r\n        &#x5B;ThreadStatic]\r\n        private static readonly char&#x5B;] Buffer = ArrayPool&lt;char&gt;.Shared.Rent(32768);\r\n\r\n        private readonly Span&lt;char&gt; dest;\r\n\r\n        public EtwString(ReadOnlySpan&lt;char&gt; source, ref int index)\r\n        {\r\n            int len = source.Length + 1;\r\n            this.dest = new Span&lt;char&gt;(Buffer, index, len);\r\n            source.CopyTo(dest);\r\n            this.dest&#x5B;len - 1] = '&#92;&#48;';\r\n            index += len;\r\n        }\r\n\r\n        public int Size =&gt; 2 * this.dest.Length;\r\n\r\n        public ReadOnlySpan&lt;char&gt; Span =&gt; this.dest;\r\n    }\r\n<\/pre>\n<p>Here is the subtly different calling code in our EventSource:<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\n        &#x5B;NonEvent]\r\n        public unsafe int Hello(int n1, int n2, ReadOnlySpan&lt;char&gt; s1)\r\n        {\r\n            int i = 0;\r\n            EtwString _s1z = new EtwString(s1, ref i);\r\n            fixed (char* _s1 = _s1z.Span)\r\n            {\r\n                EventData* descrs = stackalloc EventData&#x5B;3];\r\n                descrs&#x5B;0].DataPointer = (IntPtr)(&amp;n1);\r\n                descrs&#x5B;0].Size = sizeof(int);\r\n                descrs&#x5B;1].DataPointer = (IntPtr)(&amp;n2);\r\n                descrs&#x5B;1].Size = sizeof(int);\r\n                descrs&#x5B;2].DataPointer = (IntPtr)_s1;\r\n                descrs&#x5B;2].Size = _s1z.Size;\r\n                return this.Write(1, null, 3, descrs);\r\n            }\r\n        }\r\n<\/pre>\n<p>Now, for the benchmarks. We no longer care about &#8220;SpanZ&#8221; and we know the pre-allocated string case is constant cost, so now we&#8217;re just comparing substring allocation to our tricky <code>EtwString<\/code>:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n|   Method |   Len |        Mean |     Error |    StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |\r\n|--------- |------ |------------:|----------:|----------:|-------:|------:|------:|----------:|\r\n| StrAlloc |    10 |    23.03 ns |  0.492 ns |  0.673 ns | 0.0061 |     - |     - |      48 B |\r\n|     Span |    10 |    33.18 ns |  0.150 ns |  0.140 ns |      - |     - |     - |         - |\r\n| StrAlloc |   100 |    37.69 ns |  0.425 ns |  0.397 ns | 0.0286 |     - |     - |     224 B |\r\n|     Span |   100 |    39.77 ns |  0.714 ns |  0.668 ns |      - |     - |     - |         - |\r\n| StrAlloc |  1000 |   209.36 ns |  1.272 ns |  1.127 ns | 0.2580 |     - |     - |    2024 B |\r\n|     Span |  1000 |    86.85 ns |  0.569 ns |  0.533 ns |      - |     - |     - |         - |\r\n| StrAlloc | 10000 | 1,929.48 ns | 40.373 ns | 37.765 ns | 2.5444 |     - |     - |   20024 B |\r\n|     Span | 10000 |   846.17 ns |  7.394 ns |  6.916 ns |      - |     - |     - |         - |\r\n<\/pre>\n<p>This is an improvement! The break-even point appears to be ~100 chars, though the overhead is not far off even in the small string case. Still, this ThreadStatic business seems a bit weird, and who likes <a href=\"https:\/\/blogs.msdn.microsoft.com\/stephen_griffin\/2006\/11\/03\/the-intentional-memory-leak\/\">intentional memory leaks<\/a>?<\/p>\n<p>We have at least one other option &#8212; our dear friend <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/csharp\/language-reference\/operators\/stackalloc\">stackalloc<\/a>. We&#8217;re already using it for the EventData array, so why not go all out? We just need a buffer big enough to hold the string data plus the null terminator:<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\n        &#x5B;NonEvent]\r\n        public unsafe int Hello(int n1, int n2, ReadOnlySpan&lt;char&gt; s1)\r\n        {\r\n            int n = s1.Length;\r\n            char* buffer = stackalloc char&#x5B;n + 1];\r\n            buffer&#x5B;n] = '&#92;&#48;';\r\n            fixed (char* _s1 = s1)\r\n            {\r\n                Buffer.MemoryCopy(_s1, buffer, 2 * n, 2 * n);\r\n                EventData* descrs = stackalloc EventData&#x5B;3];\r\n                descrs&#x5B;0].DataPointer = (IntPtr)(&amp;n1);\r\n                descrs&#x5B;0].Size = sizeof(int);\r\n                descrs&#x5B;1].DataPointer = (IntPtr)(&amp;n2);\r\n                descrs&#x5B;1].Size = sizeof(int);\r\n                descrs&#x5B;2].DataPointer = (IntPtr)buffer;\r\n                descrs&#x5B;2].Size = 2 * (n + 1);\r\n                return this.Write(1, null, 3, descrs);\r\n            }\r\n        }\r\n<\/pre>\n<p>No more ref structs, no destination Span buffers; just good old fashioned <a href=\"http:\/\/www.cplusplus.com\/reference\/cstring\/memcpy\/\">C-style memcpy<\/a> patterns. Does the performance justify this pointer madness?<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n|   Method |   Len |        Mean |     Error |    StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |\r\n|--------- |------ |------------:|----------:|----------:|-------:|------:|------:|----------:|\r\n| StrAlloc |    10 |    22.52 ns |  0.224 ns |  0.198 ns | 0.0061 |     - |     - |      48 B |\r\n|     Span |    10 |    14.74 ns |  0.172 ns |  0.144 ns |      - |     - |     - |         - |\r\n| StrAlloc |   100 |    38.83 ns |  0.775 ns |  0.796 ns | 0.0286 |     - |     - |     224 B |\r\n|     Span |   100 |    23.32 ns |  0.291 ns |  0.258 ns |      - |     - |     - |         - |\r\n| StrAlloc |  1000 |   212.57 ns |  4.108 ns |  3.642 ns | 0.2580 |     - |     - |    2024 B |\r\n|     Span |  1000 |   170.16 ns |  3.368 ns |  5.041 ns |      - |     - |     - |         - |\r\n| StrAlloc | 10000 | 1,910.12 ns | 38.194 ns | 35.726 ns | 2.5444 |     - |     - |   20024 B |\r\n|     Span | 10000 | 1,708.65 ns | 32.994 ns | 36.673 ns |      - |     - |     - |         - |\r\n<\/pre>\n<p>Well, just barely. The stackalloc case is faster than ThreadStatic up to a point, but is about 50% slower in the larger (1000+ char) strings. Maybe <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.buffer.memorycopy\">Buffer.MemoryCopy<\/a> is too slow? Let&#8217;s try using <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.span-1.copyto?view=netcore-3.1\">Span.CopyTo<\/a> instead:<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\n        &#x5B;NonEvent]\r\n        public unsafe int Hello(int n1, int n2, ReadOnlySpan&lt;char&gt; s1)\r\n        {\r\n            int n = s1.Length;\r\n            char* buffer = stackalloc char&#x5B;n + 1];\r\n            s1.CopyTo(new Span&lt;char&gt;(buffer, n + 1));\r\n            buffer&#x5B;n] = '&#92;&#48;';\r\n            fixed (char* _s1 = s1)\r\n            {\r\n                EventData* descrs = stackalloc EventData&#x5B;3];\r\n                descrs&#x5B;0].DataPointer = (IntPtr)(&amp;n1);\r\n                descrs&#x5B;0].Size = sizeof(int);\r\n                descrs&#x5B;1].DataPointer = (IntPtr)(&amp;n2);\r\n                descrs&#x5B;1].Size = sizeof(int);\r\n                descrs&#x5B;2].DataPointer = (IntPtr)buffer;\r\n                descrs&#x5B;2].Size = 2 * (n + 1);\r\n                return this.Write(1, null, 3, descrs);\r\n            }\r\n        }\r\n<\/pre>\n<p>The results:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n|   Method |   Len |        Mean |     Error |     StdDev |  Gen 0 | Gen 1 | Gen 2 | Allocated |\r\n|--------- |------ |------------:|----------:|-----------:|-------:|------:|------:|----------:|\r\n| StrAlloc |    10 |    23.73 ns |  0.506 ns |   0.449 ns | 0.0061 |     - |     - |      48 B |\r\n|     Span |    10 |    13.97 ns |  0.315 ns |   0.462 ns |      - |     - |     - |         - |\r\n| StrAlloc |   100 |    39.16 ns |  0.845 ns |   1.264 ns | 0.0286 |     - |     - |     224 B |\r\n|     Span |   100 |    21.95 ns |  0.068 ns |   0.057 ns |      - |     - |     - |         - |\r\n| StrAlloc |  1000 |   216.41 ns |  4.341 ns |   4.061 ns | 0.2580 |     - |     - |    2024 B |\r\n|     Span |  1000 |   136.79 ns |  0.966 ns |   0.807 ns |      - |     - |     - |         - |\r\n| StrAlloc | 10000 | 2,034.16 ns | 40.512 ns | 100.888 ns | 2.5444 |     - |     - |   20024 B |\r\n|     Span | 10000 | 1,384.88 ns |  2.610 ns |   2.314 ns |      - |     - |     - |         - |\r\n<\/pre>\n<p>This is measurably better. Kudos to the .NET Core team for <a href=\"https:\/\/github.com\/dotnet\/corefx\/issues\/16840\">their performance work<\/a>! Still, stackalloc is slower than the ArrayPool case for 1000+ chars. It is also dangerous and can lead to <a href=\"https:\/\/github.com\/dotnet\/corefx\/issues\/26954\">unrecoverable stack overflows<\/a>.<\/p>\n<p>What is the final verdict for EventSource with Spans? All things being equal, I am tempted to recommend EtwString with a leaked ThreadStatic ArrayPool buffer. Of course, this is a very tentative result based solely on <a href=\"https:\/\/stackoverflow.com\/questions\/2842695\/what-is-microbenchmarking\/2842707#2842707\">microbenchmarks which cannot tell the whole story<\/a>. Take these conclusions as guidelines and always measure in context.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Previously, we determined that mixing EventSource and Span is costly. Now it&#8217;s just a matter of finding the lowest cost. The last approach used an ArrayPool for all ETW strings. But remember that ETW events are capped at 64 KB.&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[71,104],"tags":[],"class_list":["post-5705","post","type-post","status-publish","format-standard","hentry","category-diagnostics","category-performance"],"_links":{"self":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5705","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5705"}],"version-history":[{"count":1,"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5705\/revisions"}],"predecessor-version":[{"id":5706,"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5705\/revisions\/5706"}],"wp:attachment":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5705"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5705"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5705"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}