{"id":5668,"date":"2019-12-28T15:00:20","date_gmt":"2019-12-28T15:00:20","guid":{"rendered":"http:\/\/writeasync.net\/?p=5668"},"modified":"2019-12-26T18:12:55","modified_gmt":"2019-12-26T18:12:55","slug":"high-performance-datagramming-heap","status":"publish","type":"post","link":"https:\/\/writeasync.net\/?p=5668","title":{"rendered":"High performance datagramming: heap?"},"content":{"rendered":"<p>Last time, we looked at an implementation of a <a href=\"http:\/\/writeasync.net\/?p=5662\">datagram server with and without channels<\/a>. The non-channel solution ultimately won out in terms of performance. But that&#8217;s not all we can do to squeeze every ounce of throughput from this simple program.<\/p>\n<p>First, let&#8217;s reestablish the baseline. I am not performing these experiments in any clean room lab &#8212; rather my home computer with a background workload of web browsers, Visual Studio instances, and so on &#8212; so I can&#8217;t trust any numbers more than a few minutes old. Here is the current server code for reference:<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\nusing System;\r\nusing System.Collections.Generic;\r\nusing System.Net;\r\nusing System.Net.Sockets;\r\nusing System.Text;\r\nusing System.Threading;\r\nusing System.Threading.Tasks;\r\n\r\ninternal static class DatagramServer\r\n{\r\n    public static IEnumerable&lt;Task&gt; Start(MessageCount count, ushort port, CancellationToken token)\r\n    {\r\n        yield return TaskEx.RunAsync(nameof(ReceiveAsync), (count, port), token, (x, t) =&gt; ReceiveAsync(x, t));\r\n    }\r\n\r\n    private static async Task ReceiveAsync((MessageCount, ushort) x, CancellationToken token)\r\n    {\r\n        MessageCount count = x.Item1;\r\n        ushort port = x.Item2;\r\n        using Socket socket = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);\r\n        socket.Bind(new IPEndPoint(IPAddress.Loopback, port));\r\n\r\n        while (!token.IsCancellationRequested)\r\n        {\r\n            byte&#x5B;] buffer = new byte&#x5B;64];\r\n            int size = await socket.ReceiveAsync(new Memory&lt;byte&gt;(buffer), SocketFlags.None, token);\r\n            string message = Encoding.UTF8.GetString(buffer, 0, size);\r\n            count.Increment();\r\n            int checksum = Fletcher32(message);\r\n            if (checksum == 0x12345678)\r\n            {\r\n                Console.WriteLine(&quot;!&quot;);\r\n            }\r\n        }\r\n    }\r\n\r\n    private static int Fletcher32(string data)\r\n    {\r\n        int sum1 = 0;\r\n        int sum2 = 0;\r\n        foreach (char c in data)\r\n        {\r\n            sum1 = (sum1 + c) % 0xFFFF;\r\n            sum2 = (sum2 + sum1) % 0xFFFF;\r\n        }\r\n\r\n        return (sum2 &lt;&lt; 16) | sum1;\r\n    }\r\n}\r\n<\/pre>\n<p>Today I am seeing approximately 144K\/sec processing rate:<br \/>\n<a href=\"http:\/\/writeasync.net\/wp-content\/uploads\/2019\/12\/Datagram-without-channel-2.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/writeasync.net\/wp-content\/uploads\/2019\/12\/Datagram-without-channel-2-300x180.png\" alt=\"Datagram without channel (~144K\/sec)\" width=\"300\" height=\"180\" class=\"alignnone size-medium wp-image-5670\" srcset=\"https:\/\/writeasync.net\/wp-content\/uploads\/2019\/12\/Datagram-without-channel-2-300x180.png 300w, https:\/\/writeasync.net\/wp-content\/uploads\/2019\/12\/Datagram-without-channel-2.png 481w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>This is certainly respectable performance but even a cursory glance at the receive code shows some obvious optimization opportunities. Of course, I am talking about <a href=\"https:\/\/blog.maartenballiauw.be\/post\/2016\/10\/19\/making-net-code-less-allocatey-garbage-collector.html\">avoiding heap allocation<\/a>. For every received datagram, there are at least two heap allocations here within our control. The first is clear &#8212; we are reallocating the byte buffer every time. The second one is more subtle as there is no <code>new<\/code> keyword &#8212; the call to <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.text.encoding.getstring?view=netcore-3.1#System_Text_Encoding_GetString_System_Byte___System_Int32_System_Int32_\">Encoding.GetString<\/a> is indeed allocating a new string object on every call. Luckily, those .NET architects thought of everything and provided another method <a href=\"https:\/\/docs.microsoft.com\/en-us\/dotnet\/api\/system.text.encoding.getchars?view=netcore-3.1#System_Text_Encoding_GetChars_System_Byte___System_Int32_System_Int32_System_Char___System_Int32_\">Encoding.GetChars<\/a> which allows for a user-provided buffer to store the characters. Addressing these two points will result in essentially <a href=\"https:\/\/lukasmestan.com\/simple-guide-to-big-o-notation\/\">O(1)<\/a> memory usage.<\/p>\n<p>Here is the new implementation:<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\n    private static async Task ReceiveAsync((MessageCount, ushort) x, CancellationToken token)\r\n    {\r\n        MessageCount count = x.Item1;\r\n        ushort port = x.Item2;\r\n        using Socket socket = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);\r\n        socket.Bind(new IPEndPoint(IPAddress.Loopback, port));\r\n\r\n        byte&#x5B;] bytes = new byte&#x5B;64];\r\n        Memory&lt;byte&gt; buffer = new Memory&lt;byte&gt;(bytes);\r\n        char&#x5B;] chars = new char&#x5B;64];\r\n\r\n        while (!token.IsCancellationRequested)\r\n        {\r\n            int size = await socket.ReceiveAsync(buffer, SocketFlags.None, token);\r\n            int length = Encoding.UTF8.GetChars(bytes, 0, size, chars, 0);\r\n            count.Increment();\r\n            int checksum = Fletcher32(new ReadOnlySpan&lt;char&gt;(chars, 0, length));\r\n            if (checksum == 0x12345678)\r\n            {\r\n                Console.WriteLine(&quot;!&quot;);\r\n            }\r\n        }\r\n    }\r\n\r\n    private static int Fletcher32(ReadOnlySpan&lt;char&gt; data)\r\n    {\r\n        int sum1 = 0;\r\n        int sum2 = 0;\r\n        foreach (char c in data)\r\n        {\r\n            sum1 = (sum1 + c) % 0xFFFF;\r\n            sum2 = (sum2 + sum1) % 0xFFFF;\r\n        }\r\n\r\n        return (sum2 &lt;&lt; 16) | sum1;\r\n    }\r\n<\/pre>\n<p>The numbers show that this is an overall win &#8212; 2% improvement to ~147K\/sec:<br \/>\n<a href=\"http:\/\/writeasync.net\/wp-content\/uploads\/2019\/12\/Datagram-without-heap-2.png\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/writeasync.net\/wp-content\/uploads\/2019\/12\/Datagram-without-heap-2-300x180.png\" alt=\"Datagram without heap alloc (~147K\/sec)\" width=\"300\" height=\"180\" class=\"alignnone size-medium wp-image-5671\" srcset=\"https:\/\/writeasync.net\/wp-content\/uploads\/2019\/12\/Datagram-without-heap-2-300x180.png 300w, https:\/\/writeasync.net\/wp-content\/uploads\/2019\/12\/Datagram-without-heap-2.png 480w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>The moral of the story: when working on high performance code, you have to be like <a href=\"https:\/\/github.com\/davidfowl\">David Fowler<\/a> and <a href=\"https:\/\/twitter.com\/davidfowl\/status\/1099925353101877248\">see the allocations<\/a>. Don&#8217;t make the GC do unnecessary work!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last time, we looked at an implementation of a datagram server with and without channels. The non-channel solution ultimately won out in terms of performance. But that&#8217;s not all we can do to squeeze every ounce of throughput from this&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21,104],"tags":[],"class_list":["post-5668","post","type-post","status-publish","format-standard","hentry","category-async","category-performance"],"_links":{"self":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5668","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5668"}],"version-history":[{"count":3,"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5668\/revisions"}],"predecessor-version":[{"id":5673,"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5668\/revisions\/5673"}],"wp:attachment":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5668"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5668"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5668"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}