{"id":5892,"date":"2023-10-04T07:00:33","date_gmt":"2023-10-04T14:00:33","guid":{"rendered":"http:\/\/writeasync.net\/?p=5892"},"modified":"2023-10-03T21:53:59","modified_gmt":"2023-10-04T04:53:59","slug":"digits-faster-in-practice","status":"publish","type":"post","link":"https:\/\/writeasync.net\/?p=5892","title":{"rendered":"Digits: faster in practice?"},"content":{"rendered":"<p>While everyone loves <a href=\"http:\/\/writeasync.net\/?p=5884\">theoretical performance improvements<\/a>, we need to check back in with reality from time to time. To review, we had proposed a near-zero allocation strategy to help improve the speed of a <a href=\"http:\/\/writeasync.net\/?p=5881\">Digits game solver<\/a>. After all, heap allocation leads to <a href=\"https:\/\/michaelscodingspot.com\/avoid-gc-pressure\/\">GC pressure<\/a> which leads to <a href=\"https:\/\/blog.ycrash.io\/2021\/06\/24\/how-many-millions-of-dollars-enterprises-waste-due-to-garbage-collection\/\">millions of dollars wasted<\/a>! Well, potentially. Let&#8217;s try a few code optimizations and see how these &#8220;improvements&#8221; fare in practice.<\/p>\n<p>Recall that our initial benchmark clocked the solver at about 296 ms and 715 MB of allocations:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n| Method |     Mean |   Error |  StdDev |        Gen0 | Allocated |\r\n|------- |---------:|--------:|--------:|------------:|----------:|\r\n|  Solve | 296.0 ms | 1.29 ms | 1.21 ms | 175000.0000 | 714.83 MB |\r\n<\/pre>\n<p>Let&#8217;s start with the observation that any Digits solution would be the result of the immediately preceding move. So why not check right away if <code>TryMove<\/code> yields our target number? That should eliminate extra work in going back to the board and scanning for the solution again. Here is the code change: <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/97587ed2b24578e53cd47730c27d692eb30a19b3\">Board &#8211; TryMove with target<\/a>. Our new benchmark numbers:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n| Method |     Mean |   Error |  StdDev |       Gen0 | Allocated |\r\n|------- |---------:|--------:|--------:|-----------:|----------:|\r\n|  Solve | 263.5 ms | 4.48 ms | 4.19 ms | 83000.0000 | 665.93 MB |\r\n<\/pre>\n<p>Not bad, we gained more than a 10% speed boost and reduced allocation by 7%! But we have even more low-hanging fruit; let&#8217;s <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/7d73a4fa06d4c4765c19dda7c93a08b22b4926e4\">change <code>Move<\/code> to a readonly (record) struct<\/a>, as it is a trivial value-like type anyway:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n| Method |     Mean |   Error |  StdDev |       Gen0 | Allocated |\r\n|------- |---------:|--------:|--------:|-----------:|----------:|\r\n|  Solve | 242.5 ms | 4.83 ms | 5.16 ms | 65333.3333 | 522.22 MB |\r\n<\/pre>\n<p>Another 8% speed improvement, and 21% memory reduction! This is fun! Let&#8217;s follow suit with what you would think is the biggest memory hog, and <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/a8f8482a2139301039fadb7850bfe3ba295534a0\">change Board to a readonly struct<\/a>:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n| Method |     Mean |   Error |  StdDev |       Gen0 | Allocated |\r\n|------- |---------:|--------:|--------:|-----------:|----------:|\r\n|  Solve | 228.2 ms | 4.49 ms | 4.81 ms | 60000.0000 | 479.33 MB |\r\n<\/pre>\n<p>A modest 6% gain in execution speed, with 9% less in allocation. It&#8217;s good, but the wins are getting smaller&#8230; How about we <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/d480299ed53180cedc98618a22f310529ea0f5e0\">use array inside Board instead of a List<\/a>?<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n| Method |     Mean |   Error |  StdDev |       Gen0 | Allocated |\r\n|------- |---------:|--------:|--------:|-----------:|----------:|\r\n|  Solve | 170.6 ms | 2.98 ms | 2.33 ms | 43666.6667 | 350.66 MB |\r\n<\/pre>\n<p>Ah, that&#8217;s more like it. More than 25% improvement, both in time and space. This is about the last easy change we&#8217;re going to get, it seems. Next up, maybe we can try that idea of using a fancy VLQ-encoded list, packed into a single 64-bit integer. That would perhaps eliminate the most obvious source of heap allocation, that of the internal board array. We first need a <code>Base128<\/code> type, to help with the encode\/decode algorithm. To make absolutely sure we did this right, <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/ec61ced4196510488f594c675d03cdbd96f63ec0\">we<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/dc5bbb70cc5f8f6c575438c2337b21fc8dccfeb5\">need<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/960db4becddf37f74b9b3c91fa7fe799f721b820\">quite<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/3bfa9b8b2cabac612adbd614baa92f2d937a97df\">a<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/d77c733f127ca5154a9bc29bfc3811544de13c3d\">few<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/79b3eb19f478825813bc3bf1daffe901665b61b8\">unit<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/1aebb1f6c925257e9f6be9b63628fcb8480ad2fc\">tests<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/afe3704f9c5bf7bb34c7d2000bacd9897ef9a6d6\">to<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/a09faae32eaf99ba7342d2f51520091e85f0aec5\">guide<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/e36de41974bb4d687843085d5647ddb6313c8c5c\">us<\/a>. Voila:<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\npublic static class Base128\r\n{\r\n    public static int Read(ReadOnlySpan&lt;byte&gt; input)\r\n    {\r\n        int i = -1;\r\n        int output = 0;\r\n        byte next;\r\n        do\r\n        {\r\n            next = input&#x5B;++i];\r\n            output |= (next &amp; 0x7F) &lt;&lt; (7 * i);\r\n        }\r\n        while (next &gt; 127);\r\n\r\n        return output;\r\n    }\r\n\r\n    public static int Write(int value, Span&lt;byte&gt; output)\r\n    {\r\n        int i = 0;\r\n        do\r\n        {\r\n            byte next = (byte)(value &amp; 0x7F);\r\n            if (value &gt; 127)\r\n            {\r\n                next |= 0x80;\r\n            }\r\n\r\n            value &gt;&gt;= 7;\r\n            output&#x5B;i++] = next;\r\n        }\r\n        while (value &gt; 0);\r\n\r\n        return i;\r\n    }\r\n}\r\n<\/pre>\n<p>Now we need the list struct to pack those bits! This ends up being pretty complicated, and we need several <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/40124bca9e5c8a1ae23032439ed116754cb1e334\">more<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/9efa5b6233636451ca3a58a80c49c108a7b82d46\">tests<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/e732553fda665e61dea10f5bad096b92de538840\">to<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/93920032593055e7e4c216fe24f84914569ba456\">show<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/6c50d31fb1616808a6288439a2a3145918b8aa4f\">that<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/2041fd21e64447ee30f0929f718773b4ce7e3e23\">the<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/0d01647a452fb22808ccbd8e3e38f14378b66c67\">various<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/bc56ade74741f6fd0f71f86f85d2671834f0e447\">cases<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/67eedfd191c7ce0f9ac4237604e709a4c7e92c36\">are<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/a0d011a61c7e8d24fcc0038d59e82ed2debda2ea\">handled<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/writeasync2\/commit\/b323875d8ad56d549adc258661f9971629018e9a\">correctly<\/a>. The result:<\/p>\n<pre class=\"brush: csharp; title: ; notranslate\" title=\"\">\r\npublic readonly struct List6V\r\n{\r\n    private readonly ulong _raw;\r\n\r\n    private List6V(ulong raw)\r\n    {\r\n        _raw = raw;\r\n    }\r\n\r\n    public int this&#x5B;byte index]\r\n    {\r\n        get\r\n        {\r\n            Span&lt;byte&gt; bytes = stackalloc byte&#x5B;6];\r\n            Unpack(bytes);\r\n            return Base128.Read(bytes.Slice(index + Delta(index)));\r\n        }\r\n    }\r\n\r\n    public byte Count =&gt; (byte)(_raw &amp; 0xFF);\r\n\r\n    private byte Delta(byte index)\r\n    {\r\n        byte bits = (byte)(_raw &gt;&gt; 56);\r\n        return index switch\r\n        {\r\n            1 =&gt; (byte)(bits &amp; 0x7),\r\n            2 =&gt; (byte)((bits &gt;&gt; 3) &amp; 0x3),\r\n            3 =&gt; (byte)((bits &gt;&gt; 5) &amp; 0x3),\r\n            4 =&gt; (byte)((bits &gt;&gt; 7) &amp; 0x1),\r\n            _ =&gt; 0,\r\n        };\r\n    }\r\n\r\n    public List6V Add(int value)\r\n    {\r\n        Span&lt;byte&gt; bytes = stackalloc byte&#x5B;6];\r\n        Unpack(bytes);\r\n        int delta = Delta(Count);\r\n        int len = Base128.Write(value, bytes.Slice(Count + delta));\r\n        return new(Pack(bytes) | ((ulong)Count + 1) | Adjust(delta + len - 1));\r\n    }\r\n\r\n    private ulong Adjust(int delta)\r\n    {\r\n        byte bits = (byte)(_raw &gt;&gt; 56);\r\n        bits |= Count switch\r\n        {\r\n            0 =&gt; (byte)delta,\r\n            1 =&gt; (byte)(delta &lt;&lt; 3),\r\n            2 =&gt; (byte)(delta &lt;&lt; 5),\r\n            3 =&gt; (byte)(delta &lt;&lt; 7),\r\n            _ =&gt; 0,\r\n        };\r\n        return (ulong)bits &lt;&lt; 56;\r\n    }\r\n\r\n    private static ulong Pack(ReadOnlySpan&lt;byte&gt; bytes)\r\n    {\r\n        ulong raw = bytes&#x5B;0];\r\n        raw |= (ulong)bytes&#x5B;1] &lt;&lt; 8;\r\n        raw |= (ulong)bytes&#x5B;2] &lt;&lt; 16;\r\n        raw |= (ulong)bytes&#x5B;3] &lt;&lt; 24;\r\n        raw |= (ulong)bytes&#x5B;4] &lt;&lt; 32;\r\n        raw |= (ulong)bytes&#x5B;5] &lt;&lt; 40;\r\n        return raw &lt;&lt; 8;\r\n    }\r\n\r\n    private void Unpack(Span&lt;byte&gt; bytes)\r\n    {\r\n        bytes&#x5B;0] = (byte)(_raw &gt;&gt; 8);\r\n        bytes&#x5B;1] = (byte)(_raw &gt;&gt; 16);\r\n        bytes&#x5B;2] = (byte)(_raw &gt;&gt; 24);\r\n        bytes&#x5B;3] = (byte)(_raw &gt;&gt; 32);\r\n        bytes&#x5B;4] = (byte)(_raw &gt;&gt; 40);\r\n        bytes&#x5B;5] = (byte)(_raw &gt;&gt; 48);\r\n    }\r\n}\r\n<\/pre>\n<p>It&#8217;s&#8230; something. We got rid of heap allocated lists and\/or arrays, and we fit all of our state into 64 bits, as promised. Does this really pay off? In a word, <em>no<\/em>. The benchmark is somewhat better on the memory front, as expected, but far worse in speed:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n| Method |     Mean |   Error |  StdDev |       Gen0 | Allocated |\r\n|------- |---------:|--------:|--------:|-----------:|----------:|\r\n|  Solve | 244.7 ms | 2.84 ms | 2.65 ms | 36666.6667 | 293.16 MB |\r\n<\/pre>\n<p>What happened? We have to drill into another benchmark to see. Let&#8217;s look just at the Board.TryMove cost for various board sizes (N):<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n|  Method | N |      Mean |    Error |   StdDev | Allocated |\r\n|-------- |-- |----------:|---------:|---------:|----------:|\r\n| TryMove | 2 |  68.40 ns | 0.553 ns | 0.518 ns |         - |\r\n| TryMove | 4 | 136.23 ns | 0.742 ns | 0.694 ns |         - |\r\n| TryMove | 6 | 203.40 ns | 1.079 ns | 1.010 ns |         - |\r\n<\/pre>\n<p>Now let&#8217;s compare that to what it was with the plain old array:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n|  Method | N |     Mean |    Error |   StdDev |   Gen0 | Allocated |\r\n|-------- |-- |---------:|---------:|---------:|-------:|----------:|\r\n| TryMove | 2 | 36.27 ns | 0.152 ns | 0.135 ns | 0.0076 |      64 B |\r\n| TryMove | 4 | 45.57 ns | 0.316 ns | 0.280 ns | 0.0076 |      64 B |\r\n| TryMove | 6 | 51.87 ns | 0.919 ns | 0.860 ns | 0.0095 |      80 B |\r\n<\/pre>\n<p>Yes, the array usage carries some heap allocation cost, but the execution speed is undeniably better. It goes to show that for most situations, you really can&#8217;t beat the popular (and, by extension) highly optimized case. With an array you get the best possible read\/write speeds (indexed, contiguous memory access) and if the data is small enough, a good chance of <a href=\"https:\/\/stackoverflow.com\/questions\/12065774\/why-does-cache-locality-matter-for-array-performance\">cache locality<\/a>.<\/p>\n<p>Maybe the VLQ could be better with a lot more work &#8212; I admit I didn&#8217;t spend a lot of time trying to be clever with <a href=\"https:\/\/graphics.stanford.edu\/~seander\/bithacks.html\">bit twiddling hacks<\/a> and the like. Maybe we can find a better optimization by thinking harder.<\/p>\n<p>Honestly, I&#8217;m good with the simple improvements we already did above. Performance optimization is often a game of <a href=\"https:\/\/softwareengineering.stackexchange.com\/questions\/179924\/does-low-latency-code-sometimes-have-to-be-ugly\">making the code ugly in order to make the latency graphs pretty<\/a>. On the contrary, these particular changes meet three very important criteria: (1) the code afterward is still intelligible, (2) the changes are easy to explain, (3) they could likely be applied in other contexts.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While everyone loves theoretical performance improvements, we need to check back in with reality from time to time. To review, we had proposed a near-zero allocation strategy to help improve the speed of a Digits game solver. After all, heap&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[105,104],"tags":[],"class_list":["post-5892","post","type-post","status-publish","format-standard","hentry","category-games","category-performance"],"_links":{"self":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5892","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5892"}],"version-history":[{"count":1,"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5892\/revisions"}],"predecessor-version":[{"id":5893,"href":"https:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5892\/revisions\/5893"}],"wp:attachment":[{"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5892"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5892"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5892"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}