{"id":5638,"date":"2019-05-28T13:00:09","date_gmt":"2019-05-28T13:00:09","guid":{"rendered":"http:\/\/writeasync.net\/?p=5638"},"modified":"2019-05-28T00:12:11","modified_gmt":"2019-05-28T00:12:11","slug":"native-code-always-faster","status":"publish","type":"post","link":"http:\/\/writeasync.net\/?p=5638","title":{"rendered":"Native code: always faster?"},"content":{"rendered":"<p>Last month, I explored <a href=\"http:\/\/writeasync.net\/?p=5625\">some performance optimizations for a C# Letter Boxed solver<\/a>. It was a fair bit of effort but in the end I could solve a typical puzzle in ~80 ms. Success! However, you might rightly point out that I could have spent that time building a solution in a &#8220;naturally&#8221; high performance language such as native C++. Would that effort have been worth it? After all, isn&#8217;t a given <a href=\"https:\/\/stackoverflow.com\/questions\/138361\/how-much-faster-is-c-than-c\">C++ app simply just faster<\/a> than its C# equivalent?<\/p>\n<p>It certainly seems that way, if you <a href=\"https:\/\/www.quora.com\/Exactly-how-much-is-C-slower-than-C++\">browse<\/a> <a href=\"https:\/\/www.linkedin.com\/pulse\/c-faster-than-depends-coding-idom-joe-ellsworth\/\">the<\/a> <a href=\"https:\/\/blog.codinghorror.com\/on-managed-code-performance\/\">literature<\/a> (i.e. blog posts). Intuitively, it makes sense that any intermediate runtime layer is going to cost at least a few more CPU cycles, all else being equal. But that&#8217;s the sticking point &#8212; all else is <em>not<\/em> equal. Rarely is a program just a set of pure mathematical calculations, so in practice there are additional (usually dominating) factors which affect performance. Everything from the <a href=\"http:\/\/modernescpp.com\/index.php\/strategies-for-the-allocation-of-memory\">memory allocation strategy<\/a>, the <a href=\"https:\/\/stackoverflow.com\/questions\/12065774\/why-does-cache-locality-matter-for-array-performance\">cache locality<\/a> of your data, and even the <a href=\"https:\/\/en.cppreference.com\/w\/cpp\/container\">available data structures<\/a> will matter as much &#8212; if not more &#8212; to the overall execution speed.<\/p>\n<p>But rather than opine abstractly about it, let&#8217;s look at a real use case. The Letter Boxed solver example is actually a great proving ground for this &#8220;C++ is faster&#8221; performance hypothesis for multiple reasons:<\/p>\n<ol>\n<li>I already have a working reference implementation in C# for direct comparison.<\/li>\n<li>The project has full unit tests which I can borrow to aid in implementing the C++ version.<\/li>\n<li>The core algorithms and abstractions exist in very similar forms for C++ (Dictionary has its analog <a href=\"http:\/\/www.cplusplus.com\/reference\/unordered_map\/unordered_map\/\">unordered_map<\/a>, HashSet has <a href=\"http:\/\/www.cplusplus.com\/reference\/unordered_set\/unordered_set\/\">unordered_set<\/a>, etc.).<\/li>\n<\/ol>\n<p>So let&#8217;s get started! First, we need three more projects (<a href=\"https:\/\/github.com\/brian-dot-net\/games\/commit\/3ed50edb9e64c402ee4e290062fbec22f0a82f15\">native test library, native core library, native solver app<\/a>). Then it&#8217;s on to the code, starting with a relatively straight port of the Str class (<a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/e7a2ce88bc92be3e117e5e8012b9841b73935166\/src\/Words.Native.Core\/Str.h\">Str.h<\/a>, <a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/e7a2ce88bc92be3e117e5e8012b9841b73935166\/src\/Words.Native.Core\/Str.cpp\">Str.cpp<\/a>, <a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/e7a2ce88bc92be3e117e5e8012b9841b73935166\/test\/Words.Native.Test\/StrTest.cpp\">StrTest.cpp<\/a>). Note that I have used some C++-isms, such as <a href=\"http:\/\/www.cplusplus.com\/reference\/string\/string\/operator+\/\"><code>operator+<\/code><\/a> instead of an <code>Append<\/code> method, a <a href=\"https:\/\/docs.microsoft.com\/en-us\/cpp\/cpp\/user-defined-literals-cpp?view=vs-2019\">user-defined character literal<\/a> (eliminating the need for a full-fledged <code>Ch<\/code> enum, which became just a <code>typedef<\/code> of <code>uint8_t<\/code> here), and a <a href=\"https:\/\/docs.microsoft.com\/en-us\/cpp\/standard-library\/overloading-the-output-operator-for-your-own-classes?view=vs-2019\">stream insertion operator<\/a> for &#8220;stringifying.&#8221;<\/p>\n<p>Next up is LetterBoxStr, a pretty simple fa\u00e7ade around Str (<a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/d2fb39b90a0868c62cae8db79ff813cc581dcabb\/src\/Words.Native.Core\/LetterBoxStr.h\">LetterBoxStr.h<\/a>, <a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/d2fb39b90a0868c62cae8db79ff813cc581dcabb\/src\/Words.Native.Core\/LetterBoxStr.cpp\">LetterBoxStr.cpp<\/a>, <a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/d2fb39b90a0868c62cae8db79ff813cc581dcabb\/test\/Words.Native.Test\/LetterBoxStrTest.cpp\">LetterBoxStrTest.cpp<\/a>). Again I can benefit from C++ here by using <a href=\"https:\/\/www.tutorialspoint.com\/cpp_standard_library\/bitset.htm\">bitset<\/a> rather than rolling my own Vertices struct as I did in the C# version. (One caveat is that the string output is &#8220;backwards&#8221; compared to the one I built, so I had to reverse all the expected values in the assertions.)<\/p>\n<p>At last, it is time to build the star of the show, StrTrie. (<a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/3dee9d83671d911c42d64fb16832e965b4c5f304\/src\/Words.Native.Core\/StrTrie.h\">StrTrie.h<\/a>, <a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/3dee9d83671d911c42d64fb16832e965b4c5f304\/src\/Words.Native.Core\/StrTrie.cpp\">StrTrie.cpp<\/a>, <a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/3dee9d83671d911c42d64fb16832e965b4c5f304\/test\/Words.Native.Test\/StrTrieTest.cpp\">StrTrieTest.cpp<\/a>) It is a relatively faithful port from the C# implementation, aside from the <a href=\"https:\/\/devblogs.microsoft.com\/oldnewthing\/20190227-00\/?p=101072\">required, albeit verbose, iterator syntax for finding and inserting values into the map<\/a>.<\/p>\n<p>By this point, I have almost enough pieces to stitch together the first half of the native solver app. Just <a href=\"https:\/\/github.com\/brian-dot-net\/games\/commit\/e4ed5b324a113ae7f74cd41b30fff255cbad1679\">one little Stopwatch class<\/a> (very thinly wrapping the types in <a href=\"https:\/\/docs.microsoft.com\/en-us\/cpp\/standard-library\/chrono?view=vs-2019\">std::chrono<\/a>) to aid in performance reporting and we&#8217;re off to the races! Except I happened to realize there is a terrible bug in StrTrie which in practice I had not hit yet because my inputs were always sorted. How embarrassing! After <a href=\"https:\/\/github.com\/brian-dot-net\/games\/commit\/b84a0fd37b2c534304cf2a27c5d8a3127063ec4d\">a quick fix in the C# code<\/a> and <a href=\"https:\/\/github.com\/brian-dot-net\/games\/commit\/8814a065192f9eada215d946381baeeed73ca51e\">its C++ counterpart<\/a> we can now move back to the <a href=\"https:\/\/github.com\/brian-dot-net\/games\/commit\/d74f4bf61706c36dc3c704d0da1adadb927b3fdd\">native solver skeleton<\/a>. So far it just loads the trie from a file, but already I can see that we&#8217;re in trouble:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n&gt; LetterBoxedSolver.Native.exe RMEWCLTGKAPI words.txt\r\n&#x5B;0.000] Loading trie...\r\n&#x5B;0.226] Loaded 162309 words.\r\n<\/pre>\n<p>Recall the C# version:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n&gt; dotnet LetterBoxedSolver.dll RMEWCLTGKAPI words.txt\r\n&#x5B;000.000] Loading trie...\r\n&#x5B;000.059] Loaded 162309 words.\r\n&#x5B;000.060] Finding valid words...\r\n&#x5B;000.072] Found 754 valid words.\r\n&#x5B;000.072] Finding solutions...\r\nMARKETPLACE-EARWIG\r\nWIGWAM-MARKETPLACE\r\nPRAGMATIC-CAKEWALK\r\n&#x5B;000.080] Done.\r\n<\/pre>\n<p>Simply loading the trie in the C++ app puts us at ~140 ms <em>slower<\/em> than the <strong>entire<\/strong> C# solution. Running the profiler and trying various tricks to no avail have convinced me that the way <code>unordered_map<\/code> is defined leads to this slowdown &#8212; apparently <a href=\"https:\/\/stackoverflow.com\/questions\/31112852\/how-stdunordered-map-is-implemented\">an artifact of how iterators need to be implemented<\/a>. What a pity &#8212; we could really do without the iterator requirement as they are not necessary for this particular application.<\/p>\n<p>Not be deterred, I continued with the rest of the implementation. Moving on to LetterBoxStrSearch (<a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/316e23d4175fa00c5b78b8f7fce5c8de0aa0acb7\/src\/Words.Native.Core\/LetterBoxStrSearch.h\">LetterBoxStrSearch.h<\/a>, <a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/316e23d4175fa00c5b78b8f7fce5c8de0aa0acb7\/src\/Words.Native.Core\/LetterBoxStrSearch.cpp\">LetterBoxStrSearch.cpp<\/a>, <a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/316e23d4175fa00c5b78b8f7fce5c8de0aa0acb7\/test\/Words.Native.Test\/LetterBoxStrSearchTest.cpp\">LetterBoxStrSearchTest.cpp<\/a>) which uses a template parameter for the <code>found<\/code> functor (<a href=\"https:\/\/stackoverflow.com\/questions\/16435549\/when-to-use-stdfunction-instead-of-inheritance\">compile-time inheritance FTW<\/a>). Then to the last required data structure, LetterBoxStrWords, to select the two-word solutions. (<a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/b779b27dd08213f322f1dfc92c5537602f66a73c\/src\/Words.Native.Core\/LetterBoxStrWords.h\">LetterBoxStrWords.h<\/a>, <a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/b779b27dd08213f322f1dfc92c5537602f66a73c\/src\/Words.Native.Core\/LetterBoxStrWords.cpp\">LetterBoxStrWords.cpp<\/a>, <a href=\"https:\/\/github.com\/brian-dot-net\/games\/blob\/b779b27dd08213f322f1dfc92c5537602f66a73c\/test\/Words.Native.Test\/LetterBoxStrWordsTest.cpp\">LetterBoxStrWordsTest.cpp<\/a>)<\/p>\n<p>With all that done, we can implement the <a href=\"https:\/\/github.com\/brian-dot-net\/games\/commit\/09c73567a44c3264324933f625e2862d90991fe7\">full native solver<\/a>. The syntax of the output streams is a bit awkward due to the lack of a nice formatting library, but it&#8217;ll do. Let&#8217;s see the results:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n&#x5B;0.000] Loading trie...\r\n&#x5B;0.225] Loaded 162309 words.\r\n&#x5B;0.225] Finding valid words...\r\n&#x5B;0.229] Found 754 valid words.\r\n&#x5B;0.230] Finding solutions...\r\nMARKETPLACE-EARWIG\r\nWIGWAM-MARKETPLACE\r\nPRAGMATIC-CAKEWALK\r\n&#x5B;0.235] Done.\r\n<\/pre>\n<p>We knew it was going to be slower overall but there is one big ray of hope here &#8212; the solution process is much faster than the C# version. It only took ~5 ms to find the words and ~5 ms to find solutions, easily twice as fast as the managed app.<\/p>\n<p>Let&#8217;s try a puzzle with many more potential solutions to see if the perf gains are sticky:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n&gt;LetterBoxedSolver.Native.exe ERGBIDNCFTAO words.txt\r\n&#x5B;0.000] Loading trie...\r\n&#x5B;0.224] Loaded 162309 words.\r\n&#x5B;0.225] Finding valid words...\r\n&#x5B;0.237] Found 1436 valid words.\r\n&#x5B;0.238] Finding solutions...\r\nBAITED-DEFORCING\r\n -- snipped 72 other solutions -- \r\nOBTECTED-DRAFTING\r\n&#x5B;0.324] Done.\r\n<\/pre>\n<p>How does that compare with the C# app?<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\ndotnet LetterBoxedSolver.dll ERGBIDNCFTAO words.txt\r\n&#x5B;000.000] Loading trie...\r\n&#x5B;000.056] Loaded 162309 words.\r\n&#x5B;000.057] Finding valid words...\r\n&#x5B;000.071] Found 1436 valid words.\r\n&#x5B;000.072] Finding solutions...\r\nROBAND-DEFECTING\r\n -- snipped 72 other solutions -- \r\nOBTECTED-DRAFTING\r\n&#x5B;000.120] Done.\r\n<\/pre>\n<p>Hmm, not so great. (We can ignore that the order of solutions is different because of hash table implementation details &#8212; the words are ultimately the same.) Finding solutions is now also slower. But we should be suspicious because we are dealing with console output here, and <a href=\"https:\/\/stackoverflow.com\/questions\/4340396\/does-the-c-standard-mandate-poor-performance-for-iostreams-or-am-i-just-deali\">iostreams are known to be slow<\/a>.<\/p>\n<p>Let&#8217;s commit a bit of programming heresy and use the <a href=\"https:\/\/stackoverflow.com\/questions\/2872543\/printf-vs-cout-in-c\">maligned<\/a> and <a href=\"https:\/\/hownot2code.com\/2016\/06\/26\/dangerous-printf\/\">unsafe printf<\/a> <a href=\"https:\/\/github.com\/brian-dot-net\/games\/commit\/8f52b7846a86bf566e8aced994a8a4bd76db758d\">instead of cout<\/a>:<\/p>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n&#x5B;000.000] Loading trie...\r\n&#x5B;000.223] Loaded 162309 words.\r\n&#x5B;000.224] Finding valid words...\r\n&#x5B;000.237] Found 1436 valid words.\r\n&#x5B;000.237] Finding solutions...\r\nBAITED-DEFORCING\r\n -- snipped 72 other solutions -- \r\nOBTECTED-DRAFTING\r\n&#x5B;000.280] Done.\r\n<\/pre>\n<p>That&#8217;s a bit better. The C++ solution now has a slight edge over the C# version when solving, but still takes too long to load the trie. Clearly we need to rethink some of the data structures we are using here, but that is an exploration for another time.<\/p>\n<p>What did we learn? C++ can be faster at <em>some<\/em> things, no doubt. But there are many other factors to consider before blindly assuming native perf &gt;&gt; managed perf.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last month, I explored some performance optimizations for a C# Letter Boxed solver. It was a fair bit of effort but in the end I could solve a typical puzzle in ~80 ms. Success! However, you might rightly point out&hellip; <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[105,101,104],"tags":[],"class_list":["post-5638","post","type-post","status-publish","format-standard","hentry","category-games","category-native","category-performance"],"_links":{"self":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5638","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5638"}],"version-history":[{"count":1,"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5638\/revisions"}],"predecessor-version":[{"id":5639,"href":"http:\/\/writeasync.net\/index.php?rest_route=\/wp\/v2\/posts\/5638\/revisions\/5639"}],"wp:attachment":[{"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5638"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5638"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/writeasync.net\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5638"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}