Last month, I explored some performance optimizations for a C# Letter Boxed solver. It was a fair bit of effort but in the end I could solve a typical puzzle in ~80 ms. Success! However, you might rightly point out that I could have spent that time building a solution in a “naturally” high performance language such as native C++. Would that effort have been worth it? After all, isn’t a given C++ app simply just faster than its C# equivalent?
It certainly seems that way, if you browse the literature (i.e. blog posts). Intuitively, it makes sense that any intermediate runtime layer is going to cost at least a few more CPU cycles, all else being equal. But that’s the sticking point — all else is not equal. Rarely is a program just a set of pure mathematical calculations, so in practice there are additional (usually dominating) factors which affect performance. Everything from the memory allocation strategy, the cache locality of your data, and even the available data structures will matter as much — if not more — to the overall execution speed.
But rather than opine abstractly about it, let’s look at a real use case. The Letter Boxed solver example is actually a great proving ground for this “C++ is faster” performance hypothesis for multiple reasons:
- I already have a working reference implementation in C# for direct comparison.
- The project has full unit tests which I can borrow to aid in implementing the C++ version.
- The core algorithms and abstractions exist in very similar forms for C++ (Dictionary has its analog unordered_map, HashSet has unordered_set, etc.).
So let’s get started! First, we need three more projects (native test library, native core library, native solver app). Then it’s on to the code, starting with a relatively straight port of the Str class (Str.h, Str.cpp, StrTest.cpp). Note that I have used some C++-isms, such as operator+
instead of an Append
method, a user-defined character literal (eliminating the need for a full-fledged Ch
enum, which became just a typedef
of uint8_t
here), and a stream insertion operator for “stringifying.”
Next up is LetterBoxStr, a pretty simple façade around Str (LetterBoxStr.h, LetterBoxStr.cpp, LetterBoxStrTest.cpp). Again I can benefit from C++ here by using bitset rather than rolling my own Vertices struct as I did in the C# version. (One caveat is that the string output is “backwards” compared to the one I built, so I had to reverse all the expected values in the assertions.)
At last, it is time to build the star of the show, StrTrie. (StrTrie.h, StrTrie.cpp, StrTrieTest.cpp) It is a relatively faithful port from the C# implementation, aside from the required, albeit verbose, iterator syntax for finding and inserting values into the map.
By this point, I have almost enough pieces to stitch together the first half of the native solver app. Just one little Stopwatch class (very thinly wrapping the types in std::chrono) to aid in performance reporting and we’re off to the races! Except I happened to realize there is a terrible bug in StrTrie which in practice I had not hit yet because my inputs were always sorted. How embarrassing! After a quick fix in the C# code and its C++ counterpart we can now move back to the native solver skeleton. So far it just loads the trie from a file, but already I can see that we’re in trouble:
> LetterBoxedSolver.Native.exe RMEWCLTGKAPI words.txt [0.000] Loading trie... [0.226] Loaded 162309 words.
Recall the C# version:
> dotnet LetterBoxedSolver.dll RMEWCLTGKAPI words.txt [000.000] Loading trie... [000.059] Loaded 162309 words. [000.060] Finding valid words... [000.072] Found 754 valid words. [000.072] Finding solutions... MARKETPLACE-EARWIG WIGWAM-MARKETPLACE PRAGMATIC-CAKEWALK [000.080] Done.
Simply loading the trie in the C++ app puts us at ~140 ms slower than the entire C# solution. Running the profiler and trying various tricks to no avail have convinced me that the way unordered_map
is defined leads to this slowdown — apparently an artifact of how iterators need to be implemented. What a pity — we could really do without the iterator requirement as they are not necessary for this particular application.
Not be deterred, I continued with the rest of the implementation. Moving on to LetterBoxStrSearch (LetterBoxStrSearch.h, LetterBoxStrSearch.cpp, LetterBoxStrSearchTest.cpp) which uses a template parameter for the found
functor (compile-time inheritance FTW). Then to the last required data structure, LetterBoxStrWords, to select the two-word solutions. (LetterBoxStrWords.h, LetterBoxStrWords.cpp, LetterBoxStrWordsTest.cpp)
With all that done, we can implement the full native solver. The syntax of the output streams is a bit awkward due to the lack of a nice formatting library, but it’ll do. Let’s see the results:
[0.000] Loading trie... [0.225] Loaded 162309 words. [0.225] Finding valid words... [0.229] Found 754 valid words. [0.230] Finding solutions... MARKETPLACE-EARWIG WIGWAM-MARKETPLACE PRAGMATIC-CAKEWALK [0.235] Done.
We knew it was going to be slower overall but there is one big ray of hope here — the solution process is much faster than the C# version. It only took ~5 ms to find the words and ~5 ms to find solutions, easily twice as fast as the managed app.
Let’s try a puzzle with many more potential solutions to see if the perf gains are sticky:
>LetterBoxedSolver.Native.exe ERGBIDNCFTAO words.txt [0.000] Loading trie... [0.224] Loaded 162309 words. [0.225] Finding valid words... [0.237] Found 1436 valid words. [0.238] Finding solutions... BAITED-DEFORCING -- snipped 72 other solutions -- OBTECTED-DRAFTING [0.324] Done.
How does that compare with the C# app?
dotnet LetterBoxedSolver.dll ERGBIDNCFTAO words.txt [000.000] Loading trie... [000.056] Loaded 162309 words. [000.057] Finding valid words... [000.071] Found 1436 valid words. [000.072] Finding solutions... ROBAND-DEFECTING -- snipped 72 other solutions -- OBTECTED-DRAFTING [000.120] Done.
Hmm, not so great. (We can ignore that the order of solutions is different because of hash table implementation details — the words are ultimately the same.) Finding solutions is now also slower. But we should be suspicious because we are dealing with console output here, and iostreams are known to be slow.
Let’s commit a bit of programming heresy and use the maligned and unsafe printf instead of cout:
[000.000] Loading trie... [000.223] Loaded 162309 words. [000.224] Finding valid words... [000.237] Found 1436 valid words. [000.237] Finding solutions... BAITED-DEFORCING -- snipped 72 other solutions -- OBTECTED-DRAFTING [000.280] Done.
That’s a bit better. The C++ solution now has a slight edge over the C# version when solving, but still takes too long to load the trie. Clearly we need to rethink some of the data structures we are using here, but that is an exploration for another time.
What did we learn? C++ can be faster at some things, no doubt. But there are many other factors to consider before blindly assuming native perf >> managed perf.
Pingback: An even faster hash table – WriteAsync .NET