Last time, I wrote about a native C++ implementation of the Letter Boxed solver. Vexingly, this native implementation was actually slower than the corresponding .NET Core app. Today we’ll try to fix that.
As I mentioned, some basic profiling revealed that the STL unordered_map was the cause of the slowdown. The documentation even says:
Moreover, inserting an element invalidates no iterators, and removing an element invalidates only those iterators which point at the removed element.
We actually do not need this strong of a guarantee — indeed, we need no iterators at all in our use case. What happens if we just abandon this general data structure in favor of one with the minimal operations we require? It seems we have come to a situation every budding computer scientist dreams of: to improve performance, we need to build a custom hash table.
If, like most of us, your CS fundamentals are rusty and you don’t write hash tables for fun or profit, consult the “Hash table” Wikipedia page for a refresher. In summary, we will need a list of buckets to hold our data items, a hash function and equality operation for the key type, and a collision strategy to deal with keys that hash to the same bucket. For simplicity of our V1 implementation, we’ll choose an open addressing scheme where any bucket is “open” (free to be used) for any hashed entry; in other words, we will consider buckets in addition to the one where the hash code matches. To deal with collisions, we will probe for open buckets with a linear probing sequence, meaning we will keep scanning through the buckets to find a free slot. With open addressing, the load factor (the percentage of the table we are allowed to fill before resizing) is quite important, so we need to keep this in mind as well.
Since I’m no Bob Jenkins, I will be using a prime number size for my table to avoid the effects of mediocre hash functions. When the table is full, I will simply expand the capacity by doubling (geometric expansion with growth factor 2). Very conveniently, INT_MAX (for 32-bit integers) is a prime number. So we just need to set this as our upper bound while we keep dividing by two and finding the nearest primes. This C++ program should find all the primes we need:
bool IsPrime(int n) { if ((n % 2) == 0) { return false; } int k = static_cast<int>(sqrt(n)); for (int f = 3; f <= k; ++f) { if ((n % f) == 0) { return false; } } return true; } template <typename TFound> void FindPrimes(int min, int max, TFound found) { int i = max; while (i >= min) { if (IsPrime(i)) { found(i); i /= 2; if ((i % 2) == 0) { ++i; } } else { i -= 2; } } } int main() { FindPrimes(3, 2147483647, [](int p) { printf("%d\n", p); }); return 0; }
For the Letter Boxed solver we only need two operations from our hash table: get
which returns a value if present, and insert
which adds or updates a value and optionally returns the previous value. We can reuse the STL definitions for std::hash and std::equal_to for hash function and equality operations, respectively.
After experimenting, writing a few tests, and debugging more than a few infinite loops, I ended up with this no-frills implementation: Hashtable.h, HashtableTest.cpp.
To figure out if this was actually any good, I had to write a simple benchmark program. (Maybe I’ll try Google Benchmark some other time.) Here it is: Words.Benchmark.Native/Main.cpp. Three core benchmarks are tested, each with parameters n
(number of elements) and f
(load factor of hash table):
- FindMissing: try to get elements known to be missing from the table
- FindPresent: get elements known to be present in the table
- Insert: insert elements
Here are the results I got on my machine, conveniently sorted by f
then n
:
Operation | n | f | Time (usec) |
---|---|---|---|
FindMissing | 10 | 0.25 | 0.205 |
FindPresent | 10 | 0.25 | 0.199 |
Insert | 10 | 0.25 | 0.875 |
FindMissing | 100 | 0.25 | 1.972 |
FindPresent | 100 | 0.25 | 1.911 |
Insert | 100 | 0.25 | 19.821 |
FindMissing | 1000 | 0.25 | 19.846 |
FindPresent | 1000 | 0.25 | 19.241 |
Insert | 1000 | 0.25 | 141.171 |
FindMissing | 10000 | 0.25 | 196.617 |
FindPresent | 10000 | 0.25 | 190.697 |
Insert | 10000 | 0.25 | 4630.353 |
FindMissing | 100000 | 0.25 | 1969.729 |
FindPresent | 100000 | 0.25 | 1913.572 |
Insert | 100000 | 0.25 | 30337.493 |
FindMissing | 10 | 0.50 | 0.204 |
FindPresent | 10 | 0.50 | 0.137 |
Insert | 10 | 0.50 | 1.252 |
FindMissing | 100 | 0.50 | 1.978 |
FindPresent | 100 | 0.50 | 1.918 |
Insert | 100 | 0.50 | 12.975 |
FindMissing | 1000 | 0.50 | 19.552 |
FindPresent | 1000 | 0.50 | 18.869 |
Insert | 1000 | 0.50 | 112.012 |
FindMissing | 10000 | 0.50 | 195.986 |
FindPresent | 10000 | 0.50 | 177.933 |
Insert | 10000 | 0.50 | 2413.639 |
FindMissing | 100000 | 0.50 | 1973.212 |
FindPresent | 100000 | 0.50 | 1897.388 |
Insert | 100000 | 0.50 | 28885.607 |
FindMissing | 10 | 0.75 | 0.137 |
FindPresent | 10 | 0.75 | 0.199 |
Insert | 10 | 0.75 | 1.34 |
FindMissing | 100 | 0.75 | 1.984 |
FindPresent | 100 | 0.75 | 1.92 |
Insert | 100 | 0.75 | 15.118 |
FindMissing | 1000 | 0.75 | 19.513 |
FindPresent | 1000 | 0.75 | 19.122 |
Insert | 1000 | 0.75 | 159.648 |
FindMissing | 10000 | 0.75 | 197.038 |
FindPresent | 10000 | 0.75 | 137.337 |
Insert | 10000 | 0.75 | 1726.716 |
FindMissing | 100000 | 0.75 | 1958.956 |
FindPresent | 100000 | 0.75 | 1257.9 |
Insert | 100000 | 0.75 | 32195.437 |
FindMissing | 10 | 1.00 | 0.137 |
FindPresent | 10 | 1.00 | 0.198 |
Insert | 10 | 1.00 | 1.248 |
FindMissing | 100 | 1.00 | 1.985 |
FindPresent | 100 | 1.00 | 1.931 |
Insert | 100 | 1.00 | 14.176 |
FindMissing | 1000 | 1.00 | 19.738 |
FindPresent | 1000 | 1.00 | 19.265 |
Insert | 1000 | 1.00 | 277.614 |
FindMissing | 10000 | 1.00 | 196.745 |
FindPresent | 10000 | 1.00 | 191.986 |
Insert | 10000 | 1.00 | 8672.173 |
FindMissing | 100000 | 1.00 | 1970.514 |
FindPresent | 100000 | 1.00 | 1263.978 |
Insert | 100000 | 1.00 | 97280.546 |
Clearly the hash table performs terribly when the load factor approaches 1.0 (as “Big-O” would have predicted). But 0.5 looks like the sweet spot; we are “only” wasting 50% of the memory but we get a nice speedup in return.
The only thing left was a small set of changes to StrTrie to use the custom hash table, choosing a load factor of 0.5 given the promising benchmark results (commit).
Now let’s see what difference this makes:
[000.000] Loading trie... [000.075] Loaded 162309 words. [000.075] Finding valid words... [000.079] Found 1436 valid words. [000.079] Finding solutions... BAITED-DEFORCING -- snipped 72 other solutions -- OBTECTED-DRAFTING [000.120] Done.
These results are more than twice as fast as the original native implementation and just about on par with the managed app. Notice that it still takes longer to build the trie (~75 ms compared to ~60 ms) but this loss is offset by the generally faster solution performance.
All in all, this NIH boondoggle was surprisingly productive. And yet, we could probably do even better… but that will have to wait for now.