Last week, in my continuing exploration of a C++ Letter Boxed solver, I had to build a faster hash table to improve performance. Today we’ll make it even faster, hoping to catch up to the performance of my .NET Core solver app.
Before making the code changes, I first needed to fix the benchmark code. The find operations were not really working the way they should have been. (Luckily the numbers didn’t change all that much.)
Now, let’s start optimizing! First, note that Hashtable
has a single vector for its buckets. Each entry looks like this:
struct Entry { TKey key_; TValue value_; bool occupied_; };
This means that every time we resize the table, we have to reallocate and copy these keys and values again. Perhaps we can get a performance edge by using two vectors — one for buckets, which would just be a list of int
s, and another for the entries which would never need to change other than appending totally new key-value pairs. The number in each bucket would refer to the index of the entry stored there.
Since a vector zero-fills by default, we should take advantage of that fact by treating a bucket value of zero as empty. Then we just need to be careful to one-index the entry list by setting its initial size to one.
Unfortunately, this approach forces us into a near total rewrite of the Hashtable
class. But no matter — we have precious CPU cycles at stake! After a lot more logic errors than I care to admit, I was able to come up with these changes: Hashtable.h diff. Overall, I think this actually looks simpler, even if some of the loop patterns are duplicated. Where possible, I extracted helper functions to avoid literal copy/paste code. Also note that we do not need to keep track of any additional bookkeeping information about occupied buckets, so now the Entry is just a typedef
:
typedef std::pair<TKey, TValue> Entry;
Given the structure of this new code, we can even explore a new collision scheme. Recall that we are using linear probing, where we try successive buckets B, B + 1, B + 2, B + 3, etc. In practice, quadratic probing can be much faster than linear probing (with a few caveats about load factor being 0.5 or less). An example quadratic probing scheme would use buckets B, B + 1, B + 4, B + 9, etc. Using the fact that each perfect square is the sum of consecutive odd numbers (11 = 1, 22 = 1 + 3, 32 = 1 + 3 + 5, and so on), we can implement it as follows:
void next_index(int& index, int c) const { index += 1 + (2 * c); int n = static_cast<int>(buckets_.size()); if (index >= n) { index = index % n; } }
In this case, c
represents the (zero-based) collision count: commit.
So now we have implementation A (original), implementation B (use separate buckets and entries vectors), and implementation C (B with quadratic probing). Did any of this make a difference? Let’s run the benchmark again:
Operation | n | f | Time A | Time B | Time C |
---|---|---|---|---|---|
FindMissing | 10 | 0.25 | 0.137 | 0.134 | 0.143 |
FindPresent | 10 | 0.25 | 0.132 | 0.13 | 0.156 |
Insert | 10 | 0.25 | 0.869 | 1.29 | 1.315 |
FindMissing | 100 | 0.25 | 1.305 | 1.315 | 1.313 |
FindPresent | 100 | 0.25 | 1.29 | 1.267 | 1.248 |
Insert | 100 | 0.25 | 16.809 | 9.726 | 9.677 |
FindMissing | 1000 | 0.25 | 13.147 | 12.983 | 13.071 |
FindPresent | 1000 | 0.25 | 12.858 | 12.699 | 12.39 |
Insert | 1000 | 0.25 | 155.848 | 126.977 | 127.08 |
FindMissing | 10000 | 0.25 | 202.026 | 129.621 | 131.877 |
FindPresent | 10000 | 0.25 | 128.109 | 128.009 | 124.676 |
Insert | 10000 | 0.25 | 4560.988 | 1433.636 | 1519.233 |
FindMissing | 100000 | 0.25 | 1326.763 | 1299.435 | 1314.786 |
FindPresent | 100000 | 0.25 | 1300.205 | 1273.827 | 1248.835 |
Insert | 100000 | 0.25 | 30199.97 | 15307.192 | 15920.22 |
FindMissing | 10 | 0.5 | 0.137 | 0.134 | 0.137 |
FindPresent | 10 | 0.5 | 0.132 | 0.131 | 0.13 |
Insert | 10 | 0.5 | 1.211 | 1.469 | 1.469 |
FindMissing | 100 | 0.5 | 1.302 | 1.301 | 1.306 |
FindPresent | 100 | 0.5 | 1.286 | 1.267 | 1.252 |
Insert | 100 | 0.5 | 10.765 | 10.147 | 10.127 |
FindMissing | 1000 | 0.5 | 13.053 | 13.091 | 13.171 |
FindPresent | 1000 | 0.5 | 13.114 | 12.752 | 12.976 |
Insert | 1000 | 0.5 | 94.139 | 86.497 | 87.787 |
FindMissing | 10000 | 0.5 | 131.463 | 131.698 | 131.291 |
FindPresent | 10000 | 0.5 | 127.911 | 127.799 | 124.366 |
Insert | 10000 | 0.5 | 2145.042 | 1308.991 | 1290.921 |
FindMissing | 100000 | 0.5 | 1295.371 | 1297.845 | 1301.38 |
FindPresent | 100000 | 0.5 | 1259.002 | 1268.924 | 1249.057 |
Insert | 100000 | 0.5 | 26632.592 | 17775.16 | 17773.18 |
FindMissing | 10 | 0.75 | 0.137 | 0.143 | 0.137 |
FindPresent | 10 | 0.75 | 0.132 | 0.13 | 0.129 |
Insert | 10 | 0.75 | 1.116 | 1.554 | 1.572 |
FindMissing | 100 | 0.75 | 1.292 | 1.306 | 1.316 |
FindPresent | 100 | 0.75 | 1.285 | 1.271 | 1.256 |
Insert | 100 | 0.75 | 12.537 | 12.031 | 11.966 |
FindMissing | 1000 | 0.75 | 13.19 | 12.976 | 13.125 |
FindPresent | 1000 | 0.75 | 14.007 | 12.686 | 12.453 |
Insert | 1000 | 0.75 | 138.326 | 118.606 | 115.429 |
FindMissing | 10000 | 0.75 | 130.416 | 129.792 | 131.337 |
FindPresent | 10000 | 0.75 | 127.876 | 127.651 | 125.036 |
Insert | 10000 | 0.75 | 1388.83 | 1365.991 | 1241.811 |
FindMissing | 100000 | 0.75 | 1294.254 | 1297.491 | 1311.522 |
FindPresent | 100000 | 0.75 | 1271.941 | 1417.87 | 1324.824 |
Insert | 100000 | 0.75 | 28530.218 | 21140.233 | 20065.233 |
FindMissing | 10 | 1 | 0.137 | 0.134 | 0.137 |
FindPresent | 10 | 1 | 0.131 | 0.131 | 0.129 |
Insert | 10 | 1 | 0.988 | 1.618 | 1.642 |
FindMissing | 100 | 1 | 1.298 | 1.301 | 1.315 |
FindPresent | 100 | 1 | 1.289 | 1.26 | 1.253 |
Insert | 100 | 1 | 10.695 | 12.925 | 12.879 |
FindMissing | 1000 | 1 | 13.024 | 13.055 | 15.003 |
FindPresent | 1000 | 1 | 12.719 | 18.594 | 16.313 |
Insert | 1000 | 1 | 267.827 | 252.815 | 162.483 |
FindMissing | 10000 | 1 | 129.751 | 129.187 | 131.334 |
FindPresent | 10000 | 1 | 127.38 | 127.444 | 124.795 |
Insert | 10000 | 1 | 5708.22 | 5482.168 | 2368.82 |
FindMissing | 100000 | 1 | 1299.615 | 1293.964 | 1314.281 |
FindPresent | 100000 | 1 | 1265.741 | 1259.396 | 1252.715 |
Insert | 100000 | 1 | 94450.147 | 109098.851 | 31993.897 |
Paying attention the 0.5 load factor numbers (since this is what we chose for the trie’s use case), we are clearly doing better with B. C is roughly the same, though drastically faster in some situations as the load factor increases. For now we’ll declare C the winner and use it as our solver implementation.
The final contest will be a head-to-head tourney of the managed and native solver implementations, to see if our theoretical improvements translate to actual wall clock time savings:
Puzzle | .NET Core Solver | C++ Solver |
---|---|---|
RME WCLTGK API |
[000.000] Loading trie... [000.059] Loaded 162309 words. [000.060] Finding valid words... [000.072] Found 754 valid words. [000.072] Finding solutions... MARKETPLACE-EARWIG WIGWAM-MARKETPLACE PRAGMATIC-CAKEWALK [000.080] Done. |
[000.000] Loading trie... [000.049] Loaded 162309 words. [000.049] Finding valid words... [000.052] Found 754 valid words. [000.052] Finding solutions... MARKETPLACE-EARWIG WIGWAM-MARKETPLACE PRAGMATIC-CAKEWALK [000.054] Done. |
ERG BID NCF TAO |
[000.000] Loading trie... [000.056] Loaded 162309 words. [000.057] Finding valid words... [000.071] Found 1436 valid words. [000.072] Finding solutions... ROBAND-DEFECTING -- snipped 72 other solutions -- OBTECTED-DRAFTING [000.120] Done. |
[000.000] Loading trie... [000.049] Loaded 162309 words. [000.050] Finding valid words... [000.053] Found 1436 valid words. [000.053] Finding solutions... BAITED-DEFORCING -- snipped 72 others solutions -- OBTECTED-DRAFTING [000.092] Done. |
C++ wins! It “only” took a custom data structure and several iterations of optimizations, but we got there in the end. Was it worth it? Sure, if all we care about is personal enrichment. Otherwise, you should look at your situation and heed the advice of Carlos Bueno:
Unless your optimizations are going to stick around long enough to pay for the time you spend making them plus the opportunity cost of not doing something else, it’s a net loss.