An even faster hash table

Spread the love

Last week, in my continuing exploration of a C++ Letter Boxed solver, I had to build a faster hash table to improve performance. Today we’ll make it even faster, hoping to catch up to the performance of my .NET Core solver app.

Before making the code changes, I first needed to fix the benchmark code. The find operations were not really working the way they should have been. (Luckily the numbers didn’t change all that much.)

Now, let’s start optimizing! First, note that Hashtable has a single vector for its buckets. Each entry looks like this:

struct Entry
{
    TKey key_;
    TValue value_;
    bool occupied_;
};

This means that every time we resize the table, we have to reallocate and copy these keys and values again. Perhaps we can get a performance edge by using two vectors — one for buckets, which would just be a list of ints, and another for the entries which would never need to change other than appending totally new key-value pairs. The number in each bucket would refer to the index of the entry stored there.

Since a vector zero-fills by default, we should take advantage of that fact by treating a bucket value of zero as empty. Then we just need to be careful to one-index the entry list by setting its initial size to one.

Unfortunately, this approach forces us into a near total rewrite of the Hashtable class. But no matter — we have precious CPU cycles at stake! After a lot more logic errors than I care to admit, I was able to come up with these changes: Hashtable.h diff. Overall, I think this actually looks simpler, even if some of the loop patterns are duplicated. Where possible, I extracted helper functions to avoid literal copy/paste code. Also note that we do not need to keep track of any additional bookkeeping information about occupied buckets, so now the Entry is just a typedef:

typedef std::pair<TKey, TValue> Entry;

Given the structure of this new code, we can even explore a new collision scheme. Recall that we are using linear probing, where we try successive buckets B, B + 1, B + 2, B + 3, etc. In practice, quadratic probing can be much faster than linear probing (with a few caveats about load factor being 0.5 or less). An example quadratic probing scheme would use buckets B, B + 1, B + 4, B + 9, etc. Using the fact that each perfect square is the sum of consecutive odd numbers (11 = 1, 22 = 1 + 3, 32 = 1 + 3 + 5, and so on), we can implement it as follows:

void next_index(int& index, int c) const
{
    index += 1 + (2 * c);
    int n = static_cast<int>(buckets_.size());
    if (index >= n)
    {
        index = index % n;
    }
}

In this case, c represents the (zero-based) collision count: commit.

So now we have implementation A (original), implementation B (use separate buckets and entries vectors), and implementation C (B with quadratic probing). Did any of this make a difference? Let’s run the benchmark again:

Operation n f Time A Time B Time C
FindMissing 10 0.25 0.137 0.134 0.143
FindPresent 10 0.25 0.132 0.13 0.156
Insert 10 0.25 0.869 1.29 1.315
FindMissing 100 0.25 1.305 1.315 1.313
FindPresent 100 0.25 1.29 1.267 1.248
Insert 100 0.25 16.809 9.726 9.677
FindMissing 1000 0.25 13.147 12.983 13.071
FindPresent 1000 0.25 12.858 12.699 12.39
Insert 1000 0.25 155.848 126.977 127.08
FindMissing 10000 0.25 202.026 129.621 131.877
FindPresent 10000 0.25 128.109 128.009 124.676
Insert 10000 0.25 4560.988 1433.636 1519.233
FindMissing 100000 0.25 1326.763 1299.435 1314.786
FindPresent 100000 0.25 1300.205 1273.827 1248.835
Insert 100000 0.25 30199.97 15307.192 15920.22
FindMissing 10 0.5 0.137 0.134 0.137
FindPresent 10 0.5 0.132 0.131 0.13
Insert 10 0.5 1.211 1.469 1.469
FindMissing 100 0.5 1.302 1.301 1.306
FindPresent 100 0.5 1.286 1.267 1.252
Insert 100 0.5 10.765 10.147 10.127
FindMissing 1000 0.5 13.053 13.091 13.171
FindPresent 1000 0.5 13.114 12.752 12.976
Insert 1000 0.5 94.139 86.497 87.787
FindMissing 10000 0.5 131.463 131.698 131.291
FindPresent 10000 0.5 127.911 127.799 124.366
Insert 10000 0.5 2145.042 1308.991 1290.921
FindMissing 100000 0.5 1295.371 1297.845 1301.38
FindPresent 100000 0.5 1259.002 1268.924 1249.057
Insert 100000 0.5 26632.592 17775.16 17773.18
FindMissing 10 0.75 0.137 0.143 0.137
FindPresent 10 0.75 0.132 0.13 0.129
Insert 10 0.75 1.116 1.554 1.572
FindMissing 100 0.75 1.292 1.306 1.316
FindPresent 100 0.75 1.285 1.271 1.256
Insert 100 0.75 12.537 12.031 11.966
FindMissing 1000 0.75 13.19 12.976 13.125
FindPresent 1000 0.75 14.007 12.686 12.453
Insert 1000 0.75 138.326 118.606 115.429
FindMissing 10000 0.75 130.416 129.792 131.337
FindPresent 10000 0.75 127.876 127.651 125.036
Insert 10000 0.75 1388.83 1365.991 1241.811
FindMissing 100000 0.75 1294.254 1297.491 1311.522
FindPresent 100000 0.75 1271.941 1417.87 1324.824
Insert 100000 0.75 28530.218 21140.233 20065.233
FindMissing 10 1 0.137 0.134 0.137
FindPresent 10 1 0.131 0.131 0.129
Insert 10 1 0.988 1.618 1.642
FindMissing 100 1 1.298 1.301 1.315
FindPresent 100 1 1.289 1.26 1.253
Insert 100 1 10.695 12.925 12.879
FindMissing 1000 1 13.024 13.055 15.003
FindPresent 1000 1 12.719 18.594 16.313
Insert 1000 1 267.827 252.815 162.483
FindMissing 10000 1 129.751 129.187 131.334
FindPresent 10000 1 127.38 127.444 124.795
Insert 10000 1 5708.22 5482.168 2368.82
FindMissing 100000 1 1299.615 1293.964 1314.281
FindPresent 100000 1 1265.741 1259.396 1252.715
Insert 100000 1 94450.147 109098.851 31993.897

Paying attention the 0.5 load factor numbers (since this is what we chose for the trie’s use case), we are clearly doing better with B. C is roughly the same, though drastically faster in some situations as the load factor increases. For now we’ll declare C the winner and use it as our solver implementation.

The final contest will be a head-to-head tourney of the managed and native solver implementations, to see if our theoretical improvements translate to actual wall clock time savings:

Puzzle .NET Core Solver C++ Solver
RME
WCL
TGK
API
[000.000] Loading trie...
[000.059] Loaded 162309 words.
[000.060] Finding valid words...
[000.072] Found 754 valid words.
[000.072] Finding solutions...
MARKETPLACE-EARWIG
WIGWAM-MARKETPLACE
PRAGMATIC-CAKEWALK
[000.080] Done.
[000.000] Loading trie...
[000.049] Loaded 162309 words.
[000.049] Finding valid words...
[000.052] Found 754 valid words.
[000.052] Finding solutions...
MARKETPLACE-EARWIG
WIGWAM-MARKETPLACE
PRAGMATIC-CAKEWALK
[000.054] Done.
ERG
BID
NCF
TAO
[000.000] Loading trie...
[000.056] Loaded 162309 words.
[000.057] Finding valid words...
[000.071] Found 1436 valid words.
[000.072] Finding solutions...
ROBAND-DEFECTING
 -- snipped 72 other solutions -- 
OBTECTED-DRAFTING
[000.120] Done.
[000.000] Loading trie...
[000.049] Loaded 162309 words.
[000.050] Finding valid words...
[000.053] Found 1436 valid words.
[000.053] Finding solutions...
BAITED-DEFORCING
 -- snipped 72 others solutions --
OBTECTED-DRAFTING
[000.092] Done.

C++ wins! It “only” took a custom data structure and several iterations of optimizations, but we got there in the end. Was it worth it? Sure, if all we care about is personal enrichment. Otherwise, you should look at your situation and heed the advice of Carlos Bueno:

Unless your optimizations are going to stick around long enough to pay for the time you spend making them plus the opportunity cost of not doing something else, it’s a net loss.

Leave a Reply

Your email address will not be published. Required fields are marked *