A faster hash table

Spread the love

Last time, I wrote about a native C++ implementation of the Letter Boxed solver. Vexingly, this native implementation was actually slower than the corresponding .NET Core app. Today we’ll try to fix that.

As I mentioned, some basic profiling revealed that the STL unordered_map was the cause of the slowdown. The documentation even says:

Moreover, inserting an element invalidates no iterators, and removing an element invalidates only those iterators which point at the removed element.

We actually do not need this strong of a guarantee — indeed, we need no iterators at all in our use case. What happens if we just abandon this general data structure in favor of one with the minimal operations we require? It seems we have come to a situation every budding computer scientist dreams of: to improve performance, we need to build a custom hash table.

If, like most of us, your CS fundamentals are rusty and you don’t write hash tables for fun or profit, consult the “Hash table” Wikipedia page for a refresher. In summary, we will need a list of buckets to hold our data items, a hash function and equality operation for the key type, and a collision strategy to deal with keys that hash to the same bucket. For simplicity of our V1 implementation, we’ll choose an open addressing scheme where any bucket is “open” (free to be used) for any hashed entry; in other words, we will consider buckets in addition to the one where the hash code matches. To deal with collisions, we will probe for open buckets with a linear probing sequence, meaning we will keep scanning through the buckets to find a free slot. With open addressing, the load factor (the percentage of the table we are allowed to fill before resizing) is quite important, so we need to keep this in mind as well.

Since I’m no Bob Jenkins, I will be using a prime number size for my table to avoid the effects of mediocre hash functions. When the table is full, I will simply expand the capacity by doubling (geometric expansion with growth factor 2). Very conveniently, INT_MAX (for 32-bit integers) is a prime number. So we just need to set this as our upper bound while we keep dividing by two and finding the nearest primes. This C++ program should find all the primes we need:

bool IsPrime(int n)
{
    if ((n % 2) == 0)
    {
        return false;
    }

    int k = static_cast<int>(sqrt(n));
    for (int f = 3; f <= k; ++f)
    {
        if ((n % f) == 0)
        {
            return false;
        }
    }

    return true;
}

template <typename TFound>
void FindPrimes(int min, int max, TFound found)
{
    int i = max;
    while (i >= min)
    {
        if (IsPrime(i))
        {
            found(i);
            i /= 2;
            if ((i % 2) == 0)
            {
                ++i;
            }
        }
        else
        {
            i -= 2;
        }
    }
}

int main()
{
    FindPrimes(3, 2147483647, [](int p) { printf("%d\n", p); });
    return 0;
}

For the Letter Boxed solver we only need two operations from our hash table: get which returns a value if present, and insert which adds or updates a value and optionally returns the previous value. We can reuse the STL definitions for std::hash and std::equal_to for hash function and equality operations, respectively.

After experimenting, writing a few tests, and debugging more than a few infinite loops, I ended up with this no-frills implementation: Hashtable.h, HashtableTest.cpp.

To figure out if this was actually any good, I had to write a simple benchmark program. (Maybe I’ll try Google Benchmark some other time.) Here it is: Words.Benchmark.Native/Main.cpp. Three core benchmarks are tested, each with parameters n (number of elements) and f (load factor of hash table):

  • FindMissing: try to get elements known to be missing from the table
  • FindPresent: get elements known to be present in the table
  • Insert: insert elements

Here are the results I got on my machine, conveniently sorted by f then n:

Operation n f Time (usec)
FindMissing 10 0.25 0.205
FindPresent 10 0.25 0.199
Insert 10 0.25 0.875
FindMissing 100 0.25 1.972
FindPresent 100 0.25 1.911
Insert 100 0.25 19.821
FindMissing 1000 0.25 19.846
FindPresent 1000 0.25 19.241
Insert 1000 0.25 141.171
FindMissing 10000 0.25 196.617
FindPresent 10000 0.25 190.697
Insert 10000 0.25 4630.353
FindMissing 100000 0.25 1969.729
FindPresent 100000 0.25 1913.572
Insert 100000 0.25 30337.493
FindMissing 10 0.50 0.204
FindPresent 10 0.50 0.137
Insert 10 0.50 1.252
FindMissing 100 0.50 1.978
FindPresent 100 0.50 1.918
Insert 100 0.50 12.975
FindMissing 1000 0.50 19.552
FindPresent 1000 0.50 18.869
Insert 1000 0.50 112.012
FindMissing 10000 0.50 195.986
FindPresent 10000 0.50 177.933
Insert 10000 0.50 2413.639
FindMissing 100000 0.50 1973.212
FindPresent 100000 0.50 1897.388
Insert 100000 0.50 28885.607
FindMissing 10 0.75 0.137
FindPresent 10 0.75 0.199
Insert 10 0.75 1.34
FindMissing 100 0.75 1.984
FindPresent 100 0.75 1.92
Insert 100 0.75 15.118
FindMissing 1000 0.75 19.513
FindPresent 1000 0.75 19.122
Insert 1000 0.75 159.648
FindMissing 10000 0.75 197.038
FindPresent 10000 0.75 137.337
Insert 10000 0.75 1726.716
FindMissing 100000 0.75 1958.956
FindPresent 100000 0.75 1257.9
Insert 100000 0.75 32195.437
FindMissing 10 1.00 0.137
FindPresent 10 1.00 0.198
Insert 10 1.00 1.248
FindMissing 100 1.00 1.985
FindPresent 100 1.00 1.931
Insert 100 1.00 14.176
FindMissing 1000 1.00 19.738
FindPresent 1000 1.00 19.265
Insert 1000 1.00 277.614
FindMissing 10000 1.00 196.745
FindPresent 10000 1.00 191.986
Insert 10000 1.00 8672.173
FindMissing 100000 1.00 1970.514
FindPresent 100000 1.00 1263.978
Insert 100000 1.00 97280.546

Clearly the hash table performs terribly when the load factor approaches 1.0 (as “Big-O” would have predicted). But 0.5 looks like the sweet spot; we are “only” wasting 50% of the memory but we get a nice speedup in return.

The only thing left was a small set of changes to StrTrie to use the custom hash table, choosing a load factor of 0.5 given the promising benchmark results (commit).

Now let’s see what difference this makes:

[000.000] Loading trie...
[000.075] Loaded 162309 words.
[000.075] Finding valid words...
[000.079] Found 1436 valid words.
[000.079] Finding solutions...
BAITED-DEFORCING
 -- snipped 72 other solutions -- 
OBTECTED-DRAFTING
[000.120] Done.

These results are more than twice as fast as the original native implementation and just about on par with the managed app. Notice that it still takes longer to build the trie (~75 ms compared to ~60 ms) but this loss is offset by the generally faster solution performance.

All in all, this NIH boondoggle was surprisingly productive. And yet, we could probably do even better… but that will have to wait for now.

Leave a Reply

Your email address will not be published. Required fields are marked *