A faster hash table

Spread the love

Last time, I wrote about a native C++ implementation of the Letter Boxed solver. Vexingly, this native implementation was actually slower than the corresponding .NET Core app. Today we’ll try to fix that.

As I mentioned, some basic profiling revealed that the STL unordered_map was the cause of the slowdown. The documentation even says:

Moreover, inserting an element invalidates no iterators, and removing an element invalidates only those iterators which point at the removed element.

We actually do not need this strong of a guarantee — indeed, we need no iterators at all in our use case. What happens if we just abandon this general data structure in favor of one with the minimal operations we require? It seems we have come to a situation every budding computer scientist dreams of: to improve performance, we need to build a custom hash table.

If, like most of us, your CS fundamentals are rusty and you don’t write hash tables for fun or profit, consult the “Hash table” Wikipedia page for a refresher. In summary, we will need a list of buckets to hold our data items, a hash function and equality operation for the key type, and a collision strategy to deal with keys that hash to the same bucket. For simplicity of our V1 implementation, we’ll choose an open addressing scheme where any bucket is “open” (free to be used) for any hashed entry; in other words, we will consider buckets in addition to the one where the hash code matches. To deal with collisions, we will probe for open buckets with a linear probing sequence, meaning we will keep scanning through the buckets to find a free slot. With open addressing, the load factor (the percentage of the table we are allowed to fill before resizing) is quite important, so we need to keep this in mind as well.

Since I’m no Bob Jenkins, I will be using a prime number size for my table to avoid the effects of mediocre hash functions. When the table is full, I will simply expand the capacity by doubling (geometric expansion with growth factor 2). Very conveniently, INT_MAX (for 32-bit integers) is a prime number. So we just need to set this as our upper bound while we keep dividing by two and finding the nearest primes. This C++ program should find all the primes we need:

bool IsPrime(int n)
{
    if ((n % 2) == 0)
    {
        return false;
    }

    int k = static_cast<int>(sqrt(n));
    for (int f = 3; f <= k; ++f)
    {
        if ((n % f) == 0)
        {
            return false;
        }
    }

    return true;
}

template <typename TFound>
void FindPrimes(int min, int max, TFound found)
{
    int i = max;
    while (i >= min)
    {
        if (IsPrime(i))
        {
            found(i);
            i /= 2;
            if ((i % 2) == 0)
            {
                ++i;
            }
        }
        else
        {
            i -= 2;
        }
    }
}

int main()
{
    FindPrimes(3, 2147483647, [](int p) { printf("%d\n", p); });
    return 0;
}

For the Letter Boxed solver we only need two operations from our hash table: get which returns a value if present, and insert which adds or updates a value and optionally returns the previous value. We can reuse the STL definitions for std::hash and std::equal_to for hash function and equality operations, respectively.

After experimenting, writing a few tests, and debugging more than a few infinite loops, I ended up with this no-frills implementation: Hashtable.h, HashtableTest.cpp.

To figure out if this was actually any good, I had to write a simple benchmark program. (Maybe I’ll try Google Benchmark some other time.) Here it is: Words.Benchmark.Native/Main.cpp. Three core benchmarks are tested, each with parameters n (number of elements) and f (load factor of hash table):

FindMissing: try to get elements known to be missing from the table
FindPresent: get elements known to be present in the table
Insert: insert elements

Here are the results I got on my machine, conveniently sorted by f then n:

Operation	n	f	Time (usec)
FindMissing	10	0.25	0.205
FindPresent	10	0.25	0.199
Insert	10	0.25	0.875
FindMissing	100	0.25	1.972
FindPresent	100	0.25	1.911
Insert	100	0.25	19.821
FindMissing	1000	0.25	19.846
FindPresent	1000	0.25	19.241
Insert	1000	0.25	141.171
FindMissing	10000	0.25	196.617
FindPresent	10000	0.25	190.697
Insert	10000	0.25	4630.353
FindMissing	100000	0.25	1969.729
FindPresent	100000	0.25	1913.572
Insert	100000	0.25	30337.493
FindMissing	10	0.50	0.204
FindPresent	10	0.50	0.137
Insert	10	0.50	1.252
FindMissing	100	0.50	1.978
FindPresent	100	0.50	1.918
Insert	100	0.50	12.975
FindMissing	1000	0.50	19.552
FindPresent	1000	0.50	18.869
Insert	1000	0.50	112.012
FindMissing	10000	0.50	195.986
FindPresent	10000	0.50	177.933
Insert	10000	0.50	2413.639
FindMissing	100000	0.50	1973.212
FindPresent	100000	0.50	1897.388
Insert	100000	0.50	28885.607
FindMissing	10	0.75	0.137
FindPresent	10	0.75	0.199
Insert	10	0.75	1.34
FindMissing	100	0.75	1.984
FindPresent	100	0.75	1.92
Insert	100	0.75	15.118
FindMissing	1000	0.75	19.513
FindPresent	1000	0.75	19.122
Insert	1000	0.75	159.648
FindMissing	10000	0.75	197.038
FindPresent	10000	0.75	137.337
Insert	10000	0.75	1726.716
FindMissing	100000	0.75	1958.956
FindPresent	100000	0.75	1257.9
Insert	100000	0.75	32195.437
FindMissing	10	1.00	0.137
FindPresent	10	1.00	0.198
Insert	10	1.00	1.248
FindMissing	100	1.00	1.985
FindPresent	100	1.00	1.931
Insert	100	1.00	14.176
FindMissing	1000	1.00	19.738
FindPresent	1000	1.00	19.265
Insert	1000	1.00	277.614
FindMissing	10000	1.00	196.745
FindPresent	10000	1.00	191.986
Insert	10000	1.00	8672.173
FindMissing	100000	1.00	1970.514
FindPresent	100000	1.00	1263.978
Insert	100000	1.00	97280.546

Clearly the hash table performs terribly when the load factor approaches 1.0 (as “Big-O” would have predicted). But 0.5 looks like the sweet spot; we are “only” wasting 50% of the memory but we get a nice speedup in return.

The only thing left was a small set of changes to StrTrie to use the custom hash table, choosing a load factor of 0.5 given the promising benchmark results (commit).

Now let’s see what difference this makes:

[000.000] Loading trie...
[000.075] Loaded 162309 words.
[000.075] Finding valid words...
[000.079] Found 1436 valid words.
[000.079] Finding solutions...
BAITED-DEFORCING
 -- snipped 72 other solutions -- 
OBTECTED-DRAFTING
[000.120] Done.

These results are more than twice as fast as the original native implementation and just about on par with the managed app. Notice that it still takes longer to build the trie (~75 ms compared to ~60 ms) but this loss is offset by the generally faster solution performance.

All in all, this NIH boondoggle was surprisingly productive. And yet, we could probably do even better… but that will have to wait for now.

WriteAsync .NET

Testing, coding, in that order.

Leave a Reply Cancel reply