An even faster hash table

Spread the love

Last week, in my continuing exploration of a C++ Letter Boxed solver, I had to build a faster hash table to improve performance. Today we’ll make it even faster, hoping to catch up to the performance of my .NET Core solver app.

Before making the code changes, I first needed to fix the benchmark code. The find operations were not really working the way they should have been. (Luckily the numbers didn’t change all that much.)

Now, let’s start optimizing! First, note that Hashtable has a single vector for its buckets. Each entry looks like this:

struct Entry
{
    TKey key_;
    TValue value_;
    bool occupied_;
};

This means that every time we resize the table, we have to reallocate and copy these keys and values again. Perhaps we can get a performance edge by using two vectors — one for buckets, which would just be a list of ints, and another for the entries which would never need to change other than appending totally new key-value pairs. The number in each bucket would refer to the index of the entry stored there.

Since a vector zero-fills by default, we should take advantage of that fact by treating a bucket value of zero as empty. Then we just need to be careful to one-index the entry list by setting its initial size to one.

Unfortunately, this approach forces us into a near total rewrite of the Hashtable class. But no matter — we have precious CPU cycles at stake! After a lot more logic errors than I care to admit, I was able to come up with these changes: Hashtable.h diff. Overall, I think this actually looks simpler, even if some of the loop patterns are duplicated. Where possible, I extracted helper functions to avoid literal copy/paste code. Also note that we do not need to keep track of any additional bookkeeping information about occupied buckets, so now the Entry is just a typedef:

typedef std::pair<TKey, TValue> Entry;

Given the structure of this new code, we can even explore a new collision scheme. Recall that we are using linear probing, where we try successive buckets B, B + 1, B + 2, B + 3, etc. In practice, quadratic probing can be much faster than linear probing (with a few caveats about load factor being 0.5 or less). An example quadratic probing scheme would use buckets B, B + 1, B + 4, B + 9, etc. Using the fact that each perfect square is the sum of consecutive odd numbers (1¹ = 1, 2² = 1 + 3, 3² = 1 + 3 + 5, and so on), we can implement it as follows:

void next_index(int& index, int c) const
{
    index += 1 + (2 * c);
    int n = static_cast<int>(buckets_.size());
    if (index >= n)
    {
        index = index % n;
    }
}

In this case, c represents the (zero-based) collision count: commit.

So now we have implementation A (original), implementation B (use separate buckets and entries vectors), and implementation C (B with quadratic probing). Did any of this make a difference? Let’s run the benchmark again:

Operation	n	f	Time A	Time B	Time C
FindMissing	10	0.25	0.137	0.134	0.143
FindPresent	10	0.25	0.132	0.13	0.156
Insert	10	0.25	0.869	1.29	1.315
FindMissing	100	0.25	1.305	1.315	1.313
FindPresent	100	0.25	1.29	1.267	1.248
Insert	100	0.25	16.809	9.726	9.677
FindMissing	1000	0.25	13.147	12.983	13.071
FindPresent	1000	0.25	12.858	12.699	12.39
Insert	1000	0.25	155.848	126.977	127.08
FindMissing	10000	0.25	202.026	129.621	131.877
FindPresent	10000	0.25	128.109	128.009	124.676
Insert	10000	0.25	4560.988	1433.636	1519.233
FindMissing	100000	0.25	1326.763	1299.435	1314.786
FindPresent	100000	0.25	1300.205	1273.827	1248.835
Insert	100000	0.25	30199.97	15307.192	15920.22
FindMissing	10	0.5	0.137	0.134	0.137
FindPresent	10	0.5	0.132	0.131	0.13
Insert	10	0.5	1.211	1.469	1.469
FindMissing	100	0.5	1.302	1.301	1.306
FindPresent	100	0.5	1.286	1.267	1.252
Insert	100	0.5	10.765	10.147	10.127
FindMissing	1000	0.5	13.053	13.091	13.171
FindPresent	1000	0.5	13.114	12.752	12.976
Insert	1000	0.5	94.139	86.497	87.787
FindMissing	10000	0.5	131.463	131.698	131.291
FindPresent	10000	0.5	127.911	127.799	124.366
Insert	10000	0.5	2145.042	1308.991	1290.921
FindMissing	100000	0.5	1295.371	1297.845	1301.38
FindPresent	100000	0.5	1259.002	1268.924	1249.057
Insert	100000	0.5	26632.592	17775.16	17773.18
FindMissing	10	0.75	0.137	0.143	0.137
FindPresent	10	0.75	0.132	0.13	0.129
Insert	10	0.75	1.116	1.554	1.572
FindMissing	100	0.75	1.292	1.306	1.316
FindPresent	100	0.75	1.285	1.271	1.256
Insert	100	0.75	12.537	12.031	11.966
FindMissing	1000	0.75	13.19	12.976	13.125
FindPresent	1000	0.75	14.007	12.686	12.453
Insert	1000	0.75	138.326	118.606	115.429
FindMissing	10000	0.75	130.416	129.792	131.337
FindPresent	10000	0.75	127.876	127.651	125.036
Insert	10000	0.75	1388.83	1365.991	1241.811
FindMissing	100000	0.75	1294.254	1297.491	1311.522
FindPresent	100000	0.75	1271.941	1417.87	1324.824
Insert	100000	0.75	28530.218	21140.233	20065.233
FindMissing	10	1	0.137	0.134	0.137
FindPresent	10	1	0.131	0.131	0.129
Insert	10	1	0.988	1.618	1.642
FindMissing	100	1	1.298	1.301	1.315
FindPresent	100	1	1.289	1.26	1.253
Insert	100	1	10.695	12.925	12.879
FindMissing	1000	1	13.024	13.055	15.003
FindPresent	1000	1	12.719	18.594	16.313
Insert	1000	1	267.827	252.815	162.483
FindMissing	10000	1	129.751	129.187	131.334
FindPresent	10000	1	127.38	127.444	124.795
Insert	10000	1	5708.22	5482.168	2368.82
FindMissing	100000	1	1299.615	1293.964	1314.281
FindPresent	100000	1	1265.741	1259.396	1252.715
Insert	100000	1	94450.147	109098.851	31993.897

Paying attention the 0.5 load factor numbers (since this is what we chose for the trie’s use case), we are clearly doing better with B. C is roughly the same, though drastically faster in some situations as the load factor increases. For now we’ll declare C the winner and use it as our solver implementation.

The final contest will be a head-to-head tourney of the managed and native solver implementations, to see if our theoretical improvements translate to actual wall clock time savings:

Puzzle	.NET Core Solver	C++ Solver
RME WCL TGK API	[000.000] Loading trie... [000.059] Loaded 162309 words. [000.060] Finding valid words... [000.072] Found 754 valid words. [000.072] Finding solutions... MARKETPLACE-EARWIG WIGWAM-MARKETPLACE PRAGMATIC-CAKEWALK [000.080] Done.	[000.000] Loading trie... [000.049] Loaded 162309 words. [000.049] Finding valid words... [000.052] Found 754 valid words. [000.052] Finding solutions... MARKETPLACE-EARWIG WIGWAM-MARKETPLACE PRAGMATIC-CAKEWALK [000.054] Done.
ERG BID NCF TAO	[000.000] Loading trie... [000.056] Loaded 162309 words. [000.057] Finding valid words... [000.071] Found 1436 valid words. [000.072] Finding solutions... ROBAND-DEFECTING -- snipped 72 other solutions -- OBTECTED-DRAFTING [000.120] Done.	[000.000] Loading trie... [000.049] Loaded 162309 words. [000.050] Finding valid words... [000.053] Found 1436 valid words. [000.053] Finding solutions... BAITED-DEFORCING -- snipped 72 others solutions -- OBTECTED-DRAFTING [000.092] Done.

Puzzle

.NET Core Solver

C++ Solver

RME
WCL
TGK
API

[000.000] Loading trie...
[000.059] Loaded 162309 words.
[000.060] Finding valid words...
[000.072] Found 754 valid words.
[000.072] Finding solutions...
MARKETPLACE-EARWIG
WIGWAM-MARKETPLACE
PRAGMATIC-CAKEWALK
[000.080] Done.

[000.000] Loading trie...
[000.049] Loaded 162309 words.
[000.049] Finding valid words...
[000.052] Found 754 valid words.
[000.052] Finding solutions...
MARKETPLACE-EARWIG
WIGWAM-MARKETPLACE
PRAGMATIC-CAKEWALK
[000.054] Done.

ERG
BID
NCF
TAO

[000.000] Loading trie...
[000.056] Loaded 162309 words.
[000.057] Finding valid words...
[000.071] Found 1436 valid words.
[000.072] Finding solutions...
ROBAND-DEFECTING
 -- snipped 72 other solutions -- 
OBTECTED-DRAFTING
[000.120] Done.

[000.000] Loading trie...
[000.049] Loaded 162309 words.
[000.050] Finding valid words...
[000.053] Found 1436 valid words.
[000.053] Finding solutions...
BAITED-DEFORCING
 -- snipped 72 others solutions --
OBTECTED-DRAFTING
[000.092] Done.

C++ wins! It “only” took a custom data structure and several iterations of optimizations, but we got there in the end. Was it worth it? Sure, if all we care about is personal enrichment. Otherwise, you should look at your situation and heed the advice of Carlos Bueno:

Unless your optimizations are going to stick around long enough to pay for the time you spend making them plus the opportunity cost of not doing something else, it’s a net loss.

WriteAsync .NET

Testing, coding, in that order.

Leave a Reply Cancel reply