Letter Boxed: optimizing the trie

Spread the love

Last time, I demonstrated a relatively fast Letter Boxed solver. It took about 15 ms to solve the puzzle, but that was after a delay of over 200 ms while loading the trie. Surely we can squeeze some more performance out of this!

First, be aware that tries take a lot of memory. My StringTrie fares slightly better than the naïve implementation by using a dictionary instead of a fixed length array for the child node list. Still, for a typical word list of 100,000+ entries, there will be anywhere from three to ten times that number of nodes depending on the maximum word length and the amount of unique prefixes. Second, heap allocation in .NET can be expensive. Sometimes you just have to deal with heap allocation, for example, when passing around interface references. But when performance is paramount, it often pays to remove the need for allocation.

Given these observations, my hypothesis is that the relatively large number of memory allocations is slowing down the trie load process. Of course, we cannot just use intuition to address performance problems, so let’s measure! But before we can write a benchmark, we need to extract the loading code from the solver app and place it in the Words.Core library: StringTrie.Load

Now we can leverage BenchmarkDotNet to test the load time. The experiment design I have landed on involves reading 25%, 50%, or 100% of the complete word list I used last time. To ensure the results are predictable, I consider the file in chunks of 16 words and read only the first four (for 25%), the first eight (for 50%), or all 16 words (for 100%). This will ensure I have a good enough mix of shared prefixes, as opposed to random sampling or even/odd skipping (either of which would lead to fewer contiguous word entries). After committing the benchmark code, I ran it to collect the results:

|   Method | Pct |      Mean |    Error |   StdDev |      Gen 0 |     Gen 1 |     Gen 2 | Allocated |
|--------- |---- |----------:|---------:|---------:|-----------:|----------:|----------:|----------:|
| FromFile |  25 |  88.58 ms | 1.148 ms | 1.074 ms |  5000.0000 | 2166.6667 |  666.6667 |  26.78 MB |
| FromFile |  50 | 159.98 ms | 3.493 ms | 3.431 ms |  8500.0000 | 3500.0000 | 1000.0000 |  45.98 MB |
| FromFile | 100 | 270.75 ms | 2.266 ms | 2.009 ms | 15000.0000 | 6500.0000 | 2000.0000 |  79.95 MB |

// * Hints *
Outliers
  LoadTrie.FromFile: Core -> 1 outlier  was  removed (183.99 ms)
  LoadTrie.FromFile: Core -> 1 outlier  was  removed (280.28 ms)

// * Legends *
  Pct       : Value of the 'Pct' parameter
  Mean      : Arithmetic mean of all measurements
  Error     : Half of 99.9% confidence interval
  StdDev    : Standard deviation of all measurements
  Gen 0     : GC Generation 0 collects per 1000 operations
  Gen 1     : GC Generation 1 collects per 1000 operations
  Gen 2     : GC Generation 2 collects per 1000 operations
  Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
  1 ms      : 1 Millisecond (0.001 sec)

As I suspected, the allocation sizes are relatively high — almost 80 MB to store the ~162,000 words and prefixes. We also see the garbage collector is apparently doing a fair amount of collections of short-lived objects in Gen 0 and 1. Let’s examine the Load method to see if we can account for these allocations:

        public static StringTrie Load(Stream stream)
        {
            StringTrie trie = new StringTrie();
            using (StreamReader reader = new StreamReader(stream))
            {
                string line;
                do
                {
                    line = reader.ReadLine();
                    if ((line != null) && (line.Length > 2) && (line.Length < 13))
                    {
                        trie.Add(line);
                    }
                }
                while (line != null);
            }

            return trie;
        }

While there aren’t very many obvious heap allocations (of the new T() variety), there are many hidden ones. For instance, StreamReader.ReadLine returns a string which had to be allocated from somewhere. Of course, our own StringTrie.Add method will allocate up to N StringTrie.Nodes for a string of length N. Internally each node constructs a dictionary of child nodes keyed by char. Even more internally, a Dictionary creates an array for its hash buckets and entries and resizes/reallocates them as they fill up when more entries are added. Now it is easy to see why 160K words in our trie costs so much: these words are backed by hundreds of thousands of dictionaries with multiple arrays each.

To begin solving this memory problem, we could perhaps use a compressed or succinct trie. However, this requires some more complex code and puts more restrictions on the input format. You would have to load the expensive trie once and then save it to a more compact representation for faster deserialization later. Can we find a way to still keep our unrestricted “lines of normal text” input file and just load it really fast with fewer allocations?

Yes, this is possible! But we have to rethink the problem a little bit. Let’s start by noting the inefficiency of the data structures we are using. In .NET, a string is an arbitrary length sequence of Unicode characters, encoded in UTF-16. That means we are spending two whole bytes (65,536 possible values) to represent one character which will only ever be one of 27 values (A-Z and one more for the null character) which would fit in five bits. Further, we have already limited the solution space to 12-letter words. If we wanted to be parsimonious, we could represent an entire word in only 12 x 5 = 60 bits. If we used a 64-bit unsigned integer (ulong), we would even have four bits free to represent the length of the string (which conveniently could never exceed 15).

Okay, so we have a design for a 5-bit character and a 64-bit “stack string.” How exactly does this help reduce the size of our trie? It doesn’t unless we rethink the storage and lookup strategy. Before we had a tree structure of StringTrie.Node instances to facilitate traversal and prefix lookup — but where we’re going, we don’t need nodes. If every string we’ll ever use can fit in a ulong, we can directly use the value as a dictionary key and simply record whether that word is a terminal or just a prefix. This implies a design with a single massive dictionary with ulong keys and bool values (where true means “is terminal”). This doesn’t get rid of heap memory entirely (impractical, if not impossible) but instead amortizes the allocations under the roof of one dictionary structure with one collection of buckets.

Unfortunately we cannot measure the performance impact this new design will have without a near-total reimplementation of the system we previously built. Luckily, the system is small, the algorithms are simple enough, and we have tests of the older classes that we can steal from. “Start typing!”

We will start with our new string and character structs, lovingly named Str and Ch because they are shorter versions of their counterparts: Str.cs and Ch.cs

Ch is an enum with size byte which is the minimum for any integral type in .NET. It has members A-Z corresponding to 1-26 respectively and a zero value called None. Str is a struct which contains a single ulong value and does a lot of bit twiddling to implement the necessary operations, demonstrated as follows:

// An empty Str with value 0x00000000.
Str s = default(Str);

// Get the length which is just the lower 4 bits.
byte len = s.Length;

// Append a char and return a new Str; internally
// this shifts the input based on the existing
// length and combines it with the previous value,
// incrementing the length bits.
// The value now is 0x000001A1, or in binary:
//    6           5
// 32109 87654 32109     87654 3210
// -----|-----|-----| / |-----|----|
// 00000 00000 00000 ... 11010 0001
s = s.Append(Ch.Z);

// Get the first char; internally this shifts and
// masks the raw value based on the requested index.
Ch c0 = s[0];

// Returns a "real" string representation of the
// value, converting each Ch value to a char.
string text = s.ToString();

Now we need a StrTrie to go with our Str: StrTrie.cs

The basic idea is that we add words by inserting entries into one massive dictionary. The entries are the full word Str as given, which is added as a terminal (true) followed by every successively shorter version of that word added as a prefix (false) if not already present. So if we were inserting the word “ALE” into an empty trie, we would have dictionary entries like so:

{
  "ALE": true,
  "AL":  false,
  "A":   false,
}

If we subsequently add “ALERT”, we would end up with this:

{
  "ALERT": true,
  "ALER":  false,
  "ALE":   true,
  "AL":    false,
  "A":     false,
}

This Add algorithm requires a chop-like method added to Str, so here it is: Str.Chop

Note that we can no longer traverse the trie because it is flat. Instead we have a StrTrie.Find method which tells us what kind of node the entry is (None if missing, otherwise Prefix or Terminal).

Now we’re ready for a head-to-head benchmark comparison of StrTrie and StringTrie: Str() and String() TrieLoad benchmarks
And yet… the initial results are less than impressive:

| Method | Pct |      Mean |     Error |    StdDev |      Gen 0 |     Gen 1 |     Gen 2 | Allocated |
|------- |---- |----------:|----------:|----------:|-----------:|----------:|----------:|----------:|
| String |  25 |  87.86 ms | 0.3433 ms | 0.2867 ms |  5000.0000 | 2166.6667 |  666.6667 |  26.78 MB |
|    Str |  25 |  35.06 ms | 0.6470 ms | 0.6052 ms |  3357.1429 | 1428.5714 | 1357.1429 |  21.36 MB |
| String |  50 | 158.53 ms | 3.1136 ms | 2.9124 ms |  8500.0000 | 3500.0000 | 1000.0000 |  45.98 MB |
|    Str |  50 |  68.37 ms | 0.2945 ms | 0.2459 ms |  6000.0000 | 1750.0000 | 1625.0000 |  44.08 MB |
| String | 100 | 274.41 ms | 5.3901 ms | 7.3780 ms | 15000.0000 | 6500.0000 | 2000.0000 |  79.95 MB |
|    Str | 100 | 143.03 ms | 0.4444 ms | 0.3470 ms | 11500.0000 | 2750.0000 | 2750.0000 |  90.69 MB |

The speedup is undeniable — almost 50% faster for the full word list — but the number of allocations has actually increased. What gives? Well, we forgot one thing:

The IEquatable<T> interface is used by generic collection objects such as Dictionary<TKey,TValue>, List<T>, and LinkedList<T> when testing for equality in such methods as Contains, IndexOf, LastIndexOf, and Remove. It should be implemented for any object that might be stored in a generic collection.

If we don’t implement IEquatable, the default comparer implementation will use object.Equals which means an expensive boxing operation on every comparison. Let’s fix that now: IEquatable for Str
The new results:

| Method | Pct |      Mean |     Error |    StdDev |      Gen 0 |     Gen 1 |     Gen 2 | Allocated |
|------- |---- |----------:|----------:|----------:|-----------:|----------:|----------:|----------:|
| String |  25 |  87.94 ms | 0.7441 ms | 0.6214 ms |  5000.0000 | 2166.6667 |  666.6667 |  26.78 MB |
|    Str |  25 |  32.01 ms | 0.6339 ms | 0.5929 ms |  2875.0000 | 1937.5000 | 1937.5000 |  17.38 MB |
| String |  50 | 157.26 ms | 0.8076 ms | 0.6744 ms |  8500.0000 | 3500.0000 | 1000.0000 |  45.98 MB |
|    Str |  50 |  62.54 ms | 0.3572 ms | 0.2983 ms |  5000.0000 | 3000.0000 | 3000.0000 |  34.66 MB |
| String | 100 | 268.59 ms | 1.6544 ms | 1.3815 ms | 15000.0000 | 6500.0000 | 2000.0000 |  79.95 MB |
|    Str | 100 | 125.43 ms | 3.1945 ms | 2.9881 ms |  8600.0000 | 3800.0000 | 3800.0000 |  69.36 MB |

Better, but there are still so many allocations. This is because we didn’t implement GetHashCode and the default ValueType.GetHashCode will not perform so well. Fixing that: GetHashCode for Str
Our results so far:

| Method | Pct |      Mean |     Error |    StdDev |      Gen 0 |     Gen 1 |     Gen 2 | Allocated |
|------- |---- |----------:|----------:|----------:|-----------:|----------:|----------:|----------:|
| String |  25 |  87.68 ms | 0.3789 ms | 0.3164 ms |  5000.0000 | 2166.6667 |  666.6667 |  26.78 MB |
|    Str |  25 |  15.80 ms | 0.3857 ms | 0.3961 ms |  1968.7500 | 1968.7500 | 1968.7500 |  10.15 MB |
| String |  50 | 158.00 ms | 1.7737 ms | 1.6591 ms |  8500.0000 | 3500.0000 | 1000.0000 |  45.98 MB |
|    Str |  50 |  33.67 ms | 0.6532 ms | 0.7522 ms |  1187.5000 |  687.5000 |  687.5000 |  20.89 MB |
| String | 100 | 268.90 ms | 1.5949 ms | 1.3318 ms | 15000.0000 | 6500.0000 | 2000.0000 |  79.95 MB |
|    Str | 100 |  65.27 ms | 1.2211 ms | 1.1993 ms |  2375.0000 | 1375.0000 | 1375.0000 |  43.02 MB |

This is starting to look very good, but we can still do better. Recall that we were using StreamReader.ReadLine to read lines from the input file. This was fine when we actually used the string value in our trie node implementation, but now we are just parsing it into Ch values and throwing it away. Let’s switch to ReadBlock instead to avoid the string allocation. This substantially complicates the reading algorithm as you can see: use ReadBlock instead of ReadLine
Was it worth it? Let’s check:

| Method | Pct |      Mean |     Error |    StdDev |      Gen 0 |     Gen 1 |     Gen 2 | Allocated |
|------- |---- |----------:|----------:|----------:|-----------:|----------:|----------:|----------:|
| String |  25 |  87.60 ms | 0.4848 ms | 0.4298 ms |  5000.0000 | 2166.6667 |  666.6667 |  26.78 MB |
|    Str |  25 |  13.77 ms | 0.0614 ms | 0.0513 ms |  1984.3750 | 1984.3750 | 1984.3750 |   8.07 MB |
| String |  50 | 157.46 ms | 2.0659 ms | 1.8313 ms |  8500.0000 | 3500.0000 | 1000.0000 |  45.98 MB |
|    Str |  50 |  28.38 ms | 0.3224 ms | 0.3016 ms |   718.7500 |  718.7500 |  718.7500 |  16.73 MB |
| String | 100 | 267.83 ms | 1.6942 ms | 1.5848 ms | 15000.0000 | 6500.0000 | 2000.0000 |  79.95 MB |
|    Str | 100 |  57.62 ms | 1.2599 ms | 1.2938 ms |  1555.5556 | 1555.5556 | 1555.5556 |   34.7 MB |

Success! We have reduced unnecessary allocations to essentially zero which we can infer by the collection counts of Gen 0, Gen 1, and Gen 2 being equal (which should indicate that almost all heap objects survived and were promoted to Gen 2).

There is one possible inefficiency still in that we are using a StreamReader which internally buffers the data before giving it to us. Can we squeeze even more performance with unbuffered raw Stream reads? Let’s find out: use raw Stream.Read

The results now:

| Method | Pct |      Mean |     Error |    StdDev |      Gen 0 |     Gen 1 |     Gen 2 | Allocated |
|------- |---- |----------:|----------:|----------:|-----------:|----------:|----------:|----------:|
| String |  25 |  87.68 ms | 0.7423 ms | 0.6580 ms |  5000.0000 | 2166.6667 |  666.6667 |  26.78 MB |
|    Str |  25 |  13.39 ms | 0.0879 ms | 0.0734 ms |  1984.3750 | 1984.3750 | 1984.3750 |   8.07 MB |
| String |  50 | 156.24 ms | 0.8575 ms | 0.7602 ms |  8500.0000 | 3500.0000 | 1000.0000 |  45.98 MB |
|    Str |  50 |  27.39 ms | 0.1829 ms | 0.1621 ms |   718.7500 |  718.7500 |  718.7500 |  16.73 MB |
| String | 100 | 268.23 ms | 1.9595 ms | 1.5299 ms | 15000.0000 | 6500.0000 | 2000.0000 |  79.95 MB |
|    Str | 100 |  55.44 ms | 0.5060 ms | 0.4485 ms |  1555.5556 | 1555.5556 | 1555.5556 |   34.7 MB |

No real impact on allocations, but we see a small perf gain of about 3% — I’ll take it!

Of course, we now need to rewrite our other data structures to use the new Str* types: LetterBoxStr.cs, LetterBoxStrWords.cs

While writing those first two classes and their associated tests, I got very bogged down by the cumbersome Append().Append().Append() chains to build up the Str values. I fixed this by implementing Str.Parse. Now the tests look much nicer. (Important: this method is only for the convenience of tests — we will lose some of our hard-won perf gains if we use it in Load.)

Now it is time for the searching algorithm. We will have to approach the problem from a different angle since node traversal is no longer possible. Instead, we can pass a Str value that we can append to on each step of the algorithm and use the StrTrie.Find method to determine if we should continue the recursion. Structurally it is not all that different, though we have to be careful not to append past a length of 12. Incidentally, while writing this code, I discovered and fixed a huge bug in the StrTrie.Add method (thankfully, it did not significantly change the perf results from before). Here is the final code: LetterBoxStrSearch.cs

In the last step, we make a one-to-one translation of the solver app from the original types to the Str* types. (Actually, the real last step was a final performance optimization I noticed for the StrTrie.Add method which shaved a further 5% off the micro-benchmark!)

And now the moment of truth — is the solver app actually faster?

> LetterBoxedSolver.exe RMEWCLTGKAPI words.txt
[000.000] Loading trie...
[000.059] Loaded 162309 words.
[000.060] Finding valid words...
[000.072] Found 754 valid words.
[000.072] Finding solutions...
MARKETPLACE-EARWIG
WIGWAM-MARKETPLACE
PRAGMATIC-CAKEWALK
[000.080] Done.

Yes, it is way faster — ~60 ms trie load time + ~20 ms solution time = ~80 ms end to end, compared to ~225 ms originally. I can’t complain!

But wait — wasn’t the solution time actually faster before (~15 ms compared to ~20 ms)? We’ll dig into that discrepancy next time.

One thought on “Letter Boxed: optimizing the trie

  1. Pingback: Native code: always faster? – WriteAsync .NET

Leave a Reply

Your email address will not be published. Required fields are marked *