Letter Boxed: perf gains and losses

Spread the love

After successfully optimizing the trie loading path, we apparently “pessimized” the solution path. Let’s see if we can find out why.

I’ll start by building a benchmark for the word finding algorithm: FindWords benchmark

Note that this benchmark eliminates all work other than receiving the callback and counting the results. Still, the results clearly show close to a factor of two latency delta between the String* and Str* versions of the algorithm:

|  Method |       Mean |    Error |   StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
|-------- |-----------:|---------:|---------:|------:|------:|------:|----------:|
| StringF |   772.6 us | 1.645 us | 1.458 us |     - |     - |     - |      88 B |
|    StrF | 1,466.9 us | 4.700 us | 4.166 us |     - |     - |     - |      88 B |

There is no appreciable heap memory pressure, so that is not the culprit. We have to dig deeper. We need another benchmark to rule out additional slowdowns such as in the trie find algorithm: FindTrie benchmark

We have three cases covered for the find input: a word that is missing but with a partially valid prefix, a six-letter word that is present and a 12-letter word that is present. We should expect to see StrTrie emerge as a clear winner given its direct lookup, compared to the recursive algorithm we need for StringTrie. Is our prediction right? Let’s peek at the results:

|  Method | Len |      Mean |     Error |    StdDev |
|-------- |---- |----------:|----------:|----------:|
| StringF |  -1 |  66.92 ns | 0.5942 ns | 0.5558 ns |
|    StrF |  -1 |  14.30 ns | 0.0063 ns | 0.0049 ns |
| StringF |   6 | 103.67 ns | 0.2356 ns | 0.2089 ns |
|    StrF |   6 |  14.47 ns | 0.0329 ns | 0.0292 ns |
| StringF |  12 | 229.12 ns | 4.3933 ns | 4.3148 ns |
|    StrF |  12 |  14.60 ns | 0.0251 ns | 0.0210 ns |

Not only is StrTrie faster by far, but it achieves constant time complexity — no matter the input, it needs but a single hash calculation and lookup to find the result.

So now we have a problem. Despite achieving a measurable speedup in a rather important portion of the algorithm, incurring no heap allocation, and adding no appreciable complexity to the recursive search, we still have an unexplained but undeniable slowdown overall. As Carlos Bueno [under]states, “optimization is hard to do well.”

While reading some material from Adam Sitnik about reference vs. value types I noticed he was using a HardwareCounters diagnoser for BenchmarkDotNet to detect cache misses. This seemed interesting, so I added the same to my FindWords benchmark. Note that collection of this data requires reading an ETW session which means running elevated. Here is the FindWords benchmark with cache miss metrics:

|  Method |       Mean |     Error |    StdDev | CacheMisses/Op |
|-------- |-----------:|----------:|----------:|---------------:|
| StringF |   768.2 us |  2.025 us |  1.795 us |          1,193 |
|    StrF | 1,468.3 us | 11.890 us | 10.540 us |          2,570 |

Clearly the Str algorithm incurs significantly more cache misses. Could this be the reason for the slowness? I have to admit, I don’t know! I would merely say it is plausibly correlated to the decrease in performance. I also noted while looking at the “IL with C#” view in ILSpy that LetterBoxStrSearch.Next has slightly larger code size when compared to LetterBoxSearch.Next (132 bytes vs. 112 bytes). Again, this could be a factor but who knows?

Without more advanced profiling tools like Cachegrind or Intel VTune Amplifier, it would be hard to draw many conclusions here. Perhaps this conclusion is unsatisfying but I think it demonstrates another important lesson of performance optimization — knowing when to stop!

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.