Digits: faster in practice?

While everyone loves theoretical performance improvements, we need to check back in with reality from time to time. To review, we had proposed a near-zero allocation strategy to help improve the speed of a Digits game solver. After all, heap allocation leads to GC pressure which leads to millions of dollars wasted! Well, potentially. Let’s try a few code optimizations and see how these “improvements” fare in practice.

Recall that our initial benchmark clocked the solver at about 296 ms and 715 MB of allocations:

```| Method |     Mean |   Error |  StdDev |        Gen0 | Allocated |
|------- |---------:|--------:|--------:|------------:|----------:|
|  Solve | 296.0 ms | 1.29 ms | 1.21 ms | 175000.0000 | 714.83 MB |
```

Let’s start with the observation that any Digits solution would be the result of the immediately preceding move. So why not check right away if `TryMove` yields our target number? That should eliminate extra work in going back to the board and scanning for the solution again. Here is the code change: Board – TryMove with target. Our new benchmark numbers:

```| Method |     Mean |   Error |  StdDev |       Gen0 | Allocated |
|------- |---------:|--------:|--------:|-----------:|----------:|
|  Solve | 263.5 ms | 4.48 ms | 4.19 ms | 83000.0000 | 665.93 MB |
```

Not bad, we gained more than a 10% speed boost and reduced allocation by 7%! But we have even more low-hanging fruit; let’s change `Move` to a readonly (record) struct, as it is a trivial value-like type anyway:

```| Method |     Mean |   Error |  StdDev |       Gen0 | Allocated |
|------- |---------:|--------:|--------:|-----------:|----------:|
|  Solve | 242.5 ms | 4.83 ms | 5.16 ms | 65333.3333 | 522.22 MB |
```

Another 8% speed improvement, and 21% memory reduction! This is fun! Let’s follow suit with what you would think is the biggest memory hog, and change Board to a readonly struct:

```| Method |     Mean |   Error |  StdDev |       Gen0 | Allocated |
|------- |---------:|--------:|--------:|-----------:|----------:|
|  Solve | 228.2 ms | 4.49 ms | 4.81 ms | 60000.0000 | 479.33 MB |
```

A modest 6% gain in execution speed, with 9% less in allocation. It’s good, but the wins are getting smaller… How about we use array inside Board instead of a List?

```| Method |     Mean |   Error |  StdDev |       Gen0 | Allocated |
|------- |---------:|--------:|--------:|-----------:|----------:|
|  Solve | 170.6 ms | 2.98 ms | 2.33 ms | 43666.6667 | 350.66 MB |
```

Ah, that’s more like it. More than 25% improvement, both in time and space. This is about the last easy change we’re going to get, it seems. Next up, maybe we can try that idea of using a fancy VLQ-encoded list, packed into a single 64-bit integer. That would perhaps eliminate the most obvious source of heap allocation, that of the internal board array. We first need a `Base128` type, to help with the encode/decode algorithm. To make absolutely sure we did this right, we need quite a few unit tests to guide us. Voila:

```public static class Base128
{
{
int i = -1;
int output = 0;
byte next;
do
{
next = input[++i];
output |= (next & 0x7F) << (7 * i);
}
while (next > 127);

return output;
}

public static int Write(int value, Span<byte> output)
{
int i = 0;
do
{
byte next = (byte)(value & 0x7F);
if (value > 127)
{
next |= 0x80;
}

value >>= 7;
output[i++] = next;
}
while (value > 0);

return i;
}
}
```

Now we need the list struct to pack those bits! This ends up being pretty complicated, and we need several more tests to show that the various cases are handled correctly. The result:

```public readonly struct List6V
{

private List6V(ulong raw)
{
_raw = raw;
}

public int this[byte index]
{
get
{
Span<byte> bytes = stackalloc byte[6];
Unpack(bytes);
}
}

public byte Count => (byte)(_raw & 0xFF);

private byte Delta(byte index)
{
byte bits = (byte)(_raw >> 56);
return index switch
{
1 => (byte)(bits & 0x7),
2 => (byte)((bits >> 3) & 0x3),
3 => (byte)((bits >> 5) & 0x3),
4 => (byte)((bits >> 7) & 0x1),
_ => 0,
};
}

{
Span<byte> bytes = stackalloc byte[6];
Unpack(bytes);
int delta = Delta(Count);
int len = Base128.Write(value, bytes.Slice(Count + delta));
return new(Pack(bytes) | ((ulong)Count + 1) | Adjust(delta + len - 1));
}

{
byte bits = (byte)(_raw >> 56);
bits |= Count switch
{
0 => (byte)delta,
1 => (byte)(delta << 3),
2 => (byte)(delta << 5),
3 => (byte)(delta << 7),
_ => 0,
};
return (ulong)bits << 56;
}

{
ulong raw = bytes[0];
raw |= (ulong)bytes[1] << 8;
raw |= (ulong)bytes[2] << 16;
raw |= (ulong)bytes[3] << 24;
raw |= (ulong)bytes[4] << 32;
raw |= (ulong)bytes[5] << 40;
return raw << 8;
}

private void Unpack(Span<byte> bytes)
{
bytes[0] = (byte)(_raw >> 8);
bytes[1] = (byte)(_raw >> 16);
bytes[2] = (byte)(_raw >> 24);
bytes[3] = (byte)(_raw >> 32);
bytes[4] = (byte)(_raw >> 40);
bytes[5] = (byte)(_raw >> 48);
}
}
```

It’s… something. We got rid of heap allocated lists and/or arrays, and we fit all of our state into 64 bits, as promised. Does this really pay off? In a word, no. The benchmark is somewhat better on the memory front, as expected, but far worse in speed:

```| Method |     Mean |   Error |  StdDev |       Gen0 | Allocated |
|------- |---------:|--------:|--------:|-----------:|----------:|
|  Solve | 244.7 ms | 2.84 ms | 2.65 ms | 36666.6667 | 293.16 MB |
```

What happened? We have to drill into another benchmark to see. Let’s look just at the Board.TryMove cost for various board sizes (N):

```|  Method | N |      Mean |    Error |   StdDev | Allocated |
|-------- |-- |----------:|---------:|---------:|----------:|
| TryMove | 2 |  68.40 ns | 0.553 ns | 0.518 ns |         - |
| TryMove | 4 | 136.23 ns | 0.742 ns | 0.694 ns |         - |
| TryMove | 6 | 203.40 ns | 1.079 ns | 1.010 ns |         - |
```

Now let’s compare that to what it was with the plain old array:

```|  Method | N |     Mean |    Error |   StdDev |   Gen0 | Allocated |
|-------- |-- |---------:|---------:|---------:|-------:|----------:|
| TryMove | 2 | 36.27 ns | 0.152 ns | 0.135 ns | 0.0076 |      64 B |
| TryMove | 4 | 45.57 ns | 0.316 ns | 0.280 ns | 0.0076 |      64 B |
| TryMove | 6 | 51.87 ns | 0.919 ns | 0.860 ns | 0.0095 |      80 B |
```

Yes, the array usage carries some heap allocation cost, but the execution speed is undeniably better. It goes to show that for most situations, you really can’t beat the popular (and, by extension) highly optimized case. With an array you get the best possible read/write speeds (indexed, contiguous memory access) and if the data is small enough, a good chance of cache locality.

Maybe the VLQ could be better with a lot more work — I admit I didn’t spend a lot of time trying to be clever with bit twiddling hacks and the like. Maybe we can find a better optimization by thinking harder.

Honestly, I’m good with the simple improvements we already did above. Performance optimization is often a game of making the code ugly in order to make the latency graphs pretty. On the contrary, these particular changes meet three very important criteria: (1) the code afterward is still intelligible, (2) the changes are easy to explain, (3) they could likely be applied in other contexts.