TKey performance indicators: primitives

Spread the love

Recently I encountered a situation where a seemingly innocuous change in the structure of a dictionary key had a noticeable negative performance impact (more on this later). This got me thinking, how exactly do different types of keys measure up in a head-to-head comparison? Let’s find out!

First, we need to design an appropriate benchmark. (Note that we’ll be using BenchmarkDotNet, of course.) The basic flow will be as follows:

  1. Pick a dictionary key type.
  2. Generate some number of unique keys as a one-time setup step.
  3. Create a dictionary of ints and insert a value for each previously generated key.

For the first set of measurements, I have chosen the following key types: byte, short, int, long, and finally three variations of string (using lengths of 2, 8, and 32 chars respectively). The benchmark code looks like this:

[CoreJob]
[MemoryDiagnoser]
public class DictKeys
{
    private List<byte> byteKeys;
    private List<short> shortKeys;
    private List<int> intKeys;
    private List<long> longKeys;
    private List<string> str2Keys;
    private List<string> str8Keys;
    private List<string> str32Keys;

    [Params(10, 100, 1000)]
    public int N { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        this.byteKeys = Setup(this.N, i => (byte)i);
        this.shortKeys = Setup(this.N, i => (short)i);
        this.intKeys = Setup(this.N, i => i);
        this.longKeys = Setup(this.N, i => (long)i);
        this.str2Keys = Setup(this.N, i => TwoChars(i, 1));
        this.str8Keys = Setup(this.N, i => TwoChars(i, 4));
        this.str32Keys = Setup(this.N, i => TwoChars(i, 16));
    }

    [Benchmark]
    public int ByteK() => Run(this.byteKeys).Count;

    [Benchmark]
    public int ShortK() => Run(this.shortKeys).Count;

    [Benchmark]
    public int IntK() => Run(this.intKeys).Count;

    [Benchmark]
    public int LongK() => Run(this.longKeys).Count;

    [Benchmark]
    public int Str2K() => Run(this.str2Keys).Count;

    [Benchmark]
    public int Str8K() => Run(this.str8Keys).Count;

    [Benchmark]
    public int Str32K() => Run(this.str32Keys).Count;

    private static List<T> Setup<T>(int n, Func<int, T> make)
    {
        List<T> k = new List<T>();
        for (int i = 0; i < n; ++i)
        {
            k.Add(make(i));
        }

        return k;
    }

    private static string TwoChars(int val, int n)
    {
        StringBuilder sb = new StringBuilder();
        char one = (char)('0' + (val / 32));
        char two = (char)('0' + (val % 32));
        for (int i = 0; i < n; ++i)
        {
            sb.Append(one);
            sb.Append(two);
        }

        return sb.ToString();
    }

    private static Dictionary<T, int> Run<T>(List<T> k)
    {
        Dictionary<T, int> dict = new Dictionary<T, int>();
        int n = k.Count;
        for (int i = 0; i < n; ++i)
        {
            dict[k[i]] = i;
        }

        return dict;
    }
}

For the setup step, the integer value is used as-is for the key in all cases except for string. The string keys are a “Base32“-ish encoding of the N parameter value repeated to create the desired length; for example, the first few keys of length 8 are 00000000, 01010101, 02020202, etc. Really, the values don’t matter much, as long as they are unique.

Let’s see the results:

| Method |    N |        Mean |        Error |       StdDev |      Median |   Gen 0 |  Gen 1 | Gen 2 | Allocated |
|------- |----- |------------:|-------------:|-------------:|------------:|--------:|-------:|------:|----------:|
|  ByteK |   10 |    332.8 ns |     3.626 ns |     3.214 ns |    331.5 ns |  0.1230 |      - |     - |     776 B |
| ShortK |   10 |    335.7 ns |     1.480 ns |     1.236 ns |    335.4 ns |  0.1230 |      - |     - |     776 B |
|   IntK |   10 |    332.3 ns |     3.183 ns |     2.485 ns |    331.7 ns |  0.1230 |      - |     - |     776 B |
|  LongK |   10 |    350.5 ns |     1.542 ns |     1.367 ns |    350.3 ns |  0.1574 |      - |     - |     992 B |
|  Str2K |   10 |    588.8 ns |    11.707 ns |    23.381 ns |    575.6 ns |  0.1574 |      - |     - |     992 B |
|  Str8K |   10 |    619.0 ns |     1.886 ns |     1.575 ns |    618.8 ns |  0.1574 |      - |     - |     992 B |
| Str32K |   10 |    799.9 ns |     6.069 ns |     4.739 ns |    799.8 ns |  0.1574 |      - |     - |     992 B |
|  ByteK |  100 |  2,675.2 ns |    16.033 ns |    13.388 ns |  2,675.0 ns |  1.1711 | 0.0191 |     - |    7392 B |
| ShortK |  100 |  2,906.2 ns |    20.842 ns |    16.272 ns |  2,904.8 ns |  1.1711 | 0.0191 |     - |    7392 B |
|   IntK |  100 |  2,728.3 ns |     3.704 ns |     2.892 ns |  2,728.2 ns |  1.1711 | 0.0191 |     - |    7392 B |
|  LongK |  100 |  3,074.8 ns |    61.137 ns |   111.792 ns |  3,039.5 ns |  1.6174 | 0.0381 |     - |   10192 B |
|  Str2K |  100 |  4,831.0 ns |    14.509 ns |    12.116 ns |  4,831.0 ns |  1.6174 | 0.0381 |     - |   10192 B |
|  Str8K |  100 |  5,666.8 ns |   112.301 ns |   196.686 ns |  5,707.1 ns |  1.6174 | 0.0381 |     - |   10192 B |
| Str32K |  100 |  7,494.3 ns |    46.232 ns |    43.245 ns |  7,477.0 ns |  1.6174 | 0.0381 |     - |   10192 B |
|  ByteK | 1000 | 18,044.9 ns |   347.597 ns |   325.142 ns | 18,233.2 ns |  2.5330 | 0.0610 |     - |   16064 B |
| ShortK | 1000 | 29,562.3 ns |   186.980 ns |   156.137 ns | 29,617.7 ns | 11.5967 | 1.6479 |     - |   73168 B |
|   IntK | 1000 | 26,968.4 ns |   531.823 ns |   958.986 ns | 26,501.7 ns | 11.5967 | 1.6479 |     - |   73168 B |
|  LongK | 1000 | 30,206.0 ns |   600.060 ns | 1,170.372 ns | 29,567.4 ns | 16.1743 | 2.6855 |     - |  102216 B |
|  Str2K | 1000 | 47,720.6 ns |   215.247 ns |   201.342 ns | 47,639.7 ns | 16.1743 | 2.6855 |     - |  102216 B |
|  Str8K | 1000 | 59,098.2 ns | 1,325.516 ns | 1,901.015 ns | 58,183.4 ns | 16.1743 | 2.6855 |     - |  102216 B |
| Str32K | 1000 | 76,835.0 ns |   383.037 ns |   339.552 ns | 76,724.6 ns | 16.1133 | 2.5635 |     - |  102216 B |

Some of the results do not make sense at first. For instance, why is short always slower than int? I can’t think of a clear reason, but the performance effect is definitely consistent across all the cases.

Other non-obvious results seem strange but can be explained more readily. Take for example the fact that the allocation sizes are essentially equivalent for all the integral types excluding long. This I am pretty sure is due to memory alignment requirements. If you look at the internals of Dictionary you can see an entry struct that holds each key/value pair, defined roughly like this:

        struct Entry
        {
            public int next;
            public uint hashCode;
            public TKey key;
            public TValue value;
        }

In the benchmark, TValue is always int, and TKey varies in size (K = 1, 2, 4, or 8 bytes). Naïvely, we would then expect the struct to be (4 + 4 + K + 4) = 12 + K bytes long. In reality, the CLR has alignment requirements for these built-in types, specifically that 4-byte values have 4-byte alignment (i.e. stored at a memory address divisible by 4). For all the ints to be aligned, the layout would need to be:

         +---------------------------------+
         | 0 1 2 3 4 5 6 7 8 9 A B C D E F |
         +-^-------^-------^-------^-------+
---------+ |       |       |       |
int next  _|       |       |       |
uint hashCode  ____|       |       |
TKey key  _________________|       |
int value  ________________________|
---------+

Even if TKey is byte, the following int value field must land on the next multiple of 4 address, giving us a minimum size requirement of 16 for the struct. That being said, the N=1000 case for Dictionary<byte, int> uses far less memory than the others, which I can’t explain yet (perhaps something to do with the resizing strategy and how the entries are laid out?).

The rest of the results are fairly mundane. We can see that long and string both have the same size characteristics. This is because they are both 8 bytes on 64-bit architectures. In fact, a field of any reference type, string included, will be 8-bytes, no matter how big the actual “pointed to” memory is. (Remember that we pre-allocated the keys in our benchmark, so we only pay for the reference, and not the string data itself.)

Furthermore, longer string keys cost more in CPU time. This is because the GetHashCode implementation for string considers all characters of the string when hashing, making it roughly O(N) for input size N.

This is enough performance goodness to chew on for now. Next, we’ll explore the implications of using custom data types for keys.

One thought on “TKey performance indicators: primitives

  1. Pingback: TKey performance indicators: custom types – WriteAsync .NET

Leave a Reply

Your email address will not be published. Required fields are marked *