BenchmarkDotNet in Practice: What Actually Matters

Most .NET developers have heard of BenchmarkDotNet. Fewer have used it on something that matters. And almost nobody talks about the part that actually changes how you think about performance: designing benchmarks that answer the right question.

I spent several weeks optimizing a poker hand evaluator — a pure C# reimplementation of Cactus Kev’s classic algorithm — until it was running neck-and-neck with a native C++ build on the same hardware. BenchmarkDotNet was the instrument panel for that entire journey. This post isn’t about poker. It’s about what I learned structuring benchmarks so the numbers actually mean something.

The Problem With “How Fast Is It?”

When I first started benchmarking the evaluator, I wrote one method, slapped [Benchmark] on it, and ran it. The number came back. Great. But what did it tell me?

Nothing useful, as it turned out. The method was doing too many things: shuffling cards, evaluating hands, building UI-ready result objects, allocating arrays. When I optimized the evaluation math and the number barely moved, I knew the benchmark was measuring the wrong thing. The signal was buried in noise.

That’s the first lesson: a single benchmark that exercises your entire pipeline will tell you your pipeline’s speed. It won’t tell you where the time goes. And if you can’t answer that question, you can’t optimize anything.

Designing Three Tiers of Measurement

I restructured the benchmarks into three distinct methods, each answering a different question:

Tier 1 — The Core Engine (What’s the theoretical ceiling?)

This benchmark strips everything down to the raw math. Nine players, seven cards each, 21 five-card combinations per player — evaluated using integer values only, with zero allocations.

[Benchmark(Description = "Optimized core evaluator: (max throughput)")]
public int EngineOnly_SevenCardBestOf21_9Players_ValuesOnly_NoAllocs()
{
    int b0 = _shuffled[18].Value, b1 = _shuffled[19].Value,
        b2 = _shuffled[20].Value, b3 = _shuffled[21].Value,
        b4 = _shuffled[22].Value;

    var perm = PokerLib.Perm7Indices;
    int acc = 0;

    for (int p = 0; p < 9; p++)
    {
        ushort best = ushort.MaxValue;
        Span<int> sevenVals = stackalloc int[7];
        sevenVals[0] = _shuffled[p].Value;
        sevenVals[1] = _shuffled[p + 9].Value;
        sevenVals[2] = b0; sevenVals[3] = b1;
        sevenVals[4] = b2; sevenVals[5] = b3;
        sevenVals[6] = b4;

        for (int row = 0; row < 21; row++)
        {
            int i = row * 5;
            ushort v = PokerLib.Eval5CardsFast(
                sevenVals[perm[i]], sevenVals[perm[i + 1]],
                sevenVals[perm[i + 2]], sevenVals[perm[i + 3]],
                sevenVals[perm[i + 4]]);
            if (v < best) best = v;
        }
        acc ^= best;
    }
    return acc;
}Code language: C# (cs)

The stackalloc keeps everything on the stack. The ReadOnlySpan<byte> permutation table is a flattened, zero-allocation lookup. The acc ^= best at the end prevents the JIT from optimizing away the entire computation — dead code elimination is real, and BenchmarkDotNet won’t save you from it if the return value is unused.

Result: 900 ns per operation. Zero Gen0 collections.

That’s the ceiling. If the evaluator itself is costing 900 nanoseconds, and any downstream benchmark shows 2,000 ns, I know exactly where the other 1,100 ns is going.

Tier 2 — The Full Pipeline (What does the user actually experience?)

This benchmark runs the same evaluation through EvalEngine.EvaluateRiverNinePlayersArrays — the exact method the web application calls. It includes scoring, ranking, and building the Card[][] arrays that the UI renders.

[Benchmark(Description = "Full 9-player evaluation: What the webapp uses")]
public int EndToEnd_EvalEngine_IncludeBestHands()
{
    var (scores, ranks, bestHands) =
        EvalEngine.EvaluateRiverNinePlayersArrays(
            _shuffledArray, includeBestHands: true);

    int acc = 0;
    for (int i = 0; i < 9; i++)
    {
        acc ^= scores[i];
        acc ^= ranks[i];
        acc ^= bestHands[i][0].Value;
    }
    return acc;
}Code language: C# (cs)

Result: 1,204 ns per operation. 784 bytes allocated.

The delta between Tier 1 and Tier 2 is roughly 300 ns and 784 bytes. That’s the cost of building UI-ready result objects. For a web app serving poker hands, that’s completely acceptable. But now I know the exact price I’m paying for developer ergonomics over raw speed, and I can make that tradeoff consciously.

Tier 3 — Parallel Throughput (How does it scale across cores?)

This benchmark throws 10 million hands at Parallel.For in batches of 64 to test multi-core scalability on a 14-core/28-thread i9-9940X.

[Benchmark(Description = "Throughput: Parallel.For batched (values-only)")]
public int Parallel_Batched_ValuesOnly()
{
    const int N = 10_000_000;
    const int Batch = 64;
    // ... board cards, perm table setup ...

    int global = 0;
    Parallel.For(0, groups,
        () => 0,
        (g, _, local) =>
        {
            int start = g * Batch;
            int end = Math.Min(start + Batch, N);
            int sum = 0;
            for (int iter = start; iter < end; iter++)
            {
                for (int p = 0; p < 9; p++)
                {
                    Span<int> seven = stackalloc int[7];
                    // ... fill and evaluate ...
                    sum += best;
                }
            }
            return local + sum;
        },
        local => Interlocked.Add(ref global, local));
    return global;
}Code language: C# (cs)

Result: ~647 ms for 10 million 9-player evaluations. 10,856 bytes total allocation.

The near-zero allocation count across 10 million iterations confirms that the stackalloc pattern holds under parallel execution. The Interlocked.Add on the final accumulator keeps contention minimal — each thread works independently on its batch and only synchronizes once at the end.

The Setup That Makes It Reproducible

None of those numbers mean anything without a deterministic test fixture. BenchmarkDotNet will happily give you precise measurements of a random scenario that changes every run.

The [Params] attribute captures a specific shuffled deck as a pipe-delimited string of card IDs:

[Params("30|8|12|19|23|27|31|48|16|26|35|47|51|...")]
public string CardIds { get; set; } = string.Empty;Code language: C# (cs)

The [GlobalSetup] restores that exact deal before any measurements begin:

[GlobalSetup]
public async Task Setup()
{
    _orderedDeck = (await _deckService.RawDeckAsync()).ToList();
    _shuffled = RestoreShuffledFromIds(_orderedDeck, CardIds);
    _shuffledArray = _shuffled.ToArray();
}Code language: C# (cs)

Every run evaluates the same cards in the same order. If I change the evaluator and the number moves, I know it’s because of my code, not because the RNG dealt a different hand. You can uncomment additional deck configurations in the [Params] attribute to average across multiple board textures, but for optimization work, one deterministic scenario with known characteristics (the current deck produces a straight flush) is more useful than averaging across randomness.

Cross-Language Comparison: C# vs C++

The other half of this project was answering the question every .NET performance engineer eventually faces: how close can managed code get to native?

I ran the same evaluation algorithm as a native C++ build (bwedding/PokerEvalMultiThread) compiled with MSVC /O2 and AVX2 on the same i9-9940X. Both implementations use the same underlying Cactus Kev algorithm and were validated against the same hand distribution checksums.

The 7-card evaluation results:

Implementation	Throughput	Per Hand
C++ (MSVC /O2 AVX2)	~188–191M hands/sec	~5.3 ns
C# (.NET 10, values-only)	~900 ns per 9-player op	~5.3 ns derived per 5-card eval
C# (.NET 10, full pipeline)	~1,204 ns per 9-player op	~6.4 ns derived

When you normalize the C# values-only benchmark down to per-five-card-evaluation, the managed code is effectively on par with native C++. That result surprised me. Not because .NET is fast — I knew that — but because the gap between “fast” and “native” has narrowed to the point where the measurement uncertainty is larger than the performance difference.

The key techniques that closed the gap: Span<T> and stackalloc for zero-allocation inner loops, ReadOnlySpan<byte> for the permutation table, aggressive method inlining via [MethodImpl(MethodImplOptions.AggressiveInlining)], and eliminating all List<T> usage from the hot path.

What MemoryDiagnoser Actually Tells You

Adding [MemoryDiagnoser] to the benchmark class is one line of code that fundamentally changes what you can see:

[MemoryDiagnoser]
public class FinalRiverBench { ... }Code language: C# (cs)

The results table now includes Gen0 and Allocated columns. For the values-only benchmark, both read zero — confirming the inner loop is truly allocation-free. For the full pipeline benchmark, you see 0.0763 Gen0 collections per 1,000 operations and 784 bytes allocated. That’s the Card[][] best-hands array being constructed for the UI.

This distinction matters. A benchmark that shows “fast” but triggers Gen0 collections is hiding latency spikes that will surface under production load. The allocation data tells you whether your “fast” is consistently fast or just fast on average with occasional GC pauses.

Practical Takeaways

If you’re setting up BenchmarkDotNet for the first time on a real project, here’s what I’d suggest based on this experience:

Isolate the layers. Don’t benchmark your API endpoint. Benchmark the computation, then the pipeline, then the endpoint. Each layer answers a different question, and the deltas between them tell you where your time and memory are going.

Use deterministic inputs. The [Params] attribute with serialized test data gives you repeatable results. Random inputs average out the interesting cases. When you’re optimizing, you want to see the same case get faster, not a different case get lucky.

Prevent dead code elimination. Always use the result. XOR-fold it, return it, accumulate it — whatever it takes to ensure the JIT doesn’t realize your benchmark does nothing. BenchmarkDotNet’s baseline-subtraction handles overhead, but it can’t detect that RyuJIT optimized your entire method into a constant.

Turn on MemoryDiagnoser. Always. The allocation column is worth more than the timing column for most optimization decisions. A 10% speed improvement that doubles allocations is usually a regression in production.

Same hardware, same conditions. BenchmarkDotNet automatically sets the power plan to High Performance during runs and reverts it afterward. But if you’re comparing across languages, run both on the same machine, same day, same thermal conditions. My C# vs C++ comparison would be meaningless if I’d run them on different CPUs.

The full benchmark code and results live at:
PokerBenchmarks repository

The evaluator itself is live at poker-calculator.johnbelthoff.com.

All benchmarks were executed on Windows 10 (22H2) with an Intel Core i9-9940X (14 cores / 28 threads), .NET 10.0.0, BenchmarkDotNet v0.15.6, High Performance power plan.

BenchmarkDotNet in Practice: Measuring What Actually Matters

In this Article