Search optimization journey 3: Optimize norm gathering
We recently won the search benchmark game across query and collection types. Part of the work involved hand-tuned AVX2 code. That was fine while we only shipped x86_64 binaries. Then we decided to add linux/aarch64 releases (Docker, deb, tar.gz) so people could run SereneDB on ARM machines and Apple Silicon through Docker.
Most of the AVX2 code either compiled away behind #ifdef or had straightforward portable alternatives. One piece of code made us pause: the norm gathering loop. We weren't sure whether the portable version would regress, so we built a microbenchmark. What we found was surprising.
What are norms and where do they show up
BM25 normalizes term frequency by document length. The document length is stored per-document as a "norm" value and read during scoring.
As described in the block scoring post, IResearch scores documents in blocks of 128. For each block the engine reads 128 norm values from columnar storage.
Unlike Lucene and Tantivy, which quantize norms to a single byte (losing ~10% retrieval quality in older Lucene versions), IResearch stores norms at full precision: 1, 2, or 4 bytes depending on the maximum document length in the segment. We also support different scoring models (BM25 with tunable k1/b parameters, TFIDF with norms) through a pluggable scoring framework.
The problem
Here's the core loop. For each document in a block of 128, read its norm value from a byte array at an offset derived from the document ID:
for (scores_size_t i = 0; i != kPostingBlock; ++i) {
values[i] = ReadValue(origin, docs[i] - doc_base);
}
ReadValue loads 1, 2, or 4 bytes depending on the encoding. The issue is that docs[i] - doc_base gives a different offset each iteration. Each load goes to a different memory location. This is a gather pattern.
The compiler will not auto-vectorize a gather. Even with #pragma clang loop vectorize(enable), clang refuses to emit vpgatherdd. More on this below.
We had an AVX2 version using explicit gather intrinsics:
auto indices = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(docs + i));
indices = _mm256_sub_epi32(indices, base);
auto gathered = _mm256_i32gather_epi32(
reinterpret_cast<const int*>(origin), indices,
std::to_underlying(Encoding));
gathered = _mm256_and_si256(gathered, mask);
_mm256_storeu_si256(reinterpret_cast<__m256i*>(values + i), gathered);
This works on x86 with AVX2. On aarch64, NEON has no gather instructions at all (only SVE does). We wanted to know: can we do better than both the scalar loop and the AVX2 gather, portably?
The key insight: not all blocks are sparse
In posting lists, document IDs within a block are sorted. In many real workloads a significant fraction of blocks contain contiguous IDs: [1100, 1101, 1102, ..., 1227]. For a contiguous block, origin[first], origin[first+1], ..., origin[first+127] is a sequential memory scan. The compiler will happily auto-vectorize that into wide SIMD loads on any architecture.
The check is cheap:
if (docs[kPostingBlock - 1] - docs[0] == kPostingBlock - 1) {
// contiguous: sequential read
} else {
// sparse: individual loads
}
No need to subtract doc_base for the comparison. If the first and last doc differ by exactly 127, the block is contiguous.
Benchmarking
We wrote a microbenchmark testing three variants (Scalar, Hybrid with the contiguous check, Gather with AVX2 intrinsics) across three scenarios:
- Dense: contiguous doc IDs (best case for hybrid)
- Sparse: sorted with random gaps of 1-20
- Mixed: 50/50 random choice between dense and sparse blocks
All benchmarks run on an AMD Ryzen 9 9950X (Zen 5, 16 cores, 48 KiB L1d per core, 1 MiB L2, 32 MiB L3), compiled with clang 21 targeting at least Haswell (AVX2). Haswell came out in 2013, so this is baseline for any x86_64 machine you'd actually run a search engine on today.
For the mixed case we pre-generate 60,000 random choices. This matters. Daniel Lemire recently measured that AMD Zen 5 can learn 30,000 branch patterns, Apple M4 about 10,000, Intel around 5,000. At 60,000 random entries, no CPU will learn the pattern, so we measure real misprediction cost.
One caveat: the benchmark data fits in L1/L2 cache. In production with large segments, cache misses would affect all variants equally. What we're measuring here is pure computational throughput.
First results
The hybrid approach dominates dense blocks: 2.7 ns vs 25 ns, about 8x faster. The compiler turns the sequential loop into wide vmovdqu + vpmovzxwd loads. One memory access replaces eight.
AVX2 gather is ~24 ns regardless of pattern. It barely beats scalar. Each vpgatherdd instruction looks like one operation, but internally the gather unit still issues 8 individual micro-loads. The instruction count drops 3.5x compared to scalar, but the gather unit is the bottleneck.
For mixed blocks, hybrid averages ~17 ns. The branch misprediction penalty is only 2-3 ns.
Looking at the assembly
We expected the scalar loop to compile into simple scalar code. It didn't. Clang already auto-vectorizes the scalar loop with the same extract/insert pattern whether you ask for it or not:
The compiler vectorizes the index subtraction (vpaddd), extracts each index to a scalar register (vpextrd), does 8 individual loads, and reassembles the result into a vector (vpinsrw). The insert chain is serial: each vpinsrw depends on the previous one.
We tried #pragma clang loop vectorize(enable) interleave(enable). Same codegen, same performance. The compiler already did everything it could. It just won't emit vpgatherdd on its own.
To inspect assembly we used the C++ Compiler Explorer VS Code extension. It picks up compile_commands.json from your build directory, so you see the assembly your actual compiler produces with your actual flags and dependencies. Much more useful than Godbolt for code that depends on project headers.
Full unrolling changes things
The scalar loop runs 16 iterations (128 elements / 8 per vectorized iteration). Each iteration has an 8-instruction insert chain that must execute serially. What if we removed the loop so the CPU could overlap independent chains?
#pragma clang loop unroll(full)
for (scores_size_t i = 0; i != kPostingBlock; ++i) {
values[i] = ScalarRead<Encoding>(origin, docs[i] - doc_base);
}
This made a real difference on sparse data:
| Variant | Dense (ns) | Sparse (ns) | Mixed (ns) |
|---|---|---|---|
| Scalar | 25.7 | 25.7 | 26.6 |
| Scalar + unroll(full) | 22.8 | 22.5 | 23.5 |
| Gather (AVX2) | 23.5 | 23.4 | 24.1 |
Full unrolling beats AVX2 gather on Short encoding. With 16 independent insert chains visible at once, the CPU overlaps their execution through the out-of-order engine. Gather can't benefit from this because vpgatherdd is a single instruction that serializes internally.
The trap: don't unroll the dense path
If unrolling helps sparse, why not unroll both paths in the hybrid?
| Variant | Dense (ns) | Sparse (ns) |
|---|---|---|
| Hybrid (no unroll) | 2.7 | 26.0 |
| Hybrid (unroll sparse) | 2.7 | 18.9 |
| Hybrid (unroll both) | 13.5 | 22.5 |
Unrolling the dense path makes it 5x slower: 13.5 ns vs 2.7 ns.
Without unrolling, the compiler recognizes the sequential access pattern and emits vpmovzxwd instructions that load 8 consecutive shorts and zero-extend them to 32-bit in a single operation:
; non-unrolled dense path: load+widen 8 elements at once
vpmovzxwd (%rdi,%rcx,2), %ymm0 ; 8 shorts -> 8 dwords
vpmovzxwd 16(%rdi,%rcx,2), %ymm1 ; next 8
vpmovzxwd 32(%rdi,%rcx,2), %ymm2 ; next 8
vpmovzxwd 48(%rdi,%rcx,2), %ymm3 ; next 8
vmovdqa %ymm0, 128(%r9) ; store 8 dwords
vmovdqa %ymm1, 160(%r9)
vmovdqa %ymm2, 192(%r9)
vmovdqa %ymm3, 224(%r9)
With unroll(full), the compiler sees 128 individual accesses with constant offsets and gives up on vectorization entirely. It generates 128 scalar movzwl (load one short, zero-extend) + movl (store one dword) pairs:
; unrolled dense path: one element at a time
leal -1000(%rax), %edx
movzwl (%rdi,%rdx,2), %edx ; load 1 short
movl %edx, 128(%r9) ; store 1 dword
leal -999(%rax), %ecx
movzwl (%rdi,%rcx,2), %ecx ; load 1 short
movl %ecx, 132(%r9) ; store 1 dword
; ... 126 more times
Same data, same addresses. The compiler just lost the forest for the trees.
The final solution
Unroll the sparse path. Leave the dense path alone. No intrinsics needed.
| Benchmark | Scalar | Hybrid | HybridUS | HybridUB | Gather |
|---|---|---|---|---|---|
| Dense | 25.7 | 2.7 | 2.7 | 13.5 | 23.5 |
| Sparse | 25.7 | 26.0 | 18.9 | 22.5 | 23.4 |
| Mixed | 26.6 | 17.8 | 13.8 | 20.7 | 24.1 |
HybridUS (unroll sparse only) wins every row. Here is the production code:
void GetPostingBlock(
std::span<const doc_id_t, kPostingBlock> docs,
std::span<uint32_t, kPostingBlock> values) noexcept final {
const auto* IRS_RESTRICT const origin = _origin;
auto* IRS_RESTRICT const values_data = values.data();
const auto* IRS_RESTRICT const docs_data = docs.data();
if (docs_data[kPostingBlock - 1] - docs_data[0] == kPostingBlock - 1) {
const auto first = docs_data[0] - _doc_base;
for (scores_size_t i = 0; i != kPostingBlock; ++i) {
values_data[i] = ReadValue(origin, first + i);
}
} else {
const auto base = _doc_base;
#pragma clang loop unroll(full)
for (scores_size_t i = 0; i != kPostingBlock; ++i) {
values_data[i] = ReadValue(origin, docs_data[i] - base);
}
}
}
Zero platform-specific intrinsics. Works on x86, aarch64, wherever clang runs. Faster than hand-written AVX2 gather in every scenario we tested.
Takeaways
Sometimes the compiler knows better than you. The dense path auto-vectorizes into perfect SIMD code. Trying to help with unrolling actively hurts.
Sometimes you need to help the compiler. The sparse path has a serial dependency chain that the compiler can't break. Full unrolling exposes enough independent work for the out-of-order engine to overlap.
The combination gives us the best of both worlds. And we got to delete the AVX2 intrinsics.
Microbenchmarking is tricky to get right. Our data fits in L2, the branch predictor sees a fixed pattern length, and we're measuring one function in isolation. If you think we made a mistake or measured something unfairly, open an issue — we're happy to discuss it or fix it. The benchmark source is in the repo.
Other articles in the Search optimization journey series:
- Benchmark overview
- Search optimization journey 1: Collecting top-K candidates
- Search optimization journey 2: Block scoring
- Search optimization journey 4: When is it good to be lazy? (coming soon)
- Search optimization journey 5: Adaptive posting format (coming soon)
If you find this work interesting, we would appreciate your support — star us on GitHub goes a long way for an early-stage project.