Search optimization journey 2: Block scoring
In a recent benchmark, SereneDB's search engine IResearch outperformed established search engines across query and collection types. This post is part of the Search optimization journey series, a set of technical reports examining the implementation decisions behind those results.
We focus here on block scoring and SIMD vectorization — a family of techniques that allows score (BM25, TFIDF, etc) evaluation to be restructured from a scalar per-document computation into a data-parallel operation over fixed-width blocks of candidate documents, exploiting the wide SIMD registers available in modern CPUs. We describe the design, discuss the tradeoffs involved, and analyze the performance impact as observed in the benchmark.
IResearch supports several query execution strategies, each optimized for a different shape of query. Before diving into the strategies themselves, it is worth understanding the scoring model they all share.
Block-at-a-time scoring
There are two classical scoring models. In term-at-a-time (TAAT) evaluation, the engine iterates over each term's posting list in full, accumulating partial scores into an array indexed by document ID. TAAT can be fast when posting lists are long and sequential access is cheap, but it requires an accumulator array proportional to the index size and must touch every document that contains at least one query term. There is no way to skip low-scoring documents early or to stop before exhausting all postings.
The more common alternative and the model used by both Lucene and Tantivy is document-at-a-time (DAAT). For each candidate document, the engine resolves a virtual call to the scorer, retrieves term frequencies one by one, looks up the document norm, computes a partial score per query term and accumulates the result. DAAT enables early termination and threshold-based pruning, but it evaluates one document at a time and leaves significant CPU performance on the table.
IResearch takes a different approach, one borrowed from the analytical query execution engines. Rather than invoking the score function per document, IResearch organises all scoring arguments into column buffers one column per argument, N rows deep where N is the block size and applies the score function to the entire block at once. The output is written into a result column of the same depth.
The arguments to the score function are whatever the scoring model requires. For BM25 these are term frequencies (one column per query term, sourced from posting list decompression), document norms (sourced from columnar storage) and collection-level IDF values (constants, broadcast across the block). But the same model accommodates richer signals: per-document boost factors, per-document IDF values computed at index time and potentially custom authority signals like PageRank, click-through rates and recency weights stored as additional columns in columnar storage. Each of these is just another column in the input block. The score function consumes all of them in a single pass.
This design has three concrete benefits over DAAT evaluation.
First, vectorized computation. A scoring loop that operates over contiguous, fixed-stride column buffers is exactly the pattern that enables compiler auto-vectorization. IResearch structures its scoring loops to have no data-dependent branches and no pointer aliasing, which allows the compiler to generate SIMD instructions automatically across SSE4.2, AVX2 and ARM NEON without hand-written intrinsics.
Second, negligible virtual call overhead. In a DAAT engine, the virtual dispatch to the scorer is on the critical path invoked once per document per query term. Devirtualizing this call is an ongoing engineering challenge in both Lucene and Tantivy. In a block-at-a-time engine the virtual call is amortized across the entire block. One call processes N documents, the per-document cost of the dispatch rounds to zero.
Third, no need to quantize norms. Lucene and Tantivy encode document length as a single quantized byte per document, trading scoring accuracy for scoring performance. Because block-at-a-time scoring reads norms as a sequential columnar scan, IResearch can store norms at full precision without a per-document random access penalty.
Everything comes with a trade-off. Block-at-a-time scoring requires that the arguments to the score function be available in column form before scoring begins. This is straightforward for greedy (disjunctive) queries, where posting list blocks are decompressed eagerly. It is more complex for lazy (conjunctive and max score) queries, where not all arguments are known upfront. A naive block-at-a-time scorer applied uniformly would either overscore, attributing non-zero contribution to non-matching terms in a conjunction, or underscore, computing partial scores before all matching terms for a document have been found. Getting this right requires different scoring strategies for different query types, which is exactly what the following sections describe.
Execution strategies
The three strategies below are all built on top of the block-at-a-time scoring pipeline. What differs between them is how the input block is populated: which documents enter it, which argument columns are filled and at what point in query execution scoring is triggered.
Conjunction Query — Late Scoring
The defining challenge of a conjunctive query is finding documents that match all query terms. On a query like "engine" AND "search" AND "index",
the engine must intersect three posting lists that may each contain millions of entries.
The naive approach — advancing all iterators one step at a time and checking for overlap — wastes the vast majority of those advances on positions that will never produce a match.
IResearch solves this with a branch-free skip-ahead strategy. The ConjunctionIterator tracks the current minimum position across all term iterators and calls lazy-seek-to-target on each lagging iterator, jumping directly to the next candidate position rather than stepping through intervening entries one by one.
Scoring is deliberately postponed to a separate phase. Phase 1 produces only a buffer of matching document IDs, with no term frequency data collected yet. This is critical: if scoring were triggered eagerly as each posting list is advanced, some term frequencies would be missing for documents not yet confirmed as full matches — producing incorrect scores. By separating intersection from scoring, IResearch ensures that when the score function is invoked, all argument columns are fully populated for every document in the block.
Phase 2 collects term frequencies for the matched documents into dense per-term column buffers. Phase 3 then scores the entire block at once. Norms are read in a single vectorized sequential columnar scan — once per document, shared across all terms and the scoring loop runs over contiguous arrays with fixed-stride access and no data-dependent branches.
AND query — postpone scoring
Phase 1: intersect · Phase 2: collect TFs · Phase 3: score block
Disjunction Query — Greedy Block Scoring
A disjunctive query must enumerate every document matching any query term. On a common two-term OR query over English Wikipedia, this can easily mean hundreds of thousands of candidates every one of which must be scored.
Because every document that appears in any posting list is a valid match, there is no risk of the overscore or underscore problem that affects conjunctions.
IResearch exploits this with a greedy block scoring strategy: each TermScorer decompresses a block of up to 128 document IDs and term frequencies from its posting list in a single pass, then feeds them directly into the scoring pipeline without waiting.
The argument columns (e.g. frequencies and norms) are populated as the block decompresses and scoring runs immediately.
The key performance property is that decompression and vectorized scoring are fused into one pass.
Norms are read once per document per TermScorer.
In the disjunction case each term leaf reads the norm for the documents in its block independently.
This is a deliberate trade-off given that OR queries process each term's posting list separately.
The columnar storage layout ensures these reads are still sequential and cache-friendly within each block.
OR query — greedy block scoring
MaxScore Iterator — Combining Greedy and Lazy Scoring
The MaxScore iterator handles queries where some terms are common (essential) and some are rare (non-essential), and where the goal is top-k retrieval rather than exhaustive enumeration. It combines both strategies above under a single iterator that decides per-term which path to take.
Before query execution begins, each term is assigned a maxScore — an upper bound on its possible score (e.g. BM25) contribution to any document.
Terms at or above the current score threshold are essential and processed greedily, exactly as in the OR strategy: blocks are decompressed eagerly and scored immediately.
Terms below the threshold are non-essential and processed lazily: their posting lists are not decompressed upfront.
For non-essential terms, hits are first collected into a block via lazy Seek(doc_id) calls — one per document produced by the essential path.
Only once the hit block is assembled does scoring run, with norms read as a columnar scan over the hit block only.
Documents that miss never populate the argument columns and never reach the score function.
The pruning rule ties both sides together. Before triggering any non-essential seek, the iterator checks whether essentialScore(d) + Σ maxScore(non-essential terms) exceeds the current minimum score in the top-k collector.
If it does not, the entire non-essential path including argument collection, norm read and block scoring is bypassed.
This is the key source of speedup on top-k queries: the score function is never called for documents that cannot make it into the result set.
The norm read pattern in MaxScore is a combination of both earlier strategies:
- essential terms use the sequential columnar scan from the OR strategy;
- non-essential terms use a targeted scan over the hit block only, paying the norm read cost only for documents that actually matched.
MaxScore iterator — WAND + block scoring
Summary
| Strategy | Query type | When block is populated | Norm read | Scoring triggered |
|---|---|---|---|---|
| AND — postpone scoring | Conjunction | After full intersection (Phase 3) | Once per doc, shared across all terms | After all matches collected |
| OR — greedy block | Disjunction | Eagerly, as posting list decompresses | Once per doc per TermScorer leaf | Immediately per block |
| MaxScore | Top-k mixed | Essential: eager. Non-essential: on hit | Essential: per scorer. Non-essential: hits only | Essential: immediate. Non-essential: after hit block fills. Pruned docs: never |
All three strategies share the same block-at-a-time scoring pipeline. The differences are entirely in when documents enter that pipeline and how the argument columns get populated.
Inherited benefits of block-at-a-time scoring
Switching from document-at-a-time to block-at-a-time scoring removes two long-standing constraints that have shaped how search engines handle scoring for decades.
No need to quantize norms
BM25 normalizes term frequency by document length. To do this efficiently at query time, search engines store a per-document norm value (an encoding of the document's field length) in the index. In Lucene and Tantivy, this value is quantized into a single byte per document.
The quantization is not accidental. In a DAAT engine, the norm for each candidate document is fetched in a separate random access — one lookup per document. At scale, across millions of candidates, these random accesses dominate memory bandwidth. Compressing norms to one byte reduces the working set size, improves cache utilization and makes the random-access pattern less punishing.
The cost is scoring accuracy. Lucene 6 and earlier approximated the inverse square root of the document length, introducing quantization error for virtually every document length. The impact was measured: in a well-known reproducibility study, the author found approximately a 10% difference in retrieval quality between Lucene's quantized BM25 and an exact implementation on the same data — far from being a trifle and clearly not a random fluke. Lucene 7 introduced a more careful encoding that retains four most significant binary digits for large numbers — a quarter-precision format — which reduced but did not eliminate the error. Tantivy inherits a similar constraint.
IResearch does not quantize norms. Because block-at-a-time scoring reads norms as a sequential columnar scan over a contiguous block region, the access pattern is cache-friendly regardless of value precision. There is no random-access penalty to amortize through compression. Norms are stored and read at full precision and BM25 scores reflect the actual document lengths the formula was designed to use.
The practical implication is that IResearch produces BM25 scores that are both faster to compute and more accurate than those produced by engines constrained by the DAAT access model.
Extensible scoring framework
Tantivy ships with a single BM25 scorer. The parameters k1 and b, which control term frequency saturation and length normalization respectively, are hardcoded into the library. Changing them requires recompiling Tantivy from source. There is no mechanism to plug in a different scoring model, inject per-document signals or compose multiple scoring functions at query time.
This is a deliberate simplicity trade-off and it works well for use cases where BM25 with default parameters is sufficient. Different domains might have different optimal parameter values — k1 and b tuned for short news headlines are wrong for long legal documents. Retrieval quality on specialized datasets might require custom scoring that incorporates signals beyond term statistics: document age, authority, click-through rates or application-specific boosts. A fixed scorer makes none of this possible without forking the library.
IResearch, like Lucene, provides an extensible scoring framework. The score function is a composable component that the host database or application can configure at query time. BM25 parameters are tunable without recompilation. Custom scorers can be registered and invoked through the same block-at-a-time pipeline, receiving the same column buffers like frequencies, norms and any additional columnar signals similar to the built-in BM25 implementation. Per-document boost factors and authority signals stored in columnar storage are first-class inputs to the scoring pipeline, not afterthoughts bolted on via post-processing.
Conclusion
DAAT evaluation has been the dominant execution model in information retrieval for decades. It is simple, composable and sufficient for modest query volumes but it is fundamentally constrained. The per-document virtual dispatch, the random-access norm fetch and the scalar accumulation loop are not incidental implementation details: they are structural properties of the model. Optimizing within them yields diminishing returns. Lucene's ongoing efforts to reduce virtual call overhead and Tantivy's single hardcoded BM25 scorer are both responses to the same ceiling.
Block-at-a-time scoring is a different execution model. By restructuring scoring as a function from column buffers to a result buffer, it aligns information retrieval with the execution paradigm that analytical databases. Wide SIMD registers, larger caches and higher memory bandwidth: all of these translate directly into higher scoring throughput without changes to the algorithm.
The secondary benefits like full-precision norms follow from the same structural change and compound the advantage. But the primary claim is simpler: DAAT scoring has reached the limits of what scalar optimization can deliver. Block-at-a-time scoring has not.
Other articles in the Search optimization journey series:
- Benchmark overview
- Search optimization journey 1: Collecting top-K candidates
- Search optimization journey 3: When is it good to be lazy? (coming soon)
- Search optimization journey 4: Adaptive posting format (coming soon)
If you find this work interesting, we would appreciate your support — a star us on GitHub goes a long way for an early-stage project.