Skip to main content

Search optimization journey 1: Collecting top-K candidates

In a recent benchmark, SereneDB's search engine IResearch outperformed established search engines across query types and collection modes. This post is part of the Search optimization journey series, a set of technical reports examining the implementation decisions behind those results.

We focus here on top-K candidate collection — the mechanism responsible for identifying the K highest-scoring documents from a potentially large set of matches. While often treated as a secondary concern, the data structure and algorithm used for collection sit on the critical path of every ranked query. Every matching candidate must pass through it and at high match counts it becomes a measurable bottleneck. We describe the classical priority-queue based approach used by most engines, explain why IResearch takes a different path based on nth_element over a flat vector and show how that choice integrates naturally with block-based SIMD scoring to produce a compounding performance advantage.

Most search queries are not "give me everything that matches" — they are "give me the best K results." In SQL terms this looks like:

SELECT doc_id, score
FROM documents
WHERE MATCH(body, 'engine search index')
ORDER BY score DESC
LIMIT 10;

The sort condition does not have to be the search score. It can be any stored attribute: a timestamp, a price, a popularity signal or a combination of score and attribute:

SELECT doc_id, published_at, score
FROM documents
WHERE MATCH(body, 'engine search index')
ORDER BY published_at, score DESC
LIMIT 10;

In both cases the engine must evaluate a potentially large set of matching candidates and surface the K best ones according to the sort condition. How that collection is managed has a direct impact on query latency.

Naive approach

The simplest strategy is to collect all matching candidates into an array, sort it and return the first K elements. For N matching documents this costs O(N log N) time and O(N) space. For small collections this is fine. For queries that match millions of documents on a large corpus collecting and sorting everything is prohibitively expensive and does not fit in memory. A good example is a multi-term query over Wikipedia easily returning several million candidates.

Priority queue of size K

The traditional solution is to maintain a priority queue (binary/ternary/etc heap) or tournament tree of exactly K elements. As each new candidate arrives, it is compared against the current minimum. If it scores above the minimum it displaces the minimum; otherwise it is discarded. The minimum value serves as a running threshold: any candidate that cannot beat it is rejected immediately without adding to the data structure.

This reduces the problem significantly. Insertion into a priority queue of size K costs O(log K), so processing N candidates costs O(N log K) time with O(K) space. Since K is typically small (10, 100, 1000), log K is nearly constant and the threshold grows tighter as better candidates are found — pruning more and more candidates as the query progresses. This is the approach used by Lucene and most production search engines.

IResearch: nth_element on a 2K vector

IResearch takes a different approach. Rather than maintaining a heap of size K, it maintains a flat vector of size 2K. Candidates are appended to the vector as they arrive. When the vector reaches 2K elements, IResearch runs nth_element — a partial sort that rearranges the vector so that the element at position K is the one that would be there in a fully sorted array, with all elements before it larger than or equal to it and all elements after it is less or equal. This is the median of the 2K collected candidates.

nth_element runs in O(N) average time — linear in the number of elements being partitioned, with no logarithmic factor, because it does not need to produce a fully sorted order. It only needs to find the Kth element and partition around it.

After partitioning, the threshold is updated to the value at position K. The top half of the vector (positions 0 to K−1) is kept; the bottom half is discarded. Collection resumes from position K, appending new candidates that beat the current threshold until the vector fills to 2K again, at which point another nth_element is triggered. Over the full query, the threshold tightens with each partition pass as progressively better candidates are found.

The overall complexity is O(N) average time rather than O(N log K), with O(K) space. In practice the difference is most visible at large N and moderate K, exactly the shape of queries — broad disjunctions, MaxScore — where IResearch spends most of its time.

Top-K collection strategies

K = 5 · Use ↺ to replay individual animations

O(N log N)

Naive - collect all, sort all

collecting candidates...

array

O(N log K)

Heap - maintain top K

waiting...

heap (K=5)

threshold: -
O(N) avg

IResearch - 2K vector + nth_element

collecting...

vector (K=4, 2K=8)

threshold: -
vectorized

SIMD threshold filter

waiting for threshold...

incoming block (16 docs)

passing threshold 2.5

Why this fits block-based scoring naturally

The 2K vector approach integrates cleanly with the block-at-a-time execution model. Each scoring pass produces a full block of scored candidates rather than one document-at-a-time. Rather than inserting each candidate into a collector individually, IResearch vectorizes the filtration step: the entire block of scores is compared against the current threshold in a single SIMD pass and only candidates that beat the threshold are written into the 2K collection vector. Candidates that fall below the threshold are rejected without any write at all.

This means the threshold check — which in a heap-based engine requires a comparison and a conditional branch per document — becomes a vectorized comparison over a block, with the results used as a write mask. The cost of filtering a block of 128 candidates against the threshold is roughly the same as the cost of filtering a single candidate in a scalar engine, amortized across all 128 positions.

The two mechanisms reinforce each other: block scoring fills the 2K vector faster than document-at-a-time scoring would, each nth_element pass tightens the threshold sooner and a tighter threshold filters more candidates in subsequent SIMD passes — reducing the number of writes into the collection vector and keeping it lean.

Closing Thoughts

The techniques described here are not novel in isolation — priority queues and selection algorithms are well-studied. This work demonstrates that the choice of collection strategy is not independent of the scoring architecture. When scoring is restructured around block-based SIMD execution a more cache-friendly, branch-free alternative becomes preferable. The performance characteristics observed in the benchmark are in large part a consequence of treating these two subsystems as a jointly optimized unit rather than independent components.


Other articles in the Search optimization journey series:

If you find this work interesting, we would appreciate your support — a star us on GitHub goes a long way for an early-stage project.