Skip to main content

SereneDB IResearch: C++ search engine that won the game

SereneDB is a search-OLAP database where search is a first-class citizen of the query engine. All search functionality is powered by IResearch, an open-source C++ information retrieval library that core members of the SereneDB team have been developing since 2016. An older version of IResearch is the search foundation behind ArangoSearch in ArangoDB, running in production deployments since 2018.

The SereneDB team has made significant architectural changes to IResearch like block-at-a-time vectorized scoring, redesigned top-K collection pipeline, improvements to phrase and conjunction queries and we now want to share the results. This post is a methodology-transparent comparison of IResearch against the two most relevant open-source alternatives: Lucene and Tantivy.

It is a benchmark built by the Tantivy team, using a corpus and query set of their choosing. We thought that was a good reason to compete on their terms. So we forked the repository, added an IResearch engine implementation and ran the numbers.

Detailed benchmark results →

OR
AND
phrase
other
count
top_100
top_100 count
iresearchby SereneDB
count
top_100
top_100 count
lucene
count
top_100
top_100 count
tantivy
OR
AND
phrase
other
count
top_100
top_100 count
iresearchby SereneDB
OR
AND
phrase
other
count
top_100
top_100 count
lucene
OR
AND
phrase
other
count
top_100
top_100 count
tantivy

All values have been measured in microseconds


If you find this interesting, we'd be grateful if you support SereneDB with a star on GitHub. For an early-stage project, it means more than you might think.


IResearch

IResearch is an embeddable C++ search engine built to live inside database kernels by SereneDB team. It provides a comprehensive feature set like Lucene — inverted indexes with BM25/TFIDF scoring, columnar storage, HNSW-based vector search, S2 geospatial indexing, semi-structured filtering and a pluggable NLP pipeline covering many languages — while integrating natively with the host database's write-ahead log and transaction protocol. There is no separate process, no network hop and no consistency gap: IResearch participates directly in the host's commit protocol, giving search queries read-committed visibility over the same data the primary storage sees. A complete description of IResearch's capabilities is available at IResearch.

Search Benchmark, the Game

The database world has no shortage of well-established benchmarks. TPC-H has been the standard for analytical query performance since 1999; TPC-C covers transactional workloads; ClickBench, introduced by ClickHouse, has become a standard for OLAP query benchmarking. Each of these has a fixed schema, a fixed query set and a reproducible methodology that makes cross-engine comparison credible.

Search benchmarks are, by comparison, surprisingly thin. Lucene has the great nightly benchmark suite. It tracks indexing throughput, query latency and segment merge performance over time. It is an excellent tool for tracking Lucene's own regression history, but it is not easily extensible to other engines.

The most suitable option for a fair multi-engine comparison is the search-benchmark-game, originally developed by the Tantivy team and maintained by Quickwit. It provides a standardized corpus of the English Wikipedia — a fixed query workload derived from the AOL query dataset and a simple harness that any engine can implement by satisfying a small interface. The benchmark covers the query types that matter in practice: single-term lookups, phrase queries, boolean conjunctions and disjunctions and count-only variants of each. Results are reported as best-of-100 single-threaded throughput, with Lucene's query cache disabled and Java's JIT warmed up before timing begins.

Corpus

The corpus is the English Wikipedia, pre-processed into a single JSON file. Stemming is disabled. Queries are derived from the AOL query dataset — specifically, multi-term queries that return at least one result when searched as a phrase. They contain no personal information.

Execution

Each engine is benchmarked independently in a single-threaded mode. The benchmark driver communicates with each engine over stdin/stdout, issuing one query at a time and measuring wall-clock time with time.monotonic().

Before timing begins, each engine goes through a 60-second warmup phase where all queries are issued repeatedly in shuffled order. This ensures the index data is fully loaded into the OS page cache and, for JVM-based engines like Lucene, that the JIT compiler has had time to compile the hot paths. All benchmarks are therefore warm-cache measurements.

After warmup, the full query set is run 10 times in the same shuffled order. All 10 durations are recorded per query. The reported per-query latency is the median across the 10 runs. Aggregate statistics across all queries are reported as average, P50, P90 and P99.

Queries are shuffled with a fixed random seed, ensuring reproducibility and preventing any systematic ordering effects from favouring engines with sequential access patterns.

What is measured

Each query measures end-to-end latency as seen by the benchmark driver — from writing the query to stdin to reading the result count from stdout. This includes any serialization overhead, but engines are simple in-process servers so this cost is negligible. No network stack is involved.

Query Types

Term — A single high-frequency stop word (the). The posting list for such a term spans nearly the entire corpus, leaving no room for skipping or early termination. This is a degenerate but important stress test for raw posting list traversal speed.

Intersection — All terms are required, expressed with a + prefix on every token (e.g. +griffith +observatory). This is the most selective query type, typically matching the fewest documents. It heavily exercises conjunctive traversal and skipping efficiency. Engines that implement score pruning can avoid visiting large portions of the posting lists.

Union — All terms are optional with no prefix (e.g. griffith observatory). This matches the broadest set of documents and is typically the most expensive query type, as the engine must merge many posting lists and score far more candidates. It stresses score collector management and the cost of BM25 scoring at scale.

Phrase — Terms must appear consecutively and in order, wrapped in quotes (e.g. "griffith observatory"). This requires the engine to verify positional information after the initial term match, adding overhead on top of standard intersection traversal. Phrase queries tend to match very few documents, but the position-checking step makes them more expensive per candidate than a plain intersection.

IntersectionUnion — Some terms carry a + prefix (mandatory) while others are bare (optional, scoring only). For example, +climate policy will only return documents containing "climate", but documents that also mention "policy" will rank higher. This is the default query mode in Lucene and reflects typical real-world search behavior, namely the mandatory terms act as a hard filter, while the optional terms act as a ranking signal. It tests the engine's ability to efficiently combine strict filtering with scoring-time boosting. Also known as Reqiured/Optional query.

Negated — One required term combined with one or more excluded terms (e.g. +python -snake -monty). The engine must match documents containing the required term while filtering out any that contain the excluded terms. Queries are designed around ambiguous words like Java, Jaguar, Mercury, Apple where the negated terms carve out a specific semantic meaning. This tests how efficiently an engine handles exclusion without degrading traversal of the primary posting list.

Two-phase critic — A single query combining a phrase requirement with an additional required term: +"the who" +uk. This forces a two-phase evaluation where the engine first finds phrase matches and then filters by the second term. It is a targeted stress test for engines that handle phrase and term constraints in separate passes.

Collection Types

COUNT — Only the total number of matching documents is returned. No results are fetched and no scoring is performed. This isolates raw posting list traversal speed in its purest form. The engine can skip scoring entirely and potentially use shortcuts unavailable when ranking is required.

TOP 100 — The 100 highest-scoring documents by BM25 are returned. This mode tests the engine's ability to evaluate documents, maintain a priority queue of the best candidates and utilize dynamic pruning strategies (like WAND/Maxscore or Block-Max WAND/Maxscore). The engine attempts to quickly establish a high enough competitive threshold to safely skip evaluating lower-scoring documents, measuring how efficiently the engine can short-circuit the scoring process while still fulfilling the result quota.

TOP 100 COUNT — Both the top 100 scored results and the exact total match count are returned. Crucially, requiring the total hit count alongside ranked results typically disables dynamic pruning optimizations entirely. Every single matching document must be fully traversed and scored. This is the most demanding mode, but it provides vital insight into the raw throughput of the scoring function itself, completely free from the interference of early-termination algorithms. This metric is especially valuable for use cases where ranking is implemented externally, allowing engineers to measure the pure speed of the engine's scoring phase.

The Good, the Bad, the Ugly

The Good

Isolated search benchmark. The benchmark is focused on search retrieval performance. Each engine runs as a simple in-process server communicating over stdin/stdout, with no network stack and no unrelated system activity in the measurement loop. What you measure is the search engine.

Reproducibility. The corpus (English Wikipedia), the query set and the execution parameters are all fixed and publicly available. Queries are shuffled with a fixed random seed. Anyone can clone the repository, run make bench and get comparable results on their own hardware.

Disk I/O is deliberately eliminated. The benchmark runs a 60-second warmup phase before any timing begins, issuing all queries repeatedly to ensure the index is fully resident in the OS page cache. The corpus is approximately 5 GB. It's small enough to fit entirely in RAM and no disk reads occur during measurement after warmup. This isolates retrieval algorithm performance from storage hardware differences.

Extensive scoring coverage. Every query is benchmarked across three collection modes — COUNT, TOP 100 and TOP 100 COUNT. This cleanly separates traversal cost from ranking cost and ranking cost from the additional overhead of exact hit counting.

Phrase queries. Many search benchmarks skip phrase queries entirely. This one includes them as a first-class query type, which matters because position list decoding and adjacency checking are a meaningfully different workload from plain term or boolean queries.

Realistic queries and dataset. The queries are derived from the AOL query dataset and the corpus is Wikipedia. Both are quite close to real-world search workloads. The queries cover a natural distribution of specificities, from narrow named entities like "robert green ingersoll" to broad concepts like digital scanning.

The Bad

No deletions, no updates. The benchmark indexes a static corpus once and never modifies it. Real search engines spend significant effort managing segment merges, tombstone filtering and live/deleted document tracking. A single consolidated index segment with no deleted documents is the best-case scenario for any engine and it doesn't reflect production conditions.

Single index segment. All engines operate on one fully-merged segment. Multi-segment indexes are the norm in production (Lucene-based systems in particular) and segment merging, per-segment overhead and cross-segment result merging all have real costs that this benchmark never exercises.

No field values, no sort-by-field. The benchmark only measures full-text retrieval ranked by BM25. There is no access to stored fields, doc values, or numeric attributes. Sorting by date, filtering by category, or retrieving stored fields alongside results — all common in real applications — are entirely absent.

Limited query complexity. The query set covers term, phrase, union, intersection, mandatory/optional, negation and one two-phase query. There are no range queries, no faceted queries, no nested boolean structures, no wildcard or fuzzy queries. Real-world query workloads are considerably more varied.

Indexing performance is not measured. The benchmark only measures query latency. Index build time, indexing throughput and memory usage during indexing are not captured, even though these are often primary concerns when evaluating a search engine.

The query difficulty distribution is skewed toward easy queries. Looking at the actual result counts from the benchmark (using Tantivy 0.25 as a reference), 60% of phrase queries return fewer than 100 results and 53% of intersection queries return fewer than 1,000 results. These are queries that complete in microseconds. In production search, the high-cardinality queries — the ones that match hundreds of thousands of documents and take tens of milliseconds — are typically far more common, since users tend to search with short, broad terms. The benchmark has them (union queries have a median result count of ~170k and the single the term query matches 4.5M documents), but they are outnumbered by the many narrow phrase and intersection queries that any engine handles trivially.

Next Steps

The benchmark results above give us a clear picture of where things stand today. They also point directly at what needs to be done next — both in terms of improving the benchmark itself and pushing SereneDB's performance further.

End-to-end search benchmarks. Isolated retrieval benchmarks like search-benchmark-game are valuable, but they only tell part of the story. We are planning a ClickBench-style benchmark for search engines and databases — a full end-to-end comparison that puts SereneDB up against established players like Elasticsearch and OpenSearch on realistic workloads, measuring not just query latency but indexing throughput, resource usage and overall system behavior under load.

Extending the query set. As outlined in the weaknesses above, the current query set in search-benchmark-game has limited structural diversity. We plan to contribute additional query types — range queries, nested boolean expressions, fuzzy and wildcard queries — to make the benchmark more representative of real-world search workloads and more useful to the broader community.

Indexing performance. Search-benchmark-game currently measures query performance only. We plan to add an indexing benchmark track, covering ingestion throughput, segment merge behavior and resource consumption during index build. This is an important gap, especially when comparing engines with fundamentally different indexing architectures.

Making IResearch even faster. IResearch already performs strongly in the current benchmark results. We have a concrete set of improvements planned that we expect to push performance further — see for the full roadmap.

Closing Thoughts

SereneDB is committed to continuing development of IResearch as a standalone library, so that C++ native projects can benefit from enterprise-grade search capabilities without the burden of shipping a JVM dependency. For teams building in C++ today, that is often a non-starter and IResearch gives them a way out.

If you like what we're doing, we'd be grateful if you support SereneDB with a star on GitHub.

For those interested in the technical internals behind these results, we have written a series of deep-dives covering the key architectural decisions in IResearch:

Stay tuned!