Andrey Abramov
Adaptive posting list format
Search optimization journey 5
Search optimization journey 5: Adaptive posting list format
In a recent benchmark, SereneDB's search engine IResearch outperformed established search engines across query types and collection modes. This post is part of the Search optimization journey series, a set of technical reports examining the implementation decisions behind those results.
The previous four posts covered query execution: lazy two-phase iterators, block-at-a-time SIMD scoring, nth_element top-K collection and contiguous-block norm gathering. This post goes one level deeper - to the data the iterators read.
Before any iterator runs, a block of 128 document IDs was compressed to disk. The choices made at that layer have a direct multiplier on everything above: a poorly compressed block takes longer to decode and since posting list decoding sits on the critical path of every single query type, that cost compounds across the entire workload.
The problem with a fixed posting list format
Classical posting list formats commit to one codec per posting list. Lucene's original design, introduced around 2012, used FOR-delta (Frame of Reference with delta coding and bitpacking) for full 128-element blocks, with a VInt-encoded tail for the last partial block. The tail is a special case: it has its own code path and none of the compression options available to full blocks. Tantivy follows a similar model. The tail block is encoded differently from the main blocks and the encoding choice is made once per posting list rather than per block.
Fig. 0 - The old fixed format: every full block uses FOR-delta regardless of local data shape; the tail always falls back to VInt with no access to the compression options available to full blocks.
The fundamental problem is that the value distribution within a block changes throughout a posting list. For example, a fresh segment for a common term like "the" has long runs of nearly-consecutive document IDs efficiently compressible with a bitset. Later blocks, after many small commits and partial merges, may have large irregular gaps between the same term's occurrences. A single encoding scheme cannot be optimal for both shapes. Whichever format you choose, there will be blocks where a different choice would have been two or three times smaller.
The lineage: columnar compression research and adaptation in Infromation Retrieval
The problem of choosing the best encoding per block is not unique to search engines. Columnar data formats face the same challenge: given a block of values, what is the most compact representation that also decodes fast? The answer depends on the data type, the value range and the actual distribution of values in that specific block, which can vary dramatically across blocks in the same column.
Several research systems have tackled this directly. BtrBlocks (SIGMOD 2023) divides each column into fixed-size blocks and runs an encoding competition per block - FOR, RLE, frequency encoding, dictionary - picking the smallest result. Its central finding is that per-block selection consistently beats per-column selection because local distributions vary dramatically even within a single column. FastLanes (VLDB 2023) takes a complementary angle: redesigning the bit-unpacking layout to target a virtual 1024-bit register so the same code auto-vectorizes across SSE4.2, AVX2, ARM NEON and SVE without platform-specific intrinsics. [DuckDB]*https://duckdb.org/2022/10/28/lightweight-compression) applies adaptive compression in its storage layer, selecting from a menu of encodings per data segment based on measured compression ratio and decode cost, making the same per-block competition a production reality in a widely-used analytical engine.
Posting lists are a special case of this general problem. The values are always 32-bit unsigned integers, always monotonically increasing and always delta-encoded before compression. This constraint narrows the encoding menu considerably compared to general columnar data, but it also enables encodings that would not make sense otherwise. A bitset over a contiguous range of document IDs is only meaningful for sorted unique integers. An all-same encoding for uniform gaps is only meaningful after delta coding. The IR-specific structure of the data is what makes these encodings possible and what makes the competition tractable.
With that in mind, we designed a new adaptive posting list format that follows the same per-block competition principle, but with an encoding menu tuned specifically for sorted 32-bit integer sequences in an inverted index.
IResearch adaptive postings format
Each block in the posting list is prefixed by a single selector byte. At decode time a switch on that byte picks the path without a special-case for tail block. The physical layout is a sequence of self-describing blocks, each compressed with whatever encoding was smallest at write time.
Fig. 1 - A posting list is a sequence of self-describing blocks. Encoding is chosen per-block; the tail uses the same competition as full blocks.
The format uses two separate encoding families, because doc IDs and payload values have different structure. Doc IDs are sorted unique integers, so delta coding is always applied first - only the gaps between consecutive IDs are stored. On the other hand, frequencies are unsigned integers with no monotonicity guarantee, so delta coding is sub-optimal.
Doc ID encodings (after delta coding):
| Encoding | Description | Block size | When it wins |
|---|---|---|---|
| SIMD bitpacking | All gaps packed at N bits each using Lemire's simdpackwithoutmaskd1. N = bit_width(max_delta). | N × 16 bytes | Most natural language text - gaps small and fairly uniform |
| FOR bitset | Frame of Reference bitmask: a bit per position in [prev, prev+range]. Set bit = doc present. | ceil(range / 64) × 8 bytes | Dense blocks - many docs in a narrow ID range |
| Constant delta | All gaps are equal. Store the single gap value (1, 2 or 4 bytes). | 1-4 bytes | Structured data with regular ID spacing |
| StreamVByte | Variable-length byte encoding. Each value takes 1-4 bytes based on magnitude. | varies | Tail blocks with highly variable gaps |
| Raw values | Uncompressed 32-bit values. Fallback when nothing else wins. | 128 × 4 bytes | Pathological distributions |
Frequency and position encodings:
| Encoding | Description | Block size | When it wins |
|---|---|---|---|
| SIMD bitpacking | Values packed at N bits using Lemire's simdpackwithoutmask. No delta coding. | N × 16 bytes | General case for frequencies |
| Constant value | All values in the block are equal. Store one value (1, 2 or 4 bytes). | 1-4 bytes | tf=1 throughout - the majority of frequency blocks |
| StreamVByte | Variable-length byte encoding. No delta coding. | varies | Tail blocks with variable values |
| Raw values | Uncompressed 32-bit values. | 128 × 4 bytes | Fallback |
Encoding selection runs in a single pass over the block. Three quantities are tracked simultaneously: the maximum delta value (which determines the bitpack width), a flag for whether all deltas are equal and a running StreamVByte size estimate. The bitset cost is computed analytically from max_doc - prev. At the end of the pass, the byte cost of each candidate is known and the winner is written as a single byte prefix followed by its payload.
Fig. 2 - Encoding selection runs in a single pass. Four candidates are costed simultaneously; the smallest wins.
Tail unification
Both Lucene and Tantivy encode the tail block, the last partial block with fewer than 128 documents, using VInt: each doc ID is stored as a variable-length integer individually, with no block-level compression. This is a separate code path from the main blocks and it means the tail gets none of the compression options available to full blocks.
The current format eliminates the distinction. The tail runs through the same encoding competition as full blocks. Bitpacking is excluded for tails since it requires exactly 128 elements to operate, but all other encodings are available - constant delta, bitset, StreamVByte and raw values. StreamVByte is available only for tails: full blocks must produce a fixed-stride layout that the downstream SIMD scoring pipeline relies on, while tails have no such constraint.
| Term | Tail docs | VInt tail | New format |
|---|---|---|---|
| Common term, dense tail | 96, range of 128 | ~192 bytes (2 bytes/gap avg) | 16 bytes (bitset) |
| Mid-frequency term | 64, 12-bit gaps | ~192 bytes (3 bytes/gap avg) | 96 bytes (bitpacking) |
| Structured data, uniform gaps | 80, equal gaps | ~160 bytes (2 bytes/gap avg) | 2 bytes (constant delta) |
This matters most for rare terms - those with fewer than 128 documents - which have exactly one tail block and no full blocks at all. They outnumber common terms by orders of magnitude in any natural language vocabulary.
Extensibility
Because the selector byte fully describes each block's encoding, adding a new encoding to the format requires touching only two places: a new entry in the encoder and a corresponding entry in the decoder. The per-block competition picks it up automatically - if the new encoding produces a smaller result than existing options on a given block, it wins; otherwise it loses silently and nothing else changes.
For example, a Frame of Reference combined with StreamVByte is already planned as a future encoding. When it lands, it becomes a live competitor across the entire corpus with minimal code changes. Whether it wins often enough to justify the dependency is a question the competition answers empirically, not by intuition.
The same property absorbs larger structural changes. Experimenting with a 256-element block size, which would enable wider SIMD bitpacking and lower per-block header overhead, would change the block size constant and the bitpacking entries, while the competition logic, tail handling and selector byte format all carry over unchanged.
When each encoding wins
| Encoding | Wins when | Typical scenario | Size for 128 docs |
|---|---|---|---|
| Constant delta | All inter-doc gaps identical | Structured data with regular ID spacing | 1-4 bytes |
| Bitset | High density, narrow ID range | Common terms in fresh segments | 16-64 bytes |
| Bitpacking | Gaps small and fairly uniform | Most natural language text | N × 16 bytes |
| StreamVByte | Gaps variable in magnitude | Rare terms, short tail blocks | varies |
| Constant value (freqs) | All frequency values equal | tf=1 for all docs in block | 2 bytes |
Constant delta is the most extreme compression: when every gap between consecutive doc IDs is identical, the entire block, regardless of how many documents it contains, compresses to a single value. That's 1-4 bytes for 128 documents. It sounds like a special case, but it occurs naturally in structured data where IDs are assigned at regular intervals: partitioned datasets, range filters over numeric fields, evenly-distributed sharded collections.
Bitset wins when many doc IDs fall in a narrow range. Instead of storing gaps, a single bit is set for each ID present in the range. For 120 documents spanning IDs 1000-1127, the bitset is 17 bytes. The same block with 1-bit gap bitpacking is 16 bytes - essentially the same. But once the range widens, bitpacking pulls ahead; the competition always picks correctly without manual tuning.
Bitpacking is the workhorse for most natural language text. Gaps between occurrences of a mid-frequency term typically fit in 10-14 bits. A block at 12 bits per gap costs 192 bytes, vs 512 bytes uncompressed - a 2.7× reduction. The bit width is chosen per block based on the largest gap present, so a block with small uniform gaps compresses tightly even if a neighbouring block needs more bits.
Constant value for frequencies is worth calling out specifically. Most terms appear exactly once per document - tf=1 throughout the corpus. For those terms, an entire block of 128 frequencies compresses to 2 bytes: 1 selector + 1 value. This fell out naturally from the all-same check rather than requiring dedicated engineering and it applies to the large majority of all frequency blocks in a real index.
The bitset encoding in practice
The bitset encoding stores a bitmask of which doc IDs are present within a range. Beyond its compression benefits, it is naturally fast for seek operations: checking whether a specific doc ID is present is a single bit test, with no decoding required. This makes it particularly effective for counting queries, where the engine traverses the entire posting list checking membership - bitset seeks are as cheap as any format can offer and the compact representation reduces I/O.
In many other cases, however, the engine needs to materialize doc IDs into a flat array - for example, when gathering norms from columnar storage during scoring or when returning all matched doc IDs directly to the caller. In these cases, materialization becomes the bottleneck. A bitset that is cheap to seek can still be expensive to fully decode if the conversion is done naively.
Even Lemire's optimized bit iteration - repeatedly extracting the lowest set bit with std::countr_zero, recording the position and clearing the bit - runs at around 45 ns per 128-element block. By comparison, his SIMD bitpack decoder (simdunpackd1) decodes a bitpacked block in ~15 ns, about 3× faster. Using bit iteration for materialization would make the bitset encoding a net loss for any query that requires a flat array of doc IDs.
To address this, we implemented a SIMD materialization path using a 256-entry lookup table built at compile time. Each possible byte value maps to its popcount and the positions of its set bits. For each byte in the bitset, five AVX2 instructions produce up to 8 doc IDs: broadcast the base offset, load positions from the table, sign-extend, add and store. The result is ~15-20 ns per block depending on density at parity with simdunpackd1. The bitset encoding is now a strict win on dense blocks: better compression and equally fast to decode.
Fig. 3 - 120 doc IDs compress from 480 bytes to 17 bytes. The AVX2 lookup table matches simdunpackd1 throughput; scalar bit iteration is 3× slower and would negate the compression benefit.
SIMD materialization: implementation
The solution is a 256-entry lookup table built entirely at compile time with zero runtime cost. Each possible byte value maps to the count of its set bits and their positions within the byte. For a bitset block, the outer loop walks 64-bit words; the inner loop processes one byte at a time. For each byte, five AVX2 instructions produce up to 8 doc IDs: broadcast the base offset for that byte's position in the range, load the precomputed bit positions from the table, sign-extend them to 32-bit integers, add the base and store 8 values to the output. The write pointer advances by the actual count, not by 8 - the store always writes a full 32 bytes and any overshoot is overwritten in the next iteration, which is safe as long as the output buffer has a small headroom past the end.
struct alignas(16) BitsetByteEntry {
uint8_t count;
uint8_t positions[8];
};
static constexpr std::array<BitsetByteEntry, 256> kBitsetByteTable = [] {
std::array<BitsetByteEntry, 256> t{};
for (uint32_t b = 0; b != 256; ++b) {
t[b].count = 0;
for (uint32_t i = 0; i != 8; ++i) {
if (b & (1 << i)) t[b].positions[t[b].count++] = i;
}
}
return t;
}();
For each 64-bit word in the bitset, the AVX2 path processes 8 bytes. For each byte, five instructions produce up to 8 doc IDs:
const __m256i base_vec =
_mm256_set1_epi32(prev + i * 64 + b * 8); // broadcast base
const __m128i pos8 =
_mm_loadl_epi64( // load 8 byte positions
reinterpret_cast<const __m128i*>(e.positions));
const __m256i result =
_mm256_add_epi32( // base + each position
base_vec, _mm256_cvtepi8_epi32(pos8));
_mm256_storeu_si256( // store 8 doc IDs
reinterpret_cast<__m256i*>(end), result);
end += e.count; // advance by actual count
_mm256_storeu_si256 always writes 32 bytes, but end only advances by e.count. The extra positions beyond count are simply overwritten in the next iteration, it's safe as long as the output buffer has 8 entries of headroom past the actual output length.
Fig. 4 - Five AVX2 instructions per byte: broadcast base, load positions, sign-extend, add, store. Write pointer advances by count, not 8 - safe overshoot by design.
Total work per 128-doc block: 16 words × 8 bytes, each producing up to 8 doc IDs with early-out for empty words and no branch per doc ID. In practice this reaches ~15 ns for a 50%-dense block and ~20 ns for a fully-dense block - at parity with SIMD bitpack decode.
On aarch64, where NEON lacks the gather instructions needed, a scalar fallback is used - correct everywhere, with AVX2 preferred where available. This mirrors the pattern from the norm gathering post: a fast architecture-specific path and a portable fallback, without intrinsics that complicate cross-platform builds.
Future work
The SIMD materialization path closes the decode throughput gap, but materialization still happens even when the caller only needs to know whether a document is present. For seek-only operations, checking membership in a bitset is already a single bit test. Implementing the iterator's seek and advance operations directly over the bitset representation, without converting to a flat array first, would eliminate this conversion entirely on paths where it is not needed. The payoff is largest for top-K queries with pruning (e.g. BlockMax Score).
On the hardware side, AVX-512 would allow processing more bits per instruction in the materialization path and a 256-element block size would enable wider SIMD bitpacking and reduce per-block overhead. We're exploring both areas.
Conclusion
A posting list format that looks simple from the outside - one byte, then a payload - hides a lot of design work in that one byte. Getting it right means every query that runs on top benefits: dense blocks compress smaller with the bitset encoding, structured data compresses to near-nothing with constant delta and rare terms stop paying the overhead of a VInt tail. Each encoding fits naturally into the same competition and adding a new one touches only the encoder and decoder.
The results are available on the official benchmark, accepted and published by the Tantivy maintainers. IResearch leads across all query types and categories. All values are in microseconds:
All values have been measured in microseconds
These are still sort of "microbenchmark" results on a single-node search library. Real-world performance is shaped by many more factors like network, storage, query patterns, indexing load and the full system stack around the search engine. Translating library-level wins into end-to-end product performance is the next challenge for SereneDB and it's the one we are working on. We will be back with broader benchmarks soon. Stay tuned!
In the meantime, if you found this series useful, a star on GitHub is the best way to show it!
Other articles in the Search optimization journey series:
