
Andrey Abramov
May 6, 2026 · 6 minutes read
Search where your data lives
Why search engines haven't followed the data into the lake and why agents need them to
Elasticsearch shipped in 2012 and the deal was simple: your data lived in Postgres or MySQL; you copied it into Lucene; you got ranked search. The duplication tax was the price of admission. For most of the years since, that's been the pattern - ETL into a search engine, run it on a separate cluster, keep the two halves in sync.
The analytical landscape has moved since. Operational data still lives in row-stores, but most analytical data now sits across a wider list of sources, such as Parquet files on S3, Iceberg tables in a lakehouse, JSON or CSV files dropped into a bucket and increasingly data hubs like Hugging Face that publish datasets directly over HTTPS. The volumes have grown to the point where moving data is the expensive operation. You can't pre-ingest what you don't know you'll need.
Where agent context comes from
The agents case is where this shows up most clearly. Watch what an agent actually reaches for to answer one real question.
A research agent answering "find me recent papers on quantum error correction" wants arXiv, plus a colleague's S3 bucket of pre-prints, plus your own local notes. A support copilot wants the ticket history, the product docs, the customer's Slack thread and yesterday's release notes. Each agent question pulls from a different mix of stores and the next question may pull from a different mix again.
You can't make this work by ingesting everything into a single search cluster ahead of time. The volume is too big, the staleness window is too short and the relevant subset is decided per question, not per pipeline. What you actually want is to let the agent navigate the data where it lives. Search and analytics are the right tools for that and they always have been. What's new is that the data is no longer in one place, so the engine has to be able to read across all of it.
Federation is plumbing - it's not enough on its own
"Federated" means: query the Parquet on S3, the Iceberg table in your warehouse, the JSON in a colleague's bucket and the rows in your local database, from the same engine, without a copy step. Necessary, but not the whole story. If you can read distributed bytes but can't find what you need in them, you've solved connectivity and nothing else. Search and analytics are what turn distributed bytes into answers.
Analytical engines have largely figured federation out. You point a modern analytical engine at a Parquet file and you get aggregates - no ingest required. That's now the dominant pattern. Search engines, though, are still ingest-first. The typical pattern is: copy your data into a separate cluster, in the cluster's own format and keep the two halves in sync.
The approach: index where the engine is, data where the data is
The fix is straightforward to describe. A search index references the data instead of duplicating it. The engine reads the source (a Parquet file on S3, an Iceberg table, a JSON dump on Hugging Face or rows in your local table) and records which rows contain which terms. At query time, the index tells you which rows match. Reading the underlying row content - what databases call materialization - is deferred until a query actually needs it. A COUNT(*), a GROUP BY on an indexed facet or a top-K with LIMIT pay that cost for a handful of rows or none at all; the rest of the data lake stays on the lake.
A few practical consequences:
- Zero ETL. The data lake stays the source of truth. No re-staging, no syncing, no separate cluster to maintain.
- One SQL surface, no matter where the rows live. Whether you're hitting a
hf://URL, a local Parquet file or a regular database table, the queries are the same. Only the view or table source changes. - Search and analytics in one query. Full-text predicates are first-class SQL expressions, so they feed
COUNT,AVG,GROUP BYandJOINlike any other column. No round-trips between a search cluster and an analytical engine to stitch results together at the application layer.
What this looks like in SereneDB
SereneDB lets you search and analyse data where it lives. It's a Postgres-compatible engine: drop into psql, use your existing drivers, no extension to install. Inside, an inverted index (powered by IResearch, the fastest C++ search engine, written by SereneDB team) sits alongside a columnar store with a vectorized analytical executor and a planner that combines inverted-index lookups with columnar scans in a single plan.
The index doesn't own the data. It points at it. The same SQL runs over a hf:// URL, a Parquet file on S3, an Iceberg table in your warehouse or rows in a native SereneDB table. The data lake stays the source of truth and the engine reads through to it.
One query, one round-trip. A coding agent asks for "the top 5 hits, grouped by repo, with average relevance per group." A research workflow asks for "matching papers, joined to a metadata table, filtered by year." Both are plain SQL queries. Full-text search is just part of the language, not a separate API you have to learn alongside it. Anything that already speaks Postgres (your driver, your BI tool, your LLM agent) already speaks SereneDB. What used to take three systems (search cluster, analytical engine, app-layer glue) becomes a single SQL query.
Coming next
The next post walks through three demos that share a dataset (IMDb reviews) and the same set of queries, run against remote Parquet on Hugging Face, against local Parquet on disk and against a native SereneDB table. You'll see phrase queries, BM25 ranking, hybrid analytics and JOINs against full-text predicates - with no ingest pipeline in sight.
Next: One SQL, three access modes - the demo walkthrough.
If you'd rather skip ahead and play with the demos directly, everything is in the examples folder of the repo, including install instructions and demo SQL.
Downloads page
GitHub releases
Issues and feature requests welcome on GitHub
And if you like what you see, star us on GitHub - it genuinely helps us reach more people.