检索与知识系统

Every Core RAG Technique Was Already Invented by Search Engines

2026-03-26

We write RecursiveCharacterTextSplitter in LangChain and split documents into 512-token chunks. We use OpenAI’s embedding API to turn those chunks into vectors. We store the vectors in Pinecone and retrieve the 5 most relevant results with cosine similarity. We add a cross-encoder reranker for second-stage ranking. We think this pipeline was a collective invention of the 2023 AI community.

None of these steps was.

Document chunking was passage retrieval, published by Callan at SIGIR in 1994. Turning text into vectors was Microsoft’s DSSM dual-encoder model in 2013. Approximate nearest neighbor search was the HNSW algorithm in 2016. Two-stage ranking was Learning to Rank from Microsoft Research in 2005. The Reciprocal Rank Fusion used in hybrid retrieval came from Cormack’s 2009 paper.

This is not to say RAG has no value. RAG’s contribution is real, but it happened at the engineering and product layers: it compressed a search pipeline that previously required a PhD in Information Retrieval, or IR, into a few dozen lines of code that any Python developer can run with LangChain. This is cost compression. It is democratization. But at the algorithmic level, every component in the RAG pipeline has an IR predecessor, with time gaps ranging from 2 years to 50 years.

Understanding this fact has immediate practical value: the IR field has accumulated decades of engineering experience and failed experiments around these problems. It already knows which paths do not work and which trade-offs are worth making. If we think only within the RAG frame, we will keep falling into pits that IR already climbed out of.


A comparison table

Let’s start with the full picture. Here are the 7 core components in the RAG pipeline and their predecessors in IR:

RAG Component IR Predecessor IR Year RAG Adoption Year Gap
Document Chunking Passage Retrieval (Callan) 1994 2020 (DPR) 26 years
Dense Embedding Dual Encoder (DSSM) 2013 2020 (DPR) 7 years
Vector Search (HNSW) HNSW algorithm 2016 2021+ 5 years
Cross-encoder Reranking Learning to Rank / BERT Reranking 2005/2019 2022+ 3-17 years
Hybrid Search (RRF) Reciprocal Rank Fusion 2009 2023+ 14 years
Query Rewriting Rocchio Relevance Feedback 1971 2023+ 52 years
Query Expansion RM3 Pseudo-Relevance Feedback 2001 2023+ 22 years

Behind these numbers sits more than fifty years of accumulated work from an entire discipline. From Salton building the SMART system and the vector space model at Cornell in the 1960s, to Spärck Jones proposing the statistical interpretation of IDF in 1972, to BM25 and the TREC evaluation framework maturing in the 1990s, and then to the rise of neural IR in the 2010s, IR is a field with a complete theoretical system, standard evaluation methods, and industrial deployment experience. The RAG community walked this path again between 2020 and 2023, but skipped a large amount of engineering experience that had already been accumulated along the way.

Below I go through the components one by one. Each section focuses on three things: what problem the technique is trying to solve, the key technical decision in the original IR paper, and how those trade-offs can guide us when tuning a RAG system.


Chunking: passage size is an overlooked variable

The first step in a RAG pipeline is to split long documents into short pieces. The core intuition is simple: a full document is too long and too coarse. If a document contains multiple topics and participates in retrieval as a whole, it dilutes the signal from the passages that are actually relevant to the query. Splitting into smaller pieces makes precise matching possible.

When Callan studied this problem at SIGIR in 1994, he made one key technical decision: he systematically tested how different passage sizes affected retrieval quality. His experiments found that 100-150 words was the optimal range, but the best value changed with query type. Factoid questions prefer short passages because the answer is usually concentrated in one or two sentences. Topical questions prefer longer passages because they require more context to judge relevance. The TREC Passage Track from 2003 to 2004 further validated this conclusion in standardized evaluation. The 100-word passage setting used in the DPR paper from 2020 is almost identical to the TREC standard, but when this idea spread through the RAG community, the variable got frozen into 512 tokens with no distinction between query types.

That creates a direct tuning opportunity: chunk size should be a tunable parameter in a RAG system, not a fixed constant. In practice, splitting by semantic boundaries such as headings, paragraphs, and list boundaries usually works better than splitting by a fixed token count, and hierarchical retrieval, first at the document level and then at the passage level, often performs better on long-document workloads. If we have to pick one default, TREC’s experience points to 200-300 tokens rather than 512.


Dense Embedding: cosine vs dot product is not an arbitrary choice

Turning text into vectors and retrieving with vector similarity is the central technical choice in RAG. The intuition is also clear: keyword matching can only find literal overlap. It cannot understand synonyms or semantic relatedness. We need to turn text into numeric vectors and compare meaning in a continuous semantic space.

DSSM, published by Microsoft Research in 2013, was the first production-ready dense retrieval model. Its architecture is exactly the same as DPR in 2020: two separate encoders process the query and the document, produce low-dimensional vectors, and rank by similarity. DSSM made a key technical decision: it used cosine similarity rather than dot product. The reason is that cosine normalizes vector length and removes the effect of document length on the score. A 5,000-word document and a 200-word document receive the same score if their semantic relevance is the same. Many RAG implementations default to dot product, partly because some vector databases use it as the default setting. On corpora with large document-length variation, this creates a systematic bias: long documents have embeddings with larger norms, and dot product naturally favors them.

The tuning advice is concrete. If our corpus contains documents with large length differences, for example a mix of short FAQs and long technical manuals, we should verify that the vector database uses cosine similarity rather than dot product, or apply L2 normalization after embedding. Another lesson from IR is that pure dense retrieval has a clear weakness on exact matching. Proper nouns, product IDs, and error codes tend to get generalized into semantic categories by embedding models, which loses lexical precision. That is the fundamental reason hybrid search exists.


Vector Search: the milliseconds saved in retrieval barely matter next to the LLM call

After HNSW, or Hierarchical Navigable Small World, was published in 2016, the first large-scale adopters were recommendation systems and ad tech. Facebook’s FAISS, open sourced in 2019, was mainly used for embedding retrieval in recommendation scenarios. The wave of vector database startups, Pinecone, Weaviate, Qdrant, and Milvus, built CRUD interfaces and persistence on top of HNSW. At the intuition level there is not much to explain: brute-force search is too slow at the million-vector scale, so we need an approximate algorithm that trades accuracy for speed.

The key technical decision in HNSW is that it separates build accuracy and query accuracy into two independent parameters. ef_construction controls accuracy during index building and affects graph quality. ef_search controls how many neighbors are explored during a query and affects recall and latency. These two parameters can be tuned independently because building is a one-time offline cost, while querying is a repeated online cost. RAG tutorials usually leave them at default values, but experience from IR and recommendation systems suggests that if retrieval quality matters, ef_search should be raised to 2-3 times the default. The reason is cost structure: one HNSW query takes a few to a few dozen milliseconds, while reranking and LLM generation usually take hundreds of milliseconds to seconds. Spending an extra 10 ms in retrieval for a 5 percent recall gain is negligible next to the cost of the LLM call.


Reranking: the ceiling depends on recall, not on the ranking model

The reranker in a RAG pipeline is usually a cross-encoder: it concatenates the query and the document, feeds them into a BERT-like model, and outputs a relevance score. The intuition is a two-step retrieval process. Step one uses a fast but coarse method, a dual encoder, to pull the top 100 candidates from a million documents. Step two uses a slow but accurate method, a cross-encoder, to rerank those 100 candidates. This preserves speed at scale while improving final ranking quality.

IR calls this architecture cascade ranking, and it has been the standard approach since Microsoft’s RankNet in 2005. There is an often-overlooked technical decision here: the ceiling of a cascade is determined by the recall quality of the first stage, not by the second-stage ranking model. If the first stage fails to retrieve the relevant document, the reranker only sees irrelevant candidates, and even a perfect ranker cannot fix that. A common mistake in the RAG community is to spend too much effort choosing and fine-tuning the reranker while ignoring diversity and coverage in the recall stage. The IR approach is to make sure the first stage has multiple retrieval channels, BM25 plus dense retrieval and possibly other signals, and then let the reranker fuse and sort them.

That makes the tuning priority clear: first ensure coverage in the recall stage with multiple channels and a sensible top-k, then invest effort in tuning the reranker. Also, many popular rerankers in the RAG community, especially the ms-marco family, were trained on Bing search logs and are biased toward general web queries. If a RAG system serves a specific domain such as law, medicine, or code, it should at least be evaluated on domain-specific data to confirm whether a general reranker actually works there.


Hybrid Search and RRF: fuse by rank, not by score

The exact-match weakness of pure dense retrieval led the RAG community to gradually accept hybrid search: use BM25, a classic keyword retrieval algorithm based on term frequency and document length, for sparse retrieval, use embeddings for dense retrieval, then merge the two result lists. The standard way to combine them is Reciprocal Rank Fusion, or RRF.

RRF was published by Cormack at SIGIR in 2009. The original paper is only two pages long and the formula is extremely simple: for each result, compute 1 / (k + rank) and sum the scores across channels. The method came from meta-search, the problem of fusing results from multiple search engines into one ranking, and its key technical decision is to use rank rather than score for fusion. The reason is that the score scales across retrieval channels are completely incomparable. BM25 may range from 0 to 30. Cosine similarity may range from -1 to 1. Taking a weighted average of the two has no statistical meaning. Rank, however, is universal: rank 1 is rank 1, regardless of the raw score. This insight is often missed in the RAG community, where some implementations try to normalize BM25 and cosine scores and then average them. In practice, they often perform worse than plain RRF.

BM25 itself also contains an important design choice worth understanding: its term-frequency component uses a saturation function, so gains diminish after a term appears often enough. This prevents long documents from dominating the ranking just because their term counts are high. Dense retrieval has no such mechanism. The embedding signal of long documents can overwhelm short ones. That is another value of hybrid search: the BM25 channel naturally compensates for document length and balances the length bias of the dense channel. The tuning advice is straightforward: if a system currently uses only dense retrieval, adding a BM25 channel for hybrid search is the highest-return improvement. Use RRF with k = 60, the default from the original paper, though 40-100 is usually fine.


Query Rewriting: LLM is an upgraded Rocchio

An increasingly popular optimization in RAG systems is to let an LLM rewrite the user’s query, turning a vague question into a precise one, or generating multiple query variants to improve recall. The intuition is direct: user queries often contain too little information and there is a vocabulary gap between the user’s phrasing and the corpus. Expansion or rewriting helps match relevant documents.

This idea goes back to Rocchio relevance feedback in 1971. A user submits an initial query, the system returns results, the user marks relevant and irrelevant items, and the system modifies the query vector and retrieves again. Later, Lavrenko and Croft in 2001 proposed RM3 pseudo-relevance feedback: assume the top-k results from the first retrieval round are relevant, extract keywords from them, and expand the query without manual labels. LLM-based query rewriting is essentially an upgraded version of the same idea. It replaces the statistical method with a generative model and gains much more flexibility. This is one of the few areas where RAG offers a real improvement over traditional IR.

But experience from IR points to a risk: query drift. If the first retrieval results are themselves irrelevant, then query expansion based on them injects noise and drifts even further away from the original intent. LLM query rewriting has the same risk, especially when the model introduces concepts that were not in the original query. The practical fix is to retrieve with both the original query and the rewritten query, then take the union, rather than replacing the original query completely.


Where RAG’s real contribution lies

The point of the component-by-component comparison above is not to diminish RAG. RAG’s contribution is real. It just happened at a different layer from what many people assume.

RAG’s core contribution is cost compression. Before RAG, building a pipeline with chunking, embedding, vector retrieval, and reranking required engineers with an IR background and an understanding of BM25 mathematics, HNSW graph structure, and Learning to Rank objectives. RAG frameworks such as LangChain, LlamaIndex, and Haystack wrapped these techniques into high-level APIs so that any developer who can write Python can assemble a working retrieval-augmented generation system in a day.

The value of that democratization should not be underestimated. In the history of technology, many major advances came from lowering the barrier rather than inventing a new algorithm. Linux did not invent the operating system kernel, but it moved server operating systems from mainframes to ordinary PCs. AWS did not invent virtualization, but it turned compute resources from hardware procurement into an API call. RAG did something similar for retrieval-augmented generation.

RAG’s second contribution is innovation in product form. It feeds retrieval results directly into an LLM to generate a coherent answer rather than presenting a list of links. Search engines return links that users must click, read, and synthesize on their own. RAG delegates that synthesis step to the LLM, so the user receives an answer built on retrieved results. That is a change in interaction paradigm.

But at the algorithmic level, every component in the RAG pipeline is a mature IR technique. The practical value of recognizing this is that a large body of IR engineering experience, evaluation methods, and failure lessons can be reused directly. We do not need to rediscover the best chunk size from scratch. TREC already provided guidance. We do not need to debate whether to add BM25. Two decades of IR practice already showed that sparse plus dense beats any single channel. We do not need to invent our own score fusion method. RRF is already a tested, simple, and effective solution.


Search engines are changing too: bidirectional convergence

So far one side of the story is that RAG inherited techniques from IR. The other side is that search engines are actively absorbing LLMs. This convergence is happening on three layers.

At the surface layer, AI summaries are being overlaid on top of search results. Google AI Overviews is the most obvious example. A user asks a question, and Google places an LLM-generated summary above the traditional blue-link results. The underlying retrieval pipeline has not fundamentally changed. What changed is the presentation layer.

At the retrieval layer, neural retrieval is enhancing traditional search. In 2023 Elasticsearch introduced ELSER, Elastic Learned Sparse EncodeR, which uses a BERT model to learn token weights instead of relying on BM25 term-frequency statistics. But the output remains a sparse vector compatible with inverted indexes. Users do not need to move to a vector database. They can gain semantic retrieval on top of an existing Elasticsearch cluster.

At the reasoning layer, the LLM becomes a search decision-maker. Perplexity lets an LLM decide what to search, how many times to search, and how to combine information from different sources. A user asks one question, the LLM decomposes it into subqueries, searches separately, evaluates source credibility, and then synthesizes an answer.

Search engines are adding generation from the retrieval side. RAG is introducing retrieval techniques from the generation side. The two architectures are converging.


New things created by the collision

What is truly worth watching is the new technology created by the collision between the two worlds, the products of the intersection between the IR tradition and the LLM era. Neither side could have produced these alone.

SPLADE, short for SParse Lexical AnD Expansion model, teaches BERT to write inverted indexes. It uses the Masked Language Model head of BERT to predict the probability of every vocabulary term for the input text, keeps high-probability terms as expansion terms, and converts probabilities into weights with a log-saturation function. The output is a sparse vector whose dimensionality equals the vocabulary size, with most dimensions zero. It performs implicit query expansion and learned term weighting at the same time, while staying compatible with inverted indexes and thus storable in Elasticsearch. This is a bridge between neural IR and traditional IR.

ColBERT, or Contextualized Late Interaction over BERT, finds a new balance between efficiency and quality. A standard dual encoder compresses an entire document into a single vector and loses token-level granularity. A cross-encoder preserves token interaction but is too expensive. ColBERT keeps a vector for each token instead of compressing the document into one vector. During matching, for each query token it finds the most similar document token and sums the MaxSim values. Document token vectors can be precomputed and indexed, making it orders of magnitude faster than a cross-encoder and more precise than a dual encoder.

Agentic Search is the newest direction: the LLM actively plans retrieval strategy. Traditional search and traditional RAG both struggle with questions that require multi-step reasoning, for example comparing how a research field changed over two years. Agentic search lets the LLM act as a retrieval planner that decides what to search, how many times to search, and which method to use at each step. Perplexity Deep Research and OpenAI’s deep research mode are both exploring this direction.


Academia is already discussing this explicitly

SIGIR 2024 held the first IR-RAG Workshop and explicitly stated that current efforts have undervalued the role of information retrieval within the RAG pipeline. SIGIR 2025 continued the workshop, with call-for-papers topics that included how to bring IR evaluation methods into RAG.

TREC, the standard evaluation program that has run in IR for more than 30 years, formally launched a RAG Track in 2025. The first round received 46 submissions from 12 research groups and evaluated the end-to-end performance of RAG systems on open-domain question answering. The IR community has started using its own evaluation framework to measure RAG quality systematically.

The developers of RAGFlow, an open-source RAG framework, published a reflection article in mid-2025 stating bluntly that genuine innovation in concepts and systems was notably scarce. Their observation was that the RAG field has produced many papers, but most of them only tweak existing components and offer little foundational innovation.

These voices are converging on a shared view: RAG needs to absorb more from IR, and IR also needs to engage seriously with the new paradigm introduced by LLMs. The gap between the two communities is getting smaller.


Practical advice: five things we can do today

Based on the analysis above, if we are maintaining a RAG system, the following five improvements offer the highest return on effort.

First, add a BM25 channel for hybrid search. Fuse the two result lists with RRF and use k = 60. Fuse by rank rather than weighted score. Elasticsearch, Vespa, Weaviate, and Milvus all support this natively.

Second, revisit the chunking strategy. Treat chunk size as a tunable parameter rather than a fixed constant. Splitting by semantic boundaries usually works better than splitting by a fixed token count. For long documents, consider hierarchical retrieval.

Third, ensure diversity in the recall stage. The ceiling of a reranker is constrained by candidate quality. First improve recall with multiple channels and a sensible top-k, then spend effort tuning the reranker.

Fourth, inspect the choice of vector similarity. If the corpus includes documents with large length differences, confirm that the system uses cosine similarity rather than dot product, or apply L2 normalization after embedding.

Fifth, pay attention to learned sparse retrieval. SPLADE and ELSER bridge the sparse and dense worlds and form a natural upgrade path from BM25. If you are already using Elasticsearch, ELSER can add semantic retrieval without changing the overall architecture.


A converging direction

Search engines and RAG are moving toward the same architectural pattern from opposite directions: multi-channel retrieval with sparse, dense, and other signals; neural reranking; and LLM-based generation or reasoning. Search engines start from retrieval and are gradually adding neural retrieval and generation. RAG starts from generation and is gradually adding BM25, hybrid search, and more mature retrieval evaluation.

Understanding IR fundamentals does not make us conservative. It gives us a richer toolbox for designing RAG systems. We know what problems BM25 solves, how to tune HNSW parameters, why RRF can fuse scores from different scales seamlessly, and why cascade ranking is two-stage rather than one-stage. This knowledge translates directly into better system design.

The IR field accumulated fifty years of retrieval technology. LLMs brought generation capability and semantic understanding. Their intersection is producing new directions such as SPLADE, ColBERT, and agentic search. If we understand the language of both sides, we stand right at that intersection.


References

Technical history and classic papers: - Callan (1994), Passage-level evidence in document retrieval, SIGIR: https://dl.acm.org/doi/10.1145/188490.188560 - TREC Passage Retrieval Track (2003): https://trec.nist.gov/pubs/trec12/papers/PASSAGE.OV.pdf - DSSM (2013), Learning Deep Structured Semantic Models, CIKM: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cikm2013_DSSM_fullversion.pdf - Cormack et al. (2009), Reciprocal Rank Fusion, SIGIR: https://cormack.uwaterloo.ca/cormacksigir09-rrf.pdf - Karpukhin et al. (2020), Dense Passage Retrieval, EMNLP: https://arxiv.org/abs/2004.04906

Bidirectional convergence: - Google AI Overviews: https://developers.google.com/search/docs/appearance/ai-features - Perplexity API: https://docs.perplexity.ai/docs/getting-started/overview.md - Elasticsearch ELSER: https://www.elastic.co/search-labs/blog/elastic-learned-sparse-encoder-elser-retrieval-performance

Academic frontier: - SPLADE, TOIS: https://hal.sorbonne-universite.fr/hal-04787990/file/splade_tois_REVISION_-1.pdf - ColBERT, SIGIR 2020: https://people.eecs.berkeley.edu/~matei/papers/2020/sigir_colbert.pdf - SIGIR 2025 IR-RAG Workshop CFP: https://easychair.org/cfp/irrag2025 - TREC RAG Track 2025: https://arxiv.org/html/2603.09891v1

Industry commentary: - Coveo, Search Engine vs Vector Database: https://www.coveo.com/blog/search-engine-vs-vector-database/ - RAGFlow 2025 Mid-year Reflections: https://ragflow.io/blog/rag-at-the-crossroads-mid-2025-reflections-on-ai-evolution