RAG (Retrieval-Augmented Generation) powers today’s most visible AI applications — ChatGPT’s web search, Perplexity, and enterprise knowledge-base Q&A. Underneath them all is the same mechanism: turn the user’s question into a vector, find the most relevant document chunks in a vast vector database, and feed them to the LLM.
This step is called vector search. It isn’t a new problem born of AI. The search engine industry has spent decades on it — the only difference is that we used to search web pages, and now we search document passages. The underlying math is identical. At its core sits Approximate Nearest Neighbor search, or ANN.
Over the decades, the ANN field has produced an arsenal of sophisticated techniques: tree structures, hash encoding, product quantization, graph navigation. Each generation broke through the previous one’s bottleneck, driven by the demands of internet-scale search — billions of web pages, queries covering every topic imaginable, data scattered across every direction of the vector space. In that setting, a single divide-and-conquer strategy couldn’t cut it, which is why we needed increasingly elaborate recovery mechanisms.
But RAG text corpora have an underappreciated difference.
Internet search faces arbitrary web pages and arbitrary queries. RAG faces domain-specific document collections: customer support knowledge bases, product manuals, internal docs. Even with general-purpose RAG — say, Wikipedia-based Q&A — the text embedding space has a geometric property that web pages lack. A 2026 paper titled The Geometry of Consolidation formalized this: embeddings from contrastive text encoders concentrate their variance in extremely few effective dimensions. Different phrasings of the same idea — “how do I return this” and “refund process” — are packed tightly together in vector space.
Under specific metrics and conditions, the paper compared simple cluster centroids against mainstream ANN methods like PQ and HNSW. The result: on most real-world corpora, centroids came out on top. Centroids aren’t new — they’re just averaging. The conclusion comes with important caveats, which we’ll address later.
The paper gave us the key to understanding this result: when is the “cut and repair” logic essential, and when can you simply drop the redundancy without any repair?
Let’s ground this with concrete numbers. A typical enterprise RAG system might have hundreds of thousands to millions of passages. Each passage is encoded into a vector of 768 to 4096 dimensions. For every user query, the system needs to find the most cosine-similar passages in a fraction of a second.
Brute force means computing a million cosine similarities and sorting. A single machine takes tens to hundreds of milliseconds per query, collapsing under concurrent load. Hence approximation: don’t guarantee the absolute nearest neighbor, but guarantee a very close one with high probability, in far less time. Trade precision for speed.
That’s ANN. FAISS, Milvus, Pinecone, Weaviate — every vector database you use runs on some ANN method under the hood. Most developers call these libraries without caring what’s inside. Pass in vectors, get results back. Enough. But when you need to make technical choices — memory budget, 10ms latency targets, acceptable recall loss — the black box isn’t enough.
Pull back and look across all ANN methods, and one intuition runs through them: don’t search the entire space. Cut it into pieces, and search only the pieces relevant to the query. This is divide and conquer.
Divide and conquer has a trap. The pieces are discrete, but queries are continuous. Between any two adjacent pieces there’s a boundary line, and a query vector can land right on it. If you only search the piece the query falls in, you’ll miss candidates in the adjacent piece — potentially the true nearest neighbor. Partitioning creates blind spots.
So the essential difference between ANN methods isn’t whose math is prettier. It’s two things: how they cut the space, and how they recover the boundary information that cutting destroys.
What follows are four methods — just their “cut” and “repair” intuitions.
Tree methods: cutting on spatial coordinates. The earliest approach, like k-d trees. At each step, pick a dimension and draw a line splitting the space in two, recursively. When querying, walk down the tree to a leaf, then “backtrack” — check sibling partitions for closer points. Backtracking is the repair mechanism.
The problem is dimensionality. Beyond a few dozen dimensions, “draw a line to split the space” stops making mathematical sense. In high dimensions, all points are roughly equidistant. No single dimension can effectively separate the data. Backtracking explodes to brute-force levels. Tree methods excel in low dimensions but are effectively unusable for today’s hundreds-to-thousands-dimensional text vectors.
Hash methods: encoding with probability. Different approach: instead of cutting coordinate space, project vectors into a code space. Generate random hyperplanes, check which side each vector falls on, and assign a binary string. Semantically similar vectors end up with similar codes. At query time, find the query’s code and search within that hash bucket.
The boundary repair mechanism is “multiple tables”: build several independent hash tables with different random hyperplane combinations. The true nearest neighbor might land in a different bucket in any given table, but it’ll probably share a bucket with the query in at least one table. Multiple tables equal multiple chances, giving probabilistic guarantees against misses.
This is Locality-Sensitive Hashing (LSH). Its strength is strict theoretical backing: given parameters, you can calculate exactly how many tables you need for a target recall rate. The cost is memory: more tables, more storage.
Quantization methods: compressing the vector itself. Another angle. The first two methods cut “space” — what about cutting the “vector”? Slice each 1024-dimensional vector into segments of a few dozen dimensions each. Cluster each segment independently and replace floating-point values with cluster center indices. A single vector becomes a handful of short integers, shrinking storage from kilobytes to tens of bytes.
At query time, use compressed approximate distances for coarse filtering, pull a candidate set, then compute exact cosine similarity for fine ranking. Coarse filtering recovers quantization loss; fine ranking ensures boundary cases aren’t missed. This is Product Quantization (PQ), FAISS’s default compression scheme, the backbone of billion-scale retrieval.
PQ’s core assumption: every vector stays in the index, just stored more cheaply.
Graph methods: making cutting and repairing the same thing. This approach cuts neither space nor vectors. Build a multi-layer graph: upper layers have fewer nodes with sparse edges for long jumps; lower layers are dense for fine-grained positioning. A query starts at a random top-layer entry point and greedily walks downward, eventually converging on the nearest neighbor.
This is the Hierarchical Navigable Small World (HNSW), the top performer on most benchmarks today. Its elegance: the hierarchy simultaneously acts as both cutting and repair. If you stray at one layer, lower layers have more edges to help you reorient. No explicit “cut then repair” two-phase operation — navigation is both, in one.
The cost is memory. Each node stores many edges; the graph consumes more space than the raw vectors. In practice, HNSW is paired with PQ: the graph handles navigation, compressed vectors handle final distance computation.
The four methods above, however clever their cutting strategy, all spend effort repairing the boundary information that cutting destroys. The clustering approach takes the opposite stance: no repair.
Cluster all vectors by semantic similarity into groups. Keep only one centroid per cluster. At query time, compute distances to all centroids, hit the top-k clusters, and pull the original passage text from those clusters. No fine ranking. No repair step.
Intuitively, this can’t work. What about queries near cluster boundaries? What about clusters whose internal points are spread out? These are exactly the problems the first four methods spent decades solving.
The consolidation paper’s finding: on real RAG text corpora, these problems mostly don’t arise.
The reason lies in the geometry of embedding spaces. Modern text encoders are trained to pull synonymous expressions together and push unrelated ones apart. Dozens of phrasings of the same idea — regardless of wording — end up packed into a small patch on the sphere. That patch can, intuitively, be summarized by a single center point. The centroid sits in the middle, like the peak of a dome tent, covering all the points beneath it. Wherever a query lands in that region, the nearest centroid it finds is overwhelmingly likely to be the right one.
By contrast, internet-scale general search faces highly diverse data: one article on quantum mechanics, another on cooking, another on sports. Their embeddings scatter across every direction of the space. A single centroid can’t cover that. That’s when PQ compression and HNSW graph navigation become necessary.
The paper’s contribution isn’t inventing centroids — centroids predate every other method here. Its contribution is turning the “when is no repair enough?” question from a fuzzy intuition into a computable geometric criterion: measure your corpus’s effective cluster dimensions and within-cluster average distance. If you’re in the tight regime, clustering suffices. If you’re in the spread regime, reach for more complex recovery.
Where this paper’s claims end. The “centroids do better” result holds under very specific conditions. The paper’s core metric isn’t an industry-standard retrieval quality measure — it’s one that, by definition, tilts toward cluster-based methods. And the advantage on that metric doesn’t consistently translate to downstream task improvement. The paper provides no latency comparison, and centroid indices aren’t natively supported by any major vector database. So this paper doesn’t overturn decades of search engine research. Its most useful offering is a decision heuristic: for text RAG, where data distributions tend to be tight, clustering centroids are an underappreciated option — provided index bytes are constrained. Check your data’s geometry before choosing a pipeline. It doesn’t say you can throw away everything else.
Practical engineering choices come down to four questions. You don’t need to memorize the internals of every method.
First, what’s your data scale and memory budget. Under a million vectors, with comfortable memory: HNSW with full-precision vectors is the safe default. Billion-scale, memory-tight: go PQ, or HNSW plus PQ hybrid.
Second, how rigid is your recall requirement. Above 99% demands HNSW-level precision. 90% is usually good enough — PQ or LSH become viable, trading for far better compression ratios.
Third, how many effective dimensions does your data actually have. For text RAG, your embeddings are nominally 1024-dimensional but effective dimensions are often only in the teens to twenties. This is a well-replicated finding in the embedding literature. Lower effective dimensions mean tighter data, which means you can cut more aggressively. When you get a new corpus, check d_eff and within-cluster average distance first. If you’re in the tight regime, centroid aggregation may be enough — no HNSW needed.
Fourth, are you an SDK caller or do you need deep customization. For most scenarios, FAISS’s default HNSW or PQ index is already excellent. The value of understanding these intuitions is this: when you see FAISS’s index type selection screen, you have a mental map of where each option sits on the “cut” and “repair” axes, and where your data and requirements land.
The consolidation paper’s most useful message: in many scenarios where you were about to invest heavily in building a complex ANN pipeline, the conditions are already satisfied for the simple approach. Those investments might be wasted. Check geometry first, then choose.