You're Doing RAG Retrieval Wrong

You embedded your documents. You indexed them in a vector database. A user asks a question, you search for the nearest chunks, feed them to a language model, and get an answer. Except the answer is wrong. Not hallucinated — the model faithfully used the context you gave it. The problem is you gave it the wrong context. The retrieval step failed. This happens far more often than people admit, and the reason is almost always the same: you're comparing things that shouldn't be compared.

The Promise and the Problem

The promise of RAG is simple: give a language model access to your data, and it can answer questions about it. No fine-tuning, no retraining. Just retrieve relevant chunks and let the model synthesize.

In practice, the retrieval step is where most RAG pipelines break. And it breaks silently. The model doesn't say "I got the wrong context." It reads whatever you gave it and generates a confident, well-structured, completely wrong answer.

The root cause is what I call semantic dissonance. You're comparing a question to a document chunk using vector similarity. But questions and document chunks are fundamentally different kinds of text. A question is short, interrogative, specific. A document chunk is declarative, dense, contextual. They don't land in the same part of the embedding space — even when one is the answer to the other.

This is like going to a library and instead of looking in the card catalog (which lists questions like "Where can I learn about X?"), you're trying to match your question directly against random paragraphs from random books. Sometimes you get lucky. Often you don't.

II.

The Embedding Space

To understand why retrieval fails, you need to understand where things live in embedding space.

An embedding model converts text into a vector — a point in high-dimensional space. The idea is that semantically similar texts end up near each other. "How do I reset my password?" should be close to "What's the password reset process?" That works. Both are questions about the same thing.

But here's the problem: "How do I reset my password?" is not necessarily close to "To reset your password, navigate to Settings > Security > Reset Password and click the reset link." One is a question. The other is an answer. They're about the same topic, but they're different kinds of text. And embedding models often encode the kind of text as strongly as the topic.

2D Embedding Space — Questions vs Documents

Move query:

Blue circles are questions. Orange squares are document chunks. The purple diamond is your query. Notice: the query lands near questions, not near the document that answers it. Lines show the 3 nearest neighbors.

In a 2D embedding space visualization, questions cluster together in one region and document chunks cluster in a separate region. When a user query (which is a question) is embedded, it lands near other questions — not near the document chunk that actually answers it. This is semantic dissonance: the embedding model places similar text types near each other, meaning "similar" encodes text type, not answer relevance.

This is the core insight. In embedding space, questions cluster with questions and documents cluster with documents. Your query — which is a question — will naturally be closer to other questions than to the document chunk that actually contains the answer.

This is semantic dissonance. The embedding model is doing its job correctly — it's placing similar text near similar text. The problem is that "similar" means "same type of text," not "contains the answer." You're asking the embedding to do something it wasn't trained to do.

III.

Match Questions to Questions

The fix is elegant. Instead of searching your query against document chunks, search it against questions about document chunks.

Here's how: for each chunk in your corpus, use an LLM to generate 3-5 questions that the chunk answers. Index those questions. When a user query comes in, match it against the generated questions — not the raw documents. Then retrieve the document chunk associated with the best-matching question.

This works because you're now comparing apples to apples. A user question matches against synthetic questions. Both are interrogative, similar in structure, similar in length. The embedding model can actually do its job.

This technique is sometimes called HyDE (Hypothetical Document Embeddings) or query-to-question matching, depending on the variant.[1]

HyDE, introduced by Gao et al., takes a slightly different approach: it generates a hypothetical answer to the query, then searches for documents similar to that hypothetical answer. The principle is the same — match like with like. The question-generation approach is often simpler and more reliable in practice.

Retrieval Comparison — Naive vs Question-Matched

User query:

Left: naive retrieval (query matched against document chunks). Right: question-matched retrieval. Green = relevant. Red = irrelevant. Same query, dramatically different results.

A side-by-side comparison of naive retrieval (matching queries directly against document chunks) versus question-matched retrieval (matching queries against synthetic questions generated from each chunk). For the query "How do I cancel my subscription?", naive retrieval returns irrelevant chunks about subscription plans (0.81 similarity) as the top result. Question-matched retrieval returns the correct cancellation instructions (0.94 similarity) as the top result — a 20-40% improvement in retrieval precision.

The difference is often dramatic. In benchmarks, question-to-question matching improves retrieval precision by 20-40% over naive chunk matching. And better retrieval means better answers downstream — the LLM can only work with what you give it.

IV.

The Embedding Matters

Even after fixing what you compare, how you compare still matters. Generic embedding models — the ones you get out of the box — are trained on general web text. They know that "bank" relates to "money" and "river." They don't know that in your fintech codebase, "default" means "loan default," not "default parameter value."

This is the domain embedding problem. In a specialized corpus, words and phrases carry domain-specific meanings. A generic embedding model conflates them.

Embedding Space — Generic vs Domain-Specific

Watch how terms rearrange when switching to domain-specific embeddings. In generic space, "default" (finance) and "default" (code) are neighbors. In domain space, they separate.

Generic embedding models conflate terms with multiple meanings. In generic embedding space, "default (finance)" and "default (code)" cluster together, and "python (snake)" sits near "python (language)." Domain-specific embeddings (fine-tuned on financial text) separate these correctly: finance terms like "default," "interest rate," "bond," and "credit score" cluster together, while tech terms like "python (language)," "javascript," and "rate limiting" form a separate cluster.

The solution: fine-tune your embedding model on your domain data, or use a model that's already been trained on similar text. This doesn't require massive datasets — even a few thousand domain-specific sentence pairs can significantly improve retrieval quality.

Beyond Cosine Similarity

You've fixed what you compare and how you embed it. But you're still ranking by cosine similarity alone. And cosine similarity, while useful, only measures one thing: the angle between two vectors. It doesn't know about freshness, specificity, document quality, or user preference.

Two chunks can have nearly identical cosine similarity to your query. One is a three-sentence summary from an outdated FAQ. The other is a detailed, current technical explanation. Cosine similarity can't tell the difference.

This is where reranking comes in. After your initial vector search returns the top-N candidates, a second pass scores them on additional dimensions:

- Cosine similarity (semantic match)
- Recency (newer is often better)
- Specificity (longer, more detailed chunks score higher)
- Popularity (documents that users have found helpful before)

Reranking — Cosine Score vs Combined Score

Click "Rank by Combined Score" to see how the ranking changes when you factor in recency, specificity, and quality. The top result flips.

Cosine similarity alone is insufficient for ranking retrieved chunks. A chunk with 0.89 cosine similarity but low recency and specificity ranks first by cosine, but a detailed troubleshooting guide with 0.83 cosine but high recency (0.95), specificity (0.95), and quality (0.90) should rank first. Reranking with a combined score that weights cosine similarity, recency, specificity, and document quality changes the ranking order and surfaces more useful results.

Cross-encoder rerankers — models specifically trained to score query-document pairs — are another powerful option. They're slower than cosine similarity but dramatically more accurate, because they see the query and document together rather than comparing pre-computed vectors.

VI.

Setting Smart Thresholds

Here's a question that most RAG tutorials skip: what cosine similarity score is "good enough"?

The honest answer is: it depends entirely on your data. A score of 0.82 might be excellent for one corpus and mediocre for another. The distribution of scores matters far more than any individual score.

If you set the threshold too high, you miss relevant results (low recall). Too low, and you flood the LLM with junk (low precision). The right threshold lives at the intersection of these tradeoffs — and you can only find it by looking at your actual score distribution.

Threshold Explorer — Precision vs Recall

Threshold: 0.75

Drag the threshold. Watch precision and recall trade off. The histogram shows the actual similarity score distribution — green bars are relevant results, gray are irrelevant.

A cosine similarity threshold of 0.75 yields approximately 75% precision and 80% recall (F1 ~77%). Setting the threshold too high (e.g., 0.90) raises precision to near 100% but recall drops to ~30% — most relevant results are missed. Too low (e.g., 0.60) catches everything but floods the LLM with irrelevant context. The optimal threshold depends on your specific score distribution and must be determined empirically, not by a universal magic number.

In practice, the best approach is to evaluate a sample of queries with known-good answers, plot the score distribution, and pick the threshold that maximizes F1 — or whatever metric matters most for your use case. There is no universal magic number.

VII.

Putting It All Together

The complete pipeline looks like this:

1. Index time: For each document chunk, generate synthetic questions using an LLM. Embed both the questions and the chunks using a domain-specific embedding model.

2. Query time: Embed the user's query. Search against the synthetic questions (not the raw chunks). Retrieve the top-N associated chunks.

3. Rerank: Score the candidates using a combination of cosine similarity, recency, specificity, and optionally a cross-encoder. Reorder by combined score.

4. Threshold: Filter out any results below your empirically-determined quality threshold.

5. Generate: Pass the surviving chunks to the LLM as context.

Each step removes a different class of errors. Question matching fixes semantic dissonance. Domain embeddings fix vocabulary confusion. Reranking fixes the "close but useless" problem. Thresholding prevents garbage from reaching the model.

None of these steps is novel in isolation. But the combination — and understanding why each matters — is what separates RAG pipelines that work from RAG pipelines that confidently produce wrong answers.

RAG isn't a search problem. It's a relevance problem. And relevance requires understanding what you're comparing, how you're comparing it, and when to stop.

◆

Written by Danish Mohd.
AI product builder. Previously VP Engineering at Pixis AI.
Based on an earlier post. Rebuilt as an interactive explainer.
Last updated March 2026