You build a RAG pipeline. You try a few queries. The answers look reasonable. You ship it. Three weeks later, a user asks a question and the system confidently fabricates an answer from context it retrieved but misread. You have no way to know how often this happens because you never measured it.
This is the default state of most RAG systems. As of early 2026, 70% still lack systematic evaluation. The ones that work well in production all share one thing: they measure retrieval and generation independently, using metrics designed specifically for RAG.
The RAG Triad
Every RAG system has three moving parts: a query comes in, context gets retrieved, and a response gets generated. The RAG Triad, popularized by TruLens and now adopted across all major evaluation frameworks, measures each interaction between these parts.
Context relevance asks: did you retrieve the right documents? If a user asks about return policies and your retriever pulls up shipping information, no amount of good generation will save you. This is the signal-to-noise ratio of your retrieval step.
Faithfulness (also called groundedness) asks: is every claim in the response actually supported by the retrieved context? This is the hallucination check. The most common RAG failure mode isn't retrieving wrong documents. It's retrieving right documents and then ignoring, misinterpreting, or extrapolating beyond them. The model sounds confident either way.
Answer relevance asks: does the response actually address the user's question? A response can be perfectly grounded in the retrieved context and still be useless if it answers a different question than the one asked.
These three metrics are necessary and jointly sufficient for basic RAG evaluation. If all three score high, your system is retrieving relevant information, using it faithfully, and producing useful answers. If any one fails, you know exactly where to look.
The metrics in detail
Beyond the triad, RAGAS (the most widely cited open-source RAG evaluation framework) defines several additional metrics that matter in practice.
Context precision measures whether relevant chunks are ranked higher than irrelevant ones in your retrieval results. You might retrieve five documents, three of which are relevant, but if the relevant ones are at positions 3, 4, and 5, the model is more likely to attend to the irrelevant ones at the top. Order matters.
Context recall measures whether all the information needed to produce a correct answer was actually retrieved. You might retrieve relevant documents but miss a critical piece. This requires ground-truth reference answers to compute, making it harder to run in production but essential during development.
Noise sensitivity measures how much irrelevant context degrades generation quality. Some models handle noise gracefully. Others weave irrelevant context into confident-sounding hallucinations. If your retriever has mediocre precision, noise sensitivity tells you how much that actually hurts your outputs.
LLM-as-a-Judge: the evaluation engine
All of these metrics need a way to compute scores at scale. The dominant approach in 2025-2026 is LLM-as-a-Judge: using a language model to evaluate the outputs of another language model.
This sounds circular, but it works. The Judge's Verdict benchmark (2025) evaluated 54 LLMs as judges and found that LLM-as-a-Judge achieves roughly 80% agreement with human preferences, which matches human-to-human consistency, at 500-5000x lower cost. Of 54 models tested, 27 achieved top-tier performance, with 23 exhibiting human-like judgment patterns.
The key insight is that judging is easier than generating. A model that might hallucinate when writing a response can still reliably detect hallucinations in someone else's response. It's the same reason a film critic doesn't need to be a director.
Amazon Bedrock made RAG evaluation and LLM-as-a-Judge generally available in March 2025, signaling that this approach has reached enterprise maturity. RAGAS now provides tools to align LLM judges to specific evaluation criteria, including custom rubrics calibrated against human preferences.
The frameworks
Four frameworks dominate RAG evaluation in 2026. Each takes a different philosophical approach.
RAGAS is the reference-free standard. It requires no ground-truth annotations: you can run it on live production queries. It covers context precision, context recall, faithfulness, answer relevance, noise sensitivity, and now supports agentic workflow evaluation and automatic synthetic test data generation. If you're starting from zero, start here.
ARES (Stanford) takes a statistically rigorous approach. Instead of using a general-purpose LLM judge directly, ARES generates synthetic training data from your corpus, fine-tunes lightweight judge models on that data, then uses Prediction-Powered Inference with roughly 150 human-annotated examples to produce calibrated confidence intervals for its scores. If you need to tell your VP of engineering "faithfulness is 0.89 with a 95% confidence interval of [0.86, 0.92]," ARES is how you get there.
DeepEval is built for CI/CD integration. It provides 50+ metrics, is compatible with pytest, and treats RAG evaluation like unit testing. You define evaluation suites with passing thresholds, run them on every commit, and block deployment if faithfulness drops below your threshold. Every metric is debuggable: you can inspect the LLM judge's reasoning to understand why a score was assigned.
FaithJudge (Vectara, EMNLP 2025) focuses specifically on hallucination detection. It benchmarks 160+ LLMs on RAG faithfulness across summarization, question-answering, and data-to-text generation. The best judge configuration (o3-mini-high) achieves 84% balanced accuracy and 82.1% F1-macro. Vectara's HHEM-2.1 model provides probabilistic faithfulness scores: a score of 0.8 means 80% probability the response is factually consistent with context.
Multi-turn changes everything
Most RAG evaluation assumes single-turn interactions: one question, one retrieval, one response. Real users have conversations.
IBM's MTRAG benchmark (TACL 2025) is the first end-to-end human-generated multi-turn RAG benchmark. It covers 110 conversations with 842 tasks across four domains, evaluating faithfulness, appropriateness, naturalness, and completeness.
The findings are sobering. Even state-of-the-art RAG systems struggle significantly on multi-turn conversations. Performance degrades on later turns (where context accumulation causes the same attention problems described in lost-in-the-middle research). Unanswerable questions, non-standalone questions, and cross-domain queries are particularly challenging. Single-turn metrics dramatically overestimate real-world performance.
If your RAG system handles conversations, you need to evaluate on conversations. MTRAG is the benchmark to use.
Evaluation in production
The gap between "we evaluated during development" and "we evaluate continuously in production" is where most RAG systems fail.
The production evaluation stack that's emerging in 2025-2026 has three layers. Pre-deployment: generate synthetic test queries from your corpus (RAGAS or RAGEval can do this automatically), run the RAG Triad plus context precision and recall, test specifically for hallucination using FaithJudge or HHEM, benchmark against MTRAG if you support conversations.
CI/CD integration: DeepEval with pytest defines evaluation suites as code. Quality gates block deployment if metrics regress. The evaluation code lives alongside application code and runs on every commit. 60% of new RAG deployments now include systematic evaluation from day one, up from less than 30% in early 2025.
Production monitoring: reference-free metrics (faithfulness and answer relevance via LLM-as-a-Judge) run continuously on live queries. Retrieval latency, relevance scores, and response drift detection catch unexpected changes. Braintrust captures complete execution traces from production and converts failures into test cases automatically, creating a continuous improvement loop.
The key insight is that RAG evaluation isn't a one-time activity. Your documents change. Your users' questions change. Your retrieval index drifts. A system that scored 0.92 faithfulness last month might score 0.78 this month because someone updated the knowledge base and broke three embeddings. Without continuous evaluation, you won't know until a user complains.
What to do today
If you have a RAG system in production with no evaluation, here's the minimum viable measurement stack. Install RAGAS. Run context relevance, faithfulness, and answer relevance on a sample of production queries. This takes an afternoon and will immediately tell you whether your system has a retrieval problem, a generation problem, or both.
If you find issues (you will), fix the worse-scoring component first. Low context relevance means your retriever needs work: better embeddings, better chunking, reranking. Low faithfulness means your generation prompt needs constraints or your model needs guardrails. Low answer relevance is usually a retrieval problem in disguise: the model is faithfully summarizing irrelevant context.
Then add it to CI/CD. The bar for "good enough" varies by use case, but if you can't tell whether your last deployment made things better or worse, you don't have an engineering process. You have a prayer.
References
- Es et al., RAGAS: Automated Evaluation of Retrieval Augmented Generation, 2023
- Saad-Falcon et al., ARES: An Automated Evaluation Framework for RAG Systems, Stanford, 2023
- Confident AI, DeepEval: The Open-Source LLM Evaluation Framework
- Mishra et al., FaithJudge: Evaluating Faithfulness in RAG, EMNLP 2025
- Deng et al., MTRAG: A Multi-Turn Conversational RAG Benchmark, TACL 2025
- Zhu et al., RAGBench: Explainable Benchmark for RAG Systems, 2024