Understanding RAG Part IV: Retrieval-augmented generation evaluation framework

Written by

Silas Grey

Updated on:June-27th-2025

Retrieval-augmented Generation (RAG) plays a key role in extending the limitations of standalone Large Language Models (LLMs) and overcoming their many limitations. By integrating the retriever, RAG enhances the relevance and factual accuracy of responses: it simply leverages external knowledge sources (e.g., vector document repositories) in real time and adds relevant contextual information to the original user query or prompt, which is then passed to the LLM for output generation.

For those who dive into the RAG space, a natural question arises: How do we evaluate these far-from-simple systems?

To this end, several frameworks exist, such as DeepEval , which provides more than 14 evaluation metrics to assess criteria such as hallucination and fidelity; MLflow LLM Evaluate , which is known for its modularity and simplicity and can be evaluated in custom pipelines; RAGAs , which focuses on defining RAG pipelines and provides metrics such as fidelity and contextual relevance to calculate a comprehensive RAGA quality score.

Here is a summary of the three frameworks:

Understand RAGA

RAGA ( short for Retrieval-Augmented Generation Assessment ) is considered one of the best toolkits for evaluating LLM applications. It can evaluate the performance of RAG system components (i.e., retriever and generator) in the simplest way - either individually or jointly as a single process.

A core element of RAGA is its Metrics Driven Development (MDD) approach, which relies on data to make informed system decisions. MDD requires continuous monitoring of key metrics to provide clear insights into the performance of an application. In addition to allowing developers to measure their LLM/RAG applications and conduct metric-assisted experiments, the MDD approach is highly consistent with application repeatability.

RAGA Components

Prompt object: A component that defines the structure and content of a prompt, used to elicit a response generated by a language model. It helps in accurate evaluation by following consistent and clear prompts.
Evaluation sample: An independent data instance consisting of a user query, a generated response, and a reference response or ground truth (similar to LLM metrics such as ROUGE, BLEU, and METEOR). It is the basic unit for evaluating the performance of a RAG system.
Evaluation dataset: A set of evaluation samples used to more systematically evaluate the performance of the entire RAG system based on various indicators. Its purpose is to comprehensively evaluate the effectiveness and reliability of the system.

RAGA indicator

RAGAs provides the ability to configure the metrics of a RAG system by defining specific metrics for retrievers and generators and blending them into the overall RAGAs score, as shown in the following figure:

Let’s look at some of the most common metrics for thing retrieval and generation.

1. Retrieval performance indicators:

Context Recall:
Recall measures the proportion of relevant documents retrieved from the knowledge base to the true Top-K results, that is, how many documents were retrieved that are most relevant to the answer to the question? It is calculated by dividing the number of relevant documents retrieved by the total number of relevant documents.
Contextual Precision:
How many of the retrieved documents are relevant to the prompt, rather than noise? Contextual precision answers this question and is calculated by dividing the number of relevant documents retrieved by the total number of documents retrieved.

2. Generate performance indicators:

Fidelity:
It evaluates whether the generated response is consistent with the retrieved evidence, in other words, the factual accuracy of the response. This is usually done by comparing the response with the retrieved document.
Contextual Dependency:
This metric determines how relevant the generated response is to the query. It is usually calculated based on human judgment or through automatic semantic similarity scoring (e.g. cosine similarity).

As example metrics that connect the two aspects of the RAG system (retrieval and generation), we have:

Context Utilization:
This evaluates how effectively the RAG system utilizes the retrieved context to generate its response. Even if the retriever acquires excellent context (high accuracy and memory), a poorly performing generator may not be able to utilize it effectively. Context utilization is proposed to capture this nuance.

In the RAGAs framework, individual indicators are combined to calculate an overall RAGAs score , which comprehensively quantifies the performance of the RAG system . The process of calculating this score includes: selecting relevant indicators and calculating them, standardizing them to vary in the same range (usually 0-1), and then calculating the weighted average of these indicators. The allocation of weights depends on the priority of each use case. For example, in systems that require a high degree of factual accuracy, it is indeed crucial to ensure the fidelity and accuracy of information. When processing or providing information, especially when it comes to factual content such as specific data, dates, events, etc., it is necessary to prioritize the authenticity of the information rather than simply pursuing rapid recall or retrieval capabilities. This ensures that the information provided is more reliable and reduces the risk of spreading false information.

summary

This paper introduces and outlines RAGA: a popular evaluation framework for systematically measuring multiple aspects of RAG system performance from the perspectives of information retrieval and text generation. Understanding the key elements of the framework is the first step to grasp its practical use to exploit high-performance RAG applications.