Woter AI detection.Hurry - ends Jul 15th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Use Ragas to automatically evaluate the performance of the RAG knowledge question-answering system

Written by

Clara Bennett

Updated on:June-28th-2025

In the previous article, I created a RAG local knowledge question-and-answer system using pure code. During use, I found that:

If the local document quality is relatively high, the answer is still quite good

Once the quality of the knowledge base document itself is poor, or there are too many knowledge base documents, similar knowledge is scattered in different blocks, and the answers will be biased.

So how do you test the output results? One way is to be familiar with the knowledge and do the evaluation manually; another way is to use the AI big model to let it do the evaluation automatically .

This article will introduce how to use the ragas framework to automatically evaluate the response results of the RAG system.

As always, look at the final result first.

The knowledge base of "credit business" that has been trained in the previous article is selected for evaluation. The set of evaluation questions can be customized;

1. Important Supplement

There are many articles on the Internet that introduce the RAGAS framework, so I won’t go into details here. I will only introduce a few important concepts so that we can use them in practice.

RagaS official introduction document ( https://docs.ragas.io/en/stable/concepts/)

1. The evaluation dimensions of ragas are divided into retrieval and generation stages

2. The indicators of the retrieval stage include : context relevance and context recall; in layman's terms, it means " finding the information accurately".

3. Indicators for the generation phase include : Faithfulness , Answer Relevance; In simple terms, it means “staying on topic ” and “not making up stories”.

2. Implementation process

1. Overall planning

1) Introduce the ragas python package

2) Prepare evaluation data

3) Constructing evaluation parameters

4) Execution evaluation

5) Output evaluation report

2. Introducing ragas

This step is also very important. There is a big difference in the writing style between 0.2+ and previous versions. You should check your own version.

3. Prepare evaluation data

1) Evaluation data content: mainly includes user questions, answers generated by large models, retrieval context and real answers (human answers can be omitted).

2) Acquisition of evaluation data: It can be updated according to the program we created before, and it is ready to use out of the box! RAG creates a personal local knowledge question and answer system, a plug-in assistant for workplace people, and has no hardware requirements! The backend log is obtained, and the frontend chat box is obtained. It can also be obtained through the program. It is recommended to save it locally for repeated use.

The core code obtained through the program is as follows:

eval_questions=[], you can set multiple questions;

answers = [], for each question, the answer given by the big model, that is, the result given by the question-and-answer interface, is obtained through response.content returned by rag.

contexts = [], this is the retrieval context, obtained through response.source_nodes returned by rag .

# =================== Dataset function ================== def prepare_eval_dataset(): """Prepare evaluation dataset (uncomment for first run)""" knowledge_base_id = "credit business" embedding_model_id = "huggingface_bge-large-zh-v1.5" eval_questions = ["What are the special circumstances for credit approval?"] answers, contexts = [], [] for q in eval_questions: try: query_engine = utils.load_vector_index( knowledge_base_id, embedding_model_id ).as_query_engine(llm=DeepSeek_llm) response = query_engine.query(q) answers.append(response.response.strip()) contexts.append([node.text for node in response.source_nodes]) except Exception as e: logger.error(f"Failed to generate answer: {str(e)}") answers.append("") contexts.append([]) eval_dataset = Dataset.from_dict({ "question": eval_questions, "answer": answers, "contexts": contexts }) eval_dataset.save_to_disk("eval_dataset") logger.info("? Evaluation dataset saved")

4. Construct evaluation parameters

ContextRelevance

LLMContextPrecisionWithoutReference - Evaluates content precision without providing a reference answer

Faithfulness

AnswerRelevancy - Answer relevance

Key Code

metrics = [ ContextRelevance(llm=ragas_llm), Faithfulness(llm=ragas_llm), AnswerRelevancy(llm=ragas_llm)]

5. Execution Assessment

After preparing all the above, just call the evaluate method of ragas.

The core code is as follows:

 # Load the dataset try: eval_dataset = load_from_disk("eval_dataset") logger.info(f"? Successfully loaded the dataset | Number of samples: {len(eval_dataset)}") except Exception as e: logger.error(f"Failed to load the dataset: {str(e)}") exit(1) # Execute evaluation try: result = evaluate( eval_dataset, metrics=metrics, llm=ragas_llm, raise_exceptions=False #,timeout=300 ) except Exception as e: logger.critical(f"Evaluation process terminated abnormally: {str(e)}") exit(1)

6. Output result report

It can be printed or exported in markdown format.

The core code is as follows:

 # Result security processing logger.info("\n" + " Evaluation Report ".center(50, "=")) score_map = { 'context_relevance': 0.0, 'faithfulness': 0.0, 'answer_relevancy': 0.0 } for key in score_map.keys(): if key in result: score_map[key] = result[key].mean(skipna=True) logger.info(f"Context relevance: {score_map['context_relevance']:.2%}") logger.info(f"Answer faithfulness: {score_map['faithfulness']:.2%}") logger.info(f"Answer relevance: {score_map['answer_relevancy']:.2%}") logger.info("\nDetailed results:") print(result.to_pandas().to_markdown(index=False))

Well, today's ragas automatic evaluation is here. Although it is simple, it is very important! Just like any system needs to be tested before it goes into production! I hope it is useful to you~