Recommendation
Master the automated evaluation techniques of the RAG knowledge question-answering system to improve the accuracy and reliability of the question-answering system.
Core content:
1. Manual evaluation and AI automated evaluation methods of the RAG question-answering system
2. The core concepts and evaluation dimensions of the ragas framework
3. Detailed steps to use the ragas framework to implement automated evaluation of the RAG system
Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
In the previous article, I created a RAG local knowledge question-and-answer system using pure code. During use, I found that:If the local document quality is relatively high, the answer is still quite goodOnce the quality of the knowledge base document itself is poor, or there are too many knowledge base documents, similar knowledge is scattered in different blocks, and the answers will be biased.So how do you test the output results? One way is to be familiar with the knowledge and do the evaluation manually; another way is to use the AI big model to let it do the evaluation automatically .This article will introduce how to use the ragas framework to automatically evaluate the response results of the RAG system.As always, look at the final result first.
The knowledge base of "credit business" that has been trained in the previous article is selected for evaluation. The set of evaluation questions can be customized;There are many articles on the Internet that introduce the RAGAS framework, so I won’t go into details here. I will only introduce a few important concepts so that we can use them in practice.RagaS official introduction document ( https://docs.ragas.io/en/stable/concepts/)
1. The evaluation dimensions of ragas are divided into retrieval and generation stages2. The indicators of the retrieval stage include : context relevance and context recall; in layman's terms, it means " finding the information accurately".3. Indicators for the generation phase include : Faithfulness , Answer Relevance; In simple terms, it means “staying on topic ” and “not making up stories”.2. Implementation process1) Introduce the ragas python package2) Prepare evaluation data3) Constructing evaluation parameters5) Output evaluation reportThis step is also very important. There is a big difference in the writing style between 0.2+ and previous versions. You should check your own version.3. Prepare evaluation data1) Evaluation data content: mainly includes user questions, answers generated by large models, retrieval context and real answers (human answers can be omitted).2) Acquisition of evaluation data: It can be updated according to the program we created before, and it is ready to use out of the box! RAG creates a personal local knowledge question and answer system, a plug-in assistant for workplace people, and has no hardware requirements! The backend log is obtained, and the frontend chat box is obtained. It can also be obtained through the program. It is recommended to save it locally for repeated use.The core code obtained through the program is as follows:eval_questions=[], you can set multiple questions;answers = [], for each question, the answer given by the big model, that is, the result given by the question-and-answer interface, is obtained through response.content returned by rag.contexts = [], this is the retrieval context, obtained through response.source_nodes returned by rag .# =================== Dataset function ================== def prepare_eval_dataset(): """Prepare evaluation dataset (uncomment for first run)""" knowledge_base_id = "credit business" embedding_model_id = "huggingface_bge-large-zh-v1.5" eval_questions = ["What are the special circumstances for credit approval?"] answers, contexts = [], [] for q in eval_questions: try: query_engine = utils.load_vector_index( knowledge_base_id, embedding_model_id ).as_query_engine(llm=DeepSeek_llm) response = query_engine.query(q) answers.append(response.response.strip()) contexts.append([node.text for node in response.source_nodes]) except Exception as e: logger.error(f"Failed to generate answer: {str(e)}") answers.append("") contexts.append([]) eval_dataset = Dataset.from_dict({ "question": eval_questions, "answer": answers, "contexts": contexts }) eval_dataset.save_to_disk("eval_dataset") logger.info("? Evaluation dataset saved")
4. Construct evaluation parametersLLMContextPrecisionWithoutReference - Evaluates content precision without providing a reference answerAnswerRelevancy - Answer relevancemetrics = [ ContextRelevance(llm=ragas_llm), Faithfulness(llm=ragas_llm), AnswerRelevancy(llm=ragas_llm)]
After preparing all the above, just call the evaluate method of ragas.The core code is as follows: # Load the dataset try: eval_dataset = load_from_disk("eval_dataset") logger.info(f"? Successfully loaded the dataset | Number of samples: {len(eval_dataset)}") except Exception as e: logger.error(f"Failed to load the dataset: {str(e)}") exit(1) # Execute evaluation try: result = evaluate( eval_dataset, metrics=metrics, llm=ragas_llm, raise_exceptions=False #,timeout=300 ) except Exception as e: logger.critical(f"Evaluation process terminated abnormally: {str(e)}") exit(1)
It can be printed or exported in markdown format.The core code is as follows: # Result security processing logger.info("\n" + " Evaluation Report ".center(50, "=")) score_map = { 'context_relevance': 0.0, 'faithfulness': 0.0, 'answer_relevancy': 0.0 } for key in score_map.keys(): if key in result: score_map[key] = result[key].mean(skipna=True) logger.info(f"Context relevance: {score_map['context_relevance']:.2%}") logger.info(f"Answer faithfulness: {score_map['faithfulness']:.2%}") logger.info(f"Answer relevance: {score_map['answer_relevancy']:.2%}") logger.info("\nDetailed results:") print(result.to_pandas().to_markdown(index=False))
Well, today's ragas automatic evaluation is here. Although it is simple, it is very important! Just like any system needs to be tested before it goes into production! I hope it is useful to you~