Key knowledge points for large model application evaluation

Master the key evaluation skills of large model applications to improve the efficiency of enterprise knowledge management.
Core content:
1. The application of RAG technology in enterprise privatization large model and its importance
2. Quantitative evaluation methods and indicators of RAG effect
3. Implementation details of retrieval link evaluation, generation link evaluation and overall answer quality evaluation
At present, more and more companies have deployed private large models and adopted Retrieval Augmented Generation (RAG) technology to plug in exclusive knowledge bases for large models, so that the model can retrieve the company's private domain knowledge in real time, so as to give more accurate and reliable answers. However, building RAG engineering applications is by no means a one-time thing. This is a continuous iterative process that requires continuous optimization of the effect of RAG to better meet the actual business needs of the enterprise. In order to measure the effectiveness of RAG, we need to conduct quantitative evaluation on it. RAG application evaluation is different from traditional functional testing. Traditional functional testing focuses on whether the function is "right", while RAG application evaluation focuses on whether the effect is "good". Measurement is not the ultimate goal, but to clarify the specific application optimization direction based on this indicator.
So, how to evaluate RAG applications? At present, there are a variety of evaluation methods and indicators in the academic community. At first glance, it is dazzling and I don’t know where to start. In fact, there is no need to get too entangled in each specific evaluation method. We can start with the basic concept of RAG and see which aspects need to be evaluated. The full name of RAG is "Retrieval Enhanced Generation". As the name suggests, it is divided into two parts: "retrieval" and "generation". Then we can evaluate these two parts separately, and then evaluate the whole. Therefore, we can divide RAG evaluation into the following three categories:
Search phase evaluation
Generation Phase Evaluation
Overall answer quality assessment
Search phase evaluation
The main task of the retrieval phase is to find documents related to the user's question. Therefore, we are concerned about whether the retrieved content fragments are "accurate" and "complete", which can be divided into the following two sub-indicators:
Context Precision : Evaluates whether the content related to the correct answer in the search results is ranked high and has a high proportion (i.e., signal-to-noise ratio). In plain words, it is to see whether the retrieved content is "accurate".
Context Recall : This evaluates how many references related to the standard answer are successfully retrieved. A higher score means less relevant information is missed. In plain words, it is to see whether the retrieved content fragments are "complete".
Let's take a concrete example:
User question : "What are the test case design methods for functional testing?"
Known standard answer : "Functional test case design methods include equivalence class partitioning, boundary value analysis, error speculation, decision table, cause-and-effect diagram, orthogonal experiment, scenario method and flow chart, etc."
Related documents in the Knowledge Base :
1. "The basic methods of functional test case design are: equivalence class partitioning method, boundary value analysis method and error guessing method." 2. "In addition to the above basic methods, there are some combinatorial logic methods for functional test case design, including decision table method, cause-effect diagram method and orthogonal experiment method."
3. “In addition, there is a category called scenario and process method, which includes scenario method and flowchart method.” RAG Search Results :
1. "The basic methods of functional test case design are: equivalence class partitioning method, boundary value analysis method and error guessing method."
2. "In addition to the above basic methods, there are some combinatorial logic methods for functional test case design, including decision table method, cause-effect diagram method and orthogonal experiment method."
3. "In performance testing, benchmark testing is essential. On this basis, we need to conduct stress testing, load testing, and stability testing."
4. “Functional correctness testing focuses on whether the output results of the software meet expectations, with the aim of ensuring that the software operates as defined in the requirements specification; whereas functional suitability testing focuses on evaluating whether the software functions meet the needs and expectations of users, that is, whether the functions provided by the software are appropriate and useful to users.”
In this search result, we can see:
Context Precision :
The first two retrieved results are directly related to the standard answer, while the last two are not, so the proportion of relevant fragments is 50%, that is, the context accuracy is 50%.
Context Recall :
There are 3 relevant documents of the standard answer in the knowledge base, 2 of which are successfully retrieved, so the context recall rate is 67%.
Generation Phase Evaluation
In the generation phase, the large language model generates answers based on the retrieved content. We are concerned about whether the generated answers are accurate and irrelevant. Based on this, we can divide them into the following two sub-indicators:
Faithfulness : Evaluate the factual consistency of the generated answer with the retrieved references, and detect whether there is "illusion" (i.e. whether the content generated by the model can be found in the search results). The specific implementation method is: decompose the generated answer into multiple statements, and check whether each statement can be supported by the search content.
Answer Relevancy : Evaluate the relevance of the generated answer to the user's question, that is, whether the generated answer solves the user's question, whether it deviates from the topic, or whether the answer is irrelevant to the question. The implementation method is to calculate the semantic similarity between the question and the answer, and evaluate whether the answer covers the core points of the question. This evaluation can be measured by manual scoring or using indicators such as cosine similarity and BERTScore.
Overall answer quality assessment
Overall, our main concern is the ultimate indicator, Answer Correctness , which is to comprehensively consider the effects of the retrieval and generation stages to evaluate whether the answers of the RAG application meet the requirements. From an intuitive point of view, we can think that a correct answer should meet two conditions: the semantics is close enough to the standard answer, and it is consistent with objective facts. Based on this, we can split the answer correctness into the following two sub-indicators:
Semantic Similarity
Evaluate the semantic similarity between the generated answer and the standard answer, that is, whether the answer is close enough to the standard answer. This indicator is obtained by embedding the text vectors of the answer and the standard answer (ground_truth), and then calculating the similarity of the two text vectors. There are many ways to calculate vector similarity, such as cosine similarity, Euclidean distance, Manhattan distance, etc. The most commonly used is cosine similarity.
Factual Accuracy
Semantic similarity alone is not enough, because the large model is likely to just copycat and make up some answers that look similar to the standard answer but actually violate objective facts. Therefore, we also need to add an assessment of factual accuracy: this is an indicator that measures the difference in factual description between the answer and the standard answer (ground_truth). This indicator is different from the Faithfulness (factual fidelity) indicator mentioned above. Faithfulness (factual fidelity) only compares with the retrieved reference materials, while Answer Correctness (answer correctness) directly compares with the standard answer. If Faithfulness is high, Answer Correctness may not be high, because the retrieved reference materials may contain facts that are inconsistent with the standard answer, or the retrieved content may be incomplete. So how to calculate this indicator?
First, generate a list of opinions for the generated answer and the ground truth. An example is as follows:
answer = "Functional test case design methods include equivalence class partitioning, boundary value analysis, error conjecture, decision table, cause-effect diagram, orthogonal experiment, scenario method and flow chart, etc."ground_truth = "Functional test case design methods include equivalence class partitioning, boundary value analysis, error conjecture, decision table, cause-effect diagram, orthogonal experiment, functional correctness test method and functional suitability test method, etc."
# Generate a list of viewpoints: answer_points = [ "Functional test case design methods include equivalence class partitioning method", "Functional test case design methods include boundary value analysis method", "Functional test case design methods include error speculation method", "Functional test case design methods include decision table method", "Functional test case design methods include cause-effect diagram method", "Functional test case design methods include orthogonal experiment method", "Functional test case design methods include scenario method", "Functional test case design methods include flow chart"]
ground_truth_points = [ "Functional test case design methods include equivalence class partitioning method", "Functional test case design methods include boundary value analysis method", "Functional test case design methods include error inference method", "Functional test case design methods include decision table method", "Functional test case design methods include cause-effect diagram method", "Functional test case design methods include orthogonal experiment method", "Functional test case design methods include functional correctness test method", "Functional test case design methods include functional suitability test method"]
Then, traverse the answer and ground_truth lists and initialize three lists, TP, FP and FN.
For opinions generated by answer:
If the view matches the view of ground_truth, then add the view to the TP list. In the above example,
TP = [ "Functional test case design methods include equivalence class partitioning method", "Functional test case design methods include boundary value analysis method", "Functional test case design methods include error inference method", "Functional test case design methods include decision table method", "Functional test case design methods include cause-effect diagram method", "Functional test case design methods include orthogonal experiment method"]
If the viewpoint does not find a basis in the ground_truth viewpoint list, then the viewpoint is added to the FP list. In the above example,
FP = ["Functional test case design methods include functional correctness testing methods", "Functional test case design methods include functional suitability testing methods"]
For ground_truth generated views:
If the idea does not find a matching item in the answer's idea list, the statement is added to the FN list. In the above example,
FN = [ "Functional test case design methods include scenario method", "Functional test case design methods include flow chart" ]
Finally, count the number of elements in the TP, FP, and FN lists, and calculate the f1 score as follows:
f1 score = tp / (tp + 0.5 * (fp + fn)) if tp > 0 else 0
In the above example, f1 score = 6 / (6 + 0.5 * (2 + 2)) = 0.75.
Score Summary
After obtaining the scores of semantic similarity and factual accuracy according to the above method, the weighted sum of the two can be obtained to obtain the final Answer Correctness score.
Answer Correctness score = 0.25 * semantic similarity score + 0.75 * factual accuracy score
Automated Methods for Evaluating RAG Applications
The above evaluation, if entirely relying on manual evaluation, would be very time-consuming and prone to errors. Is there a more intelligent automated evaluation method? For example, when calculating factual accuracy, it is necessary to split the list of opinions and find matches between the lists. Can we use a mature large reasoning model (such as the full-blooded version of DeepSeek) to complete this task? The RAGAS framework is such a tool that can help us automatically evaluate RAG applications. It quantifies the performance of the retriever and generator components of the RAG system through structured indicators, and makes full use of the capabilities of large language models and embedding models to achieve automated evaluation of RAG applications, significantly reducing the investment cost of evaluation tasks. The code example is as follows:
from langchain_community.llms.tongyi import Tongyifrom langchain_community.embeddings import DashScopeEmbeddingsfrom datasets import Datasetfrom ragas import evaluatefrom ragas.metrics import answer_correctness # Dataset data_samples = { 'question': [ 'Which department is Zhang Wei from? ', 'Which department is Zhang Wei from? ', 'Which department is Zhang Wei from? ' ], 'answer': [ 'Based on the information provided, there is no mention of Zhang Wei's department. If you can provide more information about Zhang Wei, I may be able to help you find the answer.' ', 'Zhang Wei is from the personnel department', 'Zhang Wei is from the teaching and research department' ], 'ground_truth':[ 'Zhang Wei is a member of the teaching and research department', 'Zhang Wei is a member of the teaching and research department', 'Zhang Wei is a member of the teaching and research department' ]}dataset = Dataset.from_dict(data_samples)# Perform automated evaluation score = evaluate( dataset = dataset, # Define evaluation indicators metrics=[answer_correctness], llm=Tongyi(model_name="qwen-plus-0919"), embeddings=DashScopeEmbeddings(model="text-embedding-v3"))#Convert the evaluation results to DataFrame format score.to_pandas()
The above process mainly calculates the correctness of the answer, among which the semantic similarity sub-indicator is calculated by the DashScopeEmbeddings model, and the fact accuracy is calculated by the large language model Tongyi. Similarly, we can also calculate the context precision and context recall:
from langchain_community.llms.tongyi import Tongyifrom datasets import Datasetfrom ragas import evaluatefrom ragas.metrics import context_recall,context_precisiondata_samples = { 'question': [ 'Which department is Zhang Wei from? ', 'Which department is Zhang Wei from? ', 'Which department is Zhang Wei from? ' ], 'answer': [ 'Based on the information provided, there is no mention of Zhang Wei's department. If you can provide more information about Zhang Wei, I may be able to help you find the answer. ', 'Zhang Wei is from the Personnel Department', 'Zhang Wei is from the Teaching and Research Department' ], 'ground_truth':[ 'Zhang Wei is a member of the Teaching and Research Department', 'Zhang Wei is a member of the Teaching and Research Department', 'Zhang Wei is a member of the Teaching and Research Department' ], 'contexts' : [ ['Provide administrative management and coordination support to optimize administrative work processes. ', 'Performance Management Department Han Shan Li Fei I902 041 Human Resources'], ['Li Kai Director of Teaching and Research Department', 'Newton discovered gravity'], ['Newton discovered gravity', 'Zhang Wei Engineer of Teaching and Research Department, he is currently in charge of curriculum development'], ],}dataset = Dataset.from_dict(data_samples)score = evaluate( dataset = dataset, metrics=[context_recall, context_precision], llm=Tongyi(model_name="qwen-plus-0919"))score.to_pandas()
How to optimize RAG applications based on evaluation results
Evaluation is not the ultimate goal. The ultimate goal is to make targeted optimization suggestions based on the evaluation results. Although the correctness of the answer can comprehensively measure the overall effect of the application, when the effect is not satisfactory, simply looking at this indicator cannot tell where the problem lies. At this time, we need to combine the four sub-indicators to comprehensively evaluate the application effect. According to the process sequence of RAG, we should first look at the context precision and context recall rate . These two indicators reflect the effect of the retrieval link. If these two indicators are very low, it means that there is a problem in the retrieval link. The possible reasons and countermeasures are as follows:
Common reasons for low context recall :
The content of the knowledge base is not complete enough , resulting in insufficient reference information for recall. Therefore, it is necessary to supplement the knowledge base with relevant reference knowledge.
The document segmentation strategy is unreasonable , such as using fixed-length segments, which results in incomplete context information. Therefore, it is necessary to optimize the segmentation strategy so that the segments can contain the context information as completely as possible.
The number of recalled blocks is too small , resulting in insufficient reference information for recall. Therefore, the number of recalled blocks needs to be increased.
The embedding model used is not powerful enough to retrieve all relevant information, so a better embedding model is needed. A good embedding model can understand the deep semantics of the text. If two sentences are deeply related, they can get a high similarity score even if they "seem" unrelated.
The user's query is unclear or lacks key information , resulting in poor retrieval results. In this regard, developers can design a prompt template, use a large model to rewrite the query, optimize the way the question is expressed, and supplement key information, thereby improving the accuracy of recall.
Common reasons for low context precision :
The content quality of the knowledge base is low , a lot of noise information is mixed in it, and useful information is difficult to distinguish. Therefore, it is necessary to clean and preprocess the content in the knowledge base.
The document segmentation strategy is unreasonable , for example, the segment length is too large or contains too much interference information. Therefore, the segmentation strategy needs to be optimized.
Too many blocks are recalled , which results in too much interference information in the recalled reference information. Therefore, the number of blocks to be recalled should be reduced appropriately.
The embedding model used is not powerful enough to accurately match the query with relevant information, so a better embedding model needs to be replaced. Of course, there is also a very effective method, which is to add a re-ranking link after the search, and use a special re-ranking model to perform more refined scoring and ranking of the retrieved content to improve the ranking of related text segments, that is, to perform a "fine ranking" after a "rough screening".
If both indicators are high, it means that the retrieval phase performs well and does not need too much attention. Then we can focus on the generation phase . The generation phase is mainly completed by the large model. For the large model, the following optimization directions can be tried:
Common reasons for low factual fidelity :
The model itself has a high level of hallucination , so it is easy to make up wrong answers. Therefore, you need to replace it with a large model with a lower level of hallucination. You can refer to the authoritative model hallucination evaluation list to choose, or fine-tune the model.
The task prompts are not clear enough . The prompts should clearly require the large model to answer only based on the retrieved content, and if it does not know, it should just say it does not know.
In addition, the self-RAG method can also be introduced , that is, before outputting the final answer, let the large model "reflect" on itself to see whether the answer it generated is inconsistent with the facts. If so, it will be corrected, and then another round of self-review will be carried out until it is confirmed that there is no illusion in the response.
Common reasons for low answer relevance :
Improper model parameter settings : The large model has two important parameters, temperature and top_p. The higher the values of these two parameters, the more random and diverse the generated answers are, and vice versa, the more certain and conservative they are. Therefore, these two parameters need to be adjusted according to the specific task. For tasks where the standard answer is certain and unique, it is recommended to lower these two values to prevent the large model from thinking too divergently.
The model does not understand the user's question , which may be because the user's question is not expressed clearly enough, or the large model itself is not capable enough. Therefore, it is necessary to optimize the user's question expression or replace it with a better large model.