2025 AI Agent (Multi-Agent System) Evaluation and Optimization Guide

A practical guide to comprehensively master the evaluation and optimization of multi-agent systems.
Core content:
1. Detailed analysis of the evaluation process, from sample extraction to application output
2. Introduction to important evaluation indicators, covering task success rate and collaboration efficiency
3. In-depth discussion of multi-agent system evaluation tools and optimization methods
For Agent products, evaluation and optimization are two very important tasks that directly determine the usability of the product. These two tasks account for a very high proportion in the actual work process. For example, the team may build the basic framework in 2 weeks, and the subsequent evaluation and optimization work may take two months to complete. In this article, we will mainly look at how to evaluate and optimize the Multi-Agent System from four aspects: evaluation process, evaluation indicators, evaluation tools, and optimization methods.
As usual, I put the abstract first, so you can see if you want to read on. I tried my best to express all the content in plain language, but there are some professional terms that I can't avoid, so I hope you can forgive me. This article has more than 5,000 words in total, so you can find some time to read it.
summary
1 Evaluation process : Take samples from the data set, input them into the application, obtain output, and then the evaluator (which can be combined with the real answer) scores the output to complete the evaluation of the product.
2 Evaluation indicators : mainly introduces the task success rate, correct function call usage, collaboration indicators, etc.
3 Evaluation Tools : Introduces a variety of multi-agent system (MAS) evaluation tools, including: DeepEval, LangSmith, MultiAgentBench , etc.
4 Optimization methods : The optimization methods of multi-agent systems are introduced from both engineering and algorithm perspectives.
Evaluation Process
First, let’s take a brief look at the evaluation process.
First, samples are taken from the data set, input into the application, and output is obtained. Then, the evaluator (which can be combined with the real answer) scores the output to complete the evaluation of the product.
The dataset contains several examples, which will be provided to the application as input. The "(Optional)" indicates that you can optionally provide reference data such as the real answer (or annotation information) so that the evaluator can use it for comparison.
The application receives the examples in the dataset as input, and after execution, the application outputs the corresponding results.
The Evaluator receives the output of the application and compares it with the actual answer or expected result provided by (Optional) if necessary. Finally, the Evaluator gives a score based on the comparison result to quantify and measure the performance of the application.
This is the general process. Let’s expand it a little bit.
The process is actually the same as what we discussed before, except that this diagram expands on the four parts: (1) dataset, (2) evaluator, (3) task, and (4) evaluation implementation.
(1) Dataset: The dataset is the input source of the evaluation process and contains three types of examples:
Developer curated: Samples manually selected and annotated by developers to cover key use cases and edge cases.
User-provided: Input from real user logs (including user feedback), reflecting the diversity and noise of the product in actual use.
LLM generated (Synthetic): Examples automatically generated by large models, which can be used to quickly expand the dataset, simulate rare scenarios, or perform stress testing.
(2) Evaluator: The evaluator is responsible for scoring or judging the application output. It is divided into three types:
The model acts as a referee, comparing the results generated by multiple models, and the minority obeys the majority.
Rule decision, compare and see whether the generated content complies with the rules.
Human evaluation,comparison with real data.
(3) Task type: Task scenarios of some scenarios, including RAG question and answer, chatbot dialogue, code generation, etc.
(4) Evaluation implementation form: Evaluation can be performed at different stages and depths to ensure continuous improvement of product quality. There are two main methods:
Evaluation in production environment. Test on production environment traffic to evaluate the performance of the application under running conditions. Common methods include A/B testing, manual online evaluation, and evaluation using historical traffic.
Pre-deployment evaluation. Unit tests or evaluations run before the application is deployed. Common methods include unit testing, offline evaluation, and pairwise comparison.
Evaluation Metrics
Measuring MAS performance involves multiple metrics to fully evaluate accuracy, efficiency, and scalability :
Mission success rate
Metrics such as exact matches or task completion rates are used to evaluate whether the MAS produces the correct results. For example, task completion accuracy can be used to quantify overall accuracy . In collaborative environments involving tasks such as information retrieval, metrics such as precision/recall may be used . However, in many cases, a simple success rate is sufficient to evaluate system performance.
Correct function call usage
In MAS, Agents call various tools (functions). A key indicator is whether the Agent calls the correct function or API for a specific problem .
Tool success rate : measures the proportion of tool/API calls that achieved the expected results.
Function call evaluation : For example, Berkeley Function-Calling Leaderboard (BFCL) checks whether the agent chooses the right call method by providing tasks and their expected function calls.
Advanced BFCL evaluation : including multi-step scenarios and state-based metrics , which are used to track whether the agent can correctly maintain the system state during multiple calls, thereby evaluating the correctness of function calls.
Scalability Indicators To evaluate the scalability of MAS, it is usually necessary to adjust the number of agents or tasks and observe the performance degradation . Key indicators include:
Throughput/latency changes : As the number of agents or tasks increases, does the throughput grow nearly linearly and does the latency remain at an acceptable level?
Task allocation accuracy : Measures whether tasks are correctly assigned to the most appropriate agent to ensure that the team can still operate efficiently when it expands.
Communication overhead : Track the number of messages per task to prevent new agents from causing excessive coordination costs and thus slowing down the system.
A well-scalable MAS should maintain a near-linear growth in throughput when adding Agents , and the system coordination cost should not increase dramatically.
Collaboration Indicators
MAS also needs to evaluate the collaboration of multiple agents to ensure the consistency of the output and verify whether the collaboration between agents is smooth . Key indicators include:
Output coherence : measures whether the final output (such as a report or plan) is logically consistent, unified and coherent . This can be done manually or automatically .
Coordination success rate : detects whether the agent avoids conflicts (for example, whether two agents edit the same file at the same time, resulting in inconsistent data).
Matching degree of task execution path : If there is an optimal execution sequence (Ground-truth Sequence) , the degree of closeness between the actual execution path of MAS and the optimal path can be measured, which is especially suitable for planning tasks.
A high-quality MAS needs to ensure high output consistency, smooth agent coordination , and the ability to follow the best action sequence when performing complex tasks .
In actual applications, not all of these indicators need to be evaluated, but rather depend on the specific circumstances.
For example, in the case of a multi-agent data analysis assistant , the following key indicators can be tracked simultaneously:
Accuracy : Did the agent reach the correct conclusions of the analysis?
Tool usage success rate : Did the data acquisition agent successfully retrieve the required data?
Latency : How long does it take for the Agent to respond to user requests?
Scalability : Can the system support more data sources or new analysis agents?
By focusing on multiple metrics , we can ensure that the MAS not only provides correct results , but is also efficient and can scale to handle more complex problems.
Assessment Tools
Evaluating MAS requires structured frameworks, benchmarks, and tools . Researchers and industry experts have developed a variety of evaluation frameworks to systematically test the performance of MAS in different scenarios .
MultiAgentBench
MultiAgentBench is a comprehensive benchmark for MAS, which is used to evaluate the collaboration and competition capabilities of MAS in various interaction scenarios . MultiAgentBench not only evaluates the final task success rate, but also measures the quality of collaboration and competition . Milestone KPIs are used to refine the evaluation. For example, in a collaborative research task, it will set intermediate milestones (such as: collecting reference materials → drafting chapters → completing reports) and evaluate the degree of cooperation of agents at each step to ensure the overall collaboration efficiency of the team. MultiAgentBench also conducts agent coordination protocol evaluation to study how different agent communication topologies (star, chain, graph structure) affect team performance, and also evaluates the impact of different ** strategies (such as group discussions) **on team collaboration efficiency.
Compared with single-agent evaluation, MultiAgentBench provides a more comprehensive MAS evaluation that can reflect: the best agent collaboration strategy, the optimal communication architecture, and the team performance of the multi-agent system, rather than just the capabilities of a single agent.
PlanBench
PlanBench is a test suite specifically designed to evaluate Agent planning capabilities. This suite is mainly used to evaluate the following points:
Generate effective plans : Can the agent develop reasonable and executable plans ?
Optimization capability : Does the agent find an efficient path or just a feasible path?
Adaptability : Can you readjust your plans when conditions change ?
Perform reasoning : Can the agent predict steps where it might fail and make adjustments?
Suppose an agent needs to arrange furniture moving . PlanBench might perform the following tests:
Basic test : Can the agent list all necessary handling steps to ensure that the operation logic is reasonable?
Adaptability test : If the truck is too small, can the agent adjust the plan, such as arranging additional vehicles? (Source: Mastering Agents: Evaluating AI Agents - Galileo AI)
Compared with simple task execution evaluation, PlanBench deeply analyzes MAS's planning intelligence to determine whether the agent truly understands the task or is just repeating the training samples . By scoring the quality of the plan, execution reasoning, and adaptability , PlanBench makes the authenticity and generalization of MAS's planning intelligence clear at a glance .
Function Call Benchmark
As mentioned earlier, the Berkeley Function Call Leaderboard (BFCL) is a framework specifically designed to evaluate agents in terms of tool usage and API calls. It provides a dataset containing queries and expected function outputs, and monitors whether the agent correctly selects and calls the corresponding function to solve the query problem.
Key metrics included in the BFCL include the agent's ability to maintain state across a series of API calls and whether it can correctly execute multi-step tool usage.
Such benchmarks are particularly important for multi-agent systems (MAS) because these systems rely on external tools (such as data retrieval, computation, etc.). BFCL's tests ensure that the agents can handle real-world API usage patterns.
Industry Assessment Tools
In practical applications, engineers use frameworks and libraries to capture various indicators of multi-agent systems (MAS) for recording and analyzing multi-agent behaviors.
DeepEval : Allows definition of custom MAS-related indicators and can be integrated with CI/CD to achieve continuous testing.
TruLens : Focuses on explainability, helping debug inter-agent communication and ensuring outputs meet targets.
RAGas : A retrieval-augmented generation (RAG) system for agents using a shared knowledge base that tracks answer accuracy and context usage.
DeepCheck: Responsible for monitoring fairness and bias to ensure that there is no unfair tendency when MAS assigns tasks or makes decisions.
LangSmith : LangSmith is a platform for debugging, testing, monitoring and optimizing production-level large language model applications , helping developers to efficiently iterate and deploy LLM solutions.
Langfuse : It can be understood as an open source imitation of LangSmith.
Arize Phoenix : Arize Phoenix is an open source observability tool designed for experimentation, evaluation, and troubleshooting of AI and LLM applications. It enables AI engineers and data scientists to quickly visualize data, evaluate performance, track down issues, and export data for optimization improvements.
The GitHub links for this section are included at the end of the article.
By using these tools, the team can continuously evaluate MAS's performance on key metrics and identify problems in a timely manner, such as when an agent consumes too many resources or fails to coordinate with other agents.
Optimization methods
Optimizing a multi-agent system means improving the learning method or design of the agent to perform better on the metric. We will introduce the optimization method from both the engineering and algorithm perspectives.
project
Standardized communication protocols
Develop unified communication standards and data exchange formats to ensure clear and accurate information transmission between agents and reduce call errors caused by inconsistent formats. For example, you can refer to the experience of some mature systems (such as ROS) and learn from their communication module design.
Building an error handling mechanism
Introduce a dedicated middleware or agent management system to centrally coordinate task allocation and tool calls, with built-in automatic retry, fallback and fault tolerance mechanisms. In this way, when an agent fails to call a tool, the error can be automatically captured and remedied to reduce overall system interruption.
Establishing an automatic verification mechanism
Design a unified API interface for each agent to call the tool, and cooperate with the automatic verification and feedback mechanism to ensure that the parameters in the calling process are correct and the results meet expectations. In addition, you can use logging and monitoring tools (such as Arize Phoenix, Langfuse) to track the tool calling process in real time and quickly locate and correct errors.
Leveraging Distributed Optimization
By using distributed algorithms and parallel computing methods, each agent can perform tasks independently in a local environment and then summarize them, thereby reducing delays and errors caused by serial tool calls. This method also helps to alleviate the load pressure of a single node and improve the response speed and stability of the overall system.
Hybrid Optimization Methods
Using hybrid multi-agent systems , multiple optimization solvers (such as direct search methods and metaheuristic algorithms) are integrated into a coordinated framework. In such a system, each solver runs as an autonomous "agent", while a scheduler (or coordinator) manages the entire optimization process, maintaining a balance between cooperation and competition. This adaptive switching strategy can reduce unnecessary or erroneous tool calls because the system is able to dynamically select the most effective method at the moment. For example, a research paper proposed a multi-agent system for hybrid optimization , in which different types of solvers work on a given problem simultaneously, while the scheduling agent is responsible for supervising model evaluation and solver performance. This collaborative optimization approach can minimize computational overhead and reduce the probability of failure due to inappropriate or incorrect solver calls.
Distributed and consensus optimization methods
Another optimization approach is distributed optimization and consensus methods . In these methods, agents share local information (such as partial solutions or cost estimates) so that the entire network can converge to a global optimal solution. The advantages of this approach include:
Improve collaboration efficiency : Agents coordinate and cooperate to avoid repeated calculations, while achieving synchronous updates and reducing the need for central control.
Reduce tool invocation errors : Agents can share environmental information to avoid making decisions based on outdated or incomplete data , reducing the risk of errors when invoking tools.
algorithm
Multi-Agent Reinforcement Learning (MARL): In MARL, each agent learns a strategy by earning rewards, and many algorithms are adapted from single-agent reinforcement learning. The core challenge is that the actions of agents affect each other, so the learning process needs to consider cooperation or competition. For example, Q-learning and policy gradient methods have multi-agent versions. In a cooperative environment, a global reward can be provided to all agents to encourage teamwork; in a competitive environment, each agent will maximize its own reward (such as game score).
Another great example of using multi-agent reinforcement learning (MARL) and self-play for optimization is OpenAI Five’s application in Dota 2. The system trains a team of five agents to cooperate in a highly complex game environment.
OpenAI uses shaping rewards , combining win-loss results , kills , and game goals to motivate teamwork. Through reinforcement learning, the agent spontaneously learns the division of labor strategy, for example: one agent assumes a support role while another agent is responsible for attacking . These strategies are completely evolved through learning.
Evolutionary Algorithms (EAs) : Evolutionary algorithms are inspired by natural selection and are used to optimize agent behavior. Unlike gradient-based methods, EA methods maintain a population of agent strategies and form strategies by iteratively selecting strategies with better performance , combining mutation and crossover mechanisms. This technique is particularly powerful in multi-agent environments because it can explore diverse strategies and help agents escape from local optima that gradient methods may fall into .
Hybrid Evolutionary Algorithms (EA) + Reinforcement Learning (RL) : Modern research often combines evolutionary algorithms with reinforcement learning . RACE (Representation Asymmetry and Collaborative Evolution) is such a framework, which demonstrates that evolutionary algorithms can play a role in collaborative tasks of multi-agent reinforcement learning (MARL). The framework maintains a population of multi-agent teams that evolve in parallel with the main reinforcement learning training . At a specific point in time, the excellent behaviors learned by RL are injected into the population, and the evolved excellent strategies are also fed back into the RL training .