What is the most important thing to be a good AI Agent?

Written by

Clara Bennett

Updated on:June-26th-2025

Why?

Because we already have enough technical solutions, as long as we clearly define the problem we want to solve (benchmark testing), we can solve it.

OpenAI’s Yao Shunyu recently proposed the concept of “AI’s second half ”. We already have:

Pre-trained models that store a lot of knowledge (prior knowledge) and know how to continuously train it
Agent capabilities (environment) that use this model to think, reason, and perform actions
Reinforcement Learning Algorithms

Supplement the pre-trained model with prior knowledge → Agent supplements the model with tool capabilities → Reinforcement learning stimulates the application of knowledge . The entire solution has been standardized, can be generalized well, is applicable to all scenarios, and can quickly break through one benchmark after another.

The focus will become, what kind of benchmarks should we define? We already have a lot of benchmarks covering areas such as mathematical reasoning and programming, and often large models are released with high scores, but the impact on the real world is not that great.

Obviously, we should define benchmarks that are closer to real-world problems. Once defined, we can use the above solution to continuously optimize and solve them: benchmarks guide the collection of real-world data → improve the prior knowledge of pre-trained models → reinforcement learning stimulates the model to output in the direction of benchmarks.

The closer the defined benchmark is to the real world, the greater the impact and value it will have on the world. This is the most important issue in the second half of AI, and also the most important issue for AI Agents. (AI Agent is the current representative of AI. The big model has prior knowledge and reasoning ability. Agents equip the big model with environmental perception and action capabilities. To solve real-world problems, Agents are definitely needed.)

What is it?

What are real-world benchmarks?

In the past, a large number of benchmark tests were basically fixed tasks in a closed world , such as math problems, algorithm problems, Go, and games. They could clearly define problems, rules, and answers. It is relatively easy to define such benchmark tests. The rules and processes are ready-made. Reasoning can also fall into this category. When large models develop to this stage, it is relatively easy to solve these problems.

However, these tasks are too far away from the problems that we have to solve in the real world on a daily basis, and are not real-world environments. Previously, we lacked the ability to perceive and process massive and complex rule-based tasks in the real world. Now, large models and agents have initially acquired this ability.

Currently, there are many single-dimensional benchmark tests on cross-sections , including planning ability (PlanBench, AutoPlanBench, etc.), tool calling ability (ToolBench, BFCL, etc.), reflection ability (LLF-Bench, LLM-Evolve, etc.), and there are also large unified benchmark tests for general task completion ability, mainly in the aspects of operating browsers and computers, such as OpenAI's browsecomp (evaluating complex information retrieval and comprehension capabilities) and the academic community's OSWorld (evaluating the ability to understand GUI operations to complete tasks).

However, these cross-sectional or general benchmarks may not be what users care about . AI Agents need to be practical, and users are more concerned about their capabilities in vertical tasks , such as whether it can help me write good code, provide good customer service, create good stories, and give good research reports. The industry is currently in its early stages, so we should first solve basic and general problems through benchmarks. After reaching a certain threshold, benchmarks on vertical tasks will become more important.

If we simply classify these tasks, we can divide them into two categories: tasks with clear goals and tasks with unclear goals.

A mission with clear objectives

In reality, some tasks have clear definitions of whether the results are correct, and they can have standard answers like mathematics, but the process requires constant interaction with the real environment. A typical example is AI Coding, where whether the program can run smoothly and whether the bugs have been fixed can be clearly verified. Others include customer service, data analysis, etc.

This category is the easiest to be broken by AI, but it is not easy to define a good benchmark.

The most authoritative benchmark in the field of AI Coding is SWE-Bench, which has been trying to define problems as close to the real world as possible, starting from solving real issues on GitHub, but it is still difficult to measure the effects of different models in actual coding scenarios. The scores of o1, DeepSeek R1, and Claude 3.5 are all around 49%, but in actual use, Claude 3.5 is a level higher in usability. No other benchmark can reflect the effect of Claude 3.5's discontinuation , and Claude 3.7 has a score of 70%, but the actual experience is not as different from 3.5 as the score. Except for the model after matching the tool, there are dozens of AI Coding tools such as windsurf, cursor, trae, argument, etc., and it is unclear how their actual effects differ and how to evaluate and measure them.

SWE-Bench only covers a part of Coding, including the ability to understand large projects, visual animation development capabilities, code CR, requirements understanding, etc. There are still many benchmarks to be supplemented. Now there are also SWE-bench Multimodal, AgentBench, SWELancer and other benchmarks that are constantly being launched to try to cover them.

I haven’t seen any relevant benchmarks in other fields yet.

Tasks with unclear objectives

Most real-world tasks have results that are difficult to clearly define, and are not black and white. For example, research reports, travel planning, resume screening interviews, and various scenarios involving text/image/video creation, such as marketing, story creation, email reply communication, etc., the quality of many results can only be judged by humans .

One of the reasons for Deepseek's popularity at the beginning of the year, in addition to its skyrocketing scores, is that the Chinese output it produces is of high quality. However, there is no benchmark test that can measure this point, because it is indeed difficult to define what kind of text is clearly good, which is related to culture/preference/logic/diversity, etc.

The same is true for image and video generation. After a certain threshold, what constitutes a better generated image also involves many dimensions and human subjective judgment, and there is currently no benchmark test that can do this.

How to evaluate this kind of task?

Manual evaluation: For example, for image generation, a common practice is to manually score by dimension, and manually score and compare the results generated by different models. Articles/videos can also be evaluated in the same way. In addition, there are also online blind PK tests, which compare large batches of results and rank each model according to the total score. For internal iterations of your own products, you can also use data such as the adoption rate after going online to evaluate the quality. However, these require human participation, have a large subjective component, and are difficult to form a recognized standard benchmark test.
Relying on models: As the model's understanding ability gradually increases, it can have the same evaluation ability as humans, and the above-mentioned manual evaluation can be converted to model evaluation. For example, in the evaluation of pictures, the current multimodal models such as 4o have increasingly strong understanding capabilities and can evaluate some of the good and bad. The same is true for text, which can be evaluated by an evaluation model, and the model can also autonomously give evaluation dimensions based on the scenario. If everyone recognizes that a certain model's evaluation ability is OK, and the relevant data sets and evaluation dimensions are defined, it can be a benchmark test, but the current model has not yet reached a level comparable to manual evaluation.
Rely on task decomposition: Do not measure the overall result, only measure the clearly defined parts in the middle, and convert the task parts into the tasks with clear goals mentioned above. For example, for email communication, only evaluate whether the email contains the required key information, and for travel planning, only evaluate whether it meets the qualitative preferences (such as the lowest price), and whether the operations such as flight booking API calls are correct.

If we want Agents to play a good role and produce value in various fields, each field may have its own vertical Agent, and each field needs to define one or more benchmark tests to cover this field . The AI Coding field is the fastest, and there are already multiple benchmarks, such as customer service, e-commerce, marketing, creation, medical care, education, etc. Each major topic will have small vertical tasks, and each type of task may require a benchmark test to measure who does the best at this task and to promote the success rate of this task.

If you want to build a vertical agent, the most worthwhile thing to do is to define the benchmark, which is similar to TDD (test-driven development) in software development. This approach may be more important in the AI era. It clarifies the problem definition, guides the optimization direction, provides optimization data, and will not be affected by model upgrades . It is an important asset of agents in this field.