Agent 2.0: From prompt word optimization to self-development of tools

Written by

Caleb Hayes

Updated on:June-20th-2025

This article summarizes the most popular paper on Hugging Face in April, "ADVANCES AND CHALLENGES IN FOUNDATION Agents", which is a panoramic review of AI Agents. The article is nearly 200 pages long and is divided into 4 main parts. Today I will introduce the second part, the self-evolution of agents.

In the past, many AI systems were designed manually, such as feature extraction and behavior rules. But with the advancement of technology, more and more of these tasks are done by machines themselves. For example, neural networks that previously required expert design can now be automatically generated through algorithms. Agent systems are also undergoing the same changes .

Although we have not yet fully realized the "automatic evolution" of agents, the direction is clear. In the future, agent systems will not need to be built step by step by humans, but will be able to learn, modify, and become stronger on their own. Artificial design will be replaced by self-optimizing systems . Just like humans constantly learn, agents can also grow on their own.

What are the benefits of such a system? First , it is more convenient because there is no need to retrain the language model every time; second, it is more labor-saving because developers do not need to make adjustments all the time; third, it is more like human thinking and can solve problems by itself when encountering problems, without waiting for someone to fix them.

Many studies are now using large language models to promote this. Language models can not only understand instructions, but also help agents select tools and optimize processes. Some systems (such as AFLOW ) can already automatically generate complete agent workflows. This shows that agents are not created, but "grown" .

Next, let’s take a look at what practical directions are currently available.

Tip word optimization

In agent optimization based on large language models, the optimization of prompt words is the most core link . Compared with modifying the model structure, adjusting the prompt words can more directly affect task performance, response speed and computing cost. The goal of optimization is to generate the most appropriate prompt for a specific task so that the model performs best.

The entire optimization process relies on three key steps: optimization, execution, and evaluation . The execution module generates results using the current prompt words; the evaluation module is responsible for analyzing the quality of the results and generating evaluation signals and optimization signals; the optimization module improves the prompt words accordingly.

The basis of evaluation is the evaluation function . It receives the model output and the standard answer, and uses different methods to determine whether the prompt word is effective. Common evaluation sources include: comparison between model output and standard answer, comparison between model outputs, or even relying solely on the model's own feedback.

There are three evaluation methods: benchmark testing, language model as review, and manual feedback . Benchmark testing relies on setting indicators to score and is the most widely used; language model review generates feedback through natural language and is gradually becoming the mainstream method of automation; manual feedback provides the most accurate evaluation, but the cost is high and is not conducive to large-scale application.

There are also three forms of evaluation signals: numerical feedback can quantify the effect, textual feedback can provide specific suggestions, and ranking feedback helps the model compare the pros and cons of different prompt words without having to define absolute standards.

During the optimization process, some methods rely only on evaluation signals, starting from the most effective prompts, and continuously adjusting through evolutionary algorithms or heuristic strategies ; some methods use more explicit optimization signals, such as directly analyzing failure cases, or extracting commonalities from high-scoring prompts to guide the next modification. For example, TextGrad converts failure reflections into "text gradients" as the basis for rewriting new prompts; Revolve simulates a deeper feedback chain to help the system jump out of the local optimum.

In order to judge the optimization effect, researchers use three types of indicators: performance indicators, efficiency indicators, and behavioral indicators . Performance indicators such as accuracy and F1 score can directly reflect the results; efficiency indicators focus on the required computing resources and sample size; behavioral indicators focus on consistency, fairness, and model confidence. These indicators together reflect the capabilities and boundaries of a prompt word optimization system.

Workflow Optimization

Prompt word optimization can improve the performance of a single language model, but complex tasks usually require multiple models to collaborate. This requires optimizing the entire Agent workflow , not just the prompt words.

A workflow consists of multiple nodes, each of which represents a language model responsible for a subtask. Nodes collaborate with each other through preset rules and goals, and are not completely autonomous. Many systems such as MetaGPT and AlphaCodium use this structure. Optimizing these processes can improve the overall performance of the system and is also the key to building a stronger agent.

A workflow can be formalized as a combination of a node set N and an edge set E. Each node involves four dimensions: model, temperature, prompt word, and output format. The optimization goal is to find an optimal structure K* that achieves the best balance between task completion, computational efficiency, and response speed.

The representation of edges determines the expressibility and optimizability of the structure. Common representations include: graph structure , which is suitable for expressing complex processes; neural network structure , which supports dynamic adjustment; code structure , which is the most flexible and can integrate logical judgment and loop control.

Node optimization is also important. Different models, temperature settings, and output formats will affect the response. Prompt words are still key, but they are only part of the optimization in the workflow.

As the number of nodes increases, the search space expands rapidly, and the optimization strategy must take into account both efficiency and scale . Overall, optimizing the agent is not just about adjusting words and sentences, but about systematically building and adjusting a structure that can collaboratively complete tasks.

Tool optimization

Unlike the traditional single-round dialogue language model, the agent has the ability to plan multiple rounds and call external tools. Therefore, tool optimization has become a key link in improving the performance of the agent , with the goal of enabling the agent to more efficiently select, call and combine tools, reduce latency, and improve decision-making accuracy and task completion capabilities.

Tool optimization is divided into two directions: learning to use tools and creating new tools .

Learning to use tools can be done by imitating human operations or by reinforcement learning based on feedback . The former learns how to use tools through behavioral cloning, while the latter continuously adjusts behavioral strategies through environmental or human feedback. Language models can also enhance their tool call decisions through thought chains (such as Chain-of-Thought or Tree-of-Thought) , and further optimize the call order and method in combination with the output of the model.

In addition to learning existing tools, some systems can automatically generate new tools based on tasks . For example, ToolMakers generates functions first, then automatically tests and packages them; CREATOR introduces a closed loop of "create-decide-execute-correct"; CRAFT extracts reusable small tools and combines them to handle complex problems. These methods demonstrate the possibility of tool generation and evolution .

In order to evaluate the effectiveness of tool use, some researchers have proposed a complete set of measurement systems. The first step is to determine whether the tool needs to be called, then choose which tool is most suitable, and finally evaluate the retrieval and call efficiency. The evaluation includes call accuracy, selection accuracy, sorting ability, cost-return ratio , etc. For complex tasks, it is also necessary to evaluate whether the tool call sequence is reasonable, whether the plan is coherent, whether the language is clear, and whether the logic is consistent .

Some evaluation frameworks also emphasize the quality of behavior planning, requiring the model to not only select the right tools, but also reasonably summarize the intermediate results and plan the next step . Overall, tool optimization is not just about function selection, but also about how the agent thinks and acts systematically in the task . This determines whether an agent really has the ability to solve real problems.

Using large models as a tool for optimization

This chapter talks about something very interesting: how to use large models as an optimization tool . In the past, we mostly let them generate answers, but now some people have begun to let them try, modify, and optimize , such as adjusting prompt words, designing task processes, and even deciding how agents divide work and collaborate.

In the past, most optimization methods were based on mathematical routines: if there was a gradient, use the gradient , such as gradient descent; if there was no gradient, rely on trial and error, such as Bayesian optimization. However, these methods are difficult to deal with natural language, which has no structure and no formula. The language model is good at processing these "messy but human-understandable" inputs, so it has become another type of optimizer that can use language and context to repeatedly adjust and slowly improve the results without a clear objective function.

This type of optimization is usually not done in one step, but through a cycle of "try-evaluate-try again". At the beginning, some people directly generated a lot of prompts through random search , and then continued to try the ones with good performance. But this method is too costly. Later, some methods began to simulate "directional adjustment", such as referring to historical modification records, or using language models to generate the next step of modification. This is like simulating a "language version of the descent direction" where there is no real gradient.

Another way to save costs is to build a stand-in model , first predict which changes may be effective, and then verify the actual effect in small quantities. Although this method saves money and labor, it also depends on whether the "stand-in" is built accurately.

In addition, the parameter settings in this optimization process are also critical, such as whether to add "momentum", how many groups to try at a time, how to summarize feedback... Currently, everyone mainly relies on experience to adjust, and there is no unified standard. So there is a new direction: let the language model optimize itself , which is the so-called "meta-optimization". The model looks back at what it did before, learns from it, and does better next time.

The dimension of time is also very important. Most optimization methods are adjusted all at once, but LLM can be optimized continuously in rounds like RNN, getting better and better . Some work even designs the entire optimization process into a state machine or game system , allowing it to respond to changes more flexibly.

Although many results are quite amazing, the theoretical foundation has not yet fully caught up . Some people have tried to explain where the optimization ability of LLM comes from, such as contextual learning ability, or the "computational characteristics" of the Transformer structure itself. Some people are also doing interpretability research to see what is happening inside the model. However, these explanations are not perfect yet, especially when facing uncertain environments, the model is still not very good at "trying new things."

Simply put, language models are no longer just "answer machines", but are slowly becoming intelligent systems that can think, make mistakes, and improve . The way they are optimized is not a traditional mathematical solution, but more like the way humans do things: try it first, see the effect, then think about how to change it, and the more you change, the better. This method has great potential, but there is still a lot to be explored.

Offline and online self-evolution of agents

There are two paths for Agent self-evolution: online optimization and offline optimization .

Online optimization occurs during the operation of the Agent, relying on real-time feedback to continuously adjust its own behavior. For example, the model will immediately check the error after the output and try to correct it (such as Reflexion, Self-Refine), and multiple agents can also collaborate and communicate to jointly improve their understanding of the task (such as MetaGPT, ChatDev). The reward mechanism is dynamically updated during the execution process, and the parameters will also be adjusted according to the environment, no longer relying on human intervention. This process enables the Agent to have the ability to learn while using it, and can respond immediately to changes in the environment.

Offline optimization is more like a structured training. It relies on high-quality data and a predetermined training plan to systematically improve the generalization ability of the model. It includes large-scale fine-tuning, using retrieval enhancement methods to enhance memory, and adjusting the reward function to make it more in line with the real goal. Offline optimization emphasizes stability, which allows the agent to have a solid and reliable foundation before facing the formal scene.

Both have their own strengths and weaknesses. Online optimization is fast but easy to go off track; offline optimization is solid and stable but not flexible enough. More and more systems choose to combine the two to form a hybrid optimization framework.

In this hybrid strategy, the agent first lays a solid foundation through offline training; then, in actual operation, it autonomously adjusts and optimizes the strategy; finally, it regularly "writes" these online improvements into the main model to maintain long-term performance. The whole process is a cycle: from pre-training, to actual combat adjustment, and then to induction and summary .

This mechanism enables the Agent to improve at any time like humans and remain stable like machines. It is an important path to complex task capabilities.