The first half of Agent development: How environment, tools and context determine Agent | Chapter 42

Written by
Jasper Cole
Updated on:June-13th-2025
Recommendation

Explore the latest development and application prospects of Agent technology.

Core content:
1. The definition of Agent and its position in the current technology trend
2. The rapid development of Agent technology in the past two years and its practical application value
3. The importance of Context in Agent and its difference from other products

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Qu Kai: Agent is definitely the hot topic right now. Regarding the topic of Agent, I have some core questions that I am thinking about, and I believe that many people will also have questions about it. So today we invited Wen Feng, who has been researching and practicing Agent for a long time, to discuss these issues.

First of all, I would like to ask, how do you define Agent?

Wen Feng: I think the best definition is Anthropic: Agent is a program that allows the model to use tools based on environmental feedback.

Qu Kai: So what do you think of the recent Agent craze?

Wen Feng: This wave of Agents is very different from the past.

In the wave represented by AutoGPT in April 2023, Agent was more like a toy. The demos were very cool, but the actual application value was very limited.

After two years of development, this wave of Agents can indeed solve problems in actual work and life scenarios and bring value to everyone.

The reason for this transition is that the capabilities of the underlying models have greatly improved, especially after being combined with RL. Models represented by o1 also give agents long-term thinking capabilities.

The second reason is that there have been major breakthroughs in the engineering and product aspects of Agent. The main manifestation is that everyone knows better how to build a suitable Context for Agent, so as to better solve the problem.

Qu Kai: How do you understand this Context?

Wen Feng: Context refers to the sum of various information required by the large model to perform tasks.

Specifically, the context of different products is different. Take our product, Sheet0, for example. It is a data agent. Its core goal is to open up the entire data workflow, allowing the agent to automatically complete the entire process of collecting data on the web page, processing data, and taking actions based on the data.

Our Context includes web pages, collected and organized data tables, instructions issued by users, and some SQL generated when analyzing data, etc.

Qu Kai: But what is the difference between Context in Agent? Because when people make other products, it seems that they will collect various information and add it to Prompt or RAG for use.

Wen Feng: The core difference lies in the source of Context.

Taking Sheet0 as an example, if the previous RAG method is used, there will be many steps that require manual intervention. For example, if there is a lot of irrelevant information on the web page, it is necessary to manually extract the valid information. For example, if a SQL is generated in the process, its accuracy also needs to be manually verified.

But in Agent, this information will be extracted in some automated form without human involvement.

Qu Kai: I see. Recently, we often hear concepts such as Function Call, MCP, A2A, Computer Use, and Browser Use. Can you help us quickly sort out the differences between them?

Wen Feng: These concepts are essentially solving the same problem, which is to enable large models to perform tasks more efficiently through tool calls (Tool Use).

Function Call was first proposed by OpenAI, which enables large models to implement Tool Use by calling external functions. However, because the calling standards of different systems are different, just like a +86 mobile phone number cannot make calls in the United States, it is very likely that when you go to another country, you have to redo everything, so it is not very universal.

To solve this problem, MCP was created. The core value of MCP is to "unify the measurement of tool use", which greatly lowers the threshold of this matter. It can break down tasks into multiple subtasks, and each subtask has modular and unified standard components. In this way, everyone can call various tools more freely in the end.

As for A2A recently launched by Google, I think it does not provide any new technical solutions. It is more like a KPI project that a large company forcibly launched in order to compete for the right to speak on tool use, and then found a bunch of partners to promote it.

A2A claims that the difference between itself and MCP is that MCP can only allow Agents to call external tools or APIs through function interfaces, while A2A can realize the interaction between Agents. But in fact, there is no essential difference between these two interaction methods, because Agents themselves also have function call interfaces, so MCP can also indirectly realize the interaction between Agents.

Computer Use and Browser Use refer to letting the big model use the computer and browser as tools. The browser may be one of the most important tools that the big model can currently use.

Qu Kai: After listening to them, I feel that these Tool Use solutions can be divided into two camps. One is Function Call, MCP, and A2A. The logic behind them is to directly use code to solve problems. The other is Computer Use and Browser Use, which will combine some visual recognition or RPA (robotic process automation) solutions to simulate humans to solve problems.

Wen Feng: Yes. But these two schools are not mutually exclusive. For example, you can also use the MCP method to perform Browser Use.

Browser Use essentially allows the Agent to interact with the web page through the GUI (Graphical User Interface). Specifically, the big model at the back end may receive a screenshot of the browser, and then determine the interactive elements on it, calculate a coordinate, and then simulate a series of human operations on the front end, such as driving the mouse to move to that coordinate and click, or input some content, just as if the Agent is really using the browser.

But this purely visual solution is far from mature. There was a very popular company called Adept in 2023 and 2024 that did just that, but the company is now dead because it was too difficult.

So in fact, when people call Browser Use now, they usually need MCP as an intermediary. People will package the browser API into MCP components, and then let Agent complete subsequent operations in the form of code.

Qu Kai: It is like an agent performing a play for people on the front end. It seems to be simulating human operations, but in fact, it is still driven by code.

But after all, many companies are not yet compatible with MCP, and some companies may even be reluctant to become compatible in order to protect their user data. Will everyone have to use the browser in a way that simulates humans?

Wen Feng: MCP is a standardized interface, so it is not important whether these SaaS software are compatible with MCP. What is important is whether they have Open API, because Open API can be packaged as MCP for use. In foreign software ecosystems, Open API is basically standard, so MCP has a very wide range of applications.

However, the situation at home and abroad is very different, because most domestic companies have not yet opened Open API or SDK (software development kit), so this path is indeed blocked.

Qu Kai: So we can draw a conclusion that if the company can open up various back-end interfaces in the future, we can directly call the tools through code. If it is not supported, we can only solve the problem through vision and simulation of human use of computers.

Wen Feng: Yes. We have tried both solutions. Although the current visual solution is not stable and accurate enough, for example, in the screenshot I gave to LLM, there is a button for submitting a form, and it often miscalculates the coordinates. However, the advantage of this method is that it is low-cost, fast, and consumes at least an order of magnitude fewer tokens.

Therefore, these two solutions have their own advantages and disadvantages, and can be used in combination. As for how to combine them to be more efficient, developers need to adjust the ratio according to actual needs, because each Agent wants to solve different problems.

Qu Kai: I remember a few weeks ago when I was in the United States, a professional agent algorithm asked me a question. He didn't understand why Manus used Browser Use, because in his understanding, as long as the backend code can be connected, all problems can be solved directly, and there is no need to set up a browser window on the front end.

How would you answer his question?

Wen Feng: When we design the Agent, a key issue is how to create a "credible atmosphere" for users so that users can trust the results generated by the Agent.

In order to achieve this, a very important means is to allow users to see the entire process of the Agent performing the task in an easy-to-understand way.

The browser is a naturally more human-friendly way of presentation, which is far more vivid and intuitive than a black window like a code interface.

Qu Kai: What solutions do Devin, Manus, and GenSpark use respectively?

Wen Feng: Devin and Manus are both mixed solutions of Coding and Computer Use.

As for GenSpark, I used it to run some tasks and felt that it might also call some web APIs in the backend, but the frontend did not expose the web page usage process to users through the browser window like Devin or Manus did.

From this perspective, I feel that GenSpark may not quite meet my expectations for the Agent experience.

Qu Kai: But from the user's perspective, isn't it enough to solve the problem in the end? Why should we care whether the Agent backend is running anything or whether it can use a computer or browser like a human?

Wen Feng: This is a very good question.

The core of this issue is to make users feel that they are in control of everything at all times, because everyone has a sense of insecurity, and making everything transparent is the key to building a sense of security.

For example, if you are my boss and assign me a task, if we want to build a trusting relationship, you may have to let you see how I do things and understand my general ideas. Only when you know me well enough, you will trust me.

Qu Kai: This makes sense. In essence, people think that the agent is not ready yet and is not reliable, so they need to see the process of it performing tasks and participate in its execution from time to time by answering questions.

Then I think the current market discussion and understanding of Agent is actually very similar to the LLM wave two years ago. At that time, many people were discussing whether the future would belong to the general AGI model, the vertical field model, or the small model developed by startups themselves, etc.

Now people are beginning to discuss whether Agent will eventually become universal or vertical. What do you think about this issue?

Wen Feng: I think we are now in, and will remain in, an era of vertical agents for a long time.

I like to use cooking as an example recently. Many people can cook, but when we cook, we probably just take out our mobile phones, open the recipe app, and then follow the recipe step by step.

A better agent is like a chef in a five-star hotel. He has received years of professional training and does not need a recipe. The dishes he makes are delicious and many times better than ours. So he is a chef, and we are just ordinary people who can cook.

Qu Kai: I understand. At least in the past six months, the two hottest tracks in the market that have received the most money are Agent and AI Coding. Will AI Coding and Agent with Coding as the core end up in the same place?

I originally thought that these two tracks had nothing to do with each other, but I increasingly feel that they are likely to come together in the future, because many agents are now using AI Coding solutions.

Wen Feng: And AI Coding is also saying that coding is the infrastructure of everything (laughs).

Qu Kai: Yes, haha. I even saw a news report a few days ago saying that coding may also be the foundation of future AGI.

Theoretically speaking, AI Coding and Agent may eventually reach the same destination. To give an extreme example, if we want to use a browser, we can actually let AI Coding directly create a browser and then use it by itself, right?

Wen Feng: Theoretically , it is possible, but the economic and time costs of this method are too high.

AI Coding can only be said to be a powerful tool for large models to perform tasks. This tool has two key problems. First, it is difficult to coordinate with other tools, and second, it is difficult to reuse.

If we use AI Coding to perform tasks directly, it needs to break down the task first, and then write executable programs for each subtask one by one. And every time a new task is encountered, it has to do this from beginning to end, which is very inefficient and costly.

Therefore, the best option for agents is to first see if there are any ready-made tools at hand when solving the task. If there is really no such tool after searching around, then consider using AI Coding to create it on the spot.

Qu Kai: I see. What is the relationship between RL and Agent? How should startups ultimately apply RL?

Wen Feng: The concept of Agent itself originates from RL, so if you don’t understand RL, it will be difficult to understand what Agent is, and it will be difficult to design a good product.

To be a good agent, we must first understand the definition of agent in RL. Agent in RL has three elements:

1) State, corresponding to Context.

2) Action, corresponding to Tool Use.

3) Incentive signal refers to the feedback signal used to evaluate the effectiveness of each step of LLM's operation and guide its next action after it takes action.

So for startups, it is very important to create a good "environment" in your product. This environment needs to clearly describe the current state, what actions the agent can take, that is, the action space, and the definition of good or bad results.

The action space determines how many nodes there are in the workflow you design.

The reason why you must define the results well is that only in this way can you design an effective evaluation system and incentive mechanism, and then allow the Agent to continuously iterate itself based on dynamic feedback.

If you don’t define the result well, the whole system will not be able to converge. Failure to converge means that the agent is likely to give the user a poor quality result, or present a state of “knowing a little bit of everything but mastering nothing”.

Therefore, I also recommend all agent developers and product designers to read "Reinforcement Learning: An Introduction" by Richard Sutton, the father of reinforcement learning. After reading this book, you will gain a mindset that allows you to constantly think, adjust, and define your environment when designing products.

Qu Kai: How to judge the quality of the environment?

Wen Feng: The key to judging whether an environment is good or not is to see whether this environment can provide an incentive signal based on the results of actions.

From this perspective, IDE is a good environment, because as long as the Agent generates a piece of code, it can be run in the IDE immediately, and once the code cannot run, the IDE will generate an error message. This error message is naturally an incentive signal.

Qu Kai: I see. Do you think Workflow will be completely replaced by Agent?

Wen Feng: No. I think Workflow and Agent will coexist for a long time.

The essential difference between the two is that Workflow is driven by humans, while Agent is driven by AI.

The advantage of human-driven is stability and reliability, but the disadvantage is that it lacks generalization ability and is relatively rigid. AI-driven is just the opposite. It is more generalized and flexible, and can deal with some problems that you have not thought of in advance, but its disadvantage is that it is highly uncertain and may fail 5 out of 10 times.

Therefore, Agent is suitable for solving the 20% of tasks in the world that are more open and require long-term exploration and trial and error, while Workflow is completely sufficient for the remaining 80% of more daily problems.

Qu Kai: You have been working as an agent for more than a year. Have you accumulated any non-consensus insights?

Wen Feng: I think "Chat" is the most important interactive entrance for Agent.

Because for the Agent, the freedom of user interaction is the most important thing, and its importance is far higher than the accuracy of the interaction.

Once you limit the user's freedom, you are actually asking the user to adapt to your product, which increases the user's cognitive burden. A good agent should be smart enough to allow users to use it freely like a happy child.

Among the existing interaction methods, Chat is the one that can best guarantee the user's freedom of interaction.

Of course, it doesn’t mean that accuracy is unimportant, but I think it’s not a problem that users need to bear, but should be solved by developers and product designers. In fact, there are many ways in the industry to improve accuracy, such as introducing Human-in-the-loop, or accumulating user preferences like Devin and Manus, and you can also do more product design, such as asking users questions to guide them to gradually refine vague needs until they become specific and executable.

You don’t need to design a lot of extra interfaces, nor do you need to pile up too many components on the front end, but you can push the right components to the user at the right time. Even if you design 200 components, in fact, the needs of users are not very different, so each user may only use 10 of them, then there is no need to display all 200 components, which will only increase the cognitive burden of users.

Qu Kai: I agree with your last point. A simple chat box is not necessarily the most efficient way of interaction, but if the chat box can be combined with some scene-recommended UI components, it is indeed a reasonable solution.

However, to achieve this form of interaction, we must first do a good job of intent recognition and determine what the user wants. Intention recognition and context seem to be interdependent. The more context, the more likely the model is to guess the user's intention; conversely, after understanding the user's intention, the model also needs more context to determine what to do to better complete the entire task.

Wen Feng: Therefore, the model itself must be able to determine whether the current context is sufficient. If not, it must obtain more context by calling external APIs or using methods such as RAG.

Qu Kai: This matter is actually closely related to the intelligence of the model itself and the know-how in the vertical field.

Wen Feng: Yes, in addition, the System Prompts preset by developers in the Agent can also assist in the performance of the model. For example, Cursor and Windsurf have thousands of lines of System Prompts.

Qu Kai: System Prompt actually only works in vertical fields, because if you want to write targeted prompts, you have to know the user's goals, and the more you understand this field, the more accurate the prompts you write may be.

For example, if you want to be an agent specializing in research, you can preset a System Prompt in advance for the research scenario, because each time it performs a task, it can follow the process of searching the web, finding data and related articles, summarizing key information, and finally outputting it to Excel or PPT. Moreover, each step is independent and can be optimized separately.

But if you want to make a general agent, it will be difficult to write a system prompt that adapts to all tasks in the face of the vastly different needs of users. Moreover, each step of a general agent is highly dependent on the result of the previous step, so it is very likely that "one wrong step will lead to all wrong steps", which will reduce the accuracy of the final result.

Wen Feng: Yes. In short, the more context you collect at the beginning, the better.

Qu Kai: So I remember that Apple would record the webpage you just visited before opening a certain webpage. In fact, this is collecting context. Including the memory system that OpenAI just released recently, it is also essentially building a context.

I had dinner with Zhang Yueguang a few weeks ago, and he also made a particularly good point.

He said that the moment you click on an app, a large amount of context is actually provided. For example, if you click on Meituan, you probably want to order takeout, and if you click on Didi, you want to take a taxi, so the product design of these apps is based on these contexts.

Then, when users use your app, more context will continue to be generated, such as what content they input, what operations they performed, etc. All this information combined can help the system more accurately identify user intentions, predict the next step, and even proactively ask questions to guide users to get the desired results.

Wen Feng: Yes. If you want to better understand a person, you need to look at his/her past. Similarly, if you want to better understand the user’s intention, you need to track where he/she came from and what the path was during the process.

It is just like playing Go. The current move is not that important. What is important is that you have to understand how your opponent played the previous 100 moves. Only in this way can you judge the opponent's thinking throughout the game, and then infer his next strategy and make corresponding actions.

Qu Kai: So Google has been saving users’ cache for a long time.

Wen Feng: This is indeed Google's biggest competitive advantage in the AI ​​Native era. These massive amounts of user click data can be used in intent recognition in the future.

Qu Kai: Yes. Do you have any other non-consensus understanding of Agent?

Wen Feng: Agent developers also need to solve two trust issues.

First, you have to trust the capabilities of the big model.

If you don't trust the big model, you will fall back on the old rule-based approach and add a bunch of restrictions to the model, such as constantly telling the model through prompts "who you are, what you can only do, what you can't do", etc. But in fact, this artificially limits the generalization ability of the big model, resulting in a significant reduction in the Agent's utilization of the model's intelligence.

Second, you have to think about how to make users trust the results given by the agent through product design.

A particularly good example in this regard is DeepSeek R1. Before R1, when I used some similar products to generate reports, my first reaction after getting the results was often "Is this reliable?", because I didn't know how the report was generated and whether there were any errors in the process.

But R1 allowed me to see the AI's reasoning process for the first time, so I felt more secure psychologically and was more willing to believe the result. Manus actually has a similar mechanism.

Qu Kai: I see. Let's talk about Sheet0. You said earlier that it can automatically complete the entire process of data collection, processing, and taking actions based on the data. Can you give a specific example?

Wen Feng: For example, we can automate the following process: first, we grab the list of startups from the most recent rounds of YC, then find out who the founder of each company is, then further search for their Twitter accounts and follow them, and finally send a private message to establish a connection.

We have achieved 100% accuracy in this process.

We also tried to use Deep Research and Manus to perform this task, but found that they both lost data. Moreover, after Deep Research got the data, it could only generate a report and could not complete the subsequent connection actions like we did. Although Manus has the ability to act, it is dynamically coding at every step, and the process requires constant debugging and adjustment, so it is difficult to guarantee stability and success rate.

Qu Kai: So how did you achieve 100% accuracy?

Wen Feng: We used some AI Coding technologies. But that’s not enough. We also built a lot of small tool modules in advance throughout the process. We have verified these tools in advance to ensure they are easy to use. Every time we get a new task, the model can directly call these modules instead of writing a program from scratch.

The core logic behind this approach is "reuse", which is more efficient and less costly.

But Manus doesn't think this way. Whenever he encounters a problem, he opens the IDE and writes code from scratch.

It doesn’t mean that Manus’s method is necessarily bad, because there is a trade-off between the versatility and accuracy of the agent. The more you pursue versatility, the more you rely on the generalization ability of the model. However, the higher the generalization, the higher the randomness and the greater the uncertainty of the results. Which mode you choose depends on what kind of agent you want to make.

Qu Kai: So if you want an agent that is both general and accurate, the team will have to invest a lot of time and energy to manually tinker with various tool components.

Wen Feng : Yes. But not everything needs to be done manually. Sometimes it is more cost-effective to use ready-made tools. For example, a simple process like sending an email is very suitable for a module to be done manually. But if it is a database-related operation, you certainly cannot write a set of scripts from scratch every time. A more reasonable approach may be to call it directly through MCP or something like that.

Qu Kai: What is the difference between Sheet0 and other Agents?

Wen Feng: I differentiate agents by the results they deliver. From this perspective, agents on the market can be roughly divided into two categories.

One type is Coding Agent, which delivers an executable code.

The other category is research agents. GenSpark, Deep Research, and Manus actually all belong to this category. The final result they deliver to users is a report, but they cannot really help you place an order on Meituan or buy something on JD.com.

We are a spreadsheet agent. Compared with other agents, the difference is essentially between "qualitative analysis" and "quantitative analysis".

"Qualitative analysis" is the way many agents solve problems. For example, if you want to get a general understanding of a problem, you can use a tool like Deep Research to generate a report. This report can help you build a perception of the problem, but it cannot give you very precise data.

What we want to solve are those scenarios in life that require precision, so we need to use "quantitative analysis" to solve the problem.

For example, if you want to know a very precise number, you need an accurate data source, and this data source is usually a clear and complete table. What Sheet0 does is to use AI to capture various data from these data sources, summarize these data into a table, and then use this table for the next step of analysis.

We have also solved the problem of model hallucination in engineering to ensure the accuracy of this process.

Qu Kai: Speaking of model hallucination, I suddenly thought, is AI Coding equivalent to the translation and assistant of the big model? If a little AI Coding is introduced into each link, can the accuracy of the results be improved and the problem of hallucination be solved?

Wen Feng: Yes, AI Coding is the “dexterous hands” of big models.

The process of executing a task by a large model has many steps, and the accuracy of the final result is the product of the accuracy of all the previous steps. For example, if the success rate of each step is 90%, after executing 10 steps in a row, the overall success rate may drop to 0.9 to the power of 10, which is 35%.

This is because the next step is executed based on the results of the previous step, and the results of each step are difficult to evaluate, so it is difficult to make corrections in time.

To solve this problem, we can introduce AI Coding in each step, so that the difficult-to-evaluate results can be converted into verifiable codes.

For example, I can generate 10 pieces of code through AI Coding at each step. Since the code is easy to verify, it doesn’t matter if only half of the code is correct. I can just keep the 5 correct pieces and use them to generate a correct interim result before moving on to the next step. This ensures 100% accuracy of the final result.

MCP actually breaks down the barriers between tool calls through this solution.

Qu Kai: What predictions do you have for the development of Agent in the next few years?

Wen Feng: AI is developing too fast now. Instead of sharing a specific prediction result, I would rather share a thinking framework.

If you want to judge the future development direction of the Agent, the most important thing is to grasp the key variables. As we talked about before, the key to whether the Agent is doing well is whether it can really deliver a good result, and the quality of this result mainly depends on two factors: one is the model capability, and the other is whether you can build a better Context.

Therefore, if Agent wants to make a breakthrough, at least the model needs to be more powerful, or we need to go further in context engineering.

Qu Kai: If you were an investor, what questions would you ask to judge whether an agent company is doing well or not?

Wen Feng: The first thing I would ask is whether anyone in their team has read "Reinforcement Learning: An Introduction" (laughs), because people who have read this book are likely to have the right mindset and be able to build a good product in a very solid way.

In addition, I might ask them how they design the incentive signals in the product, that is, how they evaluate the results. This is a very critical question that determines whether the large model can continue to iterate in a better direction.

Qu Kai: So what is the incentive signal of your product?

Wen Feng: The core of our product is the table generated by AI during the task execution process. "Whether the data in the table is empty" itself is a very intuitive feedback signal.

In addition, as mentioned earlier, we will use AI Coding to convert some results that are difficult to evaluate directly into verifiable code. For example, we will use AI Coding to generate a script based on the model's analysis results on page structure and the relationship between pages. Whether this script can run successfully and whether the results of the run meet expectations is also an incentive signal.

Qu Kai: I understand, thank you! Finally, Sheet0 recently opened the Waiting List and will soon start internal testing.