Windsurf published: “What is an Agent?” (15,000 words)

Written by
Clara Bennett
Updated on:June-20th-2025
Recommendation

Explore the core position and unique value of intelligent agents in the field of AI.

Core content:
1. Popularity and confusion of the concept of intelligent agents
2. Windsurf's definition and practice of intelligent agents
3. The basic framework and working mechanism of intelligent agent systems

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)



" In 2025, the term "agent" has been widely used, but it has gradually lost its precise communication value due to the diversity of its meaning. Because everyone confidently uses it to refer to different things. However, when we explore the core concepts and development history of agents in depth, we will find its important position in the field of AI and its significant differences from previous AI tools. " 


Agent?" class="rich_pages wxw-img" data-ratio="0.5629251700680272" data-type="png" data-w="1764" data-imgfileid="100004129">

 

 


Hello everyone, I am Si07. On Sunday, I saw a blog post shared by a friend in the "Awareness Flow" community - Windsurf (a well-known agent application in the field of AI Coding). Windsurf explained "what is an intelligent agent" from its own practical perspective. In order to explain this issue, Windsurf also carefully made a schematic animation. Let's take a look at what this blog says.

This time I was lazy and did the layout of the material, and then only output my thoughts after reading it, please forgive me. The following is the full translation of this article:

What is an agent?

Welcome to the 2025 edition of "a term that is so overused that it starts to lose any real meaning in conversation because everyone confidently uses it to refer to something different."

If you are a builder trying to build an agent solution, this article is probably not for you. This article is for those of you who, when you hear someone talk about AI agents in a meeting, boardroom, or conversation, are either (a) not quite sure what an agent is and how it differs from the generative AI capabilities we have seen so far, or (b) not quite sure if the person using the term knows what an agent is, or (c) who might think they know what an agent is before reading the first sentence of this article.

Although we will reference Windsurf in this article to make some theoretical concepts easier to understand, this is not a sales pitch.

Let’s get started.

The most basic core concepts

To answer the title of this article, an agent AI system can be simply understood as a system that receives user input and then alternately calls the following components:

  • • A large language model (which we call the “inference model”) that decides what actions to take based on the input, potentially additional automatically retrieved context, and the accumulated conversation. The inference model will output (a) text that infers what the next action should be, and (b) structured information that specifies the action (which action, the values ​​of the action’s input parameters, etc.). The output “action” may also be that there are no actions left to be performed.
  • • Tools, which do not necessarily have anything to do with a large language model, can perform the various actions specified by the reasoning model to produce results that will be incorporated into the information for the next call to the reasoning model. The reasoning model is essentially being prompted to choose among the set of tools and actions accessible to the system.

This creates a basic agent loop:

Indeed, that’s all there is to it. There are different variations of agent systems in terms of how the agent loop is presented to the user, which we’ll discuss later, but if you understand that large language models are used less as pure content generators (like ChatGPT) and more as tool selection reasoning components, then you’ve already grasped most of the concepts.

The term “reasoning” is also overused — in the agent world, it has a very specific meaning. It refers to using a large language model to choose the next action to take, i.e. which tool to call and with what parameters.

Models like OpenAI’s o1 also use the term “reasoning”, but the meaning there is completely different. For these large language models, “reasoning” refers to chaining thought prompts. The idea is that the model outputs the intermediate steps before trying to provide a final answer to a query, trying to think more like a human would, rather than relying on the magic of pure pattern matching. For these models, no tools are called (as is the case in the agent world), the output of a large language model is simply generated in a way that looks like multiple think calls strung together, hence the name “chained thinking”.

Another misuse of “agent” is to refer to so-called “AI workflows.” For example, one might build an automation or workflow that takes in a raw document, uses a large language model for object recognition, then cleans up those extracted elements, uses another large language model to summarize the elements, and finally adds the summary to a database. There are multiple large language model calls here, but the large language model is not being used as a tool to call an inference engine. Rather than having the large language model decide in real time which tools should be called, we specify ahead of time which large language models should be called and how. This is just an automation, not an agent.

A very simple example to understand the difference between an agent and a non-agent: suppose you ask an AI system to give you a recipe for making pizza. In the non-agent world, you would pass that prompt to a large language model and let it generate the result:

In the agent world, one of the tools the agent might have is to retrieve a recipe from a recipe book, and one of the recipes is for pizza. In this agent world, the system would use a large language model (the inference model) to determine that, given the prompt, we should use the “recipe tool” and input “pizza” to retrieve the correct recipe. The tool would then be called, outputting the text of that recipe, and the inference model would then use the output of the tool call to determine that there is no more work to do, and complete its “loop”.

While the distinction may now be clear, you may be asking - what's so interesting about this? It seems to be just a technical detail in the approach.

Well, this is interesting for several reasons:

  • • Imagine this example is a bit more advanced. Say “Get a pizza recipe with healthy ingredients, Neapolitan style.” It’s possible that a non-agent system will get a reasonable result using the features of the generative model, but as the request becomes more detailed and multi-layered, it becomes less and less likely that a single call to the large language model will accurately complete the request. On the other hand, an agent system might first reason about using a tool that calls the large language model to describe how pizza is made in Naples, then reason about using a tool to do a web search on what ingredients count as healthy, and finally reason about using a tool to retrieve recipes, with information from the first two steps informing potential configurable inputs to this final tool. This decomposition of steps should feel natural because it’s how we humans work, and should reduce the variance in potential outcomes because the agent is using tools that we understand better and have more control over. While there’s no guarantee of success, this approach is more likely than a non-agent approach to get the AI ​​system to complete the task correctly.
  • • The tools we can give agents can help make up for the things that large language models are not good at. Remember, large language models are stochastic systems based on natural language patterns. They have no inherent understanding of non-textual concepts. Large language models are bad at math? We can add a calculator tool. Large language models don’t know the current time? We can add a system time tool. Large language models can’t compile code? We can add a build tool. Now, the reasoning model (the large language model in the agent world) doesn’t need to inherently know how to do math, tell the time, or compile code. Instead, it just needs to know when it’s appropriate to use a calculator, find the system time, or try to build source code, and be able to determine what the correct inputs to these tools are. Knowing how to call these tools is much more feasible and can be based on textual context.
  • • Tools can actually change the state of the world, rather than just providing a text response. For example, in the pizza example, we want the AI ​​to send a pizza recipe to my sister. Maybe the agent has the tools to access my contacts and send a text message. The agent will enter a loop — first reasoning about retrieving the recipe, then reasoning about retrieving my sister’s contact information, and finally reasoning about sending the text message. The first two steps could probably be achieved with some very smart RAG (retrieval augmented generation), but the last step? The ability to actually take action rather than just generate text is what makes agent systems potentially very powerful.

Congratulations, now you know what an agent is! But there is still some background information and content that can make you dangerous in the conversation around "agents"...

How we got here and why now

Before we get into the mental models we can use to have meaningful conversations, we’ll quickly review how we got here and provide some clarification on the types of AI-based tools in the context of our field, software engineering, so that it doesn’t get completely abstract.

If you remember the world a few years ago, humans actually did work before generative AI tools came along. This work can be represented as a timeline of actions. In software engineering, this might range from doing research on StackOverflow to running terminal commands to actually writing some code:

With the advent of large language models, we are starting to get systems that do specific tasks really well. ChatGPT for answering questions, GitHub Copilot for auto-completing a few lines of code, etc. These tools can be trusted because they satisfy two conditions at once:

  • • They solve problems that are important to users (e.g. every developer hates writing boilerplate code, so auto-completing that text is valuable)
  • • Large language model technology is good enough to solve these problems at a level that is robust enough for users to trust it for their specific application (e.g., the developer doesn’t like the autocomplete suggestions? No problem, they can just keep typing and move on)

The latter is actually quite critical. For years, people have been building impressive demonstrations of systems based on large language models solving extremely complex tasks. However, many of these were just demonstrations and could not be productized and trusted, which led to a disconnect between hype and reality, and the ensuing trough of disillusionment. Take summarizing pull requests, for example. This has obvious value to users (no one likes writing pull request descriptions), but think about the accuracy required for users to trust it over the long term. The first time the AI ​​gets the description wrong? Users will forever check all the files and write the description themselves, thus killing the value of the tool. The bar for robustness required for this use case is very high and probably not met by today’s technology. Still, while this large language model technology is not perfect, it is also improving rapidly, so the frontier of complex tasks that can be solved in a sufficiently robust way is also advancing. One day, AI will be able to robustly write pull request descriptions.

I’ve gotten off topic a bit. Initially the useful and possible intersections were limited to what I call “co-pilot-class” systems. These are AI systems that are able to solve very specific tasks, like responding to a prompt or generating auto-complete suggestions, with a single call to a large language model. A human is always in the loop reviewing the result and then “accepting” it, so the potential for the AI ​​to get out of control is not a concern. The main issue is the hallucination problem, which refers to the fact that the models provide inaccurate results due to being trained on text from the internet (where everyone is confident) and not having the necessary knowledge to base responses on reality (at the end of the day, these models are just super-sophisticated pattern matching algorithms).

Thus, these copilot-like systems were improved by increasingly powerful retrieval-augmented generation (RAG) approaches, which is a fancy way of saying that these systems would first retrieve information relevant to the user’s query, augment the user’s query with that information, and then pass that accumulated information to a large language model for final generation. This knowledge access to AI systems capable of performing specific tasks defined the first few years of large language model-based applications — the “Copilot era”:

These co-pilot-like non-agent systems are the ones that drive long-term, reliable, stable value. However, this does not mean that the concept of “agent systems” is new.

The first popular agent framework, AutoGPT, was actually released in early 2023, not long after ChatGPT. The agent approach here is to have the agent loop autonomously, so the user only needs to provide prompts, let the agent process itself, and then review the results. Essentially, because these systems have access to tools and make multiple large language model calls, they run longer and are able to accomplish a much larger range of things than co-pilot-like systems:

However, while AutoGPT remains one of the most popular GitHub repositories of all time, agents created with the framework never really took off. A year later, Cognition launched Devin, billed as a fully functional AI developer that could replace human software developers. Again, a fully autonomous agent system with some very powerful tools, but today can only solve relatively simple problems.

What’s going on? If agents are so powerful, why do users primarily derive value from non-agent RAG-based co-pilot systems, rather than these agent-based systems?

OK, remember that intersection between “useful questions” and “technique robust enough”? This is a general challenge facing these autonomous agent systems. While these autonomous agents are the obvious direction the world is headed, today’s large language models may not be powerful enough to accomplish the complexity of these tasks without any human involvement or correction.

This reality has given rise to a new approach to agents, based on the realization that there will be some balance between what humans should do and what agents should do. To distinguish them from autonomous agent approaches, we call these collaborative agents, or AI processes for short.

Tactically:

  • • There must be clear ways for a human to observe what the process is doing as it executes, so that if the process veers off course, the human can correct it early on. In other words, reintroduce some of the “human in the loop” collaborative aspects of a co-pilot type system.
  • • These processes must run in the same environment where humans work. Most attempts at autonomous agents, because they work independently of the user, are invoked from a different surface than when the user is working manually. For example, Devin is invoked from a web page, whereas in reality the developer would be writing code in an IDE. While this may be fine in a world where agents can do anything, by not running in the same place where the user is working manually, these autonomous agent systems will not be aware of a lot of the things that humans do manually. As a result, they will miss a lot of the implicit context that comes from those actions. For example, if the agent is running in an IDE, then it will be aware of recent manual edits, which will implicitly inform what the agent should do next.

In other words, in this case, it is important that humans be able to observe what the agent does, but it is equally important that the agent be able to observe what humans do.

Coming back to the intersection between “interesting problems” and “sufficiently robust technology,” the threshold for robustness required for collaborative agent approaches is significantly lower than for autonomous agent approaches. This is because humans can always correct the AI ​​at intermediate steps, need to approve certain actions of the AI ​​(e.g., executing terminal commands), and review changes in real time.

That’s why this is the approach taken by all the universally accessible agent applications that add value today, such as Windsurf’s Cascade, Cursor’s Composer Agent, and GitHub Copilot Workspaces. In the process, humans and agents are always operating on the state of the world:

We go through so much effort to distinguish between autonomous agents and collaborative agents because they are actually very different approaches to building “agent systems”. Different levels of human involvement in the loop, different levels of trust, different ways of interacting, etc. Because the word “agent” is overused, there is a lot of discussion about building autonomous agents and using agent systems like Windsurf’s Cascade as evidence that agents work, but the reality is that the two approaches are very different.

How to dissect the “intelligent system”

OK, finally, here’s what you’ve been waiting for — a quick checklist that synthesizes everything we’ve covered in order to (a) reason in conversations about “intelligent agents” and (b) ask questions that get to the heart of the technology. Many of these questions could be an article in themselves, but I’ll try to provide helpful initial questions.

Question 1: Is the system in question truly an intelligent agent?

Clearly too many people are calling systems "agents" that they are not. Are large language models used as tool-invoked reasoning models, and are they actually called tools? Or is it just some kind of chained-thinking reasoning or something else that uses the same term but means something different?

Question 2: Is it autonomous or collaborative?

Is the agent-based systems approach to have agents working in the background without human involvement, or is it to have agents that have the ability to take multiple steps independently but embedded in existing working systems and still have a human in the loop?

If the former, ask the follow-up question — are today’s models really good enough to reason about the scale and complexity of the data and tools involved, where users can rely on the consistency of the entire agent? Or will the autonomous agent’s recommendations sound great in theory, but not practically feasible?

Question 3: Does the agent have all the inputs and components it needs to be powerful?

This begins to involve distinguishing the substance of the implementations of different agents (especially collaborative agents, or processes) that are trying to solve the same task.

Question 3a: What tools does the agent have access to?

Not just a list of tools, but how are those tools implemented? For example, Windsurf’s Cascade takes a unique approach to performing web searches by chunking and parsing website copy. Also, is it easy to add your own unique tools? Approaches like Anthropic’s Model Context Protocol (MCP) aim to standardize a low-friction way to incorporate new tools into existing agent systems.

Question 3b: What reasoning model does the agent use?

It is important to remember to evaluate a large language model on its ability to perform tool invocation, rather than whether it is the best model on standard benchmarks across a variety of tasks and topics. Just because a model is great at answering coding questions does not necessarily mean it will choose the right tool when solving coding-related tasks in an agent-like manner. There is no large language model that is the best reasoning model on all tasks, and while Anthropic’s Claude 3.5 Sonnet has historically been one of the best models for tool invocation, these models are improving rapidly. It is therefore useful to ask whether we should ensure that agents have the choice of using different models.

Question 3c: How does the agent process existing data?

What data sources does the agent have access to? Does the agent’s access to its data sources respect the user’s existing access control rules (especially in a world of collaborative agents)? Sometimes the answer isn’t as simple as the data source itself — in the case of a code repository, does the agent only have access to the repository that the user currently has checked out in the IDE, or can it access information in other repositories to help inform its results. The latter can add value as code is widely distributed, but questions around access become more important.

Overall, the agent approach changes the paradigm for thinking about the data retrieval problem. In a Copilot-like system, you only had one chance to call a large language model, so you only had one chance to retrieve, which led to increasingly complex RAG systems. In an agent, if the first retrieval returns poor results, the reasoning model can simply choose to do another retrieval with different parameters until it is sure that all the relevant information has been collected to take action. This is very similar to how humans find data. So if the discussion around RAGs, parsing, and intermediate data structures gets too deep, ask - are we overcomplicating the problem in the agent world? More on this later.

That is, if there is structure in the data, it is fair to ask how the information in these data sources is processed. For example, we deal with code bases, and code is highly structured, so we can do clever techniques like abstract syntax tree (AST) parsing to intelligently chunk the code for any tool trying to reason about or search on the code base. Smart pre-processing and multi-step retrieval are not mutually exclusive.

Question 3d: How can collaborative agents or processes capture user intent?

Are there any implicit signals that can be captured by manual actions of a human user that would never be explicitly encoded? Now, the agent may not know what was said in the water cooler, but by “reading the user’s mind”, more valuable and magical experiences can often be created. In our world, this intent can be found in the other tabs the user has open in their IDE, the edits they just made in their text editor, the terminal command they just executed, what they pasted to their clipboard, and so on. This comes back to lowering the activation energy of using the agent - if every detail that can be derived from unknown information is expected to be explicitly stated at every turn, then the user’s expectations of the quality of the AI’s results will be higher.

Question 4: What makes the user experience of this agent truly outstanding?

So far, we have discussed all of the factors that can affect the quality of an agent’s results. You may find that conversations about agent systems focus on these factors, but anyone serious about creating an agent system that is actually adopted by users should focus on all axes of user experience, even if nothing changes to the underlying agent. Many of these axes of user experience are not easy to build, and therefore require careful thought.

Question 4a: What is the latency of this agent system?

Imagine two agents that do the same thing for a particular task, but one takes an hour and the other only a minute. If you know that these systems will definitely complete the task successfully, maybe you don't care too much about the time difference because you can do other things in that time. But what if there is a chance that the agent will not succeed? You would very much like the latter because then you can see the failure faster, maybe change some prompts, give the agent some more guidance, etc. For fully autonomous agents, the latency issue has always been one of the main challenges because they usually take longer to complete tasks than humans doing the work manually; unless an autonomous agent achieves a very high success rate, it will not be used at all.

There are two reasons to explicitly call out latency. First, agent builders often add complex, slow tools to improve quality without considering the impact on the user experience and thinking about this tradeoff. Second, improving latency in every part of the stack is a very difficult problem - model inference optimization? Hint construction to maximize cache? Parallelizing work within the tool? It often requires a completely different set of engineering skills to improve latency.

Question 4b: How do users observe and guide the agent?

This is a big advantage that collaborative agents have over autonomous agents, but it is rarely trivial to execute. For example, if a coding agent can make multiple edits to multiple files in an IDE, how a developer can effectively review those changes is completely different than looking at a single autocomplete suggestion or reviewing a reply in a chat panel.

Likewise, it takes time for people to accumulate context about best practices for performing specific tasks in specific environments. What kind of user experiences can you build that allow humans to guide agents through these best practices? For example, Windsurf’s Cascade can accept user-defined rules, or guide through known context through simple ways to tag context. Yes, the goal of any agent is to be able to do anything on its own, but if humans can easily make the agent’s job easier, the agent will be able to do higher quality work faster.

Question 4c: How is the agent integrated in the application?  This all comes down to the polish of invoking the agent and leveraging its output. The popularity of ChatGPT has made the chat panel the default method of invoking any AI-based system. While this may be one way, it does not have to be the only way. For example, Windsurf's Cascade can be invoked by other means, such as a simple button to explain a code block, and context can be passed to Cascade in many ways that do not require copy-pasting text, such as previews allowing console logs and UI components to be passed to Cascade.

Question 4d: How do agent experiences balance with non-agent experiences?  This may be surprising, but not everything needs to be an agent. For example, if a developer is just trying to do a local refactoring, they should use some combination of command and tab, both of which are non-agent “co-pilot-like” experiences that are fast and efficient for these tasks. Agents are the new frontier, but we can’t treat every problem as a nail just because we have a new hammer! It’s often useful to ask “Do we need to build an agent for this task?”

Again, we’ve only scratched the surface, but this list should help you have conversations about agents, ask questions that get to the heart of the matter, and inject some realism into the ideas.

Bitter Lessons

But, there is one more thing. I'm leaving it out because if that's the question you end up using from this post, it's this: Are we violating the "bitter lesson"?

The “bitter lesson” comes from Richard Sutton’s article of the same name (see the appendix to this post), and the main (paraphrased) takeaway is that more computing power, more data, and overall greater scale of technology will always eventually outperform any system that relies on human-defined structure or rules. We saw this in computer vision, with CNNs outperforming hand-crafted rules for edge and shape detection. We saw deep search and then deep neural networks outperform any rule-based computer system in more complex games. Even large language models are examples of this trend, outperforming any “traditional” NLP approach.

With agents, we are again in danger of forgetting the hard lessons we learned. We may think we know more about a specific use case, so we spend a lot of time crafting the right prompts, or making sure we wisely choose a subset of valuable tools, or trying any number of other ways to inject “our knowledge”. Eventually, these models will continue to improve, and computing power will continue to get cheaper and more powerful, and all of these efforts will be in vain.

Don’t fall into the trap of bitter lessons.


 



Appendix Bitter Lessons

 

This article is an article written by Rich Sutton on March 13, 2019 about the lessons learned from artificial intelligence research. The full translation is as follows:

Bitter Lessons (by Rich Sutton)

  • • March 13, 2019

The biggest lesson to be drawn from 70 years of AI research is that general-purpose approaches to exploiting computation are ultimately the most effective, and by a wide margin. The underlying reason for this is Moore’s Law, or the law that the cost per unit of computation continues to decrease exponentially. Most AI research is conducted under the assumption that the amount of computational power available to the agent is fixed (in which case exploiting human knowledge is one of the few ways to improve performance), but over a period slightly longer than a typical research project, there is bound to be a lot more computational power available. To make progress in the short term, researchers try to exploit their human knowledge of the domain, but in the long run, the only thing that matters is how the computation is exploited. The two are not necessarily contradictory, but in practice they often are. Time spent on one is time not spent on the other. There is a psychological investment in one approach or the other. And approaches based on human knowledge tend to complicate the method, making it less suitable for exploiting general-purpose approaches to exploiting computational power. There are many examples of AI researchers belatedly learning this painful lesson, and it is instructive to review some of the most prominent ones.

In computer chess, the approach that defeated world champion Garry Kasparov in 1997 was based on massive, deep search. Most researchers working on computer chess at the time were frustrated by this approach, and they pursued approaches that exploited humans’ understanding of the peculiar structure of chess. When a simpler, search-based approach (with special hardware and software) proved more effective, these human-knowledge-based chess researchers were not very good losers. They said “brute force search” might have won this time, but it was not a universal strategy, and it was not the way humans play chess anyway. These researchers expected approaches based on human input to win, and were disappointed when they did not.

Progress in computer Go follows a similar pattern, albeit with a 20-year delay. The initial big efforts were attempts to avoid search by exploiting human knowledge or special features of the game, but all of these efforts became irrelevant or worse once search was effectively applied at scale. Also important was the use of self-play to learn value functions (this is also true in many other games and even in chess, although learning did not play a large role in the program that first beat the world champion in 1997). Self-play learning and learning itself, like search, enable large amounts of computing power to be put to use. Search and learning are the two most important classes of techniques in AI research that exploit large amounts of computing power. In computer Go, as in computer chess, researchers initially worked to exploit human understanding (so as to reduce the need for search) and later had greater success by embracing search and learning.

In speech recognition, an early competition sponsored by DARPA in the 1970s included many ad hoc methods that exploited human knowledge—knowledge of the vocabulary, phonemes, the human vocal tract, and so on. On the other hand, some newer, more statistical methods based on Hidden Markov Models (HMMs) were more computationally intensive. Again, statistical methods won out over methods based on human knowledge. This led to a major shift in natural language processing, with statistics and computation beginning to dominate the field over the decades. The recent rise of deep learning in speech recognition is the latest step in this ongoing direction. Deep learning methods rely less on human knowledge, use more computational power, and combine learning on huge training sets to produce significantly better speech recognition systems. Just like in the game, researchers always tried to make systems work the way they thought their own brains worked—they tried to bake that knowledge into their systems—but as Moore's Law made massive amounts of computational power available, and found ways to exploit that computational power, this ultimately proved counterproductive and wasted a lot of researchers' time.

A similar situation has occurred in computer vision. Early approaches viewed vision as finding edges, generalized cylinders, or SIFT features, etc. But now all of that has been thrown out. Modern deep learning neural networks use only convolutions and some notion of invariance, and perform much better.

This is an important lesson. As a field, we have not yet thoroughly learned this lesson because we are still making the same mistakes. In order to see this, and to effectively resist it, we must understand the appeal of these mistakes. We must learn the hard lesson that building the way we think we think of ourselves does not work. The hard lesson is based on the historical observation that 1) AI researchers often try to build knowledge into their agents, 2) this is always helpful in the short term and personally satisfying to the researchers, but 3) in the long run it tends to level out or even hinder further progress, and 4) breakthrough progress eventually came through opposing approaches based on search and learning to scale computation. The ultimate success has a bitter taste and is often not fully digested because it is a victory over a favored, human-centric approach.

One thing that should be learned from this painful lesson is the enormous power of general-purpose methods that scale with increasing computing power even as available computing power becomes very powerful. The two methods that seem to scale arbitrarily are search and learning.

The second general point to be learned from the hard lessons is that the actual contents of the mind are extremely complex and hopelessly complicated; we should stop trying to find simple ways to think about the contents of the mind, such as simple ways to think about space, objects, multiple agents, or symmetries. All of these are part of an arbitrary, intrinsically complex, external world. They should not be built in, because their complexity is endless; instead, we should only build meta-methods that can discover and capture this arbitrary complexity. The key to these methods is that they can find good approximations, but the process of finding them should be done by our methods, not by us. We want AI agents to discover as we do, not to include what we have already discovered. Building in our discoveries only makes the discovery process more difficult to understand.

 




 

My thoughts

When Windsurf was first released at the end of last year, I unsubscribed from Cursor and only used Windsurf. Although I later changed to Cursor, it is still undeniable that Windsurf is an excellent AI Coding product! After reading this blog of Windsurf, I would like to share some of my gains or thoughts.

This article elaborates on the components and judgment criteria of the intelligent agent system from the perspective of Windsurf. That is, a true intelligent agent must be able to use LLM as a tool to call and reason about models, and have the ability to actually call tools to perform corresponding tasks. It is not that an intelligent agent can be called an intelligent agent just by relying on LLM to generate some content or perform CoT. This clear definition is quite popular and can enable beginners to accurately identify and understand what a true intelligent agent is, avoiding conceptual confusion and misuse.

Analytical Framework for Agent Systems

The article provides a relatively clear and detailed analysis framework to dissect an intelligent system from multiple dimensions, including whether it is a true intelligent agent, the difference between autonomy and collaboration, the input and components of the intelligent agent, user experience, etc. This framework is like a rigorous mind map, which allows us to systematically analyze the key elements of an intelligent system, rather than looking at the problem in a scattered and one-sided way. For example, when analyzing an intelligent system, we must first determine whether it meets the basic definition of an intelligent agent, and then see whether it is autonomous or collaborative, and further explore its tools, reasoning models, data processing methods, etc., and finally consider its pros and cons from the perspective of user experience. This is a lightweight method, very practical!

Review of key elements of intelligent agent systems

  • •  Tools and reasoning models : Emphasizes the importance of tools to intelligent agents and the key role of reasoning models in calling tools. The power of intelligent agents depends not only on their own algorithms and models, but also on the tools they have and how to use them reasonably. When building or researching intelligent agent systems, we should not only focus on the performance optimization of the model, but also on the scalability and adaptability of the tools, and how to accurately call these tools through effective reasoning models to maximize the effectiveness of the intelligent agent.
  • •  Data processing and user intent capture : It points out that the way the agent processes existing data and how to capture user intent are important factors affecting its performance. Data is the basis for the agent's decision-making and action , while user intent is the direction indicator for the agent's work . By processing data reasonably, the agent can better understand and respond to various complex task scenarios; accurately capturing user intent can make the agent's work more in line with user needs and improve the quality and efficiency of task completion. In the development of the agent system, it is necessary to pay full attention to the research and development of data management and intent recognition technology to improve the intelligence level and practicality of the agent system . And this resonates with my own practice! As you get deeper and deeper into the application of agents, you will become more and more aware of the importance of data and intent. These two are also the upstream key elements for improving the performance of the agent system (upstream refers to the elements that initially trigger the agent's actions, such as Input instructions and data). I once shared a point of view of my own in the community group, and you can also help me see if it is right. That is, when the agent itself is usually operating, it needs to face the "three soul questions". (Who am I, where am I? What do I have? Where am I going?)
  •  

  • For the sake of writing, I have further refined the views shared in the community, as follows :

    1. Who am I? Where am I? (Cognition) This question involves the meta-definition of the agent and the cognition of the environment. The specific elements are as follows:

    • • Meta-definition  : including role definition, rule definition, system constraints, etc., to clarify the responsibilities and permissions of the Agent in the system or environment. For example, in an intelligent customer service system, the role of the Agent is to provide consulting and problem-solving services to customers, and must comply with relevant service rules and systems.
    • • Environmental cognition  : covers environmental conditions, data entities, context, etc., helping the agent determine its specific location and environmental status. For example, data entities are the basis for agent operations, such as customer information, order data, etc.; context covers the background information of the current task. For intelligent customer service agents, conversation context can help them better understand the user's intentions and needs, thereby providing more accurate services.
    • • More detailed meta-definition classification  : In addition to institutional constraints, factors such as cultural background and organizational structure also have an important impact on the definition of agent roles. For example, agents operating in a cross-cultural environment need to understand the differences in behavioral norms and values ​​in different cultural backgrounds in order to better adapt to and integrate into the environment.
    • • Dynamic perception and update mechanism  : The agent should have the ability to dynamically perceive its own identity and location, and adjust its cognition in time according to environmental changes. For example, in the production scheduling agent of a smart factory, when the production line layout or equipment status changes, it must quickly perceive and update its cognition of the environment and role positioning in order to make accurate scheduling decisions.
  • 2. What do I have? (Data) This question is related to the Agent's ability and resource cognition. The specific elements are as follows:

    • • Capabilities  : Clarify what capabilities the agent possesses, such as text processing, image recognition, data analysis, etc. These capabilities determine what tasks the agent can complete.
    • • Tools  : Understand the tools that can be used, such as various software tools, hardware devices, etc. These tools can help Agents achieve their goals more efficiently.
    • • Data  : Organize, store and manage data through data definition (such as Schema, etc.). Clarifying the structure and format of data helps Agents better understand and use data, and provide support for subsequent decisions and actions.
    • • Capability evaluation and optimization mechanism  : Agents must not only identify their own capabilities, but also have the ability to evaluate and optimize. As the environment and tasks change, agents must be able to identify the shortcomings of their own capabilities and optimize them through learning, upgrading, etc. For example, intelligent medical diagnosis agents can continuously optimize diagnostic algorithms and knowledge bases to improve diagnostic accuracy as medical research progresses and clinical data accumulates.
    • • Tool selection and combination strategy  : Agents should flexibly select and combine tools according to the dynamics of tasks and environments. Different task phases and environmental conditions may require different tool combinations. For example, in complex data analysis tasks, agents can flexibly select and combine multiple tools such as data mining, statistical analysis, and machine learning according to factors such as data type and analysis objectives to achieve the best analysis results.
  • 3. Where am I going? (Intention) This question focuses on the agent’s cognition of goals and intentions. The specific elements are as follows:

    • • Semantic understanding  : Through semantic understanding technology, we analyze human language, text and other expressions, extract the semantic information, and clarify the specific tasks that humans want the agent to complete.
    • • Intent mining  : Agents should dig deep into users’ potential intentions and needs, rather than just understanding explicit intentions. For example, when users inquire about travel destinations, in addition to providing information about popular attractions, agents should also analyze users’ preferences and history records to mine their potential needs for travel experience, cultural characteristics, food, etc., and provide more personalized and comprehensive travel suggestions.
    • • Goal combination  : combine the short-term goal of the current task with the long-term goal. For example, in an intelligent personal assistant, it is necessary not only to complete the task of the user's current schedule, but also to consider how to help the user achieve long-term career development planning, health management goals, etc., to provide users with personalized and diversified assistance.

    •    Compared with human organizational management, the above is a bit like employee governance. For example, in the Yang triangle of the organizational capability model - whether you want to (employee willingness), whether you can (employee capability), whether you are allowed (environmental permission); using AI as a person is very interesting. So, in the actual operation of the agent, the above-mentioned aspects are interrelated and influence each other. The agent needs to constantly and dynamically perceive and adjust its own roles, capabilities, data and other factors to better adapt to the environment and meet human needs. Among them, intentions, data, and data flows account for a large proportion of the factors. This is the cognition I have gained from practice. That's why I said above that I agree that " in the development of intelligent system, we need to pay full attention to the research and development of data management and intention recognition technology to improve the intelligence level and practicality of intelligent system ."

     

          • •  User experience : Detailing the user experience as an independent and important part reflects Windsurf's user-centric design concept. An excellent intelligent system should not only have powerful functions, but also provide users with a better experience, including low latency, easy observation and guidance, good integration methods, and a reasonable balance between intelligent and non-intelligent agent experience. When building AI products, user experience cannot be ignored. Starting from the actual needs and usage experience of users, we must continuously optimize each aspect of the intelligent system so that the intelligent agent can better serve users. Windsurf's lightweight disassembly of user experience is suitable for everyone to learn and easy to master.

          Realistic thinking on technology application and development

          The article also mentions "bitter lessons". I have also attached the original text of "bitter lessons" in the appendix above. (I am really too lazy to write this one)

          The original article mentioned that in the development of intelligent systems, we cannot rely too much on the rules and knowledge preset by humans, but must follow the trend of technological development and make full use of the continuous progress of large-scale computing, data and models to promote the improvement of intelligent performance. Therefore, in the research and development and application of intelligent technology, we must not only pay attention to current technical means and methods, but also maintain a keen insight into the trend of technological development, and continue to explore and innovate to meet the needs of future intelligent technology development.

          Therefore, when we think about building intelligent systems, should we prioritize how to utilize large-scale computing capabilities rather than trying to limit the behavior of the system through artificially designed rules? Of course, this does not deny the value of human knowledge in system development. Rather, we should focus on how to use computing resources to discover and learn solutions to complex problems, rather than simply encoding our existing cognition directly into the system (which is currently encountering bottlenecks at the model level). This requires us to explore the Scaling Law of AI Agents! This deserves further thinking and exploration. I have also seen this thinking of Windsurf in several leading Agent framework teams in the industry recently. It's very interesting! Maybe this will become an industry trend.