Real large-scale agents that can be trained are coming soon

Written by

Iris Vance

Updated on:July-09th-2025

Intelligent agents are everywhere today. Yet one of the most critical advances in the field of large language model agents has received little attention.

In January 2025, OpenAI released DeepResearch, a specialized version of O3 for web and document search. Thanks to "reinforcement learning training for these browsing tasks," DeepResearch gained the ability to plan search strategies that cross-reference various source information and relevant expertise based on intermediate feedback. Claude Sonnet 3.7 seems to have successfully applied the same approach to the field of code. The model alone surpasses the existing orchestration of previous models when dealing with complex sequences of programming tasks.

In short, as William Brown puts it: “Large language model agents are capable of performing long, multi-step tasks.”

This progress raises the question: What exactly are large language model agents? Last December, Anthropic announced a new definition: "In this type of system, the large language model can dynamically guide its own processing and the use of tools, and maintain control over how the task is completed."

In contrast, a more common form of intelligent agent systems is seen as workflows, which are "the orchestration of large language models and tools through predefined code paths." The recently popular Manus AI fits this definition perfectly. All the tests I conducted over the weekend showed the same basic limitations of workflow systems, which were already evident in the AutoGPT era, especially in search:

They are unable to plan and often get stuck without a clue.
They are unable to retain information and have difficulty maintaining tasks for more than 5 to 10 minutes.
They are unable to act effectively in the long run. A series of actions often fail due to the cumulative effect of errors.

This post starts with a new and powerful definition of large language model agents. It does its best to summarize what we know so far, combining limited information from large labs, emerging replications in the open research ecosystem, and some personal speculation.

The painful lesson of simple large language model agents

The concept of an agent is almost completely in conflict with the basic language model.

In classic agent research, agents exist in constrained environments. For example, you are in a maze and you can move in this direction, but not that direction. And you can't fly, you can't go underground, and you can't disappear. You are limited by the rules of physics and possibly the rules of the game. In this case, any practical agent can still enjoy a certain degree of freedom, because there is often more than one way to solve a game. However, each move must be conceived with the premise of winning and obtaining the final reward. Efficient agents will gradually remember past actions and carefully develop action patterns and heuristics.

This process is called "search". It's a very apt metaphor: the exploration of an agent in a maze is exactly like the click patterns of a web user on a search engine. Search has been studied for decades: notably, it was once rumored that the algorithm behind OpenAI's new generation of models, Q-star (it's not clear yet...), was a derivative of a search algorithm from 1968 called A-Star. The recent Pokémon training experiments conducted by Pufferlib are a good example of this process: we see agents literally searching for the best path, failing, and trying again and again.

PufferLib’s Pokémon reinforcement learning experiment

The base language model works almost exactly the opposite way:

Agents remember the environment they are in. Base models do not, they can only react to the information available within their context window.
The agent is subject to bounded rationality. The underlying model generates any possible text. While this may lead to practically consistent reasoning, it is not absolutely guaranteed, and the model may deviate from the correct direction at any time under purely aesthetic operations.
Agents can develop long-term strategies. If designed well, they can plan actions ahead of time, or backtrack to previous steps. Language models are capable of single-step reasoning tasks, but quickly saturate when performing multi-step reasoning. In general, they are constrained by the rules of the text, not the rules of physics or the rules of the game.

A naive approach to combining big language models with agent-based learning is to simply predefine their outputs via pre-prepared prompts and rules. This is the approach taken by most big language model agent systems, but is bound to suffer from… The Hard Lesson by Richard Sutton. The Hard Lesson is sometimes mistaken as some kind of guide to pre-training language models. It’s really mostly about agents, and the temptation to bake and hard-code knowledge into models. If you see a wall, avoid it and go in another direction. If you see too many walls, backtrack. This is great in the short term, you’ll see improvements immediately, and you don’t have to run the algorithm forever to see the effects. However, in the long run, you’re bound to always find suboptimal solutions, or get stuck in unexpected situations:

We must learn the hard lesson that building the way we think into models does not work in the long run. This hard lesson is based on the following historical observations: 1) AI researchers often try to build knowledge into their agents; 2) this always helps in the short term and is personally satisfying to the researcher; but 3) in the long run it reaches a bottleneck and even hinders further progress; 4) the final breakthrough progress is often achieved through an opposite approach, based on expanding computing power through search and learning. The final success comes with a bitter taste and is often not fully understood because it is a victory over a favored, human-centric approach.

Now, let’s apply this lesson to how big language models are currently used in production. Workflows like Manus, or your usual big language model wrapper, currently “build knowledge.” They guide the model through a series of pre-prepared prompts. This may be the most appropriate solution in the short term — after all, you don’t have to retrain the model. But it’s not the optimal solution. In the end, what you create is some kind of hybrid of generative AI and rule-based systems, a set of “simple ways to think about the contents of thoughts, such as simple ways to think about space, objects, multiple agents, or symmetries.”

Let's be clear. If Manus AI can't book a flight correctly, or can't offer advice on fighting a tiger barehanded, it's not because it was poorly designed. It was simply influenced by the hard lessons learned. Prompts don't scale. Hard-coded rules don't scale. You need to design systems from scratch that can search, plan, and act. You need to design truly large language model agents.

Reinforcement Learning + Reasoning: The Secret to Success

This is another difficult problem. There is very little public information. Anthropic, OpenAI, DeepMind, and a few other labs hold relevant knowledge. So far, we can only rely on a small amount of official information, unofficial rumors, and some limited open research attempts.

Similar to classical agents, Large Language Model agents are trained using reinforcement learning. There is a “maze”: all the potential words that could be used to describe something. There is a final way out, or “reward”. The process of checking whether you got the reward is called a verifier — and that’s exactly what William Brown’s new verifier library is all about. Currently, verifiers are best used to verify formal results, like math equations or programming sequences. However, as Kalomaze showed, it’s entirely possible to build verifiers around outputs that aren’t strictly verifiable, by training specialized classifiers. Here we have a major twist: language models are better at evaluating than they are at creating. So even using a small Large Language Model as a judge, you can get significant gains in performance and overall reward design.
Large language model agents are trained by generating drafts and evaluating the entire text they generate. This is not a straightforward choice, as research initially focused on extending the search to the entire sequence of tokens. Computational limitations are a major factor, but recent breakthroughs in developing “reasoning” models have also played a major role—these are perhaps more appropriately called draft generation models. A typical training sequence for a reasoning model involves having the model generate its own logical sequence under the assumption that the logical sequence that produces the correct answer is more correct. This can produce counterintuitive results (the best example is the DeepSeek R0 model occasionally switching between English and Chinese). However, in typical painfully learned fashion, reinforcement learning only cares about what works, and it will not hesitate to take unorthodox or unplanned shortcuts if necessary. Like a classical agent lost in a maze, a language model must find its way out through a pure exercise in reasoning. There are no predefined hints, no directions to guide you, just rewards and ways to get them: a bitter solution to a painful lesson.
The drafts generated by large language models are predefined as structured data sections to facilitate reward verification and to some extent simplify the overall reasoning process. This is a kind of rule design engineering that can be managed directly as a reward function or, as I think is more common in large lab training settings, through a post-training initialization phase.
Large language model agents typically require multi-step training over a large number of drafts. This is particularly typical for search: we don’t evaluate the result of a search all at once, but rather the model’s ability to access a resource, fetch the result, elaborate on the result, fetch another resource, elaborate again, change plans, backtrack, and so on. For this reason, the current method of choice for training large language model agents is DeepSeek’s General Relative Policy Optimization (GRPO), especially in combination with vllm’s text generation. A few weeks ago, I published a popular code notebook based on William Brown’s work that successfully ran GRPO on an A100 GPU provided through Google Colab. The reduction in computational requirements is an important factor that will ensure the popularity of reinforcement learning and agent design in the coming years.

Wait… How do we achieve scale?

These are the basic building blocks. Now, there is still some distance between these building blocks and OpenAI’s DeepResearch and other emerging agents that can handle long sequences of actions. Allow me to speculate a little.

Most of the open RL and reasoning research is in the math area, because it turns out that we have a large collection of math problems, some of which are included in the Common Crawl dataset and extracted by HuggingFace via classifiers (i.e. FineMath). For many areas, especially search, we don’t have the data. Because we need actual sequences of actions: logs, click records, behavior patterns. Not long ago, I worked on log analysis. The models at that time (still using Markov chains, but hey, this field changes quickly…) were still often trained using data leaked from AOL in the late 1990s (!). Recently, the field has added at least one key open dataset: Wikipedia Clickstream, a set of anonymized data on the paths from one Wikipedia article to another. Now, let me ask you a simple question: is this dataset available on HuggingFace? No. In fact, there is almost no real agent data on HuggingFace, that is, data that cannot give the model planning capabilities. The whole field is still based on the assumption that large language models need to be orchestrated with custom rule-based systems. I’m not sure OpenAI or Anthropic have such data in sufficient quantities. At least in this area, traditional tech companies have a strong advantage, and there is no easy alternative: you can’t buy a giant dataset of Google user queries (unless it’s somehow leaked on the dark web).

There is a way around this: generate data directly through simulation. Classic reinforcement learning models don’t need past examples. They infer constraints and comprehensive policies through extensive and repeated search. Once applied to the search domain, the typical reinforcement learning approach doesn’t differ much from the game reinforcement learning approach: let the model search freely and give it a reward every time it finds the right answer. This can be a very long process. For example, you need to find a very specific chemistry experiment stored in a forgotten Soviet paper from the 1960s. Through pure brute force search, perhaps forcing some changes in the language query, the model will eventually stumble upon the right result. Then, if it can summarize all the factors that led to this result, the probability of finding similar results in the future increases.

Let's do some math. In a typical RL design, say General Relative Policy Optimization (GRPO), you might be processing 16 drafts at once — I wouldn't be surprised if models trained by large labs use a much higher number of draft iterations. Each draft might go through at least 100 different pages in turn. That's 2,000 potential queries, and that's just... one step. A complex RL training sequence can take hundreds of thousands of steps (I think this is one of the reasons why it's closer to medium-sized training now), and requires a variety of examples, especially for a task as complex as general search capabilities. What you're looking at is that a single training sequence requires hundreds of millions of separate connections — and in the process could cause a distributed denial of service (DDoS) attack on some popular academic resource. This is... not ideal. Bandwidth, not actual computing power, becomes your main limiting factor.

Game reinforcement learning faces similar limitations. This is why advanced methods like Pufferlib "wrap the environment to make it look like an Atari game from the perspective of the learning library, without losing generality": reinforcement learning models only see what they need to use. Once applied to the search domain, this may involve leveraging the large Common Crawl dataset and sending the data as if it were processed over the network, including URLs, application programming interface (API) calls and other typical Hypertext Transfer Protocol (HTTP) related elements. At the same time, the data is already in the local data frame and has the ability to be queried quickly.

So, I expect that a typical large language model RL agent for search could be trained in the following way:

Create a large-scale web search simulation using a fixed dataset and continually “convert” the data back into the model.
Pre-train the model using some form of lightweight supervised fine-tuning (SFT) (as in DeepSeek’s SFT-RL-SFT-RL step), perhaps based on any existing search patterns that can be found. The general idea is to pre-format the inferences and outputs to speed up the actual reinforcement learning training - a kind of pre-defined rule design engineering.
Prepare more or less complex queries and use the relevant results as validators. My only guess is that this involves some complex synthesis pipeline involving back-translation of existing resources, or maybe just very expensive annotation work by PhD-level annotators.
Perform actual reinforcement learning training in multiple steps. The model receives a query, initiates a search, receives the results, and can browse a page or rephrase the results, all in steps. From the model's perspective, it's as if it's really browsing the web, but all of these data exchanges are prepared in advance by the search simulator in the background.
Perhaps once the model gets good enough at search, there will be another round of reinforcement learning and supervised fine-tuning, this time more focused on writing the final synthesis result. Again, I expect this to involve some complex synthesis pipeline where the output becomes the input: chopping the original long report into small pieces and doing some reasoning to connect them back together.

You will no longer give prompts to the agent

Finally, we have a true agent model. How does it change the actual application compared to the standard workflow or model orchestration? Is it just better overall quality? Or is it a completely different paradigm?

Let’s go back to Anthropic’s definition: a large language model agent “is able to dynamically guide its own processing and use of tools, and maintain control over how tasks are completed.” I will again use a use case that I am more familiar with as an example: search.

There has been a lot of speculation about the demise of Retrieval-Augmented Generation (RAG) and its replacement by direct use of large language models with long context. But this has not happened for a number of reasons: long context is computationally expensive, not very accurate except for relatively simple lookups, and has poor traceability of inputs. Real agents searching large language models will not make RAG obsolete. What might happen in reality is to automate it to a large extent and integrate all the complex issues of vector storage, routing, re-ordering, etc. A typical search process might proceed in the following way:

Analyze the query, break it down, and make some assumptions about the user intent.
If the query is unclear, a hint might be immediately returned to the user (OpenAI’s DeepResearch already does this).
The model can then conduct a general search or, if appropriate, immediately perform a more specialized search of research resources. The model has memorized standard API schemes and can call them directly. To save inference time, the model may prefer to rely on existing "mock" versions of the web: APIs, site maps, and the vast web data ecosystem.
Search sequences are learned and trained. The model can abandon the wrong direction. Or it can take other paths like a professional knowledge worker. Some of the most impressive results I’ve seen from OpenAI’s DeepResearch demonstrate this ability: through a series of internal reasoning, it is able to correctly locate the source of poor indexing.
These steps and processes are recorded as internal reasoning traces, providing a certain degree of explainability.

In short, the search process is designed directly. The LLM agent accepts the existing search infrastructure and does its best to find the best solution. No additional data preparation is required immediately. Nor is there any need to train users to interact with generative AI systems. As Tim Berners-Lee emphasized more than a decade ago: “One way to think about [the agent] is that in every case the program does exactly what the user would want it to do if specifically asked.”

Now, to get a clearer picture of a real large language model agent in production, you can start applying this approach to other domains. A true network engineering agent would also be able to interact directly with existing infrastructure, generate device configurations based on requirements (routers, switches, firewalls), analyze network topology and make optimization recommendations, or parse error logs to determine the root cause of network issues. A true financial agent would be trained to seamlessly and accurately convert between competing data standards (such as from ISO 20022 to MT103). Currently, none of these tasks can be accomplished using a set of system prompts.

Currently, only large labs have the capability to develop truly large language model agents. They hold all the key ingredients: expertise, some data (or at least methods to synthesize it), and an overall vision to turn their models into products. I’m not sure this concentration of technology is a good thing, but it is largely fueled by the funding ecosystem’s reluctance to see actual model training as a source of real disruption and value creation in the long run.

I generally don’t like to overhype things. However, given the huge potential for big LM agents to be transformative and create value, I do think that democratizing the training and deployment of truly big LM agents is quickly becoming critical. So, open source validators, general relative policy optimization (GRPO) training samples, and perhaps soon, complex synthesis pipelines and simulators.

Will 2025 be the year of the intelligent agent? It’s still possible. Let’s wait and see what the final result is.