OpenAI Yao Shunyu: Welcome to the second half of AI!

OpenAI experts deeply interpret the new trends in AI development and reveal the future direction of AI.
Core content:
1. AI development has entered a new stage: from training models to defining problems
2. Reinforcement learning achieves generalization, and AI technology has achieved a milestone breakthrough
3. In the second half, AI needs to change its way of thinking and be closer to the perspective of product managers
Abstract: We are in the middle of the era of artificial intelligence.
For decades, AI has focused primarily on developing new training methods and models. This strategy has paid off: from beating world champions at chess and Go, to outperforming most humans on the SAT and the bar exam, to winning gold medals at the International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). Behind these historic milestones—DeepBlue, AlphaGo, GPT-4, and a host of models starting with “o”—are fundamental innovations in AI approaches: search, deep reinforcement learning (RL), scaling, and reasoning. And things keep getting better over time.
So what’s suddenly different now?
In three words: RL finally works. More precisely: RL finally generalizes. After several major inflections and a series of milestones, we have found a viable solution that uses language and reasoning to solve a wide range of RL tasks. Even a year ago, if you told most AI researchers that a single solution could handle software engineering, creative writing, IMO-level math, mouse and keyboard operation, and long-form question answering - they would have laughed at your fantasy. Each of these tasks is extremely difficult, and many researchers focus on a narrow area of them throughout their PhDs.
Yet, all this happened.
So, what comes next? The second half of AI — starting now — will shift the focus from solving problems to defining them. In this new era, evaluation will be more important than training. We no longer just ask, “Can we train a model to solve problem X?” Instead, we ask, “What should we train AI to do, and how can we measure real progress?” To succeed in the second half, we’ll need a shift in mindset and skill set that’s more like the mindset of a product manager.
First Half
To understand the first half, look at its winners. Which do you think are the most influential AI papers so far?
I tried the quiz from Stanford's 224N course, and the answers were not surprising: Transformer, AlexNet, GPT-3, and so on. What do these papers have in common? They present some fundamental breakthrough for training better models. But again, they publish papers by showing some (significant) improvement on some benchmark.
However, there is one underlying commonality: these “winners” are training methods or models, not benchmarks or tasks. Even arguably the most influential benchmark, ImageNet, has less than a third of the citations of AlexNet. Elsewhere, the method-benchmark split is even more stark—for example, the main benchmark for Transformers is WMT’14, whose workshop report has about 1,300 citations, while Transformers have over 160,000 citations.
This illustrates the first half of the game: the focus is on building new models and methods, while evaluation and benchmarking are secondary (although necessary to make the paper body work).
Why? A big reason is that in the first half of AI, the methods were harder and more exciting than the tasks. Creating a new algorithm or model architecture from scratch—think breakthroughs like the backpropagation algorithm, convolutional networks (AlexNet), or the Transformer used in GPT-3—required extraordinary insight and engineering ability. In contrast, defining tasks for AI often felt simpler: we just turned things that humans already do (like translation, image recognition, or chess) into benchmarks. There wasn’t much insight or even engineering ability.
Methods also tend to be more general and more broadly applicable than a single task, making them particularly valuable. For example, the Transformer architecture ended up driving advances in computer vision (CV), natural language processing (NLP), reinforcement learning (RL), and many other fields — far beyond the single dataset where it initially proved itself (WMT'14 translation). A great new method can continue to improve on many different benchmarks because it is simple and general, so its impact often goes beyond a single task.
This game has been going on for decades and has inspired world-changing ideas and breakthroughs that manifest themselves through rising benchmark performance in various fields. So why does the game change? Because the accumulation of these ideas and breakthroughs creates an effective solution in solving the task.
plan
What is the solution? Its ingredients, unsurprisingly, include large-scale language pre-training, scale (in data and compute), and the idea of reasoning and action. These may sound like buzzwords you hear every day in San Francisco, but why call them solutions?
We can understand this through the lens of reinforcement learning (RL), which is often considered the “end game” of AI — after all, in theory, RL is guaranteed to win at games, and in practice it is difficult to imagine superhuman systems (such as AlphaGo) without RL.
In reinforcement learning, there are three key components: algorithms, environment, and prior knowledge. For a long time, reinforcement learning researchers have focused mainly on algorithms (e.g., REINFORCE, DQN, TD-learning, actor-critic, PPO, TRPO, etc.) — the intellectual core of agent learning — while treating the environment and prior knowledge as fixed or minimized. For example, the classic textbook by Sutton and Barto focuses almost exclusively on algorithms, with little reference to the environment or prior knowledge.
However, in the era of deep reinforcement learning, it has become clear that the environment matters empirically: the performance of an algorithm is often highly dependent on the environment in which it was developed and tested. If you ignore the environment, you might build an "optimal" algorithm that only performs well in a toy environment. So why don't we first identify the environment we actually want to solve, and then find the algorithm that works best for it?
This is exactly what OpenAI planned to do initially. It built gym, a standard reinforcement learning environment for various games, and then the World of Bits and Universe projects, which tried to turn the internet or a computer into a game. Not a bad plan, isn't it? Once we turn all the digital worlds into an environment, with clever reinforcement learning algorithms solving it, we will have digital general artificial intelligence (AGI).
A nice plan, but it didn’t quite work. OpenAI made great progress down this path, using reinforcement learning to solve problems like Dota, a robotic hand, and more. But it never came close to solving computer use or web navigation, and reinforcement learning agents in one domain didn’t transfer to another. Something was missing.
It wasn’t until GPT-2 or GPT-3 came along that it became clear that the missing piece was prior knowledge. You need strong language pre-training to distill general common sense and language knowledge into the model, which can then be fine-tuned to become a web (WebGPT) or chat (ChatGPT) agent (and change the world). It turns out that the most important part of reinforcement learning may not even be the reinforcement learning algorithm or the environment, but the prior knowledge, which can be acquired in ways completely unrelated to reinforcement learning.
Language pre-training creates good prior knowledge for chatting, but not for controlling computers or playing video games. Why? These domains are far from the distribution of internet text, and simply doing supervised fine-tuning (SFT)/reinforcement learning on these domains generalizes poorly. I noticed this problem in 2019, when GPT-2 came out, and I did SFT/RL on top of it to solve text-based games - CALM was the world's first agent built with a pre-trained language model. But this agent required millions of reinforcement learning steps to continuously improve in one game, and it was not transferable to new games.
While this is exactly the nature of RL and not surprising to RL researchers, I found it odd because we humans can easily play a new game and do much better with zero-shots. Then I had my first aha moment — we are able to generalize because we can choose to do more than just “walk to chest 2”, “open chest 3 with key 1”, or “kill the dungeon monster with the sword”, we can also choose to think things like “the dungeon is dangerous, I need a weapon to fight. There are no visible weapons, maybe I need to find one in a locked chest or cabinet. Chest 3 is in chest 2, I’ll go there first to open it”.
Thinking, or reasoning, is a strange action - it does not directly affect the external world, yet the space of reasoning is open and combinatorially infinite - you can think about a word, a sentence, a whole paragraph, or 10,000 random English words, but the world around you does not immediately change. In classic reinforcement learning theory, this is a bad deal that makes decision making impossible. Imagine you need to choose between two boxes, one of which contains $1 million and the other is empty. You expect to get $500,000. Now imagine I add an infinite number of empty boxes. The amount you expect to get becomes zero.
But by incorporating reasoning into the action space of any reinforcement learning environment, we leverage the language pre-trained priors to enable generalization, and we can use flexible test-time computation across different decisions. This is a pretty amazing thing, and I apologize for not making it completely clear here, I probably need to write another blog post dedicated to explaining it. You can read ReAct to get the original story of agent reasoning, and read my thoughts at the time. For now, my intuitive explanation is: even if you multiply an infinite number of empty boxes, you have seen these boxes in various games in your life, and choosing these boxes prepares you to choose the box with money in any given game. My abstract explanation is: language enables generalization through reasoning in the agent.
Once we have the right RL priors (language pre-training) and RL environments (language reasoning as actions), it turns out that the RL algorithm may be the least important part. Hence we have the o-series, R1, deep research, computer-generated agents, and more on the horizon. What an ironic turn of events! For a long time, RL researchers have focused heavily on algorithms and almost no one on environments, let alone priors — all RL experiments were essentially started from scratch. But it took us decades of detours to realize that maybe we had our priorities all wrong.
But as Steve Jobs said: You can’t connect the dots forward; you can only connect them backward.
Second Half
This plan completely changed the game. Review of the first half of the game:
• We develop new training methods or models to continuously improve on benchmarks. • We create harder benchmarks and continue the cycle.
This game is being broken because:
• The approach essentially standardizes and industrializes continuous improvement of benchmarks without requiring much new thinking. As the approach scales and generalizes well, your new approach to a specific task might only improve it by 5%, while the next o-series model might improve it by 30% without explicitly targeting it. • Even as we create harder benchmarks, soon (and increasingly soon) they will be solved. My colleague Jason Wei created a beautiful chart that visualizes this trend nicely:
So what is left in the second half? If new methods are no longer needed and harder benchmarks are being solved faster and faster, what should we do?
I think we should fundamentally rethink evaluation. This means not just creating new, harder benchmarks, but fundamentally questioning existing evaluation setups and creating new ones so that we are forced to invent new ways to go beyond existing schemes. This is hard because humans have inertia and rarely question basic assumptions - you just take them for granted, without realizing that they are assumptions, not laws.
To illustrate inertia, let's say you invented one of the most successful assessments in history, based on human exams. In 2021, this is a very bold idea, but 3 years later it has become saturated. What would you do? Most likely create a harder exam. Or let's say you solved easy programming tasks. What would you do? Most likely find harder programming tasks to solve until you reach the International Olympiad in Informatics (IOI) gold medal level.
Inertia is natural, but here’s the thing: AI has beaten world champions at chess and Go, outperformed most humans on the SAT and the bar exam, and achieved gold medal performance at the International Olympiad in Informatics (IOI) and the International Mathematical Olympiad (IMO). Yet the world hasn’t changed much, at least from an economic and gross domestic product (GDP) perspective.
I call this the utility problem, and I think it is the most important problem in artificial intelligence.
Maybe we’ll solve the utility problem soon, maybe not. Either way, the root of the problem may be surprisingly simple: our evaluation setting differs from the real-world setting in many fundamental ways. Here are two examples:
Evaluation should be run automatically
So typically the agent receives a task input, performs actions autonomously, and then receives a task reward. But in reality, the agent needs to interact with humans throughout the task - you wouldn't send a super long message to customer service, wait 10 minutes, and then expect a detailed reply to solve all the problems. By questioning this setup, new benchmarks were invented, either incorporating real humans (such as Chatbot Arena) or user simulations (such as tau-bench).
The evaluation should be independent and identically distributed (iid)
If you have a test set of 500 tasks, you run each task independently, average the task metrics, and then get an overall metric. But in reality, you solve the tasks sequentially, not in parallel. Google software engineers (SWEs) get better at solving problems in Google code as they become more familiar with the code base, but software engineering agents don't gain this familiarity when solving many problems in the same code base. We clearly need long-term memory methods (and we do have them), but academia has no suitable benchmarks to prove this need, and even no courage to question the basic assumption of machine learning - independent and identically distributed.
These assumptions have “always” been this way, and in the first half of AI, it was fine to develop benchmarks under these assumptions, because when intelligence levels were low, increasing intelligence generally increased utility. But now, general solutions are guaranteed to work under these assumptions. So the new way to play in the second half is:
• We develop new evaluation settings or tasks for real-world utility. • We solve them with solutions, or enhance the solutions with new components. The cycle continues.
This game is hard because it is unfamiliar. But it is exciting. While the participants in the first half solve video games and exams, the participants in the second half can build companies worth billions or even hundreds of billions of dollars by building useful products. While the first half is full of incremental methods and models, the second half filters them to a certain extent. Generic solutions will easily beat your incremental methods unless you create new hypotheses that break the solutions. Then you can do truly transformative research.