Woter AI detection.Hurry - ends Jul 22nd

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

OpenAI Yao Shunyu: The second half of AI will shift from solving problems to defining problems

Written by

Iris Vance

Updated on:June-30th-2025

AI's scores are climbing higher on the list, but this may not necessarily translate into real-world efficiency improvements and widespread utility.

The focus in 2025 will no longer be on “how many points the model can get” but on “how much it can do”.

This is gradually becoming a consensus.

Just as the new o3 model was about to be released, OpenAI researcher and ReAct framework proposer Yao Shunyu published an article titled "The Second Half", in which he deeply reflected on the limitations of existing evaluation methods from the perspective of the underlying paradigm.

He pointed out that the reason why the list is repeatedly broken is that we have entered a new technological inflection point - a widely generalized AI formula has emerged. By adding "reasoning" to the action space of the RL environment, we can use the prior knowledge brought by language pre-training to achieve generalization, while also flexibly allocating computing resources during reasoning.

Because of this, many tasks that were once considered "still difficult" "collapse" very quickly in the face of scale and large models.

When “capability climbing” becomes predictable and automated, continuing to improve model intelligence is no longer the most critical task. What is really worth asking is how we should define the “task” itself and how to redesign the evaluation method to make it reflect the utility and value of the real world. The shift from “problem solving” to “problem setting” may be the new theme of the second half of AI.

We believe that this article points out the problems that must be solved in the current development of AI, so we compiled it.

The following is the original text:

Second Half

tldr: We are in the halftime period of AI.

For decades, AI research has focused on developing new training methods and models. And it has succeeded: from beating world champions in chess and Go, to outperforming most humans on the SAT and the bar exam, to winning IMO and IOI gold medals. Behind these historic milestones—DeepBlue, AlphaGo, GPT-4, and the O-Series—are fundamental innovations in AI methods: search, deep reinforcement learning, scaling, and reasoning. And it’s only going to get better over time.

So is anything suddenly different now?

Three words: reinforcement learning finally works. More precisely: reinforcement learning finally generalizes. After several important detours and a series of milestones, we have finally found a viable solution that can use language and reasoning to solve a variety of reinforcement learning tasks. Even a year ago, if you told most AI researchers that a single solution could solve software engineering, creative writing, IMO-level mathematics, mouse and keyboard operation, and long-form question answering, they would have laughed at your hallucination. Each of these tasks is extremely difficult, and many researchers spend their entire PhDs focusing on just a small number of them.

But it still happened.

So what happens next? The “second half” of AI — starting now — will move from solving problems to defining them. In this new era, evaluation becomes more important than training. Instead of just asking “Can we train a model to solve problem X?”, we ask “What should we train AI to do? How do we measure real progress?”

To stand out in this second half, we need to complete a transformation in our mindset and capability structure in a timely manner - this new paradigm may be closer to the thinking of product managers.

First Half

To understand the first half, let’s first look at the winners. Which do you think are the most influential AI papers to date?

I tried the Stanford 224N quiz and the answers were as expected: Transformer, AlexNet, GPT-3, etc. What do these papers have in common? They proposed some fundamental breakthroughs to train better models. And, they also managed to get published by showing some (significant) improvements on some benchmarks.

There is one underlying commonality among them, though: these “winners” are training methods or models, not benchmarks or tasks. Even the most influential benchmark, ImageNet, has less than a third of the citations of AlexNet. Everywhere else the difference between methods and benchmarks is even more dramatic — for example, the main benchmark for Transformers is WMT’14, whose workshop report has about 1,300 citations, while Transformers have over 160,000 citations.

This illustrates the first half of the game: focus on building new models and methods, with evaluation and benchmarking secondary (although necessary to make paper systems work).

Why? One big reason is that in the first half of AI, the methods were harder and more exciting than the tasks. Creating a new algorithm or model architecture from scratch—think of breakthroughs like the backpropagation algorithm, convolutional networks (AlexNet), or the Transformer used in GPT-3—requires extraordinary insight and engineering. In contrast, defining tasks for AI often feels more straightforward: we just turn tasks that humans have already accomplished (like translation, image recognition, or chess) into benchmarks. Not much insight or even engineering is required.

Methods also tend to be more general and applicable to a wider range of tasks than a single task, which makes them particularly valuable. For example, the Transformer architecture ultimately drove advances in computer vision, natural language processing, reinforcement learning, and many other fields far beyond the single dataset where it initially proved its worth (WMT'14 translation). A great new method is able to achieve excellent results on many different benchmarks because it is simple and general, so its impact often goes beyond a single task.

This game has been running for decades and has inspired world-changing ideas and breakthroughs that are reflected in ever-increasing benchmark performance in various fields. Why has this game changed? Because the accumulation of these ideas and breakthroughs has produced a qualitative change in creating effective solutions to tasks.

Recipe for success

What’s the secret sauce? Unsurprisingly, it involves large-scale language pre-training, scale (both data and compute), and the idea of reasoning and action. These sound like buzzwords you hear every day in San Francisco, but why is it called the secret sauce?

We can understand this through the lens of reinforcement learning (RL), which is often considered the “end game” of AI — after all, in theory, RL is guaranteed to win games, and empirically, it’s hard to imagine superhuman systems (such as AlphaGo) without RL.

In reinforcement learning, there are three key components: algorithms, environments, and priors. For a long time, reinforcement learning researchers have focused mainly on algorithms (e.g. REINFORCE, DQN, TD-learning, actor-critic, PPO, TRPO…) — the intellectual core of agent learning — while treating the environment and priors as fixed or minimal. For example, the classic textbook by Sutton and Barto is all about algorithms and barely touches on the environment or priors.

However, in the era of deep reinforcement learning, experience has shown that the environment is crucial: the performance of an algorithm is often highly correlated with the environment in which it is developed and tested. If you ignore the environment, you risk building an "optimal" algorithm that only performs well in simple environments. So why don't we first figure out what environment we actually want to solve, and then find the best algorithm for it?

This is exactly what OpenAI planned in the beginning. It built gym, a standard reinforcement learning environment for various games, and then the World of Bits and Universe projects, which tried to turn the internet or computers into games. It's a great plan, isn't it? Once we turn all digital worlds into an environment and solve it with smart reinforcement learning algorithms, we have digital general artificial intelligence (AGI).

The plan is good, but it doesn’t work completely. OpenAI has made great progress here, using reinforcement learning to solve problems like Dota and robotic hands. But it has never made progress in solving computer usage or web navigation, and reinforcement learning agents that work in one domain don’t transfer to another. This suggests something is wrong.

It wasn’t until GPT-2 or GPT-3 that people realized the missing piece was prior knowledge. You need strong language pre-training to distill general common sense and language knowledge into a model, and then fine-tune it to become a web (WebGPT) or chat (ChatGPT) agent (and change the world). It turns out that the most important part of reinforcement learning may not even be the reinforcement learning algorithm or the environment, but prior knowledge, and prior knowledge can be obtained in a way that is completely unrelated to reinforcement learning.

Language pre-training creates good priors for chatting, but doesn’t work well for controlling computers or playing video games. Why? These domains are far from the distribution of internet text, and naively performing SFT/RL on these domains generalizes poorly.

I noticed this problem in 2019, when GPT-2 had just come out, and I did SFT/RL on top of it to solve text-based games - CALM was the world's first agent built with a pre-trained language model. But the agent required millions of RL steps to complete a hill climbing game, and it couldn't be transferred to new games.

While this is exactly what RL is all about, and not something strange to RL researchers, it struck me as odd because we humans can easily play a new game and perform significantly better on zero-shot training. Then I had my first aha moment — we are able to generalize because we can choose to do more than just “go to locker #2” or “open chest #3 with key #1” or “kill the dungeon with a sword”, we can choose to think things like “this dungeon is dangerous, and I need a weapon to fight. There are no visible weapons, so maybe I need to look in a locked box or chest. Chest #3 is in locker #2, let me go there first to open it.”

Thinking, or reasoning, is a strange behavior - it has no direct impact on the external world, and yet the space of reasoning is open and combinatorially infinite - you can think about a word, a sentence, a whole paragraph, or 10,000 random English words, and the world around you doesn't immediately change.

In classic RL theory, this is a bad deal and makes decision making impossible. Imagine you need to choose between two boxes, only one of which contains $1 million and the other is empty. You expect to make $500,000. Now imagine I add an infinite number of empty boxes. You expect to make nothing.

But by adding reasoning to the action space of any RL environment, we leverage language pre-training priors for generalization, and we can provide flexible test-time computation for different decisions. This is really an amazing thing, and I’m sorry for not fully understanding it here, I probably need to write another blog post about it. Feel free to read ReAct to understand the original story of agent reasoning and to understand how I felt at the time.

Currently, my intuitive explanation is: even if you added an infinite number of empty boxes, you would have seen them in various games, and choosing these boxes allows you to better choose the boxes with money for any particular game. My abstract explanation is: language generalizes through the agent's reasoning.

Once we have the right RL priors (language pre-training) and RL environments (adding language reasoning as an action), the RL algorithm may become the most trivial part. Hence, we have the O-series, R1, deep research, agents using computers, and more to come. What an irony! For a long time, RL researchers cared more about algorithms than environments, but no one paid attention to priors - all RL experiments were basically started from scratch. But it took us decades to realize that maybe our priorities should be completely reversed.

But as Steve Jobs said: You can’t connect the dots forward; you can only connect them backward.

Second Half

This recipe completely changed the game. Here's a look at the first half:

We develop novel training methods or models for the hill climbing benchmark.
We create tighter benchmarks and the cycle continues.

This game is ruined because:

This approach essentially standardizes and industrializes the baseline hill climbing algorithm without requiring many new ideas. Because it scales and generalizes well, your new approach for a specific task might improve it by 5%, while the next O-series model might improve it by 30% without explicitly targeting that task.
Even as we create tighter benchmarks, they are quickly (and increasingly) solved by recipes. My colleague Jason Wei created a beautiful chart that shows this trend well:

So what's left to play for in the second half? If new methods are no longer needed and harder benchmarks will soon be solved, what should we do?

I believe we should fundamentally rethink assessment. This means not only creating new, more rigorous benchmarks, but also fundamentally questioning existing assessment mechanisms and creating new ones that force us to invent new methods that go beyond existing ones. This is difficult because humans are lazy and rarely question basic assumptions - you just take them for granted and don't realize they are assumptions, not laws.

To explain inertia, let's say you invented one of the most successful assessment methods in history, based on human exams. This is a very bold idea in 2021, but 3 years later it has become saturated. What would you do? Most likely design a harder exam. Or let's say you just solved a few easy coding tasks. What would you do? Most likely find harder coding tasks to solve until you reach IOI Gold.

Inertia is natural, but therein lies the problem. AI has beaten world champions at chess and Go, outperformed most humans on the SAT and the Bar Exam, and achieved gold medal level performance in the IOI and IMO. Yet the world has not changed much, at least in terms of economics and GDP.

I call this the utility problem, and I think it is the most important problem in artificial intelligence.

Maybe we’ll solve the utility problem soon, maybe not. In any case, the root cause of this problem may be deceptively simple: our evaluation setting differs from the real-world setting in many fundamental ways. Here are two examples:

Evaluations “should” run automatically, so typically the agent receives a task input, performs the task autonomously, and then gets rewarded for the task. But in reality, the agent must interact with a human throughout the entire task — you can’t just send a long text message to customer service, wait 10 minutes, and expect a detailed response to resolve all issues. By questioning this setup, new benchmarks were invented to involve real people (such as Chatbot Arena) or user simulations (such as tau-bench) in the loop.

Evaluation "should" run i.i.d.. If you have a test set of 500 tasks, you run each task independently, average the task metric, and then come up with an overall metric. But in reality, you solve the tasks sequentially, not in parallel. As a Google SWE becomes more familiar with the codebase, her ability to solve google3 problems gets better, but the SWE agent solves many problems in the same codebase without reaching that familiarity. We clearly need long-term memory methods (and they do exist), but academia has no proper benchmarks to demonstrate this need, and is not even brave enough to question the i.i.d. assumption that underlies machine learning.

These assumptions have always been true, and in the first half of AI, it was fine to base benchmarks on them, because when intelligence was low, increasing it generally improved utility. But now, the general approach is guaranteed to work under these assumptions. So, in the second half, the way to play the new game is to

We develop novel evaluation settings or tasks for real-world practicality.
We solve the problem with a formulation or enhance the formulation with a new ingredient. The cycle continues.

This game is hard because it is unknown. But it is also exciting. Players in the first half solve video games and exam problems, while players in the second half are able to use intelligence to build practical products and create billion-dollar or even trillion-dollar companies. The first half is full of incremental methods and models, while the second half filters them to a certain extent. Generic recipes will only crush your incremental methods unless you create new hypotheses that break the recipe. Then you can do truly game-changing research.