Dai Yusen of Zhen Fund: Long talk about AI Agents, every industry will encounter the "Lee Sedol moment" (Part 1)

Dai Yusen from Zhen Fund deeply interprets the future development of AI Agent and its impact on various industries.
Core content:
1. Two major events in the AI industry: the release of o1 and R1 and their impact
2. The improvement of reasoning ability, cost reduction and programming ability brought by o1 and R1
3. The application prospects of Agent in 2025 and the new actions and adjustments of AI companies
Last month, Dai Yusen, managing partner of ZhenFund, had a long chat with LatePost about AI and Agents. We have compiled this interview into a complete transcript, which will be published in two parts: (Part 1) and (Part 2).
There have been two important milestones since last year: o1 and R1. They have brought two impacts on the entire AI industry:
First, o1 introduces reinforcement learning in large language models, opening up new scaling laws for post-training and test-time computing (i.e., computing in the reasoning stage) in addition to the pre-training scaling law, greatly improving the model's reasoning capabilities.
Second, DeepSeek R1, which is also an inference model like o1, was strongly open sourced at a very low cost. The huge national influence it subsequently triggered made many people re-evaluate the most important issue in the large model industry: improving model capabilities. The open source of R1 and the release of Kimi-k1.5, another inference model with a detailed technical report at the same time, also clearly told the entire field that some directions were "dead ends" and they did not use methods such as Monte Carlo tree search.
In this episode, Yu Sen and "Late" started the conversation from o1 and R1. The improvement in reasoning ability and reduction in cost brought by the two, as well as the improvement in model programming ability and tool usage ability at the same time, opened up the application prospects of Agent in 2025.
Yu Sen shared in detail his current observations on Agent opportunities, as well as the new actions and adjustments of large and small AI companies in the changes in the open source ecosystem brought about by DeepSeek.
01
Inspiration from OpenAI o series and DeepSeek R series
Q: In the past six months, the two most important things in the AI field are: OpenAI's release of o1 in September last year, and DeepSeek's recent release of R1, which has set off a global frenzy. We can start with these two most important things. Can you first talk about what you think the significance of o1 and R1 are?
Dai Yusen: I think o1 first showed everyone the intelligence improvement brought by applying reinforcement learning to the post-training field. Because at that time everyone was wondering what would be the next one after ChatGPT-4o? After o1 came out, it did have a great improvement in reasoning and other intelligent performance. Later, o3 was released, proving that along the technical route of o, the model's capabilities can continue to improve, and its margins are still far away and there is still a lot of room for improvement.
I heard that o4 mini has also been trained. From this, we can see that the post-training Scaling Law is implemented by using Reinforcement Learning in the post-training stage. At the same time, we can also see that as the model reasoning time gets longer and longer, the quality of the answers given gets better and better. This is the test-time compute scaling law, also called the reasoning time Scaling Law. These two new Scaling Laws, based on the previous pre-training, allow the AI model to be further improved.
Previously, the leading companies more or less understood that Reinforcement Learning was quite useful and could improve the performance of the model. But after the emergence of o1, everyone was sure that this path would really work. I think the improvement in reasoning ability brought by the o series models is the key to unlocking the product form of Agent. Because the model's thinking ability is not strong enough, it cannot use tools, make plans, and check whether its work is completed independently, but these are all necessary points for Agent products. Therefore, we must first rely on the o series to improve the model's thinking ability before we can unlock new product forms.
Q: What is the difference between o4 and o3? Or what is the main optimization and iteration?
Dai Yusen: There is some gossip recently, saying that the reasoning time of O4 Mini may reach several hours. I was wondering, what is the difference between excellent humans and ordinary humans? Why does it take 5 years to write a doctoral thesis? Because a doctor can get a better and more valuable job in 5 years. But for an ordinary person, he may not be able to write a doctoral thesis in 10 years. So first of all, the basic quality of the person must be good, and secondly, there must be enough time.
We often say that training a model is like training a smarter person. But smart people need more time to deliver better work, which is the scaling law of reasoning time. In the O series models, such as O3 and O4, the model can think for a longer time and get better results, which is gradually becoming an increasingly achievable goal.
Q: We just talked about o1. To summarize briefly: o1 proves that reinforcement learning has great potential in the scaling law of post-training and test time calculation, and this can go a long way. This is the value of the o series.
Next, let’s talk about R1. I think to some extent, its influence exceeds that of the O series because R1 is a hot topic that everyone is discussing.
Dai Yusen: I think the R series is indeed a world-class work, which has given us a lot of inspiration. The first is open source vs. closed source. When it chooses to be open source, everyone can understand the training process of the model. In the training papers of R1 and V3, we saw a lot of things that OpenAI knew a long time ago, but the public did not know before. For example, DeepSeek-R1-Zero proved that without using SFT, only based on the basic model V3 for reinforcement learning, the model can output longer lengths, obtain better intelligence, and realize the Scaling Law of reasoning. Not using SFT is a very important innovation. Then there is GRPO. I heard that OpenAI knew about it before, but it was DeepSeek's article that made everyone realize that GRPO is feasible. Before, when many people discussed o1, they would think whether it could be achieved through a search method such as MCTS, or through reinforcement by step annotation like PRM? But DeepSeek generously shared that they tried these methods and none of them worked. In fact, many times, it is important to know that a path is not feasible.
I recently learned a term called "one bit of information", which means that for some key information, one bit is enough to convey.
I think the power of DeepSeek's paper lies in the fact that it provides everyone with these "one-bit information". For example, if MCTS is not feasible, at least DeepSeek has tried it and it didn't work, so everyone doesn't have to waste their efforts on this road. This "one-bit information" reflects DeepSeek's generous sharing spirit on the one hand, and on the other hand, it also reflects the gap between Silicon Valley and China. There may be some "one-bit information" in Silicon Valley that we don't know. According to some of our understanding last year, by mid-2024, it will be a consensus among first-tier laboratories in Silicon Valley that the RL route is feasible, but this information may not be transmitted to China until o1 and R1 appear. Therefore, a lot of key information in cutting-edge exploration is hidden in these "one-bit information".
The sharing spirit of open source has many benefits. On the one hand, peer trainers of the model have learned a lot of knowledge. On the other hand, we also see that companies such as WeChat and Baidu, which already have their own models, have also connected to DeepSeek because of its open source. In this way, more people can use good models. For example, Monica, which we invested in, also uses R1 in the domestic version recently launched. In the past, many domestic application developers developed applications overseas because there were good models such as ChatGPT-4o and Claude 3.5 overseas, so they were able to make good products. Now that there are good models like R1 in China, developers have more "weapons" at hand. Moreover, open source can also promote the faster development of the entire industry, and everyone can learn from each other and make progress together;
I just mentioned the first point, which is the victory brought by open source. The second point, I think, is the victory of reinforcement learning (RL). OpenAI did not disclose the specific details of o1 training, but the release of R1 showed everyone that the road of reinforcement learning can really go a long way, and pointed out a direction worth exploring in depth, so I think this is a big victory for RL;
Third, R1, V3 and the entire DeepSeek have fully demonstrated the importance of team focus. When resources are limited, everyone can come up with more creative solutions. For example, using MOE is a way to save resources. If a traditional dense model is used, the inference cost and training cost will be much higher. When using MOE and facing "bottleneck" problems such as chips, through technical innovations such as MLA, training and reasoning can proceed smoothly under the premise of legality and compliance, and better results can be achieved. This shows that resource constraints can often be a source of power for innovation.
At the same time, DeepSeek is also a company that has made many choices in research directions. In 2023, many people are working on projects such as multimodal generation and AI virtual girlfriends, and many people are focusing on the development of to C products, but DeepSeek did not follow suit. They did not launch their own App until after the release of R1. Although DeepSeek already has a lot of cards, money, and excellent people, they are still focused on improving intelligence, improving the basic capabilities of the model, concentrating their strength in one direction, and finally getting such results. This not only reflects their accurate judgment of the direction of technological development, but also shows the good results brought about by firm choices and resolute investment.
At the same time, this also shows us that young AI native teams are capable of competing with larger companies with more resources and users. In the past, everyone always thought that large companies had absolute advantages in terms of funding, talent, cards and number of users, and small companies simply could not compete with them. Although DeepSeek is not a small company in the ordinary sense, it is still a relatively young team, and many members of the team are graduate students and doctoral students trained in China. This makes everyone full of confidence in China's talent system, which is also critical;
Another point is also very important to me. DeepSeek proves that in the early stage of a technological revolution, if you can bring users a brand new, magical experience through technological progress, you will reap unexpected results. Many people use DeepSeek's R1 model for the first time, and the reasoning model for the first time. When they see the results it outputs, they will feel great. This will cause spontaneous dissemination and bring a large amount of natural traffic. Without investing a penny in advertising fees, it has obtained tens of millions of daily active users (DAU). At the same time, its API is in short supply. Many people are willing to pay for it, and some people even take the initiative to ask for a paid stable version of the R1 model. This also means that technological progress brings about changes in product experience, and changes in product experience bring about spontaneous dissemination and natural traffic from users, and business models will also emerge. So I think that in the early stage of a technological revolution, we must insist on technological breakthroughs and leading in intelligence, rather than carving products and operations on existing intelligence.
Q: Do you think this is a consensus?
Dai Yusen: Many people have proposed that from 2023 to 2024, many researchers have expressed that "intelligence is very important, don't carve on the existing basis." But I think everyone needs a practical and vivid example. Before the birth of DeepSeek-R1 in 2024, everyone was too focused on indicators of the Internet era, such as DAU, user retention rate, and user usage time. Take the AI virtual girlfriend and AI calling functions that were very popular at the time. Why are so many people keen to develop such products? The reason is that, from the data, the user retention rate of such products is relatively high, and the time users interact with them is also very long. After all, calling AI naturally lengthens the time. But does this really represent an improvement in intelligence? At least I personally think that this is more about satisfying the emotional needs of users, not an improvement in intelligence. If you use duration and DAU as optimization indicators, you won't make products like Deepseek that improve intelligence.
There have always been many controversies in the field of the Internet in China. Everyone knows that there is not enough soil for enterprise services, and it seems that users are more willing to pay for killing time rather than saving time, so everyone is accustomed to looking for the next ByteDance. When I reported to our LP in October 2024, I mentioned that ByteDance’s formula might not be used in the future, because ByteDance makes money by occupying user time, but user time is limited, and Douyin, King of Glory, etc. have already occupied a lot of user time. Therefore, the next innovative "killer app" may be those that can help users save time, or create value for users outside of these 8 hours or 16 hours, rather than having to grab the time for Douyin, which is very difficult to grab, and Douyin is very powerful. In this case, DeepSeek has become a good example.
02
Agent brings Scaling Law that converts capital into productivity
Q: What industry and application changes will the reasoning models represented by the O series and R series bring in the future? You have mentioned one thing before - the improvement of reasoning capabilities will point to agent applications, which has also been a topic that everyone has discussed frequently since the second half of last year.
Dai Yusen: According to the framework we just talked about - technological progress unlocks new product forms. We can see that from GPT to GPT 3.0, then aligned to the conversational mode InstructGPT, and finally the GPT-3.5 model, the Chatbot product form was unlocked; the model with strong coding ability represented by Sonnet unlocked the product form like Cursor, which is like a programming assistant. It can be said that it is a mutually beneficial relationship. Without Sonnet, Cursor would not be popular. Starting from Sonnet 3.5, the model began to have a certain reasoning ability, and the progress of o1 and subsequent o series models made the model's reasoning ability very strong. I think the corresponding product form may be Agent.
What is an agent? In English, "agency" means subjective initiative. In the past, only humans on earth had subjective initiative. We knew our goals, could make plans, use tools, and evaluate work results. This is one of the reasons why humans can rule the world. But now AI's capabilities have gradually reached a breakthrough point, allowing AI to play the role of an agent.
In my opinion, AI is able to achieve this transformation, unlocked by three technological advances:
The first is reasoning. Reasoning ability is the basic intelligence of AI. If the reasoning ability is insufficient, it will face a series of problems. For example, it cannot clearly define its mission objectives, it is difficult to formulate a feasible execution plan, and it is even more difficult to judge whether it has completed the task.
The second is coding ability. In the digital world, understanding code, writing code, and completing various tasks are basic skills and the "language" of the cyber world.
The third is the ability to use tools. In the digital world, humans have created so many tools and software for themselves. If AI wants to fully play its role, it must first adapt to these tools used by humans. For example, AI needs to rely on human browsers and websites to obtain information.
In the past 12 months, the three abilities of reasoning, coding, and tool use have undergone tremendous changes and entered a stage of exponential growth. In order to measure these abilities, there are some different benchmarks in the industry. Taking reasoning ability as an example, we often use GPQA for testing, which is a test that simulates the admission qualification level of human doctoral students. In this test, ordinary humans can score about 20 points, and human doctoral students can reach about 60 points. At the beginning of 2024, the most cutting-edge models in the field of AI only scored more than 10 points. But now, cutting-edge models like o3 have scored more than 70 points (if I remember correctly), so this has increased very quickly.
When measuring AI's programming ability, people often use SWE-Bench to test it, which extracts a series of real human programming tasks on GitHub. At the beginning of 2024, 4o scored only in the single digits, which was basically unusable. But now, o3 has reached 70-80 points, which means that AI can solve 70%-80% of human programming tasks.
Nowadays, the rapid development of AI capabilities has brought us a new problem, that is, it is difficult for us to find suitable questions to test AI. Some time ago, Terence Tao proposed a test called Frontier Math, in which the simplest questions were all difficult questions at the level of IMO (International Mathematical Olympiad). At that time, everyone thought that these difficult questions could at least hold back AI for a few years. As a result, the o3 model can now get 25 points in the Frontier Math test, and the o4 model performs even better.
Once reinforcement learning is applied to a certain field, the growth curve of related AI often shows exponential growth. Just like the emergence of AlphaGo before, it used reinforcement learning technology to make a huge breakthrough in the field of Go. Later, AlphaStar developed by DeepMind also quickly surpassed the top human players in the game of StarCraft by relying on reinforcement learning. There is also autonomous driving technology. In fact, from a technical point of view, autonomous driving is already many times safer than human driving, but due to various regulatory factors, it has not been widely used. I call this iconic moment when AI capabilities surpass humans the "Lee Sedol moment." Everyone should remember that when Lee Sedol played Go against AI, he lost four out of five games. At that time, it was discovered that AI could easily defeat even the strongest humans.
Q: Will humans soon lose the ability to evaluate AI capabilities?
Dai Yusen: I think it is very lacking now. For example, the question "Humanity's Last Exam" made by Alexandr Wang has now reached 20 points.
Q: Is the full score 100?
Dai Yusen: Yes, it may be very fast to go from 20 points to 80 points. The key is that humans have to come up with difficult problems, which is undoubtedly a big challenge for humans. But if AI can achieve this by spending computing power, relying on RL, and relying on stronger inference, then the gap will be difficult to catch up.
Q: Like the "Lee Sedol moment" you just mentioned, it must have started with AI surpassing humans, which is very intuitive. I have communicated with some Go enthusiasts, such as Lou Tiancheng, who said that when AlphaGo Zero appeared, it not only surpassed humans, but also that human intelligence could not understand it. He felt that playing Go and doing autonomous driving were the same, and that you couldn't test anything out by testing autonomous driving. The same is true for playing Go. The patterns that humans have accumulated for thousands of years were easily broken by AI.
Dai Yusen: I think comprehensibility and explainability do not necessarily exist.
Q: Because according to the first principles, humans currently have no way of mastering all the truths and laws in the world.
Dai Yusen: For example, we can’t understand how Einstein came up with those theories. If you think about it further, cats and dogs certainly can’t understand why humans do all kinds of things, right? Nowadays, AI is developing so fast, and we may soon face a situation that is like a primary school student taking a doctoral exam. Now we may be gradually reaching such a stage, where primary school students rack their brains to come up with questions that they think are super difficult to take a doctoral exam, but for a doctoral student, these questions may not be difficult.
This is a crucial issue for AI safety, and we may not be able to evaluate it. Because now many human tests can easily get AI scores above 95. For example, when I was studying at Tsinghua University, people often said that some people scored 100 points because their ability limit was 100 points, while some people scored 100 points because the full score of the test paper was only 100 points. If the full score was 1000 points, he would also score 1000 points.
Q: Has it reached the stage where we can no longer evaluate the capabilities of AI?
Dai Yusen: I think it is still possible to evaluate, but in the foreseeable future, it may be difficult to evaluate within a few years.
Q: What will that bring?
Dai Yusen: Actually, we have already seen many related signs. For example, during the Spring Festival, there was an article that was said to be a response from Liang Wenfeng on Zhihu. It was very popular. Later, everyone found out that it was written by DeepSeek.
I have been using OpenAI's Deep Research recently, which has been very helpful and has also brought me a lot of shock. We just talked about Agent. In fact, the first application scenario of Agent is to help me do research. I ask it a question, and it has to think about how to answer it, list a research plan, find information, summarize and compare. From the original 4o without reasoning ability, to o1 later. Then, o1 had o1 pro, which can think more deeply, and then o3 mini high, and then Deep Research. The whole process took 3-6 months, but I clearly felt that its level of improvement was exponential.
Yesterday I was thinking, if you pick ten people randomly from the street, I think at least nine of them are not as capable as Deep Research. Because Deep Research can give you a research report on any topic you need in a few minutes, which I think is at the level of a white-collar worker working in a good company for one or two years. In fact, many people do not have such thinking and reasoning ability, information acquisition ability and summarization ability even if they spend more time. So I think AGI is no longer a science fiction concept. If people talked about AGI two years ago, they thought it was a very distant thing, but now AI has surpassed most people in tasks such as collecting and organizing information.
Q: People like us, information workers who work with bits in and bits out.
Dai Yusen: So AI cannot do a chat like the one we had today. After all, this is proprietary information between us. Before the chat, this information did not exist at all. But if this information already exists somewhere and is not proprietary information, then AI will definitely do much better than most people. I am sure of this. It can be said that the growth rate of AI is really very fast. We have seen its exponential growth, and we will also witness the arrival of many of the "Lee Sedol moments" mentioned earlier.
Back to the topic at the beginning, I think unlocking Agent is of great significance. In the past, all product models on the Internet can be summarized in a famous saying, "Attention is all you need."
Whether it is Tencent or ByteDance, their core is to see how many users spend how much time on their own products. This can actually be understood with a formula: duration × number of users × monetization rate. So everyone is thinking about how to attract more users, let them spend more time, and then increase the monetization rate. But there is definitely an upper limit to this matter. After all, there are only so many people in total. Everyone sleeps 8 hours a day and wakes up for a maximum of 16 hours. They also have to eat and work. There are some things that cannot be done by looking at the phone, so it is difficult to double the time spent looking at the phone. So everyone wants to increase the monetization rate. How can I get higher value from you in the same hour? It becomes Douyin's video ads and live broadcasts, but this road must also have an end.
In human history, basically everything requires human attention, with only one exception, which is automation. In the past, for example, in mechanical automation, such as machine tools, people built the automation system and it could run on its own, but it had no subjective initiative. The current progress of AI technology has brought about a possibility that, first, it does not require human attention, and second, it can perform tasks autonomously. It is no exaggeration to say that this is the greatest progress since the birth of mankind. If the difference between humans and other animals is that humans can use tools, and the tools used by humans in the past required attention, until now there are tools such as Agents that do not require attention. Just like when I throw a problem to Deep Research, it studies it for 5 minutes by itself, and I don't need attention during this period. When I used Devin last year, I gave it a task and it did it by itself. I could interrupt it halfway, put forward new requirements, and check its progress, but if I didn't interrupt it, it would complete it by itself. So I want to put forward a new saying: In the Agent era, "Attention is not all you need."
It will unlock the unlimited potential of human beings. As mentioned before, human attention is limited. If human attention is no longer needed, then its theoretical multiple is infinite. This is like asking employees to do things from the perspective of a boss, without the need for attention. In the past, most people were the result of executing other people's attention, and only a few people were bosses.
But now AI is becoming more and more powerful, and everyone can be the boss of AI. So what to let AI do is a very important question. Many people think that assistants are very smart, except for simple things like booking air tickets and ordering takeout, but they don’t know what to let them do. I think this will have a significant impact on society and education, but I believe that after everyone adapts to this paradigm, they will find that there are more things that can be handed over to AI. Further extension, I think we may see a kind of Scaling Law of work. In fact, it is not easy to simply expand work and productivity now. For example, a large company, even if it has 10 billion or even 100 billion funds, cannot directly convert this money into productivity. It still has to recruit people and train. If there are too many people, there will be internal fighting, so having money does not necessarily mean productivity. But if the AI model becomes stronger and stronger, and the reasoning ability of the model continues to improve, you will find that having money is equal to having computing power. The more computing power, the more productivity AI can generate. This is the Scaling Law of converting funds into productivity.
Q: But does the world need so much productivity?
Dai Yusen: This is the same as what people thought before the invention of cars and airplanes. At that time, people would think that if they wanted to go to the next village, they could just walk there. Why take a plane?
Q: Do you think it will create new demand?
Dai Yusen: I at least think that a large number of technologies in history have repeatedly verified this point.
Q: Compared with the human species and the long ancient history, the time of human technological explosion is actually very short, only four or five hundred years.
Dai Yusen: This is a more interesting point. Originally, the technological explosion of mankind was based on "one generation", but gradually it became, how many times can a generation experience a technological explosion in its life cycle? Now the cycle of technological explosion has been shortened to less than ten years. It has only been 13 years since the advent of AlexNet, and not long since the birth of ChatGPT. Looking back at when ChatGPT first appeared, everyone thought it was very capable, but looking back now, it actually had a lot of room for improvement. Technology changes so quickly that it may be difficult for people to adapt in time, which is bound to have a lot of impact on society.
Apart from this, exponential growth is the norm in the world, but before the final steep curve, exponential growth looks very much like linear growth. There is a saying that goes "gradually then suddenly". Before entering the rapid upward phase, everything looks calm. This is also the reason why people who are concerned about AI safety are so worried. Now everyone feels that we have entered the exponential growth phase. This is not called preparing for a rainy day. It has already started thundering and it is about to rain. I think the substantial increase in productivity is a very important variable, if you believe that productivity ultimately brings economic value.
Then the question becomes what productivity is and how to make it create value for everyone. On the one hand, as Sam Altman said, a person's company will become very powerful. If a person can effectively command AI, or even command Agent through AI, then he may create great value; on the other hand, the reason why entrepreneurs can sometimes beat big companies in the past is that they can convert funds into productivity more efficiently, because they have a sharper vision, work harder, and have no organizational resistance. But if a big company invests a lot of money to hire a very powerful entrepreneurial agent, then ordinary entrepreneurs may find it difficult to compete with them. Perhaps only top entrepreneurs can beat big companies, and ordinary entrepreneurs may be eliminated by the AI hired by big companies, which is also a hard thing to say. So some people think that this will make the rich richer because the rich can buy more productivity. In the past, although a person was rich, he might not be able to beat a smart young person, but the situation may be different in the future.
Q: There are two directions. One is actually a super individual, and the other is like a "science fiction utopia", gradually gathering resources to more powerful companies.
Dai Yusen: So I think AI has brought about great changes, both from the perspective of productivity and social structure. However, to unlock these changes, the prerequisite is that the model capabilities must be improved. I think that finding the first PMF in the early stage of a technological revolution is sometimes like a sweet trap, or even a curse. For example, in the mobile Internet, BlackBerry was the first to find PMF. At that time, the technology was limited, the processor was weak, and the network was slow, so it felt that it could only do functions such as sending emails, sending BlackBerry messages, and receiving push notifications. In order to do this PMF well, it made a BlackBerry phone with a keyboard, and it has always been proud of the keyboard. But later, as technology advanced, the processor was stronger, the network was faster, and the screen was larger, Apple directly said no keyboard, and made a full touch-screen phone. At that time, BlackBerry still felt that typing and sending emails without a keyboard would definitely not be easy to use. This is the curse of PMF. When the technology was upgraded later, it was trapped by its own PMF.
The Internet also has this situation. Yahoo was the first company to find PMF in the Internet field. It adopted a portal model, which is to list information for users to see. Later, the search engine Google appeared, which was a huge impact on Yahoo. Yahoo was very complicated, with a lot of content that had to be clicked in to see, while Google only had a search box where you could just type. In fact, Yahoo once had the opportunity to acquire Google, but unfortunately the bid was not high enough, and it was later subverted by Google.
So I want to say that chatbots may also be a sweet trap. There are so many chatbots now, and people may want to optimize on this basis. But I always feel that chatbots may limit the capabilities of cutting-edge AI models. For example, when you chat with ChatGPT, Kimi or Doubao, are you used to short, fragmented conversations like on WeChat? But if you want to give an instruction to the agent, you often have to write a more important proposal, just like applying for a Grant from the National Natural Science Foundation, you have to fully explain what you want to do, the goals and conditions, and you have to communicate completely. But in a chatbot context similar to WeChat, only fragmented communication can be carried out, and the intelligence of the model may not be reflected.
I talked to OpenAI students before, and they said that they found that the advanced models did not improve user satisfaction much when chatting with users. It's a bit like chatting with people on WeChat. If you chat with an ordinary college student and a scientist, the difference is not that big. But if you ask them to write a doctoral thesis, it's completely the difference between 0 and 1. So, Chatbot, a product form that is easily accepted by everyone in the early stage, is not necessarily the product form that can make it to the end.
If we optimize the short-term indicators on this basis, for example, we can find a way to make people stay on the chatbot longer, and then launch a calling function. But are calling and intelligence improvement consistent? Because making a good call may depend on the tone of voice and emotional intelligence, which has nothing to do with intelligence and productivity improvement. I think this situation often occurs in history. Those who find the first PMF first, if they do not continue to explore in depth, are likely to be trapped by this PMF.
Q: We have just made a lot of predictions about Agents. If we follow the logic of the work scaling law you mentioned, what will the first batch of Agents look like in 2025?
Dai Yusen: For the first batch, I think Deep Research is the most popular one now. You see, OpenAI has launched Deep Research, but it was first launched by Google, and then Perplexity launched Deep Research, and I know that many startups are also planning to develop in this direction. Why do you want to develop in this direction? Because everyone found that letting AI study information more deeply, obtain more resources, and then decide what kind of information to obtain next based on the content obtained, forming such a cycle, and finally giving a research report, this is actually what we usually ask analysts to do. But everyone found that spending the same amount of time, or even a little more time, can get better results with this. We call this "read only agent", which means that it only does read operations and does not do write operations. I think the PMF is very obvious at present. The Deep Research I use is indeed better than my intern. So I think that for those of us knowledge workers who need to study a topic in front of a computer, browse a bunch of websites, and then produce reports, the willingness to pay and the use scenarios are very clear.
The second step is from reading to writing. OpenAI launched Operator, and Anthropic launched MCP, which are actually talking about how AI uses tools. However, this will also bring many security risks. After all, no one wants AI to mess around. But obviously, under controllable circumstances, it is a very important ability to allow AI to write operations and publish information to the outside world. Monica, which we invested in, is making a similar product - now everyone knows that it is called Manus. Yesterday they shared a very interesting thing with me. For example, there is a test question to obtain the subway timetable of a city in the United States, such as Phoenix. This model first went to the official website to check and found that the link could not be opened. At this time, it directly called the email client and sent an email to the Phoenix City Government to inquire, and finally reached the step of confirming whether to send the email. It can do these things completely independently.
Q: Is this their product?
Dai Yusen: Yes, their products can mobilize tools and call browsers, which has many interesting features. For example, AI can actively use tools and has its own "computer", which is very interesting. In the past, many people thought that applications like AutoGLM in China let AI control our mobile phones, such as letting AI order takeout on our mobile phones. But think about it carefully, does the assistant use his own device or your device to work? It must be his. So it should be that my AI assistant is in the cloud, with his own mobile phone or computer, and then use his own device to order takeout for me, instead of using my mobile phone. After all, I still have to watch Douyin and chat on WeChat. This is actually virtualization technology.
Q: So in terms of authority, it is still your account system, right?
Dai Yusen: Not necessarily. It is possible to equip AI with its own "computer". For example, if you subscribe to an expensive Bloomberg terminal, your AI assistant may ask, "Boss, lend me your account." Then you enter the account and let him use it. In another case, you may also buy a LinkedIn premium for your assistant and let him use it. All these situations are possible.
In fact, you will find that when AI can use tools, it can do a lot of things. After all, most software tools are used by either calling APIs or operating the software interface itself. Therefore, multimodal reasoning in Kimi k1.5 is very important, especially when using software interfaces. To use software interfaces, you have to understand web pages. Now everyone is talking about using world models to understand the world, which is actually quite difficult. To give a simple example, when we look at things, we can know that objects have front and back, and depth, but now AI performs generally in recognizing depth information. However, if you just operate the computer and mobile phone interfaces, AI can do a lot of things.
Q: So this is the second type, which is both reading and writing.
Dai Yusen: If I can write, I will give you another example. When AI encounters a problem, in theory it can post a message asking for help. It can even offer a reward, because it has been linked to the payment provider. Whoever helps it solve the problem will be given $100. This is not a science fiction plot. It is completely possible now. And we found that powerful AI models can come up with many solutions that humans can't think of. For example, if humans think a problem cannot be solved, AI may think about whether it can change the problem or whether it can obtain permissions that it does not have.
However, this is also something that needs to be paid attention to in AI security research, because AI may really do something harmful in order to solve problems. I have encountered a typical example myself. I used Windsurf to make an example of a personal website. In order to deploy this website, it said that two processes occupied the port and wanted to kill them. I agreed at the time, but then I thought, what if the system crashed after killing them. It just wanted to deploy the demonstration website, but did not consider the possible impact on me. Of course, these problems can be aligned, but there are many potential risks.
So, once this kind of Agent with "write" function is done, it will be very capable, but it will definitely be slower to deploy, because the consequences it may bring are also very large. It needs to be monitored, trained, and aligned a lot, and it must be prevented from being abused, so I think "reading" will be faster. As for "writing", Operator is an example. If you use it to book a flight, you will find that it is not as fast as booking it yourself, and you have to confirm every step. However, in the field of AI, the problem of slowness can always be solved. From slow to fast, from expensive to cheap, this is what has been happening with AI. You can imagine that if AI can do something in one second that originally took the assistant 30 minutes to complete, how much more can it do every day? The freed-up time can also be used to do more other things, which will have a great impact on everyone.
Q: Is this advanced process the same as the five technical levels that OpenAI has defined before? Below Agent is Innovator, and below that is Organizer.
Dai Yusen: Yes, this will lead to several problems. The simplest one is that now people are commanding agents, so can agents command agents? Assuming that each task can be completed within one second, humans will not be able to keep up with the speed of asking questions.
Q: When making interview outlines in the future, the agent may connect with Yusen’s agent, and then they will write the outline.
Dai Yusen: I think this is entirely possible, but there is an important problem, which is memory. Now the results of you answering the same question with ChatGPT and I answering the same question with ChatGPT are similar. But if it is an assistant who has been with me for a few years, except for the public part, the answer will definitely be different from yours. In this way, our agents can have something to talk about, because we both have our own memory, but now this memory mechanism is still very rudimentary.
I think memory is particularly important. Everyone is working on it, but it hasn't been done very well. Take ChatGPT for example. Its so-called memory is actually a system prompt formed when communicating with you, such as remembering "this person has a dog, this person is a college student", etc. This is very simple. But in fact, the real memory is very long, and some of these memories are actively instilled into it when you talk to it, and some may be obtained by it through other means. In short, memory is definitely a key point.
Online learning is also very important. Humans have a unique ability that AI does not have at present. Now AI models have to release new versions to update weights. But in daily life, whether through reading or socializing, people can continue to learn and actively change the "weights" in their brains. This is a characteristic of biological organisms, and AI now has to go through a training process every time it is updated.
In addition, there are many interesting cutting-edge exploration topics. For example, the agent currently uses human tools, but if it is ten times smarter and ten times faster than humans, why does it still use human tools? This is just like we would not use children's cutlery to eat, we would definitely use cutlery suitable for adults. Therefore, there may be a series of tools designed specifically for AI. The tools designed for superhumans are definitely different from those used by ordinary people. In this regard, AI-specific tools and how AI iterates its own tools are worth studying. Maybe by then we humans will not be able to use its tools, just like many people do not know how to use EDA.
Q: And it is possible that AI itself can design this tool.
Dai Yusen: So if we think about it further, the iteration speed here is so fast that it is in the field of science fiction. However, we now find that many concepts that were originally considered pure science fiction are no longer out of reach. As long as the model is further developed, these things can be realized. So I think that the progress of intelligence will unlock new product forms. And these new product forms may be very powerful. If we only optimize and refine the original Chatbot, it may be overturned soon.
Q: Actually, when we talked about Agent two or three months ago, you mentioned coding, but you didn’t mention coding just now.
Dai Yusen: You mean the agent used for coding, right? I think the relationship between agent and coding is that the first step is to be an agent that does coding, like Cursor or Windsurf. This is a scenario where agents are easier to implement. But I think a further step is an agent that can code. For example, your assistant may be a liberal arts student. If you let him learn to write code, he can write a crawler to help you collect more information. In this way, when you interview, you know who to interview, which is equivalent to your agent mastering the new skill of programming code. I think this will be a bigger development paradigm in the future.
At first, Agent was mainly used to write code, but there were not that many people who needed to write code. Development tools like Cursor, Windsurf, and Devin were actually mainly for programmers. But programmers account for a limited proportion of the population, so what role should their Agent play for more non-programmer knowledge workers, that is, ordinary white-collar workers? I think writing code is a necessary ability for their Agent, because only by writing code can it move freely in this cyber world.
Q: The industry is developing very fast. A few months ago, when people talked about Agents, they thought coding was a direction, and many people started businesses in this direction. But now, when people talk about it again, they think we need Agents that can write code, and then let them do more things.
Dai Yusen: In the past, agents were specialized in writing code (Coding Agent), but now they are agents that can write code (Agent that can code).
Q: What other abilities do you think are necessary to be a good agent?
Dai Yusen: Let me sort it out. The three major capabilities are: reasoning, coding, and tool use, followed by memory and online learning. I think these are very important and unresolved issues.
Q: In 2025, do you think that Agents will be done more by application companies, or by companies with particularly strong modeling capabilities, like OpenAI launched Operator and Anthropic launched Computer Use?
Dai Yusen: At present, model companies can indeed use RL to improve model capabilities and use more powerful models to optimize their own models. They may indeed have certain advantages. However, application companies also have several advantages. First, it uses multiple models to mix and play the strengths of each model; second, in terms of user mind, take Perplexity as an example. When it first started AI search, it actually occupied the minds of users. The models it used were constantly upgraded, and most users thought it was synonymous with AI search. Cursor is also a good example. At first, everyone thought it was just a shell, but in fact, it and the model are a process of mutual achievement. If there was no Sonnet 3.5, Cursor would not be so popular, and it would not be able to realize the function of predicting the next code; and if there was no Cursor, Sonnet 3.5 might also lack a carrier that could make it popular.
Q: You mentioned that Monica is also a company you invested in. They are exploring some agents based on other models or open source models, right?
Dai Yusen: Because they don’t train their own models, if there is no delay, they will release a very interesting Agent product next week (Manus will be released on March 6, 2025 for internal testing). We think that when you can use the model, let the model use the tool, and then through a series of clever product designs, it can actually bring a very different experience.
Q: You mentioned earlier that chatbots are actually a "sweet trap" for the first person to discover PMF. So are there similar "traps" in the application form of Agents? What aspects will distract you or slow down your pace in approaching AGI?
Dai Yusen: I haven't thought about Agent very clearly. After all, it is still in the exploration stage, so it's hard to say at the moment. But I have a feeling that if there is an AI product with a large number of users, in order to serve so many users well, some compromises may have to be made in the size and capabilities of the model. To give a simple example, if there are a lot of users, the model is large, and people in China think it is difficult to charge, it is definitely not cost-effective to provide a large number of users with a model with a high inference cost for free. At this time, the model may have to be made lighter. But will a lighter model conflict with the pursuit of AGI? So I think that when DeepSeek has so many users, many people discuss whether to retain these users. I think this is actually a "sweet trap". There are tens of millions of DAU, and the usage scenarios of users around the world are different. To serve them well, whether it is computing power, product design or operation, a lot of time and energy must be invested. I think this will affect the resources for exploring AGI. After all, resources are not unlimited.
Q: It seems that DeepSeek is not trying to retain (users) now.
Dai Yusen: I think this is the right thing to do, and it’s the only way to cooperate with WeChat. If DeepSeek also wants to take advantage of this opportunity to create a super app, it will probably be difficult for WeChat to cooperate with them.
Q: Actually, I just thought of a point, which is multimodality. But I think when it comes to agents, the understanding of multimodality is more relevant, rather than the generation part.
Dai Yusen: I think multimodality is definitely important, but it is not improving intelligence that fast at present. Because language is a very concentrated form of intelligence, relying on language to improve intelligence is a relatively fast way. If the language aspect is almost studied, the next thing is images. There is a lot of information in images, and any photo you take contains a lot of information. But there is not much intelligence contained in images, and you have to watch a lot of videos to summarize some intelligence from them. But if you want to understand Newton's laws, you may be able to understand them in a few sentences, but how many videos do you have to watch to summarize Newton's laws? So I think videos play a role in specific applications. In terms of intelligence generation, its information compression rate is not high enough at present.
Q: Why were everyone training multimodal models during that period?
Dai Yusen: There are two situations. The first is the multimodal generation route taken by Sora. I think this route has a clear PMF, because there are so many video ads in the world, such as the popular "Cooking Orange Cat". Such videos can be monetized if they are done well enough, so there is such a business model. For example, Midjourney has not raised any funds, but it has already achieved PMF. Since there is PMF and the effect is pretty good, naturally someone will do it.
Q: What is the DAU of Midjourney and Sora like now? Have they dropped?
Dai Yusen: Midjourney is doing pretty well. The first batch of users have already used it, and they bring their own "dry food". I think Sora, like Keling and Hailuo, which are made according to its technical solutions, also have good results. On the contrary, Sora got up early, but in the end it was not so amazing. However, Veo 2 released by Google yesterday is quite amazing. At least in the case of a single lens, it is currently the best video generation model.
But now everyone generally feels that video generation may not be the most important direction in improving intelligence, and it is still "rolling" in the direction of reasoning. I think it's just like walking. When there is a clear path in front of you, many people will choose this path first. So in the field of AI, we will continue to experience the process of alternating exploration and running. When you encounter a bottleneck, you will find that the seemingly aimless branch explorations before may bring new breakthroughs. So from the company's perspective, on the one hand, you have to "run straight", just like everyone is racing. On the other hand, there must also be this kind of frontier exploration, because you don't know what will happen in the short term.
Q: So it still has to be done by big companies? In the United States, it’s Google, and in China, it’s ByteDance.
Dai Yusen: There is also OpenAI in the United States.
Q: So startups have no resources at all.
Dai Yusen: I don’t think that’s the right way to put it. It depends on what stage we are in and how long this stage will last. If we are in a stage that requires innovation, then startups may be able to avoid competition from large companies through different visions. But if we are just “running straight” now, then whoever has money and cards will definitely be more likely to rush forward. The strength of startups has always been to do things that large companies don’t see. If everything is already “clear”, then large companies will definitely have more advantages.
Q: When we discussed that Agents might become popular by 2025, we didn’t specifically mention the dimension of cost. Is cost reduction an important factor in promoting the development of Agents?
Dai Yusen: Of course, and I believe that cost reduction will definitely happen. So I have a basic assumption, first make it work, then make it cheaper. Because cost reduction will definitely be achieved, and the ability of the agent will continue to grow, but it is entirely possible to encounter bottlenecks and blockages in the middle. So I think we have to make it work first, then make it easier to use, and finally make it cheaper. If it can't even work, let alone make it cheaper.
I also think that the difficulty of Agent landing in China and the United States is different. Labor costs in the United States are very high now. We can always see that the job market there is very tight and many positions cannot be filled. So for them, Devin’s pricing at the time was about a few dollars per hour. We may think it is expensive, but for American companies, the average minimum wage in California is $16. Even if you work at McDonald’s for an hour, you have to pay $16, while an Agent only pays $6-8 per hour. First, it is very cheap, and second, after a year, its capabilities will be stronger, and the same price will become cheaper. So in this environment where people are used to paying for corporate services, it is reasonable.
I am also a subscriber of GPT Pro's $200 per month package, and I think it is a great deal. It allows you to do 100 Deep Research runs at $2 per run. If I ask my intern to do it, first of all, I can't ask him to give me a report within five minutes at two o'clock in the middle of the night, and the quality of the report he makes is basically not as good as GPT Pro. So I always tell my interns that if they just collect information and make an ambiguous report, it may not be as good as the $2 per run service.
William Gibson once said, "The future has already arrived, it's just not evenly distributed yet." I think the imagination of the future between those who are already using cutting-edge AI or using it well and many people who are using Chatbot for the first time or have not used it yet is very uneven. So I really think that in terms of clerical work, AI replacing people is no longer an imagination, but something that is happening.
Q: What do you think will be the next technological paradigm after RL, that is, after Agent is unlocked?
Dai Yusen: First of all, I think RL can go a long way. Secondly, I think the next important point is to discover new knowledge. Dario, the founder of Anthropic, wrote an article called "Machines of Loving Grace". He mentioned that the future of AI is to discover new science and acquire new knowledge, which seems to be included in OpenAI's five-level classification.
Q: Level 4. Level 4 is innovator.
Dai Yusen: Because a lot of scientific discoveries usually start with a hypothesis and then verify it through experiments. In this regard, AI may have done a good job. But in the verification stage, sometimes observation is required, and sometimes physical, chemical or medical experiments are required, which may be subject to some limitations. If we can find a way to conduct experiments on a large scale in parallel to verify whether the hypothesis proposed by AI is correct, including some like mathematical theorems, and generate new knowledge through pure thinking. From this step, AI may enter a state of "left foot stepping on right foot", it generates new knowledge, and then uses the new knowledge to improve itself, which may form a process of self-iterative evolution.
But by that time, the product form may be different. Many bigwigs asked me when I could invent the elixir of life. I think this may be the common goal of everyone after making a lot of money. People may no longer just want to let the agent do a lot of work, but hope to have an elixir of life. And it can also solve many major problems facing mankind, such as what is the treatment for cancer.
Q: As AI becomes smarter, it may be able to find some ways to use energy more efficiently, and even solve the problem of controlled nuclear fusion that humans have not solved for 50 years, forming a closed loop.
Dai Yusen: AI can complete tasks that humans can complete, but it will soon encounter tasks that humans cannot solve. This is just like the "37th move" that Lee Sedol encountered that year. We don't know how this move was made, but as long as we can verify the result, even if we don't know how it was produced, but find that it is indeed feasible and usable, it may bring a lot of new progress.