R1 is more important than R1-ZERO - In-depth analysis of R1-ZERO and R1

R1-Zero is a new paradigm of AI self-evolution, which deserves more attention.
Core content:
1. Mike Knoop's unique insights into the development trend of AI
2. R1-Zero's new paradigm that does not rely on human annotation
3. ARC Prize 2024 promotes AI to adapt to novel problems
Special thanks to Tuhin and Abu from Baseten , and Yuchen from Hyperbolic Labs for hosting r1-zero for us . Almost no providers host this model variant, and its availability is critical for research.
The goal of the ARC Prize Foundation is to define, measure, and inspire new ideas toward AGI . To do this, we strive to build the strongest innovation environment in the world.
We do not yet have general AI and are still limited in our innovation — even though pure LLM pre-training to scale is not the path to go, despite this being the dominant AI industry narrative and mainstream public opinion last summer.
Stories are important because they ultimately drive economic activity, such as investment, research focus, funding, geopolitics, trade, etc. For example, new LLM startups attracted ~ $ 20 billion in investment in 2023-24 , while new AGI startups attracted only ~ $ 200 million in investment.
We launched the ARC Prize 2024 last June to increase understanding of the scaling limits of LLMs and to drive development of a useful benchmark, ARC-AGI-1 , in a new direction, requiring AI systems to adapt to novel, unseen problems rather than relying strictly on memory.
1 DeepSeek R1 Architecture
DeepSeek R1 architecture designed by @SirrahChan .
Last week, DeepSeek released their new R1-Zero and R1 “reasoner” systems, competing against OpenAI ’s o1 system on ARC-AGI-1 . R1-Zero , R1 , and o1 (low ops) all scored around 15-20% , in stark contrast to GPT-4o ’s 5% , the culmination of years of pure LLM scaling. Based on the US market reaction this week, the public is beginning to realize that there are also limitations to pure LLM scaling. However, there is still widespread public ignorance about the upcoming demands for reasoning.
In December 2024 , OpenAI announced a new breakthrough o3 system, which we validated. It scored 76% in low-compute mode and 88% in high-compute mode . The o3 system demonstrated the first practical, general-purpose computer implementation that can adapt to novel, unseen problems.
Although this was important scientific news, O3's victory over ARC-AGI-1 was mostly ignored and reported by the mainstream media.
This is a very important moment in the field of AI and computer science in general, and these systems need to be studied. But due to the closed nature of o1/o3 , we had to rely on speculation. Thanks to ARC-AGI-1 and the now ( almost ) open source R1-Zero and R1 , we can increase our understanding. In particular, R1-Zero is more important than R1 .
"Nearly" is because DeepSeek does not disclose a reproducible way of generating their model weights.
2 R1-ZERO eliminates human bottlenecks
In our O1 and O3 analyses, we speculated on how these inference systems work. Key points:
1. Generate a Chain of Thoughts ( CoT ) for a problem area .
2. Labeling of the intermediary CoT steps using human experts ( “ supervised fine-tuning ” or SFT ) and automatic machines ( “ reinforcement learning ” or RL ) .
Use (2) to train the base model.
At test time, iterative inference is performed from the process model.
The iterative sampling techniques and their ARC-AGI-1 scores are described below:
NOTE: ARC-AGI-1 semi-private scores are shown.
Based on the latest research published by DeepSeek , we can better support our speculation. The key insight is that LLM reasoning systems achieve higher achievements in innovation fitness (and reliability) along three dimensions:
1. Add human labels to the CoT process model training, i.e. SFT .
2. CoT search instead of linear reasoning (parallel step-by-step CoT reasoning)
3. Whole CoT sampling (parallel trajectory inference)
Project ( 1 ) is subject to bottlenecks and limitations of human data generation, which determine in which fields these reasoning systems can benefit the most. For example, the level of the MMLU professional legal category is significantly lower than the level of mathematics and logic on O1 .
Projects ( 2 ) and ( 3 ) are constrained by efficiency bottlenecks. o1 and o3 show logarithmic accuracy improvements on ARC-AGI-1 by spending more inference compute at test time, and different ways of spending this compute adjust the x - axis of the curve.
In my opinion, the most interesting thing about DeepSeek is the separate release of R1-Zero . R1-Zero is a model that does not use SFT , this project. Instead, it relies entirely on reinforcement learning.
R1-Zero and R1 show strong agreement on ARC-AGI-1 , scoring 14% and 15% respectively . DeepSeeks ' own reported benchmark scores also show strong agreement between R1-Zero and R1 , for example on MATH AIME 2024 , scoring 71% and 76% respectively (compared to about 40% for the base DeepSeek V3 ).
In the paper, the authors of R1-Zero state that "DeepSeek-R1-Zero encounters challenges such as poor readability and language mixing, " which has been verified online. However, in our tests, when testing R1-Zero on ARC-AGI-1 in similar math and coding domains, we found little evidence of incoherence.
Taken together, these findings suggest that:
1. In domains with strong validation, no SFT (e.g., human expert annotation) is required for accurate and readable CoT reasoning.
The R1-Zero training process is able to create its own internal domain-specific language ( "DSL" ) within the token space through RL optimization .
2. SFT helps to improve the breadth of CoT reasoning domains.
This makes intuitive sense, since language itself is actually a domain-specific language for reasoning. The exact same " words " can be learned in one domain and applied to another, like a program. Pure reinforcement learning methods are not yet able to discover a widely shared vocabulary, and I expect this will be a focus of future research.
Ultimately, R1-Zero demonstrates a prototype for a potentially scalable model with zero human bottlenecks — even in the training data acquisition itself.
DeepSeek has almost certainly set its sights on OpenAI ’s o3 system. It will be interesting to see if SFT ends up being a requirement for adding CoT search and sampling, or if there is a hypothetical “R2-Zero” that exists with the same logarithmic accuracy vs inference scale curve. Based on the results from R1-Zero , I believe that SFT would not be needed to beat ARC-AGI-1 in this hypothetical scaled-up version .
3. The reliability of economics
From an economic perspective, two major shifts are taking place in the field of artificial intelligence.
Now you can get more accuracy and reliability for your money.
Training $ is turning to inference $
Both will drive a lot of inference demand, and neither will reduce the need for more compute. In fact, they will increase the need for compute.
AI reasoning systems promise to bring rewards far beyond higher accuracy on benchmarks. The number one issue holding back more AI automation use, such as the need for reasoning, is reliability. I’ve spoken with hundreds of Zapier customers who have tried to deploy AI agents in their enterprises, and the feedback has been resoundingly consistent: “ I don’t trust them yet because they don’t work reliably . ”
Previously I argued that moving toward ARC-AGI would lead to greater reliability. The challenge with LLM agents is that they require strong local domain guidance to operate reliably. Stronger generalization requires the ability to adapt to unknown situations. We are now starting to see evidence that this view is correct. As a result, several companies are introducing agents ( Anthropic , OpenAI , Apple , etc.).
The agent will drive important near-term demand inferences due to reliability requirements . More broadly, developers can choose to invest more computational resources to increase user trust in the system. However, higher reliability does not mean 100% accuracy - but you should expect more consistent inaccuracy. This is okay because when accuracy is low, users and developers can now have more confidence in guiding behavior through prompts.
Problems that computers couldn't solve before now have dollar amounts associated with them. As efficiency increases, these dollar amounts will decrease.
4 Inference as Training
Another major shift is changing in where the input data for LLM systems comes from. Previously , most data was either purchased, crawled, or syntheticallygenerated from existing LLMs (e.g., distilled or augmented).
These inference systems offer a new option, namely generating “ real ” data, as opposed to “ synthetic ” data. The AI industry uses the term “ synthetic ” to identify the low-quality data that is often recycled through LLM to boost the overall amount of training data — with diminishing returns.
But now with an inference system and a validator, we can create new legitimate data to train on. This can be done offline, with the developer paying to create the data, or at inference time, with the end user paying!
This is a fascinating shift in economics, suggesting that there could be a moment of runaway power concentration for developers of AI systems who have the most paying customers. These customers pay for the creation of new, high-quality data … which improves the model … which in turn makes it better and more popular with users … you get my drift.
If we can break down the barriers between human experts and computers and create an extremely efficient system for creating new data through search / synthesis and validation, then we should expect massive amounts of computational resources to be devoted to these inference systems, as they can significantly improve performance by throwing money and raw data at them. Eventually, this type of AI training will completely outperform pre-training on human-generated data.
5 Conclusion
We will continue to see market adjustments as the need for inference increases becomes more pronounced. The efficiency of AI systems will only drive greater usage, not just because of the Jevons paradox, but because new training paradigms will be unlocked as efficiency increases.
Now that R1 is open and reproducible, more people and teams will push CoT and search to the limit. This will tell us more quickly where the frontier actually is and will spark a wave of innovation that increases the speed to AGI .
Several people have already told me that they plan to use an R1 -style system in their 2025 ARC Awards, and I’m very excited to see the results .
The fact that R1 is open is a great thing for the world. DeepSeek pushes the frontiers of science forward.