Open R1 Project Progress Phase 2

A major breakthrough in the field of deep learning mathematical reasoning!
Core content:
1. Overview of the two-week progress of the Open R1 project, filling the gaps in DeepSeek R1 training process and data
2. Release of the OpenR1-Math-220k dataset, a new milestone in large-scale mathematical reasoning
3. Community progress: discussion on high-quality dataset organization and model reasoning step control technology
Originally published on February 10, 2025
It has been two weeks since we started the Open R1 project, which aims to fill in the missing parts of DeepSeek R1, especially the training process and synthetic data.
In this article, we are happy to share a big achievement with you: , which is the first large-scale mathematical reasoning dataset we have created!
In addition, we also talked about some exciting developments in the community, such as how to organize small and precise high-quality datasets to fine-tune models, and how to control the "number of thinking steps" of inference models during training and inference.
Let’s take a look together!
OpenR1-Math-220k dataset
The power of DeepSeek R1 is that it can "impart" advanced reasoning capabilities to small models. The DeepSeek team generated 600,000 reasoning records to fine-tune the Qwen and Llama series models. The results show that the results "distilled" directly from R1 without reinforcement learning are also very good. For example, DeepSeek-R1-Distill-Qwen-7B scored 55.5% on AIME 2024, which is better than the larger QwQ-32B-Preview.
However, these inference records are not publicly available, which has prompted the community to take it upon themselves to recreate several similar datasets. For example , , , and .
? Introducing OpenR1-Math-220k ! This is a large-scale mathematical reasoning dataset run locally on 512 H100 machines, with multiple answers for each question. We have partnered with OpenR1 to launch a new upgraded version based on their super popular dataset.
What is the difference between this OpenR1 dataset and others:
800,000 reasoning records : We generated two answers for each of the 400,000 questions. After screening, there were 220,000 questions left , each of which had a reliable reasoning process. 512 H100s running locally : No API is used, we rely on our own scientific computing cluster to generate 180,000 inference records per day . Based on : We focus on mathematical reasoning and generate answers for questions in NuminaMath 1.5 (upgraded version). Automatic screening : Use questions that only leave at least one correct answer, and be the "referee" to retrieve more reliable answers (for example, some answers are in a messy format and the rule parser cannot recognize them). Performance parity : We fine-tune on the dataset and the performance is not inferior to the original DeepSeek-Distill-Qwen-7B.
We hope that this scalable, high-quality method for generating inference data can be used not only in mathematics, but also in areas such as code generation.
How does the data come from?
In order to get OpenR1-220k, we asked the model card to solve 400,000 problems in NuminaMath 1.5 according to the parameters suggested by the model card. We also added a sentence before the hint of each problem:
“Please reason step by step and write your answer in \boxed{} at the end.”
Each question is given a maximum of 16k tokens, because we found that 75% of the questions can be solved with 8k tokens, and the rest basically need to use up to 16k. At first, we used vLLM to run reasoning, and each H100 could generate 15 answers per hour. The script was also shared in the previous update and Open R1. Later, we tried ** **, and the speed doubled, and each H100 could get 25 answers per hour! With 512 H100s, we can generate 300,000 answers a day, and accumulate 800,000 reasoning records in a few days.
We generated two answers for each question, and some even four answers, so that we can be more flexible in screening and training. This approach is similar to the rejection sampling of DeepSeek R1, and can also support the preference optimization method such as DPO.
Generate script:
Unfiltered dataset:
How to filter data
In order to ensure that only high-quality and correct reasoning processes are left, we use a mathematical expression evaluation system that specifically evaluates LLM answers. We compare the final answer given by the model with the standard answer in the dataset.
It turns out that 55% of the questions have at least one correct answer. However, some standard answers in NuminaMath 1.5 are empty or the format cannot be automatically verified, which is quite troublesome. Although we have upgraded Math-Verify to better handle these strange formats (I will talk about the improvements later), we still found a backup plan: use Llama-3.3-70B-Instruct as a "referee" to save some reliable answers from the rejected answers. First, we filter out the samples with incomplete or empty standard answers, and only look at the ones with OK format and clear answer boxes. In the end, we saved 28,000 questions.
The instructions we gave to Llama3.3-70B-Instruct were:
You are the checker of the math answers. When you are given a problem, you have to compare the standard answer with the final answer of the model to see if they mean the same thing, even if the format is different.
topic:
{problem}
Standard answer:
{answer}
Model answer:
{generation}
Just look at the final mathematical answer given by the model and ignore these differences:
- Formatting (such as \boxed{} and normal text)
- Multiple choice format (e.g. "A" and full answer)
- Order of coordinate pairs or answers
- Equivalent mathematical expressions or symbolic differences
- If the model gives a confusing answer, just say "Result: Uncertain"
First briefly state your comparative ideas in two or three sentences, and then give your conclusions. Use these formats:
- "Result: Same"
- "Result: Different"
- "Result: Uncertain"
By combining rule verification (Math Verify) and LLM judgment, we have ensured data quality without sacrificing scale. The final data set has 220,000 questions, and the reasoning process has been verified, which is a good resource for training reasoning models. Multiple answers for each question also make it easier for the community to select better results, or make adjustments based on NuminaMath's data source and question type.
The dataset is divided into two parts:
default
(94,000 questions): SFT works best after fine-tuning.extended
(131,000 questions): Added other sources of NuminaMath 1.5, such ascn_k12
, more inference records. But after fine-tuning this part, the effect is not as good asdefault
,may becn_k12
The question is too simple.
For questions with multiple correct answers, we also tried using the reward model (RM) to pick the best one. If R1 gave several correct answers for each question, we removed the thinking process ( <think>…</think>
), and sent the questions and answers to the grading team, and ran them with vLLM. They were sorted by score, and the top answer was selected to be put into the training set. Unfortunately, the experiment found that this selection method was no different from randomly selecting a correct answer. In the future, you can try to include the reasoning process when grading, and don't just look at the final answer.
Comparing performance with DeepSeek-Distill-Qwen-7B
We use a learning rate of 5e-5. default
Qwen2.5-Math-Instruct was fine-tuned for three rounds on the dataset. In order to increase the context length from 4k to 32k, we adjusted the RoPE frequency to 300k. The training used a linear learning rate, and the first 10% was warmed up. The following is the performance of comparison , and :
This version of the dataset is just a starting point, and the community can further optimize it, for example, using DeepSeek R1's rejection sampling method to improve quality.
What's New in Math-Verify
We found some issues when checking the results of Math-Verify, so we made a major overhaul. We strongly recommend that you upgrade to the latest version (0.5.2) to experience these improvements:
pip install math-verify== 0.5 .2
The main upgrades are:
Improved parsing and validation of plain text answers (such as and is considered the same).
Improved parsing of answer lists (e.g. and and and equivalence).
Fixed a bug, and the answers in multiple boxes in a single LaTeX can now be recognized (e.g. equals {1,2}).
Added ordered tuples. Because it is very difficult to determine whether a list is a tuple or a set, we rely on the standard answer:
(1,2,3) ≠ {3,2,1}; 1,2,3 == {3,2,1}; {3,2,1} == {1,2,3}. Supports relational expressions of standard answers (such as less than) and predicted intervals (such as Equivalent to ).
Community Hotspots
This week the community has played with GRPO from various angles, and research has shown that with just 1,000 high-quality samples, existing open source models can trigger inference.
Some practices of GRPO
nrehiew used GRPO on the Qwen2.5-0.5B base model and achieved 51% accuracy in the GSM8k test, 10 points higher than Qwen2.5-0.5B-Instruct. This result is so impressive that it has aroused people's interest in the role of instruction data in pre-training . However, there has been no major breakthrough in using GRPO on other base models (such as Llama 3). It was found that the base model can reflect on itself with a little prompting. The "enlightenment" in the DeepSeek-R1 paper may be more due to the model itself, rather than the credit of RL optimization. Unsloth uses it to run GRPO with a 15B parameter model using only 15GB of video memory. Now Google Colab can be used for free! Wing Lian found . Alexander Doria came up with it , which is cool because it’s the first time GRPO has publicly jumped outside of the realm of “verifiable.”
Test performance
The first part was released this week , with 15 difficult questions, for high school students to prepare for the International Mathematical Olympiad. In the past year, AIME 2024 has been the main test for LLM mathematical ability, and everyone is looking forward to LLM's performance on the new questions:
Researchers tested a bunch of models and found that the difference was far less than expected, only 10-20 percentage points.
But I found that several questions of AIME 2025 have been available online for a long time! This may be an accidental leak of the questions .
Does LLM have to use natural language reasoning?
A new and interesting paper uses a recurrent language model to implicitly reason in the latent space, which can expand the computation at test time. This is a bit like training a language model in the latent space, but now it is used for reasoning. The advantage is that it is efficient and can produce good results without generating a lot of "thinking" tokens.
Small but precise inference data becoming a trend?
DeepSeek R1 uses 600,000 inference records for distillation, but recent studies have found that a small number of carefully selected samples can also allow the model to learn complex reasoning without massive training.
For example , the data set only has 1,000 math problems, and the reasoning process is distilled, and the difficulty, diversity, and quality of the problems are selected. The author used it to fine-tune Qwen2.5-32B-Instruct, which was 27% higher than OpenAI's o1-preview in the competition math test.
Another dataset is even more powerful, with only 817 samples and impressive performance on AIME and MATH. The author guessed that if the model has learned a lot of domain knowledge during pre-training, a few hundred good examples may be enough for it to reason.
Controlling the length of thought chains: budget enforcement and reward design
The key to the fine-tuned Qwen2.5-32B-Instruct is budget enforcement . This trick can adjust the reasoning time during testing, either by adding a "Wait" to make it think more, or adding an end mark to make it stop. The author found that the model has test-time scalability: giving more thinking time will increase the accuracy of math tests.
Similarly, (Yeo et al.) also studied the effect of chain of thought (CoT) length on performance. They created a **cosine reward**, which encourages short CoT for correct answers and long CoT for wrong answers, which stabilizes RL training, especially when the context length is limited and the answers are prone to explosion. There is also a repetition penalty , which punishes the model if it repeats nonsense on difficult problems for rewards, forcing it to solve the problem properly.
What to do next?
GRPO is running pretty well in TRL, and we are currently running a big experiment to see which hyperparameters and reward functions work best. If you want to know the progress, you can check it out, and I will write a detailed report in the next update!