Open R1 Project Progress Phase 1

Explore the latest progress of the Open R1 project and the performance reproduction of DeepSeek R1.
Core content:
1. A week of achievements of the Open-R1 project
2. Benchmark results for evaluating the performance of the DeepSeek R1 model
3. Updates on community project progress and evaluation rankings
It’s been two weeks since DeepSeek R1 was released (note: the original article was published on February 2), and it’s only been a week since we started the open-r1 project, which attempts to fill in the missing training process and synthetic data. This article briefly talks about:
Open-R1’s progress in mimicking DeepSeek-R1’s pipeline and data Our understanding and discussion of DeepSeek-R1 Interesting projects created by the community after DeepSeek-R1 was released
This is both the latest project updates and a collection of interesting information about DeepSeek-R1.
Progress after one week
Let’s take a look at what Open-R1 has done this week. We just started this project a week ago, and after the joint efforts of the team and community partners, we have some achievements to share.
Evaluate
To imitate others, the first step is to confirm whether we can reproduce DeepSeek's results. We tried it on the MATH-500 benchmark test, and it really matches the data published by DeepSeek:
Want to know how to test it? Go to the instructions.
We also found that the answers generated by the DeepSeek model are very long and difficult to evaluate. In the OpenThoughts dataset, the answers of DeepSeek-R1 have an average of 6,000 tokens, and some even have more than 20,000 tokens. What does this mean? A page of a book has about 500 words, and a word may be composed of 1 or more tokens, so many answers can fill more than 10 pages! (Source: )
The long answer makes it very challenging to train with GPRO. To generate very long content, you need a lot of GPU memory to store gradients and activations.
To let everyone see the progress, we have created an open-r1 evaluation leaderboard where the community can keep track of our reproduction status:
Training Process
After the release of Open R1, GRPO (Group Relative Policy Optimization) was integrated into the latest version of TRL (). With this, any model can be trained with one or more reward functions. GRPO can also work with DeepSpeed ZeRO 1/2/3 to achieve multi-GPU parallel training, and use vLLM to accelerate generation - after all, the biggest bottleneck of online training is generation speed.
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
dataset = load_dataset( "trl-lib/tldr" , split= "train" )
# Simple reward: Give high score to those who answer with nearly 20 characters
def reward_len (completions, **kwargs) :
return [-abs( 20 - len(completion)) for completion in completions]
training_args = GRPOConfig(output_dir= "Qwen2-0.5B-GRPO" , logging_steps= 10 )
trainer = GRPOTrainer(
model= "Qwen/Qwen2-0.5B-Instruct" ,
reward_funcs=reward_len,
args=training_args,
train_dataset=dataset,
)
trainer.train()
However, the memory usage is still a bit high now, and we are trying to optimize it.
Synthetic Data Generation
The most exciting thing about the R1 report is that the main model can generate synthetic reasoning processes, and the small model can achieve the same effect as the main model after fine-tuning with these data. So we also want to be able to reproduce this synthetic reasoning dataset so that everyone can use it to adjust the model.
The difficulty of a large model like R1 lies in how to generate data efficiently and quickly. We tried it for a week and adjusted various configurations.
At first, we used two 8xH100 nodes to run the model and used vLLM as the inference server. But the effect was not good, the throughput was low, and only 8 requests could be processed at the same time. The KV cache of the GPU was quickly full. Once the cache was full, the request was interrupted. If the PreemptionMode.RECOMPUTE
, you have to wait until the video memory is free and then run again.
Later we switched to 4 8xH100 nodes, with a total of 32 GPUs. This way, we had enough video memory to run 32 requests at the same time, with almost no requeueing due to cache fullness.
At first, we sent requests to vLLM in batches, but we found that the slowest batch would lag behind, and the GPU utilization rate fluctuated. The new batch had to wait until the previous batch was finished before it could start. Later, we changed to streaming processing, and the GPU utilization rate was much more stable:
Changing the code is not difficult. The original batch inference code is:
# 500 requests per batch
for batch in batch_generator(dataset, bs= 500 ):
active_tasks = []
for row in batch:
task = asyncio.create_task(send_requests(row))
active_tasks.add(task)
if active_tasks:
await asyncio.gather(*active_tasks)
The streaming code is now:
active_tasks = []
for row in dataset:
# Keep active requests below 500
while len(active_tasks) >= 500 :
done, active_tasks = await asyncio.wait(
active_tasks,
return_when=asyncio.FIRST_COMPLETED
)
task = asyncio.create_task(send_requests(row))
active_tasks.add(task)
# Wait until all tasks are finished
if active_tasks:
await asyncio.gather(*active_tasks)
The generation speed is now quite stable, but we are still thinking about whether it would be better to switch to CPU cache when, for example, long requests are interrupted.
Want to see the current inference code? See .
Promotion
open-r1 is so popular that even the media is paying attention to it. Team members have been in the news frequently in the past week:
Lewis went live on CNN! Thom on Bloomberg: Leandro chatted with NPR's Planet Money (about 21 minutes in):
There are a bunch of reports:
What inspiration does DeepSeek-R1 bring us?
Although everyone is still studying the results and reports of DeepSeek-R1, this model has become popular in the streets and alleys and attracted countless attention just two weeks after its release.
What reactions did R1 trigger?
The first week after the release was relatively calm, but in the second week, the market suddenly became lively, and major AI research institutions expressed their opinions:
The stock market was a bit panicky on Monday, but stabilized in the following days and even recovered: OpenAI boss Sam Altman praised DeepSeek and revealed that they will speed up and launch some new things soon: OpenAI's research expert Mark Chen said that DeepSeek's ideas coincide with their o1 ideas: Anthropic boss Dario Amodei took the opportunity to highlight export restrictions, outlining a two-horse race. Either a dominant future:
At the same time, many companies are busy plugging DeepSeek models into various platforms (the following are just some examples):
Dell: In collaboration with Hugging Face, Dell founder and owner Michael Dell launched a solution for running DeepSeek-R1 locally: AWS: Amazon's boss Andy Jassy announced that DeepSeek-R1 is now available on Amazon BedRock and SageMaker: Hyperbolic AI: Together AI: Fireworks AI:
How exaggerated is the training cost of DeepSeek V3?
People are particularly curious about the training costs of V3 and R1. Although the specific numbers may not be that critical, many people still use calculators to make rough estimates and find that these numbers are generally reliable. Take a look at these discussions:
Professor Tom Goldstein of the University of Maryland: Reiner Pope, founder of MatX, compared Llama3 and DeepSeek V3: Lukas Beyer, a former OpenAI employee (who worked at Google Brain and DeepMind), talked about the origin of MFU: SemiAnalysis also made a report, guessing what hardware support is behind DeepSeek:
Many teams are working hard to reproduce the training process, and it is estimated that we will soon know how efficient the training of this model is.
Training data
Last week, some speculated that DeepSeek may have secretly used OpenAI's data to train its own models, such as reported by the Financial Times. However, it is not yet clear what the consequences of these claims will be.
The open source community is also very lively
The open source community is extremely active around DeepSeek-R1, and many people are creating various interesting projects based on this model.
What are some fun activities?
There are projects that try to replicate the basic learning mechanics on a smaller scale so you can try them out at home:
This paper demonstrates a method to create a simple learning curve using a trainer and the Llama 1B model. Even better, for less than $30, you can experience that moment yourself with a 3B base model. Philipp Schmid also wrote Mini-R1 , which teaches you step by step how to find that "aha moment". Researchers from the Hong Kong University of Science and Technology tried a larger model. In a paper, they described how to use the 7B mathematical model to develop reasoning capabilities. The folks at the Evolving LLM lab have already started working on a multimodal version of R1, which can be found here: Stepanov used R1 to extract charts from text. The tutorial is here:
TinyZero results show that the model’s reasoning ability has become stronger
Chart from HKUST. As training time increases, the model’s inference process becomes longer.
The data set is also busy
Many people in the community are busy working on R1-related datasets. The highlights are:
: Imitated data flow, using DeepSeek-R1 to create a bunch of questions The reasoning process and answers are then used to fine-tune the Qwen models of 7B and 32B. : A fantastic synthetic reasoning dataset with 114k examples of math, science, code, puzzles, etc. Part of Open Thoughts : A large collection of 800,000 samples, mixed with DeepSeek-R1 Gemini flash also has 200,000 samples of Dolphin chats, and wants to help train models like R1. : There are now 17,000 samples, created by ServiceNow's language model team to support the Open-R1 plan. : The data used to train Sky-T1-32B-Preview can be obtained for less than $450. See this article for details . : The extended method generates instruction data with reasoning, which is quite interesting.
This list covers only a small number of reasoning and problem-solving related datasets on the Hub. We look forward to seeing what other datasets the community is able to build in the coming weeks.
What to do next?
We are just getting started and plan to finish the training process, try it on a small model, and then use the scaled-up inference process to produce a high-quality dataset. If you want to help, check out GitHub or follow Hugging Face !