1.5B small model counterattack! How DeepScaleR subverts the rules of AI mathematics competition with reinforcement learning

Written by

Caleb Hayes

Updated on:July-16th-2025

Training large-scale models often requires huge computing resources and high costs. However, recently, a research team from the University of California, Berkeley, successfully surpassed OpenAI's o1-preview with a 1.5B parameter DeepScaleR model through reinforcement learning (RL) fine-tuning, and demonstrated strong reasoning capabilities in mathematical competitions such as AIME 2024. This shocking breakthrough has once again sparked a heated discussion in the AI field about small models.

Breakthrough achievement: 1.5B small model challenges large model

In the AI community, many people believe that only large-scale pre-trained models can achieve excellent results in mathematical reasoning and complex tasks. However, the Berkeley team's research broke this convention. Through sophisticated training methods and high-quality mathematical data sets, they successfully applied reinforcement learning to improve reasoning capabilities on a small model with 1.5B parameters , surpassing OpenAI's o1-preview.

This achievement proves that bigger is not always better. In specific fields, small models can also show great potential through innovative training methods.

DeepScaleR's innovative training method: step-by-step progress and gradual breakthroughs

The success of DeepScaleR lies not only in the efficiency of its small model , but also in its unique training method , which breaks through the limitations of traditional large-scale pre-trained models and adopts a progressive reinforcement learning strategy to gradually guide the model to improve its reasoning ability.

1. Iterative context expansion: from short to long, gradually deepening the reasoning depth

DeepScaleR's training strategy is based on iterative context expansion . Unlike traditional methods that directly train large context lengths, the research team chose to start with shorter contexts and gradually expand the model's reasoning space. The initial stage of the model processed questions with a length of 8K characters (similar to high school homework). As the model's reasoning ability improved, it gradually expanded to 16K characters (more challenging questions) and finally to 24K characters . This gradual expansion strategy can effectively avoid the negative impact of prematurely increasing complexity on model training.

The core idea of this method is to let the model first master a shorter and simpler reasoning process, and then challenge more complex mathematical problems by extending the context length . This strategy helps the model gradually build its own reasoning path, avoids "information overload", and enables the model to efficiently digest and understand complex mathematical problems.

2. Reinforcement Learning Reward Mechanism: Purely Driven by Correctness

DeepScaleR uses a very simple and effective reinforcement learning reward mechanism - 1 point for the correct answer and 0 points for the wrong answer . This approach avoids the problem of reward abuse and ensures that the model will only be rewarded when the correct answer is given. Unlike the "process reward" that may be used in traditional reinforcement learning, DeepScaleR uses the result reward model (ORM) , which only cares about the correctness of the final result rather than the intermediate steps. This method is particularly effective in mathematical reasoning tasks, because the goal of mathematical reasoning is to give the correct answer in the end, and the correctness of the intermediate process often cannot be directly mapped to the final result.

This purely binary reward and punishment mechanism is actually more in line with mathematical rigor: there is no middle ground, only right and wrong . This reward method allows the model to focus on the core of solving the problem and avoid excessive "redundant" reasoning.

3. Small Model, Big Breakthrough: Efficient Utilization of Computing Resources

Although DeepScaleR has only 1.5B parameters, its training efficiency is far beyond expectations. The entire training process consumed only 3,800 A100 GPU hours , which means that the model training cost is only about $4,500 . This is very cost-effective compared to traditional large-scale pre-trained models. For example, when training traditional large models with a large number of parameters and contexts, the training cost usually reaches millions of dollars or even more. The success of DeepScaleR shows that small models can also obtain powerful reasoning capabilities through innovative training methods without the need for huge computing resources .

The efficiency of this training is due to two major factors: one is the efficient use of reinforcement learning , and the other is iterative context expansion , which allows the training process to focus more on the model's reasoning ability rather than a lot of calculations.

Outstanding performance: Beyond o1-preview, challenging math competitions

To evaluate the performance of DeepScaleR, the research team compared it with several mainstream models, including the basic DeepSeek-R1-Distilled-Qwen-1.5B and rSTAR-Math-7B , Eurus-2-7B-PRIME , and Qwen2.5-7B-SimpleRL models from academia .

By comparison, it can be seen that DeepScaleR-1.5B-Preview performed well in multiple math competition tests such as AIME 2024 , MATH 500 , and AMC 2023. Especially in the AIME 2024 test, DeepScaleR's Pass@1 accuracy reached 43.1% , an increase of 14.4% over the basic model . This result surpassed OpenAI's o1-preview and demonstrated its excellent reasoning ability.

In all tests, DeepScaleR's average performance was also extremely outstanding, reaching 57.0% , far exceeding other benchmark models, especially in the MATH 500 and AMC 2023 tests, where DeepScaleR achieved excellent results of 87.8% and 73.6%.