Qwen 2.5 Technical Report Interpretation

Written by

Audrey Miles

Updated on:July-14th-2025

Paper publication date: December 19, 2024

This paper is quite enjoyable to read. It is not as scary as Qwen 1 which is almost 100 pages long. This paper can be read in 10 minutes. Moreover, the technical report of Qwen2.5 mainly introduces the difference from Qwen2 from the perspective of training, and does not go into the technical principles.

Although Qwen was not the first to perform inference, nor did it become as popular as DeepSeek, if you look at the hundreds of models it has open-sourced in the past few years, its greatness is self-evident.

Abstract & Introduction

The paper introduces the Qwen2.5 series of LLMs, which have significant improvements in both pre-training and post-training stages.

The pre-training dataset is expanded to 18 trillion tokens, laying a solid foundation for common sense, expertise, and reasoning capabilities. Post-training uses supervised fine-tuning and multi-stage reinforcement learning to enhance human preferences and long text generation, structural data analysis, and instruction tracking capabilities.

Architecture & Tokenizer

The model architecture remains unchanged, being Decoder Only, but improvements have been made to the training corpus and training process.

The core components of the Qwen2.5 model remain as follows:

GQA Group Attention
SwiGLU Swish+GLU Activation
RoPE Rotary Position Encoding
RMSNorm root mean square normalization
DCA Dual Block Attention
YaRN

It develops models of various sizes:

Pre-training

Pre-training data

Compared with its predecessor Qwen2, Qwen2.5 shows significant improvements in pre-training data quality:

(1) Better data screening. Leverage the Qwen2-Instruct model as a data quality filter to evaluate and score training samples. The enhancements enable more detailed quality assessments, thereby increasing the retention of high-quality training data and more effectively screening low-quality samples in multiple languages.

(2) Better math and code data. In the pre-training stage of Qwen2.5, the training data from Qwen2.5-Math and Qwen2.5-Coder are integrated.

(3) Better synthetic data. To generate high-quality synthetic data, especially in the fields of mathematics, code, and knowledge, Qwen2-72B-Instruct and Qwen2Math-72B-Instruct are used. The quality of these synthetic data is further improved by rigorous filtering using a proprietary general reward model and a specialized Qwen2-Math-RM-72B model.

This is actually already distilling.

Based on these techniques, a larger and higher quality pre-training dataset was developed, extending from the 7 trillion tokens used in Qwen2 to 18 trillion tokens.

Continue exploring Scaling Law

While previous studies have mainly used Scaling Laws to determine the optimal model size for a given computational budget, the Qwen team leveraged them to identify optimal hyperparameters across model architectures. Specifically, Scaling Laws help determine key training parameters for dense models and MoE models of different sizes, such as batch size B and learning rate μ.

In addition, the Scaling Law is used to predict and compare the performance of MoE models with different parameter counts with their dense counterparts. This analysis guides the hyperparameter configuration of the MoE model, enabling performance parity with specific dense model variants (e.g., Qwen2.5-72B and Qwen2.5-14B) to be achieved by carefully tuning activation parameters and total parameters.

Long context pre-training

Qwen2.5 adopts a two-stage pre-training method:

The initial phase has a context length of 4096 tokens, followed by an expansion phase (for longer sequences).

In the final pre-training stage, the context length of all model variants except Qwen2.5-Turbo is extended from 4096 tokens to 32768 tokens. At the same time, the base frequency of RoPE is increased from 10000 to 1000000 using ABF technique.

To enhance the model’s ability to handle longer sequences during inference, YARN and DCA were implemented. With these innovations, the sequence length capacity has been quadrupled, enabling Qwen2.5-Turbo to handle up to 1 million tokens, while other models can handle up to 131,072 tokens.

Post-training

Compared to Qwen 2, Qwen 2.5 introduces two major improvements in its post-training design:

(1) Expanded supervised fine-tuning data coverage: The supervised fine-tuning process leverages a massive dataset containing millions of high-quality samples. This expansion specifically addresses key areas where previous models have limitations, such as long sequence generation, mathematical problem solving, encoding, instruction tracking, structured data understanding, logical reasoning, cross-language transfer, and robust system instructions.

(2) Two-stage reinforcement learning: The reinforcement learning (RL) process in Qwen 2.5 is divided into two different stages: offline reinforcement learning (RL) and online reinforcement learning.

Post-training techniques DPO, S