Qwen2.5-Max fully embraces DeepSeek technology route

Written by
Iris Vance
Updated on:July-16th-2025
Recommendation

Explore the breakthrough progress of the ultra-large-scale MoE model in the development of AGI.

Core content:
1. The connection between Scaling Law and AGI and the disclosure of DeepSeek's technical route
2. The ultra-large-scale MoE architecture and training process of the Qwen2.5-Max model
3. The advantages of the MoE model compared to the Dense model and the expert collaborative working mechanism

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

It is widely believed that the Scaling Law is a possible path to AGI , that is, continuously expanding the scale of data and model can significantly improve the intelligence level of the model . However, whether it is a dense model or a mixture of experts (MoE) model, research and industry have limited experience in effectively scaling extremely large models .

Many key details about this expansion process were not disclosed until the recently released DeepSeek V3 and R1 models, allowing everyone to understand the effects and implementation methods (reinforcement learning and knowledge distillation) of the ultra-large-scale MoE model.

At the same time, the Alitong Yi Qianwen team is developing an ultra-large-scale  MoE model Qwen2.5-Max , a large-scale MoE model that has been pre-trained with more than 20 trillion labels and further post-trained through carefully planned supervised fine-tuning (SFT) and reinforcement learning based on human feedback (RLHF) methods .

Qwen2.5-Max fully embraces the DeepSeek technology route.


one,Dense model or MoE model

What are dense models (Dense ) and mixture of experts ( MoE )? Dense models (i.e. dense models) and Mixture of Experts (MoE, mixed expert models) are two network architectures with significant differencesin the field of deep learning.
In the Dense model, each layer is directly connected to all previous layers . This design helps to more effectively utilize features , reduce the number of parameters, and promote the propagation of gradients , thereby alleviating the problem of gradient disappearance or explosion.
The MoE model is a mixture of experts model that assigns inputs to a set of expert networks and then uses a gating network to decide which experts should process each input.
Why did Qwen2.5-Max choose MoE instead of Dense? The MoE model efficiently processes specific tasks throughthe collaborative work of multiple expert sub-models, andintelligently selects relevant expert modelsto process input data, optimizes the use of computing resources, and improves overall efficiency and effectiveness.
  • Expert collaboration : The MoE model can handle specific tasks more effectively through the collaboration of multiple "expert" sub-models. This division of labor and cooperation is similar to the various experts in a team who perform their respective duties and work together to complete complex projects, thereby improving overall efficiency and effectiveness.

  • Intelligent selection of experts : The MoE architecture can intelligently select the appropriate "expert" model to process the input data , thereby optimizing the use of computing resources. This means that when processing different tasks, only the relevant expert sub-model will be activated, reducing unnecessary computing overhead.

Large model manufacturers have successively abandoned Dense and chosen MoE . This is just like the mobile Internet era when they chose a horizontally replicated microservice architecture instead of continuing to vertically expand the performance of a single machine.

In the comparison of base models, Qwen2.5-Max was compared with the leading open source MoE model DeepSeek V3, the largest open source dense model Llama-3.1-405B, and the leading open source dense model Qwen2.5-72B. The results show that MoE models (such as Qwen2.5-Max and DeepSeek V3) scored higher than Dense models (such as Llama-3.1-405B and Qwen2.5-72B) . The specific comparison results are shown in the figure below.


Understanding DeepSeek in one article - Mixture of Experts (MoE)

2. Pre-training and post-training

How does Qwen2.5-Max perform pre-training and post-training ?Qwen2.5-Max achieves efficient pre-training and post-training through more than 20 trillion labeled pre-training data, combined with carefully planned supervised fine-tuning (SFT) and reinforcement learning based on human feedback (RLHF) methods.
  1. Supervised fine-tuning (SFT) : The process of fine-tuning a pre-trained model by using a large amount of manually annotated data .

  2. Reinforcement Learning Based on Human Feedback (RLHF) : By collecting human feedback on model output and optimizing the model using reinforcement learning algorithms, Qwen2.5-Max combines multi-stage reinforcement learning, including offline learning DPO and online learning GRPO.

Why does Qwen2.5-Max embrace the DeepSeek technology route?Although Qwen2.5-Max's pre-training and post-training processes are similar to those of OpenAI, both are based on large-scale data, advanced architectures, supervision, and reinforcement learning, but its uniqueness lies in the use of an optimized GRPO reinforcement learning algorithm and the use of knowledge distillation instead of large-scale SFT for post-training.These strategies are consistent with DeepSeek's exploration of improving model performance and efficiency, and are therefore considered to embrace the DeepSeek technology route.

  1. GRPO (Group Relative Policy Optimization) :Optimizes the model throughan additional value model (critic model). In traditional reinforcement learning, the model (called the "policy model") adjusts its behavior based on the reward signal given by the environment, which usually involves an additional model (called the "value model") to evaluate the quality of the current strategy.GRPO simplifies this process. It does not require a value model, but optimizes the policy model through relative rewards within the group.

  2. Knowledge distillation : A methodmodel compression and knowledge transferthat improves the performance of the student model by transferring knowledge from a large teacher model to a small student model.modelwhile maintaining or improving the performance of the model.