Woter AI detection.Hurry - ends Jun 29th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Qwen3 hardcore analysis: From 36 trillion tokens to "thinking budget"

Written by

Silas Grey

Updated on:June-19th-2025

introduction

From GPT-4o, Claude 3.7 to Llama-4, these models are trained on massive amounts of data and demonstrate amazing knowledge distillation and task-solving capabilities. However, existing models often face a core challenge: how to strike a balance between deep reasoning (which requires multi-step thinking) and fast responses (which require direct, context-driven answers)? Users often need to switch between models optimized for chat and models focused on reasoning, which undoubtedly increases the complexity of use.

In order to solve this pain point and further improve the overall performance and efficiency of open source large models, Alibaba recently released the Qwen3 series of models. Qwen3 not only inherits the excellent performance tradition of the Qwen series, but also makes bold innovations in model design, aiming to achieve seamless integration of thinking mode and non-thinking mode.

A sneak peek at the conclusion of this article:

Pioneering fusion mode: Qwen3 integrates the "thinking mode" (for complex multi-step reasoning) and the "non-thinking mode" (for fast contextual response) into a single model for the first time, and introduces a "thinking budget" mechanism that allows users to dynamically allocate computing resources to achieve the best balance between performance and latency.
Extreme pre-training scale and multi-language support: The model is pre-trained on data of up to 36 trillion tokens, which is a significant increase in data volume compared to the previous generation. It supports 119 languages and dialects, greatly broadening the global application scenarios.
Efficient and robust architecture upgrade: In terms of model architecture, the Qwen3 dense model removes QKV-bias and introduces QK-Norm to improve training stability; the MoE (mixture of experts) model further improves expert division of labor and reasoning efficiency by removing shared experts and adopting global batch load balancing loss.
Comprehensive and leading performance: The Qwen3 series models have achieved leading SOTA (State-of-the-Art) performance in various benchmark tests, especially in areas such as code generation, mathematical reasoning, and agent tasks. Its flagship model Qwen3-235B-A22B even surpasses top open source models such as DeepSeek-V3 in many indicators, and is comparable to closed source models such as GPT-4o and Gemini2.5-Pro.
"Strong to Weak" Distillation Strategy: For small models, Qwen3 introduces an innovative "strong to weak" distillation strategy. By drawing knowledge from large flagship models, small models can maintain high competitiveness while significantly reducing training costs and development workload.

Next, we’ll explore these exciting innovations in detail.

Model architecture revealed: robustness and efficiency coexist

Qwen3 has iterated and optimized the model architecture to achieve higher training stability and inference efficiency. It includes two types of models: Dense Models and Mixture-of-Expert (MoE Models) .

Dense Models

The Qwen3 series includes 6 dense models with parameter sizes ranging from 60 million to 32 billion. These models largely follow the excellent architecture of Qwen2.5, for example:

Grouped Query Attention (GQA)
SwiGLU activation function
Rotary Positional Embeddings (RoPE)
RMSNorm normalization

However, Qwen3 is not a simple copy. The paper mentioned that the research team removed the QKV-bias used in Qwen2 and introduced QK-Norm . This change is very critical:

QKV-bias : In the attention mechanism, QKV-bias refers to the bias terms added to the query, key, and value vectors. Although they can increase the expressiveness of the model, they may cause training instability or performance degradation in some cases.
QK-Norm : is a normalization technique applied between queries and keys to ensure the stability of attention calculations . With this adjustment, Qwen3's dense model can be more robust during training.

Mixture of Experts Models (MoE Models)

The Qwen3 series also launched two MoE models: Qwen3-30B-A3B and the flagship model Qwen3-235B-A22B. The characteristic of the MoE architecture is that the model contains multiple "expert networks". During reasoning, only some of the experts will be activated to process the input, thereby achieving efficient reasoning with a huge number of parameters.

Qwen3's MoE model design highlights include:

128 experts, 8 experts activated for each token : This means that the model has a lot of "professional domain knowledge", but when processing each input, only the 8 most relevant "experts" will be called, which greatly improves the reasoning efficiency.
Shared Experts removed : Unlike Qwen2.5-MoE, the Qwen3-MoE model no longer uses shared experts.
Introducing Global-batch Load Balancing Loss : This is an important optimization in MoE model training.

In traditional MoE training, load balancing is usually performed in the micro-batch dimension, which may cause the expert to fail to fully utilize the data diversity for learning during the entire training process.
Global batch load balancing is performed on a larger global batch dimension, which means that the model can learn from a wider data distribution during training, prompting experts to better specialize and avoid a situation where a few experts take on too many tasks while most experts are idle. This not only improves training efficiency, but also enhances the overall performance of the model. As shown in the figure below, global load balancing significantly improves the performance of downstream tasks and improves the specialization of experts.

In addition, the Qwen3 model continues to use Qwen's tokenizer, with a vocabulary size of 151,669. These sophisticated architectural designs lay a solid foundation for Qwen3's excellent performance.

Massive data: Qwen3's pre-training journey

The reason why Qwen3 can achieve such high performance is inseparable from the massive data scale and refined processing in the pre-training stage . The paper points out that Qwen3 has made a significant expansion in pre-training data, with the total amount of data doubling that of Qwen2.5, and the number of supported languages has more than doubled.

Data size and diversity

Qwen3's pre-training dataset contains an astonishing 36 trillion tokens , covering 119 languages and dialects . This means that the model is exposed to an unprecedented breadth of languages and knowledge during training. In order to build such a large and high-quality dataset, the research team adopted a variety of strategies:

PDF document text extraction and refinement: Qwen2.5-VL (Qwen2.5 visual language model) is used to extract text from a large number of PDF documents, and then the recognized text is refined through the Qwen2.5 model, effectively obtaining trillions of tokens of high-quality text data.
Domain-specific data synthesis: With professional models such as Qwen2.5, Qwen2.5-Math (mathematical model) and Qwen2.5-Coder (code model), trillions of tokens of domain-specific data are synthesized, including textbooks, question-answer pairs, instructions and code snippets, covering dozens of fields. This ensures the deep learning of the model in specific professional fields.
Multilingual data expansion: Additional multilingual data has been further incorporated, greatly increasing the number of supported languages from 29 in Qwen2.5 to 119, significantly improving the model's cross-language understanding and generation capabilities.
Fine-grained data annotation and mixing: The team developed a multilingual data annotation system, which has detailed annotations for more than 30 trillion tokens of data in multiple dimensions such as educational value, field, theme, and security. Unlike previous hybrid optimization at the data source or field level, Qwen3's method performs data hybrid optimization at the instance level , and achieves more effective fine-grained data combination by conducting a large number of ablation experiments on small proxy models.

Three-stage pre-training strategy

The pre-training process of Qwen3 adopts a three-stage strategy to gradually give the model powerful capabilities:

General Stage (S1): In this stage, all Qwen3 models are first trained on more than 30 trillion tokens of data with a sequence length of 4,096 tokens. The goal is to build the model's proficiency in 119 languages and a broad base of general world knowledge.
Reasoning Stage (S2): To further improve the reasoning ability of the model, the pre-training corpus in this stage focuses on increasing the proportion of STEM (science, technology, engineering, mathematics), code, reasoning tasks, and synthetic data. The model is trained on about 5 trillion high-quality tokens, and the sequence length is still 4,096 tokens. The learning rate decay is accelerated in this stage to focus on and strengthen reasoning capabilities faster.
Long Context Stage: In the final pre-training stage, the research team collected high-quality long-context corpora and extended the context length of the Qwen3 model to 32,768 tokens. In this corpus, 75% of the text length is between 16,384 and 32,768 tokens, and 25% is between 4,096 and 16,384 tokens. Similar to Qwen2.5, Qwen3 increases the base frequency of RoPE from 10,000 to 1,000,000, and introduces YARN (Yet Another RoPE Normalization) and Dual Chunk Attention (DCA) technology to quadruple the sequence length capacity during inference, thereby achieving efficient processing of ultra-long contexts.

Through the careful design of these three stages, the Qwen3 model not only possesses solid general knowledge, but also has excellent reasoning ability and the advantage of processing ultra-long texts.

Overview of pre-training performance

The paper compares the performance of Qwen3 basic models in general tasks, mathematics and STEM, code, and multilingual tasks. The key highlights include:

The flagship model Qwen3-235B-A22B-Base is leading in all aspects: Even compared with top open source models such as DeepSeek-V3 Base (total parameters are 3 times that of Qwen3, and activation parameters are 1.5 times) and Llama-4-Maverick (total parameters are 2 times that of Qwen3), Qwen3-235B-A22B-Base performs better in most benchmarks. Compared with its predecessor Qwen2.5-72B-Base, it surpasses it in all benchmarks, and the activation parameters are only 1/3 of that, significantly reducing the cost of inference and training.
Amazing efficiency of MoE model: Qwen3's MoE base model can achieve similar performance to Qwen3 dense model with only 1/5 of the activation parameters, and performs better with less than half of the activation parameters and total parameters of Qwen2.5 MoE model. This means a higher performance efficiency ratio.
Improved competitiveness of small models: Qwen3's dense models, such as Qwen3-32B-Base, even surpass Qwen2.5-72B-Base in many benchmarks, even though its parameters are less than half of the latter. This shows that Qwen3 has achieved a leapfrog improvement in model efficiency and capabilities, making smaller-scale models highly competitive.

The results of these pre-training stages have laid a solid foundation for subsequent models to cope with complex instructions and changing scenarios.

Crafted with care: Qwen3's post-training cheats

Pre-training gives the model general knowledge and preliminary capabilities, while post-training is to polish these "raw capabilities" into a "weapon" that can accurately understand and respond to user instructions. Qwen3's post-training process is cleverly designed with two core goals:

Thinking Control: Gives the model the ability to switch between “thinking” and “non-thinking” modes, and can control the depth of thinking according to user needs (through thinking budget).
Strong-to-Weak Distillation: For small models, by learning from large models, it can significantly improve their performance while significantly reducing the computational cost and development effort.

Qwen3’s flagship models (such as 235B-A22B) follow a complex four-stage training process , while smaller models efficiently acquire similar capabilities through innovative distillation techniques.

Four-Stage Post-training Process

1. Long-CoT Cold Start

This is the starting stage of the model learning to generate long chains of thought (CoT). The team carefully constructed a comprehensive dataset containing math, code, logical reasoning, and STEM questions, each of which is accompanied by a verified reference answer or code test case.

Strict data screening:

Query Filtering: Remove queries that are difficult to verify (such as those containing multiple sub-questions or requiring general text generation) and queries that Qwen2.5-72B-Instruct can answer correctly without CoT. This ensures that the model only learns to handle complex questions that truly require deep reasoning.
Response Filtering: QwQ-32B (a reasoning model in the Qwen3 series) is used to generate multiple candidate responses for each query. These responses are then strictly screened to remove incorrect final answers, repetitions, pure guesses without reasoning, inconsistent thinking and summaries, mixed language or style deviation, and responses that are too similar to the validation set.

Goal: At this stage, the goal is to allow the model to form a basic reasoning pattern, rather than immediately pursuing the ultimate in reasoning performance. By minimizing the number of training samples and steps, we can leave room for greater improvement in the subsequent reinforcement learning stage.

2. Reasoning RL

After the model has the initial CoT capability, it enters the reasoning reinforcement learning stage to further improve its reasoning ability.

Data selection principle: Select data that is not used in the cold start phase, that the cold start model can learn, that is sufficiently challenging, and that covers a wide range of sub-domains.
Training strategy: The GRPO (Gradient Regret Policy Optimization) strategy was adopted, and it was found that using large batches, high rollout times, and offline training (Off-policy Training) was very beneficial to improving sample efficiency and the training process. By controlling the entropy of the model, the research team successfully balanced exploration and utilization, achieving continuous improvement in training rewards and verification performance. For example, the score of the Qwen3-235B-A22B model on AIME'24 increased from 70.1 to 85.1.

3. Thinking Mode Fusion

This phase aims to integrate “non-thinking” capabilities into models that already have “thinking” capabilities, allowing one model to handle both modes at the same time and reduce the complexity of deploying multiple models.

SFT data construction: Combine "thinking" data (generated by rejection sampling using the Stage 2 model) and "non-thinking" data (including code, mathematics, instruction following, multilingual tasks, creative writing, question answering, and other diverse tasks) for continuous supervised fine-tuning (SFT). Ensure data quality and diversity through automatically generated checklists and adding translation tasks for low-resource languages.
Chat template design: In order to achieve dynamic mode switching, Qwen3 has designed a unique chat template./thinkand/no_thinkThe model can recognize and select the corresponding thinking mode according to the instruction. By default, the model is in thinking mode. If the user does not specify, the model tends to think. In multi-round dialogue, the model follows the last token encountered.
The "emergence" ability of thinking budget: A surprising discovery is that once the model learns to respond in two modes, it naturally develops the ability to handle intermediate situations - that is, to generate responses based on incomplete thinking. When the length of the model's thinking reaches the user-defined threshold, the system will actively interrupt the thinking process and insert an instruction "Considering the user's limited time, I will directly give a solution based on the current thinking", and then the model will generate a final response based on the existing thinking. This ability is not explicitly trained, but a natural result of model fusion learning.

4. General Reinforcement Learning (General RL)

The final stage aims to comprehensively improve the model's capabilities and stability in different scenarios.

Comprehensive Reward System: A complex reward system covering more than 20 different tasks is built, each with customized scoring criteria to enhance the model’s:

Ability to follow instructions: Accurately understand and execute user instructions (content, format, length, structured output).
Format compliance: Strictly follow the specified format, such as/thinkand/no_thinkMarked responses, and use<think>and</think>Tags separate thinking and response content.
Preference alignment: Improve the helpfulness, engagement, and style of models for open-ended queries, providing a more natural and satisfying user experience.
Agent capabilities: The training model correctly calls tools through specified interfaces, and improves performance and stability in long-term decision-making tasks through interactive feedback with the real environment.
Professional scenario capabilities: For example, in the RAG (retrieval augmentation generation) task, the introduction of reward signals guides the model to generate accurate and contextually appropriate responses and reduce hallucinations.

Various reward types:

Rule-based rewards: Suitable for tasks such as reasoning and format compliance, and can evaluate the correctness of the output with high accuracy.
Model-based reward with reference answer: Provides a reference answer for each query and uses Qwen2.5-72B-Instruct to score the model's response, flexibly handling diverse tasks.
Model-based rewards for no-reference answers: Use human preference data to train a reward model to score responses, handle a wider range of queries, and make the model more engaging and helpful.

Strong-to-Weak Distillation

To optimize the post-training process of lightweight models, Qwen3 introduces an innovative Strong-to-Weak Distillation pipeline for 5 dense models and 1 MoE model.

Off-policy Distillation: In the initial stage, the team combined the teacher model/thinkand/no_thinkThe output generated in the two modes is used for response distillation. This helps the lightweight student model build basic reasoning and mode switching capabilities.
On-policy Distillation: In this stage, the student model generates online sequences for fine-tuning./thinkor/no_thinkThe responses are generated in the CNN mode and then fine-tuned by aligning the Logits of the student model with the Logits of the teacher model (Qwen3-32B or Qwen3-235B-A22B) and minimizing the KL divergence.

Core advantages: This distillation method shows great advantages in both performance and training efficiency. Compared with traditional reinforcement learning, it can achieve significantly better performance in only about 1/10 of the GPU hours , especially in improving the Pass@64 (multiple attempts to solve the problem) indicator. This shows that distilling knowledge from a strong teacher model can more effectively guide the learning of the student model and expand its exploration space and reasoning potential.

All-round evaluation: Qwen3's excellent performance

The Qwen3 team conducted a comprehensive and rigorous evaluation of the model's performance, covering both pre-trained models and instruction fine-tuned models. The evaluation not only used widely recognized open benchmarks, but also utilized automated datasets carefully constructed within the team for specific capabilities (such as long context, code, and agent tasks) to ensure the comprehensiveness and fairness of the evaluation.

Excellence in Mindset

Qwen3 's performance in thinking mode was particularly striking, showing its strong reasoning ability:

Flagship model Qwen3-235B-A22B: In thinking mode, this model surpasses DeepSeek-R1 in most benchmarks, especially in tasks that require deep reasoning such as mathematics, agent, and programming, showing top reasoning capabilities among open source models. It even competes fiercely with closed source models such as OpenAI-o1, Grok-3-Beta, and Gemini2.5-Pro, greatly narrowing the gap in reasoning capabilities between open source and closed source models.
Flagship dense model Qwen3-32B: In thinking mode, Qwen3-32B surpasses the previous strongest reasoning model QwQ-32B on most benchmarks, becoming the new SOTA at the scale of 32 billion parameters. It also competes with the closed-source OpenAI-o3-mini in terms of alignment and multi-language performance.

General Abilities in Non-Mindset Modes

Even in non-thinking mode , the Qwen3 model's general capabilities perform well:

Qwen3-235B-A22B: In non-thinking mode, it surpasses DeepSeek-V3, LLaMA-4-Maverick, and its previous flagship model Qwen2.5-72B-Instruct. More impressively, it even surpasses GPT-4o-2024-11-20 in multiple benchmarks, demonstrating its inherent ability even without initiating complex thinking processes.
Qwen3-32B: In non-thinking mode, the model performs well in almost all benchmarks, and even significantly surpasses Qwen2.5-72B-Instruct in alignment, multi-language, and reasoning-related tasks, demonstrating the comprehensive improvement of the Qwen3 series in basic capabilities.
Lightweight models: including Qwen3-30B-A3B, Qwen3-14B and other smaller dense models, whose performance continues to be better than similar open source models when the number of parameters is similar or smaller. This fully verifies the success of the "strong to weak" distillation strategy, making lightweight models also highly competitive.

Multilingual capabilities

Qwen3 supports 119 languages in the pre-training phase, and its multilingual capabilities are fully reflected in the evaluation. The team has expanded multiple multilingual benchmarks, covering tasks such as instruction following, knowledge understanding, mathematics and logical reasoning. Qwen3 has made significant progress in multilingual tasks, especially in some low-resource languages, and has also demonstrated strong understanding and generation capabilities.

Long context handling capability

The Qwen3 model has also been rigorously tested for its ability to handle long contexts:

RULER, LV-Eval, and LongBench-Chat: In these long-context benchmarks, the Qwen3 series models, especially those equipped with YARN and DCA technology, demonstrated strong long text processing capabilities. Qwen2.5-72B-Instruct performed well at all context lengths, significantly outperforming existing open source and closed source models such as GPT-4o-mini and GPT-4.
1M Token Passkey Retrieval: Qwen2.5-Turbo achieved 100% accuracy in the Passkey Retrieval task of 1 million tokens, demonstrating its excellent ability to extract detailed information from ultra-long contexts.
Reasoning speed optimization: Through the sparse attention mechanism based on Minference, Qwen2.5-Turbo reduces the computational load of the attention mechanism by 12.5 times when processing a 1M Token sequence, and speeds up TTFT (Time To First Token) by 3.2 to 4.3 times, greatly improving the user experience of long-context reasoning.

Think about budget effectiveness

The paper also verified the effectiveness of the "thinking budget" through experiments. In the benchmark tests in mathematics, code, and STEM fields, Qwen3's performance continued to improve steadily as the budget allocated to thinking increased, indicating that the model can indeed improve its problem-solving ability through more "deep" thinking.

These comprehensive evaluation results undoubtedly demonstrate Qwen3's leading position in the current open source large model field and its great potential in versatility, efficiency and scalability.

Summarize

The Qwen3 series of models launched by Alibaba is undoubtedly a milestone in the field of open source large models. It not only continues the excellent performance of the Qwen family, but also redefines our understanding of the boundaries of large model capabilities through a series of groundbreaking technological innovations.

The core highlight of Qwen3 lies in its original thinking mode and non-thinking mode fusion mechanism, supplemented by precise thinking budget control . This design enables the model to flexibly switch between deep reasoning and fast response, effectively solving the differentiated requirements of different tasks on model efficiency and intelligence level, and providing users with unprecedented flexibility.

At the technical level, Qwen3's architecture upgrades (such as QK-Norm for dense models and global batch load balancing loss for MoE models) ensure the robustness of model training and the ultimate efficiency of reasoning. The huge amount of data of 36 trillion tokens and the wide coverage of 119 languages have endowed Qwen3 with excellent general knowledge and global capabilities.

What is even more commendable is that Qwen3, through a sophisticated four-stage post-training process , especially the "strong to weak" distillation strategy , not only greatly improves the overall strength of the flagship model, making it perform well in various benchmarks and comparable to top closed-source models, but also enables lightweight models to achieve powerful performance at a very low cost, opening up new possibilities for AI applications in edge devices and resource-constrained environments.