Qwen3 series models released, deep thinking, fast response

Written by

Iris Vance

Updated on:June-26th-2025

qwen3

Overview

1. Divided into dense model architecture (0.6B/1.7B/4B/8B/14B/32B) and hybrid expert architecture (30B-A3B/235B-A22B)
2. Hybrid thinking mode: supports turning on/off reasoning abilityThinking PatternsandNon-thinking mode, enabling users to control the model according to specific tasksDegree of thinking
3. Multilingualism: 119 languages and dialects
4. Enhance Agent capabilities: Optimize the Agent and code capabilities of the Qwen3 model, and strengthen support for MCP

Architecture

Dense Model

Models	Layers	Heads (Q / KV)	Tie Embedding	Context Length
Qwen3-0.6B	28	16 / 8	Yes	32K
Qwen3-1.7B	28	16 / 8	Yes	32K
Qwen3-4B	36	32 / 8	Yes	32K
Qwen3-8B	36	32 / 8	No	128K
Qwen3-14B	40	40 / 8	No	128K
Qwen3-32B	64	64 / 8	No	128K

MoE Model

Models	Layers	Heads (Q / KV)	# Experts (Total / Activated)	Context Length
Qwen3-30B-A3B	48	32 / 4	128 / 8	128K
Qwen3-235B-A22B	94	64 / 4	128 / 8	128K

Benchmarks

According to the official benchmark test,

Flagship ModelQwen3-235B-A22BexistCode,math,general abilityIn benchmarks such as DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro, it shows very competitive results. In addition, the small MoE model Qwen3-30B-A3B The number of activation parameters is 10% of that of QwQ-32B, which is superior, and even a small model like Qwen3-4B can match the performance of Qwen2.5-72B-Instruct.

train

Qwen3 uses about 36 trillion tokens, and the pre-training phase is divided into three steps:

1. The model is pre-trained on more than 30 trillion tokens with a context length of 4K tokens. This phase provides the model with basic language skills and general knowledge
2. The dataset was improved by increasing the proportion of knowledge-intensive data (such as STEM, programming, and reasoning tasks), and the model was subsequently pre-trained on an additional 5 trillion tokens
3. Extend the context length to 32K tokens using high-quality long context data

The post-training is divided into 4 steps

1. Long-term thinking chain cold start: The model was fine-tuned using a variety of long-term thinking chain data, covering a variety of tasks and fields such as mathematics, coding, logical reasoning, and STEM problems. This process aims to equip the model with basic reasoning capabilities
2. Long thought chain reinforcement learning: the focus is onLarge-Scale Reinforcement Learning, using rule-based rewards to enhance the model's exploration and exploration capabilities
3. Thinking mode integration: integrating non-thinking modes into thinking models
4. General Reinforcement Learning: Reinforcement learning is applied to more than 20 general domain tasks to further enhance the general capabilities of the model and correct bad behaviors

Using qwen3

Upgrading Ollama

The qwen3 model requires ollama v0.6.6 or higher . First upgrade ollama on Linux to v0.6.6:

wget https://github.com/ollama/ollama/releases/download/v0.6.6/ollama-linux-amd64.tgz
sudo systemctl stop ollama
sudo rm -rf /usr/lib/ollama
sudo tar -C /usr -xzf ollama-linux-amd64.tgz
sudo systemctl start ollama

After the upgrade, download qwen3:8bModel, size is 5.2G

$  ollama pull qwen3:8b
pulling manifest
pulling a3de86cd1c13: 100% ▕ █████████████████████████████████████████████████████████████████████████████████████████ ▏ 5.2 GB
pulling eb4402837c78: 100% ▗
pulling d18a5cc71b84: 100% ▗
pulling cff3f395ef37: 100% ▗
pulling 05a61d37b084: 100% ▗
verifying sha256 digest
writing manifest
success

After configuring the model, use it in LobeChat qwen3:8b See the actual effect:

In terms of content classification, DeepSeek-R1:14B and qwen3:8b are on par with each other.

In terms of content classification, DeepSeek-R1:14B beats qwen3:8b.

In general, each has its own advantages, and you should choose a model based on the actual results.