Qwen3: Think deeply and move quickly

Written by
Clara Bennett
Updated on:June-26th-2025
Recommendation

The Qwen3 series of large language models are newly released with excellent performance and open source sharing!

Core content:
1. The outstanding performance of the Qwen3 series models in benchmark tests
2. Detailed parameters of Qwen3 model parameters and activation parameters
3. Open source information and deployment recommendations of Qwen3 models

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Today, we are announcing  Qwen3 , the latest member of the Qwen family of large language models. Our flagship model,  Qwen3-235B-A22B  , shows highly competitive results with top models such as DeepSeek-R1, o1, o3-mini, Grok-3, and Gemini-2.5-Pro ​​in benchmarks such as code, math, and general ability. In addition, the small MoE model Qwen3-30B-A3B has 10% of the number of activation parameters of QwQ-32B, outperforming even small models like Qwen3-4B, which can match the performance of Qwen2.5-72B-Instruct.




We open-sourced the weights of two MoE models: Qwen3-235B-A22B , a large model with more than 235 billion total parameters and more than 22 billion activation parameters, and Qwen3-30B-A3B , a small MoE model with about 30 billion total parameters and 3 billion activation parameters. In addition, six Dense models have also been open-sourced, including Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, and Qwen3-0.6B, all under the Apache 2.0 license.

Models 

Layers

Heads 

(Q / KV)

Tie Embedding

Context Length

Qwen3-0.6B

28

16 / 8

Yes 

32K

Qwen3-1.7B 

28

16 / 8

Yes 

32K

Qwen3-4B

36

32 / 8 

Yes 

32K

Qwen3-8B 

36

32 / 8 

No

128K

Qwen3-14B 

40

40 / 8 

No

128K

Qwen3-32B

64

64 / 8 

No

128K

Swipe up and down to see more


Models 

Layers

Heads 

(Q / KV)

Experts (Total/Activated)

Context Length

Qwen3-30B-A3B

48

32 / 4

128 / 8 

128K 

Qwen3-235B-A22B

94

64 / 4

128 / 8 

128K 


Post-trained models, such as Qwen3-30B-A3B, and their pre-trained base models (such as Qwen3-30B-A3B-Base) are now open for use on platforms such as Hugging Face, ModelScope, and Kaggle. For deployment, we recommend frameworks such as SGLang and vLLM; for local use, tools such as Ollama, LMStudio, MLX, llama.cpp, and KTransformers are also highly recommended. These options ensure that users can easily integrate Qwen3 into their workflows, whether for research, development, or production environments.


We believe that the release and open source of Qwen3 will greatly promote the research and development of large-scale basic models. Our goal is to empower researchers, developers, and organizations around the world to help them build innovative solutions using these cutting-edge models.


Welcome to try Qwen3 on the  Qwen Chat  web version (chat.qwen.ai) and Tongyi APP  !




Key highlights

Multiple thinking modes

The Qwen3 model supports two thinking modes:

Thinking mode: In this mode, the model will reason step by step and give a final answer after careful consideration. This method is very suitable for complex problems that require deep thinking.

No-Thinking Mode: In this mode, the model provides fast, near-instant responses and is suitable for simple problems where speed is more important than depth.


This flexibility enables users to control the degree to which the model "thinks" depending on the specific task. For example, complex problems can be solved by extending the reasoning steps, while simple problems can be answered directly and quickly without delay. Crucially, the combination of these two modes greatly enhances the model's ability to achieve stable and efficient "thinking budget" control. As mentioned above, Qwen3 demonstrates scalable and smooth performance improvements, which are directly related to the allocated computational reasoning budget. This design makes it easier for users to configure specific budgets for different tasks, achieving a better balance between cost-effectiveness and reasoning quality.



Multilingual

Qwen3 models support 119 languages ​​and dialects. This broad multilingual capability opens up new possibilities for international applications, allowing users around the world to benefit from the powerful capabilities of these models.


Language 

Languages ​​& Dialects

Indo-European

English, French, Portuguese, German, Romanian, Swedish, Danish, Bulgarian, Russian, Czech, Greek, Ukrainian, Spanish, Dutch, Slovak, Croatian, Polish, Lithuanian, Norwegian (Bokmal), Norwegian Nynorsk, Persian, Slovenian, Gujarati, Latvian, Italian, Occitan, Nepali, Marathi, Belarusian, Serbian, Luxembourgish, Venetian, Assamese, Welsh, Silesian , Asturian, Chhattisgarhi, Awadhi, Maithili, Bhojpuri, Sindhi, Irish, Faroese, Hindi, Punjabi, Bengali, Oriya, Tajik, Eastern Yiddish, Lombard, Ligurian, Sicilian, Friulian, Sardinian, Galician, Catalan, Icelandic, Tosk, Albanian, Limburgish, Romanian, Dari, Afrikaans, Macedonian, Sinhala, Urdu, Magahi, Bosnian, Armenian

Swipe up and down to see more

Sino-Tibetan

Chinese (Simplified, Traditional, Cantonese), Burmese

Afro-Asiatic

Arabic (Standard, Najdi, Levantine, Egyptian, Moroccan, Mesopotamian, Taiz-Adni, Tunisian), Hebrew, Maltese

Austronesian

Indonesian, Malay, Tagalog, Cebuano, Javanese, Sundanese, Minangkabau, Balinese, Banga, Pangasinan, Iloko, Warai (Philippines)

Dravidian

Tamil, Telugu, Kannada, Malayalam

Turkic 

Turkish, Northern Azerbaijani, Northern Uzbek, Kazakh, Bashkir, Tatar

Tai-Kadai

Thai, Lao

Uralic

Finnish, Estonian, Hungarian

Austroasiatic

Vietnamese, Khmer

other

Japanese, Korean, Georgian, Basque, Haitian, Papiamento, Cabvirdianu, Tok Pisin, Swahili




Pre-training

In terms of pre-training, the dataset of Qwen3 has been significantly expanded compared to Qwen2.5. Qwen2.5 was   pre-trained on  18 trillion tokens , while Qwen3 uses almost twice as much data, reaching about 36 trillion tokens , covering  119 languages ​​and dialects . To build this huge dataset, we not only collected data from the Internet, but also extracted information from PDF documents. We used Qwen2.5-VL to extract text from these documents and Qwen2.5 to improve the quality of the extracted content. In order to increase the amount of mathematics and code data, we used Qwen2.5-Math and Qwen2.5-Coder, two expert models in the fields of mathematics and code, to synthesize data in various forms including textbooks, question-answer pairs, and code snippets.


The pre-training process is divided into three stages. In the first stage (S1), the model was pre-trained on more than 30 trillion tokens with a context length of 4K tokens. This stage provides the model with basic language skills and general knowledge. In the second stage (S2), we improved the dataset by increasing the proportion of knowledge-intensive data (such as STEM, programming, and reasoning tasks), and the model was subsequently pre-trained on an additional 5 trillion tokens. In the final stage, we extended the context length to 32K tokens using high-quality long-context data to ensure that the model can effectively handle longer inputs.



Due to the improvement of model architecture, the increase of training data, and more efficient training methods, the overall performance of Qwen3 Dense base models is comparable to Qwen2.5 base models with more parameters. For example, Qwen3-1.7B/4B/8B/14B/32B-Base perform comparable to Qwen2.5-3B/7B/14B/32B/72B-Base, respectively. Especially in areas such as STEM, coding, and reasoning, the performance of Qwen3 Dense base models even exceeds that of the larger Qwen2.5 models. For Qwen3 MoE base models, they achieve similar performance to Qwen2.5 Dense base models while using only 10% of the activation parameters. This results in significant savings in training and reasoning costs.




Post-training


To develop a hybrid model that can simultaneously possess the ability to reason and respond quickly, we implemented a four-stage training process consisting of: (1) long thought chain cold start, (2) long thought chain reinforcement learning, (3) thought mode fusion, and (4) general reinforcement learning.

In the first phase, we fine-tuned the model using a variety of long-term thought chain data, covering a variety of tasks and domains such as mathematics, code, logical reasoning, and STEM problems. This process aims to equip the model with basic reasoning capabilities. The second phase focuses on large-scale reinforcement learning, using rule-based rewards to enhance the model's exploration and exploration capabilities.

In the third stage, we fine-tune the model on a combination of long thought chain data and commonly used instruction fine-tuning data to integrate non-thinking modes into the thinking model. This ensures a seamless combination of reasoning and fast response capabilities. Finally, in the fourth stage, we applied reinforcement learning on more than 20 general domain tasks including instruction following, format following, and agent capabilities to further enhance the general capabilities of the model and correct bad behaviors.




Get started with Qwen3

Here is a simple guide on how to use Qwen3 in different frameworks. First, we provide a standard example of using Qwen3-30B-A3B in Hugging Face transformers:

from  modelscope  import  AutoModelForCausalLM, AutoTokenizer
model_name =  "Qwen/Qwen3-30B-A3B"
# load the tokenizer and the modeltokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(    model_name,    torch_dtype = "auto" ,    device_map= "auto")
# prepare the model inputprompt =  "Give me a short introduction to large language model."messages = [    { "role""user""content" : prompt}]text = tokenizer.apply_chat_template(    messages,    tokenize = False ,    add_generation_prompt = True ,    enable_thinking= True  # Switch between thinking and non-thinking modes. Default is True.)model_inputs = tokenizer([text], return_tensors= "pt" ).to(model.device)
# conduct text completiongenerated_ids = model.generate(    **model_inputs,    max_new_tokens = 32768)output_ids = generated_ids[ 0 ][ len (model_inputs.input_ids[ 0 ]):].tolist() 
# parsing thinking contenttry :    # rindex finding 151668 (</think>)    index =  len (output_ids) - output_ids[::- 1 ].index( 151668 )except  ValueError:    index =  0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens= True ).strip( "\n" )content = tokenizer.decode(output_ids[index:], skip_special_tokens= True ).strip( "\n" )
print ( "thinking content:" , thinking_content)print ( "content: " , content)

To disable thinking mode, simply modify the parameter enable_thinking as follows:

text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False # True is the default value for enable_thinking.)

For deployment, you can use sglang>=0.4.6.post1 or vllm>=0.8.4 to create an API endpoint compatible with the OpenAI API:

SGLang:

    python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B --reasoning-parser qwen3

vLLM:

    vllm serve Qwen/Qwen3-30B-A3B --enable-reasoning --reasoning-parser deepseek_r1

To disable reasoning mode, you can remove the parameter --reasoning-parser (and --enable-reasoning).

For local development, you can use ollama to interact with the model by running a simple command ollama run qwen3:30b-a3b You can also use LMStudio or code libraries such as llama.cpp and ktransformers for local development.




Advanced Usage

We provide a soft switching mechanism that allows users to dynamically control the model's behavior when enable_thinking=True. Specifically, you can add /think and /no_think in user prompts or system messages to switch the model's thinking mode on a turn-by-turn basis. In multi-turn conversations, the model follows the most recent instruction.


The following is an example of a multi-turn conversation:

from  transformers  import  AutoModelForCausalLM, AutoTokenizer
classQwenChatbot:    def  __init__ ( self, model_name= "Qwen3-30B-A3B/Qwen3-30B-A3B" ):        self.tokenizer = AutoTokenizer.from_pretrained(model_name)        self.model = AutoModelForCausalLM.from_pretrained(model_name)        self.history = []
    def  generate_response ( self, user_input ):        messages = self.history + [{ "role""user""content" : user_input}]
        text = self.tokenizer.apply_chat_template(            messages,            tokenize = False ,            add_generation_prompt = True        )
        inputs = self.tokenizer(text, return_tensors= "pt" )        response_ids = self.model.generate(**inputs, max_new_tokens= 32768 )[ 0 ][ len (inputs.input_ids[ 0 ]):].tolist()        response = self.tokenizer.decode(response_ids, skip_special_tokens= True )
        # Update history        self.history.append({ "role""user""content" : user_input})        self.history.append({ "role""assistant""content" : response})
        return  response
# Example Usageif  __name__ ==  "__main__" :    chatbot = QwenChatbot()
    # First input (without /think or /no_think tags, thinking mode is enabled by default)    user_input_1 =  "How many r's in strawberries?"    print ( f"User:  {user_input_1} " )    response_1 = chatbot.generate_response(user_input_1)    print ( f"Bot:  {response_1} " )    print ( "----------------------" )
    # Second input with /no_think    user_input_2 =  "Then, how many r's in blueberries? /no_think"    print ( f"User:  {user_input_2} " )    response_2 = chatbot.generate_response(user_input_2)    print ( f"Bot:  {response_2} "    print ( "----------------------" )
    # Third input with /think    user_input_3 =  "Really? /think"    print ( f"User:  {user_input_3} " )    response_3 = chatbot.generate_response(user_input_3)    print ( f"Bot:  {response_3} " )



Agent Example

Qwen3 performs well in tool calling capabilities. We recommend using Qwen-Agent to give full play to the Agent capabilities of Qwen3. Qwen-Agent encapsulates tool calling templates and tool calling parsers, which greatly reduces code complexity.


To define the available tools, you can use MCP configuration files, use the tools built into Qwen-Agent, or integrate other tools yourself.

from  qwen_agent.agents  import  Assistant
# Define LLMllm_cfg = {    'model''Qwen3-30B-A3B' ,
    # Use the endpoint provided by Alibaba Model Studio:    # 'model_type': 'qwen_dashscope',    # 'api_key': os.getenv('DASHSCOPE_API_KEY'),
    # Use a custom endpoint compatible with OpenAI API:    'model_server''http://localhost:8000/v1' ,   # api_base    'api_key''EMPTY' ,
    # Other parameters:    # 'generate_cfg': {    # # Add: When the response content is `<think>this is the thought</think>this is the answer;    # # Do not add: When the response has been separated by reasoning_content and content.    # 'thought_in_content': True,    # },}
# Define Toolstools = [    { 'mcpServers' : {   # You can specify the MCP configuration file            'time' : {                'command''uvx' ,                'args' : [ 'mcp-server-time''--local-timezone=Asia/Shanghai' ]            },            "fetch" : {                "command""uvx" ,                "args" : [ "mcp-server-fetch" ]            }        }    },  'code_interpreter' ,   # Built-in tools]
# Define Agentbot = Assistant(llm=llm_cfg, function_list=tools)
# Streaming generationmessages = [{ 'role''user''content''https://qwenlm.github.io/blog/ Introduce the latest developments of Qwen' }]for  responses  in  bot.run(messages=messages):    passprint (responses)




Qwen's Friends

Thank you to all friends for your continued support for Qwen! We welcome more new friends to join our community and help us become better!





Future Development

Qwen3 represents an important milestone in our journey towards artificial general intelligence (AGI) and artificial super intelligence (ASI). By scaling up pre-training and reinforcement learning, we have achieved a higher level of intelligence. We seamlessly integrate thinking mode and non-thinking mode, providing users with the ability to flexibly control their thinking budget. In addition, we have expanded support for multiple languages ​​to help more users around the world.


Looking ahead, we plan to improve our models from multiple dimensions. This includes optimizing model architecture and training methods to achieve several key goals: expanding data scale, increasing model size, extending context length, broadening modal range, and using environmental feedback to advance reinforcement learning for long-term reasoning. We believe that we are transitioning from an era focused on training models to an era centered on training agents. Our next iteration will surely bring meaningful progress to everyone's work and life.