Qwen3 is here

Written by

Jasper Cole

Updated on:June-29th-2025

Qwen3 Features

Qwen3 is the latest generation of large-scale language models in the Qwen series, providing a range of intensive and mixture of experts (MoE) models. Based on extensive training, Qwen3 has made breakthrough progress in reasoning ability, instruction following, agent ability, and multi-language support. The main features are as follows:

Supports seamless switching between thinking mode (for complex logical reasoning, mathematics, and programming) and non-thinking mode (for efficient, general conversations) within a single model , ensuring optimal performance in a variety of scenarios.
The reasoning ability has been significantly improved , surpassing the previous QwQ (thinking mode) and Qwen2.5 instruction model (non-thinking mode) in thinking mode, and performing well in mathematics, code generation and common sense logical reasoning.
Better aligned with human preferences , it excels in creative writing, role-playing, multi-turn conversations, and instruction following, enabling a more natural, engaging, and immersive conversational experience.
Powerful agent capabilities that enable precise integration with external tools in both thinking and non-thinking modes, and achieve leading performance among open source models in complex agent-based tasks.
Supports more than 100 languages and dialects , with powerful multi-language instruction following and translation capabilities.

Model Overview

Qwen3-0.6B has the following features:

Type: Causal Language Model
Training phase: pre-training and post-training
Number of parameters: 0.6B
Number of non-embedded parameters: 0.44B
Number of layers: 28
Number of attention heads (GQA): Q is 16, KV is 8
Context length: 32,768

Get started quickly

Qwen3's code has been integrated into the latest Hugging Face transformers We recommend that you use the latest version of transformers.

If you are using transformers<4.51.0, you will encounter the following error:

KeyError: 'qwen3'

Here is a code snippet showing how to use this model to generate content given an input:

from  transformers  import  AutoModelForCausalLM, AutoTokenizer

model_name =  "Qwen/Qwen3-0.6B"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype = "auto" ,
    device_map= "auto"
)

# Prepare model input
prompt =  "Give me a short introduction to large language model."
messages = [
    { "role" :  "user" ,  "content" : prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False ,
    add_generation_prompt = True ,
    enable_thinking = True # Switch between thinking mode and non-thinking mode, the default is True.
)
model_inputs = tokenizer([text], return_tensors= "pt" ).to(model.device)

# Perform text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens = 32768
)
output_ids = generated_ids[ 0 ][len(model_inputs.input_ids[ 0 ]):].tolist() 

# Analyze the thinking content
try :
    # rindex search 151668 (</think>)
    index = len(output_ids) - output_ids[:: -1 ].index( 151668 )
except  ValueError:
    index =  0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens= True ).strip( "\n" )
content = tokenizer.decode(output_ids[index:], skip_special_tokens= True ).strip( "\n" )

print( "thinking content:" , thinking_content)
print( "content:" , content)

For deployment, you can use vllm>=0.8.5 or sglang>=0.4.5.post2 Create an OpenAI-compatible API endpoint:

vLLM:

vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser DeepSeek_r1

SGLang:

python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --reasoning-parser deepseek-r1

Switching between thinking and non-thinking modes

`enable_thinking=True`

By default, Qwen3 has its thinking capability enabled, similar to QwQ-32B. This means that the model will use its reasoning capabilities to improve the quality of generated responses. For example, when the display setting enable_thinking=True Or leave it as tokenizer.apply_chat_template When the default value in is used, the model will enter thinking mode.

text = tokenizer.apply_chat_template(
    messages,
    tokenize = False ,
    add_generation_prompt = True ,
    enable_thinking = True # True is the default value of enable_thinking  
)

In this mode, the model will generate a <think>...</think> A block of wrapped thoughts followed by a final response.

`enable_thinking=False`

We provide a hard switch to strictly disable the model's thinking behavior, making it functionally consistent with the previous Qwen2.5-Instruct model. This mode is particularly useful in scenarios where you need to disable thinking to improve efficiency.

text = tokenizer.apply_chat_template(
    messages,
    tokenize = False ,
    add_generation_prompt = True ,
    enable_thinking= False # Set enable_thinking=False to disable thinking mode  
)

In this mode, the model does not generate any thoughts or include <think>...</think> quick.

Advanced usage: Switching between thinking and non-thinking modes via user input

We provide a soft switch mechanism that allows the user to enable_thinking=True is to dynamically control the behavior of the model. Specifically, you can add /think and /no_think, to switch the model's thinking mode in each round of multi-round dialogue. The model will follow the most recent instruction.

The following is an example of a multi-turn conversation:

from  transformers  import  AutoModelForCausalLM, AutoTokenizer

class QwenChatbot :
    def __init__ (self, model_name= "Qwen/Qwen3-0.6B" ) :
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.history = []

    def generate_response (self, user_input) :
        messages = self.history + [{ "role" :  "user" ,  "content" : user_input}]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize = False ,
            add_generation_prompt = True
        )

        inputs = self.tokenizer(text, return_tensors= "pt" )
        response_ids = self.model.generate(**inputs, max_new_tokens= 32768 )[ 0 ][len(inputs.input_ids[ 0 ]):].tolist()
        response = self.tokenizer.decode(response_ids, skip_special_tokens= True )

        # Update History
        self.history.append({ "role" :  "user" ,  "content" : user_input})
        self.history.append({ "role" :  "assistant" ,  "content" : response})

        return  response

# Example usage
if  __name__ ==  "__main__" :
    chatbot = QwenChatbot()

    # First input (no /think or /no_think tags used, thinking mode enabled by default)
    user_input_1 =  "How many r's in strawberries?"
    print( f"User:  {user_input_1} " )
    response_1 = chatbot.generate_response(user_input_1)
    print( f"Bot:  {response_1} " )
    print( "----------------------" )

    # Second input, use /no_think
    user_input_2 =  "Then, how many r's in blueberries? /no_think"
    print( f"User:  {user_input_2} " )
    response_2 = chatbot.generate_response(user_input_2)
    print( f"Bot:  {response_2} " ) 
    print( "----------------------" )

    # The third input, using /think
    user_input_3 =  "Really? /think"
    print( f"User:  {user