Qwen3 is here

Qwen3, the latest breakthrough in the Qwen series, brings revolutionary multi-language conversation and reasoning capabilities.
Core content:
1. Qwen3's reasoning capabilities and multi-language support
2. Seamless switching between thinking mode and non-thinking mode
3. Quick start guide and code examples
Qwen3 Features
Qwen3 is the latest generation of large-scale language models in the Qwen series, providing a range of intensive and mixture of experts (MoE) models. Based on extensive training, Qwen3 has made breakthrough progress in reasoning ability, instruction following, agent ability, and multi-language support. The main features are as follows:
Supports seamless switching between thinking mode (for complex logical reasoning, mathematics, and programming) and non-thinking mode (for efficient, general conversations) within a single model , ensuring optimal performance in a variety of scenarios. The reasoning ability has been significantly improved , surpassing the previous QwQ (thinking mode) and Qwen2.5 instruction model (non-thinking mode) in thinking mode, and performing well in mathematics, code generation and common sense logical reasoning. Better aligned with human preferences , it excels in creative writing, role-playing, multi-turn conversations, and instruction following, enabling a more natural, engaging, and immersive conversational experience. Powerful agent capabilities that enable precise integration with external tools in both thinking and non-thinking modes, and achieve leading performance among open source models in complex agent-based tasks. Supports more than 100 languages and dialects , with powerful multi-language instruction following and translation capabilities.
Model Overview
Qwen3-0.6B has the following features:
Type: Causal Language Model Training phase: pre-training and post-training Number of parameters: 0.6B Number of non-embedded parameters: 0.44B Number of layers: 28 Number of attention heads (GQA): Q is 16, KV is 8 Context length: 32,768
Get started quickly
Qwen3's code has been integrated into the latest Hugging Face transformers
We recommend that you use the latest version of transformers
.
If you are using transformers<4.51.0
, you will encounter the following error:
KeyError: 'qwen3'
Here is a code snippet showing how to use this model to generate content given an input:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-0.6B"
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype = "auto" ,
device_map= "auto"
)
# Prepare model input
prompt = "Give me a short introduction to large language model."
messages = [
{ "role" : "user" , "content" : prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize = False ,
add_generation_prompt = True ,
enable_thinking = True # Switch between thinking mode and non-thinking mode, the default is True.
)
model_inputs = tokenizer([text], return_tensors= "pt" ).to(model.device)
# Perform text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens = 32768
)
output_ids = generated_ids[ 0 ][len(model_inputs.input_ids[ 0 ]):].tolist()
# Analyze the thinking content
try :
# rindex search 151668 (</think>)
index = len(output_ids) - output_ids[:: -1 ].index( 151668 )
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens= True ).strip( "\n" )
content = tokenizer.decode(output_ids[index:], skip_special_tokens= True ).strip( "\n" )
print( "thinking content:" , thinking_content)
print( "content:" , content)
For deployment, you can use vllm>=0.8.5
or sglang>=0.4.5.post2
Create an OpenAI-compatible API endpoint:
vLLM: vllm serve Qwen/Qwen3-0.6B --enable-reasoning --reasoning-parser DeepSeek_r1
SGLang: python -m sglang.launch_server --model-path Qwen/Qwen3-0.6B --reasoning-parser deepseek-r1
Switching between thinking and non-thinking modes
enable_thinking=True
By default, Qwen3 has its thinking capability enabled, similar to QwQ-32B. This means that the model will use its reasoning capabilities to improve the quality of generated responses. For example, when the display setting enable_thinking=True
Or leave it as tokenizer.apply_chat_template
When the default value in is used, the model will enter thinking mode.
text = tokenizer.apply_chat_template(
messages,
tokenize = False ,
add_generation_prompt = True ,
enable_thinking = True # True is the default value of enable_thinking
)
In this mode, the model will generate a <think>...</think>
A block of wrapped thoughts followed by a final response.
enable_thinking=False
We provide a hard switch to strictly disable the model's thinking behavior, making it functionally consistent with the previous Qwen2.5-Instruct model. This mode is particularly useful in scenarios where you need to disable thinking to improve efficiency.
text = tokenizer.apply_chat_template(
messages,
tokenize = False ,
add_generation_prompt = True ,
enable_thinking= False # Set enable_thinking=False to disable thinking mode
)
In this mode, the model does not generate any thoughts or include <think>...</think>
quick.
Advanced usage: Switching between thinking and non-thinking modes via user input
We provide a soft switch mechanism that allows the user to enable_thinking=True
is to dynamically control the behavior of the model. Specifically, you can add /think
and /no_think
, to switch the model's thinking mode in each round of multi-round dialogue. The model will follow the most recent instruction.
The following is an example of a multi-turn conversation:
from transformers import AutoModelForCausalLM, AutoTokenizer
class QwenChatbot :
def __init__ (self, model_name= "Qwen/Qwen3-0.6B" ) :
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.history = []
def generate_response (self, user_input) :
messages = self.history + [{ "role" : "user" , "content" : user_input}]
text = self.tokenizer.apply_chat_template(
messages,
tokenize = False ,
add_generation_prompt = True
)
inputs = self.tokenizer(text, return_tensors= "pt" )
response_ids = self.model.generate(**inputs, max_new_tokens= 32768 )[ 0 ][len(inputs.input_ids[ 0 ]):].tolist()
response = self.tokenizer.decode(response_ids, skip_special_tokens= True )
# Update History
self.history.append({ "role" : "user" , "content" : user_input})
self.history.append({ "role" : "assistant" , "content" : response})
return response
# Example usage
if __name__ == "__main__" :
chatbot = QwenChatbot()
# First input (no /think or /no_think tags used, thinking mode enabled by default)
user_input_1 = "How many r's in strawberries?"
print( f"User: {user_input_1} " )
response_1 = chatbot.generate_response(user_input_1)
print( f"Bot: {response_1} " )
print( "----------------------" )
# Second input, use /no_think
user_input_2 = "Then, how many r's in blueberries? /no_think"
print( f"User: {user_input_2} " )
response_2 = chatbot.generate_response(user_input_2)
print( f"Bot: {response_2} " )
print( "----------------------" )
# The third input, using /think
user_input_3 = "Really? /think"
print( f"User: {user