Preface

Today, the Tongyi Qianwen Qwen team officially launched Qwen3 , the latest member of the Qwen series of large language models. The latest Qwen3 series model has dual-mode reasoning capabilities (deep thinking/fast response), supports 119 languages and dialects, and strengthens the agent function and code execution capabilities, fully meeting the needs of complex problem processing and global application .

Among them, the flagship model Qwen3-235B-A22B shows highly competitive results compared with top models such as DeepSeek-R1, o1, o3-mini, Grok-3 and Gemini-2.5-Pro in benchmarks such as code, mathematics, and general ability. In addition, the small MoE model Qwen3-30B-A3B has 10% of the number of activation parameters of QwQ-32B, and performs better. Even small models like Qwen3-4B can match the performance of Qwen2.5-72B-Instruct.

This time, Qwen3 has open-sourced the weights of two MoE models: Qwen3-235B-A22B , a large model with more than 235 billion total parameters and more than 22 billion activation parameters, and Qwen3-30B-A3B , a small MoE model with about 30 billion total parameters and 3 billion activation parameters. In addition, six Dense models have also been open-sourced, including Qwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B and Qwen3-0.6B, all open sourced under the Apache 2.0 license.

Model highlights editor knocks on the blackboard:

The Qwen3 model supports two thinking modes:

Thinking mode: In this mode, the model will reason step by step and give the final answer after careful consideration. It is suitable for complex problems that require in-depth thinking.

No-Thinking Mode: In this mode, the model provides fast, near-instant responses and is suitable for simple problems where speed is more important than depth.

Multilingual

The Qwen3 models support 119 languages and dialects, including Indo-European, Sino-Tibetan, Afro-Asiatic, Austronesian, Dravidian, Turkic, Tai-Kadai, Uralic, Austroasiatic , etc. This broad multilingual capability opens up new possibilities for international applications, allowing users around the world to benefit from the power of these models.

Enhanced Agent Capabilities

The Agent and code capabilities of the Qwen3 model have been optimized, and support for MCP has also been enhanced (a practical tutorial on how to use the Qwen3 series model in conjunction with MCP is included later)

Inference & Deployment

Transformers

Using Qwen3-30B-A3B in transformers:

from modelscope import AutoModelForCausalLM, AutoTokenizermodel_name = "Qwen/Qwen3-30B-A3B"# load the tokenizer and the modeltokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto")# prepare the model inputprompt = "Give me a short introduction to large language model."messages = [ {"role": "user", "content": prompt}]text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=True # Switch between thinking and non-thinking modes. Default is True.)model_inputs = tokenizer([text], return_tensors="pt").to(model.device)# conduct text completiongenerated_ids = model.generate( **model_inputs,    max_new_tokens=32768)output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() # parsing thinking contenttry: # rindex finding 151668 (</think>) index = len(output_ids) - output_ids[::-1].index(151668)except ValueError: index = 0thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")print("thinking content:", thinking_content)print("content:", content)

To disable thinking mode, just modify the parameter enable_thinking as follows:

text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False # True is the default value for enable_thinking.)

Multi-tool deployment

Developers can use sglang>=0.4.6.post1 or vllm>=0.8.4 to create an API endpoint compatible with the OpenAI API:

SGLang:

SGLANG_USE_MODELSCOPE=1 python -m sglang.launch_server --model-path Qwen/Qwen3-32B --reasoning-parser qwen3

vLLM:

VLLM_USE_MODELSCOPE=1 vllm serve Qwen/Qwen3-32B --enable-reasoning --reasoning-parser deepseek_r1

To disable reasoning mode, remove the --reasoning-parser (and --enable-reasoning) flags:

For local development, you can use ollama to interact with the model by running a simple command: Ollama run qwen3:30b-a3b. You can also use LMStudio or llama.cpp and ktransformers for local development.

Ollama:

ollama run modelscope.cn/unsloth/Qwen3-8B-GGUF

Ollama is in thinking mode by default. If you need to switch to non-thinking mode, add /no_think after the prompt. In addition, please make sure to upgrade Ollama to the latest version (v0.6.6 or above).

Use Moda API-Inference to call directly

The API-Inference of the MoDa platform also provided support for the Qwen3 series models at the first time. MoDa users can use it directly through API calls. For specific instructions on how to use API-Inference, please refer to the instructions on each model page (for example, https://www.modelscope.cn/models/Qwen/Qwen3-32B ):

Or refer to the API-Inference document: https://www.modelscope.cn/docs/model-service/API-Inference/intro . It is worth noting that the Qwen3 series models can switch between thinking and normal modes freely.extra_bodyThe default isenable_thinkingWhen the thinking mode is turned on, you can alsothinking_budgetParameters to limit the length of thinking (generally recommendedthinking_budgetDo not configure too small, 4096 or more is appropriate).

Calling example:

from openai import OpenAIclient = OpenAI( base_url='https://api-inference.modelscope.cn/v1/', api_key='MODELSCOPE_SDK_TOKEN', # ModelScope Token)# set extra_body for thinking controlextra_body = { # enable thinking, set to False to disable "enable_thinking": True, # use thinking_budget to contorl num of tokens used for thinking # "thinking_budget": 4096}response = client.chat.completions.create( model='Qwen/Qwen3-32B', # ModelScope Model-Id messages=[ { 'role': 'user', 'content': 'Who is bigger, 9.9 or 9.11' } ], stream=True, extra_body=extra_body) done_thinking = False for chunk in response: thinking_chunk = chunk.choices[0].delta.reasoning_content    answer_chunk = chunk.choices[0].delta.content if thinking_chunk != '': print(thinking_chunk, end='', flush=True) elif answer_chunk != '': if not done_thinking: print('\n\n === Final Answer ===\n') done_thinking = True print(answer_chunk, end='', flush=True)

Output:

Well, I am going to compare which is bigger, 9.9 or 9.11. First, I should understand the structure of these two numbers. 9.9 is a decimal, and 9.11 also looks like a decimal, but some people may misunderstand it as a date or something else? However, the question clearly says that it is to compare the size of values, so it should be handled according to mathematical decimals. First, I need to confirm the decimal point position of the two numbers. 9.9 is nine and nine tenths, that is, 9 + 0.9 = 9.9. And 9.11 is nine and eleven hundredths, that is, 9 + 0.11 = 9.11. At this time, I need to compare the size of these two decimals. The usual way to compare decimals is to compare them digit by digit from left to right until you find different numbers. First compare the integer part, both numbers are 9, so this part is equal. Next compare the decimal part. The first number is 0.9 and the second number is 0.11. At this time, you need to pay attention to the number of decimal places. 0.9 can be regarded as 0.90, because the zero at the end of the decimal does not change the size of the value. The two decimals become 0.90 and 0.11. Now, if we compare 0.90 and 0.11, the first decimal place is 9 and 1. Obviously, 9 is greater than 1, so 0.90 is greater than 0.11, so the whole number 9.9 (i.e. 9.90) is greater than 9.11. ====================Final Answer=====================The comparison between 9.9 and 9.11 can be done by the following steps:1. **Integer part**: Both are 9, equal. 2. **Decimal part**: - 9.9 can be written as **9.90** (zero padding does not change the value). - Compare the tenths place: The tenths place of 9.90 is **9**, and the tenths place of 9.11 is **1**. - Since $9 > 1$, **9.90 > 9.11**. **Conclusion**: $$\boxed{9.9 > 9.11}$$

Thanks to Alibaba Cloud Bailian Platform for providing computing power support

Using ModelScope MCP Experiment Field Integration

The Qwen3 series of models provide better support for intelligent agents and tool calls, and can accurately integrate external tools in thinking and non-thinking modes. We also integrated multiple Qwen3 models into the ModelScope MCP experimental field as soon as possible to provide an experience of Qwen3's tool calling capabilities in MCP scenarios.

We happily let Qwen3 help us with the preparations before the trip:

Model fine-tuning

We introduce the use of ms-swift to perform SFT/GRPO on Qwen/Qwen3-8B and the use of Megatron-SWIFT to perform SFT on Qwen/Qwen3-30B-A3B. ms-swift is a large model and multi-modal large model training and deployment framework officially provided by the Moda community.

ms-swift open source address:

https://github.com/modelscope/ms-swift

We will show a runnable fine-tuning demo and give the format of a custom dataset.

Before you start fine-tuning, make sure your environment is ready.

git clone https://github.com/modelscope/ms-swift.gitcd ms-swiftpip install -e .pip install liger-kernel transformers -U

SFT

The script for training Qwen3-8B is as follows, which can be run in the free GPU Notebook provided by ModelScope:

# Training video memory: 22GB# You can specify `--dataset AI-ModelScope/alpaca-gpt4-data-zh` to run the experiment CUDA_VISIBLE_DEVICES=0 \swift sft \ --model Qwen/Qwen3-8B \ --train_type lora \ --dataset '<dataset-path>' \ --torch_dtype bfloat16 \ --num_train_epochs 1 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 1e-4 \ --lora_rank 8 \ --lora_alpha 32 \ --target_modules all-linear \ --gradient_accumulation_steps 4 \ --eval_steps 50 \ --save_steps 50 \ --save_total_limit 2 \ --logging_steps 5 \ --max_length 2048 \    --output_dir output \ --warmup_ratio 0.05 \ --dataloader_num_workers 4 \ --packing true \ --user_liger_kernel true

The format of the custom dataset is as follows (the system field is optional), just specify `--dataset <dataset_path>`:

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}

GRPO

Taking Qwen3-8B as an example, the following uses the ms-swift framework to perform GRPO training

Use AI-MO/NuminaMath-TIR as the dataset and use the accuracy function to calculate the accuracy reward of the model's answer. To calculate the reward, you need to install the following environment

pip install math_verify==0.5.2

The custom dataset format is similar to SFT, except that the assistant part is not required. If the accuracy reward is used, the solution column is required to calculate the accuracy.

# llm{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}{"messages": [{"role": "user", "content": "What is your name?"}]}# mllm{"messages": [{"role": "user", "content": "<image>What is the difference between the two images?"}], "images": ["/xxx/x.jpg"]}{"messages": [{"role": "user", "content": "<image><image>What is the difference between the two images?"}], "images": ["/xxx/y.jpg", "/xxx/z.png"]}

You can also use a custom reward function/reward model for training, and the columns in the dataset will be passed to the reward function**kwargsFor an example of a custom reward function, seeswift/examples/train/grpo/plugin/plugin.py

--external_plugins examples/train/grpo/plugin/plugin.py \--reward_funcs external_math_acc external_math_format \--reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2

During training, we use vLLM to speed up the sampling process.num_infer_workers=8,We deploy a vLLM engine for each device to accelerate the sampling process.

Training script

# 70G*8CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \NPROC_PER_NODE=8 \swift rlhf \ --rlhf_type grpo \ --model Qwen/Qwen3-8B \ --train_type full \ --dataset AI-MO/NuminaMath-TIR \ --torch_dtype bfloat16 \ --num_train_epochs 1 \ --per_device_train_batch_size 2 \ --per_device_eval_batch_size 2 \ --learning_rate 1e-6 \ --save_total_limit 2 \ --logging_steps 5 \ --output_dir output \ --gradient_accumulation_steps 1 \ --warmup_ratio 0.05 \ --dataloader_num_workers 4 \ --max_completion_length 4096\    --vllm_max_model_len 8192 \ --reward_funcs accuracy \ --num_generations 16 \ --use_vllm true \ --vllm_gpu_memory_utilization 0.4 \ --sleep_level 1 \ --offload_model true \ --offload_optimizer true \ --gc_collect_after_offload true \ --deepspeed zero3 \ --num_infer_workers 8 \ --tensor_parallel_size 1 \ --temperature 1.0 \ --top_p 0.85 \ --report_to wandb \ --log_completions true \ --overlong_filter true

MoE Training (Megatron-SWIFT)

ms-swift introduces Megatron's parallel technology to accelerate the training of large models, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models such as Qwen3, Qwen3-MoE, Qwen2.5, Llama3, and Deepseek-R1 distillation system.

For environment preparation (mirror) and conversion of HF and MCore model weights, please refer to the Megatron-SWIFT training document, which is not elaborated here : https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html

We use DLC to start the training command. The training environment is 2 machines 8 * 80GiB A800:

# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables# Please make sure that the weight saving path of the two nodes is the same NNODES=$WORLD_SIZE \ NODE_RANK=$RANK \ megatron sft \ --load Qwen3-30B-A3B-Base-mcore \ --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \ --tensor_model_parallel_size 2 \ --expert_model_parallel_size 8 \ --moe_grouped_gemm true \ --moe_shared_expert_overlap true \ --moe_aux_loss_coeff 0.01 \ --micro_batch_size 1 \ --global_batch_size 16 \ --packing true \ --recompute_granularity full \ --recompute_method uniform \ --recompute_num_layers 1 \ --train_iters 2000 \ --eval_iters 50 \ --finetune true \ --cross_entropy_loss_fusion true \ --lr 1e-5 \ --lr_warmup_iters 100 \ --min_lr 1e-6 \ --save megatron_output/Qwen3-30B-A3B-Base \ --eval_interval 200 \ --save_interval 200 \ --max_length 8192 \ --num_workers 8 \ --dataset_num_proc 8 \ --no_save_optim true \ --no_save_rng true \ --sequence_parallel true \ --use_flash_attn true

Training loss graph (part):

The custom dataset format is the same as `swift sft`, which can be found above this article, just specify `--dataset <dataset_path>`.

The comparison of the full parameter training speed/memory usage of the Qwen3-30B-A3B model using `megatron sft` and `swift sft` is as follows:

	Megatron-LM	DeepSpeed-ZeRO2	DeepSpeed-ZeRO3
Training speed	9.6s/it	-	91.2s/it
Video memory usage	16*60GiB	OOM	16*80GiB

Qwen3 open source release: Think Deeper, Act Faster! Community reasoning, deployment, fine-tuning, MCP call practical tutorials are here!

The Qwen3 model supports two thinking modes:

Transformers

Multi-tool deployment

SFT

GRPO

MoE Training (Megatron-SWIFT)