Qwen3 coming soon!

Written by

Iris Vance

Updated on:July-09th-2025

Recently huggingface/transformers See from pr Qwen3 and Qwen3MoE Request.

Original: https://github.com/huggingface/transformers/pull/36878

Browsing the code, you can see that the updates this time are:

https://huggingface.co/Qwen/Qwen3-15B-A2B (MOE model)
https://huggingface.co/Qwen/Qwen3-8B-beta
Qwen/Qwen3-0.6B-Base

It seems that the models that are updated this time are some small parameter models. I am looking forward to a 30-40B MoE.

Differences from traditional MoE

characteristic:

Routing strategy: Traditional MoE adopts global routing, that is, all experts participate in the calculation. Qwen3Moe adopts sparse routing, and only Top-K experts participate in the calculation.
Load balancing: Traditional MoE has no explicit optimization and is prone to expert collapse. Qwen3Moe integrates load_balancing_loss to penalize imbalanced situations.
Computational complexity: The computational complexity of traditional MoE is O(N×E), where N is the sequence length and E is the number of experts. The computational complexity of Qwen3Moe is O(N×K), where K is the Top-K parameter.
Dynamic adaptability: Traditional MoE uses a fixed frequency RoPE. Qwen3MoE dynamically adjusts the RoPE frequency and is a dynamic type.

Comparison with Qwen2.5

characteristic:

RoPE type: Qwen-2.5 only supports static RoPE. Qwen3Moe supports multiple types such as dynamic, yarn, and llama3.
Sparse layer scheduling: Qwen-2.5 does not explicitly support it. Qwen3Moe implements flexible control through mlp_only_layers and sparse_step.
Attention backend: Qwen-2.5 only has basic implementation. Qwen3Moe integrates Flash Attention 2 and SDPA acceleration.
Generate cache management: Qwen-2.5 uses traditional KV cache. Qwen3Moe supports sliding window cache (sliding_window).
MoE implementation: Qwen-2.5 does not use MoE. Qwen3Moe implements sparse MoE + load balancing loss.

Advantages of Qwen3Moe

characteristic:

Dynamic RoPE: supports multiple scaling strategies to adapt to long sequences and different hardware.
Sparse MoE: Improve model capacity and training stability through Top-K routing and load balancing loss.
Efficient Attention: Integrates Flash Attention 2 and SDPA to optimize generation speed.
Modular design: inherit and extend Llama/Mistral components to improve code maintainability.
Generation optimization: sliding window cache and dynamic KV update to reduce decoding memory usage.

Summarize

At present, in the small parameter model, my overall personal experience is:qwen Models are the first choice, especially the ones that will be updated this time Qwen3-15B-A2B, the sparse MoE model with a total parameter size of 15B has an actual activation parameter size of 2B, so the required hardware device resources are lower and the speed can be faster