Qwen3 coming soon!

The Qwen3 series of models will bring revolutionary performance improvements and technological innovations.
Core content:
1. The new features of the Qwen3 model and its differences from traditional MoE
2. The technical comparison and advantages of Qwen3MoE and Qwen2.5
3. The performance and application prospects of Qwen3MoE in small parameter models
huggingface/transformers
See from pr Qwen3
and Qwen3MoE
Request.Original: https://github.com/huggingface/transformers/pull/36878
Browsing the code, you can see that the updates this time are:
https://huggingface.co/Qwen/Qwen3-15B-A2B (MOE model)
https://huggingface.co/Qwen/Qwen3-8B-beta
Qwen/Qwen3-0.6B-Base
It seems that the models that are updated this time are some small parameter models. I am looking forward to a 30-40B MoE.
Differences from traditional MoE
characteristic:
Routing strategy: Traditional MoE adopts global routing, that is, all experts participate in the calculation. Qwen3Moe adopts sparse routing, and only Top-K experts participate in the calculation.
Load balancing: Traditional MoE has no explicit optimization and is prone to expert collapse. Qwen3Moe integrates load_balancing_loss to penalize imbalanced situations.
Computational complexity: The computational complexity of traditional MoE is O(N×E), where N is the sequence length and E is the number of experts. The computational complexity of Qwen3Moe is O(N×K), where K is the Top-K parameter.
Dynamic adaptability: Traditional MoE uses a fixed frequency RoPE. Qwen3MoE dynamically adjusts the RoPE frequency and is a dynamic type.
Comparison with Qwen2.5
characteristic:
RoPE type: Qwen-2.5 only supports static RoPE. Qwen3Moe supports multiple types such as dynamic, yarn, and llama3.
Sparse layer scheduling: Qwen-2.5 does not explicitly support it. Qwen3Moe implements flexible control through mlp_only_layers and sparse_step.
Attention backend: Qwen-2.5 only has basic implementation. Qwen3Moe integrates Flash Attention 2 and SDPA acceleration.
Generate cache management: Qwen-2.5 uses traditional KV cache. Qwen3Moe supports sliding window cache (sliding_window).
MoE implementation: Qwen-2.5 does not use MoE. Qwen3Moe implements sparse MoE + load balancing loss.
Advantages of Qwen3Moe
characteristic:
Dynamic RoPE: supports multiple scaling strategies to adapt to long sequences and different hardware.
Sparse MoE: Improve model capacity and training stability through Top-K routing and load balancing loss.
Efficient Attention: Integrates Flash Attention 2 and SDPA to optimize generation speed.
Modular design: inherit and extend Llama/Mistral components to improve code maintainability.
Generation optimization: sliding window cache and dynamic KV update to reduce decoding memory usage.
Summarize
At present, in the small parameter model, my overall personal experience is:qwen
Models are the first choice, especially the ones that will be updated this time Qwen3-15B-A2B
, the sparse MoE model with a total parameter size of 15B has an actual activation parameter size of 2B, so the required hardware device resources are lower and the speed can be faster