Recently, the Muon optimizer has shown strong results in training small-scale language models, but its scalability on large-scale models has not yet been verified. Kimi identified two key technologies for extending Muon:
Weight decay: crucial for scaling to larger models
Consistent RMS Updates: Maintain consistent root mean square across model updates
These techniques enable Muon to be used out of the box in large-scale training without hyperparameter tuning. Scaling law experiments show that Muon can provide about 2 times higher sample efficiency than the AdamW optimizer, which is usually used by default, in computationally optimal training.
Based on these improvements, Kimi trained the Moonlight-16B-A3B series of models based on Muon . This is a mixture of experts (MoE) model with 16B parameters (3B activation parameters), trained with 5.7T token data . This model improves the current Pareto frontier and achieves better performance with fewer training FLOPs than previous models.
At the same time, Kimi open-sourced the memory-optimized and communication-efficient Muon implementation, and also released pre-training, instruction fine-tuning, and intermediate checkpoints to support future research.
All code is available at MoonshotAI/Moonlight.
Code link:
https://github.com/MoonshotAI/Moonlight
Model Links:
Moonlight-16B-A3B
https://modelscope.cn/models/moonshotai/Moonlight-16B-A3B
Moonlight -16B-A3B-Instruct
https://modelscope.cn/models/moonshotai/Moonlight-16B-A3B-Instruct
Experience link:
https://www .modelscope.cn/studios/moonshotai/Moonlight-16B-Demo/ summary
Technical contributions include:
Muon Effective Scaling Analysis: Through extensive analysis, the research team found that weight decay plays a key role in Muon's scalability. In addition, the research team proposed to maintain consistent update root mean square (RMS) between different matrix and non-matrix parameters through parameter-level update scale adjustment. These adjustments significantly improved training stability.
Efficient distributed implementation: The research team developed a distributed version of Muon with ZeRO-1 style optimizations, achieving optimal memory efficiency and reduced communication overhead while maintaining the mathematical properties of the algorithm.
Scaling Law Verification: The research team conducted a scaling law study to compare Muon with the strong AdamW baseline, demonstrating Muon’s superior performance (see Figure 1). Based on the scaling law results, Muon only requires about 52% of the training FLOPs to achieve comparable performance to the corresponding model trained with AdamW.
Muon Extensions
(a) Comparison of the expansion law experiments of Muon and Adam. The sample efficiency of Muon is twice that of Adam.
(b) Performance of the Moonlight model (using Muon optimization) and other comparable models on MMLU.
Moonlight pushes the Pareto frontier in the trade-off between performance and training FLOPs.
Moonlight is compared to similarly scaled SOTA public models:
LLAMA3-3B is a 3B parameter dense model trained with 9T tokens
Qwen2.5-3B is a 3B parameter intensive model trained with 18T tokens
Deepseek-v2-Lite is a 2.4B/16B parameter MOE model trained with 5.7T tokens.
Reasoning Code
from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "moonshotai/Moonlight-16B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype = "auto" ,
device_map = "auto" ,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "1+1=2, 1+2="
inputs = tokenizer(prompt, return_tensors= "pt" , padding=True, truncation=True).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)
Video memory usage:
ms-swift provided the connection with Muon optimizer at the first time. ms-swift is a large model training and deployment framework provided by the MoDa community. Its open source address is: https://github.com/modelscope/ms-swift
Since the current moonshotai/Moonlight-16B-A3B series MoE models no longer support further fine-tuning (due to topk_method='noaux_tc'), we choose to use the muon optimizer improved by Moonshot to achieve fine-tuning of dense models. Specifically, in the following example, we use Qwen2.5-7B-Instruct to verify the fine-tuning training based on the Muon optimizer through Swift.
Before you start fine-tuning, make sure your environment is ready.
# pip install git+https://github.com/modelscope/ms-swift.git
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .
The fine-tuning script is as follows:
# 17GB# ref: https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.pyCUDA_VISIBLE_DEVICES=0 \swift sft \--model Qwen/Qwen2.5-7B-Instruct \--train_type lora \--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \'AI-ModelScope/alpaca-gpt4-data-en#500' \'swift/self-cognition#500' \--optimizer muon \--torch_dtype bfloat16 \--num_train_epochs 1 \--per_device_train_batch_size 1 \--per_device_eval_batch_size 1 \--learning_rate 1e-4 \--lora_rank 8 \--lora_alpha 32 \--target_modules all-linear \--gradient_accumulation_steps 16 \--eval_steps 50 \--save_steps 50 \--save_total_limit 5 \--logging_steps 5 \--max_length 2048 \--output_dir output \--system 'You are a helpful assistant.' \--warmup_ratio 0.05 \--dataloader_num_workers 4 \--model_author swift \--model_name swift-robot
Training video memory usage:
If you want to use a custom dataset for training, you can refer to the following format and specify `--dataset <dataset_path>`.
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "The capital of Zhejiang is Hangzhou."}]}{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1+1"}, {"role": "assistant", "content": "It is 2"}, {"role": "user", "content": "What about adding 1"}, {"role": "assistant", "content": "It is 3"}]}
After training is complete, use the following command to perform inference on the trained weights:
Tip: `--adapters` here needs to be replaced with the last checkpoint folder generated by training. Since the adapters folder contains the training parameter file `args.json`, there is no need to specify `--model` additionally, Swift will automatically read these parameters. If you want to turn off this behavior, you can set `--load_args false`.
CUDA_VISIBLE_DEVICES=0 \swift infer \--adapters output/vx-xxx/checkpoint-xxx \--stream true \--temperature 0
Training effect:
Push the model to ModelScope:
CUDA_VISIBLE_DEVICES=0 \swift export \--adapters output/vx-xxx/checkpoint-xxx \--push_to_hub true \--hub_model_id '<your-model-id>' \--hub_token '<your-sdk-token>'