Kimi open-sources Moonlight-16B-A3B: an efficient large model based on the Muon optimizer, achieving breakthroughs in both performance and training efficiency!

Preface

Recently, the Muon optimizer has shown strong results in training small-scale language models, but its scalability on large-scale models has not yet been verified. Kimi identified two key technologies for extending Muon:

Weight decay: crucial for scaling to larger models
Consistent RMS Updates: Maintain consistent root mean square across model updates

These techniques enable Muon to be used out of the box in large-scale training without hyperparameter tuning. Scaling law experiments show that Muon can provide about 2 times higher sample efficiency than the AdamW optimizer, which is usually used by default, in computationally optimal training.

Based on these improvements, Kimi trained the Moonlight-16B-A3B series of models based on Muon . This is a mixture of experts (MoE) model with 16B parameters (3B activation parameters), trained with 5.7T token data . This model improves the current Pareto frontier and achieves better performance with fewer training FLOPs than previous models.

At the same time, Kimi open-sourced the memory-optimized and communication-efficient Muon implementation, and also released pre-training, instruction fine-tuning, and intermediate checkpoints to support future research.

All code is available at MoonshotAI/Moonlight.

Code link:

https://github.com/MoonshotAI/Moonlight

Model Links:

Moonlight-16B-A3B

https://modelscope.cn/models/moonshotai/Moonlight-16B-A3B

Moonlight -16B-A3B-Instruct

https://modelscope.cn/models/moonshotai/Moonlight-16B-A3B-Instruct

Experience link:

https://www .modelscope.cn/studios/moonshotai/Moonlight-16B-Demo/ summary

Technical contributions include:

Muon Effective Scaling Analysis: Through extensive analysis, the research team found that weight decay plays a key role in Muon's scalability. In addition, the research team proposed to maintain consistent update root mean square (RMS) between different matrix and non-matrix parameters through parameter-level update scale adjustment. These adjustments significantly improved training stability.
Efficient distributed implementation: The research team developed a distributed version of Muon with ZeRO-1 style optimizations, achieving optimal memory efficiency and reduced communication overhead while maintaining the mathematical properties of the algorithm.
Scaling Law Verification: The research team conducted a scaling law study to compare Muon with the strong AdamW baseline, demonstrating Muon’s superior performance (see Figure 1). Based on the scaling law results, Muon only requires about 52% of the training FLOPs to achieve comparable performance to the corresponding model trained with AdamW.

Muon Extensions

(a) Comparison of the expansion law experiments of Muon and Adam. The sample efficiency of Muon is twice that of Adam.

(b) Performance of the Moonlight model (using Muon optimization) and other comparable models on MMLU.

Moonlight pushes the Pareto frontier in the trade-off between performance and training FLOPs.

performance

Moonlight is compared to similarly scaled SOTA public models:

LLAMA3-3B is a 3B parameter dense model trained with 9T tokens
Qwen2.5-3B is a 3B parameter intensive model trained with 18T tokens
Deepseek-v2-Lite is a 2.4B/16B parameter MOE model trained with 5.7T tokens.

	Benchmarks (Metrics)	Llama3.2-3B	Qwen2.5-3B	DSV2-Lite	Moonlight
	Activation Parameters†	2.81B	2.77B	2.24B	2.24B
	Total parameters†	2.81B	2.77B	15.29B	15.29B
	Number of training tokens	9T	18T	5.7T	5.7T
	Optimizer	AdamW	*	AdamW	Muon
English	MMLU	54.75	65.6	58.3	70.0
	MMLU-pro	25.0	34.6	25.5	42.4
	BBH	46.8	56.3	44.1	65.2
	TriviaQA‡	59.6	51.1	65.1	66.3
Code	HumanEval	28.0	42.1	29.9	48.1
	MBPP	48.7	57.1	43.2	63.8
math	GSM8K	34.0	79.1	41.1	77.4
	MATH	8.5	42.6	17.1	45.3
	CMath	-	80.0	58.4	81.1
Chinese	C-Eval	-	75.0	60.3	77.2
	CMMLU	-	75.0	64.3	78.2

Model Inference

Reasoning Code

from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "moonshotai/Moonlight-16B-A3B-Instruct"model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype = "auto" ,device_map = "auto" ,trust_remote_code=True,)tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "1+1=2, 1+2="inputs = tokenizer(prompt, return_tensors= "pt" , padding=True, truncation=True).to(model.device)generated_ids = model.generate(**inputs, max_new_tokens=100)response = tokenizer.batch_decode(generated_ids)[0]print(response)

Video memory usage:

Muon optimizer fine-tuning

ms-swift provided the connection with Muon optimizer at the first time. ms-swift is a large model training and deployment framework provided by the MoDa community. Its open source address is: https://github.com/modelscope/ms-swift

Since the current moonshotai/Moonlight-16B-A3B series MoE models no longer support further fine-tuning (due to topk_method='noaux_tc'), we choose to use the muon optimizer improved by Moonshot to achieve fine-tuning of dense models. Specifically, in the following example, we use Qwen2.5-7B-Instruct to verify the fine-tuning training based on the Muon optimizer through Swift.

Before you start fine-tuning, make sure your environment is ready.

# pip install git+https://github.com/modelscope/ms-swift.git
git  clone https://github.com/modelscope/ms-swift.gitcd  ms-swiftpip  install -e .

The fine-tuning script is as follows:

# 17GB# ref: https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.pyCUDA_VISIBLE_DEVICES=0 \swift sft \--model Qwen/Qwen2.5-7B-Instruct \--train_type lora \--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \'AI-ModelScope/alpaca-gpt4-data-en#500' \'swift/self-cognition#500' \--optimizer muon \--torch_dtype bfloat16 \--num_train_epochs 1 \--per_device_train_batch_size 1 \--per_device_eval_batch_size 1 \--learning_rate 1e-4 \--lora_rank 8 \--lora_alpha 32 \--target_modules all-linear \--gradient_accumulation_steps 16 \--eval_steps 50 \--save_steps 50 \--save_total_limit 5 \--logging_steps 5 \--max_length 2048 \--output_dir output \--system 'You are a helpful assistant.' \--warmup_ratio 0.05 \--dataloader_num_workers 4 \--model_author swift \--model_name swift-robot

Training video memory usage:

If you want to use a custom dataset for training, you can refer to the following format and specify `--dataset <dataset_path>`.

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "The capital of Zhejiang is Hangzhou."}]}{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1+1"}, {"role": "assistant", "content": "It is 2"}, {"role": "user", "content": "What about adding 1"}, {"role": "assistant", "content": "It is 3"}]}

After training is complete, use the following command to perform inference on the trained weights:

Tip: `--adapters` here needs to be replaced with the last checkpoint folder generated by training. Since the adapters folder contains the training parameter file `args.json`, there is no need to specify `--model` additionally, Swift will automatically read these parameters. If you want to turn off this behavior, you can set `--load_args false`.

CUDA_VISIBLE_DEVICES=0 \swift infer \--adapters output/vx-xxx/checkpoint-xxx \--stream true \--temperature 0

Training effect:

Push the model to ModelScope:

CUDA_VISIBLE_DEVICES=0 \swift export \--adapters output/vx-xxx/checkpoint-xxx \--push_to_hub true \--hub_model_id '<your-model-id>' \--hub_token '<your-sdk-token>'