Kimi open-sources Moonlight-16B-A3B: an efficient large model based on the Muon optimizer, achieving breakthroughs in both performance and training efficiency!

Written by
Jasper Cole
Updated on:July-15th-2025
Recommendation

Kimi open-sources Moonlight-16B-A3B, which is based on the Muon optimizer and has made breakthrough progress in large-scale model performance and training efficiency.

Core content:
1. Application and expansion of key technologies of Muon optimizer on large-scale models
2. Parameter scale and training data of Moonlight-16B-A3B series models
3. Kimi open-sources resources and technical contributions to promote future research and application

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
01

Preface



Recently, the Muon optimizer has shown strong results in training small-scale language models, but its scalability on large-scale models has not yet been verified. Kimi identified two key technologies for extending Muon:

  • Weight decay: crucial for scaling to larger models

  • Consistent RMS Updates: Maintain consistent root mean square across model updates


These techniques enable Muon to be used out of the box in large-scale training without hyperparameter tuning. Scaling law experiments show that Muon can provide about 2 times higher sample efficiency than the AdamW optimizer, which is usually used by default, in computationally optimal training.


Based on these improvements, Kimi trained the Moonlight-16B-A3B series of models based on Muon . This is a mixture of experts (MoE) model with 16B parameters (3B activation parameters), trained with 5.7T token data . This model improves the current Pareto frontier and achieves better performance with fewer training FLOPs than previous models.


At the same time, Kimi open-sourced the memory-optimized and communication-efficient Muon implementation, and also released pre-training, instruction fine-tuning, and intermediate checkpoints to support future research.


All code is available at MoonshotAI/Moonlight.


Code link:

https://github.com/MoonshotAI/Moonlight


Model Links:

  • Moonlight-16B-A3B

    https://modelscope.cn/models/moonshotai/Moonlight-16B-A3B


  • Moonlight -16B-A3B-Instruct

    https://modelscope.cn/models/moonshotai/Moonlight-16B-A3B-Instruct


Experience link:

https://www .modelscope.cn/studios/moonshotai/Moonlight-16B-Demo/ summary


Technical contributions include:

  • Muon Effective Scaling Analysis: Through extensive analysis, the research team found that weight decay plays a key role in Muon's scalability. In addition, the research team proposed to maintain consistent update root mean square (RMS) between different matrix and non-matrix parameters through parameter-level update scale adjustment. These adjustments significantly improved training stability.

  • Efficient distributed implementation: The research team developed a distributed version of Muon with ZeRO-1 style optimizations, achieving optimal memory efficiency and reduced communication overhead while maintaining the mathematical properties of the algorithm.

  • Scaling Law Verification: The research team conducted a scaling law study to compare Muon with the strong AdamW baseline, demonstrating Muon’s superior performance (see Figure 1). Based on the scaling law results, Muon only requires about 52% of the training FLOPs to achieve comparable performance to the corresponding model trained with AdamW.



Muon Extensions

(a) Comparison of the expansion law experiments of Muon and Adam. The sample efficiency of Muon is twice that of Adam.

(b) Performance of the Moonlight model (using Muon optimization) and other comparable models on MMLU.


Moonlight pushes the Pareto frontier in the trade-off between performance and training FLOPs.


02

performance



Moonlight is compared to similarly scaled SOTA public models:

  • LLAMA3-3B is a 3B parameter dense model trained with 9T tokens

  • Qwen2.5-3B is a 3B parameter intensive model trained with 18T tokens

  • Deepseek-v2-Lite is a 2.4B/16B parameter MOE model trained with 5.7T tokens.



Benchmarks (Metrics)

Llama3.2-3B

Qwen2.5-3B

DSV2-Lite

Moonlight


Activation Parameters†

2.81B

2.77B

2.24B

2.24B


Total parameters†

2.81B

2.77B

15.29B

15.29B


Number of training tokens

9T

18T

5.7T

5.7T


Optimizer

AdamW

*

AdamW

Muon

English

MMLU

54.75

65.6

58.3

70.0


MMLU-pro

25.0

34.6

25.5

42.4


BBH

46.8

56.3

44.1

65.2


TriviaQA‡

59.6

51.1

65.1

66.3

Code

HumanEval

28.0

42.1

29.9

48.1


MBPP

48.7

57.1

43.2

63.8

math

GSM8K

34.0

79.1

41.1

77.4


MATH

8.5

42.6

17.1

45.3


CMath

-

80.0

58.4

81.1

Chinese

C-Eval

-

75.0

60.3

77.2


CMMLU

-

75.0

64.3

78.2


03

Model Inference



Reasoning Code

from modelscope import AutoModelForCausalLM, AutoTokenizer
model_name = "moonshotai/Moonlight-16B-A3B-Instruct"model = AutoModelForCausalLM.from_pretrained(model_name,torch_dtype = "auto" ,device_map = "auto" ,trust_remote_code=True,)tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "1+1=2, 1+2="inputs = tokenizer(prompt, return_tensors= "pt" , padding=True, truncation=True).to(model.device)generated_ids = model.generate(**inputs, max_new_tokens=100)response = tokenizer.batch_decode(generated_ids)[0]print(response)


Video memory usage:


04

Muon optimizer fine-tuning



ms-swift provided the connection with Muon optimizer at the first time. ms-swift is a large model training and deployment framework provided by the MoDa community. Its open source address is: https://github.com/modelscope/ms-swift


Since the current moonshotai/Moonlight-16B-A3B series MoE models no longer support further fine-tuning (due to topk_method='noaux_tc'), we choose to use the muon optimizer improved by Moonshot to achieve fine-tuning of dense models. Specifically, in the following example, we use Qwen2.5-7B-Instruct to verify the fine-tuning training based on the Muon optimizer through Swift.


Before you start fine-tuning, make sure your environment is ready.

# pip install git+https://github.com/modelscope/ms-swift.git
git clone https://github.com/modelscope/ms-swift.gitcd ms-swiftpip install -e .


The fine-tuning script is as follows:

# 17GB# ref: https://github.com/MoonshotAI/Moonlight/blob/master/examples/toy_train.pyCUDA_VISIBLE_DEVICES=0 \swift sft \--model Qwen/Qwen2.5-7B-Instruct \--train_type lora \--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \'AI-ModelScope/alpaca-gpt4-data-en#500' \'swift/self-cognition#500' \--optimizer muon \--torch_dtype bfloat16 \--num_train_epochs 1 \--per_device_train_batch_size 1 \--per_device_eval_batch_size 1 \--learning_rate 1e-4 \--lora_rank 8 \--lora_alpha 32 \--target_modules all-linear \--gradient_accumulation_steps 16 \--eval_steps 50 \--save_steps 50 \--save_total_limit 5 \--logging_steps 5 \--max_length 2048 \--output_dir output \--system 'You are a helpful assistant.' \--warmup_ratio 0.05 \--dataloader_num_workers 4 \--model_author swift \--model_name swift-robot

Training video memory usage:


If you want to use a custom dataset for training, you can refer to the following format and specify `--dataset <dataset_path>`.

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "The capital of Zhejiang is Hangzhou."}]}{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1+1"}, {"role": "assistant", "content": "It is 2"}, {"role": "user", "content": "What about adding 1"}, {"role": "assistant", "content": "It is 3"}]}

After training is complete, use the following command to perform inference on the trained weights:

Tip: `--adapters` here needs to be replaced with the last checkpoint folder generated by training. Since the adapters folder contains the training parameter file `args.json`, there is no need to specify `--model` additionally, Swift will automatically read these parameters. If you want to turn off this behavior, you can set `--load_args false`.

CUDA_VISIBLE_DEVICES=0 \swift infer \--adapters output/vx-xxx/checkpoint-xxx \--stream true \--temperature 0


Training effect:


Push the model to ModelScope:

CUDA_VISIBLE_DEVICES=0 \swift export \--adapters output/vx-xxx/checkpoint-xxx \--push_to_hub true \--hub_model_id '<your-model-id>' \--hub_token '<your-sdk-token>'