Let's talk about DeepSeek MoE expert load balancing

Explore the secrets of DeepSeek-R1 expert load balancing and reveal the performance of MoE models in practical applications.
Core content:
1. Data analysis of DeepSeek-R1 expert load balancing
2. Expert overlap analysis and its impact on model performance
3. Technical evolution and optimization strategy of fine-grained MoE model
Last week, a colleague gave me a set of online DeepSeek-R1 inference expert activation data to study some expert load balancing algorithms. Of course, this online data comes from internal company requests. It can be observed that the first 10 layers of experts are basically relatively balanced, while the imbalance becomes greater as time goes on. When discussing this issue, I suspected that some internal requests were focused on the e-commerce field, which led to the imbalance, so I did some research. I happened to find an Intel paper "Semantic Specialization in MoE Appears with Scale: A Study of DeepSeek-R1 Expert Specialization" [1] , which has some questions related to the professionalism of MoE analysis experts based on semantics. In addition, I saw an interesting statement in a public account interview with a dean a few days ago: "Dense models are suitable for toB business, and MoE models are suitable for toC business." So I did some analysis and recorded it here.
1. Expert Overlap Analysis
From the first Word-in-Context experiment in this paper, we can see that the probability of overlap of the first ten layers of experts in DeepSeek-R1 is relatively high, which is consistent with some online data analysis.
What is special is that after the tenth layer, the distinction between different semantics and similar semantics is fully revealed, and the distinction of the model itself due to the fine-grained MoE (256 to 8) is also significantly reduced. At the same time, the paper also compares Mistral's two MoE models, which use the 8 to 2 method. It seems that there is a big gap in the distinction between semantics for different experts. This conclusion also supports the correctness of DeepSeek's technical route to more fine-grained experts. I have written an article about the technical evolution of DeepSeek MoE before.
Of course, there are several possible factors that may lead to this result:
The importance of Shared Expert. By eliminating the influence between some experts, the probability of Routed Expert Overlap is reduced. Essentially, it is the impact of the number of Routed Experts? R1 reinforcement learning workflow is further enhanced for Expert Specialization?
But it is worth noting that in the last 20 layers of the model, the difference in overlap between layers is still large and does not decrease further, which is similar to the distribution of online data I obtained.
Here we introduce a thought. The All-to-All communication time of each layer of the model is actually constrained by the bandwidth and latency of the distributed deployment. Therefore, if the model is too deep, it will affect TPOT. Although some ScaleUP methods can be used to solve this problem, considering the reliability and cost of GB200, this trade-off is not appropriate. On the other hand, we can see that the overlap of the 40th layer in the above figure has obvious jitter. On the one hand, the model can be more sparse in the later layers to further reduce the overlap. Whether there will be a similar Scaling Law will be analyzed in the later chapters.
2. SAE analysis
Another highlight of this paper is to analyze the routing mode of experts based on the features of Sparse Auto Encoder. I have written several articles about SAE before.
From the analysis of SAE in the paper, we can conclude that different experts are responsible for different reasoning and cognitive specialization, which matches the original intention of DeepSeek to design fine-grained MoE and expert specialization.
In fact, Zha B has been suggesting to analyze large models from the perspective of SAE, and to use constraints on SAE Activation as a means of strengthening the learning workflow.
SAE provides a visual explanation of concepts, and Anthropic and OAI have made corresponding visual displays, such as Anthropic's multimodal concept of the Golden Gate Bridge.
OAI and Claude have been planning in this area for quite some time, but China is still lagging behind.
3. R1 from the perspective of category theory
This is a long-standing topic. I have been wanting to take a week to analyze it and write a note, but I have been struggling with various project deadlines in recent months. I will just write a brief summary first. In fact, the entire training process of R1 is as follows from the perspective of category theory:
First of all, the V3-Base model is essentially a pre-layer category (Presheaf) formed through a series of data set pre-train processes. R1-Zero is based on the Presheaf of V3-Base and strengthens some Morphism weights. These weights make the model more generalizable based on the MoE model. Then, based on V3-Base, we mixed R1-Zero's Coldstart data and some General samples to build the final R1
I am curious whether DeepSeek records the gradient updates during the entire post-training process. I feel that more discoveries may be made by cooperating with SAE to do some analysis. I personally think that although ORM has achieved good results, PRM itself has some process defects. Can we find more reasons from the perspective of SAE? In a sense, can we also output some more abstract and generalized constraint capabilities for ORM training?
Of course, this will also face a relatively large computing power challenge, a trade-off between SAE's computing power consumption and the overall efficiency of the RL workflow.
4. MoE Scaling Law
At the beginning of this article, an interesting statement was mentioned: "Dense models are suitable for toB businesses, and MoE models are suitable for toC businesses." GPT4 is a MoE model, right? Is it suitable for toB or toC? Llama3 is a Dense model, right? Is it suitable for toB or toC? The essential problem is that under the constraint of computing power, MoE becomes an inevitable means to continue to improve Scaling. Of course, the Gating numerical stability problem of the MoE model itself and the relatively low temperature parameters usually set in the Reasoning model itself increase the degree of illusion of the model and make it less suitable for some toB business scenarios.
Recently, there is also an interesting article “Chain-of-Experts: Unleashing the Communication Potential of MoE Experts” [2] , which obtains the final output hidden through mutual processing between experts at the same layer. In fact, there is a bit of RNN flavor here. However, if such a mechanism has too many iterations, it seems difficult to balance the efficiency of training and reasoning.
From the diagram in the first section of this article, it seems that to some extent, we can derive a structure similar to the pyramid-MoE proposed in DeepSpeed-MoE [3] . As the number of layers in the model increases, the degree of expert specialization increases, and the corresponding number of experts and the number of TopK selections also need to be increased accordingly?
In fact, this is also a question I have been thinking about recently. Is the essence of MoE similar to the HNSW (Hierarchical Navigable Small Word) algorithm to some extent?
The following article also introduces some HNSW/CAGRA GPU acceleration processing content
So with the help of the Grace+Blackwell architecture, can we make something interesting? I can think of an incremental MoE algorithm:
First, train according to a relatively fine-grained model, such as 256 Routed Experts, TopK=8 For example, when training to 500B tokens, the model gradually adds some new experts in the following layers. During repeated training, the model is gradually iterated into a pyramid structure. Finally, in the PostTraining process, based on the SAE or MoE routing rules of some layers, freeze some Expert parameters or make some KL divergence constraints on this basis to reduce the illusion?
Why do we need Grace? Because to some extent, we still need more memory space on the CPU side to replace some expert weights. The bandwidth of PCIe itself is still too small. Of course, there may be more challenges in the inference stage when deploying such a model. Taking inference performance into account when designing the model architecture is a factor that must be considered. I haven't figured this out yet. I vaguely think that in such a model, it may be a way to do Expert Prediction/Prefetch of the Next Few layer.
Currently, Alibaba Cloud is optimizing heterogeneous resource pools of GPUs and CPUs. In the future, the key capability that databases need to develop is to save as much expensive GPUs as possible for the most valuable computing and caching, and push secondary computing and caching to the three-layer pooling of CPUs, memory, and storage, making online reasoning more cost-effective.
From the perspective of infrastructure and distributed systems, there is still more work to be done in coordination with models.