The technical foundation behind DeepSeek: MoE, data parallelism, and model parallelism

Written by
Jasper Cole
Updated on:July-15th-2025
Recommendation

Explore the technological innovation behind DeepSeek and learn how the MoE architecture helps large models to run efficiently.

Core content:
1. The relationship between large model training cost and parameter scale
2. The sparse computing advantage of the MoE architecture
3. MoE working principle and dynamic expert selection mechanism

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
In the past few years, deep learning technology has achieved rapid development, especially in large-scale models. From GPT-4 to DeepSeek to various large-scale application scenarios such as translation and speech recognition, they all rely on the support of large models. However, as the scale of the model continues to increase, the growth of training costs and inference time has also increased exponentially , which has posed huge challenges to many AI researchers and companies.

The exponential growth of training costs

For example, suppose you have a model with millions of parameters, and its training cost is relatively low. However, when the model parameters reach billions or even tens of billions, the hardware resources and time required for training will rise rapidly. Traditional training methods usually require the continuous addition of computing nodes and storage space. With each additional layer or parameter, the training cost often increases exponentially , which is undoubtedly a huge economic pressure.

How to solve this problem?

So, how can we keep the computing and storage costs from skyrocketing while keeping the model size significantly increased? The answer is the MoE (Mixture-of-Experts) architecture.
The MoE architecture uses a method called "sparse computing" to activate only some "expert" networks for calculations, thereby greatly reducing the consumption of computing resources. We do not need every model to calculate every parameter, but instead intelligently select some "experts" for calculations. This is like when you go shopping in a supermarket, you only take the part you need each time, instead of carrying the entire supermarket home.

1. Basic Idea of ​​MoE (Mixture-of-Experts) Architecture

The MoE (Mixture-of-Experts) architecture reduces the amount of computation and storage requirements by breaking down a large model into multiple small models (i.e., "experts") and dynamically selecting and activating some of the experts for calculation.

1. How MoE works

Suppose you have a traditional deep neural network, each of its input samples needs to be calculated by all nodes in each layer, which means that each node needs to process every input sample. In the MoE architecture, the input samples do not need to pass through all expert layers, but only some experts need to be selected to process, which greatly reduces the amount of calculation.


The specific process is as follows:

  • Input samples into the gate network : The gate network will calculate the activation weight of each expert based on the characteristics of the samples.

  • Select Top-K experts : Through Softmax calculation, the gating network selects the Top-K most relevant experts (usually 1 or 2) for each sample.

  • Experts calculate and output results : The selected experts will calculate the input samples, and finally weight the results output by each expert to obtain the final prediction result.

This mechanism ensures that only some experts will be activated, thus reducing the waste of computing resources.

2. Comparison between the traditional model and the MoE model

Imagine if you have a supermarket, and the traditional deep learning model is like you go to every aisle of the supermarket to pick up goods every time, and eventually you need every product to meet your needs. The MoE model is a "smart supermarket" that automatically recommends and allows you to select only the most relevant products based on your needs, saving a lot of time and energy. This sparse computing method is the core advantage of MoE .

2. MoE (Mixture-of-Experts) Distributed Parallel Strategy

The MoE architecture is not only optimized on a single device, but also further improves its efficiency in large-scale training through distributed parallel strategies. There are two main parallel strategies:

1. MoE + Data Parallelism

Data parallelism is a common distributed training method, which means dividing the training data into multiple small batches, and each computing unit (such as GPU) processes a part of the data. In the MoE architecture, both the gated network and the expert network are copied to each computing unit , and each computing unit calculates different data.
The advantage of this method is that the computational task of each computing unit is relatively simple and suitable for large-scale parallel computing.

2. MoE + Model Parallelism

Under the model parallel strategy, the gating network is replicated to each computing unit, while the expert network is distributed to different computing units . This approach requires network communication to ensure information exchange between different computing units.
For example, suppose we have 6 expert models, which are distributed in 2 computing units. Each computing unit is responsible for training 3 expert models, and the collaboration and information transmission between experts are realized through communication between computing units . Although this method can process more experts in parallel, it will introduce additional communication overhead.

3. Advantages of MoE large model

The biggest feature of the MoE architecture is that it can support very large-scale model training at a low computing cost. Specifically, MoE has the following significant advantages:

1. Faster training and better results

Since MoE only activates a small number of experts for calculation, the computational burden of each training is greatly reduced . For example, in natural language processing tasks, MoE can increase the calculation speed by activating the most relevant experts while avoiding over-calculation of unnecessary parts. In this way, the model can not only be trained faster, but also the effect can be guaranteed.

2. Same parameters, low inference cost

Compared with traditional large-scale deep neural networks, MoE significantly reduces the amount of computation during inference because only a few experts are activated . This makes MoE relatively low in latency and computational cost during inference, making it particularly suitable for scenarios that require efficient inference, such as online recommendation systems, speech recognition, etc.

3. Excellent scalability

The MoE architecture has good scalability and can support tens of thousands of parameters. For example, the Switch Transformer uses the MoE architecture to successfully train a model with more than 1 trillion parameters , which is almost impossible to achieve in traditional architectures.

4. Multi-task learning capability

MoE not only performs well in single tasks, but also has strong capabilities in multi-task learning. For example, Switch Transformer shows stronger performance in multilingual machine translation tasks by activating different experts to handle tasks in different languages.


4. Challenges of MoE Large Model

Although the MoE architecture has many advantages, there are also some challenges in practical applications:

1. Training stability issues

MoE may have some stability issues during training, especially when the model size is large. For example, since only some experts will be activated, this may lead to insufficient parameter updates of some expert networks, thus affecting the stability and convergence speed of the model.

2. High communication costs

The expert network of MoE is distributed in different computing units, which means that frequent communication is required over the network, especially in the case of model parallelism . Communication overhead will become a bottleneck in large-scale training, especially when using multiple GPU clusters, communication efficiency becomes particularly important.

3. Model Complexity

The design of MoE architecture is relatively complex and needs to be optimized for different hardware devices. In practical applications, the implementation and debugging of MoE requires a lot of engineering support .

4. Overfitting Problem

Due to the sparsity of MoE in the Fine-Tuning process, it may be prone to overfitting, especially when there is less data in the downstream task. Special attention should be paid to the generalization ability of the model.


5. How does MoE achieve larger model parameters and lower training costs?

1. Advantages of sparse routing

MoE significantly reduces the amount of computation by activating only a few experts for computation through a sparse routing mechanism. For example, each input sample will only activate the Top-K most relevant experts , which makes computing resources more efficiently utilized in large-scale models.

2. Mixed Precision Training

The MoE architecture adopts a mixed precision strategy during training. For example, the expert model uses bfloat16 precision , while other calculations use full precision. This approach not only reduces memory usage, but also reduces computational and communication costs.


6. How does MoE solve the problems of training stability and overfitting?

1. Load balancing loss

In order to avoid instability in the model training process, MoE introduces load balancing loss . This ensures that the utilization of each computing device is optimal, thereby preventing some experts from being over-activated and affecting the training efficiency of the entire system.

2. Dropout and learning rate adjustment

In order to prevent overfitting in the Fine-Tuning stage, MoE often uses the Dropout strategy to randomly shut down some experts during the training process to improve the generalization ability of the model. In addition, the learning rate adjustment strategy also helps to balance overfitting and convergence speed.


7. Application Scenarios of MoE in Large Language Models

The application of MoE in large language models is mainly reflected in the following aspects:
  1. Solving multimodal problems

    In multimodal learning, MoE can assign data of different modalities (such as text, images, and voice) to different experts for processing, thereby improving the processing capabilities of the model.
  2. Vertical field applications

    For some specific field tasks, MoE can improve the pertinence and effectiveness of the model by allowing experts from different fields to undertake different tasks.
  3. Improving model scale and efficiency

    MoE's sparse computing technology makes it possible to train larger models while also improving training efficiency and inference speed.
  4. Natural Language Processing

    MoE technology has achieved remarkable results in the field of natural language processing (NLP). For example, the introduction of MoE in machine translation has significantly improved the translation effect.


8. Conclusion

The MoE architecture represents an important direction for the development of deep learning models. Through sparsification and expert mechanisms, it not only improves the training efficiency of large models, but also opens up new possibilities for multi-task and multi-modal processing. Although there are certain challenges in practical applications, with the continuous advancement of technology, MoE will become one of the core architectures for large-scale model training and reasoning in the future.


References: Image source: papers and the Internet