DeepSeek open-sources DeepEP: an efficient expert parallel communication library

DeepEP, a communication library designed specifically for hybrid expert models, optimizes data transmission and improves distributed training efficiency.
Core content:
1. Supports fully exchanged GPU cores to achieve high-throughput and low-latency communication
2. Dynamic resource control to adjust the number of SMs according to task requirements
3. Supports low-precision computing to accelerate large-scale distributed training
DeepEP is a communication library designed for Mixture-of-Experts (MoE) and Expert Parallelism (EP) . It provides high-throughput, low-latency all -to-all GPU kernels, i.e., MoE dispatch and combine operations, and supports low-precision operations such as FP8.
To adapt to the group-limited gating algorithm proposed in the DeepSeek-V3 paper, a set of cores optimized for asymmetric domain bandwidth forwarding is provided, such as forwarding data from the NVLink domain to the RDMA domain. These cores have high throughput characteristics and are suitable for training and inference prefilling tasks, while supporting the number of streaming multiprocessors (SMs) to be controlled.
For latency-sensitive inference decoding scenarios, DeepEP includes a set of low-latency cores based on pure RDMA to minimize communication latency. In addition, the library also introduces a hook-based communication-computation overlap method that does not occupy any SM resources.
DeepEP mainly solves the communication bottleneck problem of MoE models in distributed training and reasoning, and achieves "cost reduction and efficiency improvement" by optimizing data transmission and resource scheduling.
Efficient All-to-All Communication : Supports high-bandwidth communication within a node (NVLink) and between nodes (RDMA), optimizing the fast exchange of data between different expert subnetworks .
Dynamic resource control: Based on the group-limited gating algorithm, the number of GPU computing units (SM) is dynamically allocated, increasing resources when there are many tasks and reducing power consumption when there are few tasks, thus reducing resource waste . Support for low-precision operations : Native support for FP8 format reduces memory usage and accelerates calculations, suitable for large-scale distributed training