DeepSeek open-sources DeepEP: an efficient expert parallel communication library

Written by
Silas Grey
Updated on:July-15th-2025
Recommendation

DeepEP, a communication library designed specifically for hybrid expert models, optimizes data transmission and improves distributed training efficiency.

Core content:
1. Supports fully exchanged GPU cores to achieve high-throughput and low-latency communication
2. Dynamic resource control to adjust the number of SMs according to task requirements
3. Supports low-precision computing to accelerate large-scale distributed training

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

DeepEP is a communication library designed for Mixture-of-Experts (MoE) and Expert Parallelism (EP) . It provides high-throughput, low-latency all -to-all GPU kernels, i.e., MoE dispatch and combine operations, and supports low-precision operations such as FP8.

To adapt to the group-limited gating algorithm proposed in the DeepSeek-V3 paper, a set of cores optimized for asymmetric domain bandwidth forwarding is provided, such as forwarding data from the NVLink domain to the RDMA domain. These cores have high throughput characteristics and are suitable for training and inference prefilling tasks, while supporting the number of streaming multiprocessors (SMs) to be controlled.

For latency-sensitive inference decoding scenarios, DeepEP includes a set of low-latency cores based on pure RDMA to minimize communication latency. In addition, the library also introduces a hook-based communication-computation overlap method that does not occupy any SM resources.

DeepEP mainly solves the communication bottleneck problem of MoE models in distributed training and reasoning, and achieves "cost reduction and efficiency improvement" by optimizing data transmission and resource scheduling.

Efficient All-to-All Communication : Supports high-bandwidth communication within a node (NVLink) and between nodes (RDMA), optimizing the fast exchange of data between different expert subnetworks .

Dynamic resource control: Based on the group-limited gating algorithm, the number of GPU computing units (SM) is dynamically allocated, increasing resources when there are many tasks and reducing power consumption when there are few tasks, thus reducing resource waste . Support for low-precision operations : Native support for FP8 format reduces memory usage and accelerates calculations, suitable for large-scale distributed training


performance
Conventional kernel (supports NVLink and RDMA forwarding)
We tested the performance of the regular kernel on H800 (NVLink max bandwidth ~160 GB/s), each equipped with a CX7 InfiniBand 400 Gb/s RDMA NIC (max bandwidth ~50 GB/s). The test follows the pre-trained configuration of DeepSeek-V3/R1 (4096 tokens per batch, hidden layer dimension 7168, top 4 group selection, top 8 expert activation, FP8 scheduling with BF16 merge).
Low latency kernel (pure RDMA support)
We tested the performance of the low-latency kernel on H800, each equipped with a CX7 InfiniBand 400 Gb/s RDMA network card (maximum bandwidth of about 50 GB/s). The test follows the typical production environment configuration of DeepSeek-V3/R1 (128 tokens per batch, hidden layer dimension 7168, first 8 experts activated, FP8 scheduling with BF16 merge).