DeepSeek's fourth open source release has a big move: releasing three parallel computing optimizations in one go: "training speed, GPU utilization, and optimization experience"

Written by
Jasper Cole
Updated on:July-15th-2025
Recommendation

The fourth DeepSeek open source project brings a revolution in large model training efficiency! Explore the three major tools for parallel computing optimization.

Core content:
1. DualPipe: Innovative bidirectional pipeline parallel algorithm to achieve full overlap of computing and communication and reduce pipeline bubbles
2. EPLB: Expert parallel load balancer to optimize GPU resource allocation and achieve efficient load balancing
3. profile-data: Performance analysis data, in-depth analysis of the parallel computing secrets of V3/R1 models

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


 

On the fourth day of DeepSeek Open Source Week, it launched the "Three Musketeers of Parallel Computing Optimization", directly releasing  the parallel computing optimization technology behind the DeepSeek-V3 and R1 models, bringing three treasure projects in one go!

The three projects simply correspond to:

✅  DualPipe  - bidirectional pipeline parallel algorithm, making computing and communication efficient ✅  EPLB  - expert parallel load balancer, making every GPU "evenly covered" ✅  profile-data  - performance analysis data, in-depth analysis of the parallel secrets of V3/R1

Each of these three projects is a hard-core technology, and each one directly hits the efficiency pain points of large model training and reasoning! Let's take a look at them one by one.


DualPipe: Bidirectional pipeline parallel algorithm

Project address: https://github.com/deepseek-ai/DualPipe

DualPipe  is an innovative bidirectional pipeline parallel algorithm proposed by DeepSeek-AI in the DeepSeek-V3 technical report. What is its power?

  • •  Computation-communication full overlap : Traditional pipeline parallelism inevitably has "pipeline bubbles", which causes the GPU to wait. The beauty of DualPipe is that it allows the communication phases of forward and backward computations to overlap perfectly!
  • •  Reduce Pipeline Bubbles : Through sophisticated design, DualPipe significantly reduces pipeline bubbles, maximizing GPU resource utilization

Take a look at the Schedules diagram provided by the official website   . It is simply a work of art! ? It clearly shows the scheduling strategies of 8 PP ranks and 20 micro-batches. The forward and backward calculations are performed symmetrically, and the overlapping areas are clear at a glance!

Looking at  the Pipeline Bubbles and Memory Usage Comparison  table, DualPipe compared with 1F1B and ZB1P, while reducing bubbles, the memory efficiency is also great!

If you want to use DualPipe in your own project, DeepSeek-AI also provides  a Quick Start  guide and  example.py  sample code. You can easily get started based on PyTorch 2.0+ version!


EPLB: Expert parallel load balancing, let GPUs do their job!

Project address: https://github.com/deepseek-ai/eplb

EPLB  (Expert Parallelism Load Balancer), as the name suggests, is a load balancing tool tailored for Expert Parallelism (EP)!

In EP, different expert models will be assigned to different GPUs. However, the load of the expert model may fluctuate as the input data changes, resulting in uneven GPU load and affecting overall efficiency. EPLB is here to solve this problem!

DeepSeek-V3 uses a redundant experts strategy to replicate heavy-load experts and cleverly distribute them to different GPUs to achieve load balancing. At the same time, combined with  group-limited expert routing  technology, experts in the same group are placed on the same node as much as possible to reduce cross-node communication.

EPLB provides two load balancing strategies:

  • •  Hierarchical Load Balancing : Used when the number of server nodes is divisible by the number of expert groups. First balance the load between nodes, then balance the load of GPUs within a node. Applicable to the prefilling stage
  • •  Global Load Balancing : Applicable to other situations. Globally copy experts and then distribute them to each GPU. Applicable to the decoding stage

The project provides detailed  Interface and Example , allowing you to easily understand how to use eplb.rebalance_experts The function calculates the optimal expert replication and placement plan based on the expert weight, number of replicas, number of groups, number of nodes, and number of GPUs. There is also a vivid  placement plan  diagram, which is clear at a glance!


? profile-data: Performance analysis data, revealing the V3/R1 parallel strategy!

Project address: https://github.com/deepseek-ai/profile-data

DeepSeek directly discloses the performance analysis data of their  training  and  inference  frameworks! It is like a step-by-step guide to learning optimization!

These data are collected using  PyTorch Profiler  and can be downloaded and viewed directly in Chrome or Edge browsers via chrome://tracing or edge://tracing Open it and visualize the analysis! DeepSeek-AI also thoughtfully simulates an absolutely balanced MoE routing strategy for performance analysis

The project provides performance data for three scenarios: Training, Prefilling, and Decoding:

  • •  Training : Demonstrates the overlapping strategy of DualPipe in a pair of forward and backward chunks. The DeepSeek-V3 pre-training settings of 4-layer MoE, EP64, TP1, 4K sequence length are used. Note that PP communication is excluded to simplify the analysis
  • •  Prefilling : Uses EP32, TP1, 4K hint length, and 16K tokens/GPU batch size. Shows how to use two micro-batches to overlap computation and all-to-all communication, and ensures that the attention computation load is balanced between the two micro-batches.
  • •  Decoding : EP128, TP1, 4K hint length, 128 requests/GPU batch size are used. Two micro-batches are also used to overlap computation and all-to-all communication. But unlike prefilling, the all-to-all communication in the decoding phase  does not occupy GPU SMs ! After the RDMA message is sent, the GPU SMs are released, and the system waits for the all-to-all communication to complete before continuing the computation. More efficient!

Through these performance data, you can clearly see how DeepSeek-AI finely optimizes computing and communication, and learn how they improve efficiency in low-level implementation. It is definitely a valuable resource for studying large model parallel computing! ?


In conclusion:

The three open-source projects of DeepSeek AI this time are full of sincerity, and they directly bring out the efficiency optimization tips for large model training and reasoning! Good news for AI researchers

  • •  DualPipe  allows you to master the core technology of efficient pipeline parallelism and improve model training speed.
  • •  EPLB  allows you to learn how to load balance for expert parallel models and improve GPU utilization.
  • •  Profile-data  allows you to gain a deeper understanding of DeepSeek-V3's parallel strategy and learn from the optimization experience of top teams

 

Please like??