DeepSeek open-sources DualPipe algorithm, improving training efficiency by 30%

Written by

Caleb Hayes

Updated on:July-14th-2025

DualPipe is an innovative bidirectional pipeline parallel algorithm proposed in the DeepSeek-V3 technical report. This algorithm achieves full overlap of the forward computation and the reverse computation-communication phase, while reducing pipeline bubbles. For more information about computation-communication overlap, see the performance analysis data.

Bidirectional pipeline parallel architecture : DualPipe achieves complete overlap of computation and communication in the forward and backpropagation stages through a bidirectional pipeline design , thereby significantly reducing the "pipeline bubbles" (device idle waiting time) in the traditional training process .

Symmetric micro-batch scheduling : In a configuration with 8 pipeline parallel (PP) stages and 20 micro-batches, the reverse micro-batches are symmetrically distributed with the forward micro-batches. In the example diagram, cells with shared black borders represent overlapping computation and communication operations, simplifying the scheduling complexity .

Memory optimization : Compared with traditional methods (such as 1F1B alternating execution and ZB1P zero-bubble unidirectional pipeline), DualPipe only increases the activation memory peak by 1 times, but significantly reduces training latency .

Scheduling

DualPipe scheduling example: bidirectional processing for 8 PP levels and 20 micro-batches. The reverse micro-batches are symmetrically distributed with the forward micro-batches, and their batch IDs are omitted for simplicity. Two cells surrounded by the same black border indicate that computation and communication overlap.

Pipeline bubbles vs. memory usage

Reducing pipeline bubbles : By overlapping computation and communication, DualPipe significantly shortens training time. For example, in DeepSeek-V3 training, efficiency was improved by 30%, and training costs were reduced to $5.576 million, far lower than similar models .

Memory usage comparison : The technical report shows the execution time of different stages (such as forward block?, backward block?, weighted backward block?) and overlap efficiency. The table data further verifies its advantage in resource utilization.

DualPipe solves the efficiency bottleneck in large model training through an innovative bidirectional pipeline parallel design while taking into account memory optimization. Its open source not only lowers the threshold for AI training, but also promotes the adaptation of the hardware ecosystem (such as Moore's Threads), becoming an important technical benchmark in the field of AI. In the future, the algorithm is expected to show its potential in more complex tasks (such as multi-language understanding and code generation), further promoting the intelligent allocation of computing resources.