DeepSeek and Tencent join hands: The story behind communication optimization that speeds up AI training

Written by
Silas Grey
Updated on:June-24th-2025
Recommendation

The hero behind AI large model training: Communication optimization technology revealed.

Core content:
1. Memory and computing speed challenges faced by AI large model training
2. Two parallel computing methods: tensor parallelism and pipeline parallelism
3. The key role of the "highway" connecting GPUs in parallel computing

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

 

Have you ever thought about the big AI models that we are using more and more easily, such as those "intelligent agents" that can write code, draw pictures, and chat with you fluently, how they are trained and how they can respond so quickly?

These models are simply too big! They have hundreds of billions or even trillions of parameters. If you only use one computer (even one equipped with the most powerful GPU), it is impossible to install the entire model, let alone complete the training or inference calculation in a reasonable time. It's like building a magnificent city or managing a country. One person can't do it alone. It requires thousands of people to work together.

Why do large AI models require “teamwork” (parallel computing)?

The reason is simple, there are two main reasons:

  1. 1. The memory cannot hold the model! The model parameters and the intermediate data generated during the training process (such as activation values ​​and gradients) may far exceed the memory capacity of a single GPU. Just like your computer hard drive is not big enough to hold all the high-definition movies, a single GPU cannot hold the entire large model.  
  2. 2. It’s too slow! Even if it can be loaded, it may take an unacceptable amount of time for a GPU to calculate all the layers of the entire model step by step. It’s like asking one person to empty the entire warehouse. It’s too inefficient.  

Therefore, we must "split" the model and let multiple GPUs share the computing tasks. This is the core idea of ​​parallel computing . In order to make these GPUs work together efficiently, AI scientists and engineers have invented several different ways to "split the model". Today we will focus on the two most important ones: tensor parallelism and pipeline parallelism , as well as a crucial behind-the-scenes hero - the "highway" connecting the GPUs .

Method 1: Tensor Parallelism (TP)

For example: Imagine that you and your colleagues have to complete a very large mathematical calculation (such as matrix multiplication) together. The calculation is too large, and one person’s draft paper is not enough to write it, or the calculation is too slow. 

What to do? Tensor parallelism is to "cut" this single huge computing task (the calculation within a model layer) into several pieces and assign them to different GPUs for simultaneous calculation. Each GPU is only responsible for calculating a part of this large task . Finally, everyone "pieces together" the results of their respective calculations to get the final complete result. 

  • Features: This method performs calculations and data splitting within a single layer of the model . Each GPU only processes a portion of the weights and data in a layer of the model.  
  • Advantages: Specially designed to solve the problem that the single-layer model is too large to be accommodated or calculated by a single GPU.  
  • Disadvantages: During the calculation process, GPUs need to frequently exchange intermediate results (for example, after all GPUs have completed a part of the calculation, the results must be summarized before proceeding to the next step). This frequent communication within the layer is its main overhead.  

Use a diagram to illustrate the concept of tensor parallelism:

The above figure shows that tensor parallelism divides a large layer computing task into multiple GPUs. After the GPUs calculate partial results, they need to be synchronized to obtain the final result.

Method 2: Pipeline Parallelism (PP)

To use an analogy: it is like an assembly line in a factory . The entire large model is considered as a complex production process, which includes many processes (different layers or modules of the model). 

What to do? Pipeline parallelism is to assign different layers of the model (or modules composed of several layers) to different GPUs to form a processing chain. After GPU 1 calculates the first layer of the model, it immediately passes the output result to GPU 2 to calculate the second layer. After GPU 2 calculates, it passes it to GPU 3 to calculate the third layer... Data is like a product, and it is processed by different GPUs in sequence along the pipeline. 

  • Features: This method splits the model into layers/modules , and each GPU is responsible for the calculation of different parts of the model .  
  • Advantages: Compared with tensor parallelism, the communication mode between GPUs is relatively simple: usually only the output of one stage needs to be passed to the downstream GPU. During training, the best thing is that different GPUs can process different batches of data at the same time , just like a real pipeline, greatly improving the overall throughput and GPU utilization. At the same time, this method can effectively utilize the total memory capacity of all GPUs in the cluster .  
  • Disadvantages: If a GPU in the pipeline calculates slowly or data is not transmitted in time, the next GPU will have to "wait for the rice to cook", causing idleness, forming the so-called **"Pipeline Bubbles"**, affecting the overall efficiency.  

Use a diagram to illustrate the concept of pipeline parallelism during training (showing parallel processing of different data batches):

The above figure shows: Pipeline parallelism allows different GPUs to process model layers of different data batches at the same time. Note some of the "waiting" gaps, which are "bubbles".

Although the above diagram shows the process more concisely, in order to more clearly illustrate the efficiency improvement brought by parallelism (especially how to use the pipeline to process multiple data batches simultaneously during training), let's look at a more conceptual diagram and imagine how multiple data packets shuttle through the pipeline:


In practice: They are often used in “packages”

For the largest and most complex models today, these two technologies are usually used in combination: first use tensor parallelism to solve the problem of a single layer being too large (split a layer onto a group of GPUs), and then use pipeline parallelism to distribute these layers (which may have been processed by tensor parallelism) to more GPU clusters. In addition, for special architectures such as MoE (mixed expert model), more complex parallel methods such as expert parallelism are also required , all of which place extremely high demands on communication between GPUs.

Why is the “highway” connecting GPUs so important?

Now we understand the basic principles of parallel computing. Whether it is the frequent data exchange within the layer in tensor parallelism, the data transfer between stages in pipeline parallelism, or the need to route data between different GPUs in expert parallelism, the efficiency of all parallel methods is highly dependent on the data communication between GPUs .

Imagine that you are in charge of a link in a factory assembly line, and you need to quickly deliver the processed parts to your colleagues in the next link. If the delivery road is narrow and congested (slow network interconnection), or the delivery "truck" cannot be sent out for half a day (low network card efficiency), the parts will be piled up here, or the next colleague will wait, and the entire assembly line will slow down. Even if you process faster (no matter how powerful the computing power of a single GPU is), it will not help.

The "network interconnection" technology that connects the GPUs is the "highway" that determines the efficiency of these parallel calculations! It determines whether data can flow quickly and smoothly between GPUs. If this "highway" has high latency, low bandwidth, and unstable traffic, it will cause serious efficiency problems: 

  • “Bubbles” are rampant: After the GPU has finished computing, but the data cannot be sent or received, it can only wait, resulting in a lot of idle time (“bubbles”), and precious GPU computing power is wasted.  
  • Advanced optimization fails: Many advanced parallel technologies try to overlap computation and communication to mask each other’s delays. However, if the network is too slow, no matter how cleverly “overlapped”, it will not be able to hide the huge communication time.  
  • Resources are forced to stack: To make up for the inefficient network, you may be forced to invest more GPUs, but they spend most of their time “waiting for data” instead of actually computing, resulting in a huge waste of resources.  

Use a diagram to illustrate the impact of network communication on efficiency:

The above figure shows that the network transmission speed between GPUs directly affects the waiting time of the downstream GPU, which in turn affects its utilization and overall efficiency.

Why is DeepSeek particularly grateful for Tencent’s “superhighway” improvements?

This is why the open source AI company DeepSeek publicly thanked Tencent - and this gratitude was not a courtesy, but stemmed from Tencent's hard-core optimization of the AI ​​computing "highway" behind the scenes !

Specifically, DeepSeek has an open source communication framework specifically for MoE and other model architectures, called DeepEP . This framework needs to handle high-throughput, low-latency data transmission tasks, which is the key to achieving efficient model training. 

Tencent's Starlink Networking team conducted an in-depth analysis of the DeepEP framework's performance in an actual high-performance network environment, just like doing a physical examination of a "highway". They accurately found two major "blocking points": 

  1. 1. The network interface card (NIC) that connects the GPU to the network usually has two physical ports, just like the entrance to a highway has two lanes. However, they found that in practice at the time, due to problems with the underlying software or configuration, the bandwidth capacity of these two ports was not fully utilized, just like only one lane of the two lanes was opened, or even if two lanes were opened, vehicle scheduling was not smooth and the total traffic could not be increased.  
  2. 2. CPU coordination is not fast enough: The CPU responsible for directing how data is transmitted on the network introduces additional delays, just like a traffic controller's dispatch is not timely enough, causing traffic delays.  

In response to these specific issues, the Tencent team carried out detailed and in-depth optimization:

  • Open up the dual lanes and add toll gates: They modified the underlying communication library (NVSHMEM) to enable the system to transparently (which means that DeepEP upper-layer applications hardly need to be modified) enable all dual ports of the network card at the same time , and establish multiple concurrent data transmission channels (Multi-Queue Pair) . This is like fully opening up the dual lanes of the highway and adding multiple toll gates, allowing massive amounts of data to flow into the network in parallel, greatly improving the concurrency of transmission and bandwidth utilization. Use a diagram to illustrate the concept of dual network port optimization.  

  • Solve the "invisible trap" of the underlying library: They also accidentally discovered a more underlying problem during the test - after the commonly used multi-GPU communication library NCCL was upgraded to a certain version (such as 2.22 and later), DeepEP's network communication performance would drop sharply ! Through in-depth research, they speculated that a certain "delayed connection" mechanism in the new version caused this problem, and found an effective way to circumvent this trap and restore high performance by adjusting environment variables. Use a picture to illustrate the problems caused by the NCCL version:    
  •    
     
  •    Although it is the same DeepEP framework, the underlying NCCL library versions used are different, and the network transmission speed may vary greatly.

What did these optimizations bring? Let the data speak for itself:

These seemingly “low-level” technical improvements have brought amazing measured results:

  • • First, they successfully **“activated” the dual network ports on the high-performance network card**. Before this, even if there were two interfaces physically, the actual bandwidth that could be achieved might be similar to that of a single network port, or even lower than some traditional high-performance networks (such as InfiniBand) due to the limitations of the underlying software and drivers. But after optimization, their solution can transparently and fully utilize the two ports , greatly improving the total bandwidth, and achieving the same performance in a dual network port environment as a single network port - this means that the wasted potential has been fully tapped! Through their tests, in an Ethernet-based network environment such as RoCE, the network bandwidth can stably run at an ultra-high level of 50-60GB per second , which is close to the theoretical peak performance.
  • • Even more shocking is that they revealed and solved the "trap" caused by the underlying communication library . They tested and found that the bandwidth of DeepEP could reach a high level of 50-60GB per second using NCCL 2.21, but after upgrading to NCCL 2.22, the bandwidth dropped drastically to 30-40GB per second! This performance loss of nearly half is like a smooth highway suddenly narrowing to a single lane! The Tencent team discovered this "invisible trap" and provided a solution to restore high bandwidth even with the new version of NCCL.

Ultimately, these optimizations combined have led to significant improvements in overall performance. As reported, in the RoCE network environment, performance has increased by 100% (directly doubled) ; in the InfiniBand environment, performance has also increased by 30% . The specific github Pull Request is recorded in the original link. Click " Read original text " to jump. 

Why is this important to DeepSeek and the industry as a whole?

  • Training speed soars: Network transmission is no longer a bottleneck, and collaboration between GPUs is smoother, significantly speeding up the training cycle of large models.  
  • Maximize resource utilization: The “bubbles” of pipeline parallelism are smaller , GPU waiting time is reduced, and more computing tasks can be completed with limited GPU resources.  
  • Cost and efficiency advantages: Faster training and more efficient use of GPUs means that goals can be achieved with less hardware cost and in a shorter time, providing a huge competitive advantage. Some reports mentioned that Tencent was able to achieve higher training efficiency on existing GPUs, and may even have slowed down the deployment of new GPUs.  
  • Promote the open source ecosystem: Tencent has fully open-sourced these optimized DeepEP frameworks and related underlying modifications , and has successfully applied them in the training of its own Hunyuan large model . This not only verifies the effectiveness of the optimization, but also contributes its results to the entire AI community, helping more developers and researchers improve the efficiency of large model training.  

To summarize:

AI large models require parallel computing on multiple GPUs. Technologies such as tensor parallelism, pipeline parallelism, and expert parallelism all require large amounts of efficient data communication between GPUs. The network connecting the GPUs is the key "highway" that determines the efficiency of these parallel processes. Tencent's Starlink Networking team successfully broke through the bottleneck of dual network ports and solved the "invisible trap" of the underlying communication library through the underlying and hardcore optimization of the DeepSeek open source DeepEP framework, significantly improving the communication bandwidth between GPUs (such as 50-60GB/s under RoCE), bringing actual performance improvements of up to 100% and 30%. These technological breakthroughs not only enable DeepSeek's model training to achieve "huge speed-ups", but also inject powerful impetus into the efficient development of the entire AI large model ecosystem through open source contributions. As AI computing power becomes increasingly valuable, these heroes behind the scenes who improve the "collaborative efficiency" of GPUs are playing an increasingly critical role!