A brief talk about computing power network: Alibaba Cloud Wanka cluster networking practice

Written by

Silas Grey

Updated on:July-08th-2025

Due to project needs, I have recently been learning Alibaba Cloud's new generation of Wanka cluster network architecture HPN 7.0 .

Let's first talk about the information obtained from the paper. The three-layer RoCE network is designed according to a 1:1 convergence ratio. The overall network topology is as shown in the figure below. Each Pod contains 8 Segments .

Each Segment contains 16 Leaf switches with a total of 2048 200Gbps downstream ports , which can connect to 128 GPU servers ( 2048 200Gbps network ports).

therefore,

Each Pod has a total of 1024 GPU servers ( 8192 GPU cards ).

Each Leaf switch has 64 400Gbps upstream ports and 128 200Gbps downstream ports . Each 200Gbps downstream port corresponds to one of the 200Gbps network ports of the GPU server . Each GPU server has 16 200Gbps network ports , and the upstream needs to correspond to 16 Leaf switch 200Gbps ports .

Each GPU server uses 8 dual 200Gbps network cards , a total of 16 200Gbps ports , and adopts dual uplink mode to achieve two uplinks for one GPU , and the two uplinks are connected to different switches . That is, for the 128 GPU servers in each Group , all NIC ports No. 1 are connected to port No. 1 of the Leaf switch , and NIC port No. 16 is connected to port No. 16 of the Leaf switch .

This dual uplink design doubles the number of GPUs and communication bandwidth of each segment .

Communication between GPUs within a segment only requires one Leaf switch, which can support up to 1,024 GPU cards interconnected , with a total communication bandwidth of 409.6 Tbps .

In addition, the double upper link design also

It can alleviate the failures caused by network cards, switches, optical modules, optical fibers, etc. For example, when an uplink fails or the corresponding switch fails, the traffic can be switched to another port to provide services without interrupting the training task ( of course, the training speed may be affected ). In the event of a failure, the traffic detour path is shown in the figure below.

The above is the design of the Core switch and the Spine switch with a 1:1 convergence ratio . For a case I recently encountered , the design was based on a 1:15 convergence ratio , which may be based on Alibaba's long-term observation and modeling of its own traffic. The network topology diagram is shown below (only 3 units are drawn).

The entire cluster is divided into 15 units, each unit generally has 128-136 GPU servers. According to the dual-plane design ( Plan A and Plan B ), each GPU server has 16 200Gbps ports, 8 of which are connected to Plan A and the other 8 to Plan B. Therefore, the entire cluster can support a maximum of 2040 GPU servers and 16,000 GPU cards .

Each unit is equipped with 16 leaf switches ( 8 for Plan A and 8 for Plan B ), and 15 units are fully equipped with 240 switches . Each leaf switch has 60 upstream and 68 downstream 400 Gbps ports, of which the 68 downstream 400 Gbps ports can be split into 200G bps ports , that is , each leaf switch has 136 downstream 200G bps ports .

In the case of full configuration, Plan A and Plan B are each equipped with 60 Spine switches . Each Spine switch has 8 and 120 400Gbps ports on the upstream and downstream respectively . Each 400Gbps port on the downstream of the Spine switch corresponds to a 400Gbps port on the upstream of a Leaf switch .

The entire cluster is configured with 8 Core switches , each Core switch has 8 and 120 400Gbps ports on the upstream and downstream respectively , and each downstream 400Gbps port corresponds to 1 Spine switch .