The Ultimate Guide to Building an AI Training Platform: Combining RoCE/IB Network, 3FS Storage, and HAI Platform

Written by

Silas Grey

Updated on:July-09th-2025

The construction of AI training platform is the core driving force for the development of artificial intelligence, especially in distributed training and large-scale model training, which involves multiple technologies of network, storage and platform integration. Based on current research and practice, this report discusses in detail how to build an efficient AI training platform from multiple dimensions, from the underlying RoCE or IB network, network optimization, 3FS storage to the Magic Square HAI Platform. The content is aimed at technical practitioners and decision makers, and strives to be easy to understand.

1. Underlying network: technical foundation of RoCE and IB

AI training has extremely high requirements for network performance, especially distributed GPU training, which requires low latency and high bandwidth to support fast data exchange between multiple nodes. RoCE (RDMA over Converged Ethernet) and IB (InfiniBand) are two key underlying network technologies that are widely used in data center AI infrastructure.

Features and advantages of RoCE :
RoCE is based on the existing Ethernet infrastructure and uses RDMA technology to achieve low-latency, high-bandwidth communication. Research shows that the RoCEv2 version is particularly suitable for AI training and supports distributed tasks on thousands of GPUs, such as content recommendation, natural language processing, and generative AI model training (such as RoCE networks for distributed AI training at scale ^[1] ). It is cost-effective, easy to integrate with existing networks, and suitable for large-scale deployment.
For example, Meta has expanded its RoCE network to multiple clusters, each supporting thousands of GPUs, covering production tasks such as ranking and content understanding.
IB's performance and applicable scenarios :
IB is known for its ultra-low latency and extremely high bandwidth, making it particularly suitable for AI training environments that require extremely high performance. However, it usually requires dedicated hardware, is relatively expensive, and is more commonly used in scientific research or high-budget projects (such as InfiniBand vs. RoCE: Choosing a Network for AI Data Centers ^[2] ).
Selection suggestion :
For most enterprises, RoCE is a more economical choice; if they are sensitive to latency and have sufficient budget, IB can be used as an alternative. Both need to be evaluated in combination with actual needs, and network design needs to consider scalability and compatibility.
For more information about RoCE, please read the previous article: What is RoCE network? What are the advantages compared with IB network?

2. Key strategies for network optimization

Network optimization is the core of ensuring the efficient operation of the AI training platform. It involves multiple technical levels and aims to reduce bottlenecks and improve overall performance.

QoS (Quality of Service) configuration :
AI training task traffic needs to be prioritized, and QoS settings are used to ensure that critical data transmission is not interfered with by other network activities. For example, configuring a priority queue can reduce latency jitter during training.
Routing and congestion control :
Adaptive routing protocols (such as ECMP, Equal-Cost Multi-Path) are used to dynamically adjust data paths to avoid network congestion points. Studies have shown that congestion control mechanisms (such as ECN, Explicit Congestion Notification) can significantly improve network stability under high loads (such as Scaling RoCE Networks for AI Training ^[3] ).
Scalability design :
As AI clusters grow rapidly, networks need to support more GPUs and nodes. Optimization includes increasing bandwidth (such as 200Gbps or higher InfiniBand NICs), link aggregation, and distributed topology design to ensure linear performance expansion.

The goal of network optimization is to create an efficient and stable communication environment to support the complex needs of AI training.

3. 3FS Storage: Performance Accelerator for AI Training

The storage system is another key component of the AI training platform. Traditional file systems have difficulty coping with the access requirements of massive data sets. 3FS (Fire-Flyer File System) is a distributed file system optimized for AI training and reasoning. It uses modern SSDs and RDMA networks to provide high-throughput and low-latency storage solutions.

Technical architecture :
3FS uses a decentralized architecture that supports thousands of SSDs and hundreds of storage nodes to work together, ensuring transparency and location independence of data access (such as 3FS: Innovation in Distributed Storage for AI ^[4] ). It is based on the Chain Replication with Apportioned Queries (CRAQ) mechanism to ensure strong consistency and simplify application development.
AI Optimization Features :

Supports complex training workflows, including parallel checkpointing and inference tasks, without preloading or shuffling datasets.
Provides the ability to randomly access training samples, reduces data preparation time, and improves training efficiency.
The KVCache feature provides a cost-effective alternative for inference, compared to DRAM cache with larger capacity (such as GitHub - deepseek-ai/3FS ^[5] ).

Performance :
Tests show that the 3FS cluster (180 storage nodes, 16 14TiB NVMe SSDs per node, 2×200Gbps InfiniBand NIC) performs well in read stress tests, supports concurrent access from 500+ client nodes, and has a throughput far exceeding that of traditional storage (such as DeepSeek Develops Linux File-System For Better AI Training & Inference Performance ^[6] ).
Applicable scenarios :
3FS is particularly suitable for processing large data sets and intermediate output management in AI training, and is suitable for highly data-intensive fields such as autonomous driving and generative AI.

The introduction of 3FS has significantly improved storage performance and reduced the bottleneck of AI training. It is an essential component for building an efficient platform.

For more information about 3FS storage, please read the previous article: DeepSeek's open source high-performance distributed file system: 3FS

4. HAI Platform: A comprehensive solution for integration and expansion

HAI Platform is a comprehensive platform for AI training that integrates RoCE/IB networks, 3FS storage, and software tools to provide an end-to-end solution suitable for large-scale AI training tasks.

Platform features :

Network and storage integration : The HAI platform seamlessly integrates RoCE/IB networks and 3FS storage to ensure high-performance communication and efficient data access.
Scalability : Designed to support thousands of GPUs and massive amounts of data, it is suitable for enterprise-level AI training needs.
User-friendliness : Provide intuitive interfaces and tools that reduce deployment and management complexity and are suitable for technical teams and non-expert users (e.g., features of similar platforms presumably based on HAI.AI ^{[7] ).}

Actual value :
The HAI platform accelerates the AI development cycle and reduces operational complexity by uniformly managing network and storage resources. For example, it supports parallel checkpoints and distributed training workflows, significantly shortening model training time. Magic Square is open source and has not been updated for 2 years. It can be used for learning or secondary development.

Summary and Outlook

Building an AI training platform requires comprehensive consideration from the underlying network (such as RoCE/IB), network optimization, 3FS storage to the HAI Platform. RoCE and IB provide a high-performance communication foundation, network optimization ensures stability and scalability, 3FS storage accelerates data access, and the HAI Platform integrates resources to improve overall efficiency. The combination of these technologies not only meets current AI training needs, but also lays the foundation for future large-scale development.

Under the technical background of March 23, 2025, the construction of AI training platforms is in a rapid development stage. Enterprises need to choose the appropriate technology combination according to actual needs and continuously optimize to cope with increasingly complex AI workloads.