Zhiyuan open source FlagOS upgrade: for the first time, DeepSeek-R1 full version can be deployed efficiently and quickly with multiple chips

Written by

Silas Grey

Updated on:July-15th-2025

Recently, DeepSeek-R1 has achieved high performance comparable to first-class models at a low training cost and has been fully open sourced, triggering massive deployment and scenario applications, and the demand for inference computing has grown rapidly. Based on FlagOS, an open source unified software and hardware technology stack for large models that supports a variety of AI chips, Zhiyuan Research Institute has jointly developed and open-sourced the DeepSeek-R1 multi-chip version with multiple chip manufacturers, aiming to promote the adaptation and application of large models on different chips, break the ecological wall and computing power constraints, and build a unified technology stack and open source software and hardware ecology for multiple chips. The release of the multi-chip version of DeepSeek-R1 based on FlagOS is the first time in the industry that a multi-chip open source version of DeepSeek-R1 has been implemented through a unified open source software stack, and at the same time, a rigorous model alignment result is given, ensuring open source availability and unified ease of use. It brings the following important values to users.

Code unification : Using the same set of open source codes and underlying frameworks, DeepSeek-R1 reasoning is implemented for different AI chip architectures, promoting ecological unification and openness.

Effect alignment : Zhiyuan adheres to scientific and rigorous methods, and strictly evaluates the multi-chip versions released by each chip server with DeepSeek-R1 on NVIDIA chips, ensuring that the DeepSeek-R1 version on different chip architectures is aligned with the original NVIDIA version and is equally excellent. This alignment evaluation is based on Zhiyuan's FlagEval large model evaluation system, and the evaluation results can be viewed on the HuggingFace and MoDa platforms.

Open source : The source code of multiple chip versions, the DeepSeek-R1 model files of each chip, and the one-stop Docker running image files of each chip are open to Github/Gitee, Huggingface and Moda, cloud vendor image repositories and other platforms, making it convenient for developers and users to obtain them.

Efficient and easy to use : Based on the basic images adapted by each chip, the FlagOS core components are installed, including the heterogeneous parallel training framework FlagScale and the large model general operator library FlagGems. On this basis, DeepSeek-R1 model service and automatic distributed reasoning tuning capabilities can be deployed with one click, while providing an API compatible with OpenAI, greatly reducing the threshold for use and improving deployment efficiency.

FlagOS is a unified, open-source system software technology stack for multiple AI chips, led by Zhiyuan and co-created with multiple manufacturers. It includes FlagScale, an efficient parallel training framework that supports multiple AI chips, FlagAttention and FlagGems, high-performance operator libraries that support multiple AI chip architectures, and FlagCX, a unified communication library that supports multiple AI chips. FlagOS aims to provide users with unified, open-source system software on NVIDIA and multiple AI chips, supporting the efficient and easy use of various large models on different AI chips, thereby breaking the constraints of computing power.

The multi-chip version of DeepSeek-R1 developed based on FlagOS can start FlagScale with one click to achieve parallel reasoning across chips for large models with 670 billion parameters, supporting users to flexibly select computing power combinations according to their needs and automatically realize parallel reasoning calculations. FlagScale will automatically optimize the distributed parallel strategy according to the computing power of different AI chips to ensure optimal resource allocation and efficient utilization, and improve overall deployment performance. FlagScale provides a unified and simple command execution mechanism, and users can quickly and seamlessly deploy services on various hardware platforms through the same command. The underlying high-performance operator library FlagGems provides CUDA open source replacement solutions for 25 general operators. The fusion operator will be replaced in the next version to support rapid migration of models to multiple chips. With FlagScale's unified Runner mechanism and deep integration with FlagGems, users only need to add environment variables in the configuration file to seamlessly switch to the FlagGems operator library for reasoning.

Model and related files download

Address:

https://www.modelscope.cn/organization/FlagRelease

HuggingFace Address:

https://huggingface.co/FlagRelease

Detailed steps

Based on FlagOS, users can complete environment construction and model deployment on supported AI chip servers in just a few steps. For specific steps, please refer to the model readme we provide (the following link takes Muxi as an example).

https://www.modelscope.cn/models/FlagRelease/DeepSeek-R1-FlagOS-Metax-BF16

5 lines of commands to complete the entire process of deploying DeepSeek-R1 on a non-Nvidia AI chip server from scratch

The multi-chip version of DeepSeek-R1 developed based on FlagOS provides pre-configured chip images, which can bypass the distributed environment construction and chip-specific configuration to achieve zero-cost adaptation, greatly facilitating users to deploy and use DeepSeek-R1 models on different AI chip servers. At present, the first batch of AI chip support for 5 different manufacturers has been completed, and more AI chip support will be launched online in the near future. At the same time, based on the FlagOS technology stack, more excellent large models will be supported in the future. The release of multiple AI chip versions will be supported.

The cross-chip model performance of DeepSeek R1 based on FlagOS is fully aligned with the model performance using NVIDIA H100 in terms of accuracy.

DeepSeek-R1-H100-CUDA is based on the baseline performance of CUDA deployed on H100, which can basically restore the values in the Deepseek R1 technical report.
DeepSeek-R1-H100-FlagOS is a model implemented using FlagOS on the H100 GPU, and its performance matches the baseline model, demonstrating the feasibility and consistency of cross-chip deployment.
DeepSeek-R1-FlagOS-Cambricon-BF16 is a model deployed on the Cambricon chip based on FlagOS and BF16 mixed precision technology. Its performance is successfully aligned with the baseline model, demonstrating the high performance potential of cross-chip migration.
DeepSeek-R1-FlagOS-Metax-BF16 is a model deployed on the Muxi chip based on FlagOS using the mixed precision technology of FlagOS and BF16. Its performance also matches the baseline model, further verifying the compatibility and stability of the model across different chip platforms.
DeepSeek-R1-FlagOS-Iluvatar-INT8 is a model based on FlagOS deployed on the Tianshu chip based on FlagOS and INT8 quantization technology. Although the performance is slightly reduced due to the application of quantization technology, it still maintains a high accuracy.

Evaluation results of DeepSeek-R1 based on FlagOS on various chips

Note: 1. This evaluation result is provided by FlagEval. The release of the current version involves performance evaluation on multiple chip platforms, which takes a long time to complete. We will gradually update and publish the performance alignment results of each platform according to the evaluation progress. Ensure that accurate and reliable performance data can be provided to meet the needs of different hardware environments.

2. This test is only used to verify the consistency of the model after migration with the NVIDIA version. However, due to the difference between the adapted chip architecture and the chip architecture that generates the original parameters, the evaluation indicators of each data set are considered to be consistent if the difference is within 1% under the conditions of the same numerical accuracy (and the same quantization strategy).

FlagGems is a large-model general operator library developed by Zhiyuan and several companies. It is based on the OpenAI Triton language and supports multiple chip architectures. With the openness and flexibility of the Triton language, FlagGems provides a unified and efficient operator layer ecosystem access solution for a variety of acceleration hardware. Currently, FlagGems is the most comprehensive general operator library developed based on Triton in the world, and has demonstrated the following features:

Rich in number: The total number of operators exceeds 140, and the breadth of operator types far exceeds that of similar competing products.

Superior performance: The average performance exceeds that of the Pytorch CUDA version by more than 90%.

Multiple backend support: Currently supports 7 accelerator backends. After continuous optimization, the performance acceleration ratio has been significantly improved.

· Innovative technology: Using unique code generation optimization and runtime optimization technology, the secondary development efficiency and runtime performance are better than similar projects.

The FlagGems operator library has initially verified the feasibility of the unified operator layer for multiple chips. At the same time, it has built a full-link industrial ecosystem from model application companies, system integrators to chip companies. In the future, the operator library plans to further improve performance, support more models and chips, and lead the technology frontier and industrial implementation of a unified ecosystem for multiple heterogeneous chips.

FlagScale is an open source large model framework for multiple chips built by Zhiyuan and its ecosystem partners based on open source technology. It aims to improve the efficiency of computing resource utilization and ensure the effectiveness of model training and inference. By providing key components for the entire process of model development, training, and deployment, FlagScale is committed to becoming an essential open source toolkit for optimizing the efficiency and effectiveness of large-scale model workflows, with the following features:

Leading heterogeneous hybrid training technology: For the first time, heterogeneous hybrid training of large models between chips of different generations and architectures is realized, providing a universal multi-dimensional heterogeneous hybrid parallel strategy, and supporting cross-node RDMA direct connection and CPU transit communication between different manufacturers.

· Efficient end-to-end training and reasoning: Supports end-to-end pre-training and reasoning of more than 10 models inside and outside Zhiyuan, covering dense and sparse models, involving language and multimodal fields, and the parameter scale reaches hundreds of billions. Under the same configuration of LLaVA-OneVision, the training efficiency is 1.7 times that of DeepSpeed; the multimodal CFG reasoning efficiency is 3.8 to 6.7 times that of HuggingFace.

· Cross-chip automatic tuning capability: Provide users with out-of-the-box automatic tuning tools, which can obtain the best parallel strategy with one click through configuration. This greatly reduces the deployment threshold of distributed training and reasoning. Through automatic tuning, the performance of multiple chips in actual tests has increased by an average of 11.3%.

· Multi-chip training and reasoning adaptation: We have worked with manufacturers to complete training and reasoning adaptation on 8 different chips, achieving precision alignment at four levels: operators, pre-training loss, fine-tuning loss, and evaluation results. We have covered models of different sizes in the language and multimodal fields, and have successfully achieved end-to-end complete training on thousands of non-NVIDIA chips.

FlagCX is a partner of Zhiyuan United Ecosystem. It has built and opened up a heterogeneous unified communication library, which is an important part of the open source software stack for multiple computing power. It can not only achieve efficient cross-node communication between different chips, support efficient heterogeneous hybrid training of a single task in a multi-chip environment, but also achieve large-scale adaptive communication optimization, significantly reducing the migration cost across chips, scales, and tasks. FlagCX has the following features:

Standardization: Functions and interfaces are standardized to greatly reduce manufacturers’ adaptation costs.

Compatibility: Compatible with frameworks such as PyTorch, compatible with vendor-developed communication libraries, compatible with standard IB/RoCE network protocols, etc.

Adaptive: Automatic tuning mechanisms will be provided for different task loads, different cluster sizes, and chips from different vendors.

High performance: Currently, zero-overhead communication distribution has been achieved on homogeneous chips, and the peak bandwidth has reached more than 90% in heterogeneous cross-machine communication.

In order to better promote the development of the heterogeneous unified communication library FlagCX and accelerate the development and implementation of relevant standards, Zhiyuan is actively building a related software ecosystem. Through the collaborative innovation of industry, academia and research, a virtuous cycle is formed to accelerate the technical promotion and application of the heterogeneous unified communication library.

FlagEval ( Libra) is a large-scale model evaluation system and open platform launched by Zhiyuan in 2023. It is committed to establishing scientific, fair and open evaluation benchmarks, methods and tool sets to assist researchers in comprehensively evaluating the performance of basic models and training algorithms.

FlagEval has gradually launched a series of evaluation tools, covering multiple fields such as language large model evaluation, multi-language text and image large model evaluation, and text and image generation evaluation. Through systematic tool construction, the platform not only realizes extensive evaluation of various large language models and cross-modal models, but also further expands the evaluation scenarios, covering four major fields: natural language processing (NLP), computer vision (CV), audio processing (Audio) and multimodal (Multimodal), and supports a variety of downstream tasks. So far, FlagEval has completed the evaluation of more than 800 large models at home and abroad, supporting customized online or offline blind tests for four major tasks: language question and answer, multi-modal text and image understanding, text-to-image, and text-to-video, providing strong support for the comprehensive evaluation of model performance.