Behind GPT-5, there is more than just an all-in-one machine
Updated on:June-18th-2025
Recommendation
In the GPT-5 era, the real competitive point of the big model industry lies in the construction of the super intelligent computing platform.
Core content:
1. The current situation of the big model industry: the limitations of all-in-one machines and industrial buffers
2. The real AI industry landscape: the computing power infrastructure of the super intelligent computing platform
3. The necessity of the super intelligent computing platform: global scheduling and resource pooling
Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)
Over the past year, the big model industry in China has shown a highly consistent "landing rhythm": major manufacturers have rushed to launch "big model all-in-one machines", delivering everything from software to hardware, from models to chips, as a package, as if as long as customers deploy a machine, the future of AI can begin.But the superficial excitement cannot cover up a fact that has been collectively ignored: what really determines the pattern of the AI industry is never how many all-in-one machines are sold, but who has the ability to build a super intelligent computing platform.All-in-one machines are only a temporary solution for tactical implementation, and are an industry buffer before the arrival of the real big model era. They cannot support the large-scale training capabilities, real-time reasoning capabilities, and massive concurrent processing capabilities required by future AI. Only a GPU computing infrastructure with 100,000 or even millions of GPUs can support the real needs of future AI.This is like cloud computing in the past - it was not the companies that "sold servers" that won, but the companies that could build CPU computing power platforms and provide service elasticity that became the masters of infrastructure.Today, the AI industry is repeating this scene, but this time, the protagonist has changed from CPU to GPU, and the end point has also been upgraded from "resources as a service" to "intelligence as a service".We must start to face the fact that in the era of big models, the real "new infrastructure" is the super-intelligent computing cluster and the coordinated scheduling and continuous evolution capability of large-scale GPU infrastructure.Whoever can afford to build a 100,000-card GPU cluster will be able to provide global-level AI capability support; whoever can only sell all-in-one machines is destined to remain at the primary stage of "assembly and delivery."Why the real challenge of large models is super intelligent computing,Rather than equipment deliveryOn the surface, the bottleneck of large models seems to be application: how to serve corporate customers well? How to embed AI into business scenarios? How to make AI into something that can be used?But if we extend the time scale and raise the perspective from "terminal deployment" to "industrial structure", we will find a more fundamental problem: AI is not unusable, but unavailable.☆ Large model = heavy resources, reasoning consumes more computing power than trainingThe training cost of a GPT-4-level model can easily reach hundreds of millions; but what is even more difficult is that after it goes online, each inference call consumes a large amount of video memory, IO, bandwidth, and energy.When a model is embedded in hundreds of scenarios such as search, customer service, documents, code, finance, etc., it is no longer an "intelligent entity" but becomes a real-time intelligent infrastructure.That means: you are not deploying a model, but a "smart power plant" that is always online, has massive concurrency, and responds with low latency.☆ Single-point deployment mode cannot support the long-tail demand of AIAll-in-one and private deployment solutions can indeed solve the "security" anxiety in local scenarios, but they have natural limitations:The computing power is fixed and cannot be expanded elastically;The unit computing cost is likely to be higher than that of a cloud cluster;The model cannot be updated, optimized, or provide feedback in real time;It is impossible to form a mechanism for scheduling between models, distributing multiple tasks, and balancing inference loads;It's like using a laptop as a server. It works in the early days, but it collapses once the business scales up.True AI applications cannot be handled by a single all-in-one machine, but require a super-intelligent computing platform with global scheduling, on-demand supply, and resource pooling.☆ Super intelligent computing platform is like the "highway, power grid and water conservancy system" in the AI eraJust as the cloud computing era relied on a resource pool consisting of tens of thousands of CPU servers to support SaaS, video, social, and payment systems, the infrastructure of the AI era is a smart computing cluster consisting of tens of thousands or hundreds of thousands of GPUs to support the future:1. Multimodal Intelligent Systems2. Complex task chains (RAG, Agent, code generation, etc.)3. Training and hot start of models with hundreds of billions of parameters4. Concurrent response to massive inference service requestsIt can be said that without super intelligent computing, there will be no large-scale model services; without elastic clusters, there will be no industry-level intelligent access.Therefore, the real challenge in the era of big models has never been "how to put it into a machine", but how to design computing systems that support its continuous evolution, real-time response, and wide-ranging services.Foreign countries are already reaching the top, but China must catch upThe current technical focus of global AI development has shifted from “whether there are models” to “whether there is the ability to train and serve larger models.” The decisive force behind this is precisely: who has the strongest, largest, and most flexible GPU intelligent computing cluster.Overseas, an arms race for super-intelligent computing clusters has begun, with mainstream players including OpenAI, Microsoft, Google, xAI, AWS, Oracle, etc.Let’s look at OpenAI first. Relatively speaking, the super computing power platform is the “god assist”.We have seen models such as GPT-4 and Sora cause a world-class sensation, but what really supports their rapid iteration is not just the algorithm, but the ultra-large-scale intelligent computing resource scheduling platform:GPT-4 training is said to have used a GPU cluster of more than 25,000 cards, with multi-node parallel training and cross-chip synchronization; GPT-5 and Sora are said to have used a computing platform of more than 100,000 cards, with high throughput, high bandwidth, and high energy efficiency. The AI supercomputing center built by Microsoft for OpenAI is constantly expanding, with the goal of building the world's largest GPU scheduling system.Let’s look at NVIDIA. It not only sells GPUs, but also uses NVIDIA DGX Cloud to build a global AI computing platform, evolving from a “hardware company” to a “global smart grid infrastructure provider.”As for Google, it has one of the highest-performance data centers in the world. Its TPU v4/v5 cluster provides PB-level bandwidth, connects tens of thousands of TPU chips, and provides training support for the Gemini series. Its Borg scheduling system is almost an "intelligent computing operating system" for AI training, supporting load perception, energy consumption balancing, and task migration during large model training.Meta is not far behind. Meta publicly stated that it has a training platform with more than 30,000 GPUs and continues to invest in expansion; it has built a combination model of "open model + self-built training platform + highly optimized Transformer stack". The LLaMA series of models can be iterated stably and quickly (LLaMA2 to LLaMA3 and then to LLaMA4), relying on controllable internal intelligent computing capabilities.This is why a true AI powerhouse does not lie in how many models it has trained, but in whether it has the ability to continuously train, reason, and serve world-class models.Competition abroad is so fierce, what about the situation at home?China's leading technology companies have long been aware of this trend and are taking active action. For example, Baidu Kunlun, Alibaba Cloud, Huawei Ascend and other teams are trying to establish autonomous intelligent computing centers; Beijing-Tianjin-Hebei, Yangtze River Delta, Guangdong, Hong Kong and Macao are also promoting the construction of a national "computing power dispatching network"; the Chinese Academy of Sciences, Inspur Information and other institutions have built GPU platforms with thousands to tens of thousands of GPUs to improve large model training capabilities.However, we must face the reality that we still have many shortcomings: unstable chip supply limits the number of cards; the software ecosystem is not yet mature, and the scheduling system, framework adaptation, and system stability still need to be optimized; high-speed interconnection technologies (such as NVLink, Infiniband, etc.) rely on imports, becoming a physical bottleneck for cluster expansion; cost control and energy efficiency optimization have not yet formed system-level capabilities.It can be said that we have "computing power nodes" but lack a "super intelligent computing platform"; we have "GPU card stacks" but do not yet have the "industrialization capabilities of cluster-level AI services."It should be said that this is an infrastructure arms race. Whoever can first build a "hundreds of thousands or even millions of cards of intelligent computing base" will have the ability to provide "basic power" for global AI applications.Just like how AWS established its dominance in cloud computing not with the number of servers but with its ability to provide global elastic services; today, if the AI industry wants to reach the top, it must first break through in intelligent computing infrastructure.Super intelligence is not just a show of skill, but the foundation of AI platformizationIn the public's view, super computing power clusters are often seen as a "technological wonder" or an "arms race": burning money, piling up cards, and competing in configuration.However, in the real platform competition of AI, super intelligent computing has never been a "show of skill", but the starting point for building platform ecology, capability services and industry support.☆ Without super computing power, there will be no platform-level AI capability flywheelA truly sustainable large-scale model platform must have the following closed-loop capabilities:1. Continuous training capability - high-frequency iteration of new models, new tasks, and new data2. Low-cost reasoning capability: service efficiency in deployment, invocation, and distribution in thousands of industries3. Multi-tenant and multi-modal scheduling capabilities - serving multiple users and multiple task scenarios at the same time4. Model adaptive optimization capability - automatic compression, acceleration, distillation, and migration to improve the actual service capabilities of the model5. Cost control and energy efficiency optimization - truly "commercially affordable"These five core links all rely on a super intelligent computing platform with strong capabilities, flexible scheduling, and large scale. Once any of these links is missing, the model service capability will be broken, and the only option is to go back to the old path of "single-point deployment + manual delivery".☆ The underlying role of super intelligent computing: becoming the "basic power grid" of the entire AI ecosystemAI is no longer a single function, but a multi-modal, multi-task, and multi-role system ecosystem running in parallel:1. Multiple users use AI customer service, AI code assistant, AI design assistant, AI financial analyst, etc. at the same time2. The backend must support multi-instance concurrent reasoning of models with hundreds of billions of parameters3. Ensure that Task A does not affect Task B, and Task B does not slow down Task C.4. At the same time, there are hard constraints on response time, energy consumption, and costThis complexity cannot be solved by deploying a few all-in-one machines. It's like using a diesel generator to power a community. You can turn on a light, but you can never light up a city. Super intelligent computing clusters are the power center that lights up cities in the AI era.☆ Whoever controls the intelligent computing platform controls the “right to distribute AI capabilities”Rather than being the “technological muscle” of large-scale enterprises, superintelligence is more of a platform ticket for them.Without it, you will always be just a model supplier and a tool factory.With it, you can become a service operator, ecosystem organizer, and platform rule maker.This is like AWS to global developers and NVIDIA to AI developers. The future big model leader will not be determined by "who has the best model" but by who can build an AI capability infrastructure that is powerful enough, open enough, and reliable enough to support the operation of the entire intelligent society.All-in-one machine is a tactical relief,Super intelligent computing is the strategic breakthrough for China's AIIn the current context of China's AI industry, the reason why all-in-one machines are so popular is not only due to market reality considerations, but also related to complex factors such as insufficient computing power supply, chip limitations, and policy requirements.In the short term, it does alleviate practical problems such as model deployment, data out-of-domain, and security compliance, and is a stopgap measure.But we must admit that the all-in-one machine solves the problem of "whether it can be used", while super intelligent computing solves the problem of "whether it can win".☆ All-in-one machine is a tactical compromiseMeet local deployment requirements: meet the compliance requirements of industries such as finance and government affairs that data should not leave the domain;Adapting to the existing procurement mechanism: corporate customers are used to buying equipment, and suppliers are used to delivering projects.Quick payment in the short term: Manufacturers can achieve early commercial closed loop through hardware + service packaging;But it cannot solve the following problems: the model cannot be iterated and updated quickly; the computing power scale is limited and it is difficult to support complex multimodal applications; the reasoning cost is high, resource utilization is low, and ecological coordination is difficult.This model is effective in the early stages of commercialization, but the moment AI capabilities become industrial infrastructure, it is destined to be replaced by intelligent computing platforms with larger scale, higher efficiency, and greater service capabilities.☆ For China's AI, super intelligence is not an option, but a national strategic taskThe essence of global AI competition is no longer about “whose model is stronger”, but “whose computing power base is more controllable, more scalable, and more sustainable”. Behind this, a country is being tested:Chip self-development capability;High-performance network and interconnection technologies;Green computing power layout (energy consumption optimization);Flexible scheduling system and model service system;Computing power sovereignty;From this perspective, all-in-one machines are just small-scale products, while super intelligent computing platforms are the true "industrial mother machines in the AI era."We cannot be satisfied with just “getting AI running”; we must pursue “keeping AI running continuously, running faster, farther, and more stably”.Build a 100,000-card GPU-level intelligent computing cluster.What are the key challenges that need to be overcome?Of course, it is not an easy task to build an intelligent computing cluster with tens of thousands or even hundreds of thousands of GPU cards.When we talk about building a super intelligent computing platform with tens of thousands or hundreds of thousands of GPU cards, it is not just a problem of "stacking servers on a larger scale", but involves a comprehensive reconstruction of the entire computing architecture, system engineering, scheduling algorithms, energy strategies, and ecological organization.Here are six core challenges that must be addressed:1. GPU chips and supply chain: scarcity, dependence, and substitutionCurrently, high-performance GPUs (such as NVIDIA A100/H100/H200, GH200) are highly concentrated in NVIDIA's hands, and cannot be freely purchased in China, which restricts the ability to expand on a large scale;Independent alternative chips (such as Ascend, Kunlun, Moore Thread, Horizon, etc.) are still growing and still have a gap with top GPUs in terms of ecology, performance, and power consumption;The chip is just the underlying layer. Building a stable supply chain, driver stack, and operation and maintenance system around the chip is an extremely challenging project.To solve this problem, we need to work on domestic chip substitution, heterogeneous computing power compatibility and adaptation, and unified programming framework abstraction (such as a unified AI runtime).2. High-speed interconnection: the invisible killer of cluster bottlenecksAt the scale of 10,000 or 100,000 GPUs, the communication bottleneck between GPUs becomes the "decisive shortcoming" of training/inference performance;Current mainstream solutions (such as InfiniBand, NVLink, and PCIe) are highly dependent on overseas supply;"Multi-hop replication" of data between GPUs will cause latency amplification and throughput reduction, seriously affecting the efficiency of distributed training and reasoning.How to solve this problem? Solution direction: R&D of domestic high-speed interconnection solutions (such as Sugon's "Xingchen Interconnect"), low-latency topology design, and GPU scheduling and communication collaborative optimization.3. System Scheduling and Elastic Resource ManagementTo do a good job of system scheduling and resource management, a series of capabilities are required: large model training usually requires multi-node synchronization, refined task parallel division, and fault-tolerant rescheduling mechanism; the Wanka scheduling system must support: job-aware task scheduling; multi-tenant model scheduling; hierarchical scheduling of inference and training; task preemption and cold start optimization.Many of the current mainstream scheduling systems in China (such as Slurm, Kubernetes, and Yarn) have not been deeply optimized for large-scale distributed AI training/inference scenarios.In the future, it will be necessary to develop an "intelligent computing native operating system" for AI workloads and build a unified scheduling center (such as the Borg-like system behind OpenAI).4. Software stack and model compatibility: a unified ecosystem from chips to APIsA super intelligent computing platform cannot run only one set of models. It must support: multiple types of models (language, vision, multimodal, voice); multiple frameworks (PyTorch, TensorFlow, MindSpore); heterogeneous chips, models, optimizers, and fine-tuning solutions from multiple manufacturers.Without a unified “model development-deployment-scheduling-monitoring” closed-loop system, the intelligent computing platform will become an isolated system puzzle.Therefore, it is necessary to create a unified AI development and operation platform (such as MindSpore in China) to achieve portability across models, chips, and frameworks.5. Energy efficiency control and green computing power layoutThe power consumption of 100,000 GPU cards is close to the level of a medium-sized city power grid. It faces the following problems: power supply pressure (the power consumption of a GPU server can reach 3-6 kilowatts); heat dissipation problems (large-scale clusters require customized liquid cooling/fluorine cooling systems); operation and maintenance challenges (large impact of downtime, difficult to troubleshoot, serious thermal runaway), etc. To solve these problems, it is necessary to start from intelligent power consumption awareness scheduling, the introduction of energy-saving AI chips (such as customized inference chips), etc.Among them, liquid cooling technology has become the key to breaking the deadlock. For more details, please refer to the article "Is it time to use "liquid cooling"?" published by Data Ape.6. Service-oriented capabilities and closed loop of commercial operationsIn addition to technical challenges, building a healthy business closed loop is also a key issue. Super clusters are not "scientific research projects", but are to be transformed into "computing as a service (CaaS)" business infrastructure, and must have the ability to serve external tenants. At the same time, it is necessary to support different companies to rent GPUs on demand, support API call reasoning services, and support security isolation, billing, operation, maintenance, and monitoring of the entire process.In this direction, it is necessary to build a platform-based operation system, learn the productization capabilities of AWS/Azure, and explore the dual-wheel operation mechanism of "model as a service" + "computing power as a service".It can be seen that building an intelligent computing cluster is not just about filling up cabinets with GPUs, but about creating an "intelligent operating system-level infrastructure" to support the wave of AI services in the next decade.In this era of rapid AI evolution, we talk about model capabilities, application implementation, and industry integration, and there are "new breakthroughs" and "new concepts" almost every day. However, few people realize that what really determines the AI landscape is not only who can make a SOTA model, but also who has the ability to support its continuous evolution, large-scale deployment, and elastic services.The starting point of all this, in the final analysis, is super computing power. Moreover, super computing power is not about “how many GPUs you have”, but whether you can organize them so that they can operate as efficiently as the power grid, as flexible and open as the cloud platform, and support thousands of industries like the operating system.Over the past decade, the rise of cloud computing has changed the IT structure of enterprises and created platform giants such as AWS, Azure, Alibaba Cloud, Tencent Cloud, and Huawei Cloud.In the next decade, the infrastructure capabilities of AI computing will determine the dominant position in the next platform order.