Talk about the infrastructure evolution mentioned in DeepSeek-v3

How the DeepSeek-v3 team closely combines algorithms and infrastructure to achieve optimization and coordination of computing power.
Core content:
1. The DeepSeek-v3 team's deep background in algorithms and infrastructure
2. The importance of computing power as a joint optimization variable and practical cases
3. Challenges and solutions for the coordination of computing power and algorithms in quantitative trading
My feeling after watching DeepSeek-v3 is that the algorithm and Infra are very closely integrated. In fact, the algorithm and Infra of many large model teams are very separated. There are not many people who fully understand both the algorithm and Infra. The DeepSeek team is one of them. There should be many OI competition winners in the DeepSeek team. In fact, for those of us who have done OI, the optimization strategies in computing are basically easy to grasp. In many cases, the architecture of the processor is also studied in depth. Therefore, it is very natural to do the algorithm and Infra at the same time. However, most of the newcomers in the algorithm position now have very limited coding ability...
Of course, the scumbag B is bragging a little bit, he knows more about the underlying chips and their interconnection than DeepSeek, maybe he knows a little more math... Yesterday I told a friend a cold joke, isn't FP8 training these Quantization just Quant becoming scum ("za"tion), ^o^.
1. Computing power should no longer be just a constraint, but a variable that can be jointly optimized
In fact, many years ago, the Alibaba Mama team did a lot of work on the coordination of algorithms and computing power when introducing deep learning into the recommendation system. We strongly agree with what Professor Zhou Guorui said: "Computing power should no longer be just a constraint, but a variable that can be jointly optimized."
At the beginning of this year, I also sorted out the coordinated development of this series of algorithms and computing power. You can refer to
"Talking about AI-easy-to-implement businesses - search and promotion"
In fact, let's talk about quantitative trading. It is very similar to S&P. It also needs to balance computing power and algorithms under a time constraint. For many high-frequency trading strategies, it is actually more difficult, involving a series of hardware and algorithm computing power coordination, and sometimes it can sacrifice stability. For example, some high-frequency trading teams are still using home CPU overclocking to get faster computing speeds. Another example is that on many network cards, even a register has to be saved...
When DeepSeek/Huanfang has such a main business to build a large model, the firepower of the entire team is naturally full... Of course, scumbag B, who has participated in the design of trading networks of almost all domestic exchanges, has compliance issues and his own professional ethics and has not gone to the high-frequency field...
On the other hand, Zha B still has more disagreements with the current large-model Transformer architecture. It is definitely not the final state leading to AGI, because such a ScalingLaw algorithm that relies on extremely high computing power is essentially a mistake. Therefore, Zha B spends more time on optimizing the underlying computing power and the mathematical principles behind the top-level algorithm.
In terms of underlying computing power, the main focus is on GPU microarchitecture analysis, Tensor computing-related work, and AI accelerator high-speed interconnection.
《The Evolution of GPU Architecture》
Tensor Operations
AI Accelerator Interconnect
In terms of mathematics (well, learning from God J's "mathematics"), I have always had a bold statement: the mathematical foundation of this AI revolution is: category theory/algebraic topology/algebraic geometry, which are the first mathematics in the 20th century to appear on the stage of commercial computing. Therefore, I have been doing some special research.
Mathematical foundations of large models
I have seen some papers recently, such as the TOPOS perspective on multimodal large models, and some things like Grothendieck graph neural networks. It seems that I have seen some light, but these things are the few heroic existences in this world, the romance of a piece of paper and a pen.
Of course, many people doubt that these algebraic things and some sparse computing efficiency issues of GNN itself seem to have nothing to do with AGI. But in fact, they may be the most wonderful existence in the human brain. When I went to MTP yesterday, there was a point of view:
MTP reminds me of the very interesting work of Zen5's 2-Ahead Branch Predictor. In fact, for a model like o3, it is essentially a token as an intruction.
It turns out that GPT is a sequential execution result predic next token similar to pc++, and then operates on the stack (historical tokens as stack). Sequential prediction of the next token
o1/o3 Large Reasoning Model Whether it is MoE or PRM such as reinforcement learning, it is actually a Divergence on Token Predict, such as jump/loop/backtracking, etc. PRM can be regarded as a CPU branch predictor. From the perspective of system architecture, large models can gradually achieve Turing-complete processing capabilities.
Based on this point of view, the current GPU's TensorCore/Cuda Core actually constitutes an execution engine, which requires a series of controls, branch predictions, decoders, and LSUs to cooperate. There are still many interesting topics to explore about the evolution of infrastructure.
Another bold statement: The current Transformer model itself is a data path for generating tokens, while things like Grothendieck graph neural networks and related algebraic structures themselves are the control path of the model. This is a way to run LRM.
2. Evolution of hardware and architecture
The implementation of DeepSeek-v3 is also very elegant. For example, considering the impact of the H800 being castrated, TP parallelism is not used in training. Then, extreme optimizations are made for MoE's AlltoAll, such as PXN and IBGDA, as well as warp specialization and dualpipe.
On the contrary, let's look at the Meta group. AlltoAll was still calling for Action at last year's OCP, and then Llama3's MoE heard a gossip from Li Mu that their training failed... No wonder they spent 10 times more money...
Back to some future hardware requirements mentioned by the DS team, for example, 20 of the current 132 SMs of H800 are allocated for communication, requiring a communication coprocessor, and in order to reduce the complexity of application programming, it is hoped that this hardware can unify the ScaleOut and ScaleUp networks from the perspective of the computing unit. Through this unified interface, the computing unit can submit communication requests based on simple primitives.
In fact, Zha B explained all these things clearly a few years ago and did a series of POCs. In 2018, when Transformer came out and the model began to grow larger and communication became a bottleneck, Zha B was doing preliminary research related to AI Infra at Cisco at the time. He was the first to introduce deep learning models into Cisco routers to do a series of performance assurance and security assurance related services.
Then in 2020, after some discussions with 4Paradigm, we designed and implemented NetDAM. Now you will find that Tesla TTPoE is doing the same thing.
NetDAM Special Topic
Today, you will find that DeepSeek's evolution of future hardware is fully realized within this framework.
First, it is a standard memory interface for the GPU side. Through a piece of memory on NetDAM, the communication of ScaleOut (Inter-Host) and ScaleUP (Intra-host) is completely integrated based on memory semantics. Then the Read/Write/multicast/reduce mentioned by DS are also the functions that NetDAM has done from the beginning. For example, RoCE needs to access GPU memory multiple times and introduce CPU control flow
NetDAM uninstalls directly:
As for the series of near-memory computing related to quantization and scale mentioned later by DS, NetDAM is essentially the best attachment point. For example, many people say that Mellanox has low latency, and NetDAM can easily beat it by directly bypassing PCIe latency.
But the world is not perfect, because everyone has a butt. For example, Cisco focused all its efforts on Silicon One, Intel guarded its UPI and searched for CXL, and Nvidia also merged IB and NVSwitch into a switching chip in the B200 generation, but they were eventually separated in the future.