In-depth analysis of the technology industry: an in-depth Q&A about DeepSeek

In-depth analysis of the technology industry, giving you insight into the story behind DeepSeek.
Core content:
1. Ben Thompson's in-depth analysis of the DeepSeek incident
2. The impact of US-China relations on the technology industry and the far-reaching consequences of the chip ban
3. The significance of DeepSeek's technological breakthrough and analysis of reactions from all walks of life
Q: Why haven’t you written about DeepSeek yet?
Ben Thompson: I did! I wrote about R1 last Tuesday. I still agree with the views of that article, including the two key points I emphasized (emergent chain thinking achieved through pure reinforcement learning, and the power of knowledge distillation). I also mentioned the impact of low training costs and the chip ban. But my observations at the time were too limited to the development of current AI technology, and I had no idea that this news would have such a wide-ranging impact on the larger discussion level, especially in terms of US-China relations.
Q: Have you ever had a similar misjudgment?
Ben Thompson: Yes, there was. In September 2023, Huawei released the Mate 60 Pro, which was equipped with a 7nm chip manufactured by SMIC. The existence of this chip did not surprise people who closely follow the field: SMIC had produced 7nm chips as early as a year ago (and I pointed out its existence earlier), and TSMC had mass-produced 7nm chips without relying on EUV lithography (some later 7nm processes began to introduce EUV). Intel had also used DUV lithography technology to produce 10nm (equivalent to TSMC's 7nm) chips earlier, but the yield was too low to be profitable. Therefore, it is not surprising that SMIC has produced 7nm chips with existing equipment - at least to me.
But what I didn’t expect at all was how intense the reaction in Washington would be. The dramatic escalation of the US government’s chip ban ultimately led the Biden administration to shift chip sales to a “licensing system.” The reason behind this is that many people don’t understand the complexity of chip manufacturing and were caught off guard by the sudden release of Huawei’s Mate 60 Pro. I feel like something similar is happening in the past 72 hours: the specific technological breakthroughs achieved by DeepSeek—and the parts it hasn’t yet achieved—are not important in themselves, but what matters is how people react and what that reaction reveals about their original assumptions.
Q: So what exactly did DeepSeek release?
Ben Thompson: The most direct reason for the widespread discussion this weekend was R1, a model with comparable reasoning capabilities to OpenAI's o1. However, the information that really shook the market - such as DeepSeek's training costs - was actually disclosed in the V3 version released on Christmas. The key technological breakthroughs supporting V3 can be traced back to the V2 version released in January last year.
Q: Is DeepSeek’s model naming convention the “biggest crime” committed by OpenAI?
Ben Thompson: The second biggest crime, and we'll talk about the biggest one later. Let's start from the beginning: What is the V2 model and why is it important?
DeepSeek-V2 brings two key breakthroughs: DeepSeekMoE and DeepSeekMLA.
The "MoE" in DeepSeekMoE refers to "Mixture of Experts". Models like GPT-3.5 activate the entire model during training and reasoning, but in fact, not all parts contribute to a specific task. MoE improves efficiency by dividing the model into multiple "experts" and activating relevant experts only when needed. It is speculated that GPT-4 is also a MoE model, which may contain 16 experts, each with about 110 billion parameters.
DeepSeekMoE has innovated this concept in version V2, including finer-grained expert division and shared experts with stronger generalization capabilities. In addition, DeepSeek has also optimized the load balancing and routing mechanisms during training. Traditional MoE models have high computational overhead during training, but DeepSeek's method makes training more efficient while maintaining efficient reasoning.
DeepSeekMLA (Multi-head Latent Attention) is a bigger breakthrough. One of the biggest bottlenecks in reasoning is memory usage: not only does one need to load the entire model, but also the entire context window. The context window is very expensive in memory because each token needs to store a Key-Value. DeepSeekMLA allows Key-Value storage to be compressed, thereby significantly reducing memory usage during reasoning.
The real significance of these breakthroughs—and the focus you need to pay attention to—is only revealed in version 3. V3 further optimizes load balancing (further reducing communication overhead) and introduces multi-token prediction in training (increasing the density of each step of training, reducing overhead again). As a result, the training cost of version 3 is amazingly low.
DeepSeek claims that the training of the model consumed a total of 2.788 million H800 GPU hours, and at $2 per GPU hour, the total cost was only $5.576 million. This cost is incredibly low. DeepSeek explicitly stated that this cost only refers to the final training run and does not include all other expenses. The following is the relevant content from the V3 paper:
We once again emphasize the economical training cost of DeepSeek-V3, as shown in Table 1, which is achieved by optimizing the algorithm, framework, and hardware co-design. In the pre-training stage, it only takes 180K H800 GPU hours to process 1 trillion tokens, which is 3.7 days of training on our cluster equipped with 2048 H800 GPUs. Therefore, we completed the pre-training in less than two months, consuming a total of 2664K GPU hours. Together with 119K GPU hours for context length expansion and 5K GPU hours for post-training, the full training of DeepSeek-V3 only consumes 2.788M GPU hours. Assuming the rental price of H800 GPU isIt should be noted that the above costs only include the formal training of DeepSeek-V3, and do not include the costs of preliminary research and ablation experiments on architecture, algorithms or data.
You can't replicate DeepSeek with $5.576 million.
Q: I still don’t believe this number.
Ben Thompson: Actually, skeptics need to provide evidence to refute this number, especially after you understand the architecture of V3. Remember DeepSeekMoE we mentioned earlier? V3 has 671 billion parameters, but each token only calculates the active expert part of 37 billion parameters, which is equivalent to 333.3 billion FLOPs (floating point operations).
Another innovation of DeepSeek needs to be mentioned here: although the parameters are stored in BF16 or FP32 precision, they are reduced to FP8 precision during calculation. In addition, the computing power of 2048 H800 GPUs is 3.97 exaFLOPS (ie 3.97×10¹⁸FLOPS). The training set contains 14.8 trillion tokens, and when you calculate all the data, you will find that 2.788 million H800 hours are indeed enough to train V3. Once again, this is only the amount of calculation for the final training, not all costs, but it is a reasonable number.
Q: Alexandr Wang, CEO of Scale AI, said they have 50,000 H100s.
Ben Thompson: I'm not sure where Wang got his data from, but I'm guessing he's referring to a tweet from Dylan Patel in November 2024, which mentioned that DeepSeek has over 50,000 Hopper GPUs. However, the H800 is also a Hopper GPU, it's just that its memory bandwidth is much more limited than the H100 due to US export controls.
The key is that many of DeepSeek's innovations are designed to overcome the memory bandwidth limitations of the H800. If you have carefully calculated the previous content, you will find that DeepSeek actually has sufficient computing power. The reason is: DeepSeek specifically uses 20 of the H800's 132 computing units to manage cross-chip communication.
This is not possible in CUDA (Nvidia's standard programming framework), and DeepSeek engineers must program in PTX (Nvidia GPU's low-level instruction set, equivalent to assembly language). This level of optimization is extremely high and only makes sense when using H800 instead of H100. At the same time, DeepSeek also needs to provide computing power for model inference, which means that their GPUs are used for much more than training.
Q: So, does this violate the US chip ban?
Ben Thompson: No. H100 is banned, but H800 is not. Many people originally thought that training top models must rely on high memory bandwidth, but DeepSeek optimized around this limitation, adjusted the model architecture and infrastructure, and finally successfully solved this problem.
Again, all of the decisions that DeepSeek made in designing V3 only make sense if they were limited to the H800. If they had access to an H100, they would probably have used a larger training cluster and not done as much optimization specifically for low bandwidth.
Q: So, is the V3 a top-of-the-line model?
Ben Thompson: V3 is indeed competitive with OpenAI's GPT-4o and Anthropic's Sonnet-3.5, and seems to be stronger than the largest model in the Llama series. More importantly, DeepSeek may have extracted high-quality tokens from these models as training data through distillation.
Q: What is distillation?
Distillation is a way to extract knowledge from a stronger model: you feed data to a teacher model, record its output, and use that data to train a student model. This is how GPT-4 Turbo evolved from GPT-4.
Generally, companies can only distill their own models, as they have full access, but can still distill informally through API access or even chat clients if they wish.
Of course, distillation usually violates the terms of service of companies like OpenAI, and the only way to prevent it is to directly ban the IP or restrict API access. But it is still a common training strategy, which is why we see more and more models approaching the quality of GPT-4o. We can't be 100% sure whether DeepSeek has distilled GPT-4o or Claude, but honestly, it would be strange if they didn't do so.
Q: Distillation is bad news for top models, right?
Ben Thompson: Exactly! On the bright side, OpenAI, Anthropic, and Google are also using distillation to optimize inference models so that they run more efficiently in consumer-facing applications. But on the bad side, they bear the training costs of all the top models, while other companies are "freeloading" these results.
In fact, this may be the core economic reason why Microsoft and OpenAI are drifting apart. Microsoft is interested in providing inference services, but is less willing to pay for $100 billion in data centers because these models may have been distilled, replicated, and made cheap before they are commercialized.
Q: Is this why all tech stocks are falling?
Ben Thompson: In the long run, the popularity of models and the decline in inference costs (as DeepSeek also demonstrates) are good things for big tech companies.
Microsoft: A significant reduction in inference costs means a reduction in demand for data centers and GPUs, or a surge in the number of users.
Amazon (AWS): AWS’s own AI models are less competitive, but they can deploy high-quality open source models at ultra-low costs and still make money from inference.
Apple: The memory requirements for inference are significantly reduced, making edge inference more feasible, and Apple has the strongest hardware. The unified memory architecture of Apple chips is more suitable for AI inference than Nvidia gaming GPUs.
Meta (Facebook): The biggest winner! All of Meta’s AI initiatives benefit from lower inference costs, making their AI ecosystem easier to implement.
Google: It may suffer. Because the hardware requirements have dropped, Google TPU's competitiveness has been weakened. And the reduction in inference costs will give rise to AI products that replace search, which poses a threat to Google's core business.
You ask why the stock price fell, and I give you the long-term trend; the market is only reacting to a short-term shock today.
Q: Wait, you haven’t talked about R1 yet?
Ben Thompson: R1 is a reasoning model, similar to OpenAI's o1. It can perform complex thinking and improve code, mathematics, and logical reasoning capabilities.
Q: Is R1 more amazing than V3?
Ben Thompson: Actually, I spent so much time talking about V3 because V3 proved the technology trend that shocked the market. R1 is special for two main reasons: first, its existence itself shows that OpenAI does not have "unique advantages that cannot be replicated". Second, R1 is open source (although there is no dataset), so you can run it on any server or even locally without paying OpenAI.
Q: How does DeepSeek train R1?
DeepSeek actually trained two models: R1 and R1-Zero. R1-Zero is the more noteworthy model - as I mentioned in my update last Tuesday:
R1-Zero is the real deal in my opinion. Their paper states:
In this paper, we take the first step towards improving the reasoning capabilities of language models using pure reinforcement learning (RL). Our goal is to explore the potential of large language models (LLMs) to self-evolve through a pure reinforcement learning process without any supervised data. Specifically, we use DeepSeek-V3-Base as the base model and GRPO as the reinforcement learning framework to improve the model's performance in reasoning. During training, DeepSeek-R1-Zero naturally emerges with many powerful and interesting reasoning behaviors. After thousands of steps of reinforcement learning, DeepSeek-R1-Zero shows extraordinary performance on reasoning benchmarks. For example, the pass@1 score on the AIME 2024 competition increased from 15.6% to 71.0%, and with majority voting, the score was further improved to 86.7%, reaching the level of OpenAI-o1-0912.
Reinforcement learning is a machine learning technique in which a model is provided with a set of data and a reward function. A classic example is AlphaGo, where DeepMind provided the model with the rules of Go and set "winning the game" as the reward function, and then let the model figure out all the other strategies on its own. This approach ultimately proved to be more effective than other more human-guided techniques.
However, to date, large language models have relied primarily on “reinforcement learning with human feedback” (RLHF), a process in which humans are involved to help guide the model and solve difficult problems where the rewards are not clear enough. RLHF is a key innovation in the evolution of GPT-3 into ChatGPT, allowing the model to generate well-structured paragraphs and provide concise responses without going off topic or generating meaningless content.
However, R1-Zero discards the "human feedback" (HF) part and only uses reinforcement learning. The DeepSeek research team provided the model with a series of math, code, and logic problems and set two reward functions: one for rewarding the correct answer and the other for rewarding the correct format that conforms to the reasoning process. In addition, the method is relatively simple: instead of using "step-by-step evaluation" (process supervision) or searching for all possible answers like AlphaGo, the researchers encouraged the model to try multiple answers at the same time and scored them according to these two reward functions.
The result is a model that can autonomously develop reasoning and “chains-of-thought,” including what the DeepSeek team calls Aha Moments:
During the training of DeepSeek-R1-Zero, we observed a particularly interesting phenomenon, namely Aha Moments. As shown in Table 3, this phenomenon occurs in the middle training stage of the model. In this stage, DeepSeek-R1-Zero learns to allocate more thinking time to a problem and re-evaluate its initial solution. This behavior not only demonstrates the model's increasing reasoning ability, but also is a striking example of how reinforcement learning can lead to unexpected and complex results.
This Aha moment is not only a breakthrough discovery for the model, but also for the researchers who observe it. This phenomenon highlights the power and charm of reinforcement learning: compared to directly teaching the model how to solve the problem, we only provide it with the right incentives, and it can autonomously develop advanced problem-solving strategies. This Aha Moments is a strong proof that reinforcement learning can unlock a new level of intelligence in artificial intelligence, paving the way for more autonomous and adaptive models in the future.
This may be one of the most powerful affirmations of “The Bitter Lesson” to date: you don’t need to teach AI how to reason, just give it enough computing power and data, and it will teach itself!
But there’s still a little bit left: R1-Zero does have the ability to reason, but it does so in a way that’s difficult for humans to understand. Back to the paper’s introduction:
However, DeepSeek-R1-Zero still faces challenges such as poor readability and mixed language. To address these issues and further improve reasoning capabilities, we introduced DeepSeek-R1, which combines a small set of "cold-start data" and a multi-stage training process. Specifically, we first collected thousands of cold-start data to fine-tune DeepSeek-V3-Base. Subsequently, we performed reasoning-based reinforcement learning as we trained DeepSeek-R1-Zero. When the reinforcement learning process is close to convergence, we create new supervised fine-tuning data on the checkpoints of RL training through rejection sampling, and retrain DeepSeek-V3-Base with supervised data from DeepSeek-V3 (covering areas such as writing, factual question answering, and self-awareness). After fine-tuning on the new data, the model checkpoint undergoes an additional reinforcement learning process and incorporates prompts in all scenarios. After these steps, we finally obtained DeepSeek-R1, whose performance has reached the level of OpenAI-o1-1217.
This sounds very similar to the approach taken by OpenAI in the o1 training process: the DeepSeek research team first used a set of "thought chain" examples to guide the model to learn a format suitable for human understanding, and then used reinforcement learning to enhance reasoning capabilities, and performed a series of editing and optimization steps. The final model is competitive with OpenAI-o1 in both reasoning ability and readability.
DeepSeek may indeed benefit from distillation, especially in training R1. However, this is an important conclusion in itself: we are in an era where AI models teach AI models and AI models train themselves, and we are seeing the AI takeoff unfold in real time.
Q: So, are we close to AGI?
Ben Thompson: It certainly seems to be. This explains why Softbank (and the investors that Masayoshi Son was able to gather) was willing to provide OpenAI with funding that Microsoft was unwilling to provide: they believed that we were at a tipping point where getting ahead of the curve would indeed bring real rewards.
Q: But isn’t R1 leading now?
Ben Thompson: I don't think so, and that statement is overstated. R1 is on par with o1, although it has some gaps in capabilities, suggesting that some of its capabilities may be distilled from o1-Pro. Meanwhile, OpenAI has demonstrated o3, a more powerful inference model. DeepSeek is undoubtedly the leader in efficiency, but this is different from overall leadership.
Q: Then why is everyone panicking?
Ben Thompson: I think there are several reasons. First, China has caught up with the top US labs, which is shocking, after all, many people think that China is not as good as the US in software. This may be the key reason why I underestimated the market reaction before. In fact, China's software industry as a whole is very strong, and it has also achieved good results in AI model building.
Secondly, the low training cost of V3 and the low inference cost of DeepSeek. This is also a surprise to me, but the data is reasonable. This also makes the market nervous about Nvidia, after all, it has a great impact on its market position.
Third, DeepSeek still achieved this achievement in the context of the chip ban. Although there are loopholes in the chip ban, DeepSeek was likely accomplished using legal chips.
Q: I own Nvidia stock! Is this the end?
Ben Thompson: This does pose a real challenge to the Nvidia story. Nvidia has two big defenders: CUDA is the preferred language for developers, and CUDA can only run on Nvidia chips. Nvidia has a huge lead in interconnecting multiple chips into large virtual GPUs.
These two guardians complement each other. I mentioned earlier that if DeepSeek could get H100, they might use a larger cluster to train the model because it would be easier. But they didn't, and the bandwidth was limited, which affected their model architecture and training infrastructure decisions. And looking at the US lab, they didn't put much effort into optimization because Nvidia has been providing more powerful systems to meet their needs. The easiest path is to pay Nvidia. However, DeepSeek has just proved the feasibility of another path: through deep optimization, amazing results can still be achieved even on weaker hardware. This shows that just throwing money at Nvidia is not the only way to improve the model.
However, Nvidia still has three advantages. First, how powerful would DeepSeek's method be if it were applied to the H100 or the upcoming GB100? Just because they have found a more efficient way to calculate does not mean that more computing resources are useless. Second, lower inference costs should drive wider use in the long run. Microsoft CEO Satya Nadella tweeted late at night, apparently to convey this message to the market. Third, the superior performance of inference models like R1 and o1 depends on more computing resources. If advances in AI still rely on more computing power, Nvidia will still be the beneficiary.
Microsoft CEO Satya Nadella made this clear in a late night tweet almost certainly directed at the market:
Still, the picture is not all rosy. At the very least, DeepSeek’s efficiency and wide availability call into question Nvidia’s most optimistic growth projections, at least in the short term. In addition, the success of model and infrastructure optimization also suggests that exploring alternatives in inference could bring significant benefits. For example, running inference on a standalone AMD GPU may be more feasible than using AMD’s weaker inter-chip communication capabilities. And advances in inference models have also greatly increased the value of dedicated inference chips that are more specialized than Nvidia GPUs.
In short, Nvidia isn't going away, but Nvidia's stock price is suddenly facing more uncertainty that the market hasn't yet priced in. This in turn will drag down the entire market.
Q: So, what about the chip ban?
Ben Thompson: The simplest way to put it is that the importance of the chip ban is highlighted given that the United States' lead in software is rapidly disappearing. Software and technical knowledge cannot be blocked - we have already discussed and concluded this. But chips are physical objects, and the United States has reasons to prevent China from obtaining them.
At the same time, we need to be humble, because the early chip ban appears to have directly led to DeepSeek's innovations. And these innovations apply not only to smuggled Nvidia chips or the weakened H800, but also to Huawei's Ascend chips. In fact, it can be said that the main consequence of the chip ban is the plunge in Nvidia's stock price today.
I am more concerned about the thinking behind the chip ban: the United States is competing not through future innovation, but by blocking past innovation. In the short term, this may help - after all, if DeepSeek had more computing power, their results might be better. But in the long run, this will only sow new seeds of competition in chips and semiconductor equipment, two industries where the United States dominates.
Q: Just like an AI model?
Ben Thompson: The AI model is a great example of this. I mentioned earlier that I would talk about OpenAI’s worst mistake, which I think was Biden’s AI executive order in 2023. I wrote in Attenuating Innovation:
The core point of this passage is this: If you accept the premise that regulation will entrench incumbent giants, then it is worth noting why the companies that first won the AI race are the ones most actively creating AI panic in Washington. Their concerns do not seem to be serious enough to stop their own AI research. Instead, they regard themselves as the responsible party and actively call for regulation—and if such regulation ultimately kills off future competitors, so much the better.
This section is mainly about OpenAI, and the broader San Francisco AI community. For years, it was those dedicated to building and controlling AI who hyped up the dangers of AI. These so-called dangers were the reason why OpenAI switched to closed mode when it released GPT-2 in 2019:
“Given concerns that large language models could be used to generate deceptive, biased, or abusive language at scale, we are releasing only a smaller version of GPT-2 and its sample code(opens in a new window). We are not releasing the dataset, training code, or GPT-2 model weights… We recognize that some researchers have the technical ability to replicate and open source our work. We believe our release strategy limits the organizations that might initially do so and gives the AI community more time to discuss the impact of such systems.”
“We also believe that governments should consider expanding or launching more systematic initiatives to monitor the societal impacts of AI technologies and their spread, and to measure progress in the capabilities of these systems. If advanced, these efforts could provide AI labs and governments with a more solid evidence base for informing their decisions and AI policies.”
The arrogance of this statement is outrageous: six years later, the world has access to model weights that are much more powerful than GPT-2. OpenAI's strategy of trying to maintain control through the US government has completely failed. During this period, how many innovation opportunities have we missed because the top models are not open source? More broadly, how much time and energy have we wasted lobbying the government to build a moat that DeepSeek has just destroyed? This time and energy could have been used to promote real innovation.
Q: So, you’re not worried about the AI doomsday theory?
Ben Thompson: I completely understand this concern, after all, we have entered the stage where AI trains AI and AI learns to reason on its own. But I also know that this train cannot be stopped. More importantly, this is why openness is so important: we need more AI in the world, not an unsupervised board of directors to rule over everyone.
Q: Wait, why is China open sourcing their models?
Ben Thompson: To be precise, it is DeepSeek that is open source. Its CEO Liang Wenfeng said in an interview that open source is the key to attracting talent:
"In the face of disruptive technology, closure is only temporary. Even OpenAI's closed strategy cannot prevent others from catching up. Therefore, our core value lies in the team - colleagues grow in the process, accumulate technical knowledge, and form an innovative organization and culture. This is our moat. Open source and publishing papers are actually free for us. For technical talents, seeing others follow their innovations will give them a great sense of accomplishment. In fact, open source is more of a cultural behavior than a commercial behavior, and participating in it can win respect. For a company, this culture is also attractive."
In that interview, the interviewer then asked Liang Wenfeng, will this strategy change in the future? DeepSeek currently has an idealistic color, similar to the early stages of OpenAI, and it is open source. Will you turn to closed in the future? Both OpenAI and Mistral have turned from open source to closed.
Liang Wenfeng replied: "We will not turn to closure. We believe that building a strong technology ecosystem first is more important than anything else. This is not just idealism, but in line with business logic. If the model is commoditized - which it seems to be at present - then the long-term competitive advantage comes from a better cost structure, and DeepSeek has achieved this. This also echoes how China has achieved dominance in other industries. This way of thinking is very different from most American companies, which generally rely on differentiated products to maintain higher profit margins."
Q: So, is OpenAI finished?
Ben Thompson: Not necessarily. ChatGPT makes OpenAI an unexpected consumer tech company, that is, a product company. OpenAI can still build a sustainable consumer business based on a commoditizable model through a combination of subscriptions and advertising. And it is still betting on winning the AI takeoff race.
Conversely, Anthropic may be the biggest loser this time around. DeepSeek’s app topped the App Store, while Claude barely gained traction outside of San Francisco. API businesses fared better, but API companies are generally the most vulnerable to the commoditization trend (it’s worth noting that OpenAI and Anthropic’s inference costs look much higher than DeepSeek’s, as they were previously making high margins, but those margins are disappearing).
Q: So, is this all very frustrating?
Ben Thompson: Not really. I think DeepSeek brings a huge gift to almost everyone. The biggest winners are consumers and businesses, who can look forward to a future of almost free AI products and services. In the long run, the Jevons Paradox will dominate the situation, and AI users will ultimately be the biggest beneficiaries.
Another group of winners are large consumer technology companies. In a world of free AI, products and channels are the most important, and these companies have already won the competition. China is also a big winner, and this may take time to fully emerge. Not only can China directly use DeepSeek's technology, but DeepSeek's success relative to the top AI labs in the United States may further stimulate China's enthusiasm for innovation and make them realize that they can compete.
What remains is the United States, and the choice we must make. Logically, we can choose to double down on defensive measures, such as significantly expanding the chip ban and imposing a licensing system for chips and semiconductor equipment similar to the EU's technology regulation. Or, we can recognize that we face real competition and give ourselves "license to compete." Stop worrying about it and stop lobbying for regulation - instead, we should go to the other extreme and cut out all the redundancy within our companies that is not related to winning. If we choose to compete, we can still win. And if we win, we will have this Chinese company to thank.