Talk about WeChat + DeepSeek

Written by

Jasper Cole

Updated on:July-16th-2025

When ima.copilot was first released, I used it for a while. Except for some gaps in the Hunyuan model itself, it supports the entire public account content ecosystem very well. Some time ago, I also tested the automatic reply of the public account backend with LLM. Except for some interactive problems in business logic (such as the distinction between normal message communication and search and sorting public account content), the sorted content is actually very readable. Of course, there are also problems with the basic model. Recently, WeChat has started to grayscale DeepSeek, which will usher in more variables in the 2C market, especially the impact on the ecology of ByteDance. After all, the DAU of Tencent WeChat is close to 1 billion.

On the other hand, the recent statements by Baidu and Huoshan (see Fast Technology report ^[1] )

Shen Dou, president of Baidu Intelligent Cloud Business Group, stated at a staff meeting that last year's "malicious" price war in the domestic large model industry caused the industry's overall revenue to be several orders of magnitude lower than that of foreign countries.

Tan Dai, president of ByteDance's Volcano Engine, responded through WeChat Moments that the price reduction of large models was achieved through technological innovation.We should focus on the basics like DeepSeek, avoid groundless speculation and attribute external factors.Tan Dai pointed out that the pre-training cost and inference cost of Volcano Engine's Doubao 1.5Pro model are lower than DeepSeek V3, and much lower than other domestic models, and have very good gross profit at the current price. He further explained: "Domestic and foreign manufacturers are relying on technological innovation to reduce model prices. We have only achieved the price level of Gemini 2.0 Flash, and this price is entirely achieved by relying on technological progress."

In fact, many times the difference in cost estimates is essentially a technical gap. For example, the data estimated by Professor You Yang is more than 10 times different from the PD separation + EP parallel performance implemented in the DeepSeek-V3 paper. The essential cost difference is that there is a big gap when estimating the results of simple TP/PP parallelization in some open source communities. Especially from the price level of Google Gemini 2.0 Flash, there is still more room for optimization in technology. For example, the performance analysis of MOE in an article mentioned yesterday.

《Talk about DeepSeek MoE model optimization and future evolution and ByteDance Ultra-Sparse Memory related work》

Here is a simple Roofline analysis. In terms of computing power, the computing power requirements of the DeepSeek-V3/R1 model are relatively small. The bottleneck is mainly in memory access and All2All communication and how to solve the expert load balancing during reasoning. For example, Huawei Ascend mentioned: "Through the EP hybrid parallel algorithm, communication optimization performance is improved by 30%+, memory access performance is improved by 20%+, thereby reducing expert imbalance and reasoning throughput performance by 20%~35%." On the other hand, from a passage from Professor Yuan Jinhui, we can know why Mr. Liang recommends that 80 units are required for the best performance, mainly because better data locality is obtained through EP parallelization.

From the perspective of Network-Bound, we use a single token of 7168B to calculate, and a 400Gbps network interconnection network is 50GB/s, which is a simple upper limit estimate. The model needs to transmit 60 layers, and each token requires 8 Routed Experts and one Shared Expert, that is, a single token requires 7168 x 9 x60 ~ 4MB of data. Adding the communication volume of the Attention block, a single card can generate more than 6,000 tokens per second. Adding some communication losses and SLA delay guarantee constraints, according to an estimate of only 30%~50% discount, basically a single card can achieve 1,800~3,000 tokens when the parallel strategy is appropriate.

From the perspective of Memory-Bound, although according to the DSv3 paper, the decoding stage requires 256 tokens as a batch, and the data volume is 1.8MB, the parameter data volume of a single Expert is 44MB. Therefore, if the experts are scattered as much as possible and the hit-rate of L2 Cache is guaranteed, the efficiency of memory bandwidth will be many times higher.

However, it is difficult to obtain the advantage of Data Locality with single/dual/quad PP/TP parallel. This is why Mr. Liang recommends 40 or 80 machines for larger-scale EP parallel.

On the other hand, we also need to consider DeepSeek-V3/R1's support for MTP. For example, after Sglang recently implemented MTP, its performance almost doubled. So with MTP support, the TPS of a single card can be nearly doubled.

Considering some additional overhead, we calculated based on the performance limit of 2000TPS for a single card, and a single machine with 8 cards is about 16000TPS. Based on the rate of 20TPS per user, a single H20 can handle about 800 users. Considering the other overhead of the Prefill node in PD separation, it is estimated that it is technically feasible for a single H20 to handle 600 users.

Then, for WeChat’s 1 billion DAU, from 7 to 10 in the morning, it is basically the public account message push of various information-related information, in the afternoon it is mostly some advertisements/e-commerce, and in the evening the content will be richer, basically it can be maintained at a relatively high level throughout the day. According to the estimated usage time of a single user on WeChat for 60 minutes, the number of concurrent active users is estimated to be 40 million. According to 800 users per machine, about 50,000 machines are needed, that is, 400,000 cards. But in fact, if it is further relaxed to 10Tokens/s and considering some Poisson arrival and user usage frequency, about 100,000 cards to 200,000 cards are enough, that is, the public accountConsensus CrusherMentioned in "WeChat + Deepseek: The turning point of 2C applications"

We have already seen from the supply chain that Tencent has placed an additional order for 100,000 to 200,000 H20s, and now the WeChat version of Deepseek has a clear purpose.

Let me talk about something off topic. In addition to tuning inference, I have also been doing some work on reproducing R1 recently. The entire reinforcement learning workflow is the main line of DeepSeek. The goal is to achieve AGI/ASI through reinforcement learning. As for MLA/MoE/MTP/FP8, they are the means to achieve this goal, including the DeepSeek app itself. I believe there are rumors that Mr. Liang does not want these tens of millions of DAUs, which is a very real and credible idea. Especially when Zha B was doing R1 reproduction recently, the bottleneck of inference performance was accompanied by inference optimization work, and I had a deeper understanding of this point.

In fact, when it comes to the topic of reinforcement learning, there seems to be endless stories to tell. From writing a lot of dynamic programming algorithms when participating in the OI competition almost 25 years ago, I finally won the award and was admitted to a technical school in southwest Shanghai. Then, almost 20 years ago, my graduation thesis was based on the pricing of financial asset prices from the perspective of game theory based on cellular machines and complex networks, and realized through multi-agent simulation transactions. 7-8 years ago, Cisco won the CEO Award for the SDN and SWAN networks built based on reinforcement learning models and SegmentRouting, and released Cisco Predicatable Network. Then two years ago, a very simple dynamic programming algorithm was used to design the congestion control algorithm of eRDMA. During the Spring Festival, I reorganized the reinforcement learning algorithm. Although I was busy with DS-R1 inference tuning in the past week, I also did some model training work to reproduce R1. Then, during training, I found that the inference efficiency was too low, and I had to optimize trl and vllm. This chain is completely understood.

In many cases, we should focus on basic skills like DeepSeek, and avoid making groundless guesses and attributing external factors. This is also why Zha B is still doing some research on the coordinated optimization of MoE algorithms and infrastructure. For example, the content related to MoE mentioned above

《Talk about DeepSeek MoE model optimization and future evolution and ByteDance Ultra-Sparse Memory related work》

Related to basic mathematical algorithms, are there some high-performance learning algorithms for nonlinear spaces? Can some constraints on algebraic structures constitute some reward features of RL? Is it also a way to modify the softmax temprature in each layer of attention-score through RL?

Mathematical foundations for the era of large models

There is also the collaboration between algorithms and infrastructure

"Talking about the basic qualities of AISys architects" , "The evolution of GPU architecture"

Of course, there is more support for domestic computing power, such as chip ScaleUP and ScaleOut interconnection and Tensor computing analysis.

《AI Accelerator Interconnect》 , 《Tensor Computing》

In this era, we need a group of people who can focus on technology. Come on, everyone.