In-depth technical interpretation! DeepSeek theoretical cost-profit ratio is 545%! 2025

Written by
Clara Bennett
Updated on:July-14th-2025
Recommendation

In-depth technical interpretation reveals the high theoretical cost-profit ratio of DeepSeek, leading the industry change.

Core content:
1. Detailed disclosure of the cost and profit ratio of DeepSeek V3/R1 inference system
2. In-depth analysis of the impact of token processing and cache hit rate on cost
3. The potential impact of high profit margin on the AI ​​industry and business model innovation

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

When the market thought that DeepSeek's open source week content had been released, on March 1, DeepSeek announced "One More Thing", suddenly unveiled the V3/R1 reasoning system, and disclosed the costs and benefits of large-scale deployment.

According to the article "Overview of DeepSeek-V3/R1 Reasoning System", according to DeepSeek's calculation, assuming that the GPU rental cost is $2/hour, the total cost is $87,000/day. If all loads including web pages, apps and APIs are counted, and all model tokens are calculated according to the pricing of DeepSeek-R1 (DeepSeek R1 pricing: $0.14/million input tokens (cache hits), $0.55/million input tokens (cache misses), $2.19/million output tokens), the total revenue per day is theoretically $56,200, and the cost-profit ratio is 545%.

What does a profit margin of up to 545% mean, and what impact will it have on the industry?

In natural language processing, a token is the basic unit after a language text is segmented. Each user asks a question to the AI ​​and gets an answer. The text length of the question and the answer corresponds to a varying number of tokens. AI needs to consume computing power to process each token. In addition, there is also the situation of whether the cache is hit or not. A cache hit means that the relevant data involved in the user's question to the AI ​​already exists in the cache, and the model can be called directly without recalculating or retrieving from the database, saving computing power, time and storage resources, and reducing costs. If it fails to hit, more computing power and other resources will be consumed, and the cost will be higher.
Currently, charging by token is the main business model of AI companies. Hitting the cache is relatively cheap, while misses are more expensive.

For the industry, the 56.3% cache hit rate mentioned in DeepSeek's latest article (the original article stated that within the 24-hour statistical period, DeepSeek V3 and R1 were able to achieve a total input token of 608B, of which 342B tokens (56.3%) hit the KVCache hard disk cache) is a data of great significance.

"Although none of the companies have released relevant data, a hit rate of more than half is already a very high level in the industry."

For example, in the 671 billion parameter super-large model developed by DeepSeek, there are more or less differences in the texts written by hundreds of millions of users when asking questions. The fact that a high accuracy rate can be achieved under this premise shows that the team has done a lot of work in optimizing the overall model.
According to the DeepSeek team, the optimization goal of the V3 and R1 reasoning systems is to pursue "higher throughput and lower latency."
Based on the core architecture of the Hybrid Expert Model (MOE) adopted by DeepSeek, the super-large model is composed of many smaller expert models, and they take on different divisions of labor. To explain the scheduling work required, we can use the teamwork in the human world. If a team wants to bring together experts from various fields to tackle a certain task, it is necessary to split the overall task into multiple process links in advance, and then assign them to experts in different fields, so that each of them can use their professional skills to solve the problem and finally summarize the conclusions.
DeepSeek wrote in the article that due to the large number of experts in DeepSeek-V3/R1, and according to the original design rules, only 8 of the 256 experts in each layer are activated in actual operation. To achieve the team's optimization goal of "high throughput, low latency", it is necessary to "efficiently call" each expert when processing a large number of tasks in a short time, which is what DeepSeek mentioned in the article as "large-scale cross-node expert parallelism (Expert Parallelism/EP)".
"This is an extremely difficult balancing task. If the model optimization allocation is not done well, a super large model with more than 600 billion parameters may only have 8 or a few experts actually running at a time, and if one of them does not finish running, all the remaining experts may be waiting. Waiting usually means a waste of computing resources. Before DeepSeek was open sourced, the balanced design of the hybrid expert model was a difficult problem that many AI model manufacturers had not yet solved.
In addition, according to DeepSeek, due to the large number of user visits and high service load during the day and low service load at night, the team implemented a mechanism to deploy inference services using all model nodes when the load is high during the day, and reduce inference nodes at night when the load is low to use for research and training.

According to DeepSeek statistics, according to this "daytime reasoning - nighttime training" plan, in the past 24 hours, the nodes occupied by DeepSeek V3 and R1 reasoning services were added together. During the peak period of busy tasks, up to 278 nodes were occupied, and the average occupied was 226.75 nodes (each node is 8 NVIDIA H800 GPUs).

Considering that DeepSeek also has new model projects and other work that requires GPUs, the above 1,800-2,000 H800 GPUs (the average number of nodes occupied multiplied by 8 GPUs) have most likely used up "all computing resources" that DeepSeek can currently call on for the DeepSeek V3 and R1 models.

According to the industry, DeepSeek's innovative breakthrough is that it has maximized efficiency under limited resources, thereby achieving low-cost model development. Based on the above series of efficiency optimization, it has achieved a cost-profit ratio of 545%.

But DeepSeek also emphasized that 545% is just a theoretical value, and there is not "so much revenue" in actual operation because V3 is priced lower, and paid services only account for a part of it, and there are also discounts at night.
Previously, DeepSeek has attracted much attention among similar model manufacturers for its low price tag of "AI Pinduoduo".
When DeepSeek launched the V2 model last year, it lowered the price of API calls to 1 yuan/million tokens for input and 2 yuan/million tokens for output for the first time in April, which triggered follow-up from manufacturers such as Doubao, Kimi, and Wenxin Yiyan, leading to the first wave of model price wars. The latest V3 model service pricing is only 1/15 of OpenAI's similar model 4o, and the price of the R1 model is also much lower than its peers.
The high profit margin announced this time also allowed the outside world to see DeepSeek's "trump card" of price reduction.
Prior to this, the industry was once hotly discussing whether "the low pricing of DeepSeek model API will lead to huge losses." Luo Fuli, a former researcher at DeepSeek, denied this in her personal Zhihu account last May. According to her, at the current pricing of DeepSeek, large-scale service provision is not at a loss, and the profit margin exceeds 50%. Liang Wenfeng, the founder of DeepSeek, also mentioned in an exclusive interview with 36Kr Media that the company's pricing strategy is "in principle, not to sell at a loss, nor to pursue excessive profits. The current pricing only retains a certain profit margin above the cost."
At present, most of the manufacturers in the industry who have announced the deployment of the "full-blooded" DeepSeek R1 model are small-scale devices such as single machines (servers with 8 GPUs) and dual machines. According to the reporter, "four machines are currently a watershed in the industry to test the technical capabilities of companies." As the number of servers increases, the difficulty of large-scale deployment scheduling and optimization increases. The deployment of more than 300 servers achieved by the DeepSeek team has sharply increased the technical capabilities of the team.
At present, although the cost-profit ratio of 545% is a theoretical value calculated by DeepSeek based on large-scale deployment, the actual profit level has not been officially announced, but it still allows the industry to begin to see the "hope of making money."
DeepSeek also open-sourced its model optimization method while announcing its profit margin. The industry will be more proactive in learning this optimization method and deploying DeepSeek. Although for most companies, "knowing" and "doing" are two different things, and putting the same optimization method into practice will encounter various new problems, the entire industry will make more attempts in this regard.

According to official disclosure, the optimization goal of the DeepSeek-V3/R1 inference system is: higher throughput and lower latency.

To achieve these two goals, DeepSeek uses large-scale cross-node expert parallelism (Expert Parallelism / EP). First, EP greatly increases the batch size, thereby improving the efficiency of GPU matrix multiplication and improving throughput. Second, EP allows experts to be dispersed on different GPUs, and each GPU only needs to calculate a small number of experts (thus requiring less memory access), thereby reducing latency.

However, EP also increases the complexity of the system. The complexity is mainly reflected in two aspects:

EP introduces cross-node transmission. In order to optimize throughput, it is necessary to design a suitable computing process so that transmission and computing can be performed simultaneously.

EP involves multiple nodes, so Data Parallelism (DP) is naturally required, and load balancing is required between different DPs.

Therefore, DeepSeek introduces how to use EP to increase the batch size, how to hide the transmission time, and how to perform load balancing.

Large-scale cross-node expert parallelism (Expert Parallelism / EP)

Since DeepSeek-V3/R1 has a large number of experts and only 8 out of 256 experts are activated in each layer, the high sparsity of the model determines that a large overall batch size must be used to provide each expert with enough expert batch size, thereby achieving higher throughput and lower latency. Large-scale cross-node expert parallelism (Expert Parallelism / EP) is required.

The expert parallel strategy among multiple machines and multiple cards is used to achieve the following goals:

Prefill : Routing Expert EP32, MLA and Shared Expert DP32, one deployment unit is 4 nodes, 32 redundant routing experts, 9 routing experts and 1 shared expert per card.

Decode : Routing experts EP144, MLA and shared experts DP144, a deployment unit is 18 nodes, 32 redundant routing experts, 2 routing experts and 1 shared expert per card.

Calculating Communication Overlap


The parallel processing of experts on multiple machines and multiple cards will introduce relatively large communication overhead, so double batch overlap is used to cover the communication overhead and improve the overall throughput.


In the prefill phase, the computation and communication of the two batches are interleaved, so that the computation of one batch can cover the communication overhead of the other batch.


For the decoding stage, the execution time of different stages is different, so the attention part is split into two stages, a total of 5 stages of the pipeline to achieve the overlap of calculation and communication.

Load balance as much as possible


Due to the large-scale parallelism (including data parallelism and expert parallelism), if the computation or communication load of a GPU is too heavy, it will become a performance bottleneck and slow down the entire system; at the same time, other GPUs will idle due to waiting, resulting in a decrease in overall utilization. Therefore, it is necessary to distribute a balanced computation load and communication load to each GPU as much as possible.


  1. PrefillLoadBalancer



    1. Core problem: The number and length of requests on different data parallel (DP) instances are different, resulting in different core-attention calculation amount and dispatch sending amount.


    2. Optimization goal: Make the amount of computation on each GPU as similar as possible (core-attention computation load balancing) and the number of input tokens as similar as possible (dispatch sending load balancing) to avoid some GPUs taking too long to process.



  2. DecodeLoadBalancer


    1. Core problem: The number and length of requests on different data parallel (DP) instances are different, resulting in different core-attention calculation amounts (related to KVCache occupancy) and dispatch sending amounts.


    2. Optimization goal: Make the KVCache usage of each GPU as similar as possible (core-attention calculation load balancing), and the number of requests as similar as possible (dispatch sending load balancing).



  3. Expert-ParallelLoadBalancer


    1. Core problem: For a given MoE model, there are some natural high-load experts, which leads to unbalanced expert computing loads on different GPUs.


    2. Optimization goal: Balance the amount of expert computing on each GPU (that is, minimize the maximum amount of dispatch received by all GPUs).

Actual statistics of online systems

All services of DeepSeekV3 and R1 use H800 GPU and the same precision as training. That is, matrix calculation and dispatch transmission adopt the FP8 format consistent with training, and core-attention calculation and combine transmission adopt BF16 consistent with training, which guarantees the service effect to the greatest extent.


In addition, since the service load is high during the day and low at night, a mechanism is implemented to deploy inference services with all nodes when the load is high during the day. When the load is low at night, the inference nodes are reduced to be used for research and training. In the last 24 hours (2025/02/27 12:00 to 2025/02/28 12:00 Beijing time), DeepSeek - V3 and R1 inference services occupied a total of 278 nodes, with an average of 226.75 nodes (8 H800 GPUs per node) . Assuming the GPU rental cost is $2/hour, the total cost is $87,072/day.

In the 24-hour statistical period, DeepSeek - V3 and R1:

The total number of input tokens is 608B, of which 342B tokens (56.3%) hit the KVCache hard disk cache.

The total number of output tokens is 168B. The average output rate is 20~22tps, and the average KVCache length of each output token is 4989.

The average throughput of each H800 is: for the prefill task, the input throughput is about 73.7ktokens/s (including cache hits); for the decode task, the output throughput is about 14.8ktokens/s.

The above statistics include all loads of web pages, apps and APIs. If all tokens are calculated according to the pricing of DeepSeek-R1, the total revenue per day is theoretically $562,027, with a cost-profit ratio of 545%. Of course, in reality, there is not so much revenue, because V3 is priced lower, and paid services only account for a part of it, and there are discounts at night.