In-depth technical interpretation! DeepSeek theoretical cost-profit ratio is 545%! 2025

In-depth technical interpretation reveals the high theoretical cost-profit ratio of DeepSeek, leading the industry change.
Core content:
1. Detailed disclosure of the cost and profit ratio of DeepSeek V3/R1 inference system
2. In-depth analysis of the impact of token processing and cache hit rate on cost
3. The potential impact of high profit margin on the AI industry and business model innovation
When the market thought that DeepSeek's open source week content had been released, on March 1, DeepSeek announced "One More Thing", suddenly unveiled the V3/R1 reasoning system, and disclosed the costs and benefits of large-scale deployment.
According to the article "Overview of DeepSeek-V3/R1 Reasoning System", according to DeepSeek's calculation, assuming that the GPU rental cost is $2/hour, the total cost is $87,000/day. If all loads including web pages, apps and APIs are counted, and all model tokens are calculated according to the pricing of DeepSeek-R1 (DeepSeek R1 pricing: $0.14/million input tokens (cache hits), $0.55/million input tokens (cache misses), $2.19/million output tokens), the total revenue per day is theoretically $56,200, and the cost-profit ratio is 545%.
What does a profit margin of up to 545% mean, and what impact will it have on the industry?
For the industry, the 56.3% cache hit rate mentioned in DeepSeek's latest article (the original article stated that within the 24-hour statistical period, DeepSeek V3 and R1 were able to achieve a total input token of 608B, of which 342B tokens (56.3%) hit the KVCache hard disk cache) is a data of great significance.
"Although none of the companies have released relevant data, a hit rate of more than half is already a very high level in the industry."
According to DeepSeek statistics, according to this "daytime reasoning - nighttime training" plan, in the past 24 hours, the nodes occupied by DeepSeek V3 and R1 reasoning services were added together. During the peak period of busy tasks, up to 278 nodes were occupied, and the average occupied was 226.75 nodes (each node is 8 NVIDIA H800 GPUs).
Considering that DeepSeek also has new model projects and other work that requires GPUs, the above 1,800-2,000 H800 GPUs (the average number of nodes occupied multiplied by 8 GPUs) have most likely used up "all computing resources" that DeepSeek can currently call on for the DeepSeek V3 and R1 models.
According to the industry, DeepSeek's innovative breakthrough is that it has maximized efficiency under limited resources, thereby achieving low-cost model development. Based on the above series of efficiency optimization, it has achieved a cost-profit ratio of 545%.
According to official disclosure, the optimization goal of the DeepSeek-V3/R1 inference system is: higher throughput and lower latency.
To achieve these two goals, DeepSeek uses large-scale cross-node expert parallelism (Expert Parallelism / EP). First, EP greatly increases the batch size, thereby improving the efficiency of GPU matrix multiplication and improving throughput. Second, EP allows experts to be dispersed on different GPUs, and each GPU only needs to calculate a small number of experts (thus requiring less memory access), thereby reducing latency.
However, EP also increases the complexity of the system. The complexity is mainly reflected in two aspects:
EP introduces cross-node transmission. In order to optimize throughput, it is necessary to design a suitable computing process so that transmission and computing can be performed simultaneously.
EP involves multiple nodes, so Data Parallelism (DP) is naturally required, and load balancing is required between different DPs.
Therefore, DeepSeek introduces how to use EP to increase the batch size, how to hide the transmission time, and how to perform load balancing.
Large-scale cross-node expert parallelism (Expert Parallelism / EP)
Since DeepSeek-V3/R1 has a large number of experts and only 8 out of 256 experts are activated in each layer, the high sparsity of the model determines that a large overall batch size must be used to provide each expert with enough expert batch size, thereby achieving higher throughput and lower latency. Large-scale cross-node expert parallelism (Expert Parallelism / EP) is required.
The expert parallel strategy among multiple machines and multiple cards is used to achieve the following goals:
Prefill : Routing Expert EP32, MLA and Shared Expert DP32, one deployment unit is 4 nodes, 32 redundant routing experts, 9 routing experts and 1 shared expert per card.
Decode : Routing experts EP144, MLA and shared experts DP144, a deployment unit is 18 nodes, 32 redundant routing experts, 2 routing experts and 1 shared expert per card.
Calculating Communication Overlap
The parallel processing of experts on multiple machines and multiple cards will introduce relatively large communication overhead, so double batch overlap is used to cover the communication overhead and improve the overall throughput.
In the prefill phase, the computation and communication of the two batches are interleaved, so that the computation of one batch can cover the communication overhead of the other batch.
For the decoding stage, the execution time of different stages is different, so the attention part is split into two stages, a total of 5 stages of the pipeline to achieve the overlap of calculation and communication.
Load balance as much as possible
Due to the large-scale parallelism (including data parallelism and expert parallelism), if the computation or communication load of a GPU is too heavy, it will become a performance bottleneck and slow down the entire system; at the same time, other GPUs will idle due to waiting, resulting in a decrease in overall utilization. Therefore, it is necessary to distribute a balanced computation load and communication load to each GPU as much as possible.
PrefillLoadBalancer
Core problem: The number and length of requests on different data parallel (DP) instances are different, resulting in different core-attention calculation amount and dispatch sending amount.
Optimization goal: Make the amount of computation on each GPU as similar as possible (core-attention computation load balancing) and the number of input tokens as similar as possible (dispatch sending load balancing) to avoid some GPUs taking too long to process.
DecodeLoadBalancer
Core problem: The number and length of requests on different data parallel (DP) instances are different, resulting in different core-attention calculation amounts (related to KVCache occupancy) and dispatch sending amounts.
Optimization goal: Make the KVCache usage of each GPU as similar as possible (core-attention calculation load balancing), and the number of requests as similar as possible (dispatch sending load balancing).
Expert-ParallelLoadBalancer
Core problem: For a given MoE model, there are some natural high-load experts, which leads to unbalanced expert computing loads on different GPUs.
Optimization goal: Balance the amount of expert computing on each GPU (that is, minimize the maximum amount of dispatch received by all GPUs).
Actual statistics of online systems
All services of DeepSeekV3 and R1 use H800 GPU and the same precision as training. That is, matrix calculation and dispatch transmission adopt the FP8 format consistent with training, and core-attention calculation and combine transmission adopt BF16 consistent with training, which guarantees the service effect to the greatest extent.
In addition, since the service load is high during the day and low at night, a mechanism is implemented to deploy inference services with all nodes when the load is high during the day. When the load is low at night, the inference nodes are reduced to be used for research and training. In the last 24 hours (2025/02/27 12:00 to 2025/02/28 12:00 Beijing time), DeepSeek - V3 and R1 inference services occupied a total of 278 nodes, with an average of 226.75 nodes (8 H800 GPUs per node) . Assuming the GPU rental cost is $2/hour, the total cost is $87,072/day.
In the 24-hour statistical period, DeepSeek - V3 and R1:
The total number of input tokens is 608B, of which 342B tokens (56.3%) hit the KVCache hard disk cache.
The total number of output tokens is 168B. The average output rate is 20~22tps, and the average KVCache length of each output token is 4989.
The average throughput of each H800 is: for the prefill task, the input throughput is about 73.7ktokens/s (including cache hits); for the decode task, the output throughput is about 14.8ktokens/s.
The above statistics include all loads of web pages, apps and APIs. If all tokens are calculated according to the pricing of DeepSeek-R1, the total revenue per day is theoretically $562,027, with a cost-profit ratio of 545%. Of course, in reality, there is not so much revenue, because V3 is priced lower, and paid services only account for a part of it, and there are discounts at night.