Analysis of DeepSeek deployment practice

In-depth analysis of the DeepSeek V3 R1 inference system, mastering its deployment architecture and key technologies in the inference phase.
Core content:
1. Two phases of model inference: detailed analysis of Prefill and Decode
2. Logical structure and configuration requirements of the R1 deployment architecture
3. How to optimize R1 configuration in large-scale concurrent scenarios
The output of the prefill phase refers to the hidden states generated by the model for each input token after processing all input tokens (i.e. prompt tokens) at one time . These hidden states can be used for subsequent tasks, such as further generation in the decoding phase, classification tasks, or other downstream operations.
Summary of key points:
1. Unified 61-layer structure:
- Both the Prefill and Decode stages use the same 61-layer structure, without the need for additional sub-layer division
2. Prefill stage characteristics:
- Input all tokens at once and perform parallel calculations
- No KV-Cache required
3. Features of the Decode phase:
- Input 1 token each time and use KV-Cache to avoid repeated calculations
- Each layer's Self-Attention reads the KV-Cache and generates a new KV
4. MoE (Sparse Experts):
- MoE FFN is part of each layer structure, and is used as needed in the Prefill and Decode stages
- Provide sparse computing to improve model efficiency and scalability
2. R1 deployment architecture
Logically, R1 has 61 decoder layers, with 256 routing experts + 8 activation experts + 1 shared expert in each layer. The simplest configuration can be deployed on 8 MI300X or 8 H200 in SGLang mode.
DeepSeek R1 Full Version on Azure AMD MI300X
But if you are facing large-scale concurrency, you can refer to the optimized configuration.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Model logic: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • Each layer has 256 “routing experts”, that is, the maximum number of experts available for selection is 256. • However, during inference, not all 256 experts are calculated at the same time. Instead, only 8 of them are activated for each input token (that is, “8 activated experts”). • In addition, there is 1 “shared expert”, and all tokens must pass through it (no sparse routing is required). ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Prefill stage configuration: EP32 / DP32 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • Cluster size: 4 nodes × 8 GPUs/node = 32 GPUs. • Official deployment: 256 routing experts are distributed to 32 cards at a time, but with “redundancy”, so each card has 9 routing experts instead of 256 ÷ 32 = 8. • So it comes down to: 32 GPUs × 9 routing experts per card = 288 routing expert copies. – From 256 to 288, there are 32 more “redundant experts (copies)”. This is usually done to allow more copies of frequently scheduled experts and balance the load between multiple cards. • At the same time, “1 shared expert” uses data parallelism, which means “one copy per card”. Therefore, what you see on each card is “9 routing experts + 1 shared expert”. • Summary: Logically it is still “256 routing experts + 1 shared expert”, but physically there is some more redundancy, so that each card can have 9, thus making up “EP32 + DP32”. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Decode stage configuration: EP144 / DP144 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • The Decode stage uses more nodes: 18 nodes × 8 GPUs/node = 144 GPUs. • This stage distributes 256 routing experts to 144 cards, and there will be redundancy, so 2 routing experts are placed on each card. • Result: 144 GPUs × 2 routing experts per card = 288 routing expert copies. – As in Prefill, there are 32 more than 256 for redundancy or load balancing. • “1 shared expert” continues to be replicated to these 144 cards in a data-parallel manner; so each card is now displayed as “2 routing experts + 1 shared expert”. • The logical level is still “256 + 1”, but at different parallel scales, the experts are redistributed and redundant to form “EP144 + DP144”. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ The relationship between “8 activated experts” and distribution━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • “Each token activates 8 experts” means that when input comes in, the gating router will select the most matching 8 experts from 256 (physically 288) routing experts to perform calculations. • Whether in the Prefill (9 per card) or Decode (2 per card) stage, the corresponding gating system can find the required 8 experts in all cards and all expert copies, and route the computing load of the token to them. • The more redundant copies, the more it can alleviate single-card congestion and improve overall throughput. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Summary in one sentence━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • From the training or logic perspective, the model has one layer of “256 routing experts + 8 activation experts + 1 shared expert”. • When deployed to dozens or hundreds of GPUs, these 256 experts will be split with redundant copies to different cards. For example, when there are 32 cards, it is “9 per card → 288 in total”, and when there are 144 cards, it is “2 per card → 288 in total”, both of which belong to “256 + 32 redundancy”. At the same time, the “1 shared expert” is simply copied on each card (DP). • The final result is as mentioned on the official website: – Prefill stage: EP32/DP32, “9 routes + 1 share” for each card – Decode stage: EP144/DP144, “2 routes + 1 share” for each card while maintaining the “256 + 1 per layer” MoE structure and the “8 activation experts” framework unchanged.