Official report: How the DeepSeek-V3 model was created!

Written by
Iris Vance
Updated on:July-15th-2025
Recommendation

Interpret the official report of the DeepSeek-V3 model and reveal the latest breakthroughs in open source large-scale language models.

Core content:
1. The development background and goals of DeepSeek-V3: improving the performance of open source models and cost-effective training costs
2. Model architecture innovation: multi-head potential attention and mixed expert optimization under the Transformer framework
3. The performance goal of DeepSeek-V3: approaching the level of closed source models in specific fields and surpassing other open source models

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

DeepSeek-V3 official report interpretation

https://arxiv.org/abs/2412.19437



1. Paper background: Why develop DeepSeek-V3?

In recent years, large language models (LLMs) have developed rapidly. Not only closed-source models (such as GPT-4o and Claude-3.5-Sonnet) have performed strongly, but open-source models are also making continuous progress, such as the DeepSeek series and the LLaMA series. The goal of DeepSeek-V3 is to further enhance the capabilities of open-source models and narrow the gap with closed-source models while maintaining the economical cost of training.

  • Model size : DeepSeek-V3 has a total of 671 billion parameters, but only 37 billion parameters are activated each time a token is processed. This design reduces computational costs.
  • Core goal : In terms of performance, DeepSeek-V3 should reach or even exceed other open source models, and be close to the level of closed source models in some areas (such as mathematics and programming); at the same time, the training cost should be as low as possible to achieve "economic efficiency".


2. The architecture of DeepSeek-V3: What does it look like?

The architecture of DeepSeek-V3 is based on the Transformer framework, but with some innovative designs, including the following key parts:


2.1 Multi-Latent Attention (MLA)

  • Function : MLA is designed to make the model more efficient during inference. It reduces the space occupied by the key-value cache (KV cache) through low-rank compression.
  • Simple explanation : Imagine that the model needs to remember a lot of information (keys and values) when processing long texts, but if all the information is stored directly, the memory will be full. MLA compresses this information into a smaller space and only stores the most important parts, which saves memory without affecting performance.
  • Relationship with DeepSeek-V2 : MLA has been proven effective in DeepSeek-V2 and will continue to be used in DeepSeek-V3.


2.2 DeepSeekMoE: Optimization of Mixture of Experts

  • What is MoE?:MoE is a model structure with many "experts" in it. Each token will select a part of experts to process. DeepSeekMoE is a unique MoE design of the DeepSeek team, characterized by finer-grained experts and some shared experts.
  • Innovation: Auxiliary-Loss-Free Load Balancing
    • Problem : MoE models are prone to "unbalanced expert load" (some experts are very busy, while others are idle). The traditional solution is to add an auxiliary loss, but this may harm model performance.
    • Solution : DeepSeek-V3 introduces a new method to balance the load by dynamically adjusting the "bias term" of each expert. This method does not rely on auxiliary losses and reduces the negative impact on performance.
    • Additional measures : To prevent extreme imbalance within a single sequence, a small sequence-wise balance loss is added, but the impact is small.
  • Node-Limited Routing : In order to reduce the communication cost during training, each token is sent to at most 4 nodes, which can better utilize hardware resources.
  • No token discard : Due to good load balancing, there is no need to discard tokens during training and inference, which is more efficient.


2.3 Multi-Token Prediction (MTP)

  • Function : Traditional language models only predict the next token at a time, while DeepSeek-V3 can predict the next two tokens (next 2 tokens) at the same time.
  • benefit :
    • More intensive training signals : Predicting multiple tokens at a time is equivalent to providing more learning signals to the model, which improves data efficiency.
    • Inference acceleration : Combined with "speculative decoding", the generation speed can be significantly improved, reaching 1.8 times the token generation rate per second (TPS).
  • Implementation :
    • DeepSeek-V3 uses a main model to predict the next token and adds an MTP module to predict the second token.
    • During training, the loss of the MTP module will be added to the total loss, but during inference, the MTP module can be directly discarded and the main model will work as usual.
  • Results : Experiments show that the MTP strategy improves model performance on most evaluation benchmarks, especially mathematics and programming tasks.


3. Training process: How to build DeepSeek-V3?

The training of DeepSeek-V3 is divided into three stages: Pre-Training, Long Context Extension, and Post-Training. The total training cost is 2.788 million H800 GPU hours, assuming $2 per hour, the cost is about $5.576 million.


3.1 Pre-training: A feast for the model

  • data :
    • A total of 14.8 trillion high-quality and diverse tokens were used, covering multiple languages ​​(mainly English and Chinese), and samples related to mathematics and programming were specially added.
    • In data processing, redundancy is reduced while maintaining diversity. The document packing method is adopted, but cross-sample attention masking is not used.
    • A Fill-in-Middle (FIM) strategy is introduced, similar to DeepSeekCoder-V2, to help the model learn the ability of context prediction.
  • Tokenizer :
    • Byte-level BPE (Byte-level Byte Pair Encoding) is used, and the vocabulary size is 128K.
    • The multi-language compression efficiency has been optimized, and some special tokens (such as a combination of punctuation marks and line breaks) have been added. However, in order to avoid token boundary bias, some combined tokens are randomly split during training.
  • Hyperparameters :
    • The number of Transformer layers is 61 and the hidden dimension is 7168.
    • The MoE layers start from layer 4, each layer has 1 shared expert and 256 routing experts, and 8 routing experts are activated at a time.
    • During training, the sequence length is 4K and the batch size is gradually increased from 3072 to 15360.
    • The AdamW optimizer is used, and the learning rate is increased from 0 to 2.2 × 10⁻⁴ and then gradually decayed.
  • Stability : The entire pre-training process is very stable, with no unrecoverable loss spikes and no need for rollback.


3.2 Long Context Extension: Allowing Models to Read Long Texts

  • Goal : Expand the model's context window from 4K to 128K.
  • method :
    • Phase 1: Expanded from 4K to 32K, sequence length 32K, batch size 1920, and trained for 1000 steps.
    • Phase 2: Expanded from 32K to 128K, sequence length 128K, batch size 480, and training 1000 steps.
    • Using the YaRN (Yet Another Rope-based Extension) approach, it is divided into two stages:
    • The learning rate is kept at the value at the end of pre-training (7.3 × 10⁻⁶).
  • Effect :
    • In the Needle In A Haystack (NIAH) test, DeepSeek-V3 performs well at 128K context length, demonstrating its long context processing capability.


3.3 Post-training: Making the model closer to human needs

  • Supervised Fine-Tuning (SFT) :
    • Reasoning data (mathematics, programming, logic): The internal DeepSeek-R1 model is used to generate data, but the answers generated by R1 may be too long or poorly formatted. By designing system prompts and rejection sampling, we balance accuracy and simplicity.
    • Non-inference data (writing, role-playing, question-answering): DeepSeek-V2.5 was used to generate answers, which were then verified manually.
    • The dataset contains 1.5 million instances, covering multiple domains:
    • Two rounds of training (2 epochs) were performed, the learning rate decayed from 5 × 10⁻⁶ to 1 × 10⁻⁶, and sample masking was used between samples to avoid mutual interference.
  • Reinforcement Learning (RL) :
    • Rule-based RM : For tasks with clear rules (such as math problems and programming problems), the answers are verified by the rules. For example, math problems require the answers to be in the box, and programming problems are verified by the compiler.
    • Model-based RM : For free-form tasks (such as writing), RM is trained with DeepSeek-V3’s SFT checkpoints and chain-of-thought data is added to prevent reward hacking.
    • Reward Model (RM) :
    • Optimization method : Use Group Relative Policy Optimization (GRPO) to estimate the baseline through group ratings to avoid using a large critic model.
    • Effect : RL improves the performance of the model on benchmarks, especially in scenarios with limited SFT data.


4. Training efficiency: Why is DeepSeek-V3 training cost low?

The training efficiency of DeepSeek-V3 is due to the coordinated optimization of hardware, algorithms and frameworks:


4.1 Computing Cluster

  • A cluster of 2048 NVIDIA H800 GPUs was used, with nodes connected via NVLink and nodes connected via InfiniBand (IB).


4.2 Training Framework (HAI-LLM)

  • Parallel strategies :
    • 16-way Pipeline Parallelism (PP): Assign different layers of the model to different GPUs.
    • 64-way Expert Parallelism (EP): spans 8 nodes, with experts evenly distributed.
    • ZeRO-1 Data Parallelism (DP): Reduce memory usage.
  • DualPipe algorithm :
    • The innovative pipeline parallel algorithm reduces pipeline bubbles and hides communication delays through computation-communication overlap.
    • Even if the model size increases, the communication overhead of cross-node expert parallelism is almost zero as long as the computation-communication ratio is kept constant.
  • Efficient communication :
    • All-to-all communication optimization across nodes takes advantage of the bandwidth difference between IB and NVLink, and each token is sent to up to 4 nodes, reducing IB traffic.
    • Using warp specialization technology, the bandwidth can be fully utilized with only 20 SMs (Streaming Multiprocessors).
  • Memory optimization :
    • RMSNorm and MLA up-projection use recomputation to reduce activation memory.
    • Exponential Moving Average (EMA) parameters are stored on the CPU and updated asynchronously, saving GPU memory.
    • The embedding layer and output head of the MTP module are shared with the main model, further saving memory.


4.3 FP8 Training

  • Background : FP8 is a low-precision format that saves more memory and computing resources than the traditional BF16, but is prone to overflow and quantization errors.
  • Innovation :
    • Fine-Grained Quantization : Activations are quantized in 1×128 tiles and weights are quantized in 128×128 blocks to reduce quantization error.
    • Increasing Accumulation Precision : When Tensor Cores process FP8 GEMM, the intermediate results are promoted to the FP32 registers of CUDA Cores every 128 elements (4 WGMMA) to reduce the accumulation error.
    • Online Quantization : Calculate the maximum absolute value in real time to simplify the quantization process.
    • Low-precision storage and communication : Activations and optimizer states (momentum for AdamW) are stored in FP8 or BF16, and some activations are quantized in FP8 during communication to reduce bandwidth pressure.
  • Results : Compared with BF16 training, the loss error of FP8 training is less than 0.25%, and the feasibility is verified on 16B and 230B scale models.


4.4 Inference and Deployment

  • Deployment Units :
    • Prefilling stage: The minimum unit is 4 nodes (32 GPUs), the attention part uses 4-way tensor parallelism (TP4) + sequence parallelism (SP) + 8-way data parallelism (DP8), and the MoE part uses 32-way expert parallelism (EP32).
    • Decoding stage: The minimum unit is 40 nodes (320 GPUs), TP4+SP+DP80 is used for the attention part, and EP320 is used for the MoE part.
  • Load Balancing :
    • Through the redundant experts strategy, high-load experts are dynamically replicated and adjusted regularly (every 10 minutes).
    • Explore dynamic redundancy strategies and calculate the global optimal routing solution before each reasoning, but further optimization is needed.
  • Throughput improvement :
    • Two micro-batches are processed simultaneously, with the attention of one overlapping with the MoE communication of the other.
    • In the decoding stage, attention consumes more time, and a small amount of SM is allocated to process MoE to maintain the overall performance.


4.5 Hardware Recommendations

  • Communication Hardware :
    • The current communication relies on SM, which is inefficient. It is recommended to develop a dedicated communication coprocessor (such as NVIDIA SHARP) to unify the IB and NVLink networks and simplify programming.
  • Computing Hardware :
    • It is recommended to increase the FP8 GEMM accumulation precision of Tensor Cores to at least 34 bits.
    • Supports tile-level and block-level quantization to avoid frequent data movement between Tensor Cores and CUDA Cores.
    • Supports online quantization, integrates FP8 conversion and TMA access, and reduces memory read and write.
    • Supports transposed GEMM operations to simplify the quantization process.


5. Performance Evaluation: How does DeepSeek-V3 perform?

DeepSeek-V3 was evaluated on multiple benchmarks and is divided into two parts: the base model and the chat model.


5.1 Base model performance

  • Compared with : DeepSeek-V2, Qwen2.5 72B, LLaMA-3.1 405B (all are base models).
  • Key Results :
    • English benchmark : Leading in all tasks including MMLU (87.1%), MMLU-Pro (64.4%), and DROP (89.0 F1), especially knowledge-intensive tasks.
    • Code Benchmark : It performs well on tasks such as HumanEval (65.2%), MBPP (75.4%), and LiveCodeBench (19.4%), far exceeding other open source models.
    • Mathematical benchmarks : It is significantly ahead in tasks such as GSM8K (89.3%), MATH (61.6%), and MGSM (79.8%), and is close to the level of closed-source models.
    • Chinese benchmark : Outperforms Qwen2.5 72B on tasks such as C-Eval (90.1%) and CMMLU (88.8%).
    • Multilingual Benchmark : MMMLU Non-English Section (79.4%), with excellent performance.
  • Conclusion : DeepSeek-V3 is currently the strongest open source model, especially in the areas of code and mathematics.


5.2 Chat Model Performance

  • Compared with : DeepSeek-V2, DeepSeek-V2.5, Qwen2.5 72B, LLaMA-3.1 405B, Claude-3.5-Sonnet, GPT-4o.
  • Key Results :
    • On Arena-Hard (85.5%) and AlpacaEval 2.0 (70.0%), the open source model was pushed above 85% for the first time, approaching the closed source model.
    • It outperforms Qwen2.5 72B on C-Eval (86.5%) and C-SimpleQA (64.8%).
    • It is significantly ahead in AIME 2024 (39.2%), MATH-500 (90.2%), and CNMO 2024 (43.2%), setting a new record for non-o1 models.
    • It leads in algorithmic tasks such as HumanEval-Mul (82.6%) and LiveCodeBench (37.6%).
    • In engineering tasks such as SWE-Bench (42.0%) and Aider (79.7%), it is second only to Claude-3.5, but far exceeds other open source models.
    • MMLU (88.5%), MMLU-Pro (75.9%), and GPQA-Diamond (59.1%) are close to closed-source models and outperform all open-source models.
    • It performs well on long-context tasks (e.g., DROP 91.6 F1, LongBench v2 48.7%).
    • It is slightly inferior to GPT-4o and Claude in factual tasks (such as SimpleQA 24.9%), but leads in Chinese factual tasks (C-SimpleQA 64.8%).
    • English Benchmarks :
    • Code Baseline :
    • Mathematical Benchmarks :
    • Chinese Benchmark :
    • Open Assessment :
  • Conclusion : DeepSeek-V3 is the strongest open source chat model, approaching or exceeding closed source models in many fields, especially in mathematics and programming tasks.


5.3 Ablation studies

  • MTP Strategy : Verified on models of different sizes, MTP improves the performance of most benchmarks, especially on code and math tasks.
  • No-auxiliary-loss load balancing : Compared with traditional auxiliary-loss methods, the new strategy performs better on most benchmarks and has clearer division of labor among experts.
  • Batch-level vs sequence-level load balancing : Batch-level balancing (such as the non-auxiliary loss method) is more flexible, allowing experts to focus more on different areas and achieve better performance.


6. Summary of innovations: What breakthroughs did DeepSeek-V3 make?

  • Architecture innovation :
    • No auxiliary loss load balancing strategy to reduce performance loss.
    • MTP training objective improves data efficiency and inference speed.
  • Pre-training efficiency :
    • FP8 mixed precision training was first used to verify its feasibility on large-scale models.
    • DualPipe algorithm and communication optimization to achieve nearly complete overlap of computation and communication.
    • The total training cost is only 2.788 million H800 GPU hours, which is cost-effective.
  • Post-training optimization :
    • Distilling reasoning capabilities from DeepSeek-R1, significantly improving math and programming performance.
    • A self-rewarding method optimizes the alignment effect through voting feedback.


7. Limitations and Future Directions


7.1 Limitations

  • Deployment cost : The recommended deployment unit is larger (pre-populated with 32 GPUs, decoded with 320 GPUs), which is a certain burden for small teams.
  • Inference speed : Although 2 times faster than DeepSeek-V2, there is still room for improvement.


7.2 Future Directions

  • Architecture optimization : further improve training and inference efficiency, explore support for unlimited context length, and break through Transformer architecture limitations.
  • Data expansion : Increase the amount and quality of training data and explore more sources of training signals.
  • Reasoning ability : Improve the model's deep thinking ability and extend the length of the reasoning chain.
  • Assessment methods : Develop more comprehensive assessment methods to avoid over-optimization of specific benchmarks and misleading capability assessment.


8. Summary: The significance of DeepSeek-V3

DeepSeek-V3 is currently the most powerful open source language model, especially in the fields of code and mathematics, with performance close to or even exceeding closed source models (such as GPT-4o, Claude-3.5-Sonnet). Its training cost is low ($5.576 million), thanks to FP8 training, communication optimization and architectural innovation. The DeepSeek team upholds the spirit of open source and is committed to promoting the development of AGI (artificial general intelligence). In the future, it will continue to optimize architecture, data and reasoning capabilities to bring more breakthroughs to the open source community.