DeepSeek V3: A new breakthrough in AI, a leap forward in performance and efficiency

Written by

Audrey Miles

Updated on:July-17th-2025

DeepSeek V3 Introduction: Innovative Architecture, Extreme Performance

DeepSeek V3 is the latest version of the DeepSeek series, inheriting the core advantages of the previous two versions (V1 and V2), while making large-scale upgrades in technical architecture and optimization methods.

V1 : Focuses on data quality and infrastructure optimization, adopts the LLaMA architecture, and performs style alignment with Supervised Fine-Tuning (SFT) through high-quality datasets.
V2 : Introduced Multi-Head Latent Attention (MLA) technology to improve reasoning efficiency, and at the same time improved the model's parameter capacity and computing power through the DeepSeekMoE architecture.

V3 has achieved real technological breakthroughs based on V2, especially in terms of inference speed , model load balancing , and multi-token prediction , marking a new stage of development for DeepSeek.

Core technical innovations of DeepSeek V3

DeepSeek V3 introduces a number of breakthrough technologies based on DeepSeek V2, further improving the model's reasoning efficiency, training cost, and performance. The following are the main technical innovations of DeepSeek V3:

1. Auxiliary-Loss-Free Load Balancing

In large-scale Mixture of Experts (MoE) models, load balancing has always been a problem that needs to be solved. Traditional MoE models often encounter the problem of "expert overload", which causes some experts to be activated too frequently while other experts have almost no chance to participate in the calculation, which directly affects the performance of the model.

DeepSeek V3 solves this problem by introducing Auxiliary-Loss-Free Load Balancing technology. Traditional load balancing methods usually rely on additional auxiliary losses to force the adjustment of the activation frequency of each expert, but this often affects the performance of the model. DeepSeek V3 innovatively adds a bias term to the score of each expert for dynamic adjustment. After each training step, the model automatically adjusts the bias term according to the load of each expert, so that experts with higher loads reduce the number of activations, while experts with lower loads increase activations, thereby ensuring balanced activation of each expert.

The advantage of this method is that there is no need to introduce additional auxiliary losses, thus avoiding the impact of the loss function on model performance, and at the same time can effectively improve the stability of training and reasoning efficiency.

2. Multi-Token Prediction

In traditional language models, the model usually generates text token by token, that is, predicting one token at a time and using the token as the input for the next prediction. Although this one-by-one generation method can ensure the accuracy of the generated text, it sacrifices the reasoning speed and is less efficient when generating long texts.

DeepSeek V3 has greatly improved the reasoning speed and generation effect by introducing Multi-Token Prediction (MTP) technology. Unlike traditional single-token prediction, MTP allows the model to predict multiple tokens at the same time, rather than just relying on the previous token. This method not only improves the reasoning efficiency, significantly increasing the number of tokens generated per second (TPS) from 20 TPS to 60 TPS, but also improves the model's global vision when generating subsequent tokens, making the generated text more fluent and coherent.

During the training phase, DeepSeek V3 achieves this goal through multiple parallel MTP modules. These modules share the Embedding layer and Output Head with the main model, and through the combination of Transformer layers, the training efficiency and data utilization are improved.

3. FP8 mixed precision training

In order to improve training efficiency and reduce computing and memory overhead, DeepSeek V3 uses the FP8 mixed precision training framework for the first time in the training of ultra-large-scale models. By using the FP8 format for computing and storage, DeepSeek V3 can significantly reduce GPU memory usage and accelerate the training process. The application of this technology makes training large-scale language models more efficient and significantly reduces training costs.

In the FP8 training framework, DeepSeek V3 combines data formats of different precisions, FP8, BF16, and FP32, to optimize computation and memory usage. In the forward propagation phase, inputs and weights are calculated in FP8 format, while gradient accumulation uses FP32 precision, thus balancing computation speed and accuracy.

4. Training framework optimization - DualPipe algorithm

DeepSeek V3 uses a new algorithm called DualPipe to optimize the pipeline parallel efficiency during training. Compared with traditional pipeline parallel methods, DualPipe can better overlap the process of computing and communication, reduce the idle time (pipeline bubbles) during training, and thus improve training efficiency. This algorithm is particularly suitable for distributed training, which can reduce the communication overhead between nodes and increase the training speed through computing-communication overlap.

In addition, DualPipe ensures that DeepSeek V3 can be trained efficiently without using expensive tensor parallelism by optimizing memory usage and cross-node communication.

5. Further optimization of DeepSeekMoE architecture

DeepSeek V3 continues to use the DeepSeekMoE architecture, and further improves the computing power of the model through more experts and fine-grained expert design. Compared with DeepSeek V2, V3 optimizes the number of activated experts and the size of each expert, thereby achieving more efficient parallel computing. In addition, V3 also improves the expert selection mechanism, using a gating mechanism to assign experts based on the affinity of the token, thereby ensuring the load balance of the experts.

Through this optimization, DeepSeek V3 can allocate computing resources more efficiently when processing diverse tasks, improving overall performance.

6. Efficient cross-node communication

DeepSeek V3 has specially optimized cross-node communication . By designing a dedicated communication kernel and combining it with the MoE routing algorithm, it fully utilizes the bandwidth of InfiniBand and NVLink to achieve complete overlap of communication and computing. This technology significantly reduces cross-node communication overhead and improves the efficiency of large-scale distributed training.

By limiting the distribution of each token to a maximum of 4 nodes, DeepSeek V3 minimizes communication traffic while ensuring efficient data transmission by leveraging the high bandwidth of NVLink.

A leap forward in performance and efficiency

Improved reasoning speed : Thanks to the introduction of MTP technology, the reasoning speed of DeepSeek V3 has increased by 3 times, from 20 TPS in V2 to 60 TPS, which greatly improves the generation efficiency and provides users with a smoother user experience.
Training efficiency : DeepSeek V3 also performs very well in the pre-training stage, and the stability and cost control of model training have been further optimized. V3 ensures high efficiency and low cost during the training process by optimizing the collaborative design of algorithms, frameworks and hardware.

In terms of model evaluation, DeepSeek V3 is not only far ahead in open source models, but also on par with the strongest closed source models (such as GPT-4o and Claude-3.5-Sonnet) in some key areas. In particular, DeepSeek V3 has demonstrated its powerful capabilities in complex tasks such as mathematics , code generation , and long text understanding .