DeepSeek open-sources FlashMLA, breaking through GPU performance limits

Written by

Jasper Cole

Updated on:July-15th-2025

On February 24, 2025, DeepSeek officially open-sourced its first code base FlashMLA , which is the first project of its "Open Source Week" program, aiming to promote AI reasoning acceleration by optimizing GPU performance.

Deep optimization for Hopper GPU

FlashMLA is an efficient decoding core designed for NVIDIA Hopper architecture GPUs (such as H800), focusing on optimizing variable length sequence processing capabilities . By dynamically adjusting memory and computing resources, it significantly improves the inference efficiency of large models in long context scenarios (such as conversations and document processing).

Core Technology

MLA architecture (Multi-head Latent Attention) : By transforming the attention mechanism, compressing the KV Cache size and reducing memory usage, it supports longer context processing under the same hardware conditions . The KV Cache of the standard Transformer grows linearly with the sequence length (complexity O(n²)), resulting in memory explosion in long context scenarios. Latent attention compression: The K/V matrix in the multi-head attention is compressed into the latent space through low-rank projection, reducing the KV Cache size by 60%-80% (for example, the original 40GB cache can be compressed to 8-16GB).

Paged KV cache (block size 64) : uses sophisticated memory management strategies to improve cache utilization and reduce latency .

BF16 precision support : takes into account both computing performance and memory efficiency, and adapts to the current mainstream AI hardware requirements .

Performance

The measured data on the H800 SXM5 GPU shows

Memory bandwidth : It reaches 3000 GB/s in memory-constrained scenarios, far exceeding the theoretical bandwidth limit of H800 (600 GB/s) and approaching the physical limit of hardware .
Computing performance : 580 TFLOPS is achieved in computing-constrained scenarios, approaching the theoretical peak of the Hopper architecture .

This optimization significantly improves the inference speed of large models, which is especially suitable for real-time generation tasks (such as chatbots and text generation), while reducing deployment costs.

The comparative experimental data released by DeepSeek reveals the significant advantages of FlashMLA

FlashMLA not only significantly reduces costs during the training phase, but also achieves breakthroughs in long-context reasoning scenarios. Its core technologies are:

Communication optimization: Reduce all-to-all communication bandwidth requirements by 62% through expert gradient compression algorithm
Computation pipeline reconstruction: overlap the matrix multiplication of the FFN layer with the activation function execution time to improve instruction-level parallelism
Dynamic load balancing: real-time monitoring of the computing load of each expert, and avoiding idle resources through asynchronous scheduling