Just now! DeepSeek open-sources FlashMLA, the core technology for inference acceleration

Written by
Jasper Cole
Updated on:July-15th-2025
Recommendation

DeepSeek open-sources FlashMLA, a new breakthrough in inference acceleration!

Core content:
1. On the first day of DeepSeek Open Source Week, FlashMLA decoding core was released
2. FlashMLA is optimized for Hopper GPU, significantly improving inference efficiency
3. Project quick deployment guide and performance test results

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


Last Friday, DeepSeek tweeted that this week will be Open Source Week and that it will open source five software libraries in succession.

The first project is indeed related to inference acceleration.

At 9:00 a.m. Beijing time on Monday, just after work (and just before Silicon Valley got off work), DeepSeek fulfilled its promise and open-sourced an efficient MLA decoding core for Hopper GPUs: FlashMLA.


The project has gained more than 400 stars in just 45 minutes! And when we took the screenshot, the number of stars was skyrocketing.


Project address: https://github.com/deepseek-ai/FlashMLA

As we all know, MLA is an important technical innovation of the DeepSeek large model. Its main function is to reduce the KV Cache in the reasoning process, thereby enabling longer context reasoning on fewer devices and greatly reducing the cost of reasoning.

This time, DeepSeek directly open-sourced an improved version of the core technology, which can be said to be full of sincerity.

Next, let me take a look at the core content of this open source project.

It is reported that FlashMLA is an efficient MLA decoding kernel for Hopper GPU, optimized for variable-length sequence services.

The currently published contents are:

  •  BF16
  •  Paging kvcache with block size 64

It's very fast, with a memory speed ceiling of 3000 GB/s and a compute ceiling of 580 TFLOPS on the H800 SXM5 GPU.

Before deploying this project, you need:

  •  Hopper GPU
  •  CUDA 12.3 and above
  •  PyTorch 2.0 and above

Quick Start

  • Install

python setup.py install

  • Benchmarks

python tests/test_flash_mla.py

Using CUDA 12.6, on an H800 SXM5, we achieve up to 3000 GB/s in a memory bound configuration and 580 TFLOPS in a compute bound configuration.

  • usage


from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata (cache_seqlens, s_q * h_q //h_kv, h_kv)

for i in range (num_layers):...o_i, lse_i = flash_mla_with_kvcache (q_i, kvcache_i, block_table, cache_seqlens, dv,tile_scheduler_metadata, num_splits, causal=True,)...

The project also received rave reviews after its release.


Some netizens even joked: "I heard that the fifth day will be AGI."


Finally, the same sentence: This is the real OpenAI