Just now! DeepSeek open-sources FlashMLA, the core technology for inference acceleration

Written by

Jasper Cole

Updated on:July-15th-2025

Last Friday, DeepSeek tweeted that this week will be Open Source Week and that it will open source five software libraries in succession.

The first project is indeed related to inference acceleration.

At 9:00 a.m. Beijing time on Monday, just after work (and just before Silicon Valley got off work), DeepSeek fulfilled its promise and open-sourced an efficient MLA decoding core for Hopper GPUs: FlashMLA.

The project has gained more than 400 stars in just 45 minutes! And when we took the screenshot, the number of stars was skyrocketing.

Project address: https://github.com/deepseek-ai/FlashMLA

As we all know, MLA is an important technical innovation of the DeepSeek large model. Its main function is to reduce the KV Cache in the reasoning process, thereby enabling longer context reasoning on fewer devices and greatly reducing the cost of reasoning.

This time, DeepSeek directly open-sourced an improved version of the core technology, which can be said to be full of sincerity.

Next, let me take a look at the core content of this open source project.

It is reported that FlashMLA is an efficient MLA decoding kernel for Hopper GPU, optimized for variable-length sequence services.

The currently published contents are:

BF16
Paging kvcache with block size 64

It's very fast, with a memory speed ceiling of 3000 GB/s and a compute ceiling of 580 TFLOPS on the H800 SXM5 GPU.

Before deploying this project, you need:

Hopper GPU
CUDA 12.3 and above
PyTorch 2.0 and above

Quick Start

Install

python setup.py install

Benchmarks

python tests/test_flash_mla.py

Using CUDA 12.6, on an H800 SXM5, we achieve up to 3000 GB/s in a memory bound configuration and 580 TFLOPS in a compute bound configuration.

usage

from flash_mla import get_mla_metadata, flash_mla_with_kvcache

tile_scheduler_metadata, num_splits = get_mla_metadata (cache_seqlens, s_q * h_q //h_kv, h_kv)

for i in range (num_layers):...o_i, lse_i = flash_mla_with_kvcache (q_i, kvcache_i, block_table, cache_seqlens, dv,tile_scheduler_metadata, num_splits, causal=True,)...

The project also received rave reviews after its release.

Some netizens even joked: "I heard that the fifth day will be AGI."

Finally, the same sentence: This is the real OpenAI