What are the advantages of DeepSeek's open source FlashMLA?

Written by
Audrey Miles
Updated on:July-14th-2025
Recommendation

DeepSeek open-sources FlashMLA, a new breakthrough in AI acceleration!

Core content:
1. The design concept and background story of FlashMLA
2. How FlashMLA optimizes GPU performance and computing efficiency
3. The performance improvement and practical application value brought by FlashMLA

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


You should have seen this information.

On February 21, 2025, DeepSeek announced the launch of "Open Source Week", planning to open source 5 code bases within a week. The first code base to be open sourced on Monday (February 24) was FlashMLA.

What is FlashMLA? To understand it, I'll tell you a story first:

Once upon a time, there was a small town with a magical fortune teller. He could answer any question, but there was one problem - he was very slow. Every time someone asked a question, he would spend a long time flipping through books and calculating, which made people wait anxiously.

One day, a smart young man came to town.

Seeing the dilemma of the fortune teller, he came up with a solution: he divided the fortune teller's book into many small pieces and designed a quick search method. In this way, the fortune teller no longer had to turn the pages of the book one by one, and the speed of answering questions was much faster.

This guy's invention is like Flash MLA.

FlashMLA has designed a "quick search system" for the AI ​​model, so that AI is no longer as slow as before when answering questions. Therefore, the emergence of FlashMLA has equipped AI with a pair of "Hot Wheels".

According to the official statement: FlashMLA is an "accelerator" specially optimized for high-performance GPUs.

Specifically, FlashMLA is tailored for NVIDIA's latest Hopper architecture GPUs (such as the H800). Through a series of optimization technologies, it enables AI models to more efficiently utilize the computing power of the GPU during reasoning, thereby significantly shortening the response time.

So, how powerful is this "accelerator"? Three key points:

First, the performance improvement is real.

It can increase the GPU's memory bandwidth to 3000 GB/s and its computing performance to 580 TFLOPS. These numbers must seem very abstract, but you can understand that it makes the already powerful GPU even more "terrifying".

Just like a sports car, which is already very fast, FlashMLA equips it with a more powerful engine, allowing it to instantly leave its competitors behind on the track. In other words, it changes the reaction speed of the AI ​​model from "very fast" to "instant".

The second point is that it is particularly "labor-saving".

How to save effort? You know, when dealing with problems, traditional AI models are like a novice driver who always likes to step on the accelerator to the bottom, regardless of whether it is necessary or not.

FlashMLA is like an experienced driver who knows when to step on the accelerator and when to ease up. It uses a smart "dynamic processing method" to invest computing resources only when they are really needed.

This is what the official statement says:

FlashMLA uses Paged KV Cache technology to divide cache data into small blocks ( block size is 64 ), which can manage memory more finely and reduce video memory fragmentation.

At the same time, it also supports  BF16 precision . This precision format further improves the utilization of memory bandwidth while ensuring calculation accuracy.

Therefore, this optimization method is like only allowing vehicles that really need to pass on the road during traffic congestion, avoiding unnecessary waste of resources. To put it bluntly, it is like when summer comes, you only turn on the air conditioner when you need it, instead of leaving it on all the time.

The third advantage is: industrial-grade practical design.

What is industrial-grade practical design? Simply put, it is not a theoretical technology, but a mature solution that has been rigorously tested and verified in real scenarios.

Since it is a mature solution, it must have the following characteristics: First, high reliability. FlashMLA can run stably in high-intensity business scenarios and will not crash due to emergencies.

Second, high performance. FlashMLA not only runs fast, but also lasts long. It is easy to deploy and maintain. Like a USB flash drive, enterprises can quickly connect it to existing systems and use it just like plug and play.

Finally, it can adapt to various complex business scenarios . Moreover, when processing massive amounts of data, FlashMLA will not leak any sensitive information. Therefore, the industrial-grade practical design means that it is not only technologically advanced, but also a "real thing" that reduces trial and error costs.

So, where did the inspiration for this Flash MLA come from?

There are two projects mentioned on GitHub: FlashAttention 2&3 and Cutlass. I checked and found that FlashAttention is a project that focuses on the efficient implementation of the attention mechanism. It significantly improves the performance of the Transformer model by optimizing memory access and calculation processes.

You can think of FlashAttention as a super efficient "commander" . It can command various resources in the computer to make them work together to complete complex tasks faster.

It is like in a factory, the commander arranges the workers to complete each link efficiently, thereby improving the production efficiency of the entire factory.

The Cutlass project is a high-performance matrix operation library developed by NVIDIA, focusing on optimizing matrix multiplication (GEMM) and related calculations on CUDA.

You can think of it as a "math genius" . In school, some students are particularly good at mental arithmetic and can get answers quickly. Cutlass optimizes algorithms to enable computers to complete complex mathematical calculations faster.

Therefore, FlashMLA draws on the advantages of these two projects when designing.

It learned from FlashAttention how to command resources efficiently, and from Cutlass how to quickly complete complex mathematical calculations. Combining the two, it understands both command and calculation.

I think the open source of FlashMLA is very important for enterprises and developers.

Why?

On the one hand, in the business world, time is money. For companies that rely on AI technology, faster reasoning speed means lower operating costs, higher customer satisfaction, and stronger market competitiveness.

On the other hand, the open source of FlashMLA will enable more companies and developers to use this advanced technology for free, thereby promoting the development of the entire industry.

Having written this, the question arises, how to use it?

Hardware requirements: FlashMLA requires a GPU with NVIDIA Hopper architecture (such as H800) to use; Software requirements: CUDA (version 12.3 and above) and PyTorch (version 2.0 and above) are required.

Then, three steps:

First, get the code. The GitHub address is: https://github.com/deepseek-ai/FlashMLA .

Second, after entering the code folder, run the following command: python setup.py install; this step is like installing the necessary parts for FlashMLA to work properly.

Finally, you can check that FlashMLA was installed successfully by running a simple test. In the code folder, run the following command: python tests/test_flash_mla.py

If everything is OK, you will see the test results, telling you how FlashMLA performs.

In short, if you are an AI developer, or your product needs to improve AI performance, FlashMLA is definitely worth a try. It is a rare business opportunity. I am not an independent developer and I am still learning. But I will share the relevant information with you as soon as possible, hoping it will be helpful to you.