Shocking, the speed of the two stages of large model reasoning is 140 times different! An experiment tells you why PD separation is needed for large model reasoning?

Written by

Clara Bennett

Updated on:June-20th-2025

introduction

In the previous article, we have explained to readers the two stages of large model reasoning: Prefill stage and Decode stage . Here we briefly review these two stages. The main task of the Prefill stage is to process all input information in parallel and generate the initial context key-value cache. In the Decode stage, the cached context is used to generate each word in the output sequence one by one and autoregressively.

The two phases show significant differences in computing characteristics, parallel potential, and core resource dependencies (such as the emphasis on computing and memory bandwidth). This article will continue the previous content and use an experiment to intuitively show you the reasons for PD separation.

For an introduction to the two stages of Prefill and Decode, you can read this article
Feng Shao, who loves technology, has a WeChat public account: Feng Shao's Technology Space. Understanding the working mechanism of large language models: a detailed explanation of the Prefill and Decode stages

Experiment: Why is PD separation necessary?

In order to deeply understand and quantify the difference in performance characteristics between the Prefill and Decode stages, we designed a benchmark experiment based on their respective computing characteristics. As mentioned above, the core task of the Prefill stage is to process all input data in parallel, while the Decode stage generates word units in the output sequence one by one and autoregressively. This essential difference determines that there must be significant differences in their processing throughput. Therefore, this experiment will simulate the scenario of concurrent user requests and measure the time consumed by these two stages respectively, so as to accurately quantify their respective performance and clearly reveal the performance gap between them.

Based on the above ideas, we wrote an experimental script to simulate 5 concurrent user requests. The input prompt of each request was unified to 255 tokens after padding, and a total of 1275 input tokens were processed. The goal was to generate 256 new tokens for each request, a total of 1280 output tokens . Then the calculation speed of the Prefill stage and the Decode stage was calculated respectively.

The experimental code has been open sourced. You can use this link: https://github.com/chen-ace/LLM-Prefill-Decode-Benchmark. The open source library provides scripts for NVIDIA CUDA and codes for Apple M series chips. This makes it convenient for users of Apple laptops to run the test scripts. However, due to the limited hardware resources of Apple M series devices, the MPS version of the test code only uses the gpt2 model for testing.

This experiment evaluates the performance of two key stages in the inference process . The time taken for the " approximate prefill stage " includes the entire process of processing 1275 input tokens and generating the first new token for 5 requests. Next, to evaluate the performance of the " approximate decode stage ", the experiment records the total time required to generate all 1280 tokens, and after subtracting the prefill time, calculates the net time required to generate the remaining 1275 tokens.

Comparison of PD two-stage data

The experimental data is shown in the figure above: the approximate Prefill stage takes only 0.2394 seconds (input throughput 5325.18 tokens/second ), and the approximate Decode stage takes 32.8948 seconds to complete the generation of the subsequent 1275 tokens (output throughput 38.76 tokens/second ). The two consume a total of 33.1343 seconds. This not only means that the speed of the Prefill stage is about 137 times that of the Decode stage , but also clearly shows that in the entire reasoning task, the vast majority of the total time (more than 99%) is consumed in the Decode stage. This huge performance gap directly exposes the inherent imbalance in the LLM reasoning process, which is a core problem that must be faced and solved to improve the overall reasoning efficiency.

This phenomenon can clearly lead to a conclusion: the computing power of the graphics card is not fully utilized in the Decode stage . In other words, the computing power resources of the graphics card are greatly wasted . This is why PD needs to be optimized in a targeted manner. If PD is not optimized, a large amount of computing power resources will be wasted in the Decode stage, which occupies the most, greatly increasing the cost of model deployment.

Advantages of PD separation

To solve the above problems, PD separation (Prefill-Decode Separation) came into being. It is not a simple technical adjustment, but a core architectural principle and system design philosophy. Its core idea is to deeply understand and strictly distinguish the essential differences between the Prefill (P) and Decode (D) stages, and implement clear and differentiated strategies in resource allocation, task scheduling and optimization algorithms. Its core advantages are mainly reflected in the following aspects:

1. Significant improvement in inference performance (low latency and high throughput)

By tailoring optimization strategies for the Prefill and Decode stages, PD separation directly reduces the user-perceived first response time (TTFT) and subsequent generation time (TPOT) , making the interactive experience faster and smoother. At the same time, thanks to higher resource utilization and more optimized scheduling, the overall throughput of the system has also been greatly improved , and it can efficiently handle more concurrent requests.

2. Hardware resource utilization and efficiency are greatly improved

Unified processing often leads to resource waste (for example, idle computing units in the Decode stage), while differentiated management of PD separation can ensure that expensive hardware resources such as GPUs are more fully and effectively utilized in the two stages of Prefill and Decode, which have very different characteristics . This not only reduces hardware idleness, but also improves overall computing efficiency and cost-effectiveness.

3. Targeted independent optimization paths are realized

PD separation allows developers to independently apply the most cutting-edge and effective optimization techniques (such as specific Attention variants, memory management solutions, etc.) to the computationally intensive Prefill and memory bandwidth intensive Decode stages. This separation avoids potential conflicts between different optimization strategies , allowing researchers to explore the performance potential of each stage more deeply and accurately.

4. Lay the foundation for advanced system scheduling

The implementation of modern efficient scheduling algorithms such as continuous batch processing requires that the system can distinguish between P and D states. PD separation provides this foundation, allowing the scheduler to intelligently and dynamically manage and combine requests to maximize concurrent processing capabilities, thereby further amplifying performance and efficiency advantages, especially in multi-user service scenarios.

in conclusion

The experiments in this article intuitively reveal the huge performance disparity between the Prefill and Decode stages in LLM reasoning: the Decode stage accounts for more than 99% of the time, but the throughput is nearly 137 times slower, which directly leads to serious waste of GPU computing power in the critical Decode stage and significantly increases the cost of model deployment.

This result is the fundamental reason why PD separation has become a core architectural principle. Through targeted and differentiated management of the parallel potential of Prefill and the memory bottleneck of Decode, the main purpose of PD separation is to transform the resource waste that is prevalent in the Decode stage into a significant improvement in inference performance and cost-effectiveness, thereby overcoming the inefficient bottleneck of the traditional unified processing model.