SGLang: An inference engine with 5 times higher throughput than vLLM

SGLang: Breaking through the bottleneck of LLM applications and achieving an inference speed 5 times faster than vLLM.
Core content:
1. Performance bottleneck problems faced by LLM applications
2. SGLang's software and hardware collaborative design concept
3. Technical features and advantages of RadixAttention and front-end DSL
Currently, the application scenarios of LLM have expanded far beyond simple conversations and have expanded to complex tasks that require multiple rounds of interaction, complex logic control, and integration with the external environment. Therefore, there are still many bottlenecks in building complex, efficient, and controllable LLM applications, such as:
Slow inference speed: Complex LLM applications usually require multiple model calls, and repeated calculations and data transmission lead to high overall latency. Insufficient controllability: Traditional methods make it difficult to accurately control the LLM generation process, limiting the flexibility and reliability of applications. High programming complexity: Due to the lack of programming languages and tools designed specifically for LLM applications, developers need to spend a lot of time on the underlying details.
In order to break through these bottlenecks, SGLang came into being. It uses the concept of hardware-software co-design to comprehensively optimize from the backend runtime system to the frontend programming language, aiming to enable developers to build high-performance, highly controllable LLM applications more quickly and easily, with performance 5 times higher than that of its vLLM.
Technical Features
SGLang's features are RadixAttention and front-end DSL. These two components work together to bring a qualitative leap to LLM applications.
1. RadixAttention: Automatic KV Caching
When LLM generates text, it needs to maintain a KV cache to store the intermediate calculation results of previously generated tokens. In multi-round conversations or complex tasks, many requests may share the same prefix, such as the same system prompts or conversation histories. When dealing with such scenarios, traditional reasoning systems often repeatedly calculate the KV cache of these shared prefixes, resulting in a large amount of redundant calculations and memory waste. Although some systems support KV cache reuse, manual configuration is usually required and it is difficult to cope with complex reuse patterns.
The blue box is the sharable prompt part, the green box is the non-shared part, and the yellow box is the non-shared model output. The sharable part includes the few-shot learning examples, the questions in self-consistency, the chat history in multi-turn dialogues, and the search history in TOT.
SGLang proposed RadixAttention, which is an automatic and efficient KV cache reuse technology . It organizes the KV cache into a radix tree data structure, and combines the LRU (Least Recently Used) elimination strategy and cache-aware scheduling strategy to achieve automatic identification and reuse of shared KV caches between different LLM calls at runtime. Simple analogy: You can imagine RadixAttention as an intelligent librarian. The library (GPU memory) stores a large number of books (KV cache), and each book has a unique title (token sequence). When a new reader (LLM request) comes to borrow a book, the administrator (RadixAttention) can quickly find out whether there is already a book in the library that contains the information the reader needs (KV cache with a shared prefix). If it exists, it can be reused directly without the need to re-purchase a new book (recalculation), which greatly saves time and resources.
As shown in the figure below, each node of the Radix tree represents a token sequence, and the edge represents a token. When a new request arrives, RadixAttention will perform prefix matching in the tree, find the node with the longest shared prefix, and reuse its KV cache. The advantage of the radix tree lies in its efficient prefix search, insertion, and elimination capabilities, which can flexibly cope with various complex KV cache reuse modes.
The advantages of RadixAttention are also obvious, with the following characteristics.
Automation: No manual configuration is required, and KV cache is automatically identified and reused. Efficiency: The radix tree structure and cache strategy ensure efficient cache management and reuse. Versatility: Compatible with existing technologies such as continuous batch processing and paging attention, and can be extended to multimodal models.
2. Front-end Python embedded DSL to simplify LLM programming
SGLang is not only optimized in the backend, but also provides a domain-specific language (DSL) embedded in Python , which is designed to simplify the programming process of LLM applications. It allows users to easily express advanced prompting techniques, control flow, multimodal input, parallelism, and external interaction. SGLang programs can be executed in interpreter mode or compiler mode.
The following figure shows an example of a multi-dimensional paper scorer implemented using SGLang . This example uses the branch-solve-merge prompting technique to evaluate the quality of papers from multiple dimensions and ultimately generate summaries and scores. Through these concise and powerful APIs, developers can easily build complex LLM application logic without having to worry about the underlying model calls and cache management details.
Performance
SGLang achieves significant performance improvements in throughput and latency through automatic KV cache reuse, program parallelism within the interpreter, and front-end and back-end collaborative design. In a series of benchmarks, it achieves up to 5 times the throughput improvement compared to existing systems (such as Guidance and vLLM) .
summary
As a rising star, SGLang stands on the shoulders of giants (SGLang Runtime imported some models and layer implementations from vLLM, but redesigned the batch processing and cache scheduler), focusing on the new pain points encountered in the development of LLM applications, and achieving very good results in performance and development efficiency. At the same time, since the project is relatively new, there are still some shortcomings in usability (the configuration is more complicated than vllm). Therefore, there is still a long way to go, but the idea of improving reasoning services for complex LLM applications is extremely correct, and the future is full of prospects, which is worthy of everyone's attention and study.
Project address: https://github.com/sgl-project/sglang