Prompt compression in large models: making the most of every token

Written by
Silas Grey
Updated on:July-13th-2025
Recommendation

Explore revolutionary technologies to improve the efficiency of large models and maximize the effectiveness of each token.

Core content:
1. The definition of prompt word compression and its importance in improving the efficiency of large language models
2. The three goals of prompt word compression: reduce costs, increase speed, and optimize token restrictions
3. How prompt word compression solves the token restrictions and cost issues of large language models

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

1. Definition and Objectives of Cue Word Compression

Prompt compression (prompt compression: a key technology to improve the efficiency of large language models) is to simplify and optimize the input text provided to the large language model while ensuring the core meaning and context of the input text is complete. This process involves removing redundant information, simplifying sentence structure, and using specialized compression techniques to minimize the number of tokens used.

Suppose a request is made to a large language model. The detailed prompt might be: "Can you please provide me with a comprehensive summary of Company X's latest quarterly financial report, highlighting both the positives and negatives?" The condensed prompt might be: "Summarize Company X's quarterly report: pros and cons?" Both prompts have the same expected output, but the condensed version is shorter, clearer, and cheaper to use.

Prompt Decorators: A Simple Way to Improve AI Responses (Code Included at the End of the Article) compression focuses on achieving three important goals: reducing costs, as the number of tokens used is reduced, the usage fee is also reduced; increasing speed, shorter inputs allow the model to process faster; optimizing token limits to help meet the token constraints of the model, especially in scenarios where long text contexts are processed.

2. The Importance of Cue Word Compression for Large Language Models

As large language models become more and more deeply embedded in everyday applications, efficient interaction with these models becomes critical. Although large language models are powerful, they have some inherent limitations, the most notable of which are token limits, cost issues, and latency concerns, and hint word compression is an effective way to address these challenges.

1. Constraints of Token Limits

Large language models have a maximum token capacity, which covers the input prompt words and the responses generated by the model. Taking a model with a 4096-token limit as an example, if the input prompt word takes up 3500 tokens, the space left for the model response is extremely limited. Without prompt word compression, you may face problems such as truncated or incomplete output, loss of important contextual information when shortening the response, and reduced usability in long text context applications (such as document summaries and multi-round conversations). By compressing the prompt words, you can make room for more detailed and comprehensive output, improving the model's performance when handling complex tasks.

2. Cost efficiency

Most large language model providers, such as OpenAI and Anthropic, charge users based on token usage. Longer prompts mean more tokens are consumed, which directly leads to increased costs, especially in high-frequency usage scenarios, where costs accumulate quickly. For example, an uncompressed prompt contains 2,000 tokens, and at $0.02 per 1,000 tokens, each request costs $0.04; a compressed prompt with only 500 tokens costs only $0.01 per request. For users who need to perform thousands of queries per day, this cost savings is considerable. Prompt word compression effectively reduces operating costs without sacrificing output quality.


3. Reduce latency and increase speed

Long prompts are more expensive and take longer to process. In real-time applications, such as customer service chatbots or voice assistants, every millisecond counts. Users expect fast, smooth interactions, and any delays can lead to user churn. Prompt compression improves the user experience by shortening the input length, speeding up the model's processing speed and significantly improving the system's responsiveness in latency-sensitive environments.

4. Enhanced focus and output quality

Surprisingly, longer prompts can sometimes distract the model. Overly lengthy instructions or redundant information can confuse large language models, resulting in vague or less relevant responses or even misinterpretation of key information. Prompt compression can make prompts clearer and ensure that the model focuses on key points and avoids wasting resources on irrelevant information, thereby improving the quality of the output.

3. Practical application scenarios of prompt word compression

Prompt word compression has demonstrated great value in many fields, bringing optimization and improvement to the workflows of different industries.

In the field of customer support, concise prompts can achieve faster and lower-cost automatic replies. When handling a large number of customer inquiries, quick response is key. Compressed prompts can allow chatbots to quickly understand questions and give accurate answers, improving customer satisfaction.

In terms of legal document summarization, lengthy and complex contracts and legal documents require a lot of time and effort to analyze. Using prompt word compression technology, these lengthy documents can be compressed into a more manageable form, extracting key information and helping legal professionals analyze and research more efficiently.

In coding assistance scenarios, developers can quickly generate code snippets using minimal instructions. As the pace of software development accelerates, developers need to quickly obtain inspiration and references for coding. Hint compression can help coding assistants understand requirements faster, provide accurate code suggestions, and improve development efficiency.

The content creation field has also benefited greatly. Whether writing marketing copy, blog summaries or social media content, cost-effectiveness is an important consideration. With prompt word compression, creators can get high-quality content generation suggestions while controlling costs, providing more possibilities for content creation.

IV. Technical means of prompt word compression

Cue word compression involves a variety of strategies and methods, ranging from traditional simple tricks to cutting-edge advanced techniques, aiming to preserve the intent and quality of the cue word while reducing the use of tokens.

1. Traditional methods

Traditional prompt word compression methods are simple and direct and can be applied without special tools or models.

  1. Information extraction
    Condense lengthy text into a concise summary, focusing on the core information and removing unnecessary details. For example, shorten "Please explain in detail how photosynthesis in plants works" to "Explain photosynthesis in plants."
  2. Structured prompt word design
    Reformat lengthy instructions into bullet points or direct commands, using key words instead of full sentences. For example, “Can you give me a comprehensive summary of this book, including its key themes and major characters?” could be condensed into “Book Summary: Key Themes and Major Characters.”
  3. Keyword extraction
    Identifying and retaining only key terms is very effective in information retrieval or search related applications. For example, for the question “Describe the economic impact of climate change on developing countries”, the keywords “climate change, economic impact, developing countries” are extracted.
  4. Contextual Summary
    Automatically downsize the context using a pre-trained summarization model to reduce text length while keeping semantics intact.

2. Advanced Technology

  1. LLMLingua Series
    This is a family of methods that aims to improve the efficiency of large language models by compressing the input prompt words, including LLMLingua, LongLLMLingua, and LLMLingua-2.
  • LLMLingua
    The method uses a well-trained small language model (such as GPT-2 small or LLaMA-7B) to identify and remove non-essential tokens in the prompt word. It adopts a coarse-to-fine compression strategy, and uses a budget controller to maintain semantic integrity at high compression rates. The interdependencies between tokens are modeled through an iterative token-level compression algorithm, and the compressed prompt word distribution is matched to the target large language model through instruction adjustment. LLMLingua can achieve up to 20 times compression rate with minimal performance loss.
  • LongLLMLingua
    In response to the challenges of long text context scenarios, LongLLMLingua improves the ability of large language models to handle long inputs through query-aware compression and reorganization, effectively alleviating problems such as increased computing costs, latency, and performance degradation. Evaluations show that LongLLMLingua can improve performance by up to 17.1% while reducing the number of tokens by about four times.
  • LLMLingua - 2
    Based on the previous two, LLMLingua-2 introduces a data distillation method for task-independent prompt word compression. By training on data extracted from GPT-4, LLMLingua-2 transforms prompt word compression into a token classification problem, using a BERT-level encoder to capture key information from bidirectional context. This method performs well in processing out-of-domain data, and is 3 to 6 times faster than the original LLMLingua, making it suitable for a variety of application scenarios.
  • 500xCompressor
    This is an advanced prompt word compression method that can compress a large amount of natural language context into very few special tokens, and can even compress up to 500 tokens into a single special token, significantly reducing the length of the input prompt word. It solves the problems of increased reasoning time, high computing cost and reduced user experience caused by long prompt words. The compression rate can reach 6 to 480 times, effectively improving the efficiency and applicability of large language models in various tasks. It has the characteristics of high compression rate, few additional parameters, zero-sample generalization, non-selective compression and retention of model performance. It can work effectively after large-scale corpus pre-training and fine-tuning on specific datasets.
  • PCToolkit (prompt word compression toolkit)
    This is a unified, plug-and-play solution that aims to improve the efficiency of large language models by shortening the length of input prompts while retaining key information. It provides a modular framework that integrates cutting-edge prompt compressors, multiple datasets, and comprehensive performance evaluation metrics. PCToolkit includes a variety of mainstream compression technologies, provides a user-friendly interface, and its modular design facilitates switching between different methods, datasets, and metrics. The toolkit has been evaluated on a variety of natural language tasks and has shown good versatility and effectiveness.
  • 5. Challenges and strategies for cue word compression

    Although cue word compression brings many advantages to large language models, it also faces some challenges in practical applications and needs to be handled with caution.

    1. Balancing Compression and Context Loss

    When compressing prompts, one of the main challenges is how to keep key context while shortening the length. If too much information is removed, large language models may misunderstand user intent, give vague or irrelevant responses, or even miss key details required for accurate answers. For example, if "Summarize the economic impact of climate change on agriculture, considering factors such as crop yield fluctuations, irrigation challenges, and growing season changes" is over-compressed into "Summarize the impact of climate change", the model may ignore specific details related to agriculture. To address this issue, key entities, actions, and results can be prioritized during the compression process; structured prompts (such as bullet point lists or keywords) can be used to retain key details; and query-aware compression can be used to ensure that the compressed prompts still meet the intent of the original question.

    2. Risks and Mitigation of Over-Compression

    Over-compression may result in vague or incomplete instructions, reduced semantic richness, affecting nuanced responses, and losing important qualifiers or constraints (such as timelines, conditions, etc.). For example, if "Explain the EU's data privacy regulations, focusing on technology startups' compliance with GDPR" is over-compressed into "Explain data privacy", the model may give a general explanation and ignore specific concerns. To avoid the risk of over-compression, you can set a compression threshold to avoid excessive compression ratios; conduct iterative testing, gradually compress prompt words and evaluate the output quality at each stage; use a hybrid approach to combine basic summarization techniques with advanced techniques (such as LLMLingua or PCToolkit) to achieve controllable compression; retain key keywords or entities as contextual anchors to guide model responses.

    3. Considerations for different use cases

    Different application scenarios have different requirements for prompt word compression. In conversational AI, priority should be given to ensuring the clear expression of user intent to avoid affecting the fluency of the conversation due to excessive compression; document summaries need to retain keywords and key entities of specific topics; code generation should avoid deleting important function names, parameters or code comments.

    Hint word compression (reducing LLM costs by semantically compressing text) plays an indispensable role in the application of large language models. It is not only an effective means to deal with challenges such as token restrictions, costs and delays, but also a key technology to improve the efficiency and quality of large language model applications. By rationally using various compression technologies and tools, fully considering the needs of different application scenarios, and balancing the relationship between compression and context preservation, more efficient, smarter and more cost-effective large language model applications can be achieved, promoting the in-depth development and widespread application of artificial intelligence technology in various fields.