Understanding RAG Part 5: Managing context length

Written by
Clara Bennett
Updated on:June-27th-2025
Recommendation

In-depth exploration of the challenges and strategies of RAG technology in handling long context management.

Core content:
1. Context length limit and its impact of large language models
2. Four strategies for managing context length in RAG system
3. Application and optimization of context summarization technology in RAG

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)


One of the main limitations of traditional large language models (LLMs) is that they have a limited context length, which limits the amount of information processed in a single user-model interaction. Addressing this limitation has been one of the main work directions of the LLM development community, raising awareness of the advantages of increasing context length in generating more coherent and accurate responses. For example, GPT-3 released in 2020 has a context length of 2048 tokens, while its younger but more powerful brother GPT-4 Turbo (born in 2023) allows processing up to 128K tokens in a single prompt. Needless to say, this is equivalent to being able to process an entire book in a single interaction, for example, to summarize it.

On the other hand, retrieval-augmented generation (RAG) integrates external knowledge from retrieved documents (usually vector databases) to enhance the context and relevance of LLM output. However, managing context length in RAG systems remains a challenge because in some scenarios where a lot of contextual information is required, the retrieved information needs to be effectively selected and summarized to keep it below the input limit of LLM without losing necessary knowledge.

Long Context Management Strategy in RAG

The RAG system has several strategies for incorporating as much relevant retrieved knowledge as possible before passing the initial user query to the LLM, while staying within the model’s input constraints. Four of these strategies are outlined below, from simplest to most complex.

1. Document Chunking

Document chunking is usually the simplest strategy, which focuses on splitting the documents in the vector database into smaller chunks. Although it may not be obvious at first glance, this strategy helps to overcome the context length limitation of LLM within RAG systems in several ways, for example, by reducing the risk of retrieving redundant information while maintaining the integrity of the context within the chunk.

2. Selective search

Selective retrieval refers to applying a filtering process to a large number of related documents to retrieve only the most relevant parts, thereby reducing the size of the input sequence passed to LLM. By intelligently filtering the parts of the retrieved documents that need to be retained, the goal is to avoid including irrelevant or extraneous information.

3. Targeted Search

While similar to selective retrieval, the essence of directed retrieval is to retrieve data with a very specific intent or end response. This is achieved by optimizing the retriever mechanism for a particular type of query or data source, for example, building retrievers specifically for medical texts, news articles, the latest scientific breakthroughs, etc. In short, it constitutes an evolved and more specialized form of selective retrieval, adding domain-specific criteria into the loop.

4. Contextual Summary

Contextual summarization is a more sophisticated approach to managing context length in RAG systems, where we apply text summarization techniques in the process of building the final context. One possible approach is to use an additional language model (usually smaller and trained for summarization tasks) to summarize the large number of retrieved documents. Summarization tasks can be either extractive or abstractive. Extractives identify and extract relevant text paragraphs, while abstracts generate summaries from scratch, reformulating and simplifying the original text blocks. In addition, some RAG solutions use heuristics to evaluate the relevance of text fragments (e.g., text blocks) and discard less relevant text blocks.

strategy

Summary

Document Chunking

Divide documents into smaller, more coherent chunks to preserve context while reducing redundancy and staying within LLM limits.

Selective Search

Filter large sets of relevant documents to retrieve only the most relevant parts, minimizing irrelevant information.

Target retrieval

Use specialized search engines to refine your search for specific query intent and add domain-specific criteria to refine your results.

Contextual Summary

Use extractive or abstractive summarization techniques to condense large amounts of retrieved content, ensuring that the necessary information is delivered to the LLM.

Long-context language model

What about long context LLM? Is that enough and no RAG is needed?

This is an important issue that needs to be addressed. Long-context LLMs (LC-LLMs) are “super-large” LLMs that are able to accept very long sequences of input tokens. Although research evidence shows that LC-LLMs generally outperform RAG systems, the latter still has unique advantages, especially in scenarios that require dynamic real-time information retrieval and cost-effectiveness. In these applications, it is worth considering using a small LLM encapsulated in a RAG system that adopts the above strategy instead of an LC-LLM. Neither of them is a one-size-fits-all solution, and they all have advantages in their own specific environments.

summary

This paper introduces and illustrates four strategies for managing context length in RAG systems, as well as LLMs in such systems to handle long contexts when there may be limits on the acceptable input length in single-user interactions. While the use of so-called long-context LLMs has become a trend in recent years to overcome this problem, in some cases it may still be worthwhile to stick with RAG systems, especially in dynamic information retrieval scenarios where real-time context updates are required.