Understanding RAG Part 5: Managing context length

In-depth exploration of the challenges and strategies of RAG technology in handling long context management.
Core content:
1. Context length limit and its impact of large language models
2. Four strategies for managing context length in RAG system
3. Application and optimization of context summarization technology in RAG
On the other hand, retrieval-augmented generation (RAG) integrates external knowledge from retrieved documents (usually vector databases) to enhance the context and relevance of LLM output. However, managing context length in RAG systems remains a challenge because in some scenarios where a lot of contextual information is required, the retrieved information needs to be effectively selected and summarized to keep it below the input limit of LLM without losing necessary knowledge.
Long Context Management Strategy in RAG
The RAG system has several strategies for incorporating as much relevant retrieved knowledge as possible before passing the initial user query to the LLM, while staying within the model’s input constraints. Four of these strategies are outlined below, from simplest to most complex.
1. Document Chunking
Document chunking is usually the simplest strategy, which focuses on splitting the documents in the vector database into smaller chunks. Although it may not be obvious at first glance, this strategy helps to overcome the context length limitation of LLM within RAG systems in several ways, for example, by reducing the risk of retrieving redundant information while maintaining the integrity of the context within the chunk.
2. Selective search
Selective retrieval refers to applying a filtering process to a large number of related documents to retrieve only the most relevant parts, thereby reducing the size of the input sequence passed to LLM. By intelligently filtering the parts of the retrieved documents that need to be retained, the goal is to avoid including irrelevant or extraneous information.
3. Targeted Search
While similar to selective retrieval, the essence of directed retrieval is to retrieve data with a very specific intent or end response. This is achieved by optimizing the retriever mechanism for a particular type of query or data source, for example, building retrievers specifically for medical texts, news articles, the latest scientific breakthroughs, etc. In short, it constitutes an evolved and more specialized form of selective retrieval, adding domain-specific criteria into the loop.
4. Contextual Summary
Contextual summarization is a more sophisticated approach to managing context length in RAG systems, where we apply text summarization techniques in the process of building the final context. One possible approach is to use an additional language model (usually smaller and trained for summarization tasks) to summarize the large number of retrieved documents. Summarization tasks can be either extractive or abstractive. Extractives identify and extract relevant text paragraphs, while abstracts generate summaries from scratch, reformulating and simplifying the original text blocks. In addition, some RAG solutions use heuristics to evaluate the relevance of text fragments (e.g., text blocks) and discard less relevant text blocks.
strategy | Summary |
Document Chunking | Divide documents into smaller, more coherent chunks to preserve context while reducing redundancy and staying within LLM limits. |
Selective Search | Filter large sets of relevant documents to retrieve only the most relevant parts, minimizing irrelevant information. |
Target retrieval | Use specialized search engines to refine your search for specific query intent and add domain-specific criteria to refine your results. |
Contextual Summary | Use extractive or abstractive summarization techniques to condense large amounts of retrieved content, ensuring that the necessary information is delivered to the LLM. |
Long-context language model
What about long context LLM? Is that enough and no RAG is needed?
This is an important issue that needs to be addressed. Long-context LLMs (LC-LLMs) are “super-large” LLMs that are able to accept very long sequences of input tokens. Although research evidence shows that LC-LLMs generally outperform RAG systems, the latter still has unique advantages, especially in scenarios that require dynamic real-time information retrieval and cost-effectiveness. In these applications, it is worth considering using a small LLM encapsulated in a RAG system that adopts the above strategy instead of an LC-LLM. Neither of them is a one-size-fits-all solution, and they all have advantages in their own specific environments.
summary
This paper introduces and illustrates four strategies for managing context length in RAG systems, as well as LLMs in such systems to handle long contexts when there may be limits on the acceptable input length in single-user interactions. While the use of so-called long-context LLMs has become a trend in recent years to overcome this problem, in some cases it may still be worthwhile to stick with RAG systems, especially in dynamic information retrieval scenarios where real-time context updates are required.