Context compression: the core step of RAG information extraction

Written by
Audrey Miles
Updated on:June-28th-2025
Recommendation

Context compression is a key technology to improve the efficiency of LLM information processing.

Core content:
1. Definition and goal of context compression technology
2. Application scenarios of context compression in RAG
3. LangChain implementation of context compression method and case

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)


-text-

The document blocks retrieved by the traditional RAG method often contain irrelevant noise and are limited by the LLM context window. The context compression technology aims to solve this problem. It processes the retrieved documents and retains only the most relevant content to the query. This can reduce the context length, remove noise, and increase information density, making the context provided to the LLM shorter and cleaner, improving the quality and efficiency of answers.

  • 1. What is context compression?
  • 2. Implementation using LangChain
    • 2.1 Code (based on ContextualCompressionRetriever)
    • 2.2 Operation log
    • 2.3 Other Implementations of Langchain Compressor

-- To receive the learning materials package, see the end of the article

In RAG, you ask a question, and a simple and direct approach is to split the document into equal-sized blocks and then store the vector embeddings of these blocks in a vector database.

When a user asks a question, the system embeds the question, performs a similarity search in the vector database, finds the most relevant document chunks (chunks of text), and then appends them to the prompt of the larger model.

One problem with this approach is that when you import data into a vector database, you usually don't know what specific queries will be used to retrieve these document chunks in the future.

This means that when a user's specific question is obtained and a block is retrieved, even if the block has some relevant content, it is likely to contain some irrelevant content.

This can cause some confusion:

  1. These document snippets that are not highly relevant or even completely irrelevant to the user's question act like "noise" and interfere with the LLM, causing it to generate inaccurate or off-topic answers.

  2. LLMs usually have a maximum input length they can handle (called a context window). If the document snippet we retrieve is too long and exceeds the context window of the LLM, we will not be able to provide all the relevant information to the model.

Contextual compression was born to solve these problems. By "searching first and then compressing", it will filter or extract each initially retrieved document block, leaving only the most relevant content for the current query, and then send it to the generation stage.

1. What is context compression?

Contextual compression refers to the technique of processing and reducing the retrieved raw document chunks before feeding them to LLM. Its main goals are:

  • Reduce context length:  Make sure that the context input to the LLM does not exceed its maximum length limit.
  • Remove noise information:  Identify and remove parts that are irrelevant to the user query.
  • Preserve critical information:  Ensure that the compressed context still contains the most important information needed to answer the query.

Through context compression, we can provide a shorter, cleaner, and more information-dense context to the LLM, thereby improving the quality and efficiency of generated answers.

"Compression" here refers to both compressing the content of a single document block and filtering out irrelevant documents as a whole.


2. Implementation using LangChain

LangChain introduces DocumentCompressor Abstraction that enables you to operate on retrieved chunks of documents compress_documents(documents: List[Document], query: str) .

The core idea is simple: instead of immediately returning a chunk of retrieved documents as is, we can compress the chunk by giving it a question and return only the relevant information.

The goal of compression is to make the information passed to the LLM more relevant. This way, you can also pass more information to the LLM, because in the initial retrieval phase, you can focus on recall (i.e. increase the number of document chunks returned) and let compression take care of precision.

The downside is that additional API calls need to be made depending on the number of document chunks retrieved, which increases the cost and latency of your application.

2.1 Code (based on ContextualCompressionRetriever )

Initialize the environment and import files:

Complete the basic settings and use LangChain's TextLoader to load text in preparation for subsequent text processing.

# Import operating system related function modules
import  os

# Import Chroma vector database related modules
from  langchain_chroma  import  Chroma
# Import OpenAI chat model and embedding model related modules
from  langchain_openai  import  ChatOpenAI, OpenAIEmbeddings

# Set OpenAI API key
OPENAI_API_KEY =  'hk-iwtbie191e427'
# Set the API key as an environment variable
os.environ[ 'OpenAI_API_KEY' ] = OPENAI_API_KEY

### 1. Import files################################################################################

# Import the text document loader
from  langchain_community.document_loaders  import  TextLoader
# Import the recursive string text segmenter
from  langchain.text_splitter  import  RecursiveCharacterTextSplitter

# Create a text loader instance and load the Journey to the West text file
loader = TextLoader(file_path= "../../data/Journey to the West 1.txt" , encoding= 'utf-8' )
# Load document content into memory
data = loader.load()

# Print the number of documents loaded
print( f'A total of  {len(data)}  documents' )
# Print the number of characters in the first document
print( f'A total of  {len(data[ 0 ].page_content)}  characters' )

Text chunking and vector embedding:

There are two core steps to implement the RAG system: text segmentation and vector embedding.

First, we use RecursiveCharacterTextSplitter to split long text into small chunks of 500 characters. Then we use the embedding model to convert these text chunks into vector representations and store them in the Chroma vector database in preparation for subsequent similarity searches.

### 2. File Blocking########################################################################################

# Create a text splitter instance, set the block size and overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size= 500 , chunk_overlap= 50 )
# Split the document
splits = text_splitter.split_documents(data)
# Print the number of text blocks after segmentation
print( f'A total of  {len(splits)}  blocks' )

### 3. Quick text embedding###################################################################################
# Create an OpenAI embedding model instance
embeddings = OpenAIEmbeddings(model= "text-embedding-3-large" , base_url= "https://api.openai-hk.com/v1" )

# Import file operation tools
import  shutil

# Check if the chroma_db directory exists, if it exists, delete it
if  os.path.exists( "./chroma_db" ):
    shutil.rmtree( "./chroma_db" )

# Create a Chroma vector database instance
vectordb = Chroma.from_documents(
    documents=splits,   # Use the split documents
    embedding=embeddings,   # Use OpenAI embedding model
    persist_directory = "./chroma_db" # Set the persistence directory
)

Related reading: Text Splitting Based on Text Structure - Text Splitting, an Indispensable Link to RAG

Define the base retriever:

Implement basic vector retrieval functions. Create a similarity-based retriever and set it to return the top 3 most relevant document blocks. After performing the retrieval, initialize an LLM instance to prepare for subsequent text generation.

### 4. Define the basic retriever#######################################################################
# Set the query question
query =  "Who did Sun Wukong fight with?"

# Set the number of search results to return
top_k =  3
# Create a basic retriever
retriever = vectordb.as_retriever(
    search_type= 'similarity' ,   # Use similarity search
    search_kwargs={ "k" : top_k}   # Set the number of results returned
)
# Perform a search
docs = retriever.invoke(query)

# Print basic search results
print( "===Basic search=========" )
for  doc  in  docs:   # Traverse the search results
    print(doc)
    print( "--------------------------" )
print( "====================================" )

# Create an OpenAI chat model instance
llm = ChatOpenAI(
    model = "gpt-4.1-nano" ,   # Use the gpt-4o-mini model
    temperature= 0 ,   # Set the temperature to 0
    base_url = "https://api.openai-hk.com/v1" # Specify the API endpoint
)

Related reading: Top-K Similarity Search: Accurately extracting the most relevant knowledge in RAG systems

Define the basic compressor LLMChainExtractor:

This code implements the document compression function, which is the core part of contextual compression. It imports the ContextualCompressionRetriever and LLMChainExtractor modules and creates a document compressor using the previously initialized language model. The compressor processes the previously retrieved document blocks, extracts the content related to the question, and removes irrelevant information.

### 5. Define the basic compressor#######################################################################
# Import the context compression retriever and document compressor
from  langchain.retrievers  import  ContextualCompressionRetriever
from  langchain.retrievers.document_compressors  import  LLMChainExtractor
# Create an LLM document compressor
compressor = LLMChainExtractor.from_llm(llm)
# Compress the search results
docs = compressor.compress_documents(documents=docs, query=query)

# Print the compressed result
print( "===Compression=========" )
for  doc  in  docs:   # Traverse the search results
    print(doc)
    print( "--------------------------" )
print( "====================================" )

Define the contextual retriever ContextualCompressionRetriever and question-answer chain:

Integrate the basic compressor and basic retriever defined above into a contextual compression retriever, realizing the complete process of "retrieve first and then compress". Create a ContextualCompressionRetriever instance and use the retriever to perform a query.

Finally, a RetrievalQA question-answering chain is created to combine the language model and the compressed retriever to perform question answering.

### 5. Define context retriever#######################################################################
# Create a contextual compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,   # Set the base compressor
    base_retriever=retriever   # Set the base retriever
)

# Perform a search using the compressed retriever
compressed_docs = compression_retriever.invoke(query)

# Print the results of the compression retriever
print( "===Compressed Retriever=========" )
for  doc  in  compressed_docs:   # Traverse the search results
    print(doc)
    print( "--------------------------" )
print( "====================================" )

# Import retrieval question and answer chain
from  langchain.chains  import  RetrievalQA

# Create a retrieval question and answer chain instance
qa = RetrievalQA.from_chain_type(
    llm=llm,   # Set language model
    retriever=compression_retriever,   # Set the retriever
    return_source_documents= True # Set to return the source document
)
# Execution Q&A
result = qa.invoke(query)
# Print question and answer results
print(result[ 'result' ])
print(result)

2.2 Operation log

  1. The basic search section shows the three original document blocks returned for the query "Who did Sun Wukong fight with?"
  2. The compression part shows the result after being processed by the LLM compressor. It can be clearly seen that:
  • The first document block was compressed from more than 500 characters to only retain the core part of the battle between Sun Wukong and the Demon King.
  • The second document block is compressed from a long text to just one key sentence.
  • The third document block also only retains the key description of the battle between Sun Wukong and the giant spirit.
  • Finally, LLM generated a concise and accurate answer based on the compressed context: "Sun Wukong fought with the devil, the giant spirit, and other characters."
  • This example clearly demonstrates the value of context compression: it can effectively remove noise information and retain the core content related to the query, thereby helping LLM generate more accurate answers.

G:\workspace\idea\py\hello-langchain\.venv\Scripts\python.exe g:\workspace\idea\py\hello-langchain\rag\retriever\contextual_compression.py 
A total of 1 document
Total 73714 characters
A total of 206 blocks
===Basic Search=========
page_content= 'The Monkey King shouted, "This devil has such big eyes that he can't see me!" The devil saw it and laughed, "You are less than four feet tall, less than thirty years old, and have no weapons in your hands. How dare you be so arrogant and want to fight me?" Wukong scolded, "You devil, you have no eyes! You think I am small, but it is not difficult to be big. You think I have no weapons, but my two hands are hooked on the moon in the sky! Don't be afraid, just take my punch!" He jumped up and hit him in the face. The devil stretched out his hand to block it and said, "You are so short and I am so tall. If you want to use your fists, I will use my sword. If I use my sword, I will kill you and make you laugh. Wait until I put down my sword and let's fight." Wukong said, "That's right. Good man! Come on!" The devil threw away his posture and started to fight. Wukong rushed in and bumped into each other. They punched and kicked, and rushed and bumped. It turned out that long fists are empty and short fists are strong. The demon king was hit in the ribs and crotch by Wukong, and several times hit his joints, which severely injured him. He dodged, picked up the big steel knife, and chopped Wukong on the head. Wukong quickly retreated, and the chop missed. Seeing that he was ferocious, Wukong used his body skills to pull out a handful of hair, threw it into his mouth, chewed it into pieces, and sprayed it into the air, shouting "Change!", and he turned into three or two hundred little monkeys, gathering around him. '  metadata={ 'source''../../data/西游记1.txt' }
--------------------------
page_content= 'Good Monkey King, jump to the bridge, use a water-blocking method, twist the magic, dive into the waves, split the waterway, and go straight to the bottom of the East China Sea. As he was walking, he suddenly saw a patrolling sea yaksha, who stopped him and asked: "Who is that holy man pushing the water? Tell me clearly so that I can inform and welcome him." Wukong said: "I am Sun Wukong, a natural saint born in the Flower-Fruit Mountain, and a close neighbor of your old Dragon King. Why don't you recognize me?" The yaksha heard it and quickly reported to the Crystal Palace: "Your Majesty, there is a natural saint born in the Flower-Fruit Mountain, Sun Wukong, who claims to be your close neighbor and will come to the palace." Ao Guang, the Dragon King of the East China Sea, immediately got up and went with his dragon son, dragon grandson, and shrimp soldiers. , Crab General went out of the palace to greet him, saying, "Please come in, Immortal, please come in." They met in the palace, sat down and offered tea, and then asked, "When did you attain the Tao, and what magic did you teach?" Wukong said, "Since I was born, I became a monk and practiced, and I have obtained a body that is neither born nor destroyed. Recently, I have been teaching my children and grandchildren to guard the cave, but I don't have any weapons. I have heard that my virtuous neighbors enjoy the Jade Palace and Beique, so there must be some extra magic weapons, so I came to ask for one." Seeing this, the Dragon King couldn't refuse, so he asked Mandarin Fish Commander to take out a big sword and offer it to him. Wukong said, "I can't use a sword, please give me another one." The Dragon King also asked Captain Bay to lead the Eel Warriors to bring out a nine-pronged fork. Wukong jumped down, took it in his hand, used it all the way, put it down and said, "It's light! Light! Light! It's not suitable for my hand! I beg for another one." The Dragon King laughed and said, "My Lord, look at this. This fork weighs 3,600 kilograms!" Wukong said, "It's not suitable for my hand! It's not suitable for my hand!" The Dragon King was frightened and ordered the commander-in-chief and the general Li to bring out a painted halberd, which weighed 7,200 kilograms. Wukong saw'  metadata={ 'source''../../data/西游记1.txt' }
--------------------------
page_content= 'The stick is called Ruyi, and the axe is called Xuanhua. When the two met for the first time, they didn't know who was better; the axe and the stick were used left and right. One was hiding his magic, and the other boasted loudly. He used his power to spit out clouds and mist; he spread his hands to spread soil and sand. The heavenly generals have magical powers, and the Monkey King has unlimited changes. The stick is raised like a dragon playing in the water, and the axe is like a phoenix flying through flowers. The giant spirit's reputation spread throughout the world, but it turned out that his skills were not as good as his; the Great Sage swung the iron stick lightly, and it hit his head and his whole body was numb. The giant spirit could not resist him, and was hit on the head by the Monkey King. He hurriedly blocked the axe, and with a crack, he broke the axe handle into two pieces, and quickly retreated to escape. The Monkey King laughed and said, "Pustules! Pustules! I have spared you, you go and report the news! Go and report the news!"
    When the giant spirit returned to the camp gate, he saw the Pagoda-Bearing Heavenly King directly, and knelt down hurriedly and said, "Bimawen is indeed very powerful! I couldn't defeat him, so I came back to apologize for my defeat." Heavenly King Li was furious and said, "This guy has blunted my edge, take him out and kill him!" Prince Nezha appeared and said, "Father, please calm down and forgive the giant spirit's crime. When I go out on a battle, you will know my depth." The Heavenly King listened to the advice and told him to return to the camp and wait for punishment.
    This prince Nezha, in full armor, jumped out of the camp and rushed to the outside of the Water Curtain Cave. Wukong was coming to withdraw his troops and saw Nezha's bravery. Good prince:'
 metadata={ 'source''../../data/西游记1.txt' }
--------------------------

===Compression=========
page_content= 'The Monkey King shouted, "This devil has such big eyes that he can't see me!" The devil saw it and laughed, "You are less than four feet tall, less than thirty years old, and you have no weapons in your hands. How dare you be so arrogant and want to fight me?" Wukong scolded, "You devil, you have no eyes! You think I am small, but it is not difficult to be big. You think I have no weapons, but my two hands are hooked on the moon in the sky! Don't be afraid, just take my punch!" He jumped up and hit him in the face. The devil stretched out his hand to block it and said, "You are so short and I am so tall. If you want to use your fists, I will use my sword. If I use my sword, I will kill you and eat people's laughter. Wait until I put down my sword and let's fight with you." Wukong said, "That's right. Good man! Come on!" The devil threw away his posture and started to fight. Wukong rushed in and bumped into each other. They punched and kicked, rushing and bumping. '  metadata={ 'source''../../data/Journey to the West 1.txt' }
--------------------------
page_content= 'Wukong said: "After I was born, I became a monk and practiced Buddhism, and I have obtained a body that is neither born nor destroyed. Recently, I have been teaching my children and grandchildren to guard the cave, but I don't have any weapons. I have heard that my virtuous neighbors are enjoying the Jade Palace and Beique, so they must have extra magic weapons, so I came to ask for one."'  metadata={ 'source''../../data/西游记1.txt' }
--------------------------
page_content= 'The Monkey King's transformations are truly limitless. His club is raised like a dragon playing in the water, and his axe is wielded like a phoenix flying through flowers. The great angel's fame spreads across the world, but it turns out that he is not as capable as the Great Sage; the Great Sage gently swings his iron club, and it hits his head, and his whole body is numb. The great angel could not resist him, and was hit on the head by the Monkey King's club. He hurriedly blocked the blow with his axe, and with a "crack", he broke the axe handle in two, and quickly retreated to escape. The Monkey King laughed and said, "Pustules! Pustules! I have spared you, go and tell the news! Go and tell the news!"'  metadata={ 'source''../../data/西游记1.txt' }
--------------------------

====================================
Sun Wukong fought with the devil, the giant spirit and other characters.

2.3 Other Implementations of Langchain Compressor

LLMChainFilter : Use the LLM chain to decide which initial retrieved document blocks to filter out.

LLMListwiseRerank : Using LLM-based document reranking  is a more reliable but more expensive solution.

EmbeddingsFilter : By embedding the document and the query question, only the embedding results that are similar enough to the question (exceeding the threshold) are returned