RAG techniques and underlying code analysis

In-depth analysis of RAG technology, hand-in-hand teaching you to build a RAG system from scratch!
Core content:
1. RAG technology principles and Python basic library implementation
2. The complete process from text preprocessing to generation optimization
3. 9 practical skills to help you break through the bottleneck of answer quality
Introduction
[Use Python basic libraries to tear RAG kernel from scratch] Are you still using ready-made frameworks to implement RAG? This article will show you how to tear open the technical black box, build a RAG system using only Python basic libraries such as numpy, and tear down the RAG kernel from scratch! From text segmentation, vectorization, similarity retrieval to generation optimization, the core logic of retrieval enhancement generation is dissected line by line, and 9 practical skills are analyzed in depth: from intelligent block strategy to dynamic context compression, to help you break through the bottleneck of answer quality. Refuse to be a parameter adjustment tool person, this time thoroughly master the underlying genes of RAG!
I believe everyone is familiar with RAG (Retrieval-Augmented Generation). In practical applications, many people use frameworks such as LangChain or FAISS to implement RAG functions. But if we manually implement a RAG system from scratch, have you ever tried it?
In order to help you better understand the working principle of RAG from the bottom up, this article will take you step by step to implement a simple version of the RAG system. In this process, we will not use any complex frameworks, but only rely on the familiar Python standard library and common scientific computing libraries, such as numpy.
1. Starting from 0: Simple RAG Implementation
Before building a more complex RAG architecture, we start with the most basic version. The entire process can be divided into the following key steps:
1. Data import: Load and preprocess raw text data to prepare for subsequent processing.
2. Text Chunking: Split long text into smaller paragraphs or sentences to improve retrieval efficiency and relevance.
3. Create Embedding: Use the embedding model to convert text blocks into vector representations to facilitate semantic comparison and matching.
4. Semantic search: Based on the query content entered by the user, the most relevant text blocks are retrieved from the existing vector library.
5. Response generation: Based on the retrieved relevant content, the final answer output is generated in combination with the language model.
Setting up the environment
First, we need to import the necessary libraries:
import os
import numpy as np
import json
import fitz
import dashscope
from openai import OpenAI
os. environ [ 'DASHSCOPE_API_KEY' ] = "your dashscope api key"
Extract text from PDF files
First we need a text data source. In this article, we use the PyMuPDF library to extract text from PDF files. Here is a function defined to extract text from PDF:
def extract_text_from_pdf ( pdf_path ):
# Open PDF file
document = fitz. open (pdf_path)
all_text = "" # Initialize an empty string to store the extracted text
# Traverse each page in the PDF
for page_num in range (document.page_count):
page = document[page_num] # Get page
text = page.get_text( "text" ) # Extract text from the page
all_text += text # Append the extracted text to the all_text string
return all_text # Returns the extracted text
Divide the extracted text into chunks
Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.
def chunk_text ( text_input, chunk_size, overlap_size ):
text_chunks = [] # Initialize a list to store text blocks
# Loop through the text with a step size of (chunk_size - overlap_size)
for i in range ( 0 , len (text_input), chunk_size - overlap_size):
text_chunks.append(text_input[i:i + chunk_size]) # Append text blocks from i to i+chunk_size to the text_chunks list
return text_chunks # Returns a list of text blocks
Setting up the OpenAI API client
Initialize the OpenAI client for generating embeddings and responses.
client = OpenAI( api_key=os.getenv("DASHSCOPE_API_KEY"), # If you don't have environment variables configured, please replace it with your API Key here base_url="https://dashscope.aliyuncs.com/compatible-mode/v1" # base_url of the Bailian service)
Extract and chunk text from PDF files
Now, we load the PDF, extract the text, and split it into chunks.
# Define the PDF file path
pdf_path = "knowledge_base/Intelligent Coding Assistant Tongyi Lingma.pdf"
# Extract text from PDF files
extracted_text = extract_text_from_pdf(pdf_path)
# Split the extracted text into blocks of 1000 characters, with overlaps of 100 characters
text_chunks = chunk_text(extracted_text, 1000 , 100 )
# Print the number of text blocks created
print ( "Number of text chunks: " , len (text_chunks))
# Print the first block of text
print ( "\nFirst text block: " )
print ( text_chunks[ 0 ] )
Number of text blocks: 5
First block of text:
What is the intelligent coding assistant Tongyi Lingma
Intelligent Coding Assistant Tongyi Lingma (abbreviated as Tongyi Lingma) is an intelligent coding assistance tool provided by Alibaba Cloud.
It provides capabilities such as intelligent code generation, intelligent question and answer, multi-file modification, and programming intelligence, bringing high
efficient and smooth coding experience, leading a new paradigm of AI native R&D. At the same time, we provide enterprise customers with
Standard and exclusive versions have the capabilities of enterprise-level scenario customization and private domain knowledge enhancement, helping enterprises develop intelligent
Chemical upgrade.
Core Competencies
Code Completion
After training with a large amount of excellent open source code data, it can generate for you according to the current code file and the context across files.
Line-level/function-level code, unit tests, code optimization suggestions, etc. Immersive coding flow, generation speed in seconds,
You focus more on technical design and complete coding work efficiently.
Ask Mode
The intelligent question-answering model has a large amount of R&D documents, product documents, general R&D knowledge, etc., and combines engineering-level perception capabilities
We can help developers solve R&D problems encountered during the coding process and assist them in fixing and debugging code problems.
Or run error troubleshooting, etc.
Edit Mode
The file editing mode has the ability to modify multiple file codes. When developers need to modify code files accurately,
It can combine the requirements description with the current engineering environment to modify multiple files, and can perform multiple iterations and code reviews.
Check to help developers complete code modification tasks efficiently and controllably.
Agent Mode
The intelligent agent mode has the ability to make autonomous decisions, perceive the environment, use tools, etc., and can be coded by the developer.
Request, use engineering search, file editing, terminal and other tools, you can complete the coding task end to end.
Developers configure MCP tools to make coding more in line with the developer workflow.
Product Advantages
• Multiple conversation modes: support question-and-answer mode, file editing mode, and intelligent model simultaneously in one conversation flow
Developers can freely switch modes according to different scenarios and problem difficulties to achieve maximum work efficiency
change.
• Automatic project perception: Based on the developer's task description, it can automatically perceive the project framework, technology stack, and required
Code files, error messages and other project information, no need to manually add project context, task description is lighter
The code completion is more in line with the business scenarios of the current code base.
• Project-level changes: Based on the developer's task description, you can independently perform task decomposition and multiple codes within the project.
File modification can be done through multiple conversations, and can be done through gradual iteration or snapshot rollback, in collaboration with Tongyi Lingma.
A coding task.
• Memory perception: Supports autonomous memory capabilities based on large models.
The Tongyi Lingma will gradually form rich memories related to developers, projects, problems, etc.
The more you use it, the better we understand you.
• Multiple enterprise edition plans, flexible choice: Provide enterprise standard edition, enterprise
Create embedding vectors for text blocks
Embedding vectors convert text into numerical vectors, allowing efficient similarity search. Here, Alibaba Cloud's embedding model " text-embedding-v3 " is used.
# Create embedding vectors for text blocks
def create_embeddings ( texts, model= "text-embedding-v3" ):
"""
Input a set of text (string or list) and return the corresponding embedding vector list
"""
ifisinstance(texts, str ):
texts = [texts] # Ensure that the input is in list form
completion = client.embeddings.create(
model=model,
input =texts,
encoding_format= "float"
)
# Convert the response to a dict and extract all embeddings
data = json.loads(completion.model_dump_json())
embeddings = [item[ "embedding" ] for item in data[ "data" ]]
return embeddings
Performing semantic searches
We find the most relevant text chunks to the user query by calculating cosine similarity.
from sklearn.metrics.pairwise import cosine_similarity
# Semantic search function
def semantic_search ( query, text_chunks, embeddings= None , k= 2 ):
"""
Find the top-k text chunks in text_chunks that are most relevant to the query
parameter:
query: query statement
text_chunks: list of candidate text chunks
embeddings: list of corresponding embedding vectors (if pre-computed)
k: the number of most relevant results to return
return:
top_k_chunks: the most relevant top-k text chunks
"""
if embeddings is None :
embeddings = create_embeddings(text_chunks) # Automatically generated if not provided
else :
assert len (embeddings) == len (text_chunks), "embeddings and text_chunks must be the same length"
query_embedding = create_embeddings(query)[ 0 ] # Get the embedding of the query
# Calculate similarity
similarity_scores = []
for i, chunk_embedding in enumerate (embeddings):
score = cosine_similarity([query_embedding], [chunk_embedding])[ 0 ][ 0 ]
similarity_scores.append((i, score))
# Sort and take top-k
similarity_scores.sort(key= lambda x: x[ 1 ], reverse= True )
top_indices = [index for index, _ in similarity_scores[:k]]
return [text_chunks[index] for index in top_indices]
Finally, the query operation is executed and the results are printed.
# Perform a semantic search
query = 'What are the capabilities of Tongyi Lingma's intelligent entities?'
top_chunks = semantic_search(query, text_chunks, k= 2 )
# Output results
print ( "Query: " , query)
for i, chunk in enumerate (top_chunks):
print ( f"Context {i + 1 } :\n {chunk} \n=====================================" )
Query: What are the intelligent capabilities of Tongyi Lingma? Context 1: What is the intelligent coding assistant Tongyi Lingma? The intelligent coding assistant Tongyi Lingma (abbreviated as Tongyi Lingma) is an intelligent coding assistance tool provided by Alibaba Cloud. It provides capabilities such as intelligent code generation, intelligent question and answer, multi-file modification, and programming intelligent agents, bringing developers an efficient and smooth coding experience and leading a new paradigm of AI native research and development. At the same time, we provide enterprise customers with enterprise standard and exclusive versions, which have enterprise-level scene customization, private domain knowledge enhancement and other capabilities to help enterprises upgrade their research and development intelligence. Core Capabilities Code Completion Code Completion has been trained with a large amount of excellent open source code data, and can generate line-level/function-level code, unit tests, code optimization suggestions, etc. for you based on the current code file and cross-file context. Immersive coding flow and generation speed in seconds allow you to focus more on technical design and complete coding work efficiently. Intelligent Question and Answer Ask Mode Intelligent question and answer mode has a large amount of R&D documents, product documents, general R&D knowledge, etc., and combines engineering-level perception capabilities to help developers solve R&D problems encountered during the coding process, and assist developers in fixing code problems, debugging, or troubleshooting running errors. File Editing Edit Mode File editing mode has the ability to modify multiple files. When developers need to modify code files accurately, they can modify multiple files in combination with the requirement description and the current engineering environment, and can perform multiple iterations and code reviews to help developers complete code modification tasks efficiently and controllably. Agent Mode Agent mode has the capabilities of autonomous decision-making, environmental perception, and tool use. It can use tools such as engineering retrieval, file editing, and terminals according to the developer's coding requirements to complete coding tasks end-to-end. At the same time, it supports developers to configure MCP tools, so that coding is more in line with the developer's workflow. Product Advantages • Multiple session modes: Question and answer mode, file editing mode, and agent mode are supported in one session flow. Developers can freely switch modes according to different scenarios and problem difficulties to maximize work efficiency. • Automatic project perception: Based on the developer's task description, it can automatically perceive project information such as project framework, technology stack, required code files, error messages, etc., without manually adding project context, making task description easier and code completion more in line with the business scenario of the current code base. • Project-level changes: Based on the developer's task description, it can independently decompose tasks and modify multiple code files in the project. At the same time, it can gradually iterate or roll back snapshots through multiple conversations, and collaborate with Tongyi Lingma to complete coding tasks. • Memory perception: Supports autonomous memory capabilities based on large models. During the conversation between developers and Tongyi Lingma, Tongyi Lingma will gradually form rich memories related to developers, projects, problems, etc., and the more you use it, the better it understands you. • Multiple Enterprise Edition solutions, flexible choice: Provide Enterprise Standard Edition, Enterprise =========================================Context 2: Perception: Support autonomous memory capabilities based on large models. In the process of dialogue between developers and Tongyi Lingma, Tongyi Lingma will gradually form rich memories related to developers, projects, problems, etc., and the more you use it, the better it understands you. • Multiple Enterprise Edition solutions, flexible choice: Provide a variety of solutions for enterprise customers such as Enterprise Standard Edition and Enterprise Exclusive Edition, and provide enterprise personalized solutions, which can be flexibly selected to accelerate the large-scale implementation of intelligent R&D within the enterprise. Function introduction Interline code completion • Line-level/function-level real-time continuation: According to the current syntax and cross-file code context, it automatically perceives the current project and generates line and function-level code in real time; • Generate code by commenting: Describe the functions you want through comments, and generate code directly in the editor area, so that the coding flow is uninterrupted. Intelligent Q&A • R&D Q&A: When you encounter coding questions or technical problems, you can call up Tongyi Lingma with one click, and you can quickly get answers and solutions without leaving the IDE client. • Engineering Q&A: Through Q&A, you can quickly combine the current warehouse to understand the project, query the code, etc. At the same time, you can describe the requirements through natural language, and generate overall repair suggestions and recommended codes for simple requirements or defects in combination with the current project. • Image multimodal Q&A: Supports selecting, dragging or pasting to add images as context, automatically analyzes the image content, and generates code suggestions or problem repair suggestions based on the requirement description. • Enterprise knowledge base Q&A: Use enterprise knowledge and data to conduct Q&A, quickly build an enterprise R&D knowledge Q&A assistant, and improve the work efficiency and collaboration ability of the team. File editing • Engineering-level changes: According to the developer's task description, you can modify multiple code files in the project, and you can perform gradual iterations or snapshot rollbacks through multiple conversations. Developers and Tongyi Lingma work together to gradually complete coding tasks. • Precise editing: Complete code file modifications within the context provided by the developer, and no modifications beyond the developer's expectations will be made. • Fast execution: Strictly follow the developer's task description and the context provided to modify code files. There is no need for additional complex task planning, which completes tasks more quickly than the agent mode. • Tool usage: It has the ability to use code modification related tools such as file reading, semantic retrieval within the project, and file editing, which can help developers quickly complete code modifications. Programming agent • Project-level changes: According to the developer's task description, it can independently decompose tasks and modify multiple code files within the project. At the same time, it can perform step-by-step iteration or snapshot rollback through multiple conversations, and collaborate with Tongyi Lingma to complete coding tasks. • Automatic project perception: According to the developer's task description, it can automatically perceive project information such as project framework, technology stack, required code files, error messages, etc., without manually adding project context, making task description easier. • Tool usage: You can use more than a dozen built-in programming tools independently, such as reading and writing files, generation === ...
Generate a response based on the retrieved chunk
# Initialize the DashScope client (using Alibaba Cloud Tongyi Qianwen)
client = OpenAI(
api_key=os.getenv( "DASHSCOPE_API_KEY" ), # Make sure to set the environment variables in advance
base_url= "https://dashscope.aliyuncs.com/compatible-mode/v1"
)
# Set system prompt
SYSTEM_PROMPT = (
"You are an AI assistant and must answer strictly based on the context provided."
"If the answer is not directly inferred from the context provided, respond with: 'I can't answer this question based on the information available.'"
)
def generate_response ( system_prompt, user_message, model= "qwen-max" ):
"""
Generate context-based answers using DashScope's 1,000-question model.
parameter:
system_prompt (str): System prompts that control AI behavior
user_message (str): question and context entered by the user
model (str): the model name used, the default is qwen-plus
return:
str: the answer content generated by the model
"""
response = client.chat.completions.create(
model=model,
temperature= 0.0 , # Set the temperature to 0 to ensure output certainty
max_tokens= 512 , # The maximum output length can be adjusted as needed
messages=[
{ "role" : "system" , "content" : system_prompt},
{ "role" : "user" , "content" : user_message}
]
)
return response.choices[ 0 ].message.content.strip()
# Example top_chunks (assuming this is what semantic_search returns)
top_chunks = [
"Tongyi Lingma is an AI-based intelligent programming assistant." ,
"File editing capabilities include auto-completion, error fixing, and code refactoring."
]
query = "What are the capabilities of Tongyi Lingma's intelligent entities?"
# Build user prompt (contains context + question)
user_prompt = "\n" .join([ f"context {i + 1 } :\n {chunk} " for i, chunk in enumerate (top_chunks)])
user_prompt += f"\n\nQuestion: {query} "
# Generate AI answer
answer = generate_response(SYSTEM_PROMPT, user_prompt)
# Output results
print ( "AI answers: " )
print (answer)
AI answers:
The intelligent capabilities of Tongyi Lingma include the following aspects:
- **Autonomous decision-making** : Ability to independently decompose tasks based on the developer's coding needs.
- **Environmental awareness** : It can automatically perceive project information such as project framework, technology stack, required code files, error messages, etc., without manually adding project context.
- **Tool Usage** : Ability to use more than ten built-in programming tools independently, such as reading and writing files, code editing, etc.
- **Complete coding tasks end-to-end** : Based on the needs of developers, use a variety of tools such as project retrieval, file editing, terminal, etc. to complete complete coding tasks from start to finish.
- **Support configuration of MCP tools** : Make the coding process more in line with the developer's personal workflow.
Evaluation Answer
# Define prompt words for the evaluation system
evaluate_system_prompt = (
"You are an intelligent evaluation system responsible for evaluating the quality of the AI assistant's answers."
"If the AI assistant's answer is very close to the real answer, please give 1 point;"
"If the answer is wrong or irrelevant to the true answer, please give 0 points;"
"If the answer is partially true, please give 0.5 points."
"Please output the rating result directly: 0, 0.5 or 1."
)
#Build an evaluation prompt and get the score
evaluation_prompt = f """
User question: {query}
AI Answer:
{answer}
Real answer: {ideal_answer}
Please rate according to the following criteria:
- If the AI answer is very close to the real answer → Output 1
- If the answer is wrong or irrelevant → output 0
- If partial match → output 0.5
" ""
evaluation_result = generate_response(evaluate_system_prompt, evaluation_prompt)
# Step 5: Output the final score
print( "AI answer score: " , evaluation_result)
AI answer rating: 1
2. Semantic-based text segmentation
In RAG, text chunking is a crucial step. Its core function is to divide a large continuous text into multiple smaller paragraphs with semantic integrity, thereby improving the accuracy and overall effect of information retrieval.
Traditional chunking methods usually use a fixed-length segmentation strategy, such as segmenting every 500 characters or every several sentences. Although this method is simple to implement, it is easy to split complete semantic units in practical applications, which affects subsequent information retrieval and understanding.
In contrast, a smarter chunking method is semantic chunking . It no longer mechanically divides based on the number of words or sentences, but instead determines the appropriate segmentation position by analyzing the content similarity between sentences. When a significant semantic difference is detected between the previous and next sentences, it is split at that position to form a new semantic paragraph.
How to determine the split point
In order to find the appropriate semantic segmentation point, we can use the following common statistical methods:
1. Percentile method: Find the "Xth percentile" of the semantic similarity difference between all adjacent sentences, and split at those positions where the difference value exceeds this threshold.
2. Standard Deviation Method: When the semantic similarity between sentences decreases by more than the mean value minus X times the standard deviation, the sentence is segmented at that position.
3. The interquartile range (IQR) method uses the difference between the upper and lower quartiles (Q3 - Q1) to identify locations with large changes and use them as potential split points.
Practical application examples
In this practice, we use the **percentile method** to perform semantic segmentation and test its segmentation effect on a sample text.
Creating sentence-level embeddings
First, a piece of original text is preliminarily segmented into sentences, and then a corresponding vector representation (Embedding) is generated for each sentence to facilitate the subsequent calculation of the semantic similarity between sentences.
# Initialize the client
client = OpenAI(
api_key=os.getenv( "DASHSCOPE_API_KEY" ), # If you don't have environment variables configured, replace it with your API Key here
base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1" # The base_url of the Bailian service
)
# Create embedding vectors for text blocks
def create_embeddings ( texts, model= "text-embedding-v3" ):
"""
Input a set of text (string or list) and return the corresponding embedding vector list
"""
if isinstance ( texts , str ):
texts = [texts] # Ensure that the input is in list form
completion = client.embeddings.create(
model=model,
input =text_chunks,
encoding_format= "float"
)
# Convert the response to a dict and extract all embeddings
data = json.loads(completion.model_dump_json())
embeddings = [item[ "embedding" ] for item in data[ "data" ]]
return embeddings
# Preliminarily split the text into sentences based on periods
sentences = extracted_text.split( "." )
# Remove empty strings and leading and trailing spaces
sentences = [sentence.strip() for sentence in sentences if sentence.strip()]
# Generate embedding vectors for all sentences in batches (recommended)
embeddings = create_embeddings(sentences)
print ( f"Successfully generated embedding vectors for { len (embeddings)} sentences." )
Successfully generated embedding vectors for 5 sentences.
Calculate similarity difference
We calculate the cosine similarity between consecutive sentences to measure their semantic proximity.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def cosine_similarity ( vec1, vec2 ):
"""
Computes the cosine similarity between two vectors.
parameter:
vec1(np.ndarray): The first vector.
vec2 (np.ndarray): The second vector.
return:
float: cosine similarity.
abnormal:
ValueError: If the input vector is not a 1D array or the shapes do not match.
"""
if vec1.ndim != 1 or vec2.ndim != 1 :
raise ValueError( "Input vector must be a one-dimensional array" )
if vec1.shape[ 0 ] != vec2.shape[ 0 ]:
raise ValueError( "Input vectors must have the same dimensions" )
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
# Use sklearn's cosine_similarity function to calculate the similarity between consecutive sentences
similarities = [cosine_similarity(embeddings[i].reshape( 1 , - 1 ), embeddings[i + 1 ].reshape( 1 , - 1 ))[ 0 ][ 0 ] for i in range ( len (embeddings) - 1 )]
Implementing semantic chunking
We implemented three different methods to identify breakpoints in the text, that is, to determine where to split a piece of text into multiple meaningful paragraphs.
The core idea of these methods is to determine the segmentation position based on the change in semantic similarity between sentences . When a large semantic difference is detected between consecutive sentences, it is considered to be a potential paragraph dividing point.
def compute_breakpoints ( similarity_scores, method= "percentile" , threshold= 90 ):
"""
Calculate segment breakpoints based on similarity drop points.
parameter:
similarity_scores(List[float]): List of similarities between sentences.
method(str): Threshold calculation method, optional 'percentile', 'standard_deviation', or 'interquartile'.
threshold(float): Threshold value (for percentile or standard deviation method).
return:
List[int]: The index positions where the split should be done.
"""
# Determine the threshold value based on the selected method
if method == "percentile" :
# Calculate the similarity value of the specified percentile as the threshold
threshold_value = np.percentile(similarity_scores, threshold)
elif method == "standard_deviation" :
# Calculate mean and standard deviation, and determine threshold by subtracting X standard deviations
mean = np.mean(similarity_scores)
std_dev = np.std(similarity_scores)
threshold_value = mean - (threshold * std_dev)
elif method == "interquartile" :
# Use the interquartile range (IQR) rule to determine the outlier threshold
q1, q3 = np.percentile(similarity_scores, [ 25 , 75 ])
iqr = q3 - q1
threshold_value = q1 - 1.5 * iqr
else :
# Throw an error if the method is invalid
raise ValueError( "Invalid method. Choose 'percentile', 'standard_deviation', or 'interquartile'." )
# Find the position where the similarity is lower than the threshold, that is, the segmentation breakpoint
return [i for i, score in enumerate (similarity_scores) if score < threshold_value]
# Use the percentile method and set the threshold to 90% percentile to calculate the breakpoint
breakpoints = compute_breakpoints(similarity_scores=similarities, method= "percentile" , threshold= 90 )
Split text into semantic chunks
def split_into_chunks ( sentence_list, break_indices ):
"""
Divide the sentence list into semantic paragraphs based on the breakpoint index.
parameter:
sentence_list(List[str]): sentence list.
break_indices(List[int]): A list of position indices that need to be segmented.
return:
List[str]: The list of divided semantic paragraphs.
"""
semantic_chunks = [] # Store the divided paragraphs
current_start_index = 0 # Current segment start index
# Iterate through each breakpoint to create paragraphs
for bp in break_indices:
# Connect the current segment from the starting index to the breakpoint position and add a period to end
semantic_chunks.append( ". " .join(sentence_list[current_start_index:bp + 1 ]) + "." )
current_start_index = bp + 1 # Update the starting index to the next sentence
# Add the last paragraph (remaining sentences)
semantic_chunks.append( ". " .join(sentence_list[current_start_index:]))
return semantic_chunks # Returns a list of semantic paragraphs
# Use the split_into_chunks function to generate paragraphs
text_chunks = split_into_chunks(sentence_list=sentences, break_indices=breakpoints)
# Print the number of generated paragraphs
print ( f"Number of semantic paragraphs generated: { len (text_chunks)} " )
# Print the first paragraph to verify the result
print ( "\nFirst semantic paragraph: " )
print ( text_chunks[ 0 ] )
Number of semantic paragraphs generated: 4
The first semantic paragraph:
What is the intelligent coding assistant Tongyi Lingma
Intelligent Coding Assistant Tongyi Lingma (abbreviated as Tongyi Lingma) is an intelligent coding assistance tool provided by Alibaba Cloud.
It provides capabilities such as intelligent code generation, intelligent question and answer, multi-file modification, and programming intelligence, bringing high
Efficient and smooth coding experience, leading a new paradigm of AI native R&D. At the same time, we provide enterprise customers with
Standard and exclusive versions have the capabilities of enterprise-level scenario customization and private domain knowledge enhancement, helping enterprises develop intelligent
Chemical upgrade.
Creating embedding vectors for semantic chunks
After completing the semantic segmentation of the text, we need to generate an embedding vector for each semantic chunk to facilitate subsequent retrieval and use.
# Create embedding vectors for text blocks
def create_embeddings ( texts, model= "text-embedding-v3" ):
"""
Input a set of text (string or list) and return the corresponding embedding vector list
"""
ifisinstance(texts, str ):
texts = [texts] # Ensure that the input is in list form
completion = client.embeddings.create(
model=model,
input =text_chunks,
encoding_format= "float"
)
# Convert the response to a dict and extract all embeddings
data = json.loads(completion.model_dump_json())
embeddings = [item[ "embedding" ] for item in data[ "data" ]]
return embeddings
Conduct semantic search
We use cosine similarity to retrieve the most relevant chunks to the query content .
from sklearn.metrics.pairwise import cosine_similarity
# Semantic search function
def semantic_search ( query, text_chunks, embeddings= None , k= 2 ):
"""
Find the top-k text chunks in text_chunks that are most relevant to the query
parameter:
query: query statement
text_chunks: list of candidate text chunks
embeddings: list of corresponding embedding vectors (if pre-computed)
k: the number of most relevant results to return
return:
top_k_chunks: the most relevant top-k text chunks
"""
if embeddings is None :
embeddings = create_embeddings(text_chunks) # Automatically generated if not provided
else :
assert len (embeddings) == len (text_chunks), "embeddings and text_chunks must be the same length"
query_embedding = create_embeddings(query)[ 0 ] # Get the embedding of the query
# Calculate similarity
similarity_scores = []
for i, chunk_embedding in enumerate (embeddings):
score = cosine_similarity([query_embedding], [chunk_embedding])[ 0 ][ 0 ]
similarity_scores.append((i, score))
# Sort and take top-k
similarity_scores.sort(key= lambda x: x[ 1 ], reverse= True )
top_indices = [index for index, _ in similarity_scores[:k]]
return [text_chunks[index] for index in top_indices]
# Perform a semantic search
query = 'What is the intelligent coding assistant Tongyi Lingma? '
top_chunks = semantic_search(query, text_chunks, k= 2 )
# Output results
print ( "Query: " , query)
for i, chunk in enumerate (top_chunks):
print ( f"Context {i + 1 } :\n {chunk} \n=====================================" )
Query: What is the intelligent coding assistant Tongyi Lingma Context 1: What is the intelligent coding assistant Tongyi Lingma Intelligent coding assistant Tongyi Lingma (abbreviated as Tongyi Lingma) is an intelligent coding assistance tool provided by Alibaba Cloud, providing capabilities such as intelligent code generation, intelligent question and answer, multi-file modification, and programming intelligence, bringing developers an efficient and smooth coding experience and leading a new paradigm of AI native research and development. At the same time, we provide enterprise customers with enterprise standard version and exclusive version, which have the capabilities of enterprise-level scene customization and private domain knowledge enhancement to help enterprises upgrade their research and development intelligence.======================================Context 2: Core Capabilities Code Completion Code Completion After training with a large amount of excellent open source code data, it can generate line-level/function-level code, unit tests, code optimization suggestions, etc. for you according to the context of the current code file and across files.=========================================
Generate a response based on the retrieved text block
After completing the semantic search and finding the chunks of text that are most relevant to the user's query, the next step is to generate answers based on these retrieval results .
import os
from openai import OpenAI
# Initialize the DashScope client (using Alibaba Cloud Tongyi Qianwen)
client = OpenAI(
api_key=os.getenv( "DASHSCOPE_API_KEY" ), # Make sure to set the environment variables in advance
base_url= "https://dashscope.aliyuncs.com/compatible-mode/v1"
)
# Set system prompt (Chinese version)
SYSTEM_PROMPT = (
"You are an AI assistant and must answer strictly based on the context provided."
"If the answer is not directly inferred from the context provided, respond with: 'I can't answer this question based on the information available.'"
)
def generate_response ( system_prompt, user_message, model= "qwen-max" ):
"""
Generate context-based answers using DashScope's 1,000-question model.
parameter:
system_prompt (str): System prompts that control AI behavior
user_message (str): question and context entered by the user
model (str): the model name used, the default is qwen-plus
return:
str: the answer content generated by the model
"""
response = client.chat.completions.create(
model=model,
temperature= 0.0 , # Set the temperature to 0 to ensure output certainty
max_tokens= 512 , # The maximum output length can be adjusted as needed
messages=[
{ "role" : "system" , "content" : system_prompt},
{ "role" : "user" , "content" : user_message}
]
)
return response.choices[ 0 ].message.content.strip()
3. Introducing “Context Enhanced Retrieval” in RAG
The traditional approach has an obvious problem: it only returns isolated blocks of text that lack context, which sometimes causes the AI to obtain incomplete information, resulting in incorrect answers or incomplete content.
To solve this problem, we proposed a new method called "Context-Enriched Retrieval" . Its core idea is: not only to find the most relevant text block, but also to return the previous and next text blocks at the same time, to help AI better understand the context, so as to generate more accurate and complete answers.
The whole process mainly includes the following steps:
1. Data Ingestion Extract the original text content from the PDF file.
2. Chunking with Overlapping Context: Chunking a large paragraph into multiple small chunks, but each chunk has some overlap with the previous and next chunks. This ensures that even if a sentence is split between two chunks, the full context can be seen in one chunk.
3. Create Embedding Vectors (Embedding Creation) Convert each text block into a set of digital representations (called "embedding vectors") to facilitate subsequent similarity calculations. ? It can be understood as giving each text block a "semantic label" so that you can quickly find content with similar semantics.
4. Context-Aware Retrieval When a user asks a question, the system will not only find the most relevant text block, but also return the text blocks before and after it. This way, AI can obtain richer background information when answering questions and avoid taking things out of context.
5. Response Generation Use large language models (such as Llama, ChatGLM, etc.) to generate natural and accurate responses based on search results that include context. Just like when you are taking an exam, you can flip through the book to find the answer, and you can also see the content before and after that page, so you can naturally answer more accurately.
6. Evaluation Finally, we will evaluate the AI's answer to determine whether the introduction of context has improved the accuracy and completeness of the answer. For example, we can use manual scoring or let another AI evaluate the quality of the answer.
Implementing context-aware semantic search
It is an improvement on the original semantic search: during the retrieval process, not only the most relevant text block is returned, but also its adjacent previous and next text blocks, thus providing more complete and context-supported information.
def context_enriched_search ( search_query, chunked_texts, chunk_embeddings, top_k= 1 , context_window_size= 1 ):
"""
When searching, not only the most relevant paragraph is returned, but also the context paragraphs before and after it to provide richer background information.
parameter:
search_query(str): user's query statement.
chunked_texts(List[str]): List of chunked texts.
chunk_embeddings(List[dict]): vector representation of each text paragraph (usually obtained from an embedding model).
top_k(int): The number of relevant paragraphs to be retrieved (here only top 1 is used to find the central paragraph).
context_window_size (int): The number of context paragraphs to include (the number of front and back paragraphs).
return:
List[str]: List of text paragraphs containing the most relevant paragraphs and their context.
"""
# Step 1: Convert the user's question into a vector (embedding) for similarity comparison with the text paragraph
query_embedding = create_embeddings(search_query).data[ 0 ].embedding
similarity_list = [] # Used to store the similarity score and index of each paragraph and question
# Step 2: Traverse all paragraph vectors and calculate the cosine similarity between them and the question vector
for i, chunk_embedding in enumerate (chunk_embeddings):
# Use the cosine_similarity function to calculate similarity (the closer to 1, the more similar)
similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))
# Save the index and similarity of the paragraph, such as: (0, 0.75)
similarity_list.append((i, similarity_score))
# Step 3: Sort all paragraphs by similarity, from high to low
similarity_list.sort(key= lambda x: x[ 1 ], reverse= True )
# Step 4: Get the index of the most relevant paragraph (i.e. the first paragraph)
most_relevant_index = similarity_list[ 0 ][ 0 ] # For example: the third paragraph
# Step 5: Determine the context range to be extracted (including the current paragraph + context_window_size paragraphs before and after)
start_index = max ( 0 , most_relevant_index - context_window_size) # Prevent exceeding the beginning
end_index = min ( len (chunked_texts), most_relevant_index + context_window_size + 1 ) # Prevent exceeding the end
# Step 6: Return a list of paragraphs containing context
return [chunked_texts[i] for i in range (start_index, end_index)]
4. Add Contextual Chunk Headers (CCH)
RAG improves the factual accuracy of language models by retrieving relevant information from external knowledge bases before generating answers. However, in traditional text chunking methods, important contextual information is often lost, resulting in poor retrieval results or even causing the model to generate out-of-context answers.
To solve this problem, we introduced an improved method: Contextual Chunk Headers (CCH) . The core idea of this method is: when dividing the text into small chunks (chunks), the high-level context information (such as document title, chapter title, etc.) to which the content belongs is added to the beginning of each text chunk before embedding and retrieval. This allows each text chunk to carry its background information, helping the model to better understand which part it belongs to, thereby improving the relevance of the retrieval and avoiding the model from generating wrong answers based on out-of-context content.
The steps in this method are as follows:
1. Data Ingestion: Load and preprocess raw text data.
2. Chunking with Contextual Headers automatically identifies chapter titles in a document and adds them to the front of the corresponding paragraphs to form text blocks with context. For example:
# Chapter 3: Basic technologies of artificial intelligence The core methods of artificial intelligence include machine learning, deep learning and natural language processing...
3. Create embedding vectors (Embedding Creation) to convert these text blocks with contextual information into digital form (i.e., embedding vectors) for subsequent semantic search.
4. Semantic Search When a user asks a question, the system will find the most relevant content based on these enhanced text blocks.
5. Response Generation uses large language models (such as Llama, ChatGLM, etc.) to generate natural and accurate responses based on retrieval results.
6. Evaluation: Evaluate the AI’s answers through a scoring system to check whether adding contextual titles improves the accuracy and relevance of the answers.
Chunking text using contextual headings
In order to improve the effect of information retrieval, we use a large language model (LLM) to automatically generate a descriptive title (Header) for each text block and add it in front of the text block.
def generate_chunk_header ( text_chunk, model_name= "qwen-max" ):
"""
Generate a title/summary for a given text paragraph using a Large Language Model (LLM).
parameter:
text_chunk(str): The text paragraph for which the title needs to be generated.
model_name (str): The name of the language model used to generate titles. The default value is "qwen-max".
return:
str: Title or summary content generated by the model.
"""
# Define system prompt words to guide AI's behavior
header_system_prompt = "Please generate a concise and informative title for the following text."
# Call the LLM model to generate responses based on system prompts and input text
llm_response = client.chat.completions.create(
model=model_name,
temperature = 0 ,
messages=[
{ "role" : "system" , "content" : header_system_prompt},
{ "role" : "user" , "content" : text_chunk}
]
)
# Extract and return the content generated by the model, removing the extra whitespace characters before and after
return llm_response.choices[ 0 ].message.content.strip()
def chunk_text_with_headers ( input_text, chunk_size, overlap_size ):
"""
Split the input text into smaller paragraphs and generate headings for each paragraph.
parameter:
input_text(str): The complete text to be segmented.
chunk_size(int): The size of each paragraph (number of characters).
overlap_size(int): The number of overlapping characters between adjacent paragraphs.
return:
List[dict]: A list of dictionaries containing the keys 'header' and 'text', representing the paragraph's title and content respectively.
"""
text_chunks = [] # Initialize an empty list to store text paragraphs with titles
# Iterate over the text using the specified paragraph size and overlap length
for start_index in range ( 0 , len (input_text), chunk_size - overlap_size):
current_chunk = input_text[start_index:start_index + chunk_size] # Extract the current paragraph
chunk_header = generate_chunk_header(current_chunk) # Generate paragraph title using large language model
text_chunks.append({ "header" : chunk_header, "text" : current_chunk}) # Add the title and paragraph content to the list together
return text_chunks # Returns a list of paragraphs containing titles and content
Create embedding vectors for the title and body text
In order to improve the accuracy of information retrieval, we not only generate embeddings for the main text content , but also generate embedding vectors for the header in front of each text block.
# Generate embedding vectors for each text block
chunk_embeddings = [] # Initialize an empty list to store dictionaries with titles, texts, and their embedding vectors
# Iterate over each text block and generate embedding vectors (with progress bar)
for current_chunk in tqdm(text_chunks, desc= "Generate embedding vector" ):
# Get the text content of the current text block and generate its embedding vector
text_embedding = create_embeddings(current_chunk[ "text" ])
# Get the title of the current text block and generate its embedding vector
header_embedding = create_embeddings(current_chunk[ "header" ])
# Store the title, text and corresponding embedding vector of the current text block into a list
chunk_embeddings.append({
"header" : current_chunk[ "header" ],
"text" : current_chunk[ "text" ],
"embedding" : text_embedding,
"header_embedding" : header_embedding
})
Semantic Search
import numpy as np
def _calculate_similarity ( query_vec, chunk_vec ):
"""
Compute the cosine similarity between the query vector and the chunk vector.
parameter:
query_vec (np.array): The embedding vector of the query.
chunk_vec (np.array): The embedding vector of the text chunk.
return:
float: cosine similarity.
"""
return cosine_similarity(np.array(query_vec), np.array(chunk_vec))
def semantic_search ( query, chunks, top_k= 5 ):
"""
Search for the most relevant text blocks based on the query semantics.
parameter:
query (str): The query statement entered by the user.
chunks (List[dict]): List of text chunks containing embedding vectors.
top_k (int): The number of most relevant results to return.
return:
List[dict]: the top_k most relevant text blocks.
"""
# Generate a vector representation of the query statement
query_vector = create_embeddings(query)
# Initialize a list to store each text block and its similarity
chunk_similarity_pairs = []
# Traverse each text block and calculate the similarity
for chunk in chunks:
text_vector = chunk[ "embedding" ] # Get the embedding vector of the text content
header_vector = chunk[ "header_embedding" ] # Get the embedding vector of the title
# Calculate the similarity between the query and the text and title respectively, and take the average
similarity_text = _calculate_similarity(query_vector, text_vector)
similarity_header = _calculate_similarity(query_vector, header_vector)
avg_similarity = (similarity_text + similarity_header) / 2
#Store text blocks and their average similarity
chunk_similarity_pairs.append((chunk, avg_similarity))
# Sort by similarity from high to low
chunk_similarity_pairs.sort(key= lambda pair: pair[ 1 ], reverse= True )
# Return the top-k most relevant text blocks
return [pair[ 0 ] forpair in chunk_similarity_pairs[:top_k]]
5. RAG based on question generation
This section enhances the document content by introducing question generation in the document processing stage.
We generate relevant questions for each text block, thereby improving the effectiveness of information retrieval, and ultimately helping the language model generate more accurate and relevant answers.
The core idea of this method is: In traditional RAG (Retrieval-Augmented Generation), we usually only embed text blocks and store them in the vector library. In this improved method, we also automatically generate some related questions for each text block and embed these questions as well. In this way, when users ask questions, the system can better understand which text blocks are most relevant to the questions, thereby improving the retrieval effect and answer quality.
The implementation steps are as follows:
1. Data Ingestion Extract the original text content from the PDF file.
2. Chunking: Chunking large text into small chunks for easy processing. Each chunk usually contains about 200 to 300 words.
3. Question Generation uses a large language model (LLM) to automatically generate several questions related to each text block. For example, if you input a piece of content about "machine learning", the output may be:
“What is machine learning?”
“What are some common algorithms for machine learning?”
“What is the relationship between machine learning and artificial intelligence?”
4. Create embedding vectors (Embedding Creation) Generate embedding vectors (i.e. convert them into digital representations) for each text block and its corresponding question for semantic matching.
5. Vector Store Creation Use NumPy to build a simple vector database to store the embedded vectors of all text blocks and questions.
6. Semantic Search When a user asks a question, the system will first look for the generated questions that are most similar to his question, and then find the corresponding text block as context.
7. Response Generation Based on the retrieved relevant text blocks, the language model generates natural and accurate responses.
8. Evaluation Finally, we will score the generated answers to evaluate whether this enhanced RAG improves the quality and accuracy of the answers.
Generate questions for text blocks
Automatically generate related questions for each block of text - questions that can be answered by looking at the text.
import re
def _extract_questions_from_response ( response_text ):
"""
Extract questions ending with a question mark from the text returned by the model.
parameter:
response_text (str): The raw text returned by the model.
return:
List[str]: List of valid questions after cleaning.
"""
questions = []
for line in response_text.split( '\n' ):
cleaned_line = line.strip() # remove leading and trailing spaces
if cleaned_line:
# Remove any possible number prefixes (such as "1.", "2)", "•", "-", etc.)
cleaned_line = re.sub( r'^[\d\-\•\*]+\s*[\.\\)]?\s*' , '' , cleaned_line)
# Determine whether it contains a question mark (supports both Chinese and English)
if '?' in cleaned_line or '? ' in cleaned_line:
# All characters end with question mark
question = cleaned_line.rstrip( '?' ).rstrip( '?' ) + '?'
questions.append(question)
return questions
def generate_questions ( text, question_count= 5 , model= "qwen-max" ):
"""
Generates answerable questions based on a provided block of text.
parameter:
text (str): The text content of the question that needs to be generated.
question_count (int): The number of questions to generate.
model (str): Name of the language model used to generate questions.
return:
List[str]: The generated list of questions.
"""
# System instructions: define the behavior rules of AI
system_instruction = "You are an expert at generating relevant questions from text. Please use only the text provided to create concise questions that focus on key information and concepts."
# User request template: provide specific tasks and format requirements
user_request = f"""
Please generate {question_count} different questions based on the following text. These questions must be answerable by the text:
{text}
Please output the questions as a numbered list, without adding anything else.
"""
# Calling the large model API to generate problems
response = client.chat.completions.create(
model=model,
temperature = 0.7 ,
messages=[
{ "role" : "system" , "content" : system_instruction},
{ "role" : "user" , "content" : user_request}
]
)
# Extract the original response content and remove leading and trailing spaces
raw_questions_text = response.choices[ 0 ].message.content.strip()
# Use auxiliary functions to extract and filter valid questions
filtered_questions = _extract_questions_from_response(raw_questions_text)
return filtered_questions
Building a simple vector repository
We will use NumPy to implement a simple vector store .
import numpy as np
from typing import List , Dict , Optional
classSimpleVectorStore:
"""
Simple NumPy-based vector storage implementation.
"""
def __init__ ( self ):
"""
Initialize the vector database, which contains vectors, text, and metadata lists.
"""
self.vectors: List [np.ndarray] = [] # store vectors
self.texts: List [ str ] = [] # Store original text
self.metadata_list: List [ Dict ] = [] #Store metadata
def add_item ( self, text: str , vector: List [ float ], metadata: Optional [ Dict ] = None ):
"""
Add an entry to the vector library.
parameter:
text (str): original text content.
vector (List[float]): Vector embedding representation.
metadata (Dict, optional): Optional metadata information.
"""
self.vectors.append(np.array(vector))
self.texts.append(text)
self.metadata_list.append(metadata or {})
def similarity_search ( self, query_vector: List [ float ], top_k: int = 5 ) -> List [ Dict ]:
"""
Find the top_k most similar records in the vector library according to the query vector.
parameter:
query_vector (List[float]): query vector.
top_k (int): The number of results returned.
return:
List[Dict]: A list of dictionaries containing similar text, metadata, and similarity scores.
"""
if not self.vectors:
return []
# Convert the query vector to a numpy array
query_array = np.array(query_vector)
# Calculate the cosine similarity between each vector and the query vector
similarities = []
for idx, vector in enumerate (self.vectors):
similarity = np.dot(query_array, vector) / (
np.linalg.norm(query_array) * np.linalg.norm(vector)
)
similarities.append((idx, similarity))
# Sort by similarity in descending order
similarities.sort(key= lambda x: x[ 1 ], reverse= True )
# Build result return
results = []
for i in range ( min (top_k, len (similarities))):
idx, score = similarities[i]
results.append({
"text" : self.texts[idx],
"metadata" : self.metadata_list[idx],
"similarity_score" : float (score)
})
return results
Using Question Enhancement to Process Documents
Now, we put all the previous steps together to fully process the document : generating relevant questions for the text block, creating embeddings, and building an augmented vector store .
def process_document ( pdf_path, chunk_size= 1000 , chunk_overlap= 200 , questions_per_chunk= 5 ):
"""
The document is processed and question enhancements are generated.
parameter:
pdf_path(str): PDF file path.
chunk_size (int): The number of characters per text chunk.
chunk_overlap(int): The number of overlapping characters between text chunks.
questions_per_chunk(int): The number of questions to generate per chunk of text.
return:
Tuple[List[str], SimpleVectorStore]: Processed text chunks and vector storage.
"""
print ( "Extracting text from PDF..." )
extracted_text = extract_text_from_pdf(pdf_path)
print ( "Split text into blocks..." )
text_chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)
print ( f"A total of { len (text_chunks)} text chunks were created" )
vector_store = SimpleVectorStore()
print ( "Process each text block and generate related questions..." )
for idx, chunk in enumerate (tqdm(text_chunks, desc= "Processing text chunks" )):
# Generate embedding for the current text block
chunk_embedding_response = create_embeddings(chunk)
chunk_embedding = chunk_embedding_response.data[ 0 ].embedding
# Add text blocks to the vector library
vector_store.add_item(
text=chunk,
vectors=chunk_embedding,
metadata={ "type" : "chunk" , "index" : idx}
)
# Generate multiple questions for the current text block
questions = generate_questions(chunk, num_questions=questions_per_chunk)
# Generate embeddings for each question and add them to the vector library
for q_idx, question in enumerate (questions):
question_embedding_response = create_embeddings(question)
question_embedding = question_embedding_response.data[ 0 ].embedding
# Add the question to the vector library
vector_store.add_item(
text=question,
vectors=question_embedding,
metadata={ "type" : "question" , "chunk_index" : idx, "original_chunk" : chunk}
)
return text_chunks, vector_store
Extracting and processing documents
# Define the PDF file path
pdf_file_path = "knowledge_base/Intelligent Coding Assistant Tongyi Lingma.pdf"
# Process documents (extract text, generate chunks, create questions, build vector library)
text_chunks, vector_store = process_document(
pdf_file_path,
chunk_size = 1000 ,
chunk_overlap = 100 ,
questions_per_chunk = 3
)
# Output the number of entries in the vector library
print ( f"The vector store contains { len (vector_store.texts)} entries" )
Querying on the enhanced vector library
import json
search_query = 'Which company is the intelligent coding assistance tool Tongyi Lingma produced by?'
# Find related content using semantic search
search_results = semantic_search(search_query, vector_store, k= 5 )
print ( "Query content: " , search_query)
print ( "\nSearch results:" )
# Sort results by type
document_chunks = []
matched_questions = []
for result in search_results:
if result[ "metadata" ][ "type" ] == "chunk" :
document_chunks.append(result)
else :
matched_questions.append(result)
# Print document fragment
print ( "\nRelated document fragment: " )
for index, result in enumerate (document_chunks):
print ( f"Context {index + 1 } (similarity: {result[ 'similarity' ]: .4 f} ):" )
print (result[ "text" ][: 300 ] + "..." )
print ( "=====================================" )
# Print matching questions
print ( "\nMatched questions: " )
for index, result in enumerate (matched_questions):
print ( f"Problem {index + 1 } (similarity: {result[ 'similarity' ]: .4 f} ):" )
print (result[ "text" ])
chunk_index = result[ "metadata" ][ "chunk_index" ]
print ( f"from fragment {chunk_index} " )
print ( "=====================================" )
Query content: Which company is Tongyi Lingma an intelligent coding assistance tool?
Matching questions:
Question 1 (Similarity: 0.9770):
Which company provides Tongyi Lingma, an intelligent coding assistance tool, and what capabilities does it mainly provide?
From segment 0
=====================================
Question 2 (Similarity: 0.8629):
In addition to individual developers, what versions and special services does Tongyi Lingma provide for enterprises?
From segment 0
=====================================
Question 3 (Similarity: 0.8108):
What enterprise plans does Tongyi Lingma provide for customers to choose from?
From Clip 1
=====================================
Question 4 (Similarity: 0.8078):
How does Tongyi Lingma's code completion function work, and what kind of code suggestions can it generate for developers?
From segment 0
=====================================
6. Query Rewriting
This section implements three query transformations to improve the information retrieval performance of the Retrieval Augmentation Generation (RAG) system.
Core objectives:
By modifying or expanding the user's original query , it helps the system understand the user's intent more accurately and find more relevant information from the vector library.
Three query conversion techniques
1. Query Rewriting
Make users' questions more specific and detailed to improve the accuracy of retrieval.
? Example:
User's original question: "What is AI?"
Rewritten question: "What is the definition of artificial intelligence and what are its core technologies?"
✅ Improvement: Make the search more precise and avoid overly broad results.
2. Step-back Prompting
Generate a broader, higher-level question that captures more context and helps the system better understand the context.
? Example:
User's original question: "What are the applications of deep learning in the medical field?"
Fallback question: “What are the applications of artificial intelligence in the healthcare industry?”
✅ Improvement points: Helps to find important background knowledge that is related to the question but not a direct match.
3. Sub-query Decomposition
Split a complex question into multiple simpler small questions , search them separately, and finally combine all the results to provide a more comprehensive answer.
? Example:
User's original question: "Compare the advantages, disadvantages and application scenarios of machine learning and deep learning."
Disassembled into:
“What is machine learning?”
“What is deep learning?”
“What are the advantages and disadvantages of machine learning?”
“What are the advantages and disadvantages of deep learning?”
“What scenarios are they suitable for?”
✅ Improvement points: Make sure to cover all aspects of the problem to avoid missing key information.
Implementing query transformation technology
1. Query Rewriting
In many cases, the questions asked by users may be vague or brief, such as:
“What is AI?”
Although this question is clear, it is not specific enough, and the system may return overly broad or irrelevant results when searching the vector library.
The role of query rewriting is to generate a clearer and more detailed version based on the intent of the original question, helping the system find more relevant information.
def rewrite_query ( original_query, model= "qwen-max" ):
"""
Rewrite user queries to make them more specific and detailed to improve retrieval results.
parameter:
original_query(str): user's original query statement
model(str): model name used for rewriting
return:
str: optimized query statement
"""
# System prompts: guide AI assistant behavior
system_prompt = "You are an AI assistant that is good at optimizing search queries. Your task is to rewrite the user's query to be more specific, detailed, and helpful in obtaining relevant information."
# User prompt: Provide the original query that needs to be rewritten
user_prompt = f"""
Please rephrase the following query to make it more specific and include relevant terms and concepts that will help you find the right results.
Original query: {original_query}
Rewritten query:
"""
# Generate rewrite results using the specified model
response = client.chat.completions.create(
model=model,
temperature= 0.0 , # Set the temperature to 0 to ensure stable output
messages=[
{ "role" : "system" , "content" : system_prompt},
{ "role" : "user" , "content" : user_prompt}
]
)
# Return the rewritten query content and remove the leading and trailing spaces
return response.choices[ 0 ].message.content.strip()
2. Step-back Prompting
The goal is to generate broader, higher-level questions that can be used to retrieve contextual information relevant to the user's question.
What is a "fallback question"?
In many cases, the questions asked by users can be very specific, such as:
“What are the applications of deep learning in medical imaging diagnosis?”
Although this question is clear, it is too focused and may cause the system to only retrieve very local information while ignoring important context.
The idea of "backward questioning" is:
First, take a step back and ask a broader question, such as:
“What are the applications of artificial intelligence in the medical industry?”
This allows the system to acquire some overall background knowledge first, helping it to better understand the context of the current question, thereby improving the accuracy and completeness of the final answer.
def generate_step_back_query ( original_query, model= "qwen-max" ):
"""
Generate a more general "step back" query to get broader contextual information.
parameter:
original_query(str): user's original query statement
model(str): the name of the model used to generate the query
return:
str: A broader context query
"""
# System prompts: guide AI assistant behavior
system_prompt = "You are an AI assistant that excels at search strategies. Your task is to expand specific queries into more general forms to obtain relevant contextual information."
# User prompt: Provide the original query that needs to be generalized
user_prompt = f"""
Please generate a broader, more general version of the specific query below to gain useful context.
Original query: {original_query}
One step back query:
"""
# Generate a broader query statement using the specified model
response = client.chat.completions.create(
model=model,
temperature = 0.1 , # The temperature is slightly higher to increase diversity
messages=[
{ "role" : "system" , "content" : system_prompt},
{ "role" : "user" , "content" : user_prompt}
]
)
# Return the generated query content and remove the leading and trailing spaces
return response.choices[ 0 ].message.content.strip()
3. Sub-query Decomposition
The goal is to split complex user questions into multiple simpler and more specific sub-questions, thereby achieving more comprehensive information retrieval.
What is subquery decomposition?
When a user asks a complex or multi-part question, such as:
“Please compare the principles, advantages and disadvantages, and application scenarios of machine learning and deep learning.”
If you search directly using this question, it may be difficult for the system to find an exact match, resulting in incomplete information or inaccurate answers.
The idea of subquery decomposition is:
Break the question into smaller, more manageable parts, search each separately, and then combine all the results to generate a complete answer.
Example description:
def decompose_query ( original_query, num_subqueries= 4 , model= "qwen-max" ):
"""
Break complex queries into simpler subqueries.
parameter:
original_query(str): original complex query content
num_subqueries (int): The number of subqueries to generate
model(str): the name of the model used to decompose the query
return:
List[str]: the subquery list after splitting
"""
# System prompt words: guide the behavior logic of AI assistant
system_prompt = "You are an AI assistant that is good at breaking down complex questions. Your task is to break down complex queries into multiple simpler questions, the answers to which together form the answer to the original question."
# User prompt: Provide the original query that needs to be decomposed
user_prompt = f"""
Decompose the following complex query into {num_subqueries} simpler subqueries. Each subquery should focus on a different aspect of the original question.
Original query: {original_query}
Please output the results in the following format:
1. [First subquery]
2. [Second subquery]
...
"""
# Generate subqueries using the specified model
response = client.chat.completions.create(
model=model,
temperature = 0.2 , # The temperature is slightly higher to increase diversity
messages=[
{ "role" : "system" , "content" : system_prompt},
{ "role" : "user" , "content" : user_prompt}
]
)
# Extract and process the response content
content = response.choices[ 0 ].message.content.strip()
# Split the response content by line
lines = content.split( "\n" )
sub_queries = []
# Parse each row and extract the subquery after the number
for line in lines:
if line.strip() and any (line.strip().startswith( f" {i} ." ) for i in range ( 1 , 10 )):
query = line.strip()
query = query[query.find( "." ) + 1 :].strip() # Remove the serial number part and keep the actual content
sub_queries.append(query)
return sub_queries
7. Reranking
Reranking is a second round of screening and optimization based on the initial search results , with the aim of ensuring that the content ultimately used to generate the answer is the most relevant and accurate part .
In traditional semantic search, we usually use vector similarity (such as cosine similarity) to find the most relevant text blocks. But this "preliminary search" is not always perfect, and sometimes returns some content that seems relevant but actually does not match.
The role of reordering is:
✅ Further screening in the initial search results;
✅ Re-score content using a more accurate relevance scoring model;
✅ Re-rank by actual relevance;
✅ Keep only the most relevant documents for subsequent answer generation.
The core process of reordering
1. Initial Retrieval
Use basic semantic similarity search (such as vector matching) to quickly obtain a batch of candidate text blocks;
This step is fast, but has limited accuracy.
2. Document Scoring
Perform a deeper relevance assessment on each retrieved document;
You can use specialized reranking models (such as BERT reranker, ColBERT, Cross-Encoder, etc.) to score based on the semantic relationship between user queries and document content;
This approach provides a better understanding of “sentence-level” relevance than simple vector matching.
3. Reordering
Re-rank all candidate documents according to the scoring results;
The most relevant ones are placed at the front, while the least relevant ones are placed at the back or eliminated.
4. Content Selection
Only the top-ranked documents are selected as context to provide to the language model;
Avoid introducing noise information and improve the accuracy and reliability of answers.
For example:
Suppose a user asks: "What are the main applications of deep learning?"
A preliminary search might return the following three paragraphs:
1. “Deep learning is widely used in image recognition and natural language processing.”
2. “Machine learning can be divided into two types: supervised learning and unsupervised learning.”
3. “Convolutional neural networks are a type of deep learning model commonly used for image classification.”
By reordering, we can determine:
Article 1 is highly relevant✅
Article 2 is not very relevant ❌
Partially related to Article 3✅
So we only keep items 1 and 3 as context to generate the final answer.
Reranking based on large models
def rerank_with_llm(query, search_results, top_n=3, model="qwen-max"): """ Rerank search results using LLM. Parameters: query(str): user query statement. search_results(List[Dict]): initial list of search results, each element contains document text, metadata and similarity score. top_n(int): number of documents returned after reranking. model(str): name of the LLM model used for scoring. Returns: List[Dict]: list of documents sorted by relevance score. """ print(f"Reranking {len(search_results)} documents...") scored_results = [] # Store results with relevance scores # Define system prompts to guide LLM how to score system_prompt = """You are an expert in assessing the relevance of documents to search queries. Your task is to score documents between 0 and 10 based on how relevant the document is to answer a given query. Scoring guidelines: - 0-2 points: document is completely irrelevant - 3-5 points: The document has some relevant information, but does not directly answer the query 6-8 points: The document is relevant and can partially answer the query 9-10 points: The document is highly relevant and can directly answer the query Please only output an integer score (0 to 10) as the score, do not include other content. """ # Iterate over each search result for idx, result in enumerate(search_results): if idx % 5 == 0: print(f"Scoring document {idx + 1}/{len(search_results)}...") # Construct user prompt words, enter query and document content user_prompt = f"""Query: {query}Document content:{result['text']}Please rate the relevance of this document based on the above query (0-10):""" # Call LLM API to get the score response = client.chat.completions.create( model=model, temperature=0, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ] ) # Extract the scoring results score_text = response.choices[0].message.content.strip() # Use regular expressions to extract numeric scores score_match = re.search(r'\b(10|[0-9])\b', score_text) if score_match: relevance_score = float(score_match.group(1)) else: # If the score cannot be extracted, use the similarity score as a fallback print(f"Warning: Unable to extract score from response: '{score_text}', using similarity score instead") relevance_score = result["similarity"] * 10 # Add the scoring results to the list scored_results.append({ "text": result["text"], "metadata": result["metadata"], "similarity": result["similarity"], "relevance_score": relevance_score }) # Sort the results in descending order by relevance score reranked_results = sorted(scored_results, key=lambda x: x["relevance_score"], reverse=True) # Return the top_n results return reranked_results[:top_n]
Reranking based on keywords
def rerank_with_keywords ( query, doc_results, top_n= 3 ):
"""
A simple re-ranking method based on keyword matching and position.
Args:
query(str): the question the user is querying
doc_results(List[Dict]): Initial search results, each dictionary contains text and other metadata
top_n(int): The number of results to return after reordering, the default is the first 3
Returns:
List[Dict]: The results are re-sorted by relevance, keeping only the top_n ones
"""
def extract_keywords ( text ):
"""Extract keywords from text"""
return [word.lower() for word in text.split()iflen(word) > 3 ]
# Extract important keywords from user questions
keywords = extract_keywords(query)
ranked_docs = [] # Create a list to store the ranked documents
for doc_result in doc_results:
document_text = doc_result[ "text" ].lower() # Convert document text to lowercase for subsequent comparison
# The base score starts from vector similarity, multiplied by 0.5 to indicate that it is not the only determining factor
base_score = doc_result[ "similarity" ] * 0.5
# Initialize keyword scores
keyword_score = 0
for keyword in keywords:
if keyword in document_text:
# If a keyword is found, add 0.1 points
keyword_score += 0.1
# If the keyword appears in the first 1/4 of the document, it is more likely to answer the question directly, and 0.1 points will be added
first_position = document_text.find(keyword)
if first_position < len (document_text) / 4 : # In the first quarter
keyword_score += 0.1
# Add points based on the number of times the keyword appears, but the maximum is 0.2 points
frequency = document_text.count(keyword)
keyword_score += min ( 0.05 * frequency, 0.2 ) # add up to 0.2 points
# Calculate the final score: basic score plus keyword score
final_score = base_score + keyword_score
# Store this document and its related information and score in a list
ranked_docs.append({
"text" : doc_result[ "text" ],
"metadata" : doc_result[ "metadata" ],
"similarity" : doc_result[ "similarity" ],
"relevance_score" : final_score
})
# Sort all documents in descending order according to the final relevance score
reranked_docs = sorted (ranked_docs, key= lambda x: x[ "relevance_score" ], reverse= True )
# Return the top_n documents with the highest scores
return reranked_docs[:top_n]
Complete RAG flow with reordering
So far, we have implemented the core modules in the RAG process, including:
Document Processing
Question Answering
Reranking
Now, we integrate these modules together to build a complete RAG system flow .
def rag_with_reranking(query, vector_store, reranking_method="llm", top_n=3, model="qwen-max"): """ Full RAG pipeline including reranking. Arguments: query(str): user query vector_store(SimpleVectorStore): vector store reranking_method(str): reranking method ('llm' or 'keywords') top_n(int): number of results to return after reranking model(str): model used to generate answers Returns: Dict: dictionary of results containing query, context and answers """ # Create query embeddings query_embedding = create_embeddings(query) # Initial search (get more results than needed for reranking later) initial_results = vector_store.similarity_search(query_embedding, k=10) # Apply reranking if reranking_method == "llm": reranked_results = rerank_with_llm(query, initial_results, top_n=top_n) elif reranking_method == "keywords": reranked_results = rerank_with_keywords(query, initial_results, top_n=top_n) else: # Do not rerank, directly use the top_n results of the initial search reranked_results = initial_results[:top_n] # Merge the reranked context context = "\n\n===\n\n".join([result["text"] for result in reranked_results]) # Generate answers based on the context response = generate_response(query, context, model) return { "query": query, "reranking_method": reranking_method, "initial_results": initial_results[:top_n], "reranked_results": reranked_results, "context": context, "response": response }
8. Extraction of relevant paragraphs for enhancing RAG
(Relevant Segment Extraction, RSE)
Different from the traditional approach of simply retrieving multiple isolated blocks of text, our goal is to identify and reconstruct continuous text fragments to provide more complete and logical context information for the language model.
Core Concept:
In documents, text blocks related to user questions often appear in the same area or in consecutive paragraphs . If we can identify the connections between these related text blocks and organize them into a coherent whole paragraph in sequence , we can significantly improve the language model's ability to understand the context.
Why use RSE?
Problems with traditional RAG:
The search results consist of multiple unconnected text blocks;
Transitions and background information may be missing between blocks;
This makes it difficult for the language model to understand and even causes quotations to be taken out of context.
The advantages of RSE are: ✅ Combining related text blocks into continuous paragraphs; ✅ Preserving the original text structure and semantic coherence; ✅ Providing a more natural and complete context to the language model; ✅ Improving the accuracy and fluency of the final answer.
The implementation steps of RSE are:
1. Preliminary search
Use semantic search to find the text blocks that are most relevant to the user's question from the vector library.
2. Position sorting
If the text blocks in the original document have numbering or position information (such as page numbers, paragraph order), we can reorder the search results based on this information.
3. Cluster analysis
Analyze which text blocks are close to each other and have similar semantics in the original text, and group them together to form "related paragraph clusters".
4. Paragraph reconstruction
The text blocks belonging to the same cluster are stitched together to form a complete context paragraph. If necessary, the adjacent preceding and following content can be added to enhance the context coherence.
5. Input language model
The reconstructed continuous paragraphs are used as context and provided to the large language model (LLM) to generate the final answer.
Example description:
Suppose a user asks: "What are the applications of deep learning?"
Traditional RAG might return the following three isolated blocks of text:
1. “Deep learning is widely used in image recognition.”
2. “It is also used for speech recognition and natural language processing.”
3. “It also has important applications in the field of autonomous driving.”
With RSE, we can concatenate these three blocks into one paragraph in the original order:
"Deep learning is widely used in image recognition. It is also used in speech recognition and natural language processing, and has important applications in the field of autonomous driving."
This way, the language model is better able to understand the overall meaning rather than processing several independent sentences separately.
Creating a simple vector database
import numpy as npclass SimpleVectorStore: """ A lightweight vector store implementation using NumPy. """ def __init__(self, dimension=1536): """ Initialize the vector store. Arguments: dimension (int): dimension of the embedding vectors """ self.dimension = dimension self.vectors = [] # store vector data self.documents = [] # store document content self.metadata_list = [] # store metadata def add_documents(self, documents, vectors=None, metadata_list=None): """ Add documents to the vector store. Arguments: documents (List[str]): list of document chunks vectors (List[List[float]], optional): list of embedding vectors metadata_list (List[Dict], optional): list of metadata dictionaries """ if vectors is None: vectors = [None] * len(documents) if metadata_list is None: metadata_list = [{} for _ in range(len(documents))] for doc, vec, metadata in zip(documents, vectors, metadata_list): self.documents.append(doc) self.vectors.append(vec) self.metadata_list.append(metadata) def search(self, query_vector, top_k=5): """ Search for most similar documents. Arguments: query_vector (List[float]): query embedding vector top_k (int): number of results to return Returns: List[Dict]: List of results containing documents, similarity scores and metadata """ if not self.vectors or not self.documents: return [] # Convert query vector to NumPy array query_array = np.array(query_vector) # Compute similarities similarities = [] for index, vector in enumerate(self.vectors): if vector is not None: # Compute cosine similarity similarity = np.dot(query_array, vector) / ( np.linalg.norm(query_array) * np.linalg.norm(vector) ) similarities.append((index, similarity)) # Sort by similarity (descending) similarities.sort(key=lambda x: x[1], reverse=True) # Get top-k results results = [] for index, score in similarities[:top_k]: results.append({ "document": self.documents[index], "score": float(score), "metadata": self.metadata_list[index] }) return results
Processing documents using RSE
Now, let’s implement the core functionality of Relevant Segment Extraction (RSE) .
? For example
Imagine you are looking for a book in a library and the administrator gives you a list of dozens of potentially related books.
You start reading one book at a time and find that some of them do indeed tell you what you want to know, while some of them only seem relevant in title but have completely different contents.
So you rate each book:
Very relevant content: 90 points
Somewhat related: 60 points
It doesn't matter: lose 20 points → get 40 points or even negative points
In the end, you only picked the ones with high scores to read.
This function does just that - picks out the most useful information .
from typing import List , Dict , Tuple
def process_document ( pdf_path: str , chunk_size: int = 800 ) -> Tuple [ List [ str ], SimpleVectorStore, Dict ]:
"""
Processes documents for use with RSE (Retrieval Enhancement Generation).
parameter:
pdf_path(str): path to the PDF document
chunk_size(int): The size of each text chunk (number of characters)
return:
Tuple[List[str], SimpleVectorStore, Dict]: a tuple containing a list of text chunks, a vector store instance, and document information
"""
print ( "Extracting text from document..." )
# Extract text content from PDF files
document_text = extract_text_from_pdf(pdf_path)
print ( "Split text into non-overlapping segments..." )
# Split the extracted text into non-overlapping text blocks
text_chunks = chunk_text(document_text, chunk_size=chunk_size, overlap= 0 )
print ( f"A total of { len (text_chunks)} text chunks were created " )
print ( "Generate embedding vector for text block..." )
# Generate embedding vectors for each text block
chunk_embeddings = create_embeddings(text_chunks)
# Create a SimpleVectorStore instance to store vector data
vector_store = SimpleVectorStore()
# Add a document with metadata (including text block index for subsequent reconstruction)
metadata_list = [{ "chunk_index" : index, "source" : pdf_path} for index in range ( len (text_chunks))]
vector_store.add_documents(text_chunks, chunk_embeddings, metadata_list)
# Record the original document structure for subsequent splicing
document_info = {
"chunks" : text_chunks,
"source" : pdf_path,
}
return text_chunks, vector_store, document_info
✅ Summary
This function does several things:
1. Convert the user's question into a vector;
2. Find which text blocks in the vector library are most similar to this question;
3. Score each text block, the more relevant the text, the higher the score;
4. Deduct points (penalize) irrelevant text blocks to make them easier to ignore;
5. Return a list to tell the system: "Which of these chunks are important and which are not important."
RSE core algorithm: Calculate the value of text blocks and find the best paragraphs
Now that we have document processing capabilities and the ability to generate embedding vectors for each block of text, we can now start implementing the core algorithm of RSE (Relevant Paragraph Extraction) .
def calculate_chunk_values ( query: str , chunks: List [ str ], vector_store, irrelevant_chunk_penalty: float = 0.2 ) -> List [ float ]:
"""
Calculate the value of each document slice, combining its relevance score and location information.
parameter:
query(str): query text entered by the user
chunks(List[str]): List of text chunks after the document is segmented
vector_store: vector database, containing vector representations of document blocks
irrelevant_chunk_penalty (float): The penalty imposed on irrelevant document chunks. The default value is 0.2
return:
List[float]: List of values corresponding to each document block (floating point number)
"""
# Convert user queries into embedding vectors for semantic matching
query_embedding = create_embeddings([query])[ 0 ]
# Get the number of all text blocks and search for similarity results
total_chunks = len (chunks)
search_results = vector_store.search(query_embedding, top_k=total_chunks)
# Build a mapping dictionary from chunk_index to relevance score
relevance_scores = {
result[ "metadata" ][ "chunk_index" ]: result[ "score" ]
for result in search_results
}
# Calculate the value based on the relevance score and apply the penalty mechanism for irrelevant blocks
chunk_values = []
for i in range (total_chunks):
score = relevance_scores.get(i, 0.0 )
value = score - irrelevant_chunk_penalty
chunk_values.append(value)
return chunk_values
? Let’s take a daily example
Imagine you are looking at the table of contents of a book, and each chapter has an "importance score" in front of it. You want to:
1. Pick a few chapters to read ;
2. Read a maximum of 20 sections per chapter ;
3. No more than 30 sessions in total ;
4. Each piece of content must be valuable (score greater than 0.2) ;
Then you will try again from the beginning: "How about reading 5 sections starting from section 3?", "How about reading 3 sections starting from section 10?"... Finally, select a few paragraphs that you think are "the most interesting and worth reading".
Rebuilding and using paragraphs in RAG
def reconstruct_segments ( document_chunks: List [ str ], best_segment_indices: List [ Tuple [ int , int ]] ) -> List [ Dict ]:
"""
Reconstruct the text paragraph based on the optimal slice index.
parameter:
document_chunks(List[str]): All text chunks of the original document
best_segment_indices(List[Tuple[int, int]]): List of start and end indices of the best segment
return:
List[Dict]: List of dictionaries containing the reconstructed paragraphs and their ranges
"""
reconstructed_segments = []
for start_idx, end_idx in best_segment_indices:
segment_text = " " .join(document_chunks[start_idx:end_idx])
reconstructed_segments.append({
"text" : segment_text,
"segment_range" : (start_idx, end_idx),
})
return reconstructed_segments
def format_segments_for_context ( segments: List [ Dict ] ) -> str :
"""
Format a paragraph of text into a context string that is usable by a language model.
parameter:
segments(List[Dict]): List of dictionaries containing paragraph text and index ranges
return:
str: formatted context text
"""
context_lines = []
for index, segment in enumerate (segments):
header = f"SEGMENT {index + 1 } (Chunks {segment[ 'segment_range' ][ 0 ]} - {segment[ 'segment_range' ][ 1 ] - 1 } ):"
context_lines.append(header)
context_lines.append(segment[ "text" ])
context_lines.append( "-" * 80 )
return "\n\n" .join(context_lines)
For example
Suppose the input is these two paragraphs:
segments = [ { "segment_range": [2, 5], "text": "Artificial intelligence is a branch of computer science that aims to enable machines to simulate human intelligent behavior." }, { "segment_range": [7, 9], "text": "Deep learning is a special type of machine learning method that is particularly good at processing image and speech data." }]
Then the output will be like this:
SEGMENT 1 (Chunks 2-4):
Artificial intelligence is a branch of computer science that aims to enable machines to simulate human intelligent behavior.
--------------------------------------------------------------------------------
SEGMENT 2 (Chunks 7-8):
Deep learning is a special type of machine learning that is particularly good at processing image and speech data.
--------------------------------------------------------------------------------
Complete pipeline
def rag_with_rse ( pdf_path: str , query: str , chunk_size: int = 800 , penalty: float = 0.2 ) -> Dict :
"""
The complete RAG process uses the Relevant Paragraph Extraction (RSE) strategy to filter out the most useful document content.
parameter:
pdf_path(str): PDF document path
query(str): user query
chunk_size(int): text slice size
penalty(float): penalty coefficient for irrelevant slices
return:
Dict: A dictionary containing the query, selected paragraphs, and generated answers
"""
print ( "\n=== Start executing the RAG process based on relevant paragraph extraction===" )
print ( f"Query content: {query} " )
# Step 1: Process documents and generate vector storage
text_chunks, vector_store, doc_info = process_document(pdf_path, chunk_size)
# Step 2: Calculate the relevance score and value of each text block
print ( "\nCalculating text block relevance score and value..." )
chunk_values = calculate_chunk_values(query, text_chunks, vector_store, penalty)
# Step 3: Select the best paragraph based on value
best_segments, scores = find_best_segments(
chunk_values=chunk_values,
max_segment_length = 20 ,
total_max_length = 30 ,
min_segment_value = 0.2
)
# Step 4: Reconstruct the best paragraph
print ( "\nRebuilding best text paragraph..." )
selected_segments = reconstruct_segments(text_chunks, best_segments)
# Step 5: Format the context for use with the large model
formatted_context = format_segments_for_context(selected_segments)
# Step 6: Call the large model to generate the final response
response = generate_response(query, formatted_context)
# Arrange the output results
result = {
"query" : query,
"segments" : selected_segments,
"response" : response
}
print ( "\n=== The final reply is as follows===" )
print (response)
return result
9. Context Compression Technology: Improving RAG System Efficiency
We will filter and compress the retrieved text blocks to keep only the most relevant content , thereby:
✅ Reduce noise information;
✅ Improve the accuracy and relevance of language model answers;
✅ Make more efficient use of limited context windows.
Background
When using the RAG system for document retrieval, we often get some text blocks containing mixed content :
Some sentences are relevant to the user’s question;
Some sentences are completely irrelevant or are just background information.
For example:
“Artificial intelligence is a branch of computer science that aims to enable machines to simulate human intelligent behavior. Many AI systems rely on big data for training. Deep learning is a special type of machine learning method.”
If the user's question is: "What is artificial intelligence?" then only the first sentence is the most relevant, and the rest of the content, although correct, is irrelevant to the current question.
The goal of context compression
What we are going to do is:
✅ Remove irrelevant sentences or paragraphs ;
✅Only keep information that is highly relevant to the user’s query ;
✅ Maximize the “useful information density” in the context window .
This allows the language model to focus more on key content and avoid being distracted by irrelevant information, thereby improving the quality of the final answer.
Implementation ideas
We will implement a simple context compression process from scratch, which mainly includes the following steps:
1. Analyze relevance sentence by sentence
Each text block is split into sentences, and a semantic model (such as BERT, Sentence-BERT, etc.) is used to calculate the relevance score between each sentence and the user query.
? Example:
scores = [model.similarity(query_embedding, sentence_embedding) for sentence in sentences]
2. Set threshold or select Top-K sentences
We can choose one of two strategies to filter sentences:
✅ Keep sentences with scores above a certain threshold;
✅ Or keep the top K sentences with the highest scores.
3. Reconstruct the compressed context
Reassemble the filtered sentences into a new, more compact contextual paragraph in their original order.
Example Demo
Original text block:
“Artificial intelligence is a branch of computer science that aims to enable machines to simulate human intelligent behavior. Many AI systems rely on big data for training. Deep learning is a special type of machine learning method.”
User Question:
“What is artificial intelligence?”
Contents retained after compression:
“Artificial intelligence is a branch of computer science that seeks to enable machines to simulate intelligent human behavior.”
Implementing context compression
This is the core part of our approach. We will use a large language model to filter and compress the retrieved content , thus retaining the information most relevant to the user’s question.
def compress_chunk ( chunk: str , query: str , compression_type: str = "selective" , model: str = "qwen-max" ) -> Tuple [ str , float ]:
"""
Compresses the retrieved text blocks, keeping only the parts relevant to the query.
parameter:
chunk(str): the text block to be compressed
query(str): user query
compression_type (str): compression method ("selective", "summary", or "extraction")
model(str): LLM model name used
return:
Tuple[str, float]: compressed text block and compression ratio (percentage)
"""
# Build system prompts based on different compression types
if compression_type == "selective" :
system_prompt = """You are an expert at filtering information.
Your task is to analyze the document snippet and extract the sentences or paragraphs that are directly related to the user query. Remove all irrelevant content.
Output requirements:
1. Include only the text that helps answer the question
2. Keep the original wording of relevant sentences (do not paraphrase)
3. Maintain the original order
4. Include all relevant content, even if it appears to be repetitive
5. Exclude any text that is not relevant to the question
Please output in plain text format without adding additional instructions. """
elif compression_type == "summary" :
system_prompt = """You are a summary expert.
Your task is to succinctly summarize the given document snippet, focusing only on the information relevant to the user's query.
Output requirements:
1. Be concise but cover all relevant content
2. Focus on information relevant to your query
3. Ignore irrelevant details
4. Write in a neutral, objective tone
Please output in plain text format without adding additional instructions. """
else : # extraction
system_prompt = """You are an information extraction expert.
Your task is to extract the exact sentences that contain relevant information from the document fragment to answer the user's query.
Output requirements:
1. Include only relevant sentences from the original text
2. Keep the original sentence unchanged (do not modify it)
3. Include only sentences directly related to the question
4. Separate each sentence with a line break
5. Do not add any comments or other content
Please output in plain text format without adding additional instructions. """
# Build user prompt
user_prompt = f"""
Query: {query}
Document snippet:
{chunk}
Extract content relevant to the query.
"""
# Call the large model API for compression
response = client.chat.completions.create(
model=model,
messages=[
{ "role" : "system" , "content" : system_prompt},
{ "role" : "user" , "content" : user_prompt}
],
temperature = 0
)
# Get the compressed content
compressed_content = response.choices[ 0 ].message.content.strip()
# Calculate the compression ratio
original_length = len (chunk)
compressed_length = len (compressed_content)
compression_ratio = (original_length - compressed_length) / original_length * 100
return compressed_content, compression_ratio
Complete pipeline
def rag_with_compression ( pdf_path: str , query: str , k: int = 10 , compression_type: str = "selective" , model: str = "qwen-max" ) -> Dict :
"""
The complete RAG process,using the context compression strategy to reduce the input length.
parameter:
pdf_path(str): PDF file path
query(str): user query
k (int): the number of text blocks to retrieve initially
compression_type(str): compression method
model(str): the name of the large model used
return:
Dict: A dictionary containing the query, the content before and after compression, the response, etc.
"""
print ( "\n=== Context compression RAG process starts===" )
print ( f"Query content: {query} " )
print ( f"Compression type: {compression_type} " )
# Load the document and create vector storage
vector_store = process_document(pdf_path)
# Create query embeddings
query_embedding = create_embeddings(query)
# Retrieve the top k most relevant text blocks
print ( f"Retrieving the first {k} related text blocks..." )
results = vector_store.similarity_search(query_embedding, k=k)
retrieved_chunks = [result[ "text" ] for result in results]
# Compress each text block
compressed_results = batch_compress_chunks(retrieved_chunks, query, compression_type, model)
compressed_chunks = [result[ 0 ] for result in compressed_results]
compression_ratios = [result[ 1 ] for result in compressed_results]
# Filter empty content
valid_compressed_data = [(chunk, ratio) for chunk, ratio in zip (compressed_chunks, compression_ratios) if chunk.strip()]
ifnot valid_compressed_data:
# Fallback to the original text block when all compressed text is empty
print ( "Warning: All text blocks were compressed to nothing. The original text blocks will be used." )
valid_compressed_data = [(chunk, 0.0 ) for chunk in retrieved_chunks]
else :
compressed_chunks, compression_ratios = zip (*valid_compressed_data)
# Build context
context = "\n\n---\n\n" .join(compressed_chunks)
# Generate final response
print ( "Generate reply based on compressed text block..." )
response = generate_response(query, context, model)
# Return results
result = {
"query" : query,
"original_chunks" : retrieved_chunks,
"compressed_chunks" : compressed_chunks,
"compression_ratios" : compression_ratios,
"context_length_reduction" : f" { sum (compression_ratios) / len (compression_ratios): .2 f} %" ,
"response" : response
}
print ( "\n=== Final reply===" )
print (response)
return result
10. Feedback Loop in RAG
In this section, I will implement a RAG system with a feedback mechanism that can continuously optimize itself over time. By collecting and integrating user feedback, it can:
✅ Learn which responses are effective and which need improvement;
✅ Continuously improve the relevance of search results and the quality of answers;
✅ Get “smarter” with every interaction.
Limitations of Traditional RAG Systems
Traditional RAG systems are static :
They retrieve information based only on vector similarity and do not learn from user feedback.
This means:
If an inaccurate or irrelevant answer is returned, the system will not automatically correct it;
Even if the same question is asked multiple times, the system does not "remember" previous better responses.
Advantages of RAG with feedback mechanism
We built a dynamic and adaptive RAG system with the following capabilities:
✅Memory function : remember which documents have provided useful information and which have not;
✅Dynamically adjust the score : Update the relevance score of the document based on historical feedback;
✅Knowledge accumulation : add successful question-answer pairs to the knowledge base for future queries;
✅Continuous evolution : Every interaction with users is a learning opportunity. The system will become more accurate and better with use.
The core process of the feedback mechanism
1. User questions
The user enters a question and gets an answer generated by the RAG system.
2. Get user feedback
Users can provide feedback through rating, likes/dislikes, or direct comments;
Example:
“This answer is very helpful✅”
“This answer is not detailed enough ❌”
"Could you please provide more details?"
3. Record feedback data
Store user questions, original answers, feedback content and other information to form a feedback log.
4. Analysis and learning
Use models to analyze which documents and passages produce high-quality responses;
Adjust the retrieval weight of these documents in the future;
Add high-quality question-answer pairs to the knowledge base to enhance future semantic understanding.
5. Optimize your next answer
The next time you encounter a similar question, the system can find the best answer faster and more accurately.
Example Demo
? User's first question:
“What is machine learning?”
System answer:
“Machine learning is an artificial intelligence technique that allows computers to learn patterns from data.”
User feedback:
“Not bad, but could be more detailed.”
? The second time someone asked a similar question:
“What are the fundamental principles of machine learning?”
The system combines previous feedback and returns a more detailed answer:
“Machine learning is an artificial intelligence technique that allows computers to automatically learn patterns and regularities through training data. Common methods include supervised learning, unsupervised learning, and reinforcement learning.”
Building a simple vector database
from typing import List , Dict , Optional , Tuple , Callable
import numpy as np
classSimpleVectorStore:
"""
A simple vector database implementation based on NumPy.
This class provides an in-memory vector storage and retrieval system, supporting basic similarity searches using cosine similarity.
"""
def __init__ ( self ):
"""
Initialize the vector database, which contains three parallel lists:
- vectors: stores embedding vectors (NumPy arrays)
- texts: stores raw text blocks
- metadata: stores metadata for each text block
"""
self.vectors: List [np.ndarray] = [] # Embedding vector list
self.texts: List [ str ] = [] # Text content list
self.metadata: List [ Dict ] = [] # metadata list
def add_item ( self, text: str , embedding: List [ float ], metadata: Optional [ Dict ] = None ) -> None :
"""
Adds an entry to the vector database.
parameter:
text (str): the text content to be stored
embedding (List[float]): represents the embedding vector of the text
metadata (Dict, optional): Optional metadata dictionary
"""
self.vectors.append(np.array(embedding))
self.texts.append(text)
self.metadata.append(metadata or {})
def similarity_search (
self,
query_embedding: List [ float ],
k: int = 5 ,
filter_func: Optional [ Callable [[ Dict ], bool ]] = None
) -> List [ Dict ]:
"""
Use cosine similarity to find the entries that are most similar to the query vector.
parameter:
query_embedding (List[float]): query embedding
k (int): number of results to return
filter_func (Callable, optional): filter function used to filter results based on metadata
return:
List[Dict]: A list of results containing text, metadata, and relevance scores
"""
if not self.vectors:
return []
query_vector = np.array(query_embedding)
similarities = []
for i, vector in enumerate (self.vectors):
if filter_func andnot filter_func(self.metadata[i]):
continue
similarity = np.dot(query_vector, vector) / (
np.linalg.norm(query_vector) * np.linalg.norm(vector)
)
similarities.append((i, similarity))
similarities.sort(key= lambda x: x[ 1 ], reverse= True )
results = []
for i in range ( min (k, len (similarities))):
idx, score = similarities[i]
results.append({
"text" : self.texts[idx],
"metadata" : self.metadata[idx],
"similarity" : score,
"relevance_score" : self.metadata[idx].get( "relevance_score" , score)
})
return results
Feedback system function module
Now we will implement the core feedback system components .
def get_user_feedback ( query: str , response: str , relevance: int , quality: int , comments: str = "" ) -> Dict :
"""
Format user feedback as a dictionary.
parameter:
query(str): user's question
response(str): the system's answer
relevance(int): relevance score (1-5)
quality(int): answer quality score (1-5)
comments(str): optional comments
return:
Dict: Formatted feedback dictionary
"""
return {
"query" : query,
"response" : response,
"relevance" : int (relevance),
"quality" : int (quality),
"comments" : comments,
"timestamp" : datetime.now().isoformat()
}
def store_feedback ( feedback: Dict , feedback_file: str = "feedback_data.json" ) -> None :
"""
Save the feedback data into a JSON file.
parameter:
feedback(Dict): feedback data
feedback_file(str): file path
"""
with open (feedback_file, "a" ) as f:
json.dump(feedback, f)
f.write( "\n" )
def load_feedback_data ( feedback_file: str = "feedback_data.json" ) -> List [ Dict ]:
"""
Load feedback data from a file.
parameter:
feedback_file(str): file path
return:
List[Dict]: Feedback data list
"""
feedback_data = []
try :
with open (feedback_file, "r" ) as f:
for line in f:
if line.strip():
feedback_data.append(json.loads(line.strip()))
except FileNotFoundError:
print ( "Feedback file not found. Starting with empty feedback." )
return feedback_data
Feedback-aware document processing
def process_document ( pdf_path: str , chunk_size: int = 1000 , chunk_overlap: int = 200 ) -> Tuple [ List [ str ], SimpleVectorStore]:
"""
Process documents for RAG process.
step:
1. Extract text from PDF
2. Segment the text
3. Create an embed
4. Store in vector database
parameter:
pdf_path(str): PDF file path
chunk_size(int): size of each slice
chunk_overlap(int): the number of overlapping characters between slices
return:
Tuple[List[str], SimpleVectorStore]: text slices and vector database
"""
print ( "Extracting text from PDF..." )
extracted_text = extract_text_from_pdf(pdf_path)
print ( "Splitting text..." )
chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)
print ( f"Generated { len (chunks)} text slices" )
print ( "Creating embedding for slice..." )
chunk_embeddings = create_embeddings(chunks)
store = SimpleVectorStore()
for i, (chunk, embedding) in enumerate ( zip (chunks, chunk_embeddings)):
store.add_item(
text=chunk,
embedding=embedding,
metadata={
"index" : i,
"source" : pdf_path,
"relevance_score" : 1.0 ,
"feedback_count" : 0
}
)
print ( f" { len (chunks)} slices have been added to the vector database" )
return chunks, store
def assess_feedback_relevance ( query: str , doc_text: str , feedback: dict ) -> bool :
"""
Use language model (LLM) to determine whether a piece of past user feedback is relevant to the current query and document content.
This function is used to decide which historical feedback should affect the current search result ranking or result generation.
parameter:
query(str): the query question of the current user
doc_text (str): the document content being evaluated (i.e. the block of text to be retrieved)
feedback(Dict): previously saved user feedback data, including 'query' and 'response' fields
return:
bool: Returns True if the feedback is relevant to the current query and document; otherwise returns False
"""
# System prompt words: Tell the AI that it can only determine whether the feedback is relevant, and cannot do anything else
system_prompt = """You are the expert on whether feedback is relevant or not. Please just answer "yes" or "no" without providing any explanation or anything else.""""
# User prompt words: build input context, including current query, past questions, document content snippets, and previous answers
user_prompt = f"""
Current query: {query}
Past feedback questions: {feedback[ 'query' ]}
Document content: {doc_text[: 500 ]} ... [truncated]
Past responses that received feedback: {feedback[ 'response' ][: 500 ]} ... [truncated]
Is this historical feedback relevant to the current query and document content? Please answer yes or no.
"""
# Call the LLM model API to obtain the judgment result
response = client.chat.completions.create(
model = "qwen-max" , # Model name used
messages=[
{ "role" : "system" , "content" : system_prompt}, # System instructions (translated into Chinese)
{ "role" : "user" , "content" : user_prompt} # User input content
],
temperature = 0 # Set to 0 to ensure output determinism
)
# Extract model response and process it
answer = response.choices[ 0 ].message.content.strip().lower()
# Check if it contains "yes"
return 'yes' in answer
def adjust_relevance_scores ( query: str , results: List [ Dict ], feedback_data: List [ Dict ] ) -> List [ Dict ]:
"""
Adjust the relevance score of search results based on historical feedback.
parameter:
query(str): current query
results(List[Dict]): search results
feedback_data(List[Dict]): historical feedback data
return:
List[Dict]: adjusted result
"""
ifnot feedback_data:
return results
print ( "Adjusting relevance score based on feedback history..." )
for i, result in enumerate (results):
document_text = result[ "text" ]
relevant_feedback = []
for fb in feedback_data:
if assess_feedback_relevance(query, document_text, fb):
relevant_feedback.append(fb)
if relevant_feedback:
avg_relevance = sum (f[ 'relevance' ] for f in relevant_feedback) / len (relevant_feedback)
modifier = 0.5 + (avg_relevance / 5.0 )
original_score = result[ "similarity" ]
adjusted_score = original_score * modifier
result.update({
"original_similarity" : original_score,
"similarity" : adjusted_score,
"relevance_score" : adjusted_score,
"feedback_applied" : True ,
"feedback_count" : len (relevant_feedback)
})
print ( f" Document {i+ 1 } : score adjusted from {original_score: .4 f} to {adjusted_score: .4 f} based on { len (relevant_feedback)} pieces of feedback" )
results.sort(key= lambda x: x[ "similarity" ], reverse= True )
return results
def rag_with_feedback_loop (
query: str ,
vector_store: SimpleVectorStore,
feedback_data: List [ Dict ],
k: int = 5 ,
model: str = "qwen-amax"
) -> Dict :
"""
Execute the complete RAG process with feedback mechanism.
parameter:
query(str): user query
vector_store(SimpleVectorStore): vector database
feedback_data(List[Dict]): historical feedback data
k(int): number of retrievals
model(str): LLM model
return:
Dict: contains the query, retrieved documents and the result of the response
"""
print ( f"\n=== Processing RAG query with feedback===" )
print ( f"Query: {query} " )
query_embedding = create_embeddings(query)
results = vector_store.similarity_search(query_embedding, k=k)
adjusted_results = adjust_relevance_scores(query, results, feedback_data)
retrieved_texts = [result[ "text" ] for result in adjusted_results]
context = "\n\n---\n\n" .join(retrieved_texts)
print ( "Generating reply..." )
response = generate_response(query, context, model)
return {
"query" : query,
"retrieved_documents" : adjusted_results,
"response" : response
}
Using feedback to fine-tune your index
def fine_tune_index (
current_store: SimpleVectorStore,
chunks: List [ str ],
feedback_data: List [ Dict ]
) -> SimpleVectorStore:
"""
Fine-tune the vector database using high-quality feedback.
parameter:
current_store(SimpleVectorStore): current database
chunks(List[str]): raw text slices
feedback_data(List[Dict]): historical feedback data
return:
SimpleVectorStore: a fine-tuned database
"""
print ( "Fine-tuning index using high-quality feedback..." )
good_feedback = [fb for fb in feedback_data if fb[ 'relevance' ] >= 4 and fb[ 'quality' ] >= 4 ]
ifnot good_feedback:
print ( "No high-quality feedback found." )
return current_store
new_store = SimpleVectorStore()
for i in range ( len (current_store.texts)):
new_store.add_item(
text=current_store.texts[i],
embedding=current_store.vectors[i],
metadata=current_store.metadata[i].copy()
)
for feedback in good_feedback:
enhanced_text = f"Question: {feedback[ 'query' ]} \nAnswer: {feedback[ 'response' ]} "
embedding = create_embeddings(enhanced_text)
new_store.add_item(
text=enhanced_text,
embedding=embedding,
metadata={
"type" : "feedback_enhanced" ,
"query" : feedback[ "query" ],
"relevance_score" : 1.2 ,
"feedback_count" : 1 ,
"original_feedback" : feedback
}
)
print ( f"Feedback content added: {feedback[ 'query' ][: 50 ]} ..." )
print ( f"After fine-tuning the index contains { len (new_store.texts) } items (original: { len (chunks)} )" )
return new_store
Complete workflow: from initial setup to feedback collection
def full_rag_workflow (
pdf_path: str ,
query: str ,
feedback_data: Optional [ List [ Dict ]] = None ,
feedback_file: str = "feedback_data.json" ,
fine_tune : bool = False
) -> Dict :
"""
Complete RAG workflow with integrated feedback mechanism.
parameter:
pdf_path(str): PDF file path
query(str): user query
feedback_data(Optional[List[Dict]]): Historical feedback data
feedback_file(str): feedback file path
fine_tune(bool): whether to enable index fine tuning
return:
Dict: contains the result of the response and the retrieved information
"""
if feedback_data is None :
feedback_data = load_feedback_data(feedback_file)
print ( f" { len (feedback_data)} pieces of feedback loaded from {feedback_file} " )
chunks, vector_store = process_document(pdf_path)
if fine_tune and feedback_data:
vector_store = fine_tune_index(vector_store, chunks, feedback_data)
result = rag_with_feedback_loop(query, vector_store, feedback_data)
print ( "\n=== Would you like to provide feedback on this reply? ===" )
print ( "Score relevance (1-5):" )
relevance = input ()
print ( "Rating quality (1-5):" )
quality = input ()
print ( "Any comments? (skip)" )
comments = input ()
feedback = get_user_feedback(
query=query,
response=result[ "response" ],
relevance = int (relevance),
quality = int (quality),
comments=comments
)
store_feedback(feedback, feedback_file)
print ( "Thanks for your feedback!" )
return result