RAG techniques and underlying code analysis

Written by
Caleb Hayes
Updated on:June-13th-2025
Recommendation

In-depth analysis of RAG technology, hand-in-hand teaching you to build a RAG system from scratch!

Core content:
1. RAG technology principles and Python basic library implementation
2. The complete process from text preprocessing to generation optimization
3. 9 practical skills to help you break through the bottleneck of answer quality

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Introduction

[Use Python basic libraries to tear RAG kernel from scratch] Are you still using ready-made frameworks to implement RAG? This article will show you how to tear open the technical black box, build a RAG system using only Python basic libraries such as numpy, and tear down the RAG kernel from scratch! From text segmentation, vectorization, similarity retrieval to generation optimization, the core logic of retrieval enhancement generation is dissected line by line, and 9 practical skills are analyzed in depth: from intelligent block strategy to dynamic context compression, to help you break through the bottleneck of answer quality. Refuse to be a parameter adjustment tool person, this time thoroughly master the underlying genes of RAG!

I believe everyone is familiar with RAG (Retrieval-Augmented Generation). In practical applications, many people use frameworks such as LangChain or FAISS to implement RAG functions. But if we manually implement a RAG system from scratch, have you ever tried it?

In order to help you better understand the working principle of RAG from the bottom up, this article will take you step by step to implement a simple version of the RAG system. In this process, we will not use any complex frameworks, but only rely on the familiar Python standard library and common scientific computing libraries, such as numpy.

1. Starting from 0: Simple RAG Implementation

Before building a more complex RAG architecture, we start with the most basic version. The entire process can be divided into the following key steps:

1. Data import: Load and preprocess raw text data to prepare for subsequent processing.

2. Text Chunking: Split long text into smaller paragraphs or sentences to improve retrieval efficiency and relevance.

3. Create Embedding: Use the embedding model to convert text blocks into vector representations to facilitate semantic comparison and matching.

4. Semantic search: Based on the query content entered by the user, the most relevant text blocks are retrieved from the existing vector library.

5. Response generation: Based on the retrieved relevant content, the final answer output is generated in combination with the language model.

Setting up the environment

First, we need to import the necessary libraries:

import  osimport  numpy  as  npimport  jsonimport  fitzimport  dashscopefrom  openai  import  OpenAI
os. environ [ 'DASHSCOPE_API_KEY' ] =  "your dashscope api key"

Extract text from PDF files

First we need a text data source. In this article, we use the PyMuPDF library to extract text from PDF files. Here is a function defined to extract text from PDF:

def  extract_text_from_pdf ( pdf_path ):    # Open PDF file    document = fitz. open (pdf_path)    all_text =  ""   # Initialize an empty string to store the extracted text
    # Traverse each page in the PDF    for  page_num  in  range (document.page_count):        page = document[page_num]   # Get page        text = page.get_text( "text" )   # Extract text from the page        all_text += text   # Append the extracted text to the all_text string
    return  all_text   # Returns the extracted text

Divide the extracted text into chunks

Once we have the extracted text, we divide it into smaller, overlapping chunks to improve retrieval accuracy.

def  chunk_text ( text_input, chunk_size, overlap_size ):    text_chunks = []   # Initialize a list to store text blocks
    # Loop through the text with a step size of (chunk_size - overlap_size)    for  i  in  range ( 0len (text_input), chunk_size - overlap_size):        text_chunks.append(text_input[i:i + chunk_size])   # Append text blocks from i to i+chunk_size to the text_chunks list
    return  text_chunks   # Returns a list of text blocks

Setting up the OpenAI API client

Initialize the OpenAI client for generating embeddings and responses.

client = OpenAI( api_key=os.getenv("DASHSCOPE_API_KEY"), # If you don't have environment variables configured, please replace it with your API Key here base_url="https://dashscope.aliyuncs.com/compatible-mode/v1" # base_url of the Bailian service)

Extract and chunk text from PDF files

Now, we load the PDF, extract the text, and split it into chunks.

# Define the PDF file pathpdf_path =  "knowledge_base/Intelligent Coding Assistant Tongyi Lingma.pdf"
# Extract text from PDF filesextracted_text = extract_text_from_pdf(pdf_path)
# Split the extracted text into blocks of 1000 characters, with overlaps of 100 characterstext_chunks = chunk_text(extracted_text,  1000100 )
# Print the number of text blocks createdprint ( "Number of text chunks: "len (text_chunks))
# Print the first block of textprint ( "\nFirst text block: " )print ( text_chunks[ 0 ] )
Number of text blocks: 5
First block of text:What is the intelligent coding assistant Tongyi Lingma Intelligent Coding Assistant Tongyi Lingma (abbreviated as Tongyi Lingma) is an intelligent coding assistance tool provided by Alibaba Cloud.It provides capabilities such as intelligent code generation, intelligent question and answer, multi-file modification, and programming intelligence, bringing highefficient and smooth coding experience, leading a new paradigm of AI native R&D. At the same time, we provide enterprise customers withStandard and exclusive versions have the capabilities of enterprise-level scenario customization and private domain knowledge enhancement, helping enterprises develop intelligentChemical upgrade. Core Competencies Code Completion After training with a large amount of excellent open source code data, it can generate for you according to the current code file and the context across files.Line-level/function-level code, unit tests, code optimization suggestions, etc. Immersive coding flow, generation speed in seconds,You focus more on technical design and complete coding work efficiently. Ask Mode The intelligent question-answering model has a large amount of R&D documents, product documents, general R&D knowledge, etc., and combines engineering-level perception capabilitiesWe can help developers solve R&D problems encountered during the coding process and assist them in fixing and debugging code problems.Or run error troubleshooting, etc. Edit Mode The file editing mode has the ability to modify multiple file codes. When developers need to modify code files accurately,It can combine the requirements description with the current engineering environment to modify multiple files, and can perform multiple iterations and code reviews.Check to help developers complete code modification tasks efficiently and controllably. Agent Mode The intelligent agent mode has the ability to make autonomous decisions, perceive the environment, use tools, etc., and can be coded by the developer.Request, use engineering search, file editing, terminal and other tools, you can complete the coding task end to end.Developers configure MCP tools to make coding more in line with the developer workflow. Product Advantages • Multiple conversation modes: support question-and-answer mode, file editing mode, and intelligent model simultaneously in one conversation flowDevelopers can freely switch modes according to different scenarios and problem difficulties to achieve maximum work efficiencychange. • Automatic project perception: Based on the developer's task description, it can automatically perceive the project framework, technology stack, and requiredCode files, error messages and other project information, no need to manually add project context, task description is lighterThe code completion is more in line with the business scenarios of the current code base. • Project-level changes: Based on the developer's task description, you can independently perform task decomposition and multiple codes within the project.File modification can be done through multiple conversations, and can be done through gradual iteration or snapshot rollback, in collaboration with Tongyi Lingma.A coding task. • Memory perception: Supports autonomous memory capabilities based on large models.The Tongyi Lingma will gradually form rich memories related to developers, projects, problems, etc.The more you use it, the better we understand you. • Multiple enterprise edition plans, flexible choice: Provide enterprise standard edition, enterprise

Create embedding vectors for text blocks

Embedding vectors convert text into numerical vectors, allowing efficient similarity search. Here, Alibaba Cloud's embedding model " text-embedding-v3 " is used.

# Create embedding vectors for text blocksdef  create_embeddings ( texts, model= "text-embedding-v3" ):    """    Input a set of text (string or list) and return the corresponding embedding vector list    """    ifisinstance(texts,  str ):        texts = [texts]   # ​​Ensure that the input is in list form
    completion = client.embeddings.create(        model=model,        input =texts,        encoding_format= "float"    )        # Convert the response to a dict and extract all embeddings    data = json.loads(completion.model_dump_json())    embeddings = [item[ "embedding"for  item  in  data[ "data" ]]    return  embeddings

Performing semantic searches

We find the most relevant text chunks to the user query by calculating cosine similarity.

from  sklearn.metrics.pairwise  import  cosine_similarity
# Semantic search functiondef  semantic_search ( query, text_chunks, embeddings= None , k= 2 ):    """    Find the top-k text chunks in text_chunks that are most relevant to the query        parameter:        query: query statement        text_chunks: list of candidate text chunks        embeddings: list of corresponding embedding vectors (if pre-computed)        k: the number of most relevant results to return            return:        top_k_chunks: the most relevant top-k text chunks    """    if  embeddings  is  None :        embeddings = create_embeddings(text_chunks)   # Automatically generated if not provided    else :        assert  len ​​(embeddings) ==  len (text_chunks),  "embeddings and text_chunks must be the same length"
    query_embedding = create_embeddings(query)[ 0 ]   # Get the embedding of the query
    # Calculate similarity    similarity_scores = []    for  i, chunk_embedding  in  enumerate (embeddings):        score = cosine_similarity([query_embedding], [chunk_embedding])[ 0 ][ 0 ]        similarity_scores.append((i, score))
    # Sort and take top-k    similarity_scores.sort(key= lambda  x: x[ 1 ], reverse= True )    top_indices = [index  for  index, _  in  similarity_scores[:k]]    return  [text_chunks[index]  for  index  in  top_indices]

Finally, the query operation is executed and the results are printed.

# Perform a semantic searchquery =  'What are the capabilities of Tongyi Lingma's intelligent entities?'top_chunks = semantic_search(query, text_chunks, k= 2 )
# Output resultsprint ( "Query: " , query)for  i, chunk  in  enumerate (top_chunks):    print ( f"Context  {i +  1 } :\n {chunk} \n=====================================" )
Query: What are the intelligent capabilities of Tongyi Lingma? Context 1: What is the intelligent coding assistant Tongyi Lingma? The intelligent coding assistant Tongyi Lingma (abbreviated as Tongyi Lingma) is an intelligent coding assistance tool provided by Alibaba Cloud. It provides capabilities such as intelligent code generation, intelligent question and answer, multi-file modification, and programming intelligent agents, bringing developers an efficient and smooth coding experience and leading a new paradigm of AI native research and development. At the same time, we provide enterprise customers with enterprise standard and exclusive versions, which have enterprise-level scene customization, private domain knowledge enhancement and other capabilities to help enterprises upgrade their research and development intelligence. Core Capabilities Code Completion Code Completion has been trained with a large amount of excellent open source code data, and can generate line-level/function-level code, unit tests, code optimization suggestions, etc. for you based on the current code file and cross-file context. Immersive coding flow and generation speed in seconds allow you to focus more on technical design and complete coding work efficiently. Intelligent Question and Answer Ask Mode Intelligent question and answer mode has a large amount of R&D documents, product documents, general R&D knowledge, etc., and combines engineering-level perception capabilities to help developers solve R&D problems encountered during the coding process, and assist developers in fixing code problems, debugging, or troubleshooting running errors. File Editing Edit Mode File editing mode has the ability to modify multiple files. When developers need to modify code files accurately, they can modify multiple files in combination with the requirement description and the current engineering environment, and can perform multiple iterations and code reviews to help developers complete code modification tasks efficiently and controllably. Agent Mode Agent mode has the capabilities of autonomous decision-making, environmental perception, and tool use. It can use tools such as engineering retrieval, file editing, and terminals according to the developer's coding requirements to complete coding tasks end-to-end. At the same time, it supports developers to configure MCP tools, so that coding is more in line with the developer's workflow. Product Advantages • Multiple session modes: Question and answer mode, file editing mode, and agent mode are supported in one session flow. Developers can freely switch modes according to different scenarios and problem difficulties to maximize work efficiency. • Automatic project perception: Based on the developer's task description, it can automatically perceive project information such as project framework, technology stack, required code files, error messages, etc., without manually adding project context, making task description easier and code completion more in line with the business scenario of the current code base. • Project-level changes: Based on the developer's task description, it can independently decompose tasks and modify multiple code files in the project. At the same time, it can gradually iterate or roll back snapshots through multiple conversations, and collaborate with Tongyi Lingma to complete coding tasks. • Memory perception: Supports autonomous memory capabilities based on large models. During the conversation between developers and Tongyi Lingma, Tongyi Lingma will gradually form rich memories related to developers, projects, problems, etc., and the more you use it, the better it understands you. • Multiple Enterprise Edition solutions, flexible choice: Provide Enterprise Standard Edition, Enterprise =========================================Context 2: Perception: Support autonomous memory capabilities based on large models. In the process of dialogue between developers and Tongyi Lingma, Tongyi Lingma will gradually form rich memories related to developers, projects, problems, etc., and the more you use it, the better it understands you. • Multiple Enterprise Edition solutions, flexible choice: Provide a variety of solutions for enterprise customers such as Enterprise Standard Edition and Enterprise Exclusive Edition, and provide enterprise personalized solutions, which can be flexibly selected to accelerate the large-scale implementation of intelligent R&D within the enterprise. Function introduction Interline code completion • Line-level/function-level real-time continuation: According to the current syntax and cross-file code context, it automatically perceives the current project and generates line and function-level code in real time; • Generate code by commenting: Describe the functions you want through comments, and generate code directly in the editor area, so that the coding flow is uninterrupted. Intelligent Q&A • R&D Q&A: When you encounter coding questions or technical problems, you can call up Tongyi Lingma with one click, and you can quickly get answers and solutions without leaving the IDE client. • Engineering Q&A: Through Q&A, you can quickly combine the current warehouse to understand the project, query the code, etc. At the same time, you can describe the requirements through natural language, and generate overall repair suggestions and recommended codes for simple requirements or defects in combination with the current project. • Image multimodal Q&A: Supports selecting, dragging or pasting to add images as context, automatically analyzes the image content, and generates code suggestions or problem repair suggestions based on the requirement description. • Enterprise knowledge base Q&A: Use enterprise knowledge and data to conduct Q&A, quickly build an enterprise R&D knowledge Q&A assistant, and improve the work efficiency and collaboration ability of the team. File editing • Engineering-level changes: According to the developer's task description, you can modify multiple code files in the project, and you can perform gradual iterations or snapshot rollbacks through multiple conversations. Developers and Tongyi Lingma work together to gradually complete coding tasks. • Precise editing: Complete code file modifications within the context provided by the developer, and no modifications beyond the developer's expectations will be made. • Fast execution: Strictly follow the developer's task description and the context provided to modify code files. There is no need for additional complex task planning, which completes tasks more quickly than the agent mode. • Tool usage: It has the ability to use code modification related tools such as file reading, semantic retrieval within the project, and file editing, which can help developers quickly complete code modifications. Programming agent • Project-level changes: According to the developer's task description, it can independently decompose tasks and modify multiple code files within the project. At the same time, it can perform step-by-step iteration or snapshot rollback through multiple conversations, and collaborate with Tongyi Lingma to complete coding tasks. • Automatic project perception: According to the developer's task description, it can automatically perceive project information such as project framework, technology stack, required code files, error messages, etc., without manually adding project context, making task description easier. • Tool usage: You can use more than a dozen built-in programming tools independently, such as reading and writing files, generation === ...

Generate a response based on the retrieved chunk

# Initialize the DashScope client (using Alibaba Cloud Tongyi Qianwen)client = OpenAI(    api_key=os.getenv( "DASHSCOPE_API_KEY" ),   # Make sure to set the environment variables in advance    base_url= "https://dashscope.aliyuncs.com/compatible-mode/v1")
# Set system promptSYSTEM_PROMPT = (    "You are an AI assistant and must answer strictly based on the context provided."    "If the answer is not directly inferred from the context provided, respond with: 'I can't answer this question based on the information available.'")
def  generate_response ( system_prompt, user_message, model= "qwen-max" ):    """    Generate context-based answers using DashScope's 1,000-question model.
    parameter:        system_prompt (str): System prompts that control AI behavior        user_message (str): question and context entered by the user        model (str): the model name used, the default is qwen-plus
    return:        str: the answer content generated by the model    """    response = client.chat.completions.create(        model=model,        temperature= 0.0 ,   # Set the temperature to 0 to ensure output certainty        max_tokens= 512 ,    # The maximum output length can be adjusted as needed        messages=[            { "role""system""content" : system_prompt},            { "role""user""content" : user_message}        ]    )        return  response.choices[ 0 ].message.content.strip()
# Example top_chunks (assuming this is what semantic_search returns)top_chunks = [    "Tongyi Lingma is an AI-based intelligent programming assistant." ,    "File editing capabilities include auto-completion, error fixing, and code refactoring."]
query =  "What are the capabilities of Tongyi Lingma's intelligent entities?"
# Build user prompt (contains context + question)user_prompt =  "\n" .join([ f"context  {i +  1 } :\n {chunk} " for  i, chunk  in  enumerate (top_chunks)])user_prompt +=  f"\n\nQuestion: {query} "
# Generate AI answeranswer = generate_response(SYSTEM_PROMPT, user_prompt)
# Output resultsprint ( "AI answers: " )print (answer)
 AI answers:The intelligent capabilities of Tongyi Lingma include the following aspects:
-  **Autonomous decision-making** : Ability to independently decompose tasks based on the developer's coding needs.-  **Environmental awareness** : It can automatically perceive project information such as project framework, technology stack, required code files, error messages, etc., without manually adding project context.-  **Tool Usage** : Ability to use more than ten built-in programming tools independently, such as reading and writing files, code editing, etc.-  **Complete coding tasks end-to-end** : Based on the needs of developers, use a variety of tools such as project retrieval, file editing, terminal, etc. to complete complete coding tasks from start to finish.-  **Support configuration of MCP tools** : Make the coding process more in line with the developer's personal workflow.

Evaluation Answer

# Define prompt words for the evaluation systemevaluate_system_prompt = (    "You are an intelligent evaluation system responsible for evaluating the quality of the AI ​​assistant's answers."    "If the AI ​​assistant's answer is very close to the real answer, please give 1 point;"    "If the answer is wrong or irrelevant to the true answer, please give 0 points;"    "If the answer is partially true, please give 0.5 points."    "Please output the rating result directly: 0, 0.5 or 1.")#Build an evaluation prompt and get the scoreevaluation_prompt = f """User question: {query}AI Answer:{answer}Real answer: {ideal_answer}Please rate according to the following criteria:- If the AI ​​answer is very close to the real answer → Output 1- If the answer is wrong or irrelevant → output 0- If partial match → output 0.5" ""evaluation_result = generate_response(evaluate_system_prompt, evaluation_prompt)
# Step 5: Output the final scoreprint( "AI answer score: " , evaluation_result)
AI answer rating: 1

2. Semantic-based text segmentation

In RAG, text chunking is a crucial step. Its core function is to divide a large continuous text into multiple smaller paragraphs with semantic integrity, thereby improving the accuracy and overall effect of information retrieval.

Traditional chunking methods usually use a fixed-length segmentation strategy, such as segmenting every 500 characters or every several sentences. Although this method is simple to implement, it is easy to split complete semantic units in practical applications, which affects subsequent information retrieval and understanding.

In contrast, a smarter chunking method is semantic chunking . It no longer mechanically divides based on the number of words or sentences, but instead determines the appropriate segmentation position by analyzing the content similarity between sentences. When a significant semantic difference is detected between the previous and next sentences, it is split at that position to form a new semantic paragraph.

How to determine the split point

In order to find the appropriate semantic segmentation point, we can use the following common statistical methods:

1. Percentile method: Find the "Xth percentile" of the semantic similarity difference between all adjacent sentences, and split at those positions where the difference value exceeds this threshold.

2. Standard Deviation Method: When the semantic similarity between sentences decreases by more than the mean value minus X times the standard deviation, the sentence is segmented at that position.

3. The interquartile range (IQR) method uses the difference between the upper and lower quartiles (Q3 - Q1) to identify locations with large changes and use them as potential split points.

Practical application examples

In this practice, we use the **percentile method** to perform semantic segmentation and test its segmentation effect on a sample text.

Creating sentence-level embeddings

First, a piece of original text is preliminarily segmented into sentences, and then a corresponding vector representation (Embedding) is generated for each sentence to facilitate the subsequent calculation of the semantic similarity between sentences.

# Initialize the clientclient = OpenAI(    api_key=os.getenv( "DASHSCOPE_API_KEY" ),   # If you don't have environment variables configured, replace it with your API Key here    base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1"   # The base_url of the Bailian service)
# Create embedding vectors for text blocksdef  create_embeddings ( texts, model= "text-embedding-v3" ):    """    Input a set of text (string or list) and return the corresponding embedding vector list    """    if  isinstance ( texts ,  str ):        texts = [texts]   # ​​Ensure that the input is in list form
    completion = client.embeddings.create(        model=model,        input =text_chunks,        encoding_format= "float"    )        # Convert the response to a dict and extract all embeddings    data = json.loads(completion.model_dump_json())    embeddings = [item[ "embedding"for  item  in  data[ "data" ]]    return  embeddings

# Preliminarily split the text into sentences based on periodssentences = extracted_text.split( "." )# Remove empty strings and leading and trailing spacessentences = [sentence.strip()  for  sentence  in  sentences  if  sentence.strip()]
# Generate embedding vectors for all sentences in batches (recommended)embeddings = create_embeddings(sentences)
print ( f"Successfully generated  embedding vectors for { len (embeddings)}  sentences." )

Successfully generated embedding vectors for 5 sentences.

Calculate similarity difference

We calculate the cosine similarity between consecutive sentences to measure their semantic proximity.

import  numpy  as  npfrom  sklearn.metrics.pairwise  import  cosine_similarity
def  cosine_similarity ( vec1, vec2 ):    """    Computes the cosine similarity between two vectors.
    parameter:    vec1(np.ndarray): The first vector.    vec2 (np.ndarray): The second vector.
    return:    float: cosine similarity.        abnormal:    ValueError: If the input vector is not a 1D array or the shapes do not match.    """    if  vec1.ndim !=  1 or  vec2.ndim !=  1 :        raise  ValueError( "Input vector must be a one-dimensional array" )    if  vec1.shape[ 0 ] != vec2.shape[ 0 ]:        raise  ValueError( "Input vectors must have the same dimensions" )
    return  np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
# Use sklearn's cosine_similarity function to calculate the similarity between consecutive sentencessimilarities = [cosine_similarity(embeddings[i].reshape( 1 , - 1 ), embeddings[i +  1 ].reshape( 1 , - 1 ))[ 0 ][ 0for  i  in  range ( len (embeddings) -  1 )]

Implementing semantic chunking

We implemented three different methods to identify breakpoints in the text, that is, to determine where to split a piece of text into multiple meaningful paragraphs.

The core idea of ​​these methods is to determine the segmentation position based on the change in semantic similarity between sentences . When a large semantic difference is detected between consecutive sentences, it is considered to be a potential paragraph dividing point.

def  compute_breakpoints ( similarity_scores, method= "percentile" , threshold= 90 ):    """    Calculate segment breakpoints based on similarity drop points.
    parameter:    similarity_scores(List[float]): List of similarities between sentences.    method(str): Threshold calculation method, optional 'percentile', 'standard_deviation', or 'interquartile'.    threshold(float): Threshold value (for percentile or standard deviation method).
    return:    List[int]: The index positions where the split should be done.    """    # Determine the threshold value based on the selected method    if  method ==  "percentile" :        # Calculate the similarity value of the specified percentile as the threshold        threshold_value = np.percentile(similarity_scores, threshold)    elif  method ==  "standard_deviation" :        # Calculate mean and standard deviation, and determine threshold by subtracting X standard deviations        mean = np.mean(similarity_scores)        std_dev = np.std(similarity_scores)        threshold_value = mean - (threshold * std_dev)    elif  method ==  "interquartile" :        # Use the interquartile range (IQR) rule to determine the outlier threshold        q1, q3 = np.percentile(similarity_scores, [ 2575 ])        iqr = q3 - q1        threshold_value = q1 -  1.5  * iqr    else :        # Throw an error if the method is invalid        raise  ValueError( "Invalid method. Choose 'percentile', 'standard_deviation', or 'interquartile'." )
    # Find the position where the similarity is lower than the threshold, that is, the segmentation breakpoint    return  [i  for  i, score  in  enumerate (similarity_scores)  if  score < threshold_value]
# Use the percentile method and set the threshold to 90% percentile to calculate the breakpointbreakpoints = compute_breakpoints(similarity_scores=similarities, method= "percentile" , threshold= 90 )

Split text into semantic chunks

Next, we divide the text according to its semantic content based on the calculated breakpoints. In the previous step, we have identified some potential breakpoints by analyzing the changes in semantic similarity between sentences. Now, we will use these positions to divide the original text into multiple paragraphs with clear semantic boundaries, also known as "semantic chunks".
def  split_into_chunks ( sentence_list, break_indices ):    """    Divide the sentence list into semantic paragraphs based on the breakpoint index.
    parameter:    sentence_list(List[str]): sentence list.    break_indices(List[int]): A list of position indices that need to be segmented.
    return:    List[str]: The list of divided semantic paragraphs.    """    semantic_chunks = []   # Store the divided paragraphs    current_start_index =  0   # Current segment start index
    # Iterate through each breakpoint to create paragraphs    for  bp  in  break_indices:        # Connect the current segment from the starting index to the breakpoint position and add a period to end        semantic_chunks.append( ". " .join(sentence_list[current_start_index:bp +  1 ]) +  "." )        current_start_index = bp +  1   # Update the starting index to the next sentence
    # Add the last paragraph (remaining sentences)    semantic_chunks.append( ". " .join(sentence_list[current_start_index:]))
    return  semantic_chunks   # Returns a list of semantic paragraphs
# Use the split_into_chunks function to generate paragraphstext_chunks = split_into_chunks(sentence_list=sentences, break_indices=breakpoints)
# Print the number of generated paragraphsprint ( f"Number of semantic paragraphs generated:  { len (text_chunks)} " )
# Print the first paragraph to verify the resultprint ( "\nFirst semantic paragraph: " )print ( text_chunks[ 0 ] )
Number of semantic paragraphs generated: 4
The first semantic paragraph:What is the intelligent coding assistant Tongyi Lingma Intelligent Coding Assistant Tongyi Lingma (abbreviated as Tongyi Lingma) is an intelligent coding assistance tool provided by Alibaba Cloud.It provides capabilities such as intelligent code generation, intelligent question and answer, multi-file modification, and programming intelligence, bringing highEfficient and smooth coding experience, leading a new paradigm of AI native R&D. At the same time, we provide enterprise customers withStandard and exclusive versions have the capabilities of enterprise-level scenario customization and private domain knowledge enhancement, helping enterprises develop intelligentChemical upgrade.

Creating embedding vectors for semantic chunks

After completing the semantic segmentation of the text, we need to generate an embedding vector for each semantic chunk to facilitate subsequent retrieval and use.

# Create embedding vectors for text blocksdef  create_embeddings ( texts, model= "text-embedding-v3" ):    """    Input a set of text (string or list) and return the corresponding embedding vector list    """    ifisinstance(texts,  str ):        texts = [texts]   # ​​Ensure that the input is in list form
    completion = client.embeddings.create(        model=model,        input =text_chunks,        encoding_format= "float"    )        # Convert the response to a dict and extract all embeddings    data = json.loads(completion.model_dump_json())    embeddings = [item[ "embedding"for  item  in  data[ "data" ]]    return  embeddings

Conduct semantic search

We use cosine similarity to retrieve the most relevant chunks to the query content . 

from  sklearn.metrics.pairwise  import  cosine_similarity
# Semantic search functiondef  semantic_search ( query, text_chunks, embeddings= None , k= 2 ):    """    Find the top-k text chunks in text_chunks that are most relevant to the query        parameter:        query: query statement        text_chunks: list of candidate text chunks        embeddings: list of corresponding embedding vectors (if pre-computed)        k: the number of most relevant results to return            return:        top_k_chunks: the most relevant top-k text chunks    """    if  embeddings  is  None :        embeddings = create_embeddings(text_chunks)   # Automatically generated if not provided    else :        assert  len ​​(embeddings) ==  len (text_chunks),  "embeddings and text_chunks must be the same length"
    query_embedding = create_embeddings(query)[ 0 ]   # Get the embedding of the query
    # Calculate similarity    similarity_scores = []    for  i, chunk_embedding  in  enumerate (embeddings):        score = cosine_similarity([query_embedding], [chunk_embedding])[ 0 ][ 0 ]        similarity_scores.append((i, score))
    # Sort and take top-k    similarity_scores.sort(key= lambda  x: x[ 1 ], reverse= True )    top_indices = [index  for  index, _  in  similarity_scores[:k]]    return  [text_chunks[index]  for  index  in  top_indices]
# Perform a semantic searchquery =  'What is the intelligent coding assistant Tongyi Lingma? 'top_chunks = semantic_search(query, text_chunks, k= 2 )
# Output resultsprint ( "Query: " , query)for  i, chunk  in  enumerate (top_chunks):    print ( f"Context  {i +  1 } :\n {chunk} \n=====================================" )
Query: What is the intelligent coding assistant Tongyi Lingma Context 1: What is the intelligent coding assistant Tongyi Lingma Intelligent coding assistant Tongyi Lingma (abbreviated as Tongyi Lingma) is an intelligent coding assistance tool provided by Alibaba Cloud, providing capabilities such as intelligent code generation, intelligent question and answer, multi-file modification, and programming intelligence, bringing developers an efficient and smooth coding experience and leading a new paradigm of AI native research and development. At the same time, we provide enterprise customers with enterprise standard version and exclusive version, which have the capabilities of enterprise-level scene customization and private domain knowledge enhancement to help enterprises upgrade their research and development intelligence.======================================Context 2: Core Capabilities Code Completion Code Completion After training with a large amount of excellent open source code data, it can generate line-level/function-level code, unit tests, code optimization suggestions, etc. for you according to the context of the current code file and across files.=========================================

Generate a response based on the retrieved text block

After completing the semantic search and finding the chunks of text that are most relevant to the user's query, the next step is to generate answers based on these retrieval results .

import  osfrom  openai  import  OpenAI
# Initialize the DashScope client (using Alibaba Cloud Tongyi Qianwen)client = OpenAI(    api_key=os.getenv( "DASHSCOPE_API_KEY" ),   # Make sure to set the environment variables in advance    base_url= "https://dashscope.aliyuncs.com/compatible-mode/v1")
# Set system prompt (Chinese version)SYSTEM_PROMPT = (    "You are an AI assistant and must answer strictly based on the context provided."    "If the answer is not directly inferred from the context provided, respond with: 'I can't answer this question based on the information available.'")
def  generate_response ( system_prompt, user_message, model= "qwen-max" ):    """    Generate context-based answers using DashScope's 1,000-question model.
    parameter:        system_prompt (str): System prompts that control AI behavior        user_message (str): question and context entered by the user        model (str): the model name used, the default is qwen-plus
    return:        str: the answer content generated by the model    """    response = client.chat.completions.create(        model=model,        temperature= 0.0 ,   # Set the temperature to 0 to ensure output certainty        max_tokens= 512 ,    # The maximum output length can be adjusted as needed        messages=[            { "role""system""content" : system_prompt},            { "role""user""content" : user_message}        ]    )        return  response.choices[ 0 ].message.content.strip()


3. Introducing “Context Enhanced Retrieval” in RAG

The traditional approach has an obvious problem: it only returns isolated blocks of text that lack context, which sometimes causes the AI ​​to obtain incomplete information, resulting in incorrect answers or incomplete content.

To solve this problem, we proposed a new method called  "Context-Enriched Retrieval" . Its core idea is: not only to find the most relevant text block, but also to return the previous and next text blocks at the same time, to help AI better understand the context, so as to generate more accurate and complete answers.

The whole process mainly includes the following steps:

1. Data Ingestion Extract the original text content from the PDF file.

2. Chunking with Overlapping Context: Chunking a large paragraph into multiple small chunks, but each chunk has some overlap with the previous and next chunks. This ensures that even if a sentence is split between two chunks, the full context can be seen in one chunk.

3. Create Embedding Vectors (Embedding Creation) Convert each text block into a set of digital representations (called "embedding vectors") to facilitate subsequent similarity calculations. ? It can be understood as giving each text block a "semantic label" so that you can quickly find content with similar semantics.

4. Context-Aware Retrieval When a user asks a question, the system will not only find the most relevant text block, but also return the text blocks before and after it. This way, AI can obtain richer background information when answering questions and avoid taking things out of context.

5. Response Generation Use large language models (such as Llama, ChatGLM, etc.) to generate natural and accurate responses based on search results that include context. Just like when you are taking an exam, you can flip through the book to find the answer, and you can also see the content before and after that page, so you can naturally answer more accurately.

6. Evaluation Finally, we will evaluate the AI's answer to determine whether the introduction of context has improved the accuracy and completeness of the answer. For example, we can use manual scoring or let another AI evaluate the quality of the answer.

Implementing context-aware semantic search

It is an improvement on the original semantic search: during the retrieval process, not only the most relevant text block is returned, but also its adjacent previous and next text blocks, thus providing more complete and context-supported information.

def  context_enriched_search ( search_query, chunked_texts, chunk_embeddings, top_k= 1 , context_window_size= 1 ):    """    When searching, not only the most relevant paragraph is returned, but also the context paragraphs before and after it to provide richer background information.
    parameter:        search_query(str): user's query statement.        chunked_texts(List[str]): List of chunked texts.        chunk_embeddings(List[dict]): vector representation of each text paragraph (usually obtained from an embedding model).        top_k(int): The number of relevant paragraphs to be retrieved (here only top 1 is used to find the central paragraph).        context_window_size (int): The number of context paragraphs to include (the number of front and back paragraphs).
    return:        List[str]: List of text paragraphs containing the most relevant paragraphs and their context.    """
    # Step 1: Convert the user's question into a vector (embedding) for similarity comparison with the text paragraph    query_embedding = create_embeddings(search_query).data[ 0 ].embedding
    similarity_list = []   # Used to store the similarity score and index of each paragraph and question
    # Step 2: Traverse all paragraph vectors and calculate the cosine similarity between them and the question vector    for  i, chunk_embedding  in  enumerate (chunk_embeddings):        # Use the cosine_similarity function to calculate similarity (the closer to 1, the more similar)        similarity_score = cosine_similarity(np.array(query_embedding), np.array(chunk_embedding.embedding))        # Save the index and similarity of the paragraph, such as: (0, 0.75)        similarity_list.append((i, similarity_score))
    # Step 3: Sort all paragraphs by similarity, from high to low    similarity_list.sort(key= lambda  x: x[ 1 ], reverse= True )
    # Step 4: Get the index of the most relevant paragraph (i.e. the first paragraph)    most_relevant_index = similarity_list[ 0 ][ 0 ]   # For example: the third paragraph
    # Step 5: Determine the context range to be extracted (including the current paragraph + context_window_size paragraphs before and after)    start_index =  max ( 0 , most_relevant_index - context_window_size)   # Prevent exceeding the beginning    end_index =  min ( len (chunked_texts), most_relevant_index + context_window_size +  1 )   # Prevent exceeding the end
    # Step 6: Return a list of paragraphs containing context    return  [chunked_texts[i]  for  i  in  range (start_index, end_index)]


4. Add Contextual Chunk Headers (CCH)

RAG improves the factual accuracy of language models by retrieving relevant information from external knowledge bases before generating answers. However, in traditional text chunking methods, important contextual information is often lost, resulting in poor retrieval results or even causing the model to generate out-of-context answers.

To solve this problem, we introduced an improved method: Contextual Chunk Headers (CCH) . The core idea of ​​this method is: when dividing the text into small chunks (chunks), the high-level context information (such as document title, chapter title, etc.) to which the content belongs is added to the beginning of each text chunk before embedding and retrieval. This allows each text chunk to carry its background information, helping the model to better understand which part it belongs to, thereby improving the relevance of the retrieval and avoiding the model from generating wrong answers based on out-of-context content.

The steps in this method are as follows:

1. Data Ingestion: Load and preprocess raw text data.

2. Chunking with Contextual Headers automatically identifies chapter titles in a document and adds them to the front of the corresponding paragraphs to form text blocks with context. For  example:

# Chapter 3: Basic technologies of artificial intelligence The core methods of artificial intelligence include machine learning, deep learning and natural language processing...

3. Create embedding vectors (Embedding Creation) to convert these text blocks with contextual information into digital form (i.e., embedding vectors) for subsequent semantic search.

4. Semantic Search When a user asks a question, the system will find the most relevant content based on these enhanced text blocks.

5. Response Generation uses large language models (such as Llama, ChatGLM, etc.) to generate natural and accurate responses based on retrieval results.

6. Evaluation: Evaluate the AI’s answers through a scoring system to check whether adding contextual titles improves the accuracy and relevance of the answers.


Chunking text using contextual headings

In order to improve the effect of information retrieval, we use a large language model (LLM) to automatically generate a descriptive title (Header) for each text block and add it in front of the text block.

def  generate_chunk_header ( text_chunk, model_name= "qwen-max" ):    """    Generate a title/summary for a given text paragraph using a Large Language Model (LLM).
    parameter:        text_chunk(str): The text paragraph for which the title needs to be generated.        model_name (str): The name of the language model used to generate titles. The default value is "qwen-max".
    return:        str: Title or summary content generated by the model.    """
    # Define system prompt words to guide AI's behavior    header_system_prompt =  "Please generate a concise and informative title for the following text."
    # Call the LLM model to generate responses based on system prompts and input text    llm_response = client.chat.completions.create(        model=model_name,        temperature = 0 ,        messages=[            { "role""system""content" : header_system_prompt},            { "role""user""content" : text_chunk}        ]    )
    # Extract and return the content generated by the model, removing the extra whitespace characters before and after    return  llm_response.choices[ 0 ].message.content.strip()
def  chunk_text_with_headers ( input_text, chunk_size, overlap_size ):    """    Split the input text into smaller paragraphs and generate headings for each paragraph.
    parameter:        input_text(str): The complete text to be segmented.        chunk_size(int): The size of each paragraph (number of characters).        overlap_size(int): The number of overlapping characters between adjacent paragraphs.
    return:        List[dict]: A list of dictionaries containing the keys 'header' and 'text', representing the paragraph's title and content respectively.    """        text_chunks = []   # Initialize an empty list to store text paragraphs with titles
    # Iterate over the text using the specified paragraph size and overlap length    for  start_index  in  range ( 0len (input_text), chunk_size - overlap_size):        current_chunk = input_text[start_index:start_index + chunk_size]   # Extract the current paragraph        chunk_header = generate_chunk_header(current_chunk)   # Generate paragraph title using large language model        text_chunks.append({ "header" : chunk_header,  "text" : current_chunk})   # Add the title and paragraph content to the list together
    return  text_chunks   # Returns a list of paragraphs containing titles and content

Create embedding vectors for the title and body text

In order to improve the accuracy of information retrieval, we not only generate embeddings for the main text content , but also generate embedding vectors for the header in front of each text block.

# Generate embedding vectors for each text blockchunk_embeddings = []   # Initialize an empty list to store dictionaries with titles, texts, and their embedding vectors
# Iterate over each text block and generate embedding vectors (with progress bar)for current_chunk in tqdm(text_chunks, desc= "Generate embedding vector" ):    # Get the text content of the current text block and generate its embedding vector    text_embedding = create_embeddings(current_chunk[ "text" ])        # Get the title of the current text block and generate its embedding vector    header_embedding = create_embeddings(current_chunk[ "header" ])        # Store the title, text and corresponding embedding vector of the current text block into a list    chunk_embeddings.append({        "header" : current_chunk[ "header" ],        "text" : current_chunk[ "text" ],        "embedding" : text_embedding,        "header_embedding" : header_embedding    })

Semantic Search

import  numpy  as  np
def  _calculate_similarity ( query_vec, chunk_vec ):    """    Compute the cosine similarity between the query vector and the chunk vector.        parameter:    query_vec (np.array): The embedding vector of the query.    chunk_vec (np.array): The embedding vector of the text chunk.        return:    float: cosine similarity.    """    return  cosine_similarity(np.array(query_vec), np.array(chunk_vec))
def  semantic_search ( query, chunks, top_k= 5 ):    """    Search for the most relevant text blocks based on the query semantics.        parameter:    query (str): The query statement entered by the user.    chunks (List[dict]): List of text chunks containing embedding vectors.    top_k (int): The number of most relevant results to return.        return:    List[dict]: the top_k most relevant text blocks.    """    # Generate a vector representation of the query statement    query_vector = create_embeddings(query)
    # Initialize a list to store each text block and its similarity    chunk_similarity_pairs = []
    # Traverse each text block and calculate the similarity    for  chunk  in  chunks:        text_vector = chunk[ "embedding" ]      # Get the embedding vector of the text content        header_vector = chunk[ "header_embedding" ]   # Get the embedding vector of the title
        # Calculate the similarity between the query and the text and title respectively, and take the average        similarity_text = _calculate_similarity(query_vector, text_vector)        similarity_header = _calculate_similarity(query_vector, header_vector)        avg_similarity = (similarity_text + similarity_header) /  2
        #Store text blocks and their average similarity        chunk_similarity_pairs.append((chunk, avg_similarity))
    # Sort by similarity from high to low    chunk_similarity_pairs.sort(key= lambda  pair: pair[ 1 ], reverse= True )
    # Return the top-k most relevant text blocks    return  [pair[ 0 ] forpair  in  chunk_similarity_pairs[:top_k]]


5. RAG based on question generation

This section enhances the document content by introducing question generation in the document processing stage.

We generate relevant questions for each text block, thereby improving the effectiveness of information retrieval, and ultimately helping the language model generate more accurate and relevant answers.

The core idea of ​​this method is: In traditional RAG (Retrieval-Augmented Generation), we usually only embed text blocks and store them in the vector library. In this improved method, we also automatically generate some related questions for each text block and embed these questions as well. In this way, when users ask questions, the system can better understand which text blocks are most relevant to the questions, thereby improving the retrieval effect and answer quality.

The implementation steps are as follows:

1. Data Ingestion Extract the original text content from the PDF file.

2. Chunking: Chunking large text into small chunks for easy processing. Each chunk usually contains about 200 to 300 words.

3. Question Generation uses a large language model (LLM) to automatically generate several questions related to each text block. For example, if you input a piece of content about "machine learning", the output may be:

  • “What is machine learning?”

  • “What are some common algorithms for machine learning?”

  • “What is the relationship between machine learning and artificial intelligence?”

4. Create embedding vectors (Embedding Creation) Generate embedding vectors (i.e. convert them into digital representations) for each text block and its corresponding question for semantic matching.

5. Vector Store Creation Use NumPy to build a simple vector database to store the embedded vectors of all text blocks and questions.

6. Semantic Search When a user asks a question, the system will first look for the generated questions that are most similar to his question, and then find the corresponding text block as context.

7. Response Generation Based on the retrieved relevant text blocks, the language model generates natural and accurate responses.

8. Evaluation Finally, we will score the generated answers to evaluate whether this enhanced RAG improves the quality and accuracy of the answers.

Generate questions for text blocks

Automatically generate related questions for each block of text - questions that can be answered by looking at the text.

import  re
def  _extract_questions_from_response ( response_text ):    """    Extract questions ending with a question mark from the text returned by the model.
    parameter:    response_text (str): The raw text returned by the model.
    return:    List[str]: List of valid questions after cleaning.    """    questions = []    for  line  in  response_text.split( '\n' ):        cleaned_line = line.strip()   # remove leading and trailing spaces        if  cleaned_line:            # Remove any possible number prefixes (such as "1.", "2)", "•", "-", etc.)            cleaned_line = re.sub( r'^[\d\-\•\*]+\s*[\.\\)]?\s*''' , cleaned_line)            # Determine whether it contains a question mark (supports both Chinese and English)            if  '?'  in  cleaned_line  or  '? '  in  cleaned_line:                # All characters end with question mark                question = cleaned_line.rstrip( '?' ).rstrip( '?' ) +  '?'                questions.append(question)    return  questions

def  generate_questions ( text, question_count= 5 , model= "qwen-max" ):    """    Generates answerable questions based on a provided block of text.
    parameter:    text (str): The text content of the question that needs to be generated.    question_count (int): The number of questions to generate.    model (str): Name of the language model used to generate questions.
    return:    List[str]: The generated list of questions.    """    # System instructions: define the behavior rules of AI    system_instruction =  "You are an expert at generating relevant questions from text. Please use only the text provided to create concise questions that focus on key information and concepts."
    # User request template: provide specific tasks and format requirements    user_request =  f"""    Please generate  {question_count}  different questions based on the following text. These questions must be answerable by the text:
    {text}
    Please output the questions as a numbered list, without adding anything else.    """
    # Calling the large model API to generate problems    response = client.chat.completions.create(        model=model,        temperature = 0.7 ,        messages=[            { "role""system""content" : system_instruction},            { "role""user""content" : user_request}        ]    )
    # Extract the original response content and remove leading and trailing spaces    raw_questions_text = response.choices[ 0 ].message.content.strip()
    # Use auxiliary functions to extract and filter valid questions    filtered_questions = _extract_questions_from_response(raw_questions_text)
    return  filtered_questions

Building a simple vector repository

We will use  NumPy  to implement a simple vector store .

import  numpy  as  npfrom  typing  import  ListDictOptional

classSimpleVectorStore:    """    Simple NumPy-based vector storage implementation.    """
    def  __init__ ( self ):        """        Initialize the vector database, which contains vectors, text, and metadata lists.        """        self.vectors:  List [np.ndarray] = []    # store vectors        self.texts:  List [ str ] = []             # Store original text        self.metadata_list:  List [ Dict ] = []     #Store metadata        def  add_item ( self, text:  str , vector:  List [ float ], metadata:  Optional [ Dict ] =  None ):        """        Add an entry to the vector library.
        parameter:        text (str): original text content.        vector (List[float]): Vector embedding representation.        metadata (Dict, optional): Optional metadata information.        """        self.vectors.append(np.array(vector))        self.texts.append(text)        self.metadata_list.append(metadata  or  {})        def  similarity_search ( self, query_vector:  List [ float ], top_k:  int  =  5 ) ->  List [ Dict ]:        """        Find the top_k most similar records in the vector library according to the query vector.
        parameter:        query_vector (List[float]): query vector.        top_k (int): The number of results returned.
        return:        List[Dict]: A list of dictionaries containing similar text, metadata, and similarity scores.        """        if not self.vectors:            return  []                # Convert the query vector to a numpy array        query_array = np.array(query_vector)                # Calculate the cosine similarity between each vector and the query vector        similarities = []        for  idx, vector  in  enumerate (self.vectors):            similarity = np.dot(query_array, vector) / (                np.linalg.norm(query_array) * np.linalg.norm(vector)            )            similarities.append((idx, similarity))                # Sort by similarity in descending order        similarities.sort(key= lambda  x: x[ 1 ], reverse= True )                # Build result return        results = []        for  i  in  range ( min (top_k,  len (similarities))):            idx, score = similarities[i]            results.append({                "text" : self.texts[idx],                "metadata" : self.metadata_list[idx],                "similarity_score"float (score)            })                return  results

Using Question Enhancement to Process Documents

Now, we put all the previous steps together to fully process the document : generating relevant questions for the text block, creating embeddings, and building an augmented vector store .

def  process_document ( pdf_path, chunk_size= 1000 , chunk_overlap= 200 , questions_per_chunk= 5 ):    """    The document is processed and question enhancements are generated.
    parameter:    pdf_path(str): PDF file path.    chunk_size (int): The number of characters per text chunk.    chunk_overlap(int): The number of overlapping characters between text chunks.    questions_per_chunk(int): The number of questions to generate per chunk of text.
    return:    Tuple[List[str], SimpleVectorStore]: Processed text chunks and vector storage.    """    print ( "Extracting text from PDF..." )    extracted_text = extract_text_from_pdf(pdf_path)
    print ( "Split text into blocks..." )    text_chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)    print ( f"A total of  { len (text_chunks)}  text chunks were created" )
    vector_store = SimpleVectorStore()
    print ( "Process each text block and generate related questions..." )    for  idx, chunk  in  enumerate (tqdm(text_chunks, desc= "Processing text chunks" )):        # Generate embedding for the current text block        chunk_embedding_response = create_embeddings(chunk)        chunk_embedding = chunk_embedding_response.data[ 0 ].embedding
        # Add text blocks to the vector library        vector_store.add_item(            text=chunk,            vectors=chunk_embedding,            metadata={ "type""chunk""index" : idx}        )
        # Generate multiple questions for the current text block        questions = generate_questions(chunk, num_questions=questions_per_chunk)
        # Generate embeddings for each question and add them to the vector library        for  q_idx, question  in  enumerate (questions):            question_embedding_response = create_embeddings(question)            question_embedding = question_embedding_response.data[ 0 ].embedding
            # Add the question to the vector library            vector_store.add_item(                text=question,                vectors=question_embedding,                metadata={ "type""question""chunk_index" : idx,  "original_chunk" : chunk}            )
    return  text_chunks, vector_store

Extracting and processing documents

# Define the PDF file pathpdf_file_path =  "knowledge_base/Intelligent Coding Assistant Tongyi Lingma.pdf"
# Process documents (extract text, generate chunks, create questions, build vector library)text_chunks, vector_store = process_document(    pdf_file_path,    chunk_size = 1000 ,    chunk_overlap = 100 ,    questions_per_chunk = 3)
# Output the number of entries in the vector libraryprint ( f"The vector store contains  { len (vector_store.texts)}  entries" )

Querying on the enhanced vector library

import  json
search_query =  'Which company is the intelligent coding assistance tool Tongyi Lingma produced by?'
# Find related content using semantic searchsearch_results = semantic_search(search_query, vector_store, k= 5 )
print ( "Query content: " , search_query)print ( "\nSearch results:" )
# Sort results by typedocument_chunks = []matched_questions = []
for  result  in  search_results:    if  result[ "metadata" ][ "type" ] ==  "chunk" :        document_chunks.append(result)    else :        matched_questions.append(result)
# Print document fragmentprint ( "\nRelated document fragment: " )for  index, result  in  enumerate (document_chunks):    print ( f"Context  {index +  1 }  (similarity:  {result[ 'similarity' ]: .4 f} ):" )    print (result[ "text" ][: 300 ] +  "..." )    print ( "=====================================" )
# Print matching questionsprint ( "\nMatched questions: " )for  index, result  in  enumerate (matched_questions):    print ( f"Problem  {index +  1 }  (similarity:  {result[ 'similarity' ]: .4 f} ):" )    print (result[ "text" ])    chunk_index = result[ "metadata" ][ "chunk_index" ]    print ( f"from fragment  {chunk_index} " )    print ( "=====================================" )
Query content: Which company is Tongyi Lingma an intelligent coding assistance tool?
Matching questions:Question 1 (Similarity: 0.9770):Which company provides Tongyi Lingma, an intelligent coding assistance tool, and what capabilities does it mainly provide?From segment 0=====================================Question 2 (Similarity: 0.8629):In addition to individual developers, what versions and special services does Tongyi Lingma provide for enterprises?From segment 0=====================================Question 3 (Similarity: 0.8108):What enterprise plans does Tongyi Lingma provide for customers to choose from?From Clip 1=====================================Question 4 (Similarity: 0.8078):How does Tongyi Lingma's code completion function work, and what kind of code suggestions can it generate for developers?From segment 0=====================================

6. Query Rewriting

This section implements three query transformations to improve the information retrieval performance of the Retrieval Augmentation Generation (RAG) system.

Core objectives:

By modifying or expanding the user's original query , it helps the system understand the user's intent more accurately and find more relevant information from the vector library.

Three query conversion techniques

1. Query Rewriting

Make users' questions more specific and detailed to improve the accuracy of retrieval.

? Example:

  • User's original question: "What is AI?"

  • Rewritten question: "What is the definition of artificial intelligence and what are its core technologies?"

✅ Improvement: Make the search more precise and avoid overly broad results.


2. Step-back Prompting

Generate a broader, higher-level question that captures more context and helps the system better understand the context.

? Example:

  • User's original question: "What are the applications of deep learning in the medical field?"

  • Fallback question: “What are the applications of artificial intelligence in the healthcare industry?”

✅ Improvement points: Helps to find important background knowledge that is related to the question but not a direct match.


3. Sub-query Decomposition

Split a complex question into multiple simpler small questions , search them separately, and finally combine all the results to provide a more comprehensive answer.

? Example:

  • User's original question: "Compare the advantages, disadvantages and application scenarios of machine learning and deep learning."

  • Disassembled into:

  • “What is machine learning?”

  • “What is deep learning?”

  • “What are the advantages and disadvantages of machine learning?”

  • “What are the advantages and disadvantages of deep learning?”

  • “What scenarios are they suitable for?”

✅ Improvement points: Make sure to cover all aspects of the problem to avoid missing key information.

Implementing query transformation technology

1. Query Rewriting

In many cases, the questions asked by users may be vague or brief, such as:

“What is AI?”

Although this question is clear, it is not specific enough, and the system may return overly broad or irrelevant results when searching the vector library.

The role of query rewriting is to generate a clearer and more detailed version based on the intent of the original question, helping the system find more relevant information.

def  rewrite_query ( original_query, model= "qwen-max" ):    """    Rewrite user queries to make them more specific and detailed to improve retrieval results.        parameter:        original_query(str): user's original query statement        model(str): model name used for rewriting            return:        str: optimized query statement    """    # System prompts: guide AI assistant behavior    system_prompt =  "You are an AI assistant that is good at optimizing search queries. Your task is to rewrite the user's query to be more specific, detailed, and helpful in obtaining relevant information."
    # User prompt: Provide the original query that needs to be rewritten    user_prompt =  f"""    Please rephrase the following query to make it more specific and include relevant terms and concepts that will help you find the right results.
    Original query:  {original_query}
    Rewritten query:    """
    # Generate rewrite results using the specified model    response = client.chat.completions.create(        model=model,        temperature= 0.0 ,   # Set the temperature to 0 to ensure stable output        messages=[            { "role""system""content" : system_prompt},            { "role""user""content" : user_prompt}        ]    )
    # Return the rewritten query content and remove the leading and trailing spaces    return  response.choices[ 0 ].message.content.strip()

2. Step-back Prompting

The goal is to generate broader, higher-level questions that can be used to retrieve contextual information relevant to the user's question.


What is a "fallback question"?

In many cases, the questions asked by users can be very specific, such as:

“What are the applications of deep learning in medical imaging diagnosis?”

Although this question is clear, it is too focused and may cause the system to only retrieve very local information while ignoring important context.

The idea of ​​"backward questioning" is:

First, take a step back and ask a broader question, such as:

“What are the applications of artificial intelligence in the medical industry?”

This allows the system to acquire some overall background knowledge first, helping it to better understand the context of the current question, thereby improving the accuracy and completeness of the final answer.

def  generate_step_back_query ( original_query, model= "qwen-max" ):    """    Generate a more general "step back" query to get broader contextual information.        parameter:        original_query(str): user's original query statement        model(str): the name of the model used to generate the query            return:        str: A broader context query    """    # System prompts: guide AI assistant behavior    system_prompt =  "You are an AI assistant that excels at search strategies. Your task is to expand specific queries into more general forms to obtain relevant contextual information."
    # User prompt: Provide the original query that needs to be generalized    user_prompt =  f"""    Please generate a broader, more general version of the specific query below to gain useful context.
    Original query:  {original_query}
    One step back query:    """
    # Generate a broader query statement using the specified model    response = client.chat.completions.create(        model=model,        temperature = 0.1 ,   # The temperature is slightly higher to increase diversity        messages=[            { "role""system""content" : system_prompt},            { "role""user""content" : user_prompt}        ]    )
    # Return the generated query content and remove the leading and trailing spaces    return  response.choices[ 0 ].message.content.strip()

3. Sub-query Decomposition

The goal is to split complex user questions into multiple simpler and more specific sub-questions, thereby achieving more comprehensive information retrieval.


What is subquery decomposition?

When a user asks a complex or multi-part question, such as:

“Please compare the principles, advantages and disadvantages, and application scenarios of machine learning and deep learning.”

If you search directly using this question, it may be difficult for the system to find an exact match, resulting in incomplete information or inaccurate answers.

The idea of ​​subquery decomposition is:

Break the question into smaller, more manageable parts, search each separately, and then combine all the results to generate a complete answer.

Example description:

def  decompose_query ( original_query, num_subqueries= 4 , model= "qwen-max" ):    """    Break complex queries into simpler subqueries.
    parameter:        original_query(str): original complex query content        num_subqueries (int): The number of subqueries to generate        model(str): the name of the model used to decompose the query
    return:        List[str]: the subquery list after splitting    """    # System prompt words: guide the behavior logic of AI assistant    system_prompt =  "You are an AI assistant that is good at breaking down complex questions. Your task is to break down complex queries into multiple simpler questions, the answers to which together form the answer to the original question."
    # User prompt: Provide the original query that needs to be decomposed    user_prompt =  f"""    Decompose the following complex query into  {num_subqueries}  simpler subqueries. Each subquery should focus on a different aspect of the original question.
    Original query:  {original_query}
    Please output the results in the following format:    1. [First subquery]    2. [Second subquery]    ...    """
    # Generate subqueries using the specified model    response = client.chat.completions.create(        model=model,        temperature = 0.2 ,   # The temperature is slightly higher to increase diversity        messages=[            { "role""system""content" : system_prompt},            { "role""user""content" : user_prompt}        ]    )
    # Extract and process the response content    content = response.choices[ 0 ].message.content.strip()
    # Split the response content by line    lines = content.split( "\n" )    sub_queries = []
    # Parse each row and extract the subquery after the number    for  line  in  lines:        if  line.strip()  and  any (line.strip().startswith( f" {i} ."for  i  in  range ( 110 )):            query = line.strip()            query = query[query.find( "." ) +  1 :].strip()   # Remove the serial number part and keep the actual content            sub_queries.append(query)
    return  sub_queries

7. Reranking

Reranking is a second round of screening and optimization based on the initial search results , with the aim of ensuring that the content ultimately used to generate the answer is the most relevant and accurate part .

In traditional semantic search, we usually use vector similarity (such as cosine similarity) to find the most relevant text blocks. But this "preliminary search" is not always perfect, and sometimes returns some content that seems relevant but actually does not match.

The role of reordering is:

✅ Further screening in the initial search results; 

✅ Re-score content using a more accurate relevance scoring model;

✅ Re-rank by actual relevance;

✅ Keep only the most relevant documents for subsequent answer generation.

The core process of reordering

1. Initial Retrieval

  • Use basic semantic similarity search (such as vector matching) to quickly obtain a batch of candidate text blocks;

  • This step is fast, but has limited accuracy.

2. Document Scoring

  • Perform a deeper relevance assessment on each retrieved document;

  • You can use specialized reranking models (such as BERT reranker, ColBERT, Cross-Encoder, etc.) to score based on the semantic relationship between user queries and document content;

  • This approach provides a better understanding of “sentence-level” relevance than simple vector matching.

3. Reordering

  • Re-rank all candidate documents according to the scoring results;

  • The most relevant ones are placed at the front, while the least relevant ones are placed at the back or eliminated.

4. Content Selection

  • Only the top-ranked documents are selected as context to provide to the language model;

  • Avoid introducing noise information and improve the accuracy and reliability of answers.

For example:

Suppose a user asks: "What are the main applications of deep learning?"

A preliminary search might return the following three paragraphs:

1. “Deep learning is widely used in image recognition and natural language processing.”

2. “Machine learning can be divided into two types: supervised learning and unsupervised learning.”

3. “Convolutional neural networks are a type of deep learning model commonly used for image classification.”

By reordering, we can determine:

  • Article 1 is highly relevant✅  

  • Article 2 is not very relevant ❌  

  • Partially related to Article 3✅

So we only keep items 1 and 3 as context to generate the final answer.

Reranking based on large models

def rerank_with_llm(query, search_results, top_n=3, model="qwen-max"): """ Rerank search results using LLM. Parameters: query(str): user query statement. search_results(List[Dict]): initial list of search results, each element contains document text, metadata and similarity score. top_n(int): number of documents returned after reranking. model(str): name of the LLM model used for scoring. Returns: List[Dict]: list of documents sorted by relevance score. """ print(f"Reranking {len(search_results)} documents...") scored_results = [] # Store results with relevance scores # Define system prompts to guide LLM how to score system_prompt = """You are an expert in assessing the relevance of documents to search queries. Your task is to score documents between 0 and 10 based on how relevant the document is to answer a given query. Scoring guidelines: - 0-2 points: document is completely irrelevant - 3-5 points: The document has some relevant information, but does not directly answer the query 6-8 points: The document is relevant and can partially answer the query 9-10 points: The document is highly relevant and can directly answer the query Please only output an integer score (0 to 10) as the score, do not include other content. """ # Iterate over each search result for idx, result in enumerate(search_results): if idx % 5 == 0: print(f"Scoring document {idx + 1}/{len(search_results)}...") # Construct user prompt words, enter query and document content user_prompt = f"""Query: {query}Document content:{result['text']}Please rate the relevance of this document based on the above query (0-10):""" # Call LLM API to get the score response = client.chat.completions.create( model=model, temperature=0, messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ] ) # Extract the scoring results score_text = response.choices[0].message.content.strip() # Use regular expressions to extract numeric scores score_match = re.search(r'\b(10|[0-9])\b', score_text) if score_match: relevance_score = float(score_match.group(1)) else: # If the score cannot be extracted, use the similarity score as a fallback print(f"Warning: Unable to extract score from response: '{score_text}', using similarity score instead") relevance_score = result["similarity"] * 10 # Add the scoring results to the list scored_results.append({ "text": result["text"], "metadata": result["metadata"], "similarity": result["similarity"], "relevance_score": relevance_score }) # Sort the results in descending order by relevance score reranked_results = sorted(scored_results, key=lambda x: x["relevance_score"], reverse=True) # Return the top_n results return reranked_results[:top_n]

Reranking based on keywords

def  rerank_with_keywords ( query, doc_results, top_n= 3 ):    """    A simple re-ranking method based on keyword matching and position.    Args:        query(str): the question the user is querying        doc_results(List[Dict]): Initial search results, each dictionary contains text and other metadata        top_n(int): The number of results to return after reordering, the default is the first 3    Returns:        List[Dict]: The results are re-sorted by relevance, keeping only the top_n ones    """    def  extract_keywords ( text ):        """Extract keywords from text"""        return  [word.lower() for  word  in  text.split()iflen(word) >  3 ]    # Extract important keywords from user questions    keywords = extract_keywords(query)    ranked_docs = []   # Create a list to store the ranked documents    for  doc_result  in  doc_results:        document_text = doc_result[ "text" ].lower()   # Convert document text to lowercase for subsequent comparison        # The base score starts from vector similarity, multiplied by 0.5 to indicate that it is not the only determining factor        base_score = doc_result[ "similarity" ] *  0.5        # Initialize keyword scores        keyword_score =  0        for  keyword  in  keywords:            if  keyword  in  document_text:                # If a keyword is found, add 0.1 points                keyword_score +=  0.1                # If the keyword appears in the first 1/4 of the document, it is more likely to answer the question directly, and 0.1 points will be added                first_position = document_text.find(keyword)                if  first_position <  len (document_text) /  4 :   # In the first quarter                    keyword_score +=  0.1                # Add points based on the number of times the keyword appears, but the maximum is 0.2 points                frequency = document_text.count(keyword)                keyword_score +=  min ( 0.05  * frequency,  0.2 )   # add up to 0.2 points        # Calculate the final score: basic score plus keyword score        final_score = base_score + keyword_score        # Store this document and its related information and score in a list        ranked_docs.append({            "text" : doc_result[ "text" ],            "metadata" : doc_result[ "metadata" ],            "similarity" : doc_result[ "similarity" ],            "relevance_score" : final_score        })    # Sort all documents in descending order according to the final relevance score    reranked_docs =  sorted (ranked_docs, key= lambda  x: x[ "relevance_score" ], reverse= True )    # Return the top_n documents with the highest scores    return  reranked_docs[:top_n]

Complete RAG flow with reordering

So far, we have implemented the core modules in the RAG process, including:

  • Document Processing

  • Question Answering

  • Reranking

Now, we integrate these modules together to build a complete RAG system flow .

def rag_with_reranking(query, vector_store, reranking_method="llm", top_n=3, model="qwen-max"): """ Full RAG pipeline including reranking. Arguments: query(str): user query vector_store(SimpleVectorStore): vector store reranking_method(str): reranking method ('llm' or 'keywords') top_n(int): number of results to return after reranking model(str): model used to generate answers Returns: Dict: dictionary of results containing query, context and answers """ # Create query embeddings query_embedding = create_embeddings(query) # Initial search (get more results than needed for reranking later) initial_results = vector_store.similarity_search(query_embedding, k=10) # Apply reranking if reranking_method == "llm": reranked_results = rerank_with_llm(query, initial_results, top_n=top_n) elif reranking_method == "keywords": reranked_results = rerank_with_keywords(query, initial_results, top_n=top_n) else: # Do not rerank, directly use the top_n results of the initial search reranked_results = initial_results[:top_n] # Merge the reranked context context = "\n\n===\n\n".join([result["text"] for result in reranked_results]) # Generate answers based on the context response = generate_response(query, context, model) return { "query": query, "reranking_method": reranking_method, "initial_results": initial_results[:top_n], "reranked_results": reranked_results, "context": context,        "response": response }


8. Extraction of relevant paragraphs for enhancing RAG

(Relevant Segment Extraction, RSE)

Different from the traditional approach of simply retrieving multiple isolated blocks of text, our goal is to identify and reconstruct continuous text fragments to provide more complete and logical context information for the language model.

Core Concept:

In documents, text blocks related to user questions often appear in the same area or in consecutive paragraphs . If we can identify the connections between these related text blocks and organize them into a coherent whole paragraph in sequence , we can significantly improve the language model's ability to understand the context.

Why use RSE?

Problems with traditional RAG:

  • The search results consist of multiple unconnected text blocks;

  • Transitions and background information may be missing between blocks;

  • This makes it difficult for the language model to understand and even causes quotations to be taken out of context.

The advantages of RSE are: ✅ Combining related text blocks into continuous paragraphs; ✅ Preserving the original text structure and semantic coherence; ✅ Providing a more natural and complete context to the language model; ✅ Improving the accuracy and fluency of the final answer.

The implementation steps of RSE are:

1. Preliminary search

Use semantic search to find the text blocks that are most relevant to the user's question from the vector library.

2. Position sorting

If the text blocks in the original document have numbering or position information (such as page numbers, paragraph order), we can reorder the search results based on this information.

3. Cluster analysis

Analyze which text blocks are close to each other and have similar semantics in the original text, and group them together to form "related paragraph clusters".

4. Paragraph reconstruction

The text blocks belonging to the same cluster are stitched together to form a complete context paragraph. If necessary, the adjacent preceding and following content can be added to enhance the context coherence.

5. Input language model

The reconstructed continuous paragraphs are used as context and provided to the large language model (LLM) to generate the final answer.

Example description:

Suppose a user asks: "What are the applications of deep learning?"

Traditional RAG might return the following three isolated blocks of text:

1. “Deep learning is widely used in image recognition.”

2. “It is also used for speech recognition and natural language processing.”

3. “It also has important applications in the field of autonomous driving.”

With RSE, we can concatenate these three blocks into one paragraph in the original order:

"Deep learning is widely used in image recognition. It is also used in speech recognition and natural language processing, and has important applications in the field of autonomous driving."

This way, the language model is better able to understand the overall meaning rather than processing several independent sentences separately.

Creating a simple vector database

import numpy as npclass SimpleVectorStore: """ A lightweight vector store implementation using NumPy. """ def __init__(self, dimension=1536): """ Initialize the vector store. Arguments: dimension (int): dimension of the embedding vectors """ self.dimension = dimension self.vectors = [] # store vector data self.documents = [] # store document content self.metadata_list = [] # store metadata def add_documents(self, documents, vectors=None, metadata_list=None): """ Add documents to the vector store. Arguments: documents (List[str]): list of document chunks vectors (List[List[float]], optional): list of embedding vectors metadata_list (List[Dict], optional): list of metadata dictionaries """ if vectors is None: vectors = [None] * len(documents) if metadata_list is None: metadata_list = [{} for _ in range(len(documents))] for doc, vec, metadata in zip(documents, vectors, metadata_list): self.documents.append(doc) self.vectors.append(vec) self.metadata_list.append(metadata) def search(self, query_vector, top_k=5): """ Search for most similar documents. Arguments: query_vector (List[float]): query embedding vector top_k (int): number of results to return Returns: List[Dict]: List of results containing documents, similarity scores and metadata """ if not self.vectors or not self.documents: return [] # Convert query vector to NumPy array query_array = np.array(query_vector) # Compute similarities similarities = [] for index, vector in enumerate(self.vectors): if vector is not None: # Compute cosine similarity similarity = np.dot(query_array, vector) / ( np.linalg.norm(query_array) * np.linalg.norm(vector) ) similarities.append((index, similarity)) # Sort by similarity (descending) similarities.sort(key=lambda x: x[1], reverse=True) # Get top-k results results = [] for index, score in similarities[:top_k]: results.append({ "document": self.documents[index], "score": float(score), "metadata": self.metadata_list[index] }) return results

Processing documents using RSE

Now, let’s implement the core functionality of Relevant Segment Extraction (RSE)  .

? For example

Imagine you are looking for a book in a library and the administrator gives you a list of dozens of potentially related books.

You start reading one book at a time and find that some of them do indeed tell you what you want to know, while some of them only seem relevant in title but have completely different contents.

So you rate each book:

  • Very relevant content: 90 points

  • Somewhat related: 60 points

  • It doesn't matter: lose 20 points → get 40 points or even negative points

In the end, you only picked the ones with high scores to read.

This function does just that -  picks out the most useful information .

from  typing  import  ListDictTuple
def  process_document ( pdf_path:  str , chunk_size:  int  =  800 ) ->  Tuple [ List [ str ], SimpleVectorStore,  Dict ]:    """    Processes documents for use with RSE (Retrieval Enhancement Generation).        parameter:        pdf_path(str): path to the PDF document        chunk_size(int): The size of each text chunk (number of characters)            return:        Tuple[List[str], SimpleVectorStore, Dict]: a tuple containing a list of text chunks, a vector store instance, and document information    """    print ( "Extracting text from document..." )    # Extract text content from PDF files    document_text = extract_text_from_pdf(pdf_path)        print ( "Split text into non-overlapping segments..." )    # Split the extracted text into non-overlapping text blocks    text_chunks = chunk_text(document_text, chunk_size=chunk_size, overlap= 0 )    print ( f"A total of { len (text_chunks)}  text chunks were created  " )        print ( "Generate embedding vector for text block..." )    # Generate embedding vectors for each text block    chunk_embeddings = create_embeddings(text_chunks)        # Create a SimpleVectorStore instance to store vector data    vector_store = SimpleVectorStore()        # Add a document with metadata (including text block index for subsequent reconstruction)    metadata_list = [{ "chunk_index" : index,  "source" : pdf_path}  for  index  in  range ( len (text_chunks))]    vector_store.add_documents(text_chunks, chunk_embeddings, metadata_list)        # Record the original document structure for subsequent splicing    document_info = {        "chunks" : text_chunks,        "source" : pdf_path,    }        return  text_chunks, vector_store, document_info

✅ Summary

This function does several things:

1. Convert the user's question into a vector;

2. Find which text blocks in the vector library are most similar to this question;

3. Score each text block, the more relevant the text, the higher the score;

4. Deduct points (penalize) irrelevant text blocks to make them easier to ignore;

5. Return a list to tell the system: "Which of these chunks are important and which are not important."

RSE core algorithm: Calculate the value of text blocks and find the best paragraphs

Now that we have document processing capabilities and the ability to generate embedding vectors for each block of text, we can now start implementing  the core algorithm of RSE (Relevant Paragraph Extraction) .

def  calculate_chunk_values ​​( query:  str , chunks:  List [ str ], vector_store, irrelevant_chunk_penalty:  float  =  0.2 ) ->  List [ float ]:    """    Calculate the value of each document slice, combining its relevance score and location information.
    parameter:        query(str): query text entered by the user        chunks(List[str]): List of text chunks after the document is segmented        vector_store: vector database, containing vector representations of document blocks        irrelevant_chunk_penalty (float): The penalty imposed on irrelevant document chunks. The default value is 0.2            return:        List[float]: List of values ​​corresponding to each document block (floating point number)    """
    # Convert user queries into embedding vectors for semantic matching    query_embedding = create_embeddings([query])[ 0 ]
    # Get the number of all text blocks and search for similarity results    total_chunks =  len (chunks)    search_results = vector_store.search(query_embedding, top_k=total_chunks)
    # Build a mapping dictionary from chunk_index to relevance score    relevance_scores = {        result[ "metadata" ][ "chunk_index" ]: result[ "score" ]        for  result  in  search_results    }
    # Calculate the value based on the relevance score and apply the penalty mechanism for irrelevant blocks    chunk_values ​​= []    for  i  in  range (total_chunks):        score = relevance_scores.get(i,  0.0 )        value = score - irrelevant_chunk_penalty        chunk_values.append(value)
    return  chunk_values

? Let’s take a daily example

Imagine you are looking at the table of contents of a book, and each chapter has an "importance score" in front of it. You want to:

1. Pick a few chapters to read ;

2. Read a maximum of 20 sections per chapter ;

3. No more than 30 sessions in total ;

4. Each piece of content must be valuable (score greater than 0.2) ;

Then you will try again from the beginning: "How about reading 5 sections starting from section 3?", "How about reading 3 sections starting from section 10?"... Finally, select a few paragraphs that you think are "the most interesting and worth reading".

Rebuilding and using paragraphs in RAG

def  reconstruct_segments ( document_chunks:  List [ str ], best_segment_indices:  List [ Tuple [ intint ]] ) ->  List [ Dict ]:    """    Reconstruct the text paragraph based on the optimal slice index.        parameter:        document_chunks(List[str]): All text chunks of the original document        best_segment_indices(List[Tuple[int, int]]): List of start and end indices of the best segment            return:        List[Dict]: List of dictionaries containing the reconstructed paragraphs and their ranges    """    reconstructed_segments = []
    for  start_idx, end_idx  in  best_segment_indices:        segment_text =  " " .join(document_chunks[start_idx:end_idx])        reconstructed_segments.append({            "text" : segment_text,            "segment_range" : (start_idx, end_idx),        })
    return  reconstructed_segments
def  format_segments_for_context ( segments:  List [ Dict ] ) ->  str :    """    Format a paragraph of text into a context string that is usable by a language model.        parameter:        segments(List[Dict]): List of dictionaries containing paragraph text and index ranges            return:        str: formatted context text    """    context_lines = []
    for  index, segment  in  enumerate (segments):        header =  f"SEGMENT  {index +  1 }  (Chunks  {segment[ 'segment_range' ][ 0 ]} - {segment[ 'segment_range' ][ 1 ] -  1 } ):"        context_lines.append(header)        context_lines.append(segment[ "text" ])        context_lines.append( "-"  *  80 )
    return "\n\n" .join(context_lines)

For example

Suppose the input is these two paragraphs:

segments = [ { "segment_range": [2, 5], "text": "Artificial intelligence is a branch of computer science that aims to enable machines to simulate human intelligent behavior." }, { "segment_range": [7, 9], "text": "Deep learning is a special type of machine learning method that is particularly good at processing image and speech data." }]

Then the output will be like this:

SEGMENT 1 (Chunks 2-4):Artificial intelligence is a branch of computer science that aims to enable machines to simulate human intelligent behavior.--------------------------------------------------------------------------------
SEGMENT 2 (Chunks 7-8):Deep learning is a special type of machine learning that is particularly good at processing image and speech data.--------------------------------------------------------------------------------

Complete pipeline

def  rag_with_rse ( pdf_path:  str , query:  str , chunk_size:  int  =  800 , penalty:  float  =  0.2 ) ->  Dict :    """    The complete RAG process uses the Relevant Paragraph Extraction (RSE) strategy to filter out the most useful document content.        parameter:        pdf_path(str): PDF document path        query(str): user query        chunk_size(int): text slice size        penalty(float): penalty coefficient for irrelevant slices            return:        Dict: A dictionary containing the query, selected paragraphs, and generated answers    """    print ( "\n=== Start executing the RAG process based on relevant paragraph extraction===" )    print ( f"Query content:  {query} " )
    # Step 1: Process documents and generate vector storage    text_chunks, vector_store, doc_info = process_document(pdf_path, chunk_size)
    # Step 2: Calculate the relevance score and value of each text block    print ( "\nCalculating text block relevance score and value..." )    chunk_values ​​= calculate_chunk_values(query, text_chunks, vector_store, penalty)
    # Step 3: Select the best paragraph based on value    best_segments, scores = find_best_segments(        chunk_values=chunk_values,        max_segment_length = 20 ,        total_max_length = 30 ,        min_segment_value = 0.2    )
    # Step 4: Reconstruct the best paragraph    print ( "\nRebuilding best text paragraph..." )    selected_segments = reconstruct_segments(text_chunks, best_segments)
    # Step 5: Format the context for use with the large model    formatted_context = format_segments_for_context(selected_segments)
    # Step 6: Call the large model to generate the final response    response = generate_response(query, formatted_context)
    # Arrange the output results    result = {        "query" : query,        "segments" : selected_segments,        "response" : response    }
    print ( "\n=== The final reply is as follows===" )    print (response)
    return  result


9. Context Compression Technology: Improving RAG System Efficiency

We will filter and compress the retrieved text blocks to keep only the most relevant content , thereby:

✅ Reduce noise information;

✅ Improve the accuracy and relevance of language model answers;

✅ Make more efficient use of limited context windows.

Background

When using the RAG system for document retrieval, we often get some text blocks containing mixed content :

  • Some sentences are relevant to the user’s question;

  • Some sentences are completely irrelevant or are just background information.

For example:

“Artificial intelligence is a branch of computer science that aims to enable machines to simulate human intelligent behavior. Many AI systems rely on big data for training. Deep learning is a special type of machine learning method.”

If the user's question is: "What is artificial intelligence?" then only the first sentence is the most relevant, and the rest of the content, although correct, is irrelevant to the current question.

The goal of context compression

What we are going to do is:

✅  Remove irrelevant sentences or paragraphs ;

✅Only  keep information that is highly relevant to the user’s query ;

✅  Maximize the “useful information density” in the context window .

This allows the language model to focus more on key content and avoid being distracted by irrelevant information, thereby improving the quality of the final answer.

Implementation ideas

We will implement a simple context compression process from scratch, which mainly includes the following steps:

1. Analyze relevance sentence by sentence

Each text block is split into sentences, and a semantic model (such as BERT, Sentence-BERT, etc.) is used to calculate the relevance score between each sentence and the user query.

? Example:

scores = [model.similarity(query_embedding, sentence_embedding) for sentence in sentences]

2. Set threshold or select Top-K sentences

We can choose one of two strategies to filter sentences:

✅ Keep sentences with scores above a certain threshold;

✅ Or keep the top K sentences with the highest scores.

3. Reconstruct the compressed context

Reassemble the filtered sentences into a new, more compact contextual paragraph in their original order.

Example Demo

Original text block:

“Artificial intelligence is a branch of computer science that aims to enable machines to simulate human intelligent behavior. Many AI systems rely on big data for training. Deep learning is a special type of machine learning method.”

User Question:

“What is artificial intelligence?”

Contents retained after compression:

“Artificial intelligence is a branch of computer science that seeks to enable machines to simulate intelligent human behavior.”

Implementing context compression

This is the core part of our approach. We will use a large language model to filter and compress the retrieved content , thus retaining the information most relevant to the user’s question.

def  compress_chunk ( chunk:  str , query:  str , compression_type:  str  =  "selective" , model:  str  =  "qwen-max" ) ->  Tuple [ strfloat ]:    """    Compresses the retrieved text blocks, keeping only the parts relevant to the query.    parameter:        chunk(str): the text block to be compressed        query(str): user query        compression_type (str): compression method ("selective", "summary", or "extraction")        model(str): LLM model name used    return:        Tuple[str, float]: compressed text block and compression ratio (percentage)    """    # Build system prompts based on different compression types    if  compression_type ==  "selective" :        system_prompt =  """You are an expert at filtering information.        Your task is to analyze the document snippet and extract the sentences or paragraphs that are directly related to the user query. Remove all irrelevant content.        Output requirements:        1. Include only the text that helps answer the question        2. Keep the original wording of relevant sentences (do not paraphrase)        3. Maintain the original order        4. Include all relevant content, even if it appears to be repetitive        5. Exclude any text that is not relevant to the question        Please output in plain text format without adding additional instructions. """        elif  compression_type ==  "summary" :        system_prompt =  """You are a summary expert.        Your task is to succinctly summarize the given document snippet, focusing only on the information relevant to the user's query.        Output requirements:        1. Be concise but cover all relevant content        2. Focus on information relevant to your query        3. Ignore irrelevant details        4. Write in a neutral, objective tone        Please output in plain text format without adding additional instructions. """        else :   # extraction        system_prompt =  """You are an information extraction expert.        Your task is to extract the exact sentences that contain relevant information from the document fragment to answer the user's query.        Output requirements:        1. Include only relevant sentences from the original text        2. Keep the original sentence unchanged (do not modify it)        3. Include only sentences directly related to the question        4. Separate each sentence with a line break        5. Do not add any comments or other content        Please output in plain text format without adding additional instructions. """    # Build user prompt    user_prompt =  f"""        Query: {query}        Document snippet:        {chunk}        Extract content relevant to the query.    """    # Call the large model API for compression    response = client.chat.completions.create(        model=model,        messages=[            { "role""system""content" : system_prompt},            { "role""user""content" : user_prompt}        ],        temperature = 0    )    # Get the compressed content    compressed_content = response.choices[ 0 ].message.content.strip()    # Calculate the compression ratio    original_length =  len (chunk)    compressed_length =  len (compressed_content)    compression_ratio = (original_length - compressed_length) / original_length *  100
    return  compressed_content, compression_ratio

Complete pipeline

def  rag_with_compression ( pdf_path:  str , query:  str , k:  int  =  10 , compression_type:  str  =  "selective" , model:  str  =  "qwen-max" ) ->  Dict :    """    The complete RAG process,using the context compression strategy to reduce the input length.
    parameter:        pdf_path(str): PDF file path        query(str): user query        k (int): the number of text blocks to retrieve initially        compression_type(str): compression method        model(str): the name of the large model used
    return:        Dict: A dictionary containing the query, the content before and after compression, the response, etc.    """
    print ( "\n=== Context compression RAG process starts===" )    print ( f"Query content:  {query} " )    print ( f"Compression type:  {compression_type} " )
    # Load the document and create vector storage    vector_store = process_document(pdf_path)
    # Create query embeddings    query_embedding = create_embeddings(query)
    # Retrieve the top k most relevant text blocks    print ( f"Retrieving the first  {k}  related text blocks..." )    results = vector_store.similarity_search(query_embedding, k=k)    retrieved_chunks = [result[ "text"for  result  in  results]
    # Compress each text block    compressed_results = batch_compress_chunks(retrieved_chunks, query, compression_type, model)    compressed_chunks = [result[ 0for  result  in  compressed_results]    compression_ratios = [result[ 1for  result  in  compressed_results]
    # Filter empty content    valid_compressed_data = [(chunk, ratio)  for  chunk, ratio  in  zip (compressed_chunks, compression_ratios)  if  chunk.strip()]
    ifnot valid_compressed_data:        # Fallback to the original text block when all compressed text is empty        print ( "Warning: All text blocks were compressed to nothing. The original text blocks will be used." )        valid_compressed_data = [(chunk,  0.0for  chunk  in  retrieved_chunks]    else :        compressed_chunks, compression_ratios =  zip (*valid_compressed_data)
    # Build context    context =  "\n\n---\n\n" .join(compressed_chunks)
    # Generate final response    print ( "Generate reply based on compressed text block..." )    response = generate_response(query, context, model)
    # Return results    result = {        "query" : query,        "original_chunks" : retrieved_chunks,        "compressed_chunks" : compressed_chunks,        "compression_ratios" : compression_ratios,        "context_length_reduction"f" { sum (compression_ratios) /  len (compression_ratios): .2 f} %" ,        "response" : response    }
    print ( "\n=== Final reply===" )    print (response)
    return  result    

10. Feedback Loop in RAG

In this section, I will implement a RAG system with a feedback mechanism that can continuously optimize itself over time. By collecting and integrating user feedback, it can:

✅ Learn which responses are effective and which need improvement;

✅ Continuously improve the relevance of search results and the quality of answers;

✅ Get “smarter” with every interaction.

Limitations of Traditional RAG Systems

Traditional RAG systems are static

They retrieve information based only on vector similarity and do not learn from user feedback.

This means:

  • If an inaccurate or irrelevant answer is returned, the system will not automatically correct it;

  • Even if the same question is asked multiple times, the system does not "remember" previous better responses.

Advantages of RAG with feedback mechanism

We built a dynamic and adaptive RAG system with the following capabilities:

✅Memory  function : remember which documents have provided useful information and which have not;

✅Dynamically  adjust the score : Update the relevance score of the document based on historical feedback;

✅Knowledge  accumulation : add successful question-answer pairs to the knowledge base for future queries;

✅Continuous  evolution : Every interaction with users is a learning opportunity. The system will become more accurate and better with use.

The core process of the feedback mechanism

1. User questions

  • The user enters a question and gets an answer generated by the RAG system.

2. Get user feedback

  • Users can provide feedback through rating, likes/dislikes, or direct comments;

  • Example:

  • “This answer is very helpful✅”

  • “This answer is not detailed enough ❌”

  • "Could you please provide more details?"

3. Record feedback data

  • Store user questions, original answers, feedback content and other information to form a feedback log.

4. Analysis and learning

  • Use models to analyze which documents and passages produce high-quality responses;

  • Adjust the retrieval weight of these documents in the future;

  • Add high-quality question-answer pairs to the knowledge base to enhance future semantic understanding.

5. Optimize your next answer

  • The next time you encounter a similar question, the system can find the best answer faster and more accurately.

Example Demo

? User's first question:

“What is machine learning?”

System answer:

“Machine learning is an artificial intelligence technique that allows computers to learn patterns from data.”

User feedback:

“Not bad, but could be more detailed.”

? The second time someone asked a similar question:

“What are the fundamental principles of machine learning?”

The system combines previous feedback and returns a more detailed answer:

“Machine learning is an artificial intelligence technique that allows computers to automatically learn patterns and regularities through training data. Common methods include supervised learning, unsupervised learning, and reinforcement learning.”

Building a simple vector database

from  typing  import  ListDictOptionalTupleCallableimport  numpy  as  np

classSimpleVectorStore:    """    A simple vector database implementation based on NumPy.
    This class provides an in-memory vector storage and retrieval system, supporting basic similarity searches using cosine similarity.    """
    def  __init__ ( self ):        """        Initialize the vector database, which contains three parallel lists:        - vectors: stores embedding vectors (NumPy arrays)        - texts: stores raw text blocks        - metadata: stores metadata for each text block        """        self.vectors:  List [np.ndarray] = []   # Embedding vector list        self.texts:  List [ str ] = []            # Text content list        self.metadata:  List [ Dict ] = []        # metadata list
    def  add_item ( self, text:  str , embedding:  List [ float ], metadata:  Optional [ Dict ] =  None ) ->  None :        """        Adds an entry to the vector database.
        parameter:            text (str): the text content to be stored            embedding (List[float]): represents the embedding vector of the text            metadata (Dict, optional): Optional metadata dictionary        """        self.vectors.append(np.array(embedding))        self.texts.append(text)        self.metadata.append(metadata  or  {})
    def  similarity_search (        self,        query_embedding:  List [ float ],        k:  int  =  5 ,        filter_func:  Optional [ Callable [[ Dict ],  bool ]] =  None    ) ->  List [ Dict ]:        """        Use cosine similarity to find the entries that are most similar to the query vector.
        parameter:            query_embedding (List[float]): query embedding            k (int): number of results to return            filter_func (Callable, optional): filter function used to filter results based on metadata
        return:            List[Dict]: A list of results containing text, metadata, and relevance scores        """        if not self.vectors:            return  []
        query_vector = np.array(query_embedding)        similarities = []
        for  i, vector  in  enumerate (self.vectors):            if  filter_func andnot filter_func(self.metadata[i]):                continue
            similarity = np.dot(query_vector, vector) / (                np.linalg.norm(query_vector) * np.linalg.norm(vector)            )            similarities.append((i, similarity))
        similarities.sort(key= lambda  x: x[ 1 ], reverse= True )
        results = []        for  i  in  range ( min (k,  len (similarities))):            idx, score = similarities[i]            results.append({                "text" : self.texts[idx],                "metadata" : self.metadata[idx],                "similarity" : score,                "relevance_score" : self.metadata[idx].get( "relevance_score" , score)            })
        return  results

Feedback system function module

Now we will implement the core feedback system components .

def  get_user_feedback ( query:  str , response:  str , relevance:  int , quality:  int , comments:  str  =  "" ) ->  Dict :    """    Format user feedback as a dictionary.
    parameter:        query(str): user's question        response(str): the system's answer        relevance(int): relevance score (1-5)        quality(int): answer quality score (1-5)        comments(str): optional comments
    return:        Dict: Formatted feedback dictionary    """    return {        "query" : query,        "response" : response,        "relevance"int (relevance),        "quality"int (quality),        "comments" : comments,        "timestamp" : datetime.now().isoformat()    }
def  store_feedback ( feedback:  Dict , feedback_file:  str  =  "feedback_data.json" ) ->  None :    """    Save the feedback data into a JSON file.
    parameter:        feedback(Dict): feedback data        feedback_file(str): file path    """    with  open (feedback_file,  "a"as  f:        json.dump(feedback, f)        f.write( "\n" )
def  load_feedback_data ( feedback_file:  str  =  "feedback_data.json" ) ->  List [ Dict ]:    """    Load feedback data from a file.
    parameter:        feedback_file(str): file path
    return:        List[Dict]: Feedback data list    """    feedback_data = []    try :        with  open (feedback_file,  "r"as  f:            for  line  in  f:                if  line.strip():                    feedback_data.append(json.loads(line.strip()))    except  FileNotFoundError:        print ( "Feedback file not found. Starting with empty feedback." )
    return  feedback_data

Feedback-aware document processing

def  process_document ( pdf_path:  str , chunk_size:  int  =  1000 , chunk_overlap:  int  =  200 ) ->  Tuple [ List [ str ], SimpleVectorStore]:    """    Process documents for RAG process.
    step:    1. Extract text from PDF    2. Segment the text    3. Create an embed    4. Store in vector database
    parameter:        pdf_path(str): PDF file path        chunk_size(int): size of each slice        chunk_overlap(int): the number of overlapping characters between slices
    return:        Tuple[List[str], SimpleVectorStore]: text slices and vector database    """    print ( "Extracting text from PDF..." )    extracted_text = extract_text_from_pdf(pdf_path)
    print ( "Splitting text..." )    chunks = chunk_text(extracted_text, chunk_size, chunk_overlap)    print ( f"Generated  { len (chunks)}  text slices" )
    print ( "Creating embedding for slice..." )    chunk_embeddings = create_embeddings(chunks)
    store = SimpleVectorStore()
    for  i, (chunk, embedding)  in  enumerate ( zip (chunks, chunk_embeddings)):        store.add_item(            text=chunk,            embedding=embedding,            metadata={                "index" : i,                "source" : pdf_path,                "relevance_score"1.0 ,                "feedback_count"0            }        )
    print ( f"  { len (chunks)}  slices have been added to the vector database" )    return  chunks, store
def  assess_feedback_relevance ( query:  str , doc_text:  str , feedback:  dict ) ->  bool :    """    Use language model (LLM) to determine whether a piece of past user feedback is relevant to the current query and document content.        This function is used to decide which historical feedback should affect the current search result ranking or result generation.        parameter:        query(str): the query question of the current user        doc_text (str): the document content being evaluated (i.e. the block of text to be retrieved)        feedback(Dict): previously saved user feedback data, including 'query' and 'response' fields            return:        bool: Returns True if the feedback is relevant to the current query and document; otherwise returns False    """
    # System prompt words: Tell the AI ​​that it can only determine whether the feedback is relevant, and cannot do anything else    system_prompt =  """You are the expert on whether feedback is relevant or not. Please just answer "yes" or "no" without providing any explanation or anything else.""""
    # User prompt words: build input context, including current query, past questions, document content snippets, and previous answers    user_prompt =  f"""    Current query: {query}    Past feedback questions: {feedback[ 'query' ]}    Document content: {doc_text[: 500 ]} ... [truncated]    Past responses that received feedback: {feedback[ 'response' ][: 500 ]} ... [truncated]
    Is this historical feedback relevant to the current query and document content? Please answer yes or no.    """
    # Call the LLM model API to obtain the judgment result    response = client.chat.completions.create(        model = "qwen-max" ,   # Model name used        messages=[            { "role""system""content" : system_prompt},   # System instructions (translated into Chinese)            { "role""user""content" : user_prompt}        # User input content        ],        temperature = 0   # Set to 0 to ensure output determinism    )
    # Extract model response and process it    answer = response.choices[ 0 ].message.content.strip().lower()
    # Check if it contains "yes"    return 'yes'  in  answer
def  adjust_relevance_scores ( query:  str , results:  List [ Dict ], feedback_data:  List [ Dict ] ) ->  List [ Dict ]:    """    Adjust the relevance score of search results based on historical feedback.
    parameter:        query(str): current query        results(List[Dict]): search results        feedback_data(List[Dict]): historical feedback data
    return:        List[Dict]: adjusted result    """    ifnot feedback_data:        return  results
    print ( "Adjusting relevance score based on feedback history..." )
    for  i, result  in  enumerate (results):        document_text = result[ "text" ]        relevant_feedback = []
        for  fb  in  feedback_data:            if  assess_feedback_relevance(query, document_text, fb):                relevant_feedback.append(fb)
        if  relevant_feedback:            avg_relevance =  sum (f[ 'relevance'for  f  in  relevant_feedback) /  len (relevant_feedback)            modifier =  0.5  + (avg_relevance /  5.0 )            original_score = result[ "similarity" ]            adjusted_score = original_score * modifier
            result.update({                "original_similarity" : original_score,                "similarity" : adjusted_score,                "relevance_score" : adjusted_score,                "feedback_applied"True ,                "feedback_count"len (relevant_feedback)            })
            print ( f" Document  {i+ 1 } : score adjusted from  {original_score: .4 f}  to  {adjusted_score: .4 f} based on  { len (relevant_feedback)}  pieces of feedback" )
    results.sort(key= lambda  x: x[ "similarity" ], reverse= True )    return  results
def  rag_with_feedback_loop (    query:  str ,    vector_store: SimpleVectorStore,    feedback_data:  List [ Dict ],    k:  int  =  5 ,    model:  str  =  "qwen-amax") ->  Dict :    """    Execute the complete RAG process with feedback mechanism.
    parameter:        query(str): user query        vector_store(SimpleVectorStore): vector database        feedback_data(List[Dict]): historical feedback data        k(int): number of retrievals        model(str): LLM model
    return:        Dict: contains the query, retrieved documents and the result of the response    """    print ( f"\n=== Processing RAG query with feedback===" )    print ( f"Query:  {query} " )
    query_embedding = create_embeddings(query)    results = vector_store.similarity_search(query_embedding, k=k)    adjusted_results = adjust_relevance_scores(query, results, feedback_data)
    retrieved_texts = [result[ "text"for  result  in  adjusted_results]    context =  "\n\n---\n\n" .join(retrieved_texts)
    print ( "Generating reply..." )    response = generate_response(query, context, model)
    return  {        "query" : query,        "retrieved_documents" : adjusted_results,        "response" : response    }

Using feedback to fine-tune your index

def  fine_tune_index (    current_store: SimpleVectorStore,    chunks:  List [ str ],    feedback_data:  List [ Dict ]) -> SimpleVectorStore:    """    Fine-tune the vector database using high-quality feedback.
    parameter:        current_store(SimpleVectorStore): current database        chunks(List[str]): raw text slices        feedback_data(List[Dict]): historical feedback data
    return:        SimpleVectorStore: a fine-tuned database    """    print ( "Fine-tuning index using high-quality feedback..." )
    good_feedback = [fb  for  fb  in  feedback_data  if  fb[ 'relevance' ] >=  4 and  fb[ 'quality' ] >=  4 ]
    ifnot good_feedback:        print ( "No high-quality feedback found." )        return  current_store
    new_store = SimpleVectorStore()
    for  i  in  range ( len (current_store.texts)):        new_store.add_item(            text=current_store.texts[i],            embedding=current_store.vectors[i],            metadata=current_store.metadata[i].copy()        )
    for  feedback  in  good_feedback:        enhanced_text =  f"Question:  {feedback[ 'query' ]} \nAnswer:  {feedback[ 'response' ]} "        embedding = create_embeddings(enhanced_text)
        new_store.add_item(            text=enhanced_text,            embedding=embedding,            metadata={                "type""feedback_enhanced" ,                "query" : feedback[ "query" ],                "relevance_score"1.2 ,                "feedback_count"1 ,                "original_feedback" : feedback            }        )
        print ( f"Feedback content added:  {feedback[ 'query' ][: 50 ]} ..." )
    print ( f"After fine-tuning the index contains  { len (new_store.texts) }  items (original:  { len (chunks)} )" )    return  new_store

Complete workflow: from initial setup to feedback collection

def  full_rag_workflow (    pdf_path:  str ,    query:  str ,    feedback_data:  Optional [ List [ Dict ]] =  None ,    feedback_file:  str  =  "feedback_data.json" ,    fine_tune :  bool  =  False) ->  Dict :    """    Complete RAG workflow with integrated feedback mechanism.
    parameter:        pdf_path(str): PDF file path        query(str): user query        feedback_data(Optional[List[Dict]]): Historical feedback data        feedback_file(str): feedback file path        fine_tune(bool): whether to enable index fine tuning
    return:        Dict: contains the result of the response and the retrieved information    """    if  feedback_data  is  None :        feedback_data = load_feedback_data(feedback_file)        print ( f"  { len (feedback_data)}  pieces of feedback  loaded  from {feedback_file} " )
    chunks, vector_store = process_document(pdf_path)
    if  fine_tune  and  feedback_data:        vector_store = fine_tune_index(vector_store, chunks, feedback_data)
    result = rag_with_feedback_loop(query, vector_store, feedback_data)
    print ( "\n=== Would you like to provide feedback on this reply? ===" )    print ( "Score relevance (1-5):" )    relevance =  input ()
    print ( "Rating quality (1-5):" )    quality =  input ()
    print ( "Any comments? (skip)" )    comments =  input ()
    feedback = get_user_feedback(        query=query,        response=result[ "response" ],        relevance = int (relevance),        quality = int (quality),        comments=comments    )
    store_feedback(feedback, feedback_file)    print ( "Thanks for your feedback!" )
    return  result