A full-process guide to improving enterprise RAG accuracy: from data extraction to precise retrieval

Essential guide to improve enterprise data retrieval efficiency, RAG system accuracy improves the whole process analysis.
Core content:
1. PDF file upload and record creation, optimize storage and processing process
2. PDF parsing and segmentation, improve text extraction and semantic segmentation accuracy
3. Embedded model application and entity relationship extraction, enhance retrieval accuracy
In an enterprise environment, it is crucial to accurately and efficiently retrieve information from large amounts of unstructured data, such as PDF files. Retrieval-augmented generation (RAG)-based systems play an important role in this regard, but improving their accuracy is a complex and challenging task. This article will detail a step-by-step guide to improving the accuracy of enterprise RAGs, covering key steps from data extraction to retrieval.
1. Extracting knowledge from PDF
1.1 Upload and record creation
The user uploads a PDF file (other file types such as audio and video will be supported in the future). The system saves the file to disk (plans to migrate to AWS S3 buckets in the near future to better meet enterprise needs), inserts a record into the database, and creates a processing status entry. Use the SingleStore database that supports multiple data types, hybrid search, and single retrieval. At the same time, the task of processing PDFs is put into a background queue for asynchronous processing and tracking through Redis and Celery.
# Pseudo-codesave_file_to_disk(pdf)db_insert(document_record, status=”started”)queue_processing_task(pdf)
1.2 Parsing and Blocking PDF
Open the file and verify the size limit and password protection. If the file is unreadable, terminate the processing as soon as possible. Extract the file content into text or Markdown format. You can use Llamaparse (from Llamaindex) instead of the previous PyMudf. Its free version supports 1,000 document parsing per day and can better handle table and image extraction. Analyze the document structure (such as directory, title, etc.) and use Gemini Flash 2.0 to segment the text into meaningful blocks based on semantics. If semantic segmentation fails, use a simple segmentation method and add overlaps between blocks to maintain contextual coherence.
# Pseudo-codevalidate_pdf(pdf)text = extract_text(pdf)chunks = semantic_chunking(text) or fallback_chunking(text)add_overlaps(chunks)
1.3 Generating Embeddings
# Pseudo-codefor chunk in chunks: vector = generate_embedding(chunk.text) db_insert(embedding_record, vector)
1.4 Extracting Entities and Relations Using Large Language Models
This step has a great impact on the overall accuracy. Semantically organized text blocks are sent to OpenAI, and specific prompts are used to ask it to return the entities and relations in each block, including key entities (name, type, description, alias). The relationships between entities are mapped to avoid repeated additions of data, and the extracted "knowledge" is stored in a structured table.
# Pseudo-codefor chunk in chunks: entities, relationships = extract_knowledge(chunk.text) db_insert(entities) db_insert(relationships)
1.5 Final processing status
If all steps are processed correctly, the status is updated to "Completed" so that the front end can poll and display the correct status at any time; if the processing fails, it is marked as "Failed" and the temporary data is cleaned up.
# Pseudo-codeif success: update_status("completed")else: update_status("failed") cleanup_partial_data()
2. Knowledge Retrieval (RAG Pipeline)
2.1 User Query
The user submits a query request to the system.
# Pseudo-codequery = get_user_query()
2.2 Preprocessing and expanding queries
The system normalizes queries to remove punctuation, standardize whitespace, and expand synonyms using large language models (such as Groq, which is faster).
# Pseudo-codequery = preprocess_query(query)expanded_query = expand_query(query)
2.3 Embedding Query and Search Vectors
The query is embedded as a high-dimensional vector using the same ada model as used for extraction, and the best matching text chunks are found using semantic search (eg, dot product operation in SingleStore) in the document embedding database.
# Pseudo-codequery_vector = generate_embedding(expanded_query)top_chunks = vector_search(query_vector)
2.4 Full-text search
Parallel full-text searches are performed to complement vector searches and can be implemented in SingleStore using the MATCH statement.
# Pseudo-codetext_results = full_text_search(query)
2.5 Merging and sorting results
The results of vector search and full-text search are combined and re-ranked according to relevance. The number of top k results returned can be adjusted (k = 10 or higher is better), and low-confidence results can be filtered out.
# Pseudo-codemerged_results = merge_and_rank(top_chunks, text_results) filtered_results = filter_low_confidence(merged_results)
2.6 Retrieving Entities and Relationships
If entities and relationships exist for the retrieved text block, they are included in the response.
# Pseudo-codefor result in filtered_results: entities, relationships = fetch_knowledge(result) enrich_result(result, entities, relationships)
2.7 Generating the Final Answer
The context is enhanced with prompts and the relevant data is sent to a large language model (such as GPT3O-Mini) to generate the final response.
# Pseudo-codefinal_answer = generate_llm_response(filtered_results)
2.8 Return the answer to the user
The response is returned as a structured JSON payload along with the original database search results to facilitate debugging and tuning if needed.
# Pseudo-codereturn_response(final_answer)
3. Performance optimization and implementation
In practice, we found that the response time of the retrieval process was too long (about 8 seconds), and the main bottleneck was the call of the large language model (about 1.5-2 seconds each time), while the SingleStore database query time was usually within 600 milliseconds. After switching to Groq for some large language model calls, the response time was shortened to 3.5 seconds. For further optimization, you can try parallel calls instead of serial calls.
To simplify management and improve database response time, the code for single retrieval query is implemented. The query embedding vector is generated through OpenAI's Embeddings API, and a hybrid search SQL query is executed in SingleStore to obtain text blocks, vector scores, text scores, comprehensive scores, and related entity and relationship information.
import os
import json
import mysql.connector
from openai import OpenAI
# Define database connection parameters (assumed from env vars)
DB_CONFIG = {
"host" : os.getenv( "SINGLESTORE_HOST" , "localhost" ),
"port" : int (os.getenv( "SINGLESTORE_PORT" , "3306" )),
"user" : os.getenv( "SINGLESTORE_USER" , "root" ),
"password" : os.getenv( "SINGLESTORE_PASSWORD" , "" ),
"database" : os.getenv( "SINGLESTORE_DATABASE" , "knowledge_graph" )
}
def get_query_embedding ( query: str ) -> list :
"""
Generate a 1536-dimensional embedding for the query using OpenAI embeddings API.
"""
client = OpenAI(api_key=os.getenv( "OPENAI_API_KEY" ))
response = client.embeddings.create(
model= "text-embedding-ada-002" ,
input =query
)
return response.data[ 0 ].embedding # Extract embedding vector
def retrieve_rag_results ( query: str ) -> list :
"""
Execute the hybrid search SQL query in SingleStore and return the top-ranked results.
"""
conn = mysql.connector.connect(**DB_CONFIG)
cursor = conn.cursor(dictionary= True )
# Generate query embedding
query_embedding = get_query_embedding(query)
embedding_str = json.dumps(query_embedding) # Convert to JSON for SQL compatibility
# Set the query embedding session variable
cursor.execute( "SET @qvec = %s" , (embedding_str,))
# Hybrid Search SQL Query (same as provided earlier)
sql_query = """
SELECT
d.doc_id,
d.content,
(d.embedding <*> @qvec) AS vector_score,
MATCH(TABLE Document_Embeddings) AGAINST(%s) AS text_score,
(0.7 * (d.embedding <*> @qvec) + 0.3 * MATCH(TABLE Document_Embeddings) AGAINST(%s)) AS combined_score,
JSON_AGG(DISTINCT JSON_OBJECT(
'entity_id', e.entity_id,
'name', e.name,
'description', e.description,
'category', e.category
)) AS entities,
JSON_AGG(DISTINCT JSON_OBJECT(
'relationship_id', r.relationship_id,
'source_entity_id', r.source_entity_id,
'target_entity_id', r.target_entity_id,
'relation_type', r.relation_type
)) AS relationships
FROM Document_Embeddings d
LEFT JOIN Relationships r ON r.doc_id = d.doc_id
LEFT JOIN Entities e ON e.entity_id IN (r.source_entity_id, r.target_entity_id)
WHERE MATCH(TABLE Document_Embeddings) AGAINST(%s)
GROUP BY d.doc_id, d.content, d.embedding
ORDER BY combined_score DESC
LIMIT 10;
"""
# Execute the query
cursor.execute(sql_query, (query, query, query))
results = cursor.fetchall()
cursor.close()
conn.close()
return results # Return list of retrieved documents with entities and relationships
4. Lessons learned and future improvement directions
Improving RAG accuracy and maintaining low latency is a challenging task, especially when dealing with structured data. Future improvements can be made in the following areas:
4.1 Accuracy Improvement - Extraction Phase
- Externalize and experiment with entity extraction hints
Try different hinting strategies to more accurately extract entities from text blocks. - Summarize a block of text before processing
May have a significant impact on improving accuracy. - Add better failure retry mechanism
Ensure effective recovery when failures occur in various processing steps.
4.2 Accuracy Improvement - Retrieval Phase
- Use better query expansion techniques
Such as custom dictionaries, specific industry terms, etc., to improve the relevance of queries. - Fine-tuning weights for vector and text search
Currently the settings are externalized in the configuration file and can be further optimized. - Adding a second large language model for re-ranking
But there is a latency trade-off. - Resize the search window
Optimize the balance between recall and relevance. - Generate block-level summaries
Avoid sending raw text directly to a large language model, reducing the processing burden.