A full-process guide to improving enterprise RAG accuracy: from data extraction to precise retrieval

Written by
Caleb Hayes
Updated on:July-09th-2025
Recommendation

Essential guide to improve enterprise data retrieval efficiency, RAG system accuracy improves the whole process analysis.

Core content:
1. PDF file upload and record creation, optimize storage and processing process
2. PDF parsing and segmentation, improve text extraction and semantic segmentation accuracy
3. Embedded model application and entity relationship extraction, enhance retrieval accuracy

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

In an enterprise environment, it is crucial to accurately and efficiently retrieve information from large amounts of unstructured data, such as PDF files. Retrieval-augmented generation (RAG)-based systems play an important role in this regard, but improving their accuracy is a complex and challenging task. This article will detail a step-by-step guide to improving the accuracy of enterprise RAGs, covering key steps from data extraction to retrieval.

1. Extracting knowledge from PDF

1.1 Upload and record creation

The user uploads a PDF file (other file types such as audio and video will be supported in the future). The system saves the file to disk (plans to migrate to AWS S3 buckets in the near future to better meet enterprise needs), inserts a record into the database, and creates a processing status entry. Use the SingleStore database that supports multiple data types, hybrid search, and single retrieval. At the same time, the task of processing PDFs is put into a background queue for asynchronous processing and tracking through Redis and Celery.

# Pseudo-codesave_file_to_disk(pdf)db_insert(document_record, status=”started”)queue_processing_task(pdf)

1.2 Parsing and Blocking PDF

Open the file and verify the size limit and password protection. If the file is unreadable, terminate the processing as soon as possible. Extract the file content into text or Markdown format. You can use Llamaparse (from Llamaindex) instead of the previous PyMudf. Its free version supports 1,000 document parsing per day and can better handle table and image extraction. Analyze the document structure (such as directory, title, etc.) and use Gemini Flash 2.0 to segment the text into meaningful blocks based on semantics. If semantic segmentation fails, use a simple segmentation method and add overlaps between blocks to maintain contextual coherence.

# Pseudo-codevalidate_pdf(pdf)text = extract_text(pdf)chunks = semantic_chunking(text) or fallback_chunking(text)add_overlaps(chunks)

1.3 Generating Embeddings

Use an embedding model to convert each text block into a high-dimensional vector, such as OpenAI's large ada model, with the dimension set to 1536. Store the text blocks and their embedding vectors in the database. In SingleStore, store the text blocks and text in different columns of the same table for easy maintenance and retrieval.

# Pseudo-codefor chunk in chunks: vector = generate_embedding(chunk.text) db_insert(embedding_record, vector)

1.4 Extracting Entities and Relations Using Large Language Models

This step has a great impact on the overall accuracy. Semantically organized text blocks are sent to OpenAI, and specific prompts are used to ask it to return the entities and relations in each block, including key entities (name, type, description, alias). The relationships between entities are mapped to avoid repeated additions of data, and the extracted "knowledge" is stored in a structured table.

# Pseudo-codefor chunk in chunks: entities, relationships = extract_knowledge(chunk.text) db_insert(entities) db_insert(relationships)

1.5 Final processing status

If all steps are processed correctly, the status is updated to "Completed" so that the front end can poll and display the correct status at any time; if the processing fails, it is marked as "Failed" and the temporary data is cleaned up.

# Pseudo-codeif success: update_status("completed")else: update_status("failed") cleanup_partial_data()

2. Knowledge Retrieval (RAG Pipeline)

2.1 User Query

The user submits a query request to the system.

# Pseudo-codequery = get_user_query()

2.2 Preprocessing and expanding queries

The system normalizes queries to remove punctuation, standardize whitespace, and expand synonyms using large language models (such as Groq, which is faster).

# Pseudo-codequery = preprocess_query(query)expanded_query = expand_query(query)

2.3 Embedding Query and Search Vectors

The query is embedded as a high-dimensional vector using the same ada model as used for extraction, and the best matching text chunks are found using semantic search (eg, dot product operation in SingleStore) in the document embedding database.

# Pseudo-codequery_vector = generate_embedding(expanded_query)top_chunks = vector_search(query_vector)

2.4 Full-text search

Parallel full-text searches are performed to complement vector searches and can be implemented in SingleStore using the MATCH statement.

# Pseudo-codetext_results = full_text_search(query)

2.5 Merging and sorting results

The results of vector search and full-text search are combined and re-ranked according to relevance. The number of top k results returned can be adjusted (k = 10 or higher is better), and low-confidence results can be filtered out.

# Pseudo-codemerged_results = merge_and_rank(top_chunks, text_results) filtered_results = filter_low_confidence(merged_results)

2.6 Retrieving Entities and Relationships

If entities and relationships exist for the retrieved text block, they are included in the response.


# Pseudo-codefor result in filtered_results: entities, relationships = fetch_knowledge(result) enrich_result(result, entities, relationships)

2.7 Generating the Final Answer

The context is enhanced with prompts and the relevant data is sent to a large language model (such as GPT3O-Mini) to generate the final response.

# Pseudo-codefinal_answer = generate_llm_response(filtered_results)

2.8 Return the answer to the user

The response is returned as a structured JSON payload along with the original database search results to facilitate debugging and tuning if needed.

# Pseudo-codereturn_response(final_answer)

3. Performance optimization and implementation

In practice, we found that the response time of the retrieval process was too long (about 8 seconds), and the main bottleneck was the call of the large language model (about 1.5-2 seconds each time), while the SingleStore database query time was usually within 600 milliseconds. After switching to Groq for some large language model calls, the response time was shortened to 3.5 seconds. For further optimization, you can try parallel calls instead of serial calls.

To simplify management and improve database response time, the code for single retrieval query is implemented. The query embedding vector is generated through OpenAI's Embeddings API, and a hybrid search SQL query is executed in SingleStore to obtain text blocks, vector scores, text scores, comprehensive scores, and related entity and relationship information.

import  osimport  jsonimport  mysql.connectorfrom  openai  import  OpenAI# Define database connection parameters (assumed from env vars)DB_CONFIG = { "host" : os.getenv( "SINGLESTORE_HOST""localhost" ), "port"int (os.getenv( "SINGLESTORE_PORT""3306" )), "user" : os.getenv( "SINGLESTORE_USER""root" ), "password" : os.getenv( "SINGLESTORE_PASSWORD""" ), "database" : os.getenv( "SINGLESTORE_DATABASE""knowledge_graph" )}def  get_query_embedding ( query:  str ) ->  list : """ Generate a 1536-dimensional embedding for the query using OpenAI embeddings API. """ client = OpenAI(api_key=os.getenv( "OPENAI_API_KEY" )) response = client.embeddings.create( model= "text-embedding-ada-002" , input =query ) return  response.data[ 0 ].embedding  # Extract embedding vectordef  retrieve_rag_results ( query:  str ) ->  list : """ Execute the hybrid search SQL query in SingleStore and return the top-ranked results. """ conn = mysql.connector.connect(**DB_CONFIG) cursor = conn.cursor(dictionary= True ) # Generate query embedding query_embedding = get_query_embedding(query) embedding_str = json.dumps(query_embedding)  # Convert to JSON for SQL compatibility # Set the query embedding session variable cursor.execute( "SET @qvec = %s" , (embedding_str,)) # Hybrid Search SQL Query (same as provided earlier) sql_query =  """SELECT  d.doc_id, d.content, (d.embedding <*> @qvec) AS vector_score, MATCH(TABLE Document_Embeddings) AGAINST(%s) AS text_score, (0.7 * (d.embedding <*> @qvec) + 0.3 * MATCH(TABLE Document_Embeddings) AGAINST(%s)) AS combined_score, JSON_AGG(DISTINCT JSON_OBJECT( 'entity_id', e.entity_id, 'name', e.name, 'description', e.description, 'category', e.category )) AS entities, JSON_AGG(DISTINCT JSON_OBJECT( 'relationship_id', r.relationship_id, 'source_entity_id', r.source_entity_id, 'target_entity_id', r.target_entity_id, 'relation_type', r.relation_type )) AS relationshipsFROM Document_Embeddings dLEFT JOIN Relationships r ON r.doc_id = d.doc_idLEFT JOIN Entities e ON e.entity_id IN (r.source_entity_id, r.target_entity_id)WHERE MATCH(TABLE Document_Embeddings) AGAINST(%s)GROUP BY d.doc_id, d.content, d.embeddingORDER BY combined_score DESCLIMIT 10; """ # Execute the query cursor.execute(sql_query, (query, query, query)) results = cursor.fetchall() cursor.close() conn.close() return  results  # Return list of retrieved documents with entities and relationships

4. Lessons learned and future improvement directions

Improving RAG accuracy and maintaining low latency is a challenging task, especially when dealing with structured data. Future improvements can be made in the following areas:

4.1 Accuracy Improvement - Extraction Phase

  • Externalize and experiment with entity extraction hints
    Try different hinting strategies to more accurately extract entities from text blocks.
  • Summarize a block of text before processing
    May have a significant impact on improving accuracy.
  • Add better failure retry mechanism
    Ensure effective recovery when failures occur in various processing steps.

4.2 Accuracy Improvement - Retrieval Phase

  • Use better query expansion techniques
    Such as custom dictionaries, specific industry terms, etc., to improve the relevance of queries.
  • Fine-tuning weights for vector and text search
    Currently the settings are externalized in the configuration file and can be further optimized.
  • Adding a second large language model for re-ranking
    But there is a latency trade-off.
  • Resize the search window
    Optimize the balance between recall and relevance.
  • Generate block-level summaries
    Avoid sending raw text directly to a large language model, reducing the processing burden.
Through the above step-by-step guide and continuous improvement direction, enterprises can continuously optimize the accuracy of the RAG system, better meet the needs of efficient information retrieval in large amounts of unstructured data, and improve business efficiency and decision-making quality.