A step-by-step guide to improving enterprise RAG accuracy

Written by
Iris Vance
Updated on:July-14th-2025
Recommendation

A practical guide to improve the accuracy of enterprise RAG and help efficient retrieval of unstructured data.

Core content:
1. Using large context models to improve the accuracy of semantic segmentation
2. Extraction and retrieval: key steps for data ordering and contextualization
3. PDF file processing process and code practice sharing

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Knowledge graph generated from PDF files

In my previous blog , I wrote about how using new models like Gemini Flash 2.0 with very large context sizes for semantic chunking can significantly improve the overall retrieval accuracy from unstructured data like PDFs.

As I explored this, I began looking into other strategies to further improve the accuracy of the responses, since in most large enterprises the tolerance for inaccuracy is almost zero, and it should be. In this pursuit, I ended up trying many different things, and in this blog, let’s look at the overall steps that ultimately helped improve accuracy.

Before we get into the steps, let’s look at the whole process from a slightly high level and understand that in order to get more accurate results, we have to do significantly better in two areas:

  1. 1.  Extraction — Given a set of documents, extract data and knowledge  in an essentially ordered manner to facilitate better and more accurate retrieval.
  2. 2.  Retrieval  — When a query arrives, look at some pre-retrieval and post-retrieval steps and better contextualize with “knowledge” to get better results.

Now, let's look at the specific steps that reflect the current state of the project. I put some pseudo-code in each step to make the overall article easier to understand. For those looking for code-level specifics, you can check out my Github repository where I published the code.

1. Extracting knowledge from PDF

process

When a PDF enters the system, several things need to happen: it is stored, processed, chunked, embedded, and enriched with structured knowledge. Here’s how the whole process unfolds:

Step 1: Upload and record creation

  • • User uploads a PDF (other file types like audio and video files coming soon).
  • • The system saves the file to disk (in the near future I will move this to an AWS S3 bucket to better support enterprise use cases).
  • • A record is inserted into the database and a processing status entry is created. For the database, I use  SingleStore because it supports multiple data types and mixed searches as well as single retrieval).
  • • A background task was queued to process the PDF asynchronously. This led me to dig deeper into how long the overall steps took and ultimately decided to use Redis and Celery for task processing and tracking. This did prove to be a bit of a pain to deploy, but we can get back to that later.

# Pseudocode save_file_to_disk(pdf) db_insert(document_record, status=”started”) queue_processing_task(pdf)

Step 2: Parsing and chunking the PDF

  • • The file is opened and verified for size limits or password protection because we want to fail the process early if the file is unreadable.
  • • Content is extracted as text/markdown. This is another big topic. I was using PyMudf for the overall extraction, but then I discovered  Llamaparse and made my life a lot easier after switching. The free version of Llamaparse allows parsing 1000 documents per day and has many additional features to return responses in different formats and better extract tables and images from PDFs.
  • • The document structure is analyzed (e.g., table of contents, headings, etc.).
  • • Use a semantic approach to split the text into meaningful chunks. This is where I use Gemini Flash 2.0 because of its huge context size and significantly lower pricing.
  • • If semantic chunking fails, the system falls back to simpler segmentation.
  • • Add overlap between blocks to maintain context.
validate_pdf(pdf)
text = extract_text(pdf)
chunks = semantic_chunking(text) or fallback_chunking(text)
add_overlaps(chunks)

Step 3: Generate Embeddings

  • • Each block is converted to a high-dimensional vector via an embedding model. I used 1536 dimensions because I used the large ada model from OpenAI.
  • • Next, both the chunk and its embedding are stored in the database. In SingleStore, we store the chunk and text in two different columns in the same table for easy maintenance and retrieval.

# Pseudocode for chunk in chunks: vector = generate_embedding(chunk.text) db_insert(embedding_record, vector)

Step 4: Extract entities and relationships using LLMs

  • • This is something that has a big impact on the overall accuracy. In this step, I send the semantically organized chunks to OpenAI and request it to return entities and relations from each chunk with some specific prompts. The results include key entities (name, type, description, alias).
  • • Relationships between entities are mapped out. Here, if we find multiple entities, we update the category every time with enriched data instead of adding duplicates.
  • • The extracted “knowledge” is now stored in structured tables.

# Pseudocode for chunk in chunks: entities, relationships = extract_knowledge(chunk.text) db_insert(entities) db_insert(relationships)

Step 5: Final processing status

  • • If everything is processed correctly, the status will be updated to "Completed". This way the front end can keep polling and display the correct status at all times.
  • • If a failure occurs, the status is marked as Failed and any temporary data is cleaned up.

# Pseudocode if success: update_status("completed") else: update_status("failed") cleanup_partial_data()

When these steps are completed, we now have semantic chunks, their corresponding embeddings, and the entities and relations found in the document, all in tables that reference each other.

We are now ready to move on to the next step, which is retrieval.

2. Retrieve knowledge (RAG pipeline)

process

Now that the data is structured and stored, we need to retrieve it efficiently when a user asks a question. The system processes the query, finds the relevant information, and generates a response.

Step 1: User Query

  • • Users submit queries to the system.

Step 2: Preprocess and expand the query

  • • The system normalizes the query (removes punctuation, normalizes spaces, expands with synonyms). Here I use LLM again (Groq for faster processing)

query = preprocess_query(query) expanded_query = expand_query(query)

Step 3: Embed query and search vectors

  • • The query is embedded into a high dimensional vector. I use the same ada model that I used for extraction before.
  • • The system uses semantic search to search for the closest match in the document embedding database. I use dot_product in SingleStore to achieve this.

query_vector = generate_embedding(expanded_query) top_chunks = vector_search(query_vector)

Step 4: Full text search

  • • Perform parallel full-text searches to complement vector searches. In SingleStore, we use the MATCH statement to achieve this.

text_results = full_text_search(query)

Step 5: Merge and rank results

  • • The vector and text search results are merged and re-ranked based on relevance. One of the configurations we can tweak here is the top k results. I got better results with top k = 10 or higher.
  • • Results with low confidence are filtered out.

merged_results = merge_and_rank(top_chunks, text_results) filtered_results = filter_low_confidence(merged_results)

Step 6: Retrieve entities and relationships

  • • Next, if entities and relationships exist for the retrieved chunk, they are included in the response.

# Pseudocode for result in filtered_results: entities, relationships = fetch_knowledge(result) enrich_result(result, entities, relationships)

Step 7: Generate Final Answer

  • • Now we take the overall context, enhance the context with prompts, and send the relevant data to the LLM (I used gpt3o-mini) to generate the final response.

final_answer = generate_llm_response(filtered_results)

Step 8: Return the answer to the user

  • • The system sends the response back as a structured JSON payload along with the original database search results to identify the source for further debugging and tuning if needed.

# Pseudocode return_response(final_answer)

Now, here comes the kicker. Overall, the retrieval process took about 8 seconds for me, which is unacceptable.

When tracing the calls, I found that the highest response times were coming from LLM calls (around 1.5 to 2 seconds). SingleStore database queries consistently returned in 600 milliseconds or less. After switching to Groq for some of the LLM calls, the overall response time dropped to 3.5 seconds. I think this could be improved further if we made some parallel calls instead of serial, but that’s another project.

Finally, here comes the key.

Given that we are using SingleStore, I wanted to see if I could do the entire retrieval with just one query, which would not only be easier to manage, update, and improve, but also because I want to get better response times from the database. The assumption here is that the LLM model will get better and faster in the near future, and I have no control over that (of course, if you are really serious about latency, you can deploy a local LLM in the same network).

Finally, here is the code (single file for convenience) that now performs a single retrieve query.

import  os
import  json
import  mysql.connector
from  openai  import  OpenAI

DB_CONFIG = {
"host" : os.getenv( "SINGLESTORE\_HOST""localhost" ),
"port"int (os.getenv( "SINGLESTORE\_PORT""3306" )),
"user" : os.getenv( "SINGLESTORE\_USER""root" ),
"password" : os.getenv( "SINGLESTORE\_PASSWORD""" ),
"database" : os.getenv( "SINGLESTORE\_DATABASE""knowledge\_graph" )
}
def get \_query\_embedding(query:  str ) -\>  list :
"""
 Generate a 1536-dimensional embedding for the query using OpenAI embeddings API.
 """

 client = OpenAI(api\_key=os.getenv( "OPENAI\_API\_KEY" ))
 response = client.embeddings.create(
 model= "text-embedding-ada-002" ,
input =query
 )
return  response.data\[ 0 \].embedding 
def retrieve \_rag\_results(query:  str ) -\>  list :
"""
 Execute the hybrid search SQL query in SingleStore and return the top-ranked results.
 """

 conn = mysql.connector.connect(\*\*DB\_CONFIG)
 cursor = conn.cursor(dictionary= True )

 query\_embedding = get\_query\_embedding(query)
 embedding\_ str  = json.dumps(query\_embedding) 

 cursor.execute( "SET @qvec = %s" , (embedding\_ str ,))


 sql_query =  """
SELECT 
 d.doc_id,
 d.content,
 (d.embedding <\*\> @qvec) AS vector\_score,
 MATCH(TABLE Document\_Embeddings) AGAINST(%s) AS text\_score,
 (0.7 \* (d.embedding <\*\> @qvec) + 0.3 \* MATCH(TABLE Document\_Embeddings) AGAINST(%s)) AS combined\_score,
 JSON\_AGG(DISTINCT JSON\_OBJECT(
 'entity_id', e.entity_id,
 'name', e.name,
 'description', e.description,
 'category', e.category
 )) AS entities,
 JSON\_AGG(DISTINCT JSON\_OBJECT(
 'relationship\_id', r.relationship\_id,
 'source\_entity\_id', r.source\_entity\_id,
 'target\_entity\_id', r.target\_entity\_id,
 'relation\_type', r.relation\_type
 )) AS relationships
FROM Document\_Embeddings d
LEFT JOIN Relationships r ON r.doc\_id = d.doc\_id
LEFT JOIN Entities e ON e.entity\_id IN (r.source\_entity\_id, r.target\_entity\_id)
WHERE MATCH(TABLE Document\_Embeddings) AGAINST(%s)
GROUP BY d.doc\_id, d.content, d.embedding
ORDER BY combined_score DESC
LIMIT 10;
 """


 cursor.execute(sql\_query, (query, query, query))
 results = cursor.fetchall()
 cursor.close()
 conn.close()
return  results

Lessons Learned

As you can imagine, it’s one thing to do a “simple” RAG chat with your pdf, and another to achieve over 80% accuracy while keeping latency low. Now add structured data to the mix and you’ve got yourself deep into a project that’s almost become a full-time job ?

I plan to continue tweaking and improving and blogging about this project, and in the short term I’m looking for some ideas to explore next.

Enhanced accuracy

extract:

  1. 1. Externalize and experiment with entity extraction hints.
  2. 2. Summarize the chunks before processing. I feel like this might have non-trivial effects.
  3. 3. Add better retry mechanism for failures in different steps.

Search

  1. 1. Use better query expansion techniques (custom dictionaries, industry-specific terminology).
  2. 2. Fine-tune the weights for vector and text search (this has been externalized in the configuration YAML file).
  3. 3. Add a second LLM process to re-rank the top results (I'd be cautious about this given the latency tradeoff).
  4. 4. Adjust the search window size to optimize recall and relevance.
  5. 5. Generate block-level summaries instead of sending the original text to LLM.

Summarize

In many ways, I document this to remind myself of the enterprise requirements I need to consider when building an enterprise-level RAG or KAG. If you as a reader find some of what I have done to be naive, or have other ideas for how I can improve, please feel free to contact me here or on LinkedIn so we can work together.