[RAG] What is the most difficult part of the RAG project in helping traditional enterprises transform into AI?

Written by
Clara Bennett
Updated on:June-19th-2025
Recommendation

In-depth interpretation of the challenges of RAG projects in the AI ​​transformation of traditional enterprises and sharing of practical experience.

Core content:
1. Difficulties and coping strategies of data integration
2. Analysis of pain points of data cleaning and preprocessing
3. Key steps of knowledge extraction and structuring

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

I’m Bo Ge , focusing on large models/recommendation systems, and continuously sharing AI algorithm job interview knowledge, practical projects, and interview experience .

[Big Model/Sou Guangtui one-on-one personalized project coaching], [Big Model 14-week practical autumn recruitment sprint camp] for more details , please add v : Burger_AI




Recently, a friend of mine took on a job to help a traditional enterprise transform to AI. The boss was very ambitious and specified that the first project should be RAG (Retrieval Enhanced Generation), saying that they wanted to make use of the massive data they had accumulated over decades to create an intelligent knowledge base, intelligent customer service, etc. It sounds simple, right? But in reality, it was a mixture of pain and joy.

Today, I would like to take this project as an opportunity to talk to you about which part of the work is the most difficult and most frustrating when doing RAG. Don’t be fooled by the numerous online tutorials, such as LangChain running a demo with a few lines of code. When it comes to the production environment, there are more pitfalls than stars.

1. All things are difficult at the beginning. The mountain of data is enough to give you a hard time.

Do you think the most difficult part of RAG is to adjust the big model? Naive! Let me tell you, the most troublesome, time-consuming, and most likely to make your project stuck is definitely data!  Especially for such a traditional enterprise with a long history, the data situation is simply a series of "surprises".

  1. Data "archaeology" and "migration" (Data Ingestion & Integration):
    1. Pain point:  Customer data is so diverse. There are SQL databases in outdated systems, Excel, Word, and PPT files stored by various departments, a bunch of scanned PDF contracts and reports, and even some "antiques" that were printed out and scanned back. If you want to "invite" these scattered data out and then "move" them to a unified place, the communication cost and technical difficulty are simply too high!
    2. Complaint:  In order to get the data interface of a certain department, I had to attend many coordination meetings with a smile on my face and fill out many application forms. Some systems are so old that they don’t even have documentation, so we can only rely on guesswork. This is not AI, it’s simply data archaeologist + diplomat!
  2. Data Cleaning & Preprocessing:
    1. Pain point:  "Garbage in, garbage out" is the truth in RAG. After finally getting the data, I opened it and was shocked! Typos, garbled characters, confusing formats, redundant information, outdated information... If you don't clean it up, the data you feed into the model will be "poison".
    2. Complaints:  Especially for those scanned PDFs, the OCR recognition is so "abstract". Dislocation of tables and garbled formulas are commonplace. Do you know how troublesome OCR parameter adjustment and post-processing are? In order to improve the accuracy a little bit, I lost a lot of hair. If you think you can just cram in this part of the work and it's done, then the project is basically half dead.
  3. Extracting “real gold and silver” from “hodgepodge” (Knowledge Text Extraction & Structuring):
    1. Pain point:  Cleaning is not enough, you need to extract useful knowledge points from these texts. For example, in a report of several hundred pages, what are the core ideas? What are the key data? If you can build a knowledge graph, it would be great, but the workload...
    2. Complaint:  Some knowledge is so deep that it needs to be manually labeled or extracted using complex NLP models. This part is directly related to whether the subsequent search can find the truly relevant "material".

2. Search: How to find a needle in a haystack quickly and accurately?

Now that we have finally figured out the data, the next step is how to make the model “find” and “find correctly”.

  1. The art of “dividing the cake”: text chunking (Chunking Strategy):
    1. Pain point:  Long documents must be cut into small pieces before they can be fed to the model. How big should they be cut? What should they be cut by? Cutting by fixed length is simple, but it is easy to cut a sentence in half; cutting by semantics (such as by paragraph) works well, but it is difficult to implement.
    2. Complaint:  This is a trial and error process. If the data is cut too small, the context is not enough and the model cannot understand it; if the data is cut too large, there will be too much noise and the model may exceed its "appetite" (context window). There is no panacea and we can only refine it slowly according to the characteristics of the customer's data.
  2. Labeling knowledge: Embedding Model Selection:
    1. Pain point:  Only when the text block is converted into a vector can the semantic similarity search be performed. Should a general Embedding model be used or one fine-tuned for the customer's industry data? This directly affects the accuracy of the search.
    2. Complaint:  The general model may not perform well on the customer's professional terms. Want to fine-tune? Where can we get high-quality labeled data? This is another headache.
  3. "Cast a wide net" and "fish accurately" (Recall & Precision ):
    1. Pain point:  Not only must we find all the relevant information (recall rate), but we must also ensure that the information we find is actually useful (precision rate). Vector search alone is sometimes not enough, and we may need to add keyword search or something else.
    2. Complaints:  Customers often ask: "I searched for this word, why didn't the relevant document come up?" or "Why did a bunch of irrelevant results come up?" At this time, you have to optimize the retrieval strategy, maybe do a hybrid search, and then add a re-ranking, which increases the complexity and computational effort.

3. Generation: Let AI “speak human language” and speak it correctly and well

After retrieving the information, we still have to rely on the big model to organize the answers.

  1. Model's "appetite" problem: context length ( LLM Context Length):
    1. Pain point:  The context that the model can receive is limited. What if there are too many good things retrieved and it can’t be stuffed in?
    2. Complaint:  You have to find a way to filter out the most important information and feed it to the model. Although there are long-context models now, they are expensive and slow to reason about, and the boss's budget may not allow it.
  2. Don’t let AI “talk nonsense seriously” ( Output Effect Optimization & Hallucination Mitigation):
    1. Pain point:  Even with RAG, the big model will sometimes "play freely" and say something unreliable. How to make it speak honestly according to the given materials is a big challenge.
    2. Complaints:  Prompt tuning is a basic skill, and sometimes you have to figure out how to verify whether what it says is true and whether you can mark the source. Customers are most afraid that AI will say nonsense and ruin their reputation.
  3. Let AI follow the "rules": Output Format Fixing :
    1. Pain point:  Many times, we want AI to output in a fixed format, such as generating a JSON or filling out a form. But in this case, AI may not be so obedient.
    2. My complaint:  If you ask it to output a list, it may give you a paragraph. Controlling the output format is also a delicate job.

4. Interviewer’s perspective: Don’t just stop at “Hello RAG”

By the way, I recently helped interview several people and found that many of them said “familiar with RAG” on their resumes, but they were exposed as weak when asked in detail.

  • When asked about the text segmentation strategy , is it fixed length, semantic segmentation, or adaptive? What are the advantages and disadvantages? Many people hesitated.
  • When asked how he selected the embedding model and how he determined the dimension size, he couldn't explain it clearly.
  • When asked whether he used re-ranking and how to combine BM25 and vector search , he looked confused.
  • When I asked him if Prompt was designed by himself and how to evaluate the hit rate and illusion problem , I was basically stuck.

To put it bluntly, many people may just run a demo using a framework like LangChain, cut the text into pieces, throw it into the vector library and that's it. But the retrieval in the first half of RAG is essentially very similar to the "recall-sort-fine sort" logic of the recommendation system, and there are many tricks involved. The generation control in the second half is even more delicate. If you don't have some algorithm background, or haven't done search and recommendation optimization seriously, these deep-level questions are really hard to answer.

5. So, what is the most difficult thing for RAG to deal with?

After all this talk, if you ask me which part of this project is the most difficult for RAG to handle? I still have to vote for "data processing"!  It is the foundation of the entire system. If the foundation is not stable, all the fancy technologies that follow will be useless. This part has a large workload, high communication costs, and many technical details. It is often the least "invisible" but the most deadly.

The second is the "fine polishing of the retrieval module". How to quickly and accurately find the most useful small piece of information in massive, complex, and even low-quality data and feed it to the big model is a real test of one's skills.

Of course, other links also have their own difficulties, such as how to make the model output more controllable, how to establish a reliable evaluation system, etc. Doing RAG is not as simple as building blocks, it is a systematic project that requires patience and wisdom to work on bit by bit.

That’s enough, I’ll stop here for today. I hope my experience of stepping on the pitfalls can give some inspiration to brothers who are also exploring the RAG road. Although this job is difficult, it can really help the company, and the sense of accomplishment is full! I will continue to move bricks!




Finally, let me share with you our latest [ Big Model 14-week Practical Fall Recruitment Sprint Camp ], you can sign up by +v: Burger_AI


Student offer case

















1. After 2 months of 1-on-1 tutoring, I got the Ant LLM+ recommendation algorithm offer

2.CV career change recommendation algorithm, got offers from 3 big companies in 3 months

3. I started the big model industry with zero foundation and got a summer internship at ByteDance in 2 months


Course Link:
https://course.terminiai.com/
Course Preview:
https://acnfgkjh8azx.feishu.cn/docx/CXXTdNy9motFD6xOE3AcZyyUnpd?from=from_copylink