Still relying on RAG to search for documents? Here’s a tip to make AI answers more reliable!

Written by

Iris Vance

Updated on:July-02nd-2025

In the process of using AI programming, have you ever encountered such a problem - or a challenge: you have a large number of documents, but you urgently need to locate a specific piece of information in them.

A few years ago, the answer was simple and straightforward: read the documentation!

Nowadays, many people have turned to a smarter approach: ask GPT directly!

Not to mention that with programming assistants like Cursor and Cline, finding answers becomes easier.

While this approach works in most cases, occasionally you’ll find that the answers you get are incomplete, or even half-baked, and far from satisfying your needs.

Example analysis: NotebookLM performance

Let me share my personal experience with NotebookLM (with the help of Gemini engine) as an example.

I fed NotebookLM the documentation for dify (an amazing open source chat tool for large language models). In this case, I asked how to create a chat assistant, but it either had no idea or gave only fragmented information.

Why do AI answers appear “random”?

In fact, even powerful tools such as Cursor and Windsurf may produce quite random answers.

Why is this happening?

The main reasons are:

Document size vs. model capacity
Depending on the size of the document and the token limit of the Large Language Model (LLM), these models often implicitly use Retrieval-Augmented Generation (RAG) to process documents.
Retrieval mechanism
Simply put, the system will first build a vectorized database and then search based on this database.
Current challenges
Although this method can be used as an emergency measure in many scenarios, the quality of search based on RAG is one of the current research hotspots in the field of generative AI.
Therefore, it is not easy to generate a complete answer in an instant.

A simpler solution: make the entire document the context

Fortunately, with large language models like Gemini that support up to 2 million token contexts, we have a simpler approach.

Specifically:

Depending on the size of your repository, you could pass the entire document as context to the LL.M.
This is equivalent to pasting the entire document into a dialog box that supports extra-large context windows (such as ChatGPT, Claude or Gemini), and then asking questions based on this "complete" context.

This method is not only simple and intuitive, but also greatly reduces the risk of information loss due to occasional deviations in RAG retrieval.

Real-life example: Dify creates a chat assistant

Does it sound a bit abstract? Let's take a real project as an example to see how to use this solution to create a chat assistant with Dify. Of course, this method also applies to any other document, as long as you can get the source document code.

Step 1: Find the document repository and directory

Open source projects usually have public documentation. You can quickly locate the corresponding Git repository and specific directory by clicking the "Edit" button on the documentation page.

Taking https://docs.dify.ai/ as an example, you can easily find the folders where they are located.

Note: Specifying the exact folder can reduce the amount of files passed to the model.

If you directly use the entire repository (including a lot of unnecessary source code) as the context, it is often not worth the effort. The best approach is to only extract the Markdown documents in it. For example, in the case of Dify, all documents are concentrated in a specified folder.

Step 2: Compress the file

Now that you have found the document directory, you need to "join" the Markdown files in the entire directory into a single file.

This is where an open source tool called Repomix comes in handy. npx repomix command, or visit its online version directly: https://repomix.com/.

The operation process is as follows:

Copy the Git repository link
(No need to include the path).
Fill in the file path to be included into the "include pattern" field (as shown in the figure), and select Markdown as the output format.
Finally, click “pack”.

In just a few seconds you will have a file with all the contents of the selected folder in one long string, which you can choose to copy or download.

At the same time, you can also see the number of tokens in this compressed version on the left side of the page. In my case, it is about 475,000 tokens.

(Although this number is close to the upper limit of Claude Sonnet, it is still more than enough compared to Gemini 2x's millions or even two million tokens.)

Step 3: "Talk" with the document

Now, once you have saved the entire document to a file or copied it to your clipboard, you can go to Gemini (or Claude if you have at least 10 tokens).

Next, you can start asking questions or let AI automatically generate code based on this complete document context.

In my example, I used the free version of Gemini AI Studio. Unlike before, this time, I asked again about how to add an assistant, and I got an extremely comprehensive and correct answer!

Summarize

By following the steps above, you can now use the same prompt (with a whole document as context) to ask questions or generate code, without having to rely on perfunctory RAG search results or "made-up" answers that sound reasonable but don't match the actual document.

Kind tips:

The limitation is token restriction
It is mainly subject to the token limit of the model you use. For example, Gemini supports up to 1 million tokens, which is a very large capacity and can accommodate a large number of documents.
Explore more possibilities
Additionally, you can explore further ways to find other related tools within the Coding Assistant.