Three reflections on RAG's future development and two tasks of language and cultural analysis

Written by

Silas Grey

Updated on:June-21st-2025

Today is May 16, 2025, Friday, Beijing, sunny.

We have said that we have done a lot of things about knowledge graphs, RAG, and document intelligence. These have been well developed in 2024, and many detailed solutions have emerged. GraphRAG, DeepResearch, etc. have emerged in an endless stream. There have also been a lot of document parsing work such as Mineru and Mistralocr, and Qwen3 has also been released.

However, now that we have entered May, it seems that everything has calmed down? Many github projects are not updated much? Model releases are not so exciting anymore? It seems that we have slowly entered a period of silence, fatigue or climbing? So, what do you feel about RAG? Here are three points.

In addition, let's take a look at two interesting works on language analysis: one is historical data, which can be used for evolution, and the other is the analysis of large model language, both of which are very interesting.

Grasping the fundamental issues, identifying the root causes, specializing and systematizing them will lead to more in-depth thinking. Let's work together.

1. Three thoughts about RAG

1. “A demo is released in one week, but it is not usable in half a year ” This sentence will always be true no matter how technology develops.

Because RAG is a framework, not a radical solution, the fundamental solution to the current problem is still in specific business scenarios and specific business problems, which still need to be handled in a special way, case by case, and this rule will not change. From the technical solution level , many variants have emerged, whether it is from query rewriting, disassembly or Hyde solutions, or various embeddings in vectorization, or hybrid mixed retrieval on the recall side, or various ranking and rerank module sorting and denoising, or various combination strategies on the prompt assembly side, or citation generation and self-correction on the result generation side, or putting a while loop outside these processes to become AgenticRAG-style DeepResearch, or switching to colqwen-style multi-modal RAG, there are already many of these solutions. From the perspective of open source frameworks , there are already low-code drag-and-drop fast building solutions such as coze and dify, as well as multiple RAG framework libraries such as RAGflow, langchain, lammaindex, and cheerystdio, which actually greatly meet the needs of being able to produce RAG in half a day. Looking back, these frameworks are highly homogenized, and it is not easy to differentiate them, so many of them just connect multimodal data, connect real-time data, or make early document intelligence, and do in-depth Deepdoc.

Therefore, based on such observation and reality, the importance of RAG is actually decreasing, the cost-effectiveness is not high, and the needs it solves are not very urgent. Instead, it has evolved into a small component and put it into the large system of Agent. It is currently evolving in this direction .

2. GraphRAG in RAG has few points for further evolution

The reason why GraphRAG has been able to generate many ideas in the past is that the fundamental logic lies in its basic characteristics. Graph or KG is also OK. One is that it has structured properties . Through structured information extraction, keywords, entities, and relationships are extracted, which has the effect of refining and denoising information, and provides an anchor point for information organization and association; the other is the relevance of this Graph structure , which can provide relevant connection work, which can facilitate subsequent multi-hop, breadth or depth walking, and can solve the comprehensiveness of recall problems, such as MS's solution to local search, improve the comprehensiveness of the answer to a certain entity, and can also use algorithms such as community discovery to do layer-by-layer summaries, thereby solving the problem of document summarization. One is the quantitative algorithm that can be run on this graph structure . Graph algorithms such as pagerank, centrality algorithm, shortest path, node2vec, etc., provide some ideas for data quantification. From a structural point of view, what can be done is actually to add node types, such as introducing multimodality, linking information of more modalities such as images, texts, videos, paragraphs, levels, etc., or how to design more appropriate nodes, so as to prepare for multimodal RAG, that is, to become multimodal GraphRAG; from the perspective of Graph structural relevance , what can be done includes how to prune the path, how to find the path that has a causal relationship with the problem itself, more accurate denoising, and more concise context, which is actually quite difficult. From the perspective of quantitative information of Graph, if we go further, we may go with the GNN graph neural network, but this will be more difficult in data modeling. Of course, this may be the case in RAG, but it can still be done by cutting hot trends, such as the combination with Agent's memory, memory management based on Graph, and enhancing the personalized experience of the intelligent body. This is very suitable for Graph, such as mem0^g and Graphti are such solutions; for example, it can be combined with R1, thinking chain, etc., and GraphRAG can be used to synthesize interpretable reasoning data , such as MedReason is one of such works. As long as the hot spot is not short, as long as you study it, you can always find a point that fits.

3. Document parsing in RAG is worth doing but does not require heavy investment

With the application of large models, especially the RAG wave, the demand and attention of document parsing have been rapidly increased. This is also the focus of my work in the past year . The supporting logic here is that RAG involves the recall of elements, and the effect of document parsing will directly affect the document segmentation and question-and-answer effect. For example, when facing non-editable ppt or pdf, if traditional tools such as pdfminer and pypdf are used, the tables, pictures and other information in them will be destroyed, causing confusion of text information. Therefore, this directly led to a seemingly complete document parsing demand, including document layouts covering different fields and sizes, detecting tables, pictures, headers and footers, paragraphs, titles, table titles, picture titles, formulas and other areas on the page; table parsing, converting wired, wireless, missing lines, research report tables, financial long tables, etc. into HTML or latex representation for later tableQA; such as subsequent formula parsing, paragraph title OCR, reading order; such as early document watermark removal, seal removal and other processing; such as handwriting recognition; also including multi-column reading order, etc., these are actually the routine tasks of the previous OCR set, which have been developed for many years and are not an emerging field. However, in fact, although document parsing is very important to RAG, it is not that important . In terms of existing capabilities, the big model only has a good effect on elements such as paragraph titles, and the digestion ability for formulas, tables, charts, and pictures is not very good, which occupies the main research and development time, and the big model actually has a good fault tolerance for some occasional paragraph deposits and text disorder problems. Based on this assumption, it is not cost-effective to do in-depth and complex document analysis. The focus should be on doing the text part well, doing a good job of layout analysis, and isolating the corresponding element areas, which can already cover most of the scene requirements. As for table analysis and formula analysis, the cost-effectiveness is not that high.

In the current document direction, everyone also attaches great importance to the document hierarchy and hopes to achieve perfect markdown recovery. This is actually not a rigid demand, but another field of text recovery. It is mainly used in document format conversion and document recovery, such as pdf2docx and pdf2ppt. In this scenario, it is necessary to make the document as high-fidelity as possible, without omissions, and strive to make every element accurate . This is naturally necessary. But please note that this has nothing to do with RAG, and it has little to do with LLM. Of course, if this is done well, RAG will be better, and this logic is correct, but it depends on the input-output ratio and whether it is worthwhile.

Therefore, as RAG continues to move forward, there are actually some things that need to be summarized and some that can be predicted. These are all points that we can have in-depth discussions on.

2. Some interesting points about language

Let’s continue to look at a few interesting points.

One is the data set, historical newspaper corpus, American Newspaper Database, covering the period from 1780 to 1960. Melissa Dell and her collaborators used nearly 20 million newspaper scans from American public libraries, totaling 1.14 billion text data, https://huggingface.co/datasets/dell-research-harvard/AmericanStories. This can be collected together with our previous People's Daily historical data, which is meaningful for historical research. It can also be connected to the big model to make some analysis and get some opinions and evolution trends, which are all very meaningful.

Another is that for large models, they have become a tool for producing large amounts of content. Under this premise, if different large models are regarded as different creators, it is meaningful to study what characteristics they present in the content. So you can take a look at this work " A Comprehensive Analysis of Large Language Model Outputs: Similarity, Diversity, and Bias, https://arxiv.org/pdf/2505.09056, which analyzes 3 million texts generated by 12 mainstream LLMs , revealing the characteristics of these models in terms of internal similarity of output (generally higher than humans), cross-model style differences, diversity, and potential biases (such as GPT-4's unique vocabulary style but similar to GPT-3.5 in deep semantics, and Gemma-7B and Gemini-pro are relatively balanced in bias), etc., which are all quite interesting.