Open source tools for visualizing large model generation processes, Zerosearch misunderstandings, and RAG document parsing issues in open source projects

Explore visualization tools for large model generation process, and deeply analyze Zerosearch misinterpretation and RAG document issues.
Core content:
1. Introduction and application of visualization tools for large model generation process
2. Technical discovery and optimization ideas for Zerosearch misinterpretation
3. Discussion on RAG document analysis problems and solutions
Today we will look at open source tools for visualizing the large model generation process, misunderstandings about Zerosearch, and RAG document parsing issues in open source projects .
Specific:
One is several revealing tools for interpretable visualization of large models. By obtaining the intermediate steps of large model reasoning and combining them with visualization rendering tools, some intuitive understanding can be obtained, thus achieving an interpretable effect .
The other is some interesting technical discoveries and ideas, including misunderstandings about zerosearch, how to find problems related to document parsing and optimization ideas. What are the problems ? Make a note and have a deeper understanding.
Grasping the fundamental issues, identifying the root causes, specializing and systematizing them will lead to more in-depth thinking. Let's work together.
1. Several tools for visualizing the large model generation process
Let’s look at the progress of large model generation visualization tools. There are already several of them. Let’s look at three of them below. These can be used for exploring the internal mechanisms of the model, interpretability research, etc.
1. OpenMAV
Through the interactive terminal interface, the internal state of LLM when generating text can be visualized in real time, including attention distribution, MLP activation value, and token prediction probability, etc., https://github.com/attentionmech/mav. In addition, the visualization function can be easily extended through plug-ins, and multiple models such as GPT-2 and Llama can be supported.
That is, take out the corresponding values of each layer of the underlying state at each step of the predicted token and make a dynamic visualization. However, this type of data can only observe the internal value changes, how can it be explained? It can only be interpreted by Zhuge Liang, which is very similar to discourse analysis.
We have introduced this in "Disassembly of the Knowledge Graph + Knowledge Base RAG Project Yuxi-Know and the Large Model Reasoning Internal Visualization Tool OpenMAV" ( https://mp.weixin.qq.com/s/6mT-zDxv3n4vD7s9TijblQ ).
2. Logitloom
Logitloom explores the token trajectory tree in the instructions and basic models, provides a visual tree structure, and intuitively displays the token generation path. https://github.com/vgel/logitloom
The results of each step are as follows, and you can see the probability and token value;
3. ReasonGraph
Visualization of the reasoning process of large models, this is for large reasoning models, ReasonGraph, https://github.com/ZongqianLi/ReasonGraph,
It intuitively displays and analyzes the execution process of multiple reasoning methods, integrating various reasoning methods such as chain thinking and self-improvement.
For example: Chain-of-Thoughts : Visualization of the linear reasoning process. It is presented as a linear directed graph structure, where each node represents a reasoning step, and the nodes are connected by arrows. The node content contains specific reasoning text, and the last node (green) shows the final conclusion. It is suitable for solving mathematical problems and logical reasoning problems;
Self-refine : Demonstration of the iterative optimization process. It uses an iterative ring structure, with the initial reasoning results represented by blue nodes, the improvement steps marked by yellow nodes, and arrows indicating the direction of reasoning optimization. It is suitable for text generation optimization and answer improvement.
Least-to-Most : Visualization of problem decomposition and step-by-step solution. Hierarchical tree structure, the top node (light blue) is the original problem, the middle node is the decomposed sub-problem, the bottom node (green) is the solution to each sub-problem, and the final summary node shows the complete solution, which is suitable for decomposing complex problems and solving them step by step.
Self-consistency : Comparative analysis of multi-path reasoning. Parallel multi-path structure, multiple starting nodes represent different reasoning paths, each path develops independently, and finally the results are summarized through the voting mechanism (central node), which is suitable for problems that require multi-angle verification
Tree-of-Thoughts: A tree-like display of the branching reasoning process . A complete tree structure, each branch represents a possible reasoning direction, nodes can be dynamically expanded and contracted, supports depth-first and breadth-first exploration, suitable for open-ended questions and comparison of multiple solutions
Beam Search visualization : a score-driven tree structure with an associated score for each node, a fixed-width search beam, and the optimal path highlighted in dark color, suitable for decision problems that require quantitative evaluation;
1. Some interesting technical discoveries and exploration ideas
1. Misunderstanding of zerosearch
The first thing is that there was an interesting work yesterday, which caused a discussion in the community, "ZeroSearch", https://alibaba-nlp.github.io/ZeroSearch/, https://github.com/Alibaba-nlp/ZeroSearch, https://huggingface.co/collections/sunhaonlp/zerosearch, https://arxiv.org/pdf/2505.04588. The core is to simulate the search engine. Based on the knowledge of the large model itself, the large model is fine-tuned to simulate the search engine, and relevant or noise documents are generated according to the query, thereby stimulating the search ability of the large model and achieving the training goal. The purpose is to reduce the cost of training data . In this training process, avoid interaction with real search engines (such as Google), thereby reducing costs and uncontrollability.
This has been explained in the introduction of the article, as follows:
However, there are many articles that have led to some misunderstandings. We can see that they are mainly focused on: 1. Zerosearch is not a search engine [ it is a reinforcement learning framework, a framework used to enhance the reasoning ability of large-scale model search behavior, not a search engine ]; 2. Zerosearch only saves the real search engine interaction during the training process, which has nothing to do with surpassing Google search (and it also distills the positive and negative samples of Google API for fine-tuning); 3. Zerosearch has nothing to do with search engines , and it does not have much feasibility in the application landing end . This is a work done for the reinforcement learning training stage , not for solving application problems such as AI search and RAG.
2. RAG document parsing issues in open source projects
We can look at the update log and issue of the github project to find some ideas.
For example, in https://github.com/netease-youdao/QAnything/blob/qanything-v2/README_zh.md , you can see the document parsing problems and modification methods encountered when the RAG framework qanything-v2 processes documents.
For example, more reasonable block lengths reduce semantic and logical loss caused by too small or incomplete paragraphs ; the ability to recognize column text has been improved, and the reading order can be intelligently determined, even if the paragraphs span across pages can be correctly processed; the new version can recognize and save pictures and tables in text paragraphs to ensure that no important text information is missed;
Optimize table parsing, including the parsing and storage of long tables that exceed the chunk limit and xlsx files with complex structures; locate and organize corresponding text blocks based on the subheadings in the identified document, making the parsed structure clearer and the information hierarchy more distinct; optimize the parsing results for web page URLs and convert them to .md format; support txt files and docx files in more encoding formats.
Another example is the parsing problem in miner-u , https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md. The reading order is based on the model to sort the distribution of readable content in space. In extremely complex layouts, some areas may be out of order. Vertical text is not supported. Tables of contents and lists are recognized by rules, and some uncommon list formats may not be recognized.
Code blocks are not yet supported in the layout model; comic books, art albums, elementary school textbooks, and exercises cannot be parsed well; table recognition may result in row/column recognition errors on complex tables; OCR recognition may result in inaccurate characters on PDFs in minority languages (such as Latin accents, easily confused Arabic characters, etc.); some formulas may not be rendered in markdown.
These problems are of great help to us in algorithm design and system design .
Summarize
This article mainly introduces some interesting discoveries and mining ideas about the internal operation mechanism of large models and the technical explanation. These can be applied to our actual development and are worth reading.