RAG does not need to be vectorized? Building Agentic RAG with PageIndex

Explore the new era of long document retrieval and how to improve the performance of the RAG system through reasoning.
Core content:
1. Limitations and challenges of RAG system vector database retrieval
2. Innovation of PageIndex: document indexing system based on tree search
3. Practical application and usage steps of PageIndex
Have you been disappointed with the accuracy of vector database retrieval for long professional documents? Traditional vector-based RAG systems rely on semantic similarity rather than true relevance . But in retrieval, what we really need is relevance - which requires reasoning . Similarity search often fails to deliver when dealing with professional documents that require domain expertise and multi-step reasoning.
Reasoning-based RAG provides a better option: allowing large language models to think and reason to find the most relevant parts of documents. Inspired by AlphaGo, Vectify AI proposed using tree search to perform structured document retrieval.
PageIndex is a document indexing system that builds a search tree structure from long documents in preparation for inference-based RAG.
Developed by Vectify AI .
What is PageIndex
PageIndex converts lengthy PDF documents into a semantic tree structure , similar to a "table of contents" but optimized for Large Language Models (LLMs). It is particularly suitable for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds the contextual limitations of an LLM.
Main Features
The hierarchical tree structure
allows the large language model to traverse the document logically - like a smart, LLM-optimized table of contents.Precise page reference
Each node contains its summary and physical index of the start/end page, enabling accurate retrieval.No artificial chunking required
No arbitrary chunking is used. Nodes follow the natural structure of the document.Designed for large-scale documents
to easily handle documents of hundreds or even thousands of pages.
PageIndex Format
The following is an example of the output. See more example documents and the resulting tree structure .
...
{
"title" : "Financial Stability" ,
"node_id" : "0006" ,
"start_index" : 21,
"end_index" : 22,
"summary" : "Federal Reserve..." ,
"nodes" : [
{
"title" : "Monitoring Financial Vulnerabilities" ,
"node_id" : "0007" ,
"start_index" : 22,
"end_index" : 28,
"summary" : "Federal Reserve monitoring..."
},
{
"title" : "National and International Cooperation and Coordination" ,
"node_id" : "0008" ,
"start_index" : 28,
"end_index" : 31,
"summary" : "In 2023, the Federal Reserve and..."
}
]
}
...
In fact, seeing this, we will find that many frameworks or algorithms before RAG have similar ideas:
For example, the Node implementation of LlamaIndex For example, Raptor's hierarchical clustering There is also Mineru's PDF conversion to generate Markdown, which we can then parse into json data similar to chapter information
So what is the highlight of PageIndex? In fact, in the last part, "Using PageIndex for Reasoning-based RAG", Agentic RAG is smarter than the previous Advanced and Modular RAG. Let's see how to implement it.
How to use
Follow the steps below to generate the PageIndex tree structure from a PDF document.
1. Install dependencies
pip3 install -r requirements.txt
2. Set up your OpenAI API key
Create a new one in the root directory.env
File and add your API key:
CHATGPT_API_KEY=your openai key
3. Run PageIndex on the PDF
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
You can customize the process by passing additional optional parameters:
--model OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages Check the number of pages in the table of contents (default: 20)
--max-pages-per-node Maximum number of pages per node (default: 10)
--max-tokens-per-node Maximum number of tokens per node (default: 20000)
--if -add-node-id Add node ID (yes/no, default: yes)
--if -add-node-summary Add node summary (yes/no, default: no )
--if -add-doc-description Add documentation description (yes/no, default: yes)
Cloud API (Beta)
If you don't want to deploy it yourself, try Vectify AI 's PageIndex hosted API . The hosted version uses Vectify AI's custom OCR model to more accurately recognize PDFs and provide better tree structures for complex documents. Leave your email in this form to get 1,000 pages of processing credits for free.
Case Study: Mafin 2.5
Mafin 2.5 is a state-of-the-art reasoning-based RAG model designed for financial document analysis. It is built on PageIndex and achieves an astonishing 98.7% accuracy in the FinanceBench benchmark - significantly outperforming traditional vector-based RAG systems.
PageIndex's hierarchical indexing enables precise navigation and extraction of relevant content in complex financial reports such as SEC filings and earnings disclosures.
? View full benchmark results for detailed comparisons and performance metrics.
Inference-based RAG using PageIndex
Use PageIndex to build an inference-based retrieval system without relying on semantic similarity. It is very suitable for domain-specific tasks that require subtle distinctions.
Preprocessing workflow example
Use PageIndex to process the document and generate a tree structure. Store the tree structure and its corresponding document IDs in a database table. Store the contents of each node in a separate table, indexed by node ID and tree ID.
An example of an inference-based RAG framework
Query preprocessing: Analyze the query to determine the knowledge required
Document selection: Search for related documents and their IDs Get the corresponding tree structure from the database
Node selection: Search the tree structure to identify relevant nodes
LLM Generation: Get the corresponding content of the selected node from the database Format and extract relevant information Send the assembled context along with the original query to the LLM Generate informed responses
Example Tips for Node Selection
prompt = f"""
Given a question and a document tree structure.
You need to find all the nodes that could possibly contain the answer.
Question: {question}
Document tree structure: {structure}
Please reply in the following JSON format:
{{
"thinking": <the reasoning process about where to look>,
"node_list": [node_id1, node_id2, ...]
}}
"""
Seeing the rationality, we naturally understand that PageIndex does not need a block vector because it converts documents into nodes and then uses a large model for selection. Previously, RAG was retrieval + sorting = the current LLM Judge.
At the same time, the problem is that when there are many documents or the documents are long, the cost of LLM selection is relatively high.