RAG does not need to be vectorized? Building Agentic RAG with PageIndex

Written by

Jasper Cole

Updated on:June-27th-2025

Have you been disappointed with the accuracy of vector database retrieval for long professional documents? Traditional vector-based RAG systems rely on semantic similarity rather than true relevance . But in retrieval, what we really need is relevance - which requires reasoning . Similarity search often fails to deliver when dealing with professional documents that require domain expertise and multi-step reasoning.

Reasoning-based RAG provides a better option: allowing large language models to think and reason to find the most relevant parts of documents. Inspired by AlphaGo, Vectify AI proposed using tree search to perform structured document retrieval.

PageIndex is a document indexing system that builds a search tree structure from long documents in preparation for inference-based RAG.

Developed by Vectify AI .

What is PageIndex

PageIndex converts lengthy PDF documents into a semantic tree structure , similar to a "table of contents" but optimized for Large Language Models (LLMs). It is particularly suitable for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds the contextual limitations of an LLM.

Main Features

The hierarchical tree structure
allows the large language model to traverse the document logically - like a smart, LLM-optimized table of contents.
Precise page reference
Each node contains its summary and physical index of the start/end page, enabling accurate retrieval.
No artificial chunking required
No arbitrary chunking is used. Nodes follow the natural structure of the document.
Designed for large-scale documents
to easily handle documents of hundreds or even thousands of pages.

PageIndex Format

The following is an example of the output. See more example documents and the resulting tree structure .

...
{
  "title" :  "Financial Stability" ,
  "node_id" :  "0006" ,
  "start_index" : 21,
  "end_index" : 22,
  "summary" :  "Federal Reserve..." ,
  "nodes" : [
    {
      "title" :  "Monitoring Financial Vulnerabilities" ,
      "node_id" :  "0007" ,
      "start_index" : 22,
      "end_index" : 28,
      "summary" :  "Federal Reserve monitoring..."
    },
    {
      "title" :  "National and International Cooperation and Coordination" ,
      "node_id" :  "0008" ,
      "start_index" : 28,
      "end_index" : 31,
      "summary" :  "In 2023, the Federal Reserve and..."
    }
  ]
}
...

In fact, seeing this, we will find that many frameworks or algorithms before RAG have similar ideas:

For example, the Node implementation of LlamaIndex
For example, Raptor's hierarchical clustering
There is also Mineru's PDF conversion to generate Markdown, which we can then parse into json data similar to chapter information

So what is the highlight of PageIndex? In fact, in the last part, "Using PageIndex for Reasoning-based RAG", Agentic RAG is smarter than the previous Advanced and Modular RAG. Let's see how to implement it.

How to use

Follow the steps below to generate the PageIndex tree structure from a PDF document.

1. Install dependencies

pip3 install -r requirements.txt

2. Set up your OpenAI API key

Create a new one in the root directory.envFile and add your API key:

CHATGPT_API_KEY=your openai key

3. Run PageIndex on the PDF

python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

You can customize the process by passing additional optional parameters:

--model OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages Check the number of pages in the table of contents (default: 20)
--max-pages-per-node Maximum number of pages per node (default: 10)
--max-tokens-per-node Maximum number of tokens per node (default: 20000)
--if -add-node-id Add node ID (yes/no, default: yes)
--if -add-node-summary Add node summary (yes/no, default: no )
--if -add-doc-description Add documentation description (yes/no, default: yes)

Cloud API (Beta)

If you don't want to deploy it yourself, try Vectify AI 's PageIndex hosted API . The hosted version uses Vectify AI's custom OCR model to more accurately recognize PDFs and provide better tree structures for complex documents. Leave your email in this form to get 1,000 pages of processing credits for free.

Case Study: Mafin 2.5

Mafin 2.5 is a state-of-the-art reasoning-based RAG model designed for financial document analysis. It is built on PageIndex and achieves an astonishing 98.7% accuracy in the FinanceBench benchmark - significantly outperforming traditional vector-based RAG systems.

PageIndex's hierarchical indexing enables precise navigation and extraction of relevant content in complex financial reports such as SEC filings and earnings disclosures.

? View full benchmark results for detailed comparisons and performance metrics.

Inference-based RAG using PageIndex

Use PageIndex to build an inference-based retrieval system without relying on semantic similarity. It is very suitable for domain-specific tasks that require subtle distinctions.

Preprocessing workflow example

Use PageIndex to process the document and generate a tree structure.
Store the tree structure and its corresponding document IDs in a database table.
Store the contents of each node in a separate table, indexed by node ID and tree ID.

An example of an inference-based RAG framework

Query preprocessing:
- Analyze the query to determine the knowledge required
Document selection:
- Search for related documents and their IDs
- Get the corresponding tree structure from the database
Node selection:
- Search the tree structure to identify relevant nodes
LLM Generation:
- Get the corresponding content of the selected node from the database
- Format and extract relevant information
- Send the assembled context along with the original query to the LLM
- Generate informed responses

Example Tips for Node Selection

prompt =  f"""
Given a question and a document tree structure.
You need to find all the nodes that could possibly contain the answer.

Question:  {question}

Document tree structure:  {structure}

Please reply in the following JSON format:
{{
    "thinking": <the reasoning process about where to look>,
    "node_list": [node_id1, node_id2, ...]
}}
"""

Seeing the rationality, we naturally understand that PageIndex does not need a block vector because it converts documents into nodes and then uses a large model for selection. Previously, RAG was retrieval + sorting = the current LLM Judge.

At the same time, the problem is that when there are many documents or the documents are long, the cost of LLM selection is relatively high.