Microsoft PIKE-RAG open source: L0 to L4 staged system construction strategy

Microsoft PIKE-RAG phased system construction strategy, exploring the knowledge understanding and reasoning ability improvement from L0 to L4.
Core content:
1. L0 level: basic tasks and modules for building a comprehensive knowledge base
2. Application and challenges of document parsing and knowledge organization in L0
3. L1 level: focus on factual issues, enhance the strategy of chunking and knowledge retrieval
01
—
L0: Knowledge base construction
1. File parsing
File parsing is a key step in processing diverse data sources. With tools like LangChain, you can easily parse text documents in various formats and integrate heterogeneous data. At the same time, through deep learning tools and commercial cloud APIs, OCR and table extraction can be achieved to convert scanned documents into structured text.
For professional documents containing complex tables and charts, it is recommended to perform layout analysis, retain multimodal elements such as charts and graphs, and describe these elements using visual-language models (VLMs) , which can maintain document integrity and improve retrieval results.
2. Knowledge organization
The knowledge base is constructed using a multi-layer heterogeneous graph structure to clearly display the different granularities, abstraction levels, and relationships of information. As shown in the following figure:
It is divided into information resource layer, corpus layer and refined knowledge layer, supporting semantic understanding and efficient retrieval.
Information Resource Layer : records various data sources and uses nodes and edges to represent their reference relationships to facilitate cross-validation and reasoning.
Corpus Layer : Splits documents into chapters and blocks while retaining their original hierarchical structure. Tables and graphics are summarized by large language models (LLMs) and integrated into nodes to ensure that multimodal content is searchable.
Distilled Knowledge Layer : Through entity recognition and relationship extraction, the corpus is converted into structured forms such as knowledge graphs, atomic knowledge, and table knowledge to support deep reasoning. Specific distillation methods include:
Knowledge graph: Use LLMs to extract entities and relationships, form a "node-edge-node" structure, and build a graph. Atomic knowledge: Split the text into atomic sentences and generate atomic knowledge by combining node relationships. Tabular knowledge: Extract entity pairs with specified types and relationships, and combine them to build tabular knowledge.
02
—
L1: Focus on factual issues
L1 adds knowledge retrieval and organization functions based on L0 to improve retrieval and generation capabilities. The core challenges are semantic alignment and text segmentation : the large number of professional terms may reduce the accuracy of segmentation, and unreasonable segmentation will destroy semantic integrity and introduce interference. To this end, the L1 system introduces more sophisticated query analysis and basic knowledge extraction modules, and expands the architecture to support task decomposition, coordination, and preliminary knowledge organization to ensure more efficient processing of complex queries.
1. Enhanced Blocking
Chunking is the process of splitting a large text into small chunks. The main methods include fixed-size chunking, semantic chunking, and hybrid chunking . Reasonable chunking can improve retrieval efficiency and accuracy, and directly affect system performance. Chunking has a dual role in the L1 system:
First, it is stored as vectorized units of information for retrieval;
Second, it provides a basis for subsequent knowledge extraction and summarization.
Improper chunking will result in loss of semantic information, especially in scenarios such as laws and regulations. Fixed-size chunking often destroys context and affects extraction quality. The chunking process is shown in the following figure:
This text segmentation algorithm breaks large documents into small chunks while preserving the context and generating effective summaries for each chunk.
Given a source text, the algorithm iteratively splits the text into multiple chunks. The first iteration generates a forward summary for the initial chunk, which serves as the context for subsequent chunks. Each chunk is combined with the forward summary to generate an independent summary, which is then stored and updated, and the processed part is removed, and the cycle continues until the text is fully decomposed. In addition, the algorithm can dynamically adjust the chunk size based on the content and structure of the text.
2. Automatic labeling
In RAG scenarios in specific fields, the corpus is mostly professional expressions, while user queries often use everyday language. For example, in medical Q&A, the symptom description is simple, but the corpus uses professional terms. This difference leads to inaccurate retrieval. To solve this problem, the automatic tagging module narrows the gap between queries and documents by preprocessing the corpus to extract a comprehensive set of domain-specific tags or establish tag mapping rules .
The specific method is to use the ability of large language models (LLMs) to identify key factors in the chunks, summarize them into label categories, and generate extraction prompts. When there are no query samples, labels are extracted from the corpus to form a set; when there are samples, labels are extracted from the query and answer chunks to establish a cross-domain mapping. Finally, query labels are mapped to optimize retrieval and improve accuracy and coverage.
3. Multi-granularity search
The L1 system supports multi-layer and multi-granular retrieval across heterogeneous knowledge graphs. Each layer in the graph (such as the information source layer, corpus layer, and refined knowledge layer) provides knowledge at different levels of abstraction and granularity. Queries can be mapped to the entire document or specific blocks, flexibly adapting to task requirements. The system calculates the similarity between the query and the node, and propagates and aggregates information between layers to ensure both breadth and depth.
03
—
L2: Focusing on chain reasoning problems
L2 focuses on efficiently retrieving multi-source information and performing complex reasoning . To this end, it introduces a knowledge extraction module and a task decomposition and coordination module . The former accurately extracts relevant information, and the latter decomposes complex tasks into easy-to-handle subtasks to improve system efficiency. See Figure 9.
1. Knowledge Atomization
Document chunks often contain multiple pieces of information, but the task only requires a subset of them. Traditional retrieval integrates information into a single chunk, which is inefficient.
To this end, knowledge atomization uses large language models (LLMs) to generate question labels for blocks. These questions can be answered by blocks, covering contents such as tables and images. The labels and blocks form a hierarchical knowledge base, which supports coarse and fine granularity queries and quickly locates relevant blocks through question indexes.
2. Knowledge-aware task decomposition
3. Knowledge-aware task decomposer training
04
—
L3: Focusing on predictive issues
L3 focuses on improving prediction capabilities , and its core lies in efficiently collecting and organizing knowledge and building prediction basis . The system generates prediction logic based on the retrieved knowledge through task decomposition and coordination modules, as shown in the following figure:
To support advanced analysis and prediction, the knowledge organization module has added structuring and sorting submodules to convert raw knowledge into a clear format. For example, in the FDA scenario, drug labels, clinical trials and other data are integrated into a multi-layer knowledge base. The structuring submodule sorts out drug names and approval dates according to task requirements, and the summarization submodule further classifies by date for easy statistics and prediction.
To address the shortcomings of large language models in professional reasoning, the knowledge center reasoning module has added a prediction submodule that can infer results based on the query and organized knowledge (such as the number of approved drugs each year). This is not limited to answering historical data, but can also predict future trends and provide more flexible responses.
L3 can efficiently process complex and dynamic knowledge bases by optimizing knowledge organization and prediction functions.
05
—
L4: Focus on creative problems
L4 achieves multi-angle thinking by introducing a multi-agent mechanism . Solving creative problems requires combining facts and principles for innovative reasoning. The main difficulty lies in extracting logic from knowledge, dealing with complex influencing factors, and evaluating the quality of answers to open-ended questions. To this end, the system coordinates multiple agents to analyze and reason in their own unique way, integrate different ideas in parallel, and output comprehensive solutions. As shown in the following figure:
This design supports diverse perspectives and can effectively respond to complex queries, inspiring new ideas rather than fixed answers. Multi-agent collaboration not only deepens reasoning, but also provides users with rich insights, promoting creative thinking and unique solutions to complex problems.
Finally, the main content of this article is translated and summarized based on the paper "PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation". For more information about PIKE-RAG, please refer to the following open source projects and papers:
GitHub link: https://github.com/microsoft/PIKE-RAG Paper link: https://arxiv.org/abs/2501.11551