Microsoft PIKE-RAG open source: unlocking professional domain knowledge understanding and reasoning, a new breakthrough in RAG!

Written by

Caleb Hayes

Updated on:July-01st-2025

In the past year, although the Retrieval Augmented Generation (RAG) system has made some progress in expanding the capabilities of large language models (LLMs) through external retrieval, it mainly relies on text retrieval and LLMs understanding capabilities, lacks the extraction, understanding and utilization of multi-source data knowledge , and shows significant deficiencies in areas with strong professional knowledge (such as industrial applications).

To solve this problem, Microsoft Research Asia proposed the PIKE-RAG (sPecIalized KnowledgE and Rationale Augmented Generation) method, a method that "focuses on extracting, understanding, and applying domain-specific knowledge while building a coherent thinking logic to gradually guide LLMs to obtain accurate responses" to solve the following problems:

1. Diversity of knowledge sources : Facing the diversity of knowledge sources, PIKE-RAG aims to better solve this problem by constructing multi-layer heterogeneous graphs to represent information and knowledge at different levels.
2. Generality and "one size fits all" issues: Different types of questions (such as simple factual questions and answers and complex questions that require multi-step reasoning) require different processing strategies. Existing RAG methods fail to fully consider the complexity and specific needs in different application scenarios, and adopt a unified process, so they cannot take into account all needs. Through task classification and system capability grading , PIKE-RAG provides a capability demand-driven solution construction strategy, which significantly improves the system's adaptability to problems of different complexity.
3. LLMs lack domain expertise: In industrial applications, RAGs need to leverage private knowledge and logic in specialized domains, but existing methods perform poorly when applied to specialized domains, especially in areas where LLMs are not good at. PIKE-RAG enhances the ability to extract and organize domain-specific knowledge through knowledge atomization and dynamic task decomposition . In addition, the system is able to automatically extract domain knowledge from system interaction logs and solidify the learned knowledge through LLMs fine-tuning for better application in future question-answering tasks.

—

PIKE-RAG FRAME

As shown in the figure below, PIKE-RAG is a versatile and extensible RAG framework. The framework is mainly composed of multiple basic modules, including: file parsing, knowledge extraction, knowledge storage, knowledge retrieval, knowledge organization, knowledge-centric reasoning, and task decomposition and coordination . Through this modular architecture design, PIKE-RAG can flexibly build different RAG methods by adjusting the sub-modules within the main module according to different system capability requirements, thereby coping with complex requirements in actual scenarios.

—

Phased system construction strategy from L0 to L4

PIKE-RAG adopts a hierarchical and phased system construction and implementation strategy to ensure that the system can gradually improve its ability to handle complex problems. As shown in the following figure:

It divides system construction into L0 to L4 (i.e. knowledge base construction (L0), factual problem module (L1), chain reasoning problem module (L2), predictive problem module (L3), creative problem module (L4) ), and each stage has different goals and challenges.

At present, the system has achieved good results in both public benchmarks and in some professional fields. For more information about PIKE-RAG, please refer to the following open source projects and papers:

GitHub link: https://github.com/microsoft/PIKE-RAG Paper link: https://arxiv.org/abs/2501.11551(opens in new tab)