General RAG: Generate question-answering ideas by searching multi-source heterogeneous knowledge bases through routing modules

Through the UniversalRAG framework, multi-source heterogeneous knowledge bases are integrated to achieve multimodal question answering.
Core content:
1. Overview of the UniversalRAG framework and its support for multimodal knowledge retrieval
2. Application of modality-aware retrieval and routing modules in question answering
3. Design and challenges of granularity-aware retrieval and untrained and trained routing modules
How to retrieve and integrate knowledge from different modalities and granularities in multiple corpora (multi-source heterogeneous knowledge bases, such as text, images, and videos)? UniversalRAG: A multimodal RAG framework for retrieving and integrating knowledge from corpora of multiple modalities and granularities. Let's take a look at the ideas below for reference.
method
As can be seen from the above figure, the core idea of UniversalRAG is to perform retrieval by dynamically identifying and routing queries to the most appropriate modality and granularity knowledge source .
Modality-Aware Retrieval:
Multimodal Corpus: UniversalRAG maintains three independent embedding spaces, corresponding to text, image, and video modalities. The corpus of each modality is organized into different sub-corpora, such as: the text corpus is divided into paragraph level and document level, and the video corpus is divided into complete videos and video clips.
Router: We introduce a routing module, Router, to dynamically select the most appropriate modality for each query. Given a query q, Router predicts the query-related modality r and selects relevant items c from the corresponding modality-specific corpus.
Granularity-Aware Retrieval:
Multi-granularity support: In order to flexibly adapt to the information needs of different queries, UniversalRAG is further divided into multiple granularity levels within each modality. For example, the text corpus is divided into paragraph level and document level, and the video corpus is divided into video clips and complete videos.
Routing decision: Routing decisions r are divided into six categories: None, Paragraph, Document, Image, Clip, and Video. The retrieval process is performed according to the routing decision r. The specific formula is as follows:
Hint Design: Given a query q, the LLM is provided with a detailed instruction describing the routing task along with several contextual examples. Predicted routing type: LLM predicts the most appropriate retrieval type for a query based on prompts and examples, choosing from six predefined options. Leveraging the inductive bias of benchmarks: Assume that each benchmark is primarily associated with a specific modality and retrieval granularity. For example, queries in a text question answering benchmark may primarily require paragraph-level information, while a multi-hop question answering benchmark may require document-level information. Label allocation: For the text question answering benchmark, queries are labeled as ‘None’ (if the query can be answered only by the model’s parameter knowledge), ‘Paragraph’ (single-hop RAG benchmark), or ‘Document’ (multi-hop RAG benchmark).
For the image benchmark, queries are labeled 'Image'.
For the Video Question Answering benchmark, queries are labeled as ‘Clip’ (if the query focuses on a local event or a specific moment in a video) or ‘Video’ (if the query requires understanding the storyline or broader context of the entire video).
Routing module design:
1. Routing without training
Untrained routing leverages the inherent knowledge and reasoning capabilities of the pre-trained LLM to classify queries. The steps are as follows:
Summary: The advantage of this method is that it does not require additional training data and takes advantage of the strong generalization ability of LLM. However, its performance may be limited by the pre-trained knowledge and reasoning ability of LLM.
2. Training Routing
In order to improve the accuracy of routing, UniversalRAG also explored methods to train the routing module. The main challenge in training the routing module is the lack of supervision of query-label pairs (ground-truth query-label pairs) for optimal corpus selection. To this end, the article adopts an indirect method to construct the training dataset: