Recommendation
Quickly master the core terminology of RAG technology and gain an in-depth understanding of the architecture and application of retrieval enhancement generation.
Core content:
1. The core components of the RAG architecture and their meanings
2. Detailed explanation of embedding and vector retrieval technology
3. Key terms of generation and context control
4. Introduction to related technologies and patterns
Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)
RAG ( Retrieval-Augmented Generation)
1. Core components of RAG architecture
| |
---|
Retriever | Responsible for finding documents or snippets related to user questions from external knowledge bases (such as Top-k retrieval in vector databases). |
Generator | Usually a large language model (such as GPT, T5) uses the retrieved information to generate the final answer. |
Index | The core data structure of the retrieval system, used to quickly find documents. Usually a vector index. |
Knowledge Base / Corpus | A collection of content that stores structured or unstructured knowledge from which the RAG system retrieves relevant information. |
2. Embedding and vector retrieval
| |
---|
Embedding | Convert text into vectors for semantic comparison and retrieval. |
Dense Retrieval | Using semantic vectors (such as DPR and BERT) for text retrieval is better than the traditional TF-IDF method. |
Vector Store | Databases for storing document vectors, such as FAISS, Pinecone, Milvus, Weaviate, etc. |
ANN (Approximate Nearest Neighbor) | An algorithm for efficiently finding similar vectors, often used in large-scale vector retrieval. |
3. Search Technology
| |
---|
DPR (Dense Passage Retrieval) | The dense retrieval method proposed by Facebook trained the Query Encoder and Passage Encoder. |
BM25 | A classic sparse text retrieval algorithm based on word frequency, commonly used in traditional search engines. |
Hybrid Retrieval | At the same time, the results of sparse retrieval (such as BM25) and dense retrieval (such as DPR) are combined to improve the recall rate. |
4. Generation and Context Control
| |
---|
Context Window | The maximum input length that LLM can handle. If it exceeds the length, it will be truncated. |
Chunking | Split long documents into smaller chunks to fit within the retrieval and context window constraints. |
Top-k Retrieval | Returns the top k document chunks or fragments that are most relevant to the query. |
Prompt Engineering | Design prompt words to better guide the language model to generate answers using the retrieved content. |
Grounding | Ensure that the generated content is based on real search results, not hallucinations. |
V. Related technologies and models
| |
---|
Reranking | The preliminary search results are scored and sorted again to improve quality. |
Query Expansion | Enhance search results by adding synonyms, hyponyms, and similar words. |
Multi-hop Retrieval | Supports complex question answering across multiple documents or query steps. |
Fusion-in-Decoder (FiD) | A generative architecture proposed by Google that fuses multiple retrieved documents into the decoder. |
Retriever-Reader Architecture | The traditional question-answering architecture of retriever + reader is the predecessor of RAG. |
6. RAG deployment related
| |
---|
Cold Start | The problem of lack of valid retrieval results or embedded representations when the system is first run. |
Latency | The total time consumption of retrieval + generation is one of the key points of RAG system optimization. |
Caching | Cache common search or generation results to improve performance. |
Incremental Indexing | Supports a mechanism to add new documents without rebuilding the entire index. |