Building Production-Grade RAG Pipelines Based on Gemini and Qdrant: Design Guidelines and Code Practice

Written by
Silas Grey
Updated on:June-09th-2025
Recommendation

Master RAG technology and build an efficient smart assistant. Core content: 1. Key applications of RAG technology in the dynamic field 2. Technology comparison between RAG and fine-tuning, CAG and airport scene selection 3. Architectural design and code practice of production-grade RAG pipelines

 
Yang Fangxian
53A founder/Tencent Cloud (TVP), most valuable expert

1. Core values ​​and application scenarios of RAG technology

In the field of artificial intelligence, retrieval-Augmented Generation (RAG) is becoming a key technology to solve problems such as lagging knowledge updates of large language models (LLM) and untraceable content generation. The traditional Fine-Tuning method solidifies knowledge into model parameters, making it difficult to cope with rapid changes in the dynamic field; while RAG realizes real-time updates and traceability of knowledge by decoupling retrieval and generation, and is especially suitable for scenarios where policies change frequently and have extremely high accuracy requirements, such as medical care, legal and aviation management.

This article takes the construction of the airport intelligent assistant as an example, and combines Google's Gemini multimodal model and Qdrant vector database to elaborate on how to design and implement a highly reliable and scalable production-level RAG pipeline. The content covers architectural design principles, key technology selection, data management strategies and complete code implementation, aiming to provide developers with full process guidance from theory to practice.

 

2. Technical selection: RAG vs fine-tuning vs CAG

Before starting the RAG project, you need to clarify the technical route. Figure 1 compares the core differences between the three solutions: RAG, Fine-Tuning, and Cache Enhanced Generation (CAG):

Dimension RAG Fine-Tuning CAG
Knowledge update
Update the document dynamically without retraining the model
Retrain the model
Depend on cached predefined responses
      Traceability
Output the original document associated with high transparency
Knowledge is implicit in parameters and cannot be traced
Cached hits can only be traced
Anti-sickness
generated based on search content, low risk
May strengthen outdated knowledge
Free only for known queries
  Applicable scenarios
Dynamic fields (such as aviation policy)
Static field (such as fixed rule manual)
High frequency repetition scenarios (such as customer service Q&A)

2.1 Technical decision-making in airport scenarios

In airport scenarios, knowledge such as security protocols, flight scheduling rules, customs policies are highly timely and must strictly follow official documents. Therefore:

  • Select RAG
    : Make sure that the assistant always provides the latest information, and at the same time search for the associated original content through vectors to meet audit requirements.
  • Exclude Fine-Tuning
    : The model parameter update cost is high and cannot cope with weekly or even daily policy changes.
  • Assisted use of CAG
    : For high-frequency fixed queries (such as "gateway No. 3"), the response speed can be improved through Redis cache, but the core business logic is still implemented based on RAG.

3. System architecture design: from requirements to hierarchical architecture

3.1 Business requirements dismantling

Airport Intelligent Assistant must meet the following core functions:

  1. Real-time and accurate response
    : In high-voltage scenarios such as check-in and security check, the response delay must be less than 500ms.
  2. Contextual perception: provides personalized guidance in combination with passenger location (such as terminal area F), identity type (such as transit passenger), and emotional state (such as anxiety).
  3. Multi-round dialogue memory
    : Supports up to 12 rounds of dialogue history storage to ensure interaction consistency.
  4. Multimodal support
    : In the future, it needs to be extended to image recognition (such as luggage security image analysis), so choose a Gemini model that supports multimodality.

3.2 Hierarchical architecture design

Based on the above requirements, the five-layer architecture shown in Figure 2 is designed:

1. Data layer

  • Data source
    : Including airport protocol documents in PDF format, flight dynamic API in JSON format, and employee training manual in CSV format.
  • Preprocessing components
    : Use the pdf-parse library to parse PDF text, clean redundant line breaks through regular expressions, and normalize continuous spaces into single spaces.

2. Vector storage layer

  • Qdrant Database
    : Gemini embed vectors responsible for storing document blocks. Reasons for choosing Qdrant include:
    • Support local deployment and meet airport data privacy requirements;
    • Provides mixed search (semantics + keywords), such as when querying the "international flight tax refund process", and matches paragraphs related to the "tax refund" keyword and semantics;
    • Scaling horizontally can cope with the growth of the knowledge base by increasing nodes.

3. Retrieval layer

  • Vector search
    : Generate embedded vectors of query statements through Gemini, perform cosine similarity search in Qdrant, and return the first 3 most relevant document blocks.
  • Caching layer
    : Use Redis to store high-frequency query results, the key name format is rag:cache:{interactionId}:{queryHash} , validity period is set to 1 hour.

4. generate layer

  • Gemini model
    : Use gemini-2.5-pro-preview version, supports generation of two responses at the same time:
    • Comprehensive scale
      : Strictly based on the retrieved document content, use bullet point to list the operation steps, which is suitable for scenarios such as security inspections that must comply with the protocol.
    • Experience mode
      : Provide suggestions in a friendly tone, and support Markdown format output (such as bold key information).

5. Application layer

  • API interface
    : Expose /ask endpoints, receiving message , context (position, emotion, etc.) and interactionId 'JSON request, returns a dual-mode response.
  • Supervision System
    : Integrate Prometheus+Grafana to monitor indicators such as Qdrant search delay, Gemini call success rate, and cache hit rate.

IV. Details of key technology implementation

4.1 Data chunking and embedding strategy

4.1.1 Intelligent blocking algorithm

The granularity of document chunking directly affects the retrieval accuracy. Use the sliding window blocking method to set the block size to 1000-1500 tokens, with an overlap rate of 20%, to ensure semantic coherence across paragraphs. The code implementation is as follows:

const chunkText = (text) => {
  const cleanText = text.replace(/(\r\n|\n|\r)+/g, " ").replace(/\s+/g, " ").trim();
  const maxSize = 1500; // 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0. 1.4;font-size: 1.25em;font-weight: bold;color: rgb(0, 0, 0);font-family: "PingFang SC";font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;letter-spacing: normal;orphans: 2;text-align: start;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;'>4.1.2 Gemini embed generation

Special embedding model using Geminigemini-embedding-exp-03-07, optimized for retrieval scenarios. Each document block generates a 3072-dimensional vector, and the code is as follows:

const { GoogleGenAI } = require("@google/generative-ai");
const genAI = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const getEmbeddings = async (chunks) => {
  const embeddings = [];
  for (const chunk of chunks) {
    const response = await genAI.embedContent({
      model: "models/gemini-embedding-exp-03-07",
    content: chunk,
    taskType: "retrieval_document", // Make sure the task type is document retrieval
    });
    embeddings.push(response.embedding);
  }
  return embeddings;
};

4.2 Context-aware prompt project

Hint engineering is the core means to control Gemini output. Two types of prompt templates are designed for airport scenes:

4.2.1 Scale-style tips

This is an airport scenario. Provide protocol steps for: "${user_query}". 
Context: ${retrieved_documents} 
Conversation History: ${last_12_messages} 
Guest Profile: ${profile}, Location: ${location}, Mood: ${mood} 
Response Requirements: 
1. Strictly based on provided context 
2. Use numbered list 
3. Under 300 words

4.2.2 Experience mode prompt

This is an airport scenario. Help staff respond to: "${user_query}". 
Focus on improving guest experience for ${profile} at ${location}. 
Context: ${retried_documents} 
Conversation History: ${last_12_messages} 
Response Requirements: 
1. Friendly tone with emojis 
2. Highlight key actions in bold 
3. Under 100 words 
4. Use Markdown formatting

4.3 High Availability Architecture Design

4.3.1 Troubleshooting mechanism

  • Retrieval failed
    : If Qdrant returns empty results, priority will be made to check whether the query is a common problem (such as matching the Redis cache through keywords). If there is still no result, it will return: "Sorry, the current problem needs to be handled manually, please contact the check-in counter."
  • Model crash
    : Three retrys are achieved through the Promise.retry library. If it still fails, it returns the cached most recent valid response (must ensure that the cached content expires).

4.3.2 Asynchronous processing optimization

For non-real-time scenarios (such as weekly knowledge base updates), a message queue (such as RabbitMQ) is used to decouple the data processing flow:

  1. The administrator uploads a new PDF file to the S3 bucket;
  2. The queue listener triggers the document parsing task and generates a new vector block;
  3. Qdrant performs a batch upsert operation, and the old vector automatically expires.

5. Code implementation: from document parsing to response generation

5.1 Qdrant initialization and vector operations

const { QdrantClient } = require("@qdrant/js-client-rest");
const client = new QdrantClient({ url: "http://localhost:6333" }); // Local deployment address

// Make sure the collection exists and the vector dimension is consistent with the Gemini output
const ensureCollectionExists = async () => {
  const exists = await client.collectionExists("airport-protocols");
  if (!exists) {
    await client.createCollection("airport-protocols", {
    vectors: { size: 3072, distance: "Cosine" }, // Use cosine similarity
      sharding: { key: "document_id" }, // Slice by document ID to improve multi-document retrieval efficiency
    });
  }
  };
const upsertVectors = async (documentId, chunks, embeddeds) => {
  await ensureCollectionExists();
  const points = chunks.map((chunk, index) => ({
    id: `${documentId}-${index}`, // Unique identifier
  vector: embeddings[index],
    payload: { text: chunk, document_id: documentId, source: "official-sop" }, // Attach metadata
  }));
  await client.upsert("airport-protocols", { points, wait: true }); // wait=true ensures that the operation is completed
};

5.2 Document parsing pipeline

const fs = require("fs");
const pdf = require("pdf-parse");

// parse PDF and generate vectors to store in Qdrant
const processPDF = async (filePath, documentId) => {
  // ​​1. parse PDF text
  const text = await extractTextFromPDF(filePath);
  // ​​2. Block processing
  const chunks = chunkText(text);
  // ​​3. Generate embeddings
  const embeddings = await getEmbeddings(chunks);
  // ​​4. Save vector database
  await upsertVectors(documentId, chunks, embeddings);
  console.log(`Processed ${chunks.length} chunks for document ${documentId}`);
};

const extractTextFromPDF = async (filePath) => {
  const data = fs.readFileSync(filePath);
  const pdfData = await pdf(data);
  if (!pdfData.text) throw new Error("Invalid PDF file");
  return pdfData.text;
};

5.3 Multi-mode query interface

const queryGemini = async (userQuery, context, interactionId) => {
  // ​​1. Generate query vector
  const queryEmbedding = (await getEmbeddings([userQuery]))[0];
  // ​​2. Vector search
  const results = await client.query("airport-protocols", {
    query: queryEmbedding,
    limit: 3,
    with_payload: true,
  });
  const relevantChunks = results.points.map(p => p.payload.text).join("\n\n");

  // ​​3. Get dialogue history (up to 12 rounds)
  const history = await getConversationHistory(interactionId, 12);

  // ​​4. Generate dual-mode prompts
  const protocolPrompt = buildProtocolPrompt(userQuery, relevantChunks, context, history);
  const experiencePrompt = buildExperiencePrompt(userQuery, relevantChunks, context, history);

  // ​​5. Call Gemini in parallel (improve efficiency)
  const [protocolResp, experienceResp] = await Promise.all([
genAI.generateContent({
      model: "models/gemini-2.5-pro-preview",
      contents: [{ role: "user", parts: [{ text: protocolPrompt }] }],
      generationConfig: { temperature: 0.1 } // 低温度确保输出确定性
    }),
genAI.generateContent({
      model: "models/gemini-2.5-pro-preview",
      contents: [{ role: "user", parts: [{ text: experiencePrompt }] }],
      generationConfig: { temperature: 0.7 } // 高温度增加灵活性
    })
]);

  return {
  return {
    protocol: protocolResp.text.trim(),
    experience: experienceResp.text.trim(),
    sources: results.points.map(p => p.payload.document_id) // Return the reference document ID
};
};

VI. Performance optimization and monitoring

6.1 Search performance tuning

  • Index optimization
    : in Qdrantdocument_id field creates a payload index to speed up filtering queries by document.
  • Hardware acceleration
    : Run Qdrant using a server with a GPU, enable the IVF index of the HNSW algorithm, and reduce the average search delay from 200ms to 80ms.

6.2 Key monitoring indicators

Indicators Tools Threshold Alarm Policy
Qdrant Retrieval Delay
Grafana
P99 > 500ms
Trigger the ticket and check the index status
Gemini call success rate
Prometheus
Prometheus
Prometheus
< 95%
Restart model service node
Cached hit rate
Redis monitoring
Redis monitoring
< 70%
Extend cache clusters or adjust TTL
Knowledge Base update time
Custom log
> 30 minutes
Check document parsing pipeline error

7. Expansion and future direction

7.1 Multimodal capability enhancement

  • Image search
    : Generate visual embeddings from luggage images taken by security inspection equipment through Gemini Vision, and search with text vectors to realize the function of "finding corresponding security inspection rules based on images".
  • Voice interaction
    : Integrate Google Speech-to-Text and Text-to-Speech to support airport employees to quickly query information through voice.

7.2 Federated Learning Deployment

For multi-airport groups, the federal learning model can be adopted:

  • Each airport runs Qdrant instances locally and stores private data (such as the layout diagram of this airport);
  • The central server maintains a common model (such as the universal protocol for aviation security), and realizes cross-site retrieval through encrypted vector exchange.

8. Core principles of RAG implementation

Through the practice of airport intelligent assistants, the design points of production-level RAG systems are summarized:

  1. Business-driven selection
    : Choose RAG instead of fine-tuning according to the field dynamics and traceability requirements to avoid over-design of technology.
  2. Hydrafted decoupling architecture
    : Separate search, generation, and cache to ensure that each component is independently extended. For example, Qdrant is responsible for storage expansion, and Gemini focuses on generation and optimization.
  3. Data quality is preferred
    : Invest at least 30% of the development time in data cleaning, chunking strategies and metadata annotation to avoid "garbage in and out of garbage".
  4. Engineering thinking
    : Implement a complete monitoring, logging, and failure recovery mechanism, rather than focusing only on model effects.

The value of RAG technology is not only to solve the inherent shortcomings of LLM, but also to build an evolving intelligent system. By continuously optimizing data pipelines and prompt strategies, enterprises can adapt to the rapid changes in business needs at a lower cost. With the iteration of multimodal models such as Gemini, RAG will unleash greater potential in more vertical fields (such as smart manufacturing and smart healthcare).