Dify+RAGFlow creates an enterprise-level intelligent knowledge base: PDF tables become structured data in seconds, and search accuracy increases dramatically

Written by

Clara Bennett

Updated on:July-08th-2025

Detailed tutorial and principle analysis of combining dify with RAGFlow to deploy local knowledge base and improve retrieval accuracy:

1. Environment Preparation and Deployment Architecture

Hardware requirements :

CPU ≥ 4 cores (AVX instruction set support recommended)
Memory ≥ 16GB
Disk ≥ 50 GB (for storing vector indexes)
GPU is not required but can speed up processing (NVIDIA T4 or above is recommended)

Software Architecture :

User side → Dify application layer (workflow orchestration) → RAGFlow engine (document parsing/retrieval) → Local LLM (Ollama, etc.)

This architecture achieves decoupled deployment of Dify and RAGFlow through API interfaces, ensuring the professionalism of document processing while maintaining the flexibility of application development.

2. Detailed deployment steps

1. RAGFlow deployment (document processing layer)

# Clone the repository and start the container (Docker must be installed in advance)
git  clone  https://github.com/infiniflow/ragflow.git
cd  ragflow/deploy/docker
docker-compose up -d

Key configuration :

Revisedocker-compose.ymlmiddleMINIO_ROOT_PASSWORD(Object Storage Key)
AdjustmentelasticsearchMemory allocation to 8GB or more

2. Dify deployment (application development layer)

# Modify environment variables (key steps)
vim dify-main/docker/.env
# Enable custom models and configure Ollama
CUSTOM_MODEL_ENABLED = true
OLLAMA_API_BASE_URL=http://[localhost IP]:11434

Deployment command :

cd  dify-main/docker
docker compose -p dify_docker up -d

This configuration implements local model calls and avoids cloud API delays.

3. System Integration and Configuration

1. API connection process

step	Dify Operation	RAGFlow Operation
1	Creating an External Knowledge Base	Create a new knowledge base and upload documents
2	Fill in the API Endpoint	Console acquisition`http://[IP]:9380`
3	Configure API Key	Generate and copy the key in the background
4	Enter the repository ID	Get the unique ID on the document library details page

Special note : The following processing needs to be completed in RAGFlow in advance:

Enable "deep layout parsing" mode for PDF documents
Select "Cell Level Segmentation" in Excel
Set multi-language support parameters (Chinese requires special configuration)

2. Hybrid retrieval configuration

In the Dify workflow, set:

retrieval_strategy:
  - vector_search:
      model: jina-embeddings-v2-base-zh
      top_k: 8
- full_text:
      analyzer: ik_max_word
rerank:
    model: bge-reranker-large
    score_threshold: 0.35

This configuration combines semantic retrieval with keyword matching. Tests have shown that it can improve the recall rate of table data.

4. Core Strategies for Improving Accuracy

1. Document parsing optimization

Layout-aware technology : RAGFlow uses the CV algorithm to identify the position of tables in PDF, avoiding the misalignment problem of traditional OCR (tests show that the completeness of table parsing in scanned documents has increased by 62%)
Intelligent block algorithm :

Use "." to separate Chinese characters (28% more accurate than line breaks)
The table uses "title + cell" association storage
Automatically generate AltText for images and create cross-modal indexes

2. Retrieval Enhancement Mechanism

Multi-way recall strategy :

Vector Retrieval: Capturing Semantic Similarity
Full-text search: Ensure keyword matching
Graph recall: Expanding on document internal associations

Dynamic re-ranking : Use the BGE model to re-rank the Top 50 results to eliminate the "semantic drift" phenomenon

3. Workflow Optimization

5. Effect Verification and Tuning

1. Case comparison

Query Type	Dify alone	Dify+RAGFlow
"2024Q3 Sales Data Table"	Missing 37% of cells	Complete recall
"Technical features in patent claims"	False matching rate 42%	Precision positioning clause
Scanned version of the key terms of the contract	Unable to parse	Structured Extraction

2. Parameter Tuning Guide

TopK dynamic adjustment : set according to the average length of the document (recommended 6-12)
Score threshold : Start testing from 0.3 and adjust in steps of 0.05
Segment overlap rate : set to 10-15% to avoid information fragmentation

6. Summary of the principles of improving accuracy

Deep document understanding : RAGFlow's layout parsing algorithm breaks through the limitations of traditional NLP tools, especially when processing scanned documents and complex tables.
Hybrid search mechanism : Combined with Dify's flexible workflow arrangement, it realizes the three-dimensional matching of "keywords + semantics + associations"
Dynamic optimization strategy : continuous optimization closed loop based on reordering model and parameter adaptation
Local deployment : Eliminate API transmission loss and ensure original data security

Operation document reference :

RAGFlow Official Deployment Guide
Dify External Knowledge Base Configuration Manual
White Paper on Hybrid Search Parameter Optimization