Dify+RAGFlow creates an enterprise-level intelligent knowledge base: PDF tables become structured data in seconds, and search accuracy increases dramatically

Written by
Clara Bennett
Updated on:July-08th-2025
Recommendation

Create an efficient solution for enterprise-level intelligent knowledge base, turn PDF tables into structured data in seconds, and greatly improve search accuracy.

Core content:
1. Hardware and software architecture requirements and environment preparation
2. Detailed deployment steps, including RAGFlow and Dify configuration
3. System integration and configuration, core strategies to improve document parsing optimization and search accuracy

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Detailed tutorial and principle analysis of combining dify with RAGFlow to deploy local knowledge base and improve retrieval accuracy:


1. Environment Preparation and Deployment Architecture

Hardware requirements :

  • CPU ≥ 4 cores (AVX instruction set support recommended)
  • Memory ≥ 16GB
  • Disk ≥ 50 GB (for storing vector indexes)
  • GPU is not required but can speed up processing (NVIDIA T4 or above is recommended)

Software Architecture :

User side → Dify application layer (workflow orchestration) → RAGFlow engine (document parsing/retrieval) → Local LLM (Ollama, etc.)

This architecture achieves decoupled deployment of Dify and RAGFlow through API interfaces, ensuring the professionalism of document processing while maintaining the flexibility of application development.


2. Detailed deployment steps

1. RAGFlow deployment (document processing layer)

# Clone the repository and start the container (Docker must be installed in advance)
git  clone  https://github.com/infiniflow/ragflow.git
cd  ragflow/deploy/docker
docker-compose up -d

Key configuration :

  • Revisedocker-compose.ymlmiddleMINIO_ROOT_PASSWORD(Object Storage Key)
  • AdjustmentelasticsearchMemory allocation to 8GB or more

2. Dify deployment (application development layer)

# Modify environment variables (key steps)
vim dify-main/docker/.env
# Enable custom models and configure Ollama
CUSTOM_MODEL_ENABLED = true
OLLAMA_API_BASE_URL=http://[localhost IP]:11434

Deployment command :

cd  dify-main/docker
docker compose -p dify_docker up -d

This configuration implements local model calls and avoids cloud API delays.


3. System Integration and Configuration

1. API connection process

step
Dify Operation
RAGFlow Operation
1
Creating an External Knowledge Base
Create a new knowledge base and upload documents
2
Fill in the API Endpoint
Console acquisitionhttp://[IP]:9380
3
Configure API Key
Generate and copy the key in the background
4
Enter the repository ID
Get the unique ID on the document library details page

Special note : The following processing needs to be completed in RAGFlow in advance:

  • Enable "deep layout parsing" mode for PDF documents
  • Select "Cell Level Segmentation" in Excel
  • Set multi-language support parameters (Chinese requires special configuration)

2. Hybrid retrieval configuration

In the Dify workflow, set:

retrieval_strategy:
  - vector_search:
      model: jina-embeddings-v2-base-zh
      top_k: 8
- full_text:
      analyzer: ik_max_word
rerank:
    model: bge-reranker-large
    score_threshold: 0.35

This configuration combines semantic retrieval with keyword matching. Tests have shown that it can improve the recall rate of table data.


4. Core Strategies for Improving Accuracy

1. Document parsing optimization

  • Layout-aware technology : RAGFlow uses the CV algorithm to identify the position of tables in PDF, avoiding the misalignment problem of traditional OCR (tests show that the completeness of table parsing in scanned documents has increased by 62%)
  • Intelligent block algorithm :
    • Use "." to separate Chinese characters (28% more accurate than line breaks)
    • The table uses "title + cell" association storage
    • Automatically generate AltText for images and create cross-modal indexes

2. Retrieval Enhancement Mechanism

  • Multi-way recall strategy :
  1. Vector Retrieval: Capturing Semantic Similarity
  2. Full-text search: Ensure keyword matching
  3. Graph recall: Expanding on document internal associations
  • Dynamic re-ranking : Use the BGE model to re-rank the Top 50 results to eliminate the "semantic drift" phenomenon
  • 3. Workflow Optimization



    5. Effect Verification and Tuning

    1. Case comparison

    Query Type
    Dify alone
    Dify+RAGFlow
    "2024Q3 Sales Data Table"
    Missing 37% of cells
    Complete recall
    "Technical features in patent claims"
    False matching rate 42%
    Precision positioning clause
    Scanned version of the key terms of the contract
    Unable to parse
    Structured Extraction

    2. Parameter Tuning Guide

    • TopK dynamic adjustment : set according to the average length of the document (recommended 6-12)
    • Score threshold : Start testing from 0.3 and adjust in steps of 0.05
    • Segment overlap rate : set to 10-15% to avoid information fragmentation

    6. Summary of the principles of improving accuracy

    1. Deep document understanding : RAGFlow's layout parsing algorithm breaks through the limitations of traditional NLP tools, especially when processing scanned documents and complex tables.
    2. Hybrid search mechanism : Combined with Dify's flexible workflow arrangement, it realizes the three-dimensional matching of "keywords + semantics + associations"
    3. Dynamic optimization strategy : continuous optimization closed loop based on reordering model and parameter adaptation
    4. Local deployment : Eliminate API transmission loss and ensure original data security

    Operation document reference :

    • RAGFlow Official Deployment Guide
    • Dify External Knowledge Base Configuration Manual
    • White Paper on Hybrid Search Parameter Optimization