[Open Source] Dify+RAGFlow combines to improve the accuracy of knowledge base and turn PDF tables into structured data in seconds!

Accurately parse PDF tables and greatly improve knowledge base search efficiency!
Core content:
1. The integration advantages of Dify and RAGFlow and the improvement of search quality
2. Simplified configuration steps and effect verification methods
3. Scenario value and applicable industry application cases
Integration Advantages
1. Deep web page analysis capabilities
RAGFlow can parse complex formats such as PDF, scans, tables, etc., automatically identify the layout and extract structured data, making up for the shortcomings of Dify's native parsing.
2. A leap in search quality
Through multi-way recall and re-ordering optimization strategies, RAGFlow significantly improves the accuracy of answers. For example, the parsing completeness of scanned PDF tables has increased by more than 40%.
3. Hybrid search mode
Dify supports vector retrieval, full-text retrieval, and hybrid retrieval (recommendation). Combined with RAGFlow's API call, it achieves the dual advantages of "unstructured data + semantic matching".
Configuration steps (simplified version)
1. Deploy RAGFlow
- Clone the source code and start the Docker container (CPU ≥ 4 cores, memory ≥ 16GB required).
- Record the RAGFlow API address (such as `http://IP:9380`) and API Key.
Execute docker-compose up -d in the console
2. Dify configuration
- Modify the `.env` file to enable the custom model and fill in the Ollama API address.
- Fill in RAGFlow’s API Endpoint, Key and knowledge base ID in Dify’s “External Knowledge Base”.
3. Effect verification
Upload a test web page (such as a scanned contract or complex form) and compare the results of RAGFlow native search with those of Dify integration. The latter has better data integrity and logic.
Precautions
- Hardware requirements: Make sure the server meets resource thresholds (CPU/memory/storage).
- Interface compatibility: RAGFlow's port 9380 must be open, and the API Key permissions must include knowledge base access.
First, we need to solve a port conflict problem. In the local environment, the default access ports of ragflow and dify are 80 and 443, which will cause one of the services to fail to start normally. To solve this problem, I suggest changing the default port of ragflow. Here is the modification method: In the docker-compose.yml file, change the port mapping of ragflow, map the container's port 80 to the host's port 8000, and map port 443 to the host's port 4333. In this way, there will be no conflict between the ports of ragflow and dify.
- Model adaptation: It is recommended to turn off Dify's Rerank model and give priority to trusting RAGFlow's parsing results.
Scene Value
Suitable for industries such as law, finance, and medicine that need to process large amounts of unstructured web pages, such as:
- Quickly extract key information from contract terms;
- Structured storage of medical imaging reports and diagnostic records;
- Real-time analysis of financial data in corporate annual reports.