RAGFlow Automation Script Suite: Custom Parsing + Answer Quality Assessment + Parameter Automatic Tuning

Explore the RAGFlow automation script suite to improve the efficiency and flexibility of data processing and system configuration.
Core content:
1. Introduction to the automation script suite: data processing, system testing and parameter optimization
2. Advantages of data processing and system configuration scripts: automation, systematization, and programmability
3. Flexible configuration options and advanced data set configuration implementation
MCP (Large Model Context Protocol) has been quite popular recently. I spent some time studying the collaboration architecture with RAG, and will publish it after I have sorted out my experience.
Back to the topic, the previous article introduced the usage of RAGFlow's Python API in detail. Today, I will give you a simple demonstration combined with several legal documents to stimulate discussion. This article mainly introduces three script examples, namely: data processing and system configuration, system testing, and parameter optimization scripts. This set of automated scripts provides three key advantages over RAGFlow's web interface:
Automation and efficiency : Reduce operations that take hours to complete manually to a few minutes of fully automated process
Systematic and repeatable: Ensure the objectivity, systematicity and repeatability of the testing and optimization process
Programmable and scalable: Configuration, test methods and optimization strategies can be adjusted according to specific needs
This type of script can be regarded as an "enhanced supporting tool" of RAGFlow , which expands the capabilities of the basic platform through code and is more flexible when adapting to in-depth business scenarios.
The source code has been published in the Knowledge Planet
1
Data processing and system configuration scripts
1.1
Advantages over web interface
Automated process handling
One-click configuration: Full automation of the entire process from dataset creation, document upload to chat assistant configuration
Batch processing capability: can process documents in an entire directory at once
Process control: Automatically wait for the document to be parsed before creating the chat assistant to ensure a reasonable process sequence
Flexibility and Customizability
Flexible parameter adjustment: You can adjust various parameters directly in the code without clicking them one by one in the interface
Conditional processing: You can add logical judgments to perform different operations according to different situations
Error handling: Built-in error handling mechanism, providing more detailed information when problems occur
Reusability
Environment replication: The same configuration script can be reused in different environments
Version control: Configuration can be included in the code version control system to facilitate tracking changes
Standardized deployment: ensure that different instances use exactly the same configuration
Integration capabilities
Integrable with other systems: as part of a larger workflow
Scheduled tasks: can be automatically run as scheduled tasks
Connect with test scripts: can be seamlessly connected with test scripts to automatically complete configuration and testing
1.2
Flexible configuration currently implemented
Dataset configuration:
Customize dataset name and description
Configure the embedded model used (BAAI/bge-m3)
Use a chunking method designed for legal documents (chunk_method="laws")
Documentation:
Supports automatic processing of multiple document formats (docx, doc, pdf, txt)
Bulk upload documents
Parse documents asynchronously and monitor progress
Chat Assistant Configuration:
Custom helper name
Linked to the created legal and regulatory dataset
1.3
Other configuration options that can be added
Advanced Dataset Configuration
def create_legal_dataset(rag_object, dataset_name="Legal Knowledge Base"): # Add more advanced configurations dataset = rag_object.create_dataset( name=dataset_name, description="Contains legal and regulatory documents such as the Biosafety Law", embedding_model="BAAI/bge-m3", chunk_method="laws", permission="team", # Set to team accessible parser_config={ "raptor": {"user_raptor": False} } # Add specific parser configuration for legal documents ) return dataset
Document metadata configuration
def upload_documents_with_metadata(dataset, docs_path): documents = [] for filename in os.listdir(docs_path): if filename.endswith(('.docx', '.doc', '.pdf', '.txt')): file_path = os.path.join(docs_path, filename) with open(file_path, "rb") as f: blob = f.read() # Add metadata documents.append({ "display_name": filename, "blob": blob, "meta_fields": { "Legal Type": "Administrative Regulations" if "Regulations" in filename else "Laws", "Promulgation Year": filename.split("(")[1].split(")")[0] if "(" in filename else "Unknown", "Effectiveness Level": "National Level" } }) dataset.upload_documents(documents)
Document parsing custom configuration
def customize_document_parsing(dataset, doc_ids): # Get the document and update the parsing configuration for doc_id in doc_ids: docs = dataset.list_documents(id=doc_id) if docs: doc = docs[0] # Update document parsing configuration doc.update({ "chunk_method": "laws", "parser_config": { "raptor": {"user_raptor": True} } }) # Then parse the document dataset.async_parse_documents(doc_ids)
Chat Assistant Advanced Configuration
def create_legal_assistant(rag_object, dataset_id, assistant_name="法律助手"): # Create a chat assistant with advanced configuration assistant = rag_object.create_chat( name=assistant_name, dataset_ids=[dataset_id], llm={ "model_name": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "temperature": 0.1, "top_p": 0.3, "presence_penalty": 0.2, "frequency_penalty": 0.7, "max_token": 1024 }, prompt={ "similarity_threshold": 0.2, # Set similarity threshold "keywords_similarity_weight": 0.7, # Keyword similarity weight "top_n": 8, # Retrieve topN documents "rerank_model": "BAAI/bge-reranker-v2-m3", # Use reranking model "prompt": """You are a professional legal consultant who is well versed in Chinese laws and regulations, especially relevant laws and regulations such as the Biosafety Law. Please accurately answer the user's questions based on the retrieved legal provisions. When answering, please: 1. Cite the specific legal provision number 2. Explain the meaning of the legal provision 3. If necessary, explain the relationship between the provisions 4. Be objective and do not add personal opinions 5. If the search results are not sufficient to answer the question, please clearly state {knowledge}""" } ) return assistant
Multi-data association and permission management
def setup_multiple_datasets(rag_object): # Create multiple thematic datasets datasets = [] topics = ["Biosafety Law", "Law on the Prevention and Control of Infectious Diseases", "Wildlife Protection Law"] for topic in topics: dataset = rag_object.create_dataset( name=f"{topic} Knowledge Base", description=f"Analysis of laws and regulations specifically for {topic}", embedding_model="BAAI/bge-m3", chunk_method="laws", permission="team" # Team sharing ) datasets.append(dataset) # Create a comprehensive legal assistant and associate all datasets dataset_ids = [dataset.id for dataset in datasets] assistant = create_legal_assistant(rag_object, dataset_ids, "Comprehensive Legal and Regulatory Consultant") return datasets, assistant
The above advanced configurations can be combined and adjusted according to actual needs, so you don’t have to stick to my writing method.
2
System test script
Automatically generate different types of test questions, collect system responses, evaluate response quality using large models, and generate detailed evaluation reports.
It supports systematic testing of four typical legal issue types (direct reference, concept interpretation, scenario application, and cross-clause association). Compared with the web interface, it provides a more comprehensive and objective automated testing and evaluation tool.
2.1
Test question classification
Four types of test questions were designed:
Direct citation: Asking about the content of a specific clause
Concept interpretation: Asking about the definition of concepts in the law
Scenario application: propose actual scenarios and ask about applicable legal terms
Cross-clause association class: questions that require comprehensive answers from multiple clauses
2.2
Evaluation Metrics
The quality of responses was evaluated along five dimensions:
Accuracy: Does the answer cite the correct legal provision?
Completeness: Are all relevant clauses included?
Quality of interpretation: whether the interpretation of legal provisions is clear and accurate
Reference format: Is the clause number correctly referenced?
Overall rating: Overall evaluation based on the above points
3
Parameter optimization script
Automatically test multiple parameter combinations, create temporary test assistants, evaluate the performance of each combination, and identify the best configuration. Specifically, you can try to initially explore the effects of different combinations such as similarity threshold, keyword weight, and number of returned documents.
Relatively speaking, the web interface can only manually adjust a set of parameters and then conduct subjective evaluation, while this type of script can automatically compare the effects of multiple sets of parameters. It should be noted that the parameter optimization schemes listed here are only examples, and you can flexibly adjust them according to specific business needs.
Use a grid search method to test different parameter combinations:
Similarity threshold: [0.1, 0.15, 0.2, 0.25]
Keyword weight: [0.6, 0.7, 0.8, 0.9]
Return the number of items: [8, 10, 12, 15]
4
other
In addition to the reference techniques mentioned above, you can also test different embedding models and reranking models, as well as combine automatic and manual evaluation.
Anyway, it is always faster, better and cheaper to design a script combination that meets the specific document structure characteristics and business goals.