Woter AI detection.Hurry - ends Jul 19th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

RAGFlow Automation Script Suite: Custom Parsing + Answer Quality Assessment + Parameter Automatic Tuning

Written by

Audrey Miles

Updated on:July-09th-2025

MCP (Large Model Context Protocol) has been quite popular recently. I spent some time studying the collaboration architecture with RAG, and will publish it after I have sorted out my experience.

Back to the topic, the previous article introduced the usage of RAGFlow's Python API in detail. Today, I will give you a simple demonstration combined with several legal documents to stimulate discussion. This article mainly introduces three script examples, namely: data processing and system configuration, system testing, and parameter optimization scripts. This set of automated scripts provides three key advantages over RAGFlow's web interface:

Automation and efficiency : Reduce operations that take hours to complete manually to a few minutes of fully automated process

Systematic and repeatable: Ensure the objectivity, systematicity and repeatability of the testing and optimization process

Programmable and scalable: Configuration, test methods and optimization strategies can be adjusted according to specific needs

This type of script can be regarded as an "enhanced supporting tool" of RAGFlow , which expands the capabilities of the basic platform through code and is more flexible when adapting to in-depth business scenarios.

The source code has been published in the Knowledge Planet

Data processing and system configuration scripts

1.1

Advantages over web interface

Automated process handling

One-click configuration: Full automation of the entire process from dataset creation, document upload to chat assistant configuration

Batch processing capability: can process documents in an entire directory at once

Process control: Automatically wait for the document to be parsed before creating the chat assistant to ensure a reasonable process sequence

Flexibility and Customizability

Flexible parameter adjustment: You can adjust various parameters directly in the code without clicking them one by one in the interface

Conditional processing: You can add logical judgments to perform different operations according to different situations

Error handling: Built-in error handling mechanism, providing more detailed information when problems occur

Reusability

Environment replication: The same configuration script can be reused in different environments

Version control: Configuration can be included in the code version control system to facilitate tracking changes

Standardized deployment: ensure that different instances use exactly the same configuration

Integration capabilities

Integrable with other systems: as part of a larger workflow

Scheduled tasks: can be automatically run as scheduled tasks

Connect with test scripts: can be seamlessly connected with test scripts to automatically complete configuration and testing

1.2

Flexible configuration currently implemented

Dataset configuration:

Customize dataset name and description

Configure the embedded model used (BAAI/bge-m3)

Use a chunking method designed for legal documents (chunk_method="laws")

Documentation:

Supports automatic processing of multiple document formats (docx, doc, pdf, txt)

Bulk upload documents

Parse documents asynchronously and monitor progress

Chat Assistant Configuration:

Custom helper name

Linked to the created legal and regulatory dataset

1.3

Other configuration options that can be added

Advanced Dataset Configuration

def create_legal_dataset(rag_object, dataset_name="Legal Knowledge Base"): # Add more advanced configurations dataset = rag_object.create_dataset( name=dataset_name, description="Contains legal and regulatory documents such as the Biosafety Law", embedding_model="BAAI/bge-m3", chunk_method="laws", permission="team", # Set to team accessible parser_config={ "raptor": {"user_raptor": False} } # Add specific parser configuration for legal documents ) return dataset

Document metadata configuration

def upload_documents_with_metadata(dataset, docs_path): documents = [] for filename in os.listdir(docs_path): if filename.endswith(('.docx', '.doc', '.pdf', '.txt')): file_path = os.path.join(docs_path, filename) with open(file_path, "rb") as f: blob = f.read() # Add metadata documents.append({ "display_name": filename, "blob": blob, "meta_fields": { "Legal Type": "Administrative Regulations" if "Regulations" in filename else "Laws", "Promulgation Year": filename.split("（")[1].split("）")[0] if "（" in filename else "Unknown", "Effectiveness Level": "National Level" } }) dataset.upload_documents(documents)

Document parsing custom configuration

def customize_document_parsing(dataset, doc_ids): # Get the document and update the parsing configuration for doc_id in doc_ids: docs = dataset.list_documents(id=doc_id) if docs: doc = docs[0] # Update document parsing configuration doc.update({ "chunk_method": "laws", "parser_config": { "raptor": {"user_raptor": True} } }) # Then parse the document dataset.async_parse_documents(doc_ids)

Chat Assistant Advanced Configuration

def create_legal_assistant(rag_object, dataset_id, assistant_name="法律助手"): # Create a chat assistant with advanced configuration assistant = rag_object.create_chat( name=assistant_name, dataset_ids=[dataset_id], llm={ "model_name": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B", "temperature": 0.1, "top_p": 0.3, "presence_penalty": 0.2, "frequency_penalty": 0.7, "max_token": 1024 }, prompt={ "similarity_threshold": 0.2, # Set similarity threshold "keywords_similarity_weight": 0.7, # Keyword similarity weight "top_n": 8, # Retrieve topN documents "rerank_model": "BAAI/bge-reranker-v2-m3", # Use reranking model "prompt": """You are a professional legal consultant who is well versed in Chinese laws and regulations, especially relevant laws and regulations such as the Biosafety Law. Please accurately answer the user's questions based on the retrieved legal provisions. When answering, please: 1. Cite the specific legal provision number 2. Explain the meaning of the legal provision 3. If necessary, explain the relationship between the provisions 4. Be objective and do not add personal opinions 5. If the search results are not sufficient to answer the question, please clearly state {knowledge}""" } ) return assistant

Multi-data association and permission management

def setup_multiple_datasets(rag_object): # Create multiple thematic datasets datasets = [] topics = ["Biosafety Law", "Law on the Prevention and Control of Infectious Diseases", "Wildlife Protection Law"] for topic in topics: dataset = rag_object.create_dataset( name=f"{topic} Knowledge Base", description=f"Analysis of laws and regulations specifically for {topic}", embedding_model="BAAI/bge-m3", chunk_method="laws", permission="team" # Team sharing ) datasets.append(dataset) # Create a comprehensive legal assistant and associate all datasets dataset_ids = [dataset.id for dataset in datasets] assistant = create_legal_assistant(rag_object, dataset_ids, "Comprehensive Legal and Regulatory Consultant") return datasets, assistant

The above advanced configurations can be combined and adjusted according to actual needs, so you don’t have to stick to my writing method.

System test script

Automatically generate different types of test questions, collect system responses, evaluate response quality using large models, and generate detailed evaluation reports.

It supports systematic testing of four typical legal issue types (direct reference, concept interpretation, scenario application, and cross-clause association). Compared with the web interface, it provides a more comprehensive and objective automated testing and evaluation tool.

2.1

Test question classification

Four types of test questions were designed:

Direct citation: Asking about the content of a specific clause

Concept interpretation: Asking about the definition of concepts in the law

Scenario application: propose actual scenarios and ask about applicable legal terms

Cross-clause association class: questions that require comprehensive answers from multiple clauses

2.2

Evaluation Metrics

The quality of responses was evaluated along five dimensions:

Accuracy: Does the answer cite the correct legal provision?

Completeness: Are all relevant clauses included?

Quality of interpretation: whether the interpretation of legal provisions is clear and accurate

Reference format: Is the clause number correctly referenced?

Overall rating: Overall evaluation based on the above points

Parameter optimization script

Automatically test multiple parameter combinations, create temporary test assistants, evaluate the performance of each combination, and identify the best configuration. Specifically, you can try to initially explore the effects of different combinations such as similarity threshold, keyword weight, and number of returned documents.

Relatively speaking, the web interface can only manually adjust a set of parameters and then conduct subjective evaluation, while this type of script can automatically compare the effects of multiple sets of parameters. It should be noted that the parameter optimization schemes listed here are only examples, and you can flexibly adjust them according to specific business needs.

Use a grid search method to test different parameter combinations:

Similarity threshold: [0.1, 0.15, 0.2, 0.25]

Keyword weight: [0.6, 0.7, 0.8, 0.9]

Return the number of items: [8, 10, 12, 15]

other

In addition to the reference techniques mentioned above, you can also test different embedding models and reranking models, as well as combine automatic and manual evaluation.

Anyway, it is always faster, better and cheaper to design a script combination that meets the specific document structure characteristics and business goals.