Woter AI detection.Hurry - ends Jul 23rd

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Build a DeepSeek test case generation system knowledge base in 8 minutes

Written by

Jasper Cole

Updated on:July-08th-2025

1. Background and system positioning

Previously, I shared two 8-minute series of articles on DeepSeek empowering software testing, which attracted many like-minded students to discuss. Based on the previous articles , we have built basic test case generation capabilities. Today, I will mainly talk about the knowledge base.

Build a DeepSeek-powered test case tool in 8 minutes
Polaris School, WeChat Official Account: Polaris School Build a DeepSeek-powered test case tool in 8 minutes

Build a DeepSeek API intelligent testing engine in 8 minutes: before the coffee is cold, the test report is ready
Polaris School, WeChat Official Account: Polaris School Build a DeepSeek API Intelligent Testing Engine in 8 Minutes: Before the Coffee Gets Cold, the Test Report is Out

On this basis, this system introduces knowledge base enhanced generation (RAG) technology, which integrates domain documents and historical use case data to make the generated results more in line with business scenarios.

1.1 Why do we need a knowledge base?

Traditional AI generation solutions have two major pain points:

Lack of domain knowledge
Large models cannot memorize enterprise private documents (such as requirement specifications and interface documents)
Waste of historical experience
Past test cases have not been effectively reused

This system is implemented through a lightweight RAG architecture (no vector database required):

Intelligent analysis of PDF documents ➡️ Build domain knowledge base
Semantic retrieval of historical use cases ➡️ Forming an experience reuse mechanism
Dynamically enhance the generation of prompt words ➡️ Improve the professionalism of use cases

Watch the video demo first

Upload documents to the knowledge base
The first time I generated the code, I did not choose to use the knowledge base enhancement -> The designed test cases have nothing to do with mobile phone number login
The second time, choose to use knowledge base enhancement -> the designed test case knows how to register the system with a mobile phone number and knows more other details

2. Core Logic Analysis

2.1 System Architecture Overview

2.2 Description of key technical points

2.2.1 Knowledge base building module

def process_pdf ( uploaded_file ) : 
    # Extract text from PDF pages
    pdf_reader  =  PyPDF2 . PdfReader ( filepath )
    # Intelligent segmentation rules, slightly simple and crude
    paragraphs  =  re . split ( r'\n\s*\n' ,  text )  
    # Structured Storage
    segments . append ( {
        'segment_id' : f" { filename } _ { page_num } _ { i } " , 
        'document_name' :  uploaded_file . name ,
        'page_num' :  page_num  + 1 , 
        'content' :  paragraph
    } )

Innovation :

Use unique paragraph ID
Split into natural paragraphs, preserving contextual semantics
Filter invalid short text (<20 characters)

2.2.2 Enhanced search engine

def find_similar_cases ( new_req ,  df ,  top_k = 3 ) : 
    # TF-IDF vectorization
    vectorizer  =  TfidfVectorizer ( )
    tfidf_matrix  =  vectorizer . fit_transform ( . . . )
    # Cosine similarity calculation
    similarity  =  cosine_similarity ( tfidf_matrix [ - 1 ] ,  tfidf_matrix [ : - 1 ] )
    return  top_indices

Design considerations :

Easier to implement than BM25 algorithm
Computational efficiency: O(n) complexity, real-time response to thousands of data
The results are highly interpretable and suitable for debugging

2.2.3 Dynamic Prompt Word Project

system_prompt  = f"""References: 
Document { item [ 'document' ] } , page { item [ 'page' ] } : { item [ 'content' ] }
Historical case { idx + 1 } :  { case }
Build requirements:
1. JSON array format...
"""

Enhancement strategy :

Knowledge fragment truncation processing (single segment ≤ 512 characters)
Prioritization: Domain Knowledge > Historical Use Cases
Strong format constraints (JSON Schema injection)

3. Analysis of Key Technology Selection

3.1 What is RAG?

Retrieval-Augmented Generation improves generation quality through the following processes:

User question → Knowledge retrieval → Prompt word enhancement → Large model generation → Result output

Differences from traditional generation:

Real-time knowledge
No need to retrain the model
Data security
Sensitive information does not leave the domain
Controllability of results
Guide the generation direction through search results

3.2 Why not use a vector database?

Although vector databases (such as ChromaDB) are widely used in RAG, this system chooses the TF-IDF+CSV file storage solution for the following reasons:

Dimensions	Vector database solution	This system solution
Deployment complexity	Need to deploy services separately	Zero dependency, single file operation
Hardware requirements	Requires GPU acceleration	CPU can run
Data size	Suitable for millions of data	Optimal for thousands of documents
Maintainability	Professional DBA required	Modify the CSV file directly
Cost of Study	Query syntax required	No new knowledge required for developers

Suitable for :

Small and medium-sized teams quickly verify the value of RAG
The frequency of updating domain documents is low (weekly)
Test data size < 100,000

4. Quick Deployment Guide

4.1 Environmental Preparation

4.1.1 Installing Python packages

# Core Dependencies
pip  install  streamlit pandas requests sklearn
PDF Processing
pip  install  PyPDF2 pdfminer.six
#JSON fix
pip  install  json_repair

4.1.2 Obtaining an API key

Visit any large model provider to register an account. This article uses Tencent Cloud.
Create an application → Getsk-xxxxFormat Key

Replace in your code:

headers = {"Authorization": "Bearer sk-xxxx"}

4.2 System Startup

# Automatically create the knowledge base directory for the first run
mkdir  -p temp

# Start the Web Service
streamlit run testcase_generator.py

4.3 Functional Verification Process

Upload domain documents :

Go to the "Knowledge Base Management" page
Upload PDF format requirement document/interface document
View the processed knowledge paragraph
Generate enhanced use cases :

    Requirements example:
    User management module, including user registration and login, etc.

Check "Use knowledge base enhancement"
View the generated boundary value test cases

Result export :

pythonpd.DataFrame(new_cases).to_excel("output.xlsx")

Directly copy the JSON example
Export to Excel via Pandas:

5. Performance optimization suggestions (enthusiasts with energy and ability can continue to optimize)

5.1 Hierarchical storage of knowledge base

# New fields in knowledge_segments.csv
knowledge_df [ 'category' ] = "Requirement Document" # Requirement Document | Interface Specification | Test Report    
knowledge_df [ 'importance' ] = 5 # 1-5 rating

Prioritize high-level knowledge fragments when searching

5.2 Cache Mechanism

from  functools  import  lru_cache

@lru_cache ( maxsize = 100 )
def load_cases ( ) : 
    # Cache history use case loading

5.3 Asynchronous Processing

import  asyncio

async def async_generate_cases ( ) :  
    # Non-blocking generation

6. Expansion Direction

Multimodal support: parsing requirements documents in images (OCR technology)
Automated review: Add a use case quality scoring model
CI/CD integration: automatic triggering with Jenkins/GitLab

Build a DeepSeek test case generation system knowledge base in 8 minutes

1. Background and system positioning

1.1 Why do we need a knowledge base?

2. Core Logic Analysis

2.1 System Architecture Overview

2.2 Description of key technical points

2.2.3 Dynamic Prompt Word Project
`system_prompt = f"""References: Document { item [ 'document' ] } , page { item [ 'page' ] } : { item [ 'content' ] } Historical case { idx + 1 } : { case } Build requirements: 1. JSON array format... """`

3. Analysis of Key Technology Selection

3.1 What is RAG?

3.2 Why not use a vector database?

4. Quick Deployment Guide

4.1 Environmental Preparation

4.1.1 Installing Python packages
`# Core Dependencies pip install streamlit pandas requests sklearn PDF Processing pip install PyPDF2 pdfminer.six #JSON fix pip install json_repair`

4.1.2 Obtaining an API key

4.2 System Startup
`# Automatically create the knowledge base directory for the first run mkdir -p temp # Start the Web Service streamlit run testcase_generator.py`

4.3 Functional Verification Process

5. Performance optimization suggestions (enthusiasts with energy and ability can continue to optimize)

5.1 Hierarchical storage of knowledge base
`# New fields in knowledge_segments.csv knowledge_df [ 'category' ] = "Requirement Document" # Requirement Document | Interface Specification | Test Report knowledge_df [ 'importance' ] = 5 # 1-5 rating`

5.2 Cache Mechanism
`from functools import lru_cache @lru_cache ( maxsize = 100 ) def load_cases ( ) : # Cache history use case loading`

5.3 Asynchronous Processing
`import asyncio async def async_generate_cases ( ) : # Non-blocking generation`

6. Expansion Direction

Build a DeepSeek test case generation system knowledge base in 8 minutes

1. Background and system positioning

1.1 Why do we need a knowledge base?

2. Core Logic Analysis

2.1 System Architecture Overview

2.2 Description of key technical points

2.2.3 Dynamic Prompt Word Projectsystem_prompt = f"""References: Document { item [ 'document' ] } , page { item [ 'page' ] } : { item [ 'content' ] }Historical case { idx + 1 } : { case }Build requirements:1. JSON array format..."""

3. Analysis of Key Technology Selection

3.1 What is RAG?

3.2 Why not use a vector database?

4. Quick Deployment Guide

4.1 Environmental Preparation

4.1.1 Installing Python packages# Core Dependenciespip install streamlit pandas requests sklearnPDF Processingpip install PyPDF2 pdfminer.six#JSON fixpip install json_repair

4.1.2 Obtaining an API key

4.2 System Startup# Automatically create the knowledge base directory for the first runmkdir -p temp# Start the Web Servicestreamlit run testcase_generator.py

4.3 Functional Verification Process

5. Performance optimization suggestions (enthusiasts with energy and ability can continue to optimize)

5.1 Hierarchical storage of knowledge base# New fields in knowledge_segments.csvknowledge_df [ 'category' ] = "Requirement Document" # Requirement Document | Interface Specification | Test Report knowledge_df [ 'importance' ] = 5 # 1-5 rating

5.2 Cache Mechanismfrom functools import lru_cache@lru_cache ( maxsize = 100 )def load_cases ( ) : # Cache history use case loading

5.3 Asynchronous Processingimport asyncioasync def async_generate_cases ( ) : # Non-blocking generation

6. Expansion Direction

2.2.3 Dynamic Prompt Word Project
`system_prompt = f"""References: Document { item [ 'document' ] } , page { item [ 'page' ] } : { item [ 'content' ] } Historical case { idx + 1 } : { case } Build requirements: 1. JSON array format... """`

4.1.1 Installing Python packages
`# Core Dependencies pip install streamlit pandas requests sklearn PDF Processing pip install PyPDF2 pdfminer.six #JSON fix pip install json_repair`

4.2 System Startup
`# Automatically create the knowledge base directory for the first run mkdir -p temp # Start the Web Service streamlit run testcase_generator.py`

5.1 Hierarchical storage of knowledge base
`# New fields in knowledge_segments.csv knowledge_df [ 'category' ] = "Requirement Document" # Requirement Document | Interface Specification | Test Report knowledge_df [ 'importance' ] = 5 # 1-5 rating`

5.2 Cache Mechanism
`from functools import lru_cache @lru_cache ( maxsize = 100 ) def load_cases ( ) : # Cache history use case loading`

5.3 Asynchronous Processing
`import asyncio async def async_generate_cases ( ) : # Non-blocking generation`