Build a DeepSeek test case generation system knowledge base in 8 minutes

Written by
Jasper Cole
Updated on:July-08th-2025
Recommendation

Quickly build the DeepSeek test case generation system to improve software testing expertise.

Core content:
1. The importance of knowledge base in AI test case generation
2. System architecture and key technical point analysis
3. Implementation details of knowledge base construction and enhanced search engine

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

1. Background and system positioning

Previously, I shared two 8-minute series of articles on DeepSeek empowering software testing, which attracted many like-minded students to discuss. Based on the previous articles , we have built basic test case generation capabilities. Today, I will mainly talk about the knowledge base.

Build a DeepSeek-powered test case tool in 8 minutes

Polaris School, WeChat Official Account: Polaris School Build a DeepSeek-powered test case tool in 8 minutes

Build a DeepSeek API intelligent testing engine in 8 minutes: before the coffee is cold, the test report is ready

Polaris School, WeChat Official Account: Polaris School Build a DeepSeek API Intelligent Testing Engine in 8 Minutes: Before the Coffee Gets Cold, the Test Report is Out


On this basis, this system introduces knowledge base enhanced generation (RAG) technology, which integrates domain documents and historical use case data to make the generated results more in line with business scenarios.

1.1 Why do we need a knowledge base?

Traditional AI generation solutions have two major pain points:

  1. Lack of domain knowledge
    Large models cannot memorize enterprise private documents (such as requirement specifications and interface documents)
  2. Waste of historical experience
    Past test cases have not been effectively reused

This system is implemented through a lightweight RAG architecture (no vector database required):

  • Intelligent analysis of PDF documents ➡️ Build domain knowledge base
  • Semantic retrieval of historical use cases ➡️ Forming an experience reuse mechanism
  • Dynamically enhance the generation of prompt words ➡️ Improve the professionalism of use cases

Watch the video demo first 
  1. Upload documents to the knowledge base
  2. The first time I generated the code, I did not choose to use the knowledge base enhancement -> The designed test cases have nothing to do with mobile phone number login
  3. The second time, choose to use knowledge base enhancement -> the designed test case knows how to register the system with a mobile phone number and knows more other details

2. Core Logic Analysis

2.1 System Architecture Overview


2.2 Description of key technical points

2.2.1 Knowledge base building module

def process_pdf ( uploaded_file ) : 
    # Extract text from PDF pages
    pdf_reader  =  PyPDF2 . PdfReader ( filepath )
    # Intelligent segmentation rules, slightly simple and crude
    paragraphs  =  re . split ( r'\n\s*\n' ,  text )  
    # Structured Storage
    segments . append ( {
        'segment_id' : f" { filename } _ { page_num } _ { i } " , 
        'document_name' :  uploaded_file . name ,
        'page_num' :  page_num  + 1 , 
        'content' :  paragraph
    } )

Innovation :

  • Use unique paragraph ID
  • Split into natural paragraphs, preserving contextual semantics
  • Filter invalid short text (<20 characters)

2.2.2 Enhanced search engine

def find_similar_cases ( new_req ,  df ,  top_k = 3 ) : 
    # TF-IDF vectorization
    vectorizer  =  TfidfVectorizer ( )
    tfidf_matrix  =  vectorizer . fit_transform ( . . . )
    # Cosine similarity calculation
    similarity  =  cosine_similarity ( tfidf_matrix [ - 1 ] ,  tfidf_matrix [ : - 1 ] )
    return  top_indices

Design considerations :

  • Easier to implement than BM25 algorithm
  • Computational efficiency: O(n) complexity, real-time response to thousands of data
  • The results are highly interpretable and suitable for debugging

2.2.3 Dynamic Prompt Word Project

system_prompt  = f"""References: 
Document { item [ 'document' ] } , page { item [ 'page' ] } : { item [ 'content' ] }
Historical case { idx + 1 } { case }
Build requirements:
1. JSON array format...
"""

Enhancement strategy :

  • Knowledge fragment truncation processing (single segment ≤ 512 characters)
  • Prioritization: Domain Knowledge > Historical Use Cases
  • Strong format constraints (JSON Schema injection)

3. Analysis of Key Technology Selection

3.1 What is RAG?

Retrieval-Augmented Generation improves generation quality through the following processes:

User question → Knowledge retrieval → Prompt word enhancement → Large model generation → Result output

Differences from traditional generation:

  • Real-time knowledge
    No need to retrain the model
  • Data security
    Sensitive information does not leave the domain
  • Controllability of results
    Guide the generation direction through search results

3.2 Why not use a vector database?

Although vector databases (such as ChromaDB) are widely used in RAG, this system chooses the TF-IDF+CSV file storage solution for the following reasons:

Dimensions
Vector database solution
This system solution
Deployment complexity
Need to deploy services separately
Zero dependency, single file operation
Hardware requirements
Requires GPU acceleration
CPU can run
Data size
Suitable for millions of data
Optimal for thousands of documents
Maintainability
Professional DBA required
Modify the CSV file directly
Cost of Study
Query syntax required
No new knowledge required for developers

Suitable for :

  • Small and medium-sized teams quickly verify the value of RAG
  • The frequency of updating domain documents is low (weekly)
  • Test data size < 100,000

4. Quick Deployment Guide

4.1 Environmental Preparation

4.1.1 Installing Python packages

# Core Dependencies
pip  install  streamlit pandas requests sklearn
PDF Processing
pip  install  PyPDF2 pdfminer.six
#JSON fix
pip  install  json_repair

4.1.2 Obtaining an API key

  1. Visit any large model provider to register an account. This article uses Tencent Cloud.
  2. Create an application → Getsk-xxxxFormat Key
  3. Replace in your code:

    headers = {"Authorization": "Bearer sk-xxxx"}

4.2 System Startup

# Automatically create the knowledge base directory for the first run
mkdir  -p temp

# Start the Web Service
streamlit run testcase_generator.py

4.3 Functional Verification Process

  1. Upload domain documents :

  • Go to the "Knowledge Base Management" page
  • Upload PDF format requirement document/interface document
  • View the processed knowledge paragraph
  • Generate enhanced use cases :

    Requirements example:
    User management module, including user registration and login, etc.

    • Check "Use knowledge base enhancement"
    • View the generated boundary value test cases

    1. Result export :

      pythonpd.DataFrame(new_cases).to_excel("output.xlsx")
    • Directly copy the JSON example
    • Export to Excel via Pandas:

    5. Performance optimization suggestions (enthusiasts with energy and ability can continue to optimize)

    5.1 Hierarchical storage of knowledge base

    # New fields in knowledge_segments.csv
    knowledge_df [ 'category' ] = "Requirement Document" # Requirement Document | Interface Specification | Test Report    
    knowledge_df [ 'importance' ] = 5 # 1-5 rating         

    Prioritize high-level knowledge fragments when searching

    5.2 Cache Mechanism

    from  functools  import  lru_cache

    @lru_cache ( maxsize = 100 )
    def load_cases ( ) : 
        # Cache history use case loading

    5.3 Asynchronous Processing

    import  asyncio

    async def async_generate_cases ( ) :  
        # Non-blocking generation


    6. Expansion Direction

    1. Multimodal support: parsing requirements documents in images (OCR technology)
    2. Automated review: Add a use case quality scoring model
    3. CI/CD integration: automatic triggering with Jenkins/GitLab