Woter AI detection.Hurry - ends Jul 16th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

How to perform stress testing on large models? Try APIfox!

Written by

Caleb Hayes

Updated on:July-10th-2025

More and more companies are deploying large models privately, but how much pressure can the large models we deploy actually withstand? Only after stress testing can we know for sure. This article will show you how to use Apifox to perform stress testing on large models.

1. Stress Testing Target

Performance benchmark: Determine the response time (latency), throughput (throughput), and stability of the API of a locally deployed large model under different loads.
Capacity assessment: Find out the maximum concurrent processing capacity of the API (that is, the maximum number of requests it can support without crashing or taking too long to respond).
Bottleneck identification: Discover potential performance bottlenecks (such as CPU, memory, I/O, etc.) of the system under high load.

Example target:

When processing 100 requests per second, the average response time does not exceed 2 seconds.
When the number of concurrent users reaches 200, the system still runs stably.

2. Preliminary preparation

1. Environmental Preparation

Large models deployed locally: Make sure the model is deployed and made available to the public through an API (such as a RESTful interface). For example, assume the API address is http://localhost:8000/v1/completions.
Apifox Tools:

Download and install the latest version of APIFOX (official website: https://www.apifox.cn/).
Register and log in to your account (the free version supports basic stress testing functions).

Test machine:

Configuration: At least 8-core CPU and 32GB memory are recommended (adjusted according to the model size).
Operating system: Windows/Linux/Mac.
Ensure network stability and avoid external interference.

Monitoring tools: Install system performance monitoring tools (such as htop and nmon for Linux, or resource monitor for Windows). If the server is a cloud server, you can directly use the platform monitoring capabilities to observe the CPU, memory, and disk I/O usage.

2. API interface confirmation

Interface documentation: Get the interface description of the big model API (such as OpenAPI/Swagger format), including request methods (GET/POST), parameters, and response formats.
Example interface:

URL: POST http://localhost:8000/v1/completions
Request body:

{"prompt": "Hello, please generate a text about AI.","max_tokens": 100,"temperature": 0.7}

response:

{"text": "AI is the trend of the future...","status": "success"}

3. Test data preparation

Diversified input:

Short text: such as "Hello".
Medium text: such as "Please write a 100-word article."
Long text: such as "Analysis of the application of AI in the medical field, 500 words".

Parameter changes:

max_tokens: 50, 100, 200.
temperature: 0.5, 0.7, 1.0.

Save these inputs as a JSON file for APIfox to call.

3. Design of stress testing solution

1. Test scenario

The following three scenarios are designed according to actual needs:

Scenario 1: Low load testing

Number of concurrent users: 10
Request frequency: 1 time/second/user
Duration: 5 minutes
Purpose: To verify basic performance and stability.

Scenario 2: Medium load testing

Number of concurrent users: 50
Request frequency: 2 times/second/user
Duration: 10 minutes
Purpose: To evaluate performance under normal usage scenarios.

Scenario 3: High load testing

Number of concurrent users: 200
Request frequency: 5 times/second/user
Duration: 15 minutes
Purpose: To test the limit capacity and stability.

2. Key indicators

Response time: average, P95 (response time for 95% of requests), maximum.
Throughput: Requests per second (RPS).
Error rate: The percentage of failed requests.
System resources: CPU usage , memory usage, network bandwidth.

4. Implementing stress testing in Apifox

1. Configure API

Open APIfox and click "New Project".
Add API in "Interface Management":

Enter the URL: http://localhost:8000/v1/completions.
Set the request method to POST.
Enter the sample request body (JSON as above) in "Body".

Save and test a single request to ensure that the response is normal.

2. Set up the stress testing script

Go to the "Automated Test" module and click "New Test".
Configuration test steps:

Step 1: Call the API

Select the API you just added.
Set variables (such as prompt) to dynamic values, read randomly from a prepared JSON file.

Step 2: Verify the response

Check that the status code is 200.
Check that the status field in the response is success.

Save the script.

3. Configure stress testing parameters

Click the "Stress Test" tab and set the scenario parameters:

Scenario 1: 10 concurrent requests, 1 time per second, for 300 seconds.
Scenario 2: 50 concurrent requests, 2 requests per second, and a duration of 600 seconds.
Scenario 3: 200 concurrent requests, 5 requests per second, and a duration of 900 seconds.

Select Dynamic Value:

Import the JSON file and make prompt and max_tokens vary randomly.

Set the stop condition:

The error rate exceeds 10%.
The average response time is over 5 seconds.

4. Perform stress testing

Click "Start stress testing" and Apifox will simulate concurrent requests.
At the same time, open the system monitoring tool to record resource usage.
After each scenario, save the results report.

V. Results Analysis

1. Data organization

Export reports from Apifox, including:

Response time distribution (mean, P95, maximum).
Throughput (RPS).
Error rate.

Combined with system monitoring data, record the CPU and memory peak values.

2. Analysis Example

Scenario 1:

Average response time: 0.5 seconds
Throughput: 10 RPS
CPU: 20%
Conclusion: Good performance under light load.

Scenario 2:

Average response time: 1.2 seconds
Throughput: 100 RPS
CPU: 60%
Conclusion: Acceptable for medium loads.

Scenario 3:

Average response time: 4.8 seconds
Throughput: 800 RPS
CPU: 95%, memory overflow
Conclusion: 200 concurrent connections is overloaded and needs to be optimized.

3. Bottleneck troubleshooting

If the response time is too long, check:

Model inference speed: Is GPU acceleration required?
Server resources: Is there insufficient CPU/memory?
Network latency: This should not be a problem in local deployment, but it needs to be confirmed.