How to perform stress testing on large models? Try APIfox!

Efficient stress testing of large models to improve system performance, Apifox is your right-hand man.
Core content:
1. Clarify stress testing goals: performance benchmark, capacity assessment, bottleneck identification
2. Preliminary preparation: environment configuration, API interface confirmation, test data preparation
3. Stress testing solution design: low load, medium load test scenarios
Performance benchmark: Determine the response time (latency), throughput (throughput), and stability of the API of a locally deployed large model under different loads. Capacity assessment: Find out the maximum concurrent processing capacity of the API (that is, the maximum number of requests it can support without crashing or taking too long to respond). Bottleneck identification: Discover potential performance bottlenecks (such as CPU, memory, I/O, etc.) of the system under high load.
Example target:
When processing 100 requests per second, the average response time does not exceed 2 seconds. When the number of concurrent users reaches 200, the system still runs stably.
2. Preliminary preparation
Large models deployed locally: Make sure the model is deployed and made available to the public through an API (such as a RESTful interface). For example, assume the API address is http://localhost:8000/v1/completions. Apifox Tools: Download and install the latest version of APIFOX (official website: https://www.apifox.cn/). Register and log in to your account (the free version supports basic stress testing functions).
Test machine: Configuration: At least 8-core CPU and 32GB memory are recommended (adjusted according to the model size). Operating system: Windows/Linux/Mac. Ensure network stability and avoid external interference.
Monitoring tools: Install system performance monitoring tools (such as htop and nmon for Linux, or resource monitor for Windows). If the server is a cloud server, you can directly use the platform monitoring capabilities to observe the CPU, memory, and disk I/O usage.
2. API interface confirmation
Interface documentation: Get the interface description of the big model API (such as OpenAPI/Swagger format), including request methods (GET/POST), parameters, and response formats.
Example interface:
URL: POST http://localhost:8000/v1/completions
Request body:
response:
{"prompt": "Hello, please generate a text about AI.","max_tokens": 100,"temperature": 0.7}
{"text": "AI is the trend of the future...","status": "success"}
3. Test data preparation
Diversified input: Short text: such as "Hello". Medium text: such as "Please write a 100-word article." Long text: such as "Analysis of the application of AI in the medical field, 500 words". Parameter changes: max_tokens: 50, 100, 200. temperature: 0.5, 0.7, 1.0. Save these inputs as a JSON file for APIfox to call.
Scenario 1: Low load testing Number of concurrent users: 10
Request frequency: 1 time/second/user
Duration: 5 minutes
Purpose: To verify basic performance and stability. Scenario 2: Medium load testing Number of concurrent users: 50 Request frequency: 2 times/second/user Duration: 10 minutes Purpose: To evaluate performance under normal usage scenarios.
Scenario 3: High load testing Number of concurrent users: 200 Request frequency: 5 times/second/user Duration: 15 minutes Purpose: To test the limit capacity and stability.
Response time: average, P95 (response time for 95% of requests), maximum. Throughput: Requests per second (RPS). Error rate: The percentage of failed requests. System resources: CPU usage , memory usage, network bandwidth.
4. Implementing stress testing in Apifox
Open APIfox and click "New Project". Add API in "Interface Management": Enter the URL: http://localhost:8000/v1/completions. Set the request method to POST. Enter the sample request body (JSON as above) in "Body". Save and test a single request to ensure that the response is normal.
2. Set up the stress testing script
Go to the "Automated Test" module and click "New Test". Configuration test steps: Step 1: Call the API Select the API you just added. Set variables (such as prompt) to dynamic values, read randomly from a prepared JSON file. Step 2: Verify the response Check that the status code is 200. Check that the status field in the response is success. Save the script.
3. Configure stress testing parameters
Click the "Stress Test" tab and set the scenario parameters: Scenario 1: 10 concurrent requests, 1 time per second, for 300 seconds. Scenario 2: 50 concurrent requests, 2 requests per second, and a duration of 600 seconds. Scenario 3: 200 concurrent requests, 5 requests per second, and a duration of 900 seconds. Select Dynamic Value: Import the JSON file and make prompt and max_tokens vary randomly. Set the stop condition: The error rate exceeds 10%. The average response time is over 5 seconds.
Click "Start stress testing" and Apifox will simulate concurrent requests.
At the same time, open the system monitoring tool to record resource usage.
After each scenario, save the results report.
V. Results Analysis
Export reports from Apifox, including:
Response time distribution (mean, P95, maximum).
Throughput (RPS).
Error rate.
Combined with system monitoring data, record the CPU and memory peak values.
2. Analysis Example
Scenario 1:
Average response time: 0.5 seconds
Throughput: 10 RPS
CPU: 20%
Conclusion: Good performance under light load.
Scenario 2:
Average response time: 1.2 seconds
Throughput: 100 RPS
CPU: 60%
Conclusion: Acceptable for medium loads.
Scenario 3:
Average response time: 4.8 seconds
Throughput: 800 RPS
CPU: 95%, memory overflow
Conclusion: 200 concurrent connections is overloaded and needs to be optimized.
If the response time is too long, check:
Model inference speed: Is GPU acceleration required?
Server resources: Is there insufficient CPU/memory?
Network latency: This should not be a problem in local deployment, but it needs to be confirmed.