Ollama high concurrency test

Written by
Clara Bennett
Updated on:July-10th-2025
Recommendation

Explore the limits of Ollama's high-concurrency performance and learn how to optimize concurrent processing capabilities.

Core content:
1. Concurrency performance test under Ollama's default parameters
2. Adjust parameters to achieve high-concurrency configuration
3. High-concurrency test results and deployment suggestions

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)
This article mainly tests the high concurrency capability of ollama.
The specific configuration is as follows:
1. Ollama default parameter execution
We opened 4 windows and asked DeepSeek to "Tell me a joke" respectively to see the order of answers in different windows.
From the answering order, we can see that when no parameters are set, the model is executed one by one. This means that under the default parameters, Ollama does not support high concurrency and it will reply to our requests one by one .
2. Adjust Ollama high concurrency parameters
In ollama, there are two parameters related to high concurrency:
OLLAMA_MAX_LOADED_MODELS : The maximum number of parallel requests that each model will handle simultaneously, that is, how many LLMs can respond to at the same time.
As for the application scenario, we can call two LLMs to chat at the same time on the chat page to see how different LLMs will respond.
Of course, with this setup, different users can request different models at the same time.
OLLAMA_NUM_PARALLEL : The maximum number of parallel requests that each model will process simultaneously, that is, how many LLMs can reply at the same time.
This parameter is very important for high concurrency. If you have deployed Ollama, and 10 people request your LLM at the same time, if you answer them one by one, and each model takes 10 seconds to reply, then it will take more than a minute for the 10th person to reply, which is unacceptable for the 10th person.
The above two parameters should be set according to your own hardware conditions .
High concurrency testing :
We add the above two parameters to the computer's environment variables and set them both to 4.
OLLAMA_MAX_LOADED_MODELS 4OLLAMA_NUM_PARALLEL 4
After setting, confirm the environment variables and restart Ollama . Let's see the effect.
It can be seen that after setting the concurrency to 4, the model can respond to requests from 4 users at the same time.
Generally speaking, for small and medium-sized deployments, Ollam can be used as the base. You only need to deploy multiple servers and implement it through reverse proxy and load balancing.
If you need to handle more concurrent requests, it is not recommended to use Ollama as the base, and VLLM should be used for deployment.