Which one should we use, Ollama or vLLM?

Written by
Caleb Hayes
Updated on:July-14th-2025
Recommendation

In-depth analysis of the large language model (LLM) reasoning framework to help you choose the right tool.

Core content:
1. Comparison of the advantages and disadvantages of the large language model (LLM) reasoning framework Ollama and vLLM
2. Analysis from multiple dimensions such as resource utilization, deployment maintenance, and specific use cases
3. Based on actual test results, explore the best choice in different scenarios

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Today, I will talk in depth about the advantages and disadvantages of the large language model (LLM) reasoning framework, so that you can choose the appropriate reasoning framework according to your scenario.

This comparison is not about finding out which framework is the best, but about understanding which framework performs better in different scenarios.

We will focus on the following aspects:

  1. Resource Utilization and Efficiency
  2. Ease of deployment and maintenance
  3. Specific use cases and recommendations
  4. Security and production readiness
  5. document

Let's go into the comparison practice and see what the final result is~?

Benchmark Settings ⚡

To ensure a fair comparison, we will use the same hardware and models for both frameworks:

  1. Hardware configuration :
  • GPU: NVIDIA RTX 4060 16GB Ti
  • Memory: 64GB RAM
  • CPU: AMD Ryzen 7
  • Storage: NVMe SSD
  • Model :
    • Qwen2.5–14B-Instruct (int4 quantization)
    • Context length: 8192 tokens
    • Batch size: 1 (single user scenario)

    Contrast ?

    Next, we analyze how the two frameworks manage system resources in different ways, focusing on their core architectural approaches and their impact on resource utilization.

    Ollama:

    I'll give you an example of a single question, the question is "Tell me a 1000 word story". For one request, I got a processing rate of 25.59 tokens/second. And there were no parallel requests.

    If you want to make parallel requests, users using Ubuntu systems must modify the /etc/systemd/system/ollama.service and add a line Environment="OLLAMA_NUM_PARALLEL=4", so that up to 4 parallel requests can be made.

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/ local /bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment= "PATH=/home/henry/.local/bin:/usr/local/cuda/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment= "OLLAMA_HOST=0.0.0.0:11434"
Environment= "OLLAMA_DEBUG=1"
Environment= "OLLAMA_NUM_PARALLEL=4"
Environment= "OPENAI_BASE_URL=http://0.0.0.0:11434/api"

[Install]
WantedBy=multi-user.target

However, I have one thing I am very dissatisfied with Ollama, and I think it is not a good framework for production environments. Ollama reserves all the memory needed, even if only a small part of it is actually used. This means that with only 4 concurrent requests, it is impossible to load the entire model on the GPU, and some layers will be loaded on the CPU, as you can see from the picture below, or run in the terminal ollama ps Commands can also be found.

What’s even more crazy is that I can see that only 15% of the neural network is loaded on the GPU, but there are almost 2GB of GPU memory free! But why is Ollama like this?

Knowing this, what is the maximum context length that Ollama can support in order to load 100% of the model onto the GPU? I tried to modify my model file and set PARAMETER num_ctx 24576, but I noticed the same problem: the CPU was being used at 4% despite having almost 2GB of free video memory in the GPU.

vLLM:

vLLM uses a pure GPU optimization approach. For a fair comparison, I wanted to find the maximum context length for my GPU. After several attempts, my RTX 4060 Ti supports 24576 tokens. So I ran this modified Docker command:

# Run the container with GPU support
docker run -it \
    --runtime nvidia \
    --gpus all \
    --network= "host"  \
    --ipc=host \
    -v ./models:/vllm-workspace/models \
    -v ./config:/vllm-workspace/config \
    vllm/vllm-openai:latest \
    --model models/Qwen2.5-14B-Instruct/Qwen2.5-14B-Instruct-Q4_K_M.gguf \
    --tokenizer Qwen/Qwen2.5-14B-Instruct \
    --host  "0.0.0.0"  \
    --port 5000 \
    --gpu-memory-utilization 1.0 \
    --served-model-name  "VLLMQwen2.5-14B"  \
    --max-num-batched-tokens 24576 \
    --max-num-seqs 256 \
    --max-model-len 8192 \
    --generation-config config

And I can run up to 20 parallel requests!! To test this framework, I used the following code:

import requests
import concurrent.futures

BASE_URL =  "http://<your_vLLM_server_ip>:5000/v1"
API_TOKEN =  "sk-1234"
MODEL =  "VLLMQwen2.5-14B"

def create_request_body():
    return  {
        "model" : MODEL,
        "messages" : [
            { "role""user""content""Tell me a story of 1000 words." }
        ]
    }

def make_request(request_body):
    headers = {
        "Authorization" : f "Bearer {API_TOKEN}" ,
        "Content-Type""application/json"
    }
    response = requests.post(f "{BASE_URL}/chat/completions" , json=request_body, headers=headers, verify=False)
    return  response.json()

def parallel_requests(num_requests):
    request_body = create_request_body()
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_requests) as executor:
        futures = [executor.submit(make_request, request_body)  for  _  in  range(num_requests)]
        results = [future.result()  for  future  in  concurrent.futures.as_completed(futures)]
    return  results

if  __name__ ==  "__main__" :
    num_requests = 50   # Example: Set the number of parallel requests
    responses = parallel_requests(num_requests)
    for  i, response  in  enumerate(responses):
        print (f "Response {i+1}: {response}" )

I got over 100 tokens/second! I couldn't believe that I could do that with a gaming GPU. The GPU was 100% utilized, which is exactly what I wanted: to use the full performance of the GPU (after all, I paid for 100% GPU performance?).

And that’s not even the best part, we set --max-num-seq 256, so theoretically we can send 256 parallel requests!

Final decision… ⚖️

  1. Performance Overview : The clear winner is vLLM. For a single request, vLLM has an 11% performance improvement (26 tokens/sec for Ollama vs. 29 tokens/sec for vLLM).
  2. Resource Management : vLLM is definitely the king in this regard. I was very disappointed when I saw that Ollama couldn't handle many concurrent requests, even 4 concurrent requests due to inefficient resource management.
  3. Ease of use and development difficulty : Nothing is easier to use than Ollama. Even if you are not an expert, you can easily talk to the large language model with one line of code. vLLM requires some knowledge, such as Docker and more parameter settings.
  4. Production-readiness : vLLM is designed for production environments, and many companies are using this framework in their Node.
  5. Security : vLLM supports APIKEY authorization for security purposes, while Ollama does not. So, if your node machine is not well protected, anyone can access it.
  6. Documentation : The two frameworks take different approaches to documentation: Ollama's documentation is simple and beginner-friendly, but lacks technical depth, especially in terms of performance and parallel processing. Their GitHub discussion board often has some important questions left unanswered. In contrast, vLLM provides comprehensive technical documentation, including detailed API references and guides. Their GitHub is well maintained, the developers are responsive, helpful for solving problems and understanding the framework, and they even have a dedicated website https://docs.vllm.ai/en/latest/ to introduce these contents.

So, in my opinion, the winner is... neither!

If you often need to quickly experiment with large language models in a local environment or even on a remote server without having to go through too much setup hassle, then Ollama is undoubtedly your preferred solution. Its simplicity and ease of use make it ideal for rapid prototyping, testing ideas, or for developers who are just starting to use large language models and want a gentle learning curve.

However, when we shift our focus to production environments, where performance, scalability, and resource optimization are critical, vLLM clearly wins. Its excellent handling of parallel requests, efficient GPU utilization, and strong documentation make it a strong contender for serious deployments at scale. The framework's ability to squeeze maximum performance out of available hardware resources is particularly impressive and could be a game-changer for companies looking to optimize their large language model infrastructure.

That said, the choice between Ollama and vLLM cannot be made in isolation. It must depend on your specific use case, taking into account the following factors:

  1. Project size
  2. Your team's technical expertise
  3. Application-specific performance requirements
  4. Your development timeline and resources
  5. Customization and fine-tuning requirements
  6. Long-term maintenance and support considerations

Essentially, while vLLM may provide superior performance and scalability in a production environment, the simplicity of Ollama may be invaluable in certain situations, especially in the early stages of development or for smaller-scale projects.

Ultimately, the best choice will be the one that best meets your project's unique needs and constraints. It's worth considering that in some cases you might even benefit from using both: using Ollama for rapid prototyping and initial development, and then using vLLM when you're ready to scale and optimize for production. This hybrid approach gives you the best of both worlds, enabling you to leverage the strengths of each framework at different stages of the project lifecycle.