Woter AI detection.Hurry - ends Jul 17th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Comparison of LLM operating frameworks: A brief analysis of ollama and vllm

Written by

Iris Vance

Updated on:July-13th-2025

Open source LLMs have become a great choice for programmers, enthusiasts, and users who want to use generative AI in their daily work and maintain privacy, as well as for private deployments in enterprises. These models provide excellent performance, sometimes comparable to large closed-source models (such as GPT-4o or Claude Sonnet 3.5) in many tasks.

These LLMs are open source, but that doesn't mean they can be used out of the box. A runtime framework is needed to run large models locally or on a server for specific use cases. In addition, OpenAI-compatible servers have become the most popular way to deploy any model, because these APIs allow us to use LLM service capabilities on almost any SDK or client, such as OpenAI SDK, Transformers, LangChain, etc.

So, what is the best runtime framework for deploying LLM to be compatible with OpenAI? Here we try to analyze Ollama and vLLM, two popular runtime frameworks that can be used to deploy models with compatible OpenAI APIs. We can compare the two in terms of performance, ease of use, customization, and other aspects.

1. Ollama

Ollama is a powerful runtime framework designed to make running LLMs as easy as possible. Ollama simplifies the entire process of downloading, running, and managing large language models on a local machine or server.

Using Ollama is easy and can be installed on different platforms:

curl -fsSL https://ollama.com/install.sh | sh (Linux)
brew install ollama (macOS)

Ollama provides a ready-made model running environment that can run large model services with one line of command: Ollama run <anymodel> This command will easily run any model listed in the Ollama model repository in your terminal. For example:

ollama run qwen2.5:14b --verbose

Added the --verbose flag so you can see the token throughput per second (tokens/sec).

1.1 Ollama Parameters

If we need to create a private model with specific parameters, we need to create a Modelfile, which is a separate plain text file that contains the parameters that need to be set.

FROM qwen2.5:14b

PARAMETER temperature 0.5

# Context size
PARAMETER num_ctx 8192

# The maximum number of tokens is 4096 
PARAMETER num_predict 4096

# System prompt word configuration
SYSTEM """You are a helpful AI assistant."""

We can build and run this custom model:

# Build the model
ollama create mymodel -f Modelfile

# Run
ollama run mymodel --verbose

Ollama provides two ways to interact with the model:

1. Native REST API: Ollama runs a local server on port 11434 by default, and we can interact with it using standard HTTP requests:

import requests

response = requests.post('http://<my_ollama_server_ip>:11434/api/chat', 
    json={
        'model': 'qwen2.5:14b',
        'messages': [
            {
                'role': 'system',
                'content': 'You are a helpful AI assistant.'
            },
            {
                'role': 'user',
                'content': 'What is AI Agent?'
            }
        ],
        'stream': False
    }
)
print(response.json()['message']['content'])

2. To achieve seamless integration with existing applications, Ollama provides OpenAI API compatibility and can be used with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://<my_ollama_server_ip>:11434/v1",
    api_key="Abel" # can be set to any string
)

response = client.chat.completions.create(
    model="qwen2.5:14b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is AI Agent?"}
        ]
)
print(response.choices[0].message.content)

1.2 Features of Ollama API

Ollama’s API has many essential features that make it one of the important choices for developers. Its main features are as follows:

Streaming support: real-time token generation, fully compatible with the OpenAI API, ideal for creating responsive applications.
Multiple Model Management: Ability to run different models simultaneously, but there is a caveat. When VRAM is limited, Ollama will stop one model to run another, which requires careful resource planning.
Parameter control: Highly customizable settings through API calls, which provides great flexibility but is not friendly for beginners and production servers.
CPU compatibility: When VRAM is insufficient, intelligent resource management can automatically offload the model to the CPU for execution, allowing large model services to run on systems with limited GPU memory.
Language agnostic: You can freely use programming languages such as Python, JavaScript, Go, and any other programming language with HTTP capabilities.

2. vLLM

vLLM is a high-performance framework designed for LLM inference, focusing on efficiency and scalability. It is based on PyTorch, which leverages CUDA to accelerate GPUs, and implements advanced optimization techniques such as continuous batching and efficient memory management and tensor parallelism, making it particularly suitable for production environments and high-throughput scenarios.

vLLM is not as simple as using Ollama, and the best way may be to install it using Docker. Docker provides a consistent environment, making cross-system deployment easier. The prerequisites for using Docker to execute vLLM are as follows:

Docker installed on your system.
NVIDIA Container Toolkit (with GPU support).
At least 16GB of RAM (recommended).
Configure NV's GPU with enough VRAM for the target model.

2.1 GGUF (GPT-Generated Unified Format)

GGUF is considered by many to be the successor to GGML. It is a quantization method that enables hybrid CPU-GPU execution of large language models, optimizing memory usage and inference speed. It is the only format supported by Ollama for model operation. The format is particularly efficient on CPU architectures and Apple Silicon, supporting a variety of quantization levels (from 4 to 8 bits) while maintaining model quality.

Although vLLM currently provides only limited GGUF support with a focus on native GPU optimization, understanding this format is important for comparative analysis of frameworks running large models.

2.2 Docker deployment and operation

We continue to deploy Qwen 2.5-14B as a reference model. Downloading the model may take a little time, depending on the current Internet connection speed:

mkdir models/
mkdir models/Qwen2.5-14B-Instruct/

# Download a 4bit quantization model
wget -P models/Qwen2.5-14B-Instruct/ https://huggingface.co/lmstudio-community/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf

We also need to set the generation_config.son file. For testing convenience, we set temperature = 0 here.

{
  "bos_token_id": 151643,
  "pad_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "repetition_penalty": 1.05,
  "temperature": 0.0,
  "top_p": 0.8,
  "top_k": 20,
  "transformers_version": "4.37.0"
}

Therefore, you need to create a folder that contains this JSON file and make sure it is named generation_config.json. Then, run the docker container with multiple parameters:

# GPU support required
docker run -it \
    --runtime nvidia \
    --gpus all \
    --network="host" \
    --ipc=host \
    -v ./models:/vllm-workspace/models \
    -v ./config:/vllm-workspace/config \
    vllm/vllm-openai:latest \
    --model models/Qwen2.5-14B-Instruct/Qwen2.5-14B-Instruct-Q4_K_M.gguf \
    --tokenizer Qwen/Qwen2.5-14B-Instruct \
    --host "0.0.0.0" \
    --port 5000 \
    --gpu-memory-utilization 1.0 \
    --served-model-name "VLLMQwen2.5-14B" \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 256 \
    --max-model-len 8192 \
    --generation-config config

The meaning of these parameters are as follows:

--runtime nvidia --gpus all: Enable NVIDIA GPU support for the container.
--network="host": Use host network mode for better performance.
--ipc=host: Enables shared memory between the host and the container.
- v ./model:/vllm-workspace/model: Load the local model directory into the container, which contains the example Qwen2.5–14B model
--model: Specify the path to the GGUF model file.
--tokenizer: Defines the HuggingFace tokenizer to use.
--gpu-memory-utilization 1: Set GPU memory utilization to 100%.
--served-model-name: A custom name for the model when served through the API. You can specify the name you want.
--max-num-batched-tokens: Maximum number of tokens in a batch.
--max-num-seqs: Maximum number of sequences to process simultaneously.
--max-model-len: Maximum context length of the model.

These parameters can be adjusted based on specific hardware capabilities and performance requirements. After running this command, a lot of logs will be displayed. Once you see output similar to the following, you can use it.

By default, vLLM's REST API runs locally on port 8000, and you can interact with it using standard HTTP requests:

import requests

response = requests.post('http://192.168.123.23:5000/v1/chat/completions', 
    json={
        'model': 'VLLMQwen2.5-14B',
        'messages': [
            {
                'role': 'system',
                'content': 'You are a helpful AI assistant.'
            },
            {
                'role': 'user',
                'content': 'What is artificial intelligence?'
            }
        ],
        'stream': False
    }
)
print(response.json()['choices'][0]['message']['content'])

To seamlessly integrate with existing applications, vLLM also provides a compatibility interface with the OpenAI API.

from openai import OpenAI

client = OpenAI(
    base_url="http://<my_vLLM_server_ip>:5000/v1",
    api_key="Abel" # vLLM supports API authentication. For testing and comparison, it is also set to Abel 
)

response = client.chat.completions.create(
    model="VLLMQwen2.5-14B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is AI Agent?"}
        ]
)
print(response.choices[0].message.content)

2.3 vLLM API Features

The vLLM API is designed for high-performance reasoning and production environments. Its main features are as follows:

Efficient GPU Optimization: Leverage CUDA and PyTorch to maximize the use of GPU, resulting in faster inference speed.
Batch processing capability: Implement continuous batch processing and efficient memory management to improve the throughput of multiple concurrent requests.
Security features: Built-in API key support and proper request validation instead of skipping authentication entirely.
Flexible deployment: Full Docker support with fine-grained control over GPU memory usage and model parameters.

Although vLLM requires more parameters and environment settings, it demonstrates excellent performance and features for production environments.

3. Comparison between Ollama and vLLM

Which runtime inference framework should we use? We can compare Ollama and vLLM from the following dimensions:

Resource Utilization and Efficiency
Easy to deploy and maintain
Specific use cases and recommendations
Production-ready and secure
Document support level

3.1 Benchmarks

We use the same hardware and models for both frameworks:

Hardware configuration:

GPU: NVIDIA RTX 4060 16GB Ti
RAM: 64GB
CPU: AMD Ryzen 7
Storage: NVMe SSD solid state drive.

Model:

Qwen2.5–14B-Instruct (4-bit quantization)
Context length: 8192 tokens.
Batch size: 1 (single user case).

3.2 Model operation

An example of a simple problem "generate a 1000 word story".

One request of Ollama takes about 25 seconds, and no parallel requests are executed. For parallel requests, the user must modify the file located in /etc/systemd/system/OLLAMA.service (the server is the Ubuntu operating system) and add a line Environment = "OLLAMA _NUM_PARALLEL = 4", which means that up to 4 parallel requests can be executed.

[Unit]
Description=Ollama Qwen Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/home/abel_cao/.local/bin:/usr/local/cuda/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_DEBUG=1"
Environment="OLLAMA_NUM_PARALLEL=4"
Environment="OPENAI_BASE_URL=http://0.0.0.0:11434/api"

[Install]
WantedBy=multi-user.target

This is the limitation of Ollama. It is not a framework for running large models in production environments. Even if only part of the memory is currently used, Ollama takes up all the required memory. Even with only 4 parallel requests, Ollama still seems to have a very difficult time loading the entire neural network, and no relevant reference documentation can be found.

What is the maximum number of contexts that Ollama can support in order to load the model 100% in the GPU? Try modifying the model file by setting PARAMETER num_ctx 24576. Even though almost 2GB of VRAM in the GPU is free, 4% of the CPU is still used.

VLLM has a pure GPU optimization method, while GGUF quantization is still in the experimental stage. After several attempts, RTX 4060Ti also supports 24576 contexts.

import requests
import concurrent.futures

BASE_URL = "http://<my_vLLM_server_ip>:5000/v1"
API_TOKEN = "Abel-1234"
MODEL = "VLLMQwen2.5-14B"

def create_request_body():
    return {
        "model": MODEL,
        "messages": [
            {"role": "user", "content": "Generate a 1000 word story"}
        ]
    }

def make_request(request_body):
    headers = {
        "Authorization": f"Bearer {API_TOKEN}",
        "Content-Type": "application/json"
    }
    response = requests.post(f"{BASE_URL}/chat/completions", json=request_body, headers=headers, verify=False)
    return response.json()

def parallel_requests(num_requests):
    request_body = create_request_body()
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_requests) as executor:
            futures = [executor.submit(make_request, request_body) for _ in range(num_requests)]
            results = [future.result() for future in concurrent.futures.as_completed(futures)]
    return results

if __name__ == "__main__":
    num_requests = 50 # Number of concurrent requests
    responses = parallel_requests(num_requests)
    for i, response in enumerate(responses):
            print(f"Response {i+1}: {response}")

More than 100 tokens can be obtained per second, and the GPU utilization rate reaches 100%. The number of concurrent requests is set to 50, so in theory 50 requests can be sent in parallel!

In general, the comprehensive comparison between Ollama and vLLM is as follows:

Performance Overview: The clear winner is vLLM, which got more than 10% improvement with only one request (Ollama ~25 token/sec vs vLLM ~29 token/sec).
Resource Management: vLLM wins again, Ollama’s inability to process multiple requests in parallel is very disappointing, it can’t even process 4 requests in parallel due to inefficient resource management.
Easy to use and develop: Ollama is easier to use, and one line of code can easily chat with LLM quickly. Meanwhile, vLLM requires some knowledge like docker and more parameter configuration.
For production environments: vLLM is more suitable for production environments, and even many AI service providers are using this runtime framework as the endpoint of AI services.
Security: vLLM supports token authorization for security purposes, while Ollama does not. Therefore, anyone can access your Ollama endpoint if you do not secure it well.
Documentation support: The two frameworks adopt different documentation support methods. Ollama's documentation is simple and beginner-friendly, but lacks technical depth, especially in terms of performance and parallel processing. Discussions on GitHub often leave some key questions unanswered. In contrast, vLLM provides comprehensive technical documentation with detailed API references and guides. Its GitHub is well maintained by developers, which helps troubleshooting and understanding, and even has a website dedicated to it.

So if the goal is to quickly experiment with large models in a local environment or even on a remote server, Ollama is definitely the solution of choice. Its ease of use is perfect for rapid prototyping, testing ideas, or for developers just starting out with LLM, with a very smooth learning curve.

However, when the focus shifts to performance, scalability, and resource optimization for production environments, vLLM shines. Its excellent handling of parallel requests, efficient GPU utilization, and robust documentation make it a strong contender for large-scale deployment in production environments. The runtime framework's ability to squeeze maximum performance out of available hardware resources is particularly appealing.

4. Other considerations for large model operation framework

The choice of a framework to run large models must depend on our own specific use case, taking into account the following factors:

Project size
Technical expertise of the team
Application-specific performance requirements
Development Timeline and Resources
Whether customization and fine-tuning is required
Long-term maintenance and support considerations

Essentially, while vLLM can provide excellent performance and scalability for production environments, the simplicity of Ollama may be more valuable for certain scenarios, especially in the early stages of development or demo-level projects.

5. Summary in one sentence

The choice of large model runtime framework is the one that is most closely related to the unique needs and constraints of the project. In some cases, it is even possible to use both: Ollama for rapid prototyping and initial development, and vLLM for scaling and optimizing production environments. This hybrid approach allows us to leverage the strengths of different runtime frameworks at different stages of the project lifecycle.