How to choose between Ollama and vLLM, a large model deployment framework? This article explains the advantages, disadvantages and applicable scenarios of the two frameworks

Written by
Jasper Cole
Updated on:June-27th-2025
Recommendation

Explore new options for large model deployment and gain in-depth understanding of Ollama and vLLM frameworks.

Core content:
1. Cross-platform installation guide for Ollama framework
2. Quick start and personalized configuration
of Ollama model 3. Native API interface call and Python interface compatibility implementation of Ollama framework

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)

Ollama

With the rapid development of artificial intelligence technology, large language models (LLMs) are increasingly used. As an innovative open source framework, Ollama provides developers and researchers with a new solution for efficiently deploying and running LLMs in local environments.

Cross-platform installation guide

Ollama supports mainstream operating systems and the installation process is extremely simple:

• Linux users can install via terminal with one click:

wget -O - https://setup.ollama.ai | bash

• macOS users are recommended to use Homebrew:

brew tap ollama/ollama && brew install

• Windows users can easily deploy via WSL

Model Quick Start Example

Starting a pre-trained model only requires a simple command:

ollama start qwen2.5-14b --detail

Adding the --detail parameter can monitor the token generation rate in real time to facilitate performance tuning.

Personalized model configuration

We can achieve deep customization through Modelfile, such as creating a new file as follows:

BASE qwen2.5-14b

# Model parameter settings
SET temperature 0.7
SET context_length 16384
SET max_tokens 8192

# Role Definition
DEFINE ROLE "You are a professional technical consultant"

Build a custom model process:

ollama build custom-model -c config.mod
ollama activate custom-model --detail

Interaction

  1. Native API interface call example:
import  requests

response = requests.post( 'http://<my_ollama_server_ip>:11434/api/chat'
    json={
        'model''qwen2.5:14b' ,
        'messages' : [
            {
                'role''system' ,
                'content''You are a helpful AI assistant.'
            },
            {
                'role''user' ,
                'content''What is AI Agent?'
            }
        ],
        'stream' : False
    }
)
print(response.json()[ 'message' ][ 'content' ])
  1. Python implementation compatible with OpenAI interface:
from  openai  import  OpenAI

client = OpenAI(
    base_url= "http://<my_ollama_server_ip>:11434/v1" ,
    api_key = "xx" # can be set to any string
)

response = client.chat.completions.create(
    model = "qwen2.5:14b" ,
    messages=[
        { "role""system""content""You are a helpful assistant." },
        { "role""user""content""What is AI Agent?" }
        ]
)
print(response.choices[ 0 ].message.content)

Core Features Highlights

  • Real-time token generation feature: Our system supports instant token generation, is fully compatible with the OpenAI API, and is ideal for developing responsive applications.

  • Parallel model running: Our system can operate multiple models at the same time, but there is one caveat. When VRAM resources are limited, Ollama will shut down one model to start another, so it is crucial to plan resources properly.

  • Highly customizable settings: Through API calls, we can make various customized settings. Although this provides great flexibility, it may not be so friendly to beginners and servers used in production environments.

  • CPU compatibility and intelligent resource management: If VRAM resources are insufficient, our system can intelligently transfer the model to the CPU for execution, which enables large model serving even on systems with limited GPU memory.

  • Programming language agnostic: You can freely choose programming languages ​​such as Python, JavaScript, Go, or any programming language with HTTP functions for development.

vLLM

In the field of deep learning reasoning, the vLLM framework stands out with its excellent performance. As a dedicated solution built on PyTorch, the framework deeply integrates CUDA acceleration technology and provides an industrial-grade efficient operating environment for large-scale language model deployment through innovative continuous batch processing mechanisms, intelligent memory allocation strategies, and distributed tensor computing capabilities.

Compared with simple tools such as Ollama, vLLM is more suitable for containerized deployment. Docker's standardized environment packaging feature can effectively solve cross-platform compatibility issues. Before deployment, ensure that the following technical requirements are met:

  1. The Docker runtime environment has been correctly configured
  2. NVIDIA container runtime support installed
  3. 16GB or more physical memory capacity
  4. NVIDIA graphics card with sufficient video memory

Download the model

The following demonstrates how to deploy the Qwen2.5-14B model in a container environment:

First, create a model storage directory and obtain the quantized model:

mkdir -p model_repository/Qwen2.5-14B/
curl -L https://huggingface.co/lmstudio-community/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf \
-o model_repository/Qwen2.5-14B/model.gguf

In addition to using the curl command to download the model, you can also download it through a script:

export HF_ENDPOINT=https://hf-mirror.com

pip install modelscope

Use modelscope to download and cache to /usr/local. The model address can be changed to the one you want to download.

import  torch
from  modelscope  import  snapshot_download, AutoModel, AutoTokenizer
import  os

from  modelscope.hub.api  import  HubApi
api = HubApi()
# Some places need it, the key is on the upper right corner of modelscope.cn/models
# api.login('xxx the key corresponding to your account')

model_dir = snapshot_download( 'Qwen/Qwen2.5-72B-Instruct-AWQ' , cache_dir= '/usr/local' ,revision= 'master' )
print(model_dir)

Start the model

We also need to set the generation_config.son file. For testing convenience, we set temperature = 0 here.

{
  "bos_token_id": 151643,
  "pad_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "repetition_penalty": 1.05,
  "temperature": 0.0,
  "top_p": 0.8,
  "top_k": 20,
  "transformers_version": "4.37.0"
}

Therefore, you need to create a folder that contains this JSON file and make sure it is named generation_config.json. Then, run the docker container with multiple parameters:

# GPU support required
docker run -it \
    --runtime nvidia \
    --gpus all \
    --network="host" \
    --ipc=host \
    -v ./models:/vllm-workspace/models \
    -v ./config:/vllm-workspace/config \
    vllm/vllm-openai:latest \
    --model models/Qwen2.5-14B-Instruct/Qwen2.5-14B-Instruct-Q4_K_M.gguf \
    --tokenizer Qwen/Qwen2.5-14B-Instruct \
    --host "0.0.0.0" \
    --port 5000 \
    --gpu-memory-utilization 1.0 \
    --served-model-name "VLLMQwen2.5-14B" \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 256 \
    --max-model-len 8192 \
    --generation-config config

The meaning of these parameters are as follows:

--runtime nvidia --gpus all: Enable NVIDIA GPU support for the container.
--network="host": Use host network mode for better performance.
--ipc=host: Enables shared memory between the host and the container.
- v ./model:/vllm-workspace/model: Load the local model directory into the container, which contains the example Qwen2.5–14B model
--model: Specify the path to the GGUF model file.
--tokenizer: Defines the HuggingFace tokenizer to use.
--gpu-memory-utilization 1: Set GPU memory utilization to 100%.
--served-model-name: A custom name for the model when served through the API. You can specify the name you want.
--max-num-batched-tokens: Maximum number of tokens in a batch.
--max-num-seqs: Maximum number of sequences to process simultaneously.
--max-model-len: Maximum context length of the model.

Interaction

  1. Native API interface call example:
import  requests

response = requests.post( 'http://192.168.123.23:5000/v1/chat/completions'
    json={
        'model''VLLMQwen2.5-14B' ,
        'messages' : [
            {
                'role''system' ,
                'content''You are a helpful AI assistant.'
            },
            {
                'role''user' ,
                'content''What is artificial intelligence?'
            }
        ],
        'stream'False
    }
)
print(response.json()[ 'choices' ][ 0 ][ 'message' ][ 'content' ])
  1. Python implementation compatible with OpenAI interface:
from  openai  import  OpenAI

client = OpenAI(
    base_url= "http://<my_vLLM_server_ip>:5000/v1" ,
    api_key= "xx"
)

response = client.chat.completions.create(
    model= "VLLMQwen2.5-14B" ,
    messages=[
        { "role""system""content""You are a helpful assistant." },
        { "role""user""content""What is AI Agent?" }
        ]
)
print(response.choices[ 0 ].message.content)

Core Features Highlights

vLLM is designed specifically for high-performance inference and production environments. Its main features include:

  • Optimized GPU performance: By using CUDA and PyTorch together, we have fully utilized the potential of the GPU to achieve faster reasoning.

  • Batching: We implemented continuous batching and efficient memory management to improve the throughput of multiple parallel requests.

  • Security: Built-in API key support and proper request authentication mechanisms, rather than ignoring authentication outright.

  • Flexible deployment: Full support for Docker, allowing fine-grained control over GPU memory usage and model parameters.

Ollama vs vLLM

  1. Performance : Benchmark tests show that vLLM has a significant speed advantage. The token generation rate for a single request is more than 15% higher than that of Ollama (measured data: vLLM 29 token/s vs Ollama 25 token/s)

  2. Concurrent processing capability : vLLM uses advanced resource scheduling algorithms to efficiently handle high-concurrency requests; however, Ollama has architectural limitations in parallel request processing, and even a small number of concurrent requests (such as 4) can cause system resource contention issues.

  3. Development convenience : Ollama stands out with its minimalist interactive design, allowing developers to interact with the model through a single line of command. In contrast, vLLM requires developers to master container deployment technology and understand various performance tuning parameters.

  4. Production readiness : vLLM's industrial-grade features make it the first choice for enterprise-level deployment. Technical teams including many well-known AI service providers have adopted it as the core reasoning engine. The framework supports fine-grained resource allocation and elastic expansion, and is perfectly adapted to cloud-native environments.

  5. Security mechanism : vLLM has a built-in complete authentication system and supports token-based access control; while Ollama adopts an open access mode by default and requires additional configuration of network layer protection measures to ensure service security.

  6. Technical support system : Ollama's documentation focuses on a quick hands-on experience, but the technical implementation details are relatively scarce, and key technical issues in the community forum often go unanswered. vLLM has established a three-dimensional technical support system, including:

  • Detailed API specification documentation

  • Performance Tuning White Paper

  • Active developer community

  • Dedicated technology portal

Comparison DimensionsOllamavLLM
Core Positioning
Lightweight local large model running tool (suitable for personal development/experiments)
Production-level large model reasoning framework (suitable for enterprise/high concurrency scenarios)
Deployment Difficulty
Simple: One-click installation, supports Mac/Linux/Windows (WSL)
More complex: Depends on Python environment, requires manual configuration of GPU driver and CUDA
Hardware requirements
Low: CPU available (16GB+ RAM recommended), GPU acceleration optional
High: Requires NVIDIA GPU (the larger the video memory, the better), relies on CUDA computing
Model support
Built-in mainstream open source models (Llama2, Mistral, DeepSeek, etc.), automatically download pre-trained models
Supports HuggingFace format models, which require manual download and conversion of model files
Operational performance
Medium: Suitable for single question and answer, small-scale interaction
Very high: Optimized video memory management and batch processing, supporting thousands of concurrent requests
Usage scenarios
Personal learning, local testing, rapid prototyping
Enterprise-level API services, high-concurrency reasoning, and cloud deployment
Interaction
Direct command line dialogue, supporting interactive interfaces similar to ChatGPT
Need to be called through API (OpenAI compatible interface), no built-in dialogue interface
Resource usage
Flexible: CPU/memory usage can be adjusted, suitable for low-end computers
Fixed: Video memory usage is large, resources need to be reserved to cope with peak load
Scalability
Limited: Focus on single-machine localization operation
Strong: Supports distributed deployment, dynamic batch processing, and multi-GPU parallelism
Newbie Friendliness
Very high: out-of-the-box, no code required
Moderate: Need to understand Python and API development basics
Community Support
Active developer community and clear documentation
Maintained by the academic team, updated frequently but tends to be more technical documentation
Typical Uses
Personal tasks such as coding, translation, copywriting, etc.
Build intelligent customer service, batch document processing, and AI-enabled business systems



Summarize

If you want to quickly experiment with large models on a local or remote server, Ollama is an ideal choice. Its ease of use allows developers who are new to large language models to get started smoothly. For production environments that focus on performance, scalability, and resource optimization, vLLM performs well, efficiently handles parallel requests, optimizes GPU utilization, and has complete documentation, making it a strong candidate for large-scale deployment in production environments, especially in fully tapping hardware performance.