A complete analysis of the large language model engine: Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

Written by
Caleb Hayes
Updated on:June-24th-2025
Recommendation

In-depth analysis of the large language model engine, discover the most suitable tool for you, and unleash the infinite possibilities of AI.

Core content:
1. Transformers engine: an all-round tool in the NLP field, supporting a variety of pre-trained models
2. vLLM engine: the pinnacle of GPU inference performance, improving the inference speed of large models
3. Llama.cpp engine: a lightweight pioneer on the CPU, running large models without a GPU

Yang Fangxian
Founder of 53A/Most Valuable Expert of Tencent Cloud (TVP)


This article will take you deep into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama engines to help you find the most suitable tools and unleash the full potential of large language models!


1. Transformers Engine: The Almighty King in the NLP Field

Developer : Hugging Face

Core features : As the most popular open source NLP library, Transformers is known as the "Swiss Army Knife" in the field of NLP. It supports hundreds of pre-trained models, including well-known models such as GPT, BERT, and T5, and provides a one-stop solution from model loading, fine-tuning, to reasoning.

Significant advantages :

  • Strong compatibility : Perfectly compatible with PyTorch and TensorFlow, providing developers with more choices.

  • Ecosystem prosperity : With an active community, rich model library and complete documentation, both beginners and experts can benefit from it.

  • Widely used : Suitable for various NLP tasks from academic research to industrial production.

  • Applicable scenarios : When you need to quickly implement tasks such as text classification, generation, and translation, Transformers is the best choice, which can help you easily implement NLP applications.


2. vLLM Engine: Peak Performance of GPU Inference

Developer : UC Berkeley research team

Core features : vLLM focuses on large language model reasoning. With innovative memory management technologies (such as PagedAttention), it greatly improves GPU utilization and reasoning speed, making it a "performance monster" for GPU reasoning.

Significant advantages :

  • Excellent performance : Extreme inference speed, able to meet large-scale deployment requirements.

  • Memory Efficient : Efficient memory management, supporting larger model batch processing.

  • Scene adaptation : Optimized for GPU, with excellent performance in high-concurrency scenes.

    Applicable scenarios : If you need to deploy a large language model in a production environment and pursue extreme performance, vLLM is undoubtedly the best choice. It can improve the model inference speed and reduce hardware costs.


3. Llama.cpp engine: lightweight pioneer on CPU

Developer : Community Project

Core features : Llama.cpp is implemented based on C++ and is specially designed for running Meta's LLaMA model. By optimizing calculation and memory management, it makes it possible to run large models on the CPU. It is the "lightweight king" on CPU devices.

Significant advantages :

  • Lightweight : No GPU is required and it can run on ordinary CPU devices.

  • Flexible deployment : suitable for resource-constrained environments, such as embedded devices and low-profile servers.

  • Open source expansion : The open source feature makes it easy to expand and customize.

  • Applicable scenarios : When the device has no GPU resources but needs to run a large language model, Llama.cpp is an ideal choice, allowing ordinary devices to experience the power of large language models.


4. SGLang Engine: A Potential Rising Star for Efficient Reasoning

Developer : Unknown

Core features : SGLang focuses on efficient reasoning and may use technologies such as sparse computing and distributed optimization to improve performance. Although it is full of mystery, it has unlimited potential.

Significant advantages :

  • Scenario optimization : Deeply optimize for specific scenarios to significantly improve reasoning efficiency.

  • Enterprise adaptation : Suitable for enterprise-level applications that require high-performance reasoning.

  • Applicable scenarios : SGLang is worth a try when running large language models in a large-scale distributed environment. It is an important window for exploring future reasoning technologies.


MLX Engine: The Future of Efficient Computing

Developer : Unknown

Core features : MLX may be a machine learning framework optimized for large language models, focusing on efficient computing and reasoning, and is a "future star" in the field of efficient computing.

Significant advantages :

  • Hardware adaptation : It may be optimized for specific hardware such as TPU or custom chips.

  • Efficiency first : Applicable to scenarios that require extreme computing efficiency.

  • Applicable scenarios : If you need to run a large language model on specific hardware, MLX is worth paying attention to. Its potential hardware optimization capabilities are expected to lead the future of efficient computing.


6. Ollama: A convenient choice for running large local models

Developer : Community Project

Core features : Ollama is a powerful tool for running large language models locally. It supports multiple models such as LLaMA and GPT, and simplifies the model deployment and operation process.

Significant advantages :

  • Easy to use : easy to operate, suitable for individual users and developers.

  • Local operation : No cloud resources are required, and the model can be run completely on the local device.

  • Rich models : supports multiple models and is flexible to use.

  • Applicable scenarios : If you want to test or run a large language model on your personal device, Ollama is an excellent choice, helping you get rid of cloud dependence and experience the charm of large models at any time 


7. Index comparison

1. Performance comparison

engine
Performance characteristics
Hardware Support
Applicable model scale
Transformers
It has strong versatility and medium performance, and is suitable for small and medium-scale model reasoning and training.
CPU/GPU
Small and medium scale models
vLLM
High-performance reasoning, optimizing GPU memory and computing efficiency through technologies such as PagedAttention.
GPU
Large-scale models
Llama.cpp
Optimized for CPU, with medium performance, suitable for resource-constrained environments.
CPU
Small and medium scale models
SGLang
Possibly improved performance through sparse computing or distributed optimization, depending on the implementation.
Unknown (maybe GPU)
Medium to large scale models
MLX
May be optimized for specific hardware (such as TPU or custom chips) and have high performance potential.
Specific hardware
Medium to large scale models
Ollama
Moderate performance, suitable for local operation, does not require high-performance hardware.
CPU/GPU
Small and medium scale models

Summarize :

  • vLLM  has the best inference performance on GPU and is suitable for large-scale models.

  • Llama.cpp  and  Ollama  are suitable for running small and medium-sized models on CPU or low-end devices.

  • SGLang  and  MLX  have great performance potential, but more practical verification is needed.

2.  Concurrency Comparison

engine
Concurrency support
Applicable scenarios
Transformers
It supports multi-threading and multi-GPU reasoning, but the concurrency capability is limited by the framework and hardware.
Small and medium-sized concurrent tasks
vLLM
High concurrency support, significantly improving concurrency performance through memory optimization and batch processing technology.
Highly concurrent reasoning tasks
Llama.cpp
The concurrency capability is limited and suitable for low concurrency scenarios.
Single task or low concurrency task
SGLang
High concurrency may be supported through distributed computing, the exact capability depends on the implementation.
Medium to high concurrent tasks
MLX
May be optimized for high concurrency, exact capabilities depend on hardware and implementation.
Medium to high concurrent tasks
Ollama
The concurrency capability is medium, suitable for local low-concurrency tasks.
Single task or low concurrency task

Summarize :

  • vLLM  performs best in high-concurrency scenarios and is suitable for production environments.

  • Transformers  and  SGLang  are suitable for moderately concurrent tasks.

  • Llama.cpp  and  Ollama  are more suitable for single-task or low-concurrency scenarios.

3.  Comparison of applicable scenarios

engine
Applicable scenarios
Advantages
Transformers
Research, development, small and medium scale production environments.
It has comprehensive functions, strong community support, and is suitable for a variety of NLP tasks.
vLLM
Large-scale model reasoning and high-concurrency production environment.
Extreme performance, efficient memory management, suitable for enterprise-level applications.
Llama.cpp
Resource-constrained environments (such as embedded devices and low-profile servers).
Lightweight, does not require a GPU, and is suitable for low-cost deployment.
SGLang
Medium to large-scale model reasoning, distributed computing environment.
May improve performance through optimization, suitable for exploratory projects.
MLX
Specific hardware environments (such as TPU or custom chips).
It may be optimized for hardware and suitable for high-performance computing scenarios.
Ollama
Local development, testing, and personal use.
Simple and easy to use, no cloud resources required, suitable for individual users.

Summarize :

  • Transformers  are the most versatile tool and suitable for most NLP tasks.

  • vLLM  is the first choice for enterprise-level high-concurrency scenarios.

  • Llama.cpp  and  Ollama  are suitable for individual developers or resource-constrained environments.

  • SGLang  and  MLX  are suitable for scenarios that require high performance or specific hardware support.

4.  Hardware compatibility comparison

engine
Hardware Support
Applicable device types
Transformers
CPU/GPU
Ordinary servers, personal computers, cloud servers
vLLM
GPU
High-performance GPU server
Llama.cpp
CPU
Low-end devices, embedded devices
SGLang
Unknown (maybe GPU)
High-performance servers
MLX
Specific hardware
TPU, custom chips, etc.
Ollama
CPU/GPU
Personal computer, ordinary server

Summarize :

  • Transformers  and  Ollama  are the most compatible and support multiple devices.

  • vLLM  and  SGLang  require high-performance GPU or server.

  • Llama.cpp  is suitable for low-end devices, while  MLX  requires specific hardware support.

8. Comparison of token output per second


1. Factors affecting performance

Before comparing TPS, it is necessary to clarify the key factors that affect performance:

  • Hardware performance: GPU computing power, video memory bandwidth, video memory capacity, etc.

  • Model size: The larger the number of parameters, the slower the inference speed.

  • Batch Size: Larger batches can increase throughput but increase video memory usage.

  • Engine optimization: Different engines have significant differences in performance in terms of memory management, computational optimization, etc.


2.  GPU Performance Comparison

The following is a comparison of the main parameters of A800, A100 and H100:

GPU Model
FP32 computing power (TFLOPS)
Video memory capacity (GB)
Memory bandwidth (TB/s)
Applicable scenarios
A800
19.5
40/80
2.0
Reasoning, training
A100
19.5
40/80
2.0
High performance computing, AI training
H100
30.0
80
3.35
High-performance inference, AI training


  • H100  is the most powerful GPU currently available, suitable for high-throughput and high-concurrency scenarios.

  • The performance of A100  and  A800  is similar, but A800 is mainly aimed at the Chinese market and complies with export control requirements.


3.  Engine TPS comparison

The following are the estimated TPS of each engine on different GPUs  (taking the LLaMA-13B model as an example):

engine
A800 (TPS)
A100 (TPS)
H100 (TPS)
Remark
Transformers
50-100
60-120
80-150
The performance is moderate and suitable for small and medium-scale reasoning.
vLLM
200-400
300-600
500-1000
High-performance inference, optimized video memory and batch processing.

illustrate:

  • vLLM performs best on high-performance GPUs (such as H100), with TPS reaching 500-1000, far exceeding other engines.

  • Transformers have medium performance and are suitable for general scenarios.

  • Llama.cpp and Ollama have lower performance and are suitable for resource-constrained environments.

  • The performance data of SGLang and MLX are less and need further testing.




9. Here is a brief introduction to Xinference installation


Xinference can be used on Linux, Windows, and MacOS via pip To install it. If you need to use Xinference for model reasoning, you can specify different engines according to different models.

If you want to be able to infer all supported models, you can install all required dependencies with the following command:

pipinstall "xinference[all]"

Remark

If you want to use GGML models, it is recommended to manually install the required dependencies according to the hardware you are using to take full advantage of the hardware acceleration capabilities. For more details, please refer to  the Llama.cpp engine  section.

If you just want to install the necessary dependencies, here are the detailed steps on how to do it.

Transformers Engine

The PyTorch (transformers) engine supports almost all the latest models. This is the default engine used by Pytorch models:

pipinstall"xinference[transformers]"

vLLM Engine

vLLM is a high-performance large model inference engine that supports high concurrency. When the following conditions are met, Xinference will automatically select vllm as the engine to achieve higher throughput:

  • The model format is pytorch , gptq or awq .

  • When the model format is pytorch When , the quantization option needs to be none .

  • When the model format is awq When , the quantization option needs to be Int4 .

  • When the model format is gptq When , the quantization option needs to be Int3 , Int4 or Int8 .

  • The operating system is Linux and there is at least one CUDA-enabled device

  • Custom Model model_family Fields and built-in models model_name Fields are in the vLLM support list.

Currently, supported models include:

  • llama-2llama-3llama-2-chatllama-3-instruct

  • Baichuanbaichuan-chatbaichuan-2-chat

  • internlm-16kinternlm-chat-7binternlm-chat-8kinternlm-chat-20b

  • mistral-v0.1mistral-instruct-v0.1mistral-instruct-v0.2mistral-instruct-v0.3

  • codestral-v0.1

  • YiYi-1.5Yi-chatYi-1.5-chatYi-1.5-chat-16k

  • code-llamacode-llama-pythoncode-llama-instruct

  • DeepSeekdeepseek-coderdeepseek-chatdeepseek-coder-instruct

  • codeqwen1.5codeqwen1.5-chat

  • vicuna-v1.3vicuna-v1.5

  • internlm2-chat

  • qwen-chat

  • mixtral-instruct-v0.1mixtral-8x22B-instruct-v0.1

  • chatglm3chatglm3-32kchatglm3-128k

  • glm4-chatglm4-chat-1m

  • qwen1.5-chatqwen1.5-moe-chat

  • qwen2-instructqwen2-moe-instruct

  • gemma-it

  • orion-chatorion-chat-rag

  • c4ai-command-r-v01

Install xinference and vLLM:

pipinstall "xinference[vllm]"

Llama.cpp engine

Xinference llama-cpp-python support gguf and ggml It is recommended to manually install the dependencies according to the hardware currently used to obtain the best acceleration effect.

Initial steps:

pipinstallxinference

Installation methods for different hardware:

  • Apple M Series

    CMAKE_ARGS="-DLLAMA_METAL=on"pipinstallllama-cpp-python
  • NVIDIA Graphics Cards:

    CMAKE_ARGS="-DLLAMA_CUBLAS=on"pipinstallllama-cpp-python
  • AMD Graphics:

    CMAKE_ARGS="-DLLAMA_HIPBAS=on"pipinstallllama-cpp-python

SGLang Engine

SGLang has a high-performance inference runtime based on RadixAttention. It significantly accelerates the execution of complex LLM programs by automatically reusing the KV cache between multiple calls. It also supports other common inference techniques such as continuous batch processing and tensor parallel processing.

Initial steps:

pipinstall'xinference[sglang]'

10. Xinference  environment variables

XINFERENCE_ENDPOINT

Xinference service address, used to connect to Xinference. The default address is http://127.0.0.1:9997, which can be obtained from the log.

XINFERENCE_MODEL_SRC

Configure the model download repository. The default download source is "huggingface", which can also be set to "modelscope" as the download source.

XINFERENCE_HOME

Xinference uses by default <HOME>/.xinference As the default directory to store models and necessary files such as logs. <HOME> Is the home directory of the current user. You can change the default directory by configuring this environment variable.

XINFERENCE_HEALTH_CHECK_ATTEMPTS

The number of health checks when Xinference starts. If it fails after exceeding this number, an error will be reported when starting. The default value is 3.

XINFERENCE_HEALTH_CHECK_INTERVAL

The interval of health check when Xinference starts. If it fails within this time, the startup will report an error. The default value is 3.

XINFERENCE_DISABLE_HEALTH_CHECK

When the conditions are met, Xinference will automatically report the worker health status. Setting this environment variable to 1 can disable the health check.

XINFERENCE_DISABLE_VLLM

When the conditions are met, Xinference will automatically use vLLM as the inference engine to improve inference efficiency. You can disable vLLM by setting the environment variable to 1.

XINFERENCE_DISABLE_METRICS

Xinference enables the metrics exporter on supervisor and worker by default. Setting the environment variable to 1 will disable the /metrics endpoint on supervisor and disable the HTTP service on worker (providing only the /metrics endpoint).