Woter AI detection.Hurry - ends Jun 28th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

A complete analysis of the large language model engine: Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

Written by

Caleb Hayes

Updated on:June-24th-2025

This article will take you deep into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama engines to help you find the most suitable tools and unleash the full potential of large language models!

1. Transformers Engine: The Almighty King in the NLP Field

Developer : Hugging Face

Core features : As the most popular open source NLP library, Transformers is known as the "Swiss Army Knife" in the field of NLP. It supports hundreds of pre-trained models, including well-known models such as GPT, BERT, and T5, and provides a one-stop solution from model loading, fine-tuning, to reasoning.

Significant advantages :

Strong compatibility : Perfectly compatible with PyTorch and TensorFlow, providing developers with more choices.
Ecosystem prosperity : With an active community, rich model library and complete documentation, both beginners and experts can benefit from it.
Widely used : Suitable for various NLP tasks from academic research to industrial production.
Applicable scenarios : When you need to quickly implement tasks such as text classification, generation, and translation, Transformers is the best choice, which can help you easily implement NLP applications.

2. vLLM Engine: Peak Performance of GPU Inference

Developer : UC Berkeley research team

Core features : vLLM focuses on large language model reasoning. With innovative memory management technologies (such as PagedAttention), it greatly improves GPU utilization and reasoning speed, making it a "performance monster" for GPU reasoning.

Significant advantages :

Excellent performance : Extreme inference speed, able to meet large-scale deployment requirements.
Memory Efficient : Efficient memory management, supporting larger model batch processing.
Scene adaptation : Optimized for GPU, with excellent performance in high-concurrency scenes.
Applicable scenarios : If you need to deploy a large language model in a production environment and pursue extreme performance, vLLM is undoubtedly the best choice. It can improve the model inference speed and reduce hardware costs.

3. Llama.cpp engine: lightweight pioneer on CPU

Developer : Community Project

Core features : Llama.cpp is implemented based on C++ and is specially designed for running Meta's LLaMA model. By optimizing calculation and memory management, it makes it possible to run large models on the CPU. It is the "lightweight king" on CPU devices.

Significant advantages :

Lightweight : No GPU is required and it can run on ordinary CPU devices.
Flexible deployment : suitable for resource-constrained environments, such as embedded devices and low-profile servers.
Open source expansion : The open source feature makes it easy to expand and customize.
Applicable scenarios : When the device has no GPU resources but needs to run a large language model, Llama.cpp is an ideal choice, allowing ordinary devices to experience the power of large language models.

4. SGLang Engine: A Potential Rising Star for Efficient Reasoning

Developer : Unknown

Core features : SGLang focuses on efficient reasoning and may use technologies such as sparse computing and distributed optimization to improve performance. Although it is full of mystery, it has unlimited potential.

Significant advantages :

Scenario optimization : Deeply optimize for specific scenarios to significantly improve reasoning efficiency.
Enterprise adaptation : Suitable for enterprise-level applications that require high-performance reasoning.
Applicable scenarios : SGLang is worth a try when running large language models in a large-scale distributed environment. It is an important window for exploring future reasoning technologies.

MLX Engine: The Future of Efficient Computing

Developer : Unknown

Core features : MLX may be a machine learning framework optimized for large language models, focusing on efficient computing and reasoning, and is a "future star" in the field of efficient computing.

Significant advantages :

Hardware adaptation : It may be optimized for specific hardware such as TPU or custom chips.
Efficiency first : Applicable to scenarios that require extreme computing efficiency.
Applicable scenarios : If you need to run a large language model on specific hardware, MLX is worth paying attention to. Its potential hardware optimization capabilities are expected to lead the future of efficient computing.

6. Ollama: A convenient choice for running large local models

Developer : Community Project

Core features : Ollama is a powerful tool for running large language models locally. It supports multiple models such as LLaMA and GPT, and simplifies the model deployment and operation process.

Significant advantages :

Easy to use : easy to operate, suitable for individual users and developers.
Local operation : No cloud resources are required, and the model can be run completely on the local device.
Rich models : supports multiple models and is flexible to use.
Applicable scenarios : If you want to test or run a large language model on your personal device, Ollama is an excellent choice, helping you get rid of cloud dependence and experience the charm of large models at any time

7. Index comparison

1. Performance comparison

engine	Performance characteristics	Hardware Support	Applicable model scale
Transformers	It has strong versatility and medium performance, and is suitable for small and medium-scale model reasoning and training.	CPU/GPU	Small and medium scale models
vLLM	High-performance reasoning, optimizing GPU memory and computing efficiency through technologies such as PagedAttention.	GPU	Large-scale models
Llama.cpp	Optimized for CPU, with medium performance, suitable for resource-constrained environments.	CPU	Small and medium scale models
SGLang	Possibly improved performance through sparse computing or distributed optimization, depending on the implementation.	Unknown (maybe GPU)	Medium to large scale models
MLX	May be optimized for specific hardware (such as TPU or custom chips) and have high performance potential.	Specific hardware	Medium to large scale models
Ollama	Moderate performance, suitable for local operation, does not require high-performance hardware.	CPU/GPU	Small and medium scale models

Summarize :

vLLM has the best inference performance on GPU and is suitable for large-scale models.
Llama.cpp and Ollama are suitable for running small and medium-sized models on CPU or low-end devices.
SGLang and MLX have great performance potential, but more practical verification is needed.

2. Concurrency Comparison

engine	Concurrency support	Applicable scenarios
Transformers	It supports multi-threading and multi-GPU reasoning, but the concurrency capability is limited by the framework and hardware.	Small and medium-sized concurrent tasks
vLLM	High concurrency support, significantly improving concurrency performance through memory optimization and batch processing technology.	Highly concurrent reasoning tasks
Llama.cpp	The concurrency capability is limited and suitable for low concurrency scenarios.	Single task or low concurrency task
SGLang	High concurrency may be supported through distributed computing, the exact capability depends on the implementation.	Medium to high concurrent tasks
MLX	May be optimized for high concurrency, exact capabilities depend on hardware and implementation.	Medium to high concurrent tasks
Ollama	The concurrency capability is medium, suitable for local low-concurrency tasks.	Single task or low concurrency task

Summarize :

vLLM performs best in high-concurrency scenarios and is suitable for production environments.
Transformers and SGLang are suitable for moderately concurrent tasks.
Llama.cpp and Ollama are more suitable for single-task or low-concurrency scenarios.

3. Comparison of applicable scenarios

engine	Applicable scenarios	Advantages
Transformers	Research, development, small and medium scale production environments.	It has comprehensive functions, strong community support, and is suitable for a variety of NLP tasks.
vLLM	Large-scale model reasoning and high-concurrency production environment.	Extreme performance, efficient memory management, suitable for enterprise-level applications.
Llama.cpp	Resource-constrained environments (such as embedded devices and low-profile servers).	Lightweight, does not require a GPU, and is suitable for low-cost deployment.
SGLang	Medium to large-scale model reasoning, distributed computing environment.	May improve performance through optimization, suitable for exploratory projects.
MLX	Specific hardware environments (such as TPU or custom chips).	It may be optimized for hardware and suitable for high-performance computing scenarios.
Ollama	Local development, testing, and personal use.	Simple and easy to use, no cloud resources required, suitable for individual users.

Summarize :

Transformers are the most versatile tool and suitable for most NLP tasks.
vLLM is the first choice for enterprise-level high-concurrency scenarios.
Llama.cpp and Ollama are suitable for individual developers or resource-constrained environments.
SGLang and MLX are suitable for scenarios that require high performance or specific hardware support.

4. Hardware compatibility comparison

engine	Hardware Support	Applicable device types
Transformers	CPU/GPU	Ordinary servers, personal computers, cloud servers
vLLM	GPU	High-performance GPU server
Llama.cpp	CPU	Low-end devices, embedded devices
SGLang	Unknown (maybe GPU)	High-performance servers
MLX	Specific hardware	TPU, custom chips, etc.
Ollama	CPU/GPU	Personal computer, ordinary server

Summarize :

Transformers and Ollama are the most compatible and support multiple devices.
vLLM and SGLang require high-performance GPU or server.
Llama.cpp is suitable for low-end devices, while MLX requires specific hardware support.

8. Comparison of token output per second

1. Factors affecting performance

Before comparing TPS, it is necessary to clarify the key factors that affect performance:

Hardware performance: GPU computing power, video memory bandwidth, video memory capacity, etc.
Model size: The larger the number of parameters, the slower the inference speed.
Batch Size: Larger batches can increase throughput but increase video memory usage.
Engine optimization: Different engines have significant differences in performance in terms of memory management, computational optimization, etc.

2. GPU Performance Comparison

The following is a comparison of the main parameters of A800, A100 and H100:

GPU Model	FP32 computing power (TFLOPS)	Video memory capacity (GB)	Memory bandwidth (TB/s)	Applicable scenarios
A800	19.5	40/80	2.0	Reasoning, training
A100	19.5	40/80	2.0	High performance computing, AI training
H100	30.0	80	3.35	High-performance inference, AI training

H100 is the most powerful GPU currently available, suitable for high-throughput and high-concurrency scenarios.
The performance of A100 and A800 is similar, but A800 is mainly aimed at the Chinese market and complies with export control requirements.

3. Engine TPS comparison

The following are the estimated TPS of each engine on different GPUs (taking the LLaMA-13B model as an example):

engine	A800 (TPS)	A100 (TPS)	H100 (TPS)	Remark
Transformers	50-100	60-120	80-150	The performance is moderate and suitable for small and medium-scale reasoning.
vLLM	200-400	300-600	500-1000	High-performance inference, optimized video memory and batch processing.

illustrate:

vLLM performs best on high-performance GPUs (such as H100), with TPS reaching 500-1000, far exceeding other engines.
Transformers have medium performance and are suitable for general scenarios.
Llama.cpp and Ollama have lower performance and are suitable for resource-constrained environments.
The performance data of SGLang and MLX are less and need further testing.

9. Here is a brief introduction to Xinference installation

Xinference can be used on Linux, Windows, and MacOS via pip To install it. If you need to use Xinference for model reasoning, you can specify different engines according to different models.

If you want to be able to infer all supported models, you can install all required dependencies with the following command:

pipinstall "xinference[all]"

Remark

If you want to use GGML models, it is recommended to manually install the required dependencies according to the hardware you are using to take full advantage of the hardware acceleration capabilities. For more details, please refer to the Llama.cpp engine section.

If you just want to install the necessary dependencies, here are the detailed steps on how to do it.

Transformers Engine

The PyTorch (transformers) engine supports almost all the latest models. This is the default engine used by Pytorch models:

pipinstall"xinference[transformers]"

vLLM Engine

vLLM is a high-performance large model inference engine that supports high concurrency. When the following conditions are met, Xinference will automatically select vllm as the engine to achieve higher throughput:

The model format is pytorch , gptq or awq .
When the model format is pytorch When , the quantization option needs to be none .
When the model format is awq When , the quantization option needs to be Int4 .
When the model format is gptq When , the quantization option needs to be Int3 , Int4 or Int8 .
The operating system is Linux and there is at least one CUDA-enabled device
Custom Model model_family Fields and built-in models model_name Fields are in the vLLM support list.

Currently, supported models include:

llama-2, llama-3, llama-2-chat, llama-3-instruct
Baichuan, baichuan-chat, baichuan-2-chat
internlm-16k, internlm-chat-7b, internlm-chat-8k, internlm-chat-20b
mistral-v0.1, mistral-instruct-v0.1, mistral-instruct-v0.2, mistral-instruct-v0.3
codestral-v0.1
Yi, Yi-1.5, Yi-chat, Yi-1.5-chat, Yi-1.5-chat-16k
code-llama, code-llama-python, code-llama-instruct
DeepSeek, deepseek-coder, deepseek-chat, deepseek-coder-instruct
codeqwen1.5, codeqwen1.5-chat
vicuna-v1.3, vicuna-v1.5
internlm2-chat
qwen-chat
mixtral-instruct-v0.1, mixtral-8x22B-instruct-v0.1
chatglm3, chatglm3-32k, chatglm3-128k
glm4-chat, glm4-chat-1m
qwen1.5-chat, qwen1.5-moe-chat
qwen2-instruct, qwen2-moe-instruct
gemma-it
orion-chat, orion-chat-rag
c4ai-command-r-v01

Install xinference and vLLM:

pipinstall "xinference[vllm]"

Llama.cpp engine

Xinference llama-cpp-python support gguf and ggml It is recommended to manually install the dependencies according to the hardware currently used to obtain the best acceleration effect.

Initial steps:

pipinstallxinference

Installation methods for different hardware:

Apple M Series

CMAKE_ARGS="-DLLAMA_METAL=on"pipinstallllama-cpp-python

NVIDIA Graphics Cards:

CMAKE_ARGS="-DLLAMA_CUBLAS=on"pipinstallllama-cpp-python

AMD Graphics:

CMAKE_ARGS="-DLLAMA_HIPBAS=on"pipinstallllama-cpp-python

SGLang Engine

SGLang has a high-performance inference runtime based on RadixAttention. It significantly accelerates the execution of complex LLM programs by automatically reusing the KV cache between multiple calls. It also supports other common inference techniques such as continuous batch processing and tensor parallel processing.

Initial steps:

pipinstall'xinference[sglang]'

10. Xinference environment variables

XINFERENCE_ENDPOINT

Xinference service address, used to connect to Xinference. The default address is http://127.0.0.1:9997, which can be obtained from the log.

XINFERENCE_MODEL_SRC

Configure the model download repository. The default download source is "huggingface", which can also be set to "modelscope" as the download source.

XINFERENCE_HOME

Xinference uses by default <HOME>/.xinference As the default directory to store models and necessary files such as logs. <HOME> Is the home directory of the current user. You can change the default directory by configuring this environment variable.

XINFERENCE_HEALTH_CHECK_ATTEMPTS

The number of health checks when Xinference starts. If it fails after exceeding this number, an error will be reported when starting. The default value is 3.

XINFERENCE_HEALTH_CHECK_INTERVAL

The interval of health check when Xinference starts. If it fails within this time, the startup will report an error. The default value is 3.

XINFERENCE_DISABLE_HEALTH_CHECK

When the conditions are met, Xinference will automatically report the worker health status. Setting this environment variable to 1 can disable the health check.

XINFERENCE_DISABLE_VLLM

When the conditions are met, Xinference will automatically use vLLM as the inference engine to improve inference efficiency. You can disable vLLM by setting the environment variable to 1.

XINFERENCE_DISABLE_METRICS

Xinference enables the metrics exporter on supervisor and worker by default. Setting the environment variable to 1 will disable the /metrics endpoint on supervisor and disable the HTTP service on worker (providing only the /metrics endpoint).