A complete analysis of the large language model engine: Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama

In-depth analysis of the large language model engine, discover the most suitable tool for you, and unleash the infinite possibilities of AI.
Core content:
1. Transformers engine: an all-round tool in the NLP field, supporting a variety of pre-trained models
2. vLLM engine: the pinnacle of GPU inference performance, improving the inference speed of large models
3. Llama.cpp engine: a lightweight pioneer on the CPU, running large models without a GPU
This article will take you deep into Transformers, vLLM, Llama.cpp, SGLang, MLX and Ollama engines to help you find the most suitable tools and unleash the full potential of large language models!
1. Transformers Engine: The Almighty King in the NLP Field
Developer : Hugging Face
Core features : As the most popular open source NLP library, Transformers is known as the "Swiss Army Knife" in the field of NLP. It supports hundreds of pre-trained models, including well-known models such as GPT, BERT, and T5, and provides a one-stop solution from model loading, fine-tuning, to reasoning.
Significant advantages :
Strong compatibility : Perfectly compatible with PyTorch and TensorFlow, providing developers with more choices.
Ecosystem prosperity : With an active community, rich model library and complete documentation, both beginners and experts can benefit from it.
Widely used : Suitable for various NLP tasks from academic research to industrial production.
Applicable scenarios : When you need to quickly implement tasks such as text classification, generation, and translation, Transformers is the best choice, which can help you easily implement NLP applications.
2. vLLM Engine: Peak Performance of GPU Inference
Developer : UC Berkeley research team
Core features : vLLM focuses on large language model reasoning. With innovative memory management technologies (such as PagedAttention), it greatly improves GPU utilization and reasoning speed, making it a "performance monster" for GPU reasoning.
Significant advantages :
Excellent performance : Extreme inference speed, able to meet large-scale deployment requirements.
Memory Efficient : Efficient memory management, supporting larger model batch processing.
Scene adaptation : Optimized for GPU, with excellent performance in high-concurrency scenes.
Applicable scenarios : If you need to deploy a large language model in a production environment and pursue extreme performance, vLLM is undoubtedly the best choice. It can improve the model inference speed and reduce hardware costs.
3. Llama.cpp engine: lightweight pioneer on CPU
Developer : Community Project
Core features : Llama.cpp is implemented based on C++ and is specially designed for running Meta's LLaMA model. By optimizing calculation and memory management, it makes it possible to run large models on the CPU. It is the "lightweight king" on CPU devices.
Significant advantages :
Lightweight : No GPU is required and it can run on ordinary CPU devices.
Flexible deployment : suitable for resource-constrained environments, such as embedded devices and low-profile servers.
Open source expansion : The open source feature makes it easy to expand and customize.
Applicable scenarios : When the device has no GPU resources but needs to run a large language model, Llama.cpp is an ideal choice, allowing ordinary devices to experience the power of large language models.
4. SGLang Engine: A Potential Rising Star for Efficient Reasoning
Developer : Unknown
Core features : SGLang focuses on efficient reasoning and may use technologies such as sparse computing and distributed optimization to improve performance. Although it is full of mystery, it has unlimited potential.
Significant advantages :
Scenario optimization : Deeply optimize for specific scenarios to significantly improve reasoning efficiency.
Enterprise adaptation : Suitable for enterprise-level applications that require high-performance reasoning.
Applicable scenarios : SGLang is worth a try when running large language models in a large-scale distributed environment. It is an important window for exploring future reasoning technologies.
MLX Engine: The Future of Efficient Computing
Developer : Unknown
Core features : MLX may be a machine learning framework optimized for large language models, focusing on efficient computing and reasoning, and is a "future star" in the field of efficient computing.
Significant advantages :
Hardware adaptation : It may be optimized for specific hardware such as TPU or custom chips.
Efficiency first : Applicable to scenarios that require extreme computing efficiency.
Applicable scenarios : If you need to run a large language model on specific hardware, MLX is worth paying attention to. Its potential hardware optimization capabilities are expected to lead the future of efficient computing.
6. Ollama: A convenient choice for running large local models
Developer : Community Project
Core features : Ollama is a powerful tool for running large language models locally. It supports multiple models such as LLaMA and GPT, and simplifies the model deployment and operation process.
Significant advantages :
Easy to use : easy to operate, suitable for individual users and developers.
Local operation : No cloud resources are required, and the model can be run completely on the local device.
Rich models : supports multiple models and is flexible to use.
Applicable scenarios : If you want to test or run a large language model on your personal device, Ollama is an excellent choice, helping you get rid of cloud dependence and experience the charm of large models at any time
7. Index comparison
1. Performance comparison
Summarize :
2. Concurrency Comparison
Transformers | ||
vLLM | ||
vLLM performs best in high-concurrency scenarios and is suitable for production environments.
3. Comparison of applicable scenarios
Transformers | ||
vLLM | ||
MLX | ||
Ollama |
4. Hardware compatibility comparison
Transformers | ||
vLLM | ||
Llama.cpp | ||
SGLang | ||
MLX | ||
Ollama |
8. Comparison of token output per second
1. Factors affecting performance
Before comparing TPS, it is necessary to clarify the key factors that affect performance:
Hardware performance: GPU computing power, video memory bandwidth, video memory capacity, etc.
Model size: The larger the number of parameters, the slower the inference speed.
Batch Size: Larger batches can increase throughput but increase video memory usage.
Engine optimization: Different engines have significant differences in performance in terms of memory management, computational optimization, etc.
2. GPU Performance Comparison
The following is a comparison of the main parameters of A800, A100 and H100:
3. Engine TPS comparison
The following are the estimated TPS of each engine on different GPUs (taking the LLaMA-13B model as an example):
Transformers | ||||
vLLM |
illustrate:
vLLM performs best on high-performance GPUs (such as H100), with TPS reaching 500-1000, far exceeding other engines.
Transformers have medium performance and are suitable for general scenarios.
Llama.cpp and Ollama have lower performance and are suitable for resource-constrained environments.
The performance data of SGLang and MLX are less and need further testing.
9. Here is a brief introduction to Xinference installation
Xinference can be used on Linux, Windows, and MacOS via pip
To install it. If you need to use Xinference for model reasoning, you can specify different engines according to different models.
If you want to be able to infer all supported models, you can install all required dependencies with the following command:
pipinstall "xinference[all]"
Remark
If you want to use GGML models, it is recommended to manually install the required dependencies according to the hardware you are using to take full advantage of the hardware acceleration capabilities. For more details, please refer to the Llama.cpp engine section.
If you just want to install the necessary dependencies, here are the detailed steps on how to do it.
Transformers Engine
The PyTorch (transformers) engine supports almost all the latest models. This is the default engine used by Pytorch models:
pipinstall"xinference[transformers]"
vLLM Engine
vLLM is a high-performance large model inference engine that supports high concurrency. When the following conditions are met, Xinference will automatically select vllm as the engine to achieve higher throughput:
The model format is
pytorch
,gptq
orawq
.When the model format is
pytorch
When , the quantization option needs to benone
.When the model format is
awq
When , the quantization option needs to beInt4
.When the model format is
gptq
When , the quantization option needs to beInt3
,Int4
orInt8
.The operating system is Linux and there is at least one CUDA-enabled device
Custom Model
model_family
Fields and built-in modelsmodel_name
Fields are in the vLLM support list.
Currently, supported models include:
llama-2
,llama-3
,llama-2-chat
,llama-3-instruct
Baichuan
,baichuan-chat
,baichuan-2-chat
internlm-16k
,internlm-chat-7b
,internlm-chat-8k
,internlm-chat-20b
mistral-v0.1
,mistral-instruct-v0.1
,mistral-instruct-v0.2
,mistral-instruct-v0.3
codestral-v0.1
Yi
,Yi-1.5
,Yi-chat
,Yi-1.5-chat
,Yi-1.5-chat-16k
code-llama
,code-llama-python
,code-llama-instruct
DeepSeek
,deepseek-coder
,deepseek-chat
,deepseek-coder-instruct
codeqwen1.5
,codeqwen1.5-chat
vicuna-v1.3
,vicuna-v1.5
internlm2-chat
qwen-chat
mixtral-instruct-v0.1
,mixtral-8x22B-instruct-v0.1
chatglm3
,chatglm3-32k
,chatglm3-128k
glm4-chat
,glm4-chat-1m
qwen1.5-chat
,qwen1.5-moe-chat
qwen2-instruct
,qwen2-moe-instruct
gemma-it
orion-chat
,orion-chat-rag
c4ai-command-r-v01
Install xinference and vLLM:
pipinstall "xinference[vllm]"
Llama.cpp engine
Xinference llama-cpp-python
support gguf
and ggml
It is recommended to manually install the dependencies according to the hardware currently used to obtain the best acceleration effect.
Initial steps:
pipinstallxinference
Installation methods for different hardware:
Apple M Series
CMAKE_ARGS="-DLLAMA_METAL=on"pipinstallllama-cpp-python
NVIDIA Graphics Cards:
CMAKE_ARGS="-DLLAMA_CUBLAS=on"pipinstallllama-cpp-python
AMD Graphics:
CMAKE_ARGS="-DLLAMA_HIPBAS=on"pipinstallllama-cpp-python
SGLang Engine
SGLang has a high-performance inference runtime based on RadixAttention. It significantly accelerates the execution of complex LLM programs by automatically reusing the KV cache between multiple calls. It also supports other common inference techniques such as continuous batch processing and tensor parallel processing.
Initial steps:
pipinstall'xinference[sglang]'
10. Xinference environment variables
XINFERENCE_ENDPOINT
Xinference service address, used to connect to Xinference. The default address is http://127.0.0.1:9997, which can be obtained from the log.
XINFERENCE_MODEL_SRC
Configure the model download repository. The default download source is "huggingface", which can also be set to "modelscope" as the download source.
XINFERENCE_HOME
Xinference uses by default <HOME>/.xinference
As the default directory to store models and necessary files such as logs. <HOME>
Is the home directory of the current user. You can change the default directory by configuring this environment variable.
XINFERENCE_HEALTH_CHECK_ATTEMPTS
The number of health checks when Xinference starts. If it fails after exceeding this number, an error will be reported when starting. The default value is 3.
XINFERENCE_HEALTH_CHECK_INTERVAL
The interval of health check when Xinference starts. If it fails within this time, the startup will report an error. The default value is 3.
XINFERENCE_DISABLE_HEALTH_CHECK
When the conditions are met, Xinference will automatically report the worker health status. Setting this environment variable to 1 can disable the health check.
XINFERENCE_DISABLE_VLLM
When the conditions are met, Xinference will automatically use vLLM as the inference engine to improve inference efficiency. You can disable vLLM by setting the environment variable to 1.
XINFERENCE_DISABLE_METRICS
Xinference enables the metrics exporter on supervisor and worker by default. Setting the environment variable to 1 will disable the /metrics endpoint on supervisor and disable the HTTP service on worker (providing only the /metrics endpoint).