Xinference: An innovative reasoning framework

Written by
Audrey Miles
Updated on:July-01st-2025
Recommendation

Xinference: Explore the new realm of AI reasoning, efficiently manage models, optimize performance, and meet enterprise-level needs.

Core content:
1. Model full life cycle management and support for 100+ open source models
2. Multi-inference engine optimization and wide compatibility with hardware platforms
3. Enterprise-level features, including permission management, batch processing, domestic GPU support, etc.

Yang Fangxian
Founder of 53AI/Most Valuable Expert of Tencent Cloud (TVP)

Detailed explanation of features

FeatureXinferenceFastChatOpenLLMRayLLM
OpenAI-Compatible RESTful API
vLLM Integrations
More Inference Engines (GGML, TensorRT)
More Platforms (CPU, Metal)
Multi-node Cluster Deployment
Image Models (Text-to-Image)
Text Embedding Models
Multimodal Models
Audio Models
More OpenAI Functionalities (Function Calling)

1. Comprehensive and efficient model management

Xinference provides full lifecycle management of models, from model import, version control to deployment, everything is under control. In addition, it also supports 100+ latest open source models, covering text, voice, video, embedding/rerank and other fields, ensuring that users can quickly adapt and use the most cutting-edge models.

2. Multiple inference engines and hardware compatibility

In order to maximize the inference performance, Xinference optimizes a variety of mainstream inference engines, including vLLM, SGLang, TensorRT, etc. At the same time, it also widely supports a variety of hardware platforms, whether it is an international brand or a domestic GPU (such as Huawei Ascend, Haiguang, etc.), it can achieve seamless docking and jointly serve AI inference tasks.

3. High performance and distributed architecture

With the help of underlying algorithm optimization and hardware acceleration technology, Xinference achieves high-performance reasoning. Its native distributed architecture is even more powerful, supporting horizontal expansion of clusters and easily coping with large-scale data processing needs. In addition, the application of multiple scheduling strategies enables Xinference to flexibly adapt to different scenarios such as low latency, high context, and high throughput.

4. Rich enterprise-level features

In addition to powerful reasoning capabilities, Xinference also provides many enterprise-level features to meet complex business needs. These include user rights management, single sign-on, batch processing, multi-tenant isolation, model fine-tuning, and comprehensive observability. These features enable Xinference to ensure data security and compliance while greatly improving the efficiency and flexibility of business operations.

Open source version

Comparison between Enterprise Edition and Open Source Edition

FunctionEnterprise EditionOpen source version
User Rights ManagementUser permissions, single sign-on, encryption authenticationTokens authorization
Cluster CapabilitiesSLA scheduling, tenant isolation, and elastic scalingPreemptive Scheduling
Engine SupportOptimized vLLM, SGLang, and TensorRTvLLM, SGLang
Batch ProcessingSupports custom batch processing for large numbers of callsnone
Fine-tuningSupport uploading datasets for fine-tuningnone
Domestic GPU supportAscend, Haiguang, Tianshu, Cambrian, Muxinone
Model ManagementModel download and management service for private deploymentDepends on modelscope and huggingface
Fault detection and recoveryAutomatically detect node failures and perform fault resetnone
High AvailabilityAll nodes are redundantly deployed to support high availability of servicesnone
monitorMonitoring indicator API interface, integration with existing systemsPage Display
OperationsRemote CLI deployment, non-stop upgradenone
ServeRemote technical support and automatic upgrade servicesCommunity Support

Mainstream engines

Install all

pip install "xinference[all]"

Transformers Engine

pip install "xinference[transformers]"


vLLM Engine

pip install "xinference[vllm]"

Llama.cpp engine

pip install xinference 
pip install xllamacpp  --force-reinstall --index-url  https://xorbitsai.github.io/xllamacpp/whl/cu124  
CMAKE_ARGS = "-DLLAMA_CUBLAS=on"  pip install llama-cpp-python

SGLang Engine

pip install "xinference[sglang]"

MLX Engine

pip install "xinference[mlx]"

How it works

Run locally

conda create --name xinference python=3.10 conda activate  xinference                                                                                    #Start command xinference-local  --host 0.0.0.0 --port 9997 #Start model command xinference engine  -e  http://0.0.0.0:9997  --model-name  qwen-chat #Other references xinference launch  --model-name  <MODEL_NAME> \                   [--model-engine <MODEL_ENGINE>] \                   [--model-type <MODEL_TYPE>] \                   [--model-uid <MODEL_UID>] \                   [--endpoint  "http://<XINFERENCE_HOST>:<XINFERENCE_PORT>" ] \

  







Deployment in a cluster

#Start Supervisor Replace `${supervisor_host}` with the IP of the current node.
xinference-supervisor  -H " ${supervisor_host} "  
#Start Worker
xinference-worker  -e "http:// ${supervisor_host} :9997" -H " ${worker_host} "   

After startup is complete, you can access the web UI at http://${supervisor_host}:9997/ui and the API documentation at http://${supervisor_host}:9997/docs.

Deploy using Docker

#Run on a machine with an NVIDIA graphics card
docker run  -e XINFERENCE_MODEL_SRC = modelscope  -p 9998 :9997  --gpus  all xprobe/xinference:<your_version> xinference-local  -H 0 .0.0.0  --log-level  debug    
#Run on a machine with only a CPU
docker run  -e XINFERENCE_MODEL_SRC = modelscope  -p 9998 :9997 xprobe/xinference:<your_version>-cpu xinference-local  -H 0 .0.0.0  --log-level  debug   

Full analysis of model capabilities

‌Core Functional Module‌

  1. ‌Chat & Spawn‌

  • Large Language Model (LLM)

    • Built-in models : Supports mainstream open source models such as Qwen, ChatGLM3, Vicuna, WizardLM, etc., covering Chinese, English and multi-language scenarios.

    • ‌Long context processing‌ : Optimizes high-throughput reasoning and supports ultra-long text conversations, code generation, and complex logical reasoning‌.

    • ‌Function call‌ : Provides structured output capabilities for models such as Qwen and ChatGLM3, supports interaction with external APIs (such as weather query and code execution), and enables intelligent agent development‌.

  1. ‌Multimodal Processing‌

  • ‌Vision Module‌

    • ‌Image Generation‌ : Integrates models such as Stable Diffusion and supports text-to-image generation‌.

    • ‌Image and text understanding‌ : Use large multimodal models (such as Qwen-VL) to achieve tasks such as image description and visual question answering‌.

  • ‌Audio module

    • ‌Speech Recognition‌ : Supports the Whisper model to realize speech-to-text and multi-language translation‌38.

    • ‌Speech Generation (Experimental) ‌: Explore text-to-speech (TTS) capabilities and support custom voice generation‌.

  • ‌Video module (experimental)

    • ‌Video Understanding‌ : Analyze video content based on multimodal embedding technology, and support clip retrieval and summary generation‌.

  1. ‌Embedding & Reordering‌

  • ‌Embedding Model

    • ‌Text /image vectorization‌ : supports models such as BGE and M3E to generate cross-modal unified semantic vectors‌.

    • Application scenarios : Optimize the recall accuracy of search and recommendation systems, and support mixed-modal retrieval.

  • ‌Reordering Model

    • ‌Refined sorting‌ : Optimize the sorting of search results through cross encoders to improve Top-K accuracy‌.


‌Built-in Model List‌

Model TypeRepresentative ModelKey Features
Large Language ModelQwen-72B, ChatGLM3-6B, Vicuna-7BSupport function calls, long contexts, and multi-round conversations
‌Embedding Model‌BGE-Large, M3E-BaseCross-modal semantic alignment and low-latency reasoning
‌Image Model‌Stable Diffusion XL, Qwen-VLText-based graphs, image descriptions, and visual question answering
‌Audio Model‌Whisper-Large, Bark (experimental)Speech recognition, multi-language translation, TTS generation
‌Reorder Model‌bge-reranker-largeDynamically adjust the search results sorting
‌Video Model‌CLIP-ViT (experimental)Video content analysis, cross-modal retrieval

Core Advantages‌

  • ‌Performance optimization‌ : Low-latency inference is achieved through engines such as vLLM and SGLang, and throughput is increased by 2-3 times‌.

  • ‌Enterprise -level support‌ : supports distributed deployment, domestic hardware adaptation, and model full life cycle management‌.

  • ‌Eco- compatibility‌ : Seamlessly connect to development frameworks such as LangChain and LlamaIndex to accelerate the construction of AI applications‌.