Woter AI detection.Hurry - ends Jul 11th

New Year Sales :up to 80% OFF

AI Humanize AI Translator Bypass AI AI Rewriter AI Detector

PRICING

TRY FOR FREE

Xinference: An innovative reasoning framework

Written by

Audrey Miles

Updated on:July-01st-2025

Detailed explanation of features

Feature	Xinference	FastChat	OpenLLM	RayLLM
OpenAI-Compatible RESTful API	✅	✅	✅	✅
vLLM Integrations	✅	✅	✅	✅
More Inference Engines (GGML, TensorRT)	✅	❌	✅	✅
More Platforms (CPU, Metal)	✅	✅	❌	❌
Multi-node Cluster Deployment	✅	❌	❌	✅
Image Models (Text-to-Image)	✅	✅	❌	❌
Text Embedding Models	✅	❌	❌	❌
Multimodal Models	✅	❌	❌	❌
Audio Models	✅	❌	❌	❌
More OpenAI Functionalities (Function Calling)	✅	❌	❌	❌

1. Comprehensive and efficient model management

Xinference provides full lifecycle management of models, from model import, version control to deployment, everything is under control. In addition, it also supports 100+ latest open source models, covering text, voice, video, embedding/rerank and other fields, ensuring that users can quickly adapt and use the most cutting-edge models.

2. Multiple inference engines and hardware compatibility

In order to maximize the inference performance, Xinference optimizes a variety of mainstream inference engines, including vLLM, SGLang, TensorRT, etc. At the same time, it also widely supports a variety of hardware platforms, whether it is an international brand or a domestic GPU (such as Huawei Ascend, Haiguang, etc.), it can achieve seamless docking and jointly serve AI inference tasks.

3. High performance and distributed architecture

With the help of underlying algorithm optimization and hardware acceleration technology, Xinference achieves high-performance reasoning. Its native distributed architecture is even more powerful, supporting horizontal expansion of clusters and easily coping with large-scale data processing needs. In addition, the application of multiple scheduling strategies enables Xinference to flexibly adapt to different scenarios such as low latency, high context, and high throughput.

4. Rich enterprise-level features

In addition to powerful reasoning capabilities, Xinference also provides many enterprise-level features to meet complex business needs. These include user rights management, single sign-on, batch processing, multi-tenant isolation, model fine-tuning, and comprehensive observability. These features enable Xinference to ensure data security and compliance while greatly improving the efficiency and flexibility of business operations.

Open source version

Comparison between Enterprise Edition and Open Source Edition

Function	Enterprise Edition	Open source version
User Rights Management	User permissions, single sign-on, encryption authentication	Tokens authorization
Cluster Capabilities	SLA scheduling, tenant isolation, and elastic scaling	Preemptive Scheduling
Engine Support	Optimized vLLM, SGLang, and TensorRT	vLLM, SGLang
Batch Processing	Supports custom batch processing for large numbers of calls	none
Fine-tuning	Support uploading datasets for fine-tuning	none
Domestic GPU support	Ascend, Haiguang, Tianshu, Cambrian, Muxi	none
Model Management	Model download and management service for private deployment	Depends on modelscope and huggingface
Fault detection and recovery	Automatically detect node failures and perform fault reset	none
High Availability	All nodes are redundantly deployed to support high availability of services	none
monitor	Monitoring indicator API interface, integration with existing systems	Page Display
Operations	Remote CLI deployment, non-stop upgrade	none
Serve	Remote technical support and automatic upgrade services	Community Support

Mainstream engines

Install all

pip install "xinference[all]"

Transformers Engine

pip install "xinference[transformers]"

vLLM Engine

pip install "xinference[vllm]"

Llama.cpp engine

pip install xinference 
pip install xllamacpp  --force-reinstall --index-url  https://xorbitsai.github.io/xllamacpp/whl/cu124  
CMAKE_ARGS = "-DLLAMA_CUBLAS=on"  pip install llama-cpp-python

SGLang Engine

pip install "xinference[sglang]"

MLX Engine

pip install "xinference[mlx]"

How it works

Run locally

conda create --name xinference python=3.10 conda activate  xinference                                                                                    #Start command xinference-local  --host 0.0.0.0 --port 9997 #Start model command xinference engine  -e  http://0.0.0.0:9997  --model-name  qwen-chat #Other references xinference launch  --model-name  <MODEL_NAME> \                   [--model-engine <MODEL_ENGINE>] \                   [--model-type <MODEL_TYPE>] \                   [--model-uid <MODEL_UID>] \                   [--endpoint  "http://<XINFERENCE_HOST>:<XINFERENCE_PORT>" ] \

Deployment in a cluster

#Start Supervisor Replace `${supervisor_host}` with the IP of the current node.
 xinference-supervisor  -H " ${supervisor_host} "  
#Start Worker
 xinference-worker  -e "http:// ${supervisor_host} :9997" -H " ${worker_host} "

After startup is complete, you can access the web UI at http://${supervisor_host}:9997/ui and the API documentation at http://${supervisor_host}:9997/docs.

Deploy using Docker

#Run on a machine with an NVIDIA graphics card
 docker run  -e XINFERENCE_MODEL_SRC = modelscope  -p 9998 :9997  --gpus  all xprobe/xinference:<your_version> xinference-local  -H 0 .0.0.0  --log-level  debug    
#Run on a machine with only a CPU
 docker run  -e XINFERENCE_MODEL_SRC = modelscope  -p 9998 :9997 xprobe/xinference:<your_version>-cpu xinference-local  -H 0 .0.0.0  --log-level  debug

Full analysis of model capabilities

‌Core Functional Module‌

‌Chat & Spawn‌

Large Language Model (LLM)

Built-in models : Supports mainstream open source models such as Qwen, ChatGLM3, Vicuna, WizardLM, etc., covering Chinese, English and multi-language scenarios.
‌Long context processing‌ : Optimizes high-throughput reasoning and supports ultra-long text conversations, code generation, and complex logical reasoning‌.
‌Function call‌ : Provides structured output capabilities for models such as Qwen and ChatGLM3, supports interaction with external APIs (such as weather query and code execution), and enables intelligent agent development‌.

‌Multimodal Processing‌

‌Vision Module‌

‌Image Generation‌ : Integrates models such as Stable Diffusion and supports text-to-image generation‌.
‌Image and text understanding‌ : Use large multimodal models (such as Qwen-VL) to achieve tasks such as image description and visual question answering‌.

‌Audio module

‌Speech Recognition‌ : Supports the Whisper model to realize speech-to-text and multi-language translation‌38.
‌Speech Generation (Experimental) ‌: Explore text-to-speech (TTS) capabilities and support custom voice generation‌.

‌Video module (experimental)

‌Video Understanding‌ : Analyze video content based on multimodal embedding technology, and support clip retrieval and summary generation‌.

‌Embedding & Reordering‌

‌Embedding Model

‌Text /image vectorization‌ : supports models such as BGE and M3E to generate cross-modal unified semantic vectors‌.
Application scenarios : Optimize the recall accuracy of search and recommendation systems, and support mixed-modal retrieval.

‌Reordering Model

‌Refined sorting‌ : Optimize the sorting of search results through cross encoders to improve Top-K accuracy‌.

‌Built-in Model List‌

Model Type	Representative Model	Key Features
Large Language Model	Qwen-72B, ChatGLM3-6B, Vicuna-7B	Support function calls, long contexts, and multi-round conversations
‌Embedding Model‌	BGE-Large, M3E-Base	Cross-modal semantic alignment and low-latency reasoning
‌Image Model‌	Stable Diffusion XL, Qwen-VL	Text-based graphs, image descriptions, and visual question answering
‌Audio Model‌	Whisper-Large, Bark (experimental)	Speech recognition, multi-language translation, TTS generation
‌Reorder Model‌	bge-reranker-large	Dynamically adjust the search results sorting
‌Video Model‌	CLIP-ViT (experimental)	Video content analysis, cross-modal retrieval

Core Advantages‌

‌Performance optimization‌ : Low-latency inference is achieved through engines such as vLLM and SGLang, and throughput is increased by 2-3 times‌.
‌Enterprise -level support‌ : supports distributed deployment, domestic hardware adaptation, and model full life cycle management‌.
‌Eco- compatibility‌ : Seamlessly connect to development frameworks such as LangChain and LlamaIndex to accelerate the construction of AI applications‌.