Xinference: An innovative reasoning framework

Xinference: Explore the new realm of AI reasoning, efficiently manage models, optimize performance, and meet enterprise-level needs.
Core content:
1. Model full life cycle management and support for 100+ open source models
2. Multi-inference engine optimization and wide compatibility with hardware platforms
3. Enterprise-level features, including permission management, batch processing, domestic GPU support, etc.
Detailed explanation of features
Feature | Xinference | FastChat | OpenLLM | RayLLM |
---|---|---|---|---|
OpenAI-Compatible RESTful API | ✅ | ✅ | ✅ | ✅ |
vLLM Integrations | ✅ | ✅ | ✅ | ✅ |
More Inference Engines (GGML, TensorRT) | ✅ | ❌ | ✅ | ✅ |
More Platforms (CPU, Metal) | ✅ | ✅ | ❌ | ❌ |
Multi-node Cluster Deployment | ✅ | ❌ | ❌ | ✅ |
Image Models (Text-to-Image) | ✅ | ✅ | ❌ | ❌ |
Text Embedding Models | ✅ | ❌ | ❌ | ❌ |
Multimodal Models | ✅ | ❌ | ❌ | ❌ |
Audio Models | ✅ | ❌ | ❌ | ❌ |
More OpenAI Functionalities (Function Calling) | ✅ | ❌ | ❌ | ❌ |
1. Comprehensive and efficient model management
Xinference provides full lifecycle management of models, from model import, version control to deployment, everything is under control. In addition, it also supports 100+ latest open source models, covering text, voice, video, embedding/rerank and other fields, ensuring that users can quickly adapt and use the most cutting-edge models.
2. Multiple inference engines and hardware compatibility
In order to maximize the inference performance, Xinference optimizes a variety of mainstream inference engines, including vLLM, SGLang, TensorRT, etc. At the same time, it also widely supports a variety of hardware platforms, whether it is an international brand or a domestic GPU (such as Huawei Ascend, Haiguang, etc.), it can achieve seamless docking and jointly serve AI inference tasks.
3. High performance and distributed architecture
With the help of underlying algorithm optimization and hardware acceleration technology, Xinference achieves high-performance reasoning. Its native distributed architecture is even more powerful, supporting horizontal expansion of clusters and easily coping with large-scale data processing needs. In addition, the application of multiple scheduling strategies enables Xinference to flexibly adapt to different scenarios such as low latency, high context, and high throughput.
4. Rich enterprise-level features
In addition to powerful reasoning capabilities, Xinference also provides many enterprise-level features to meet complex business needs. These include user rights management, single sign-on, batch processing, multi-tenant isolation, model fine-tuning, and comprehensive observability. These features enable Xinference to ensure data security and compliance while greatly improving the efficiency and flexibility of business operations.
Open source version
Comparison between Enterprise Edition and Open Source Edition
Function | Enterprise Edition | Open source version |
---|---|---|
User Rights Management | User permissions, single sign-on, encryption authentication | Tokens authorization |
Cluster Capabilities | SLA scheduling, tenant isolation, and elastic scaling | Preemptive Scheduling |
Engine Support | Optimized vLLM, SGLang, and TensorRT | vLLM, SGLang |
Batch Processing | Supports custom batch processing for large numbers of calls | none |
Fine-tuning | Support uploading datasets for fine-tuning | none |
Domestic GPU support | Ascend, Haiguang, Tianshu, Cambrian, Muxi | none |
Model Management | Model download and management service for private deployment | Depends on modelscope and huggingface |
Fault detection and recovery | Automatically detect node failures and perform fault reset | none |
High Availability | All nodes are redundantly deployed to support high availability of services | none |
monitor | Monitoring indicator API interface, integration with existing systems | Page Display |
Operations | Remote CLI deployment, non-stop upgrade | none |
Serve | Remote technical support and automatic upgrade services | Community Support |
Mainstream engines
Install all
pip install "xinference[all]"
Transformers Engine
pip install "xinference[transformers]"
vLLM Engine
pip install "xinference[vllm]"
Llama.cpp engine
pip install xinference
pip install xllamacpp --force-reinstall --index-url https://xorbitsai.github.io/xllamacpp/whl/cu124
CMAKE_ARGS = "-DLLAMA_CUBLAS=on" pip install llama-cpp-python
SGLang Engine
pip install "xinference[sglang]"
MLX Engine
pip install "xinference[mlx]"
How it works
Run locally
conda create --name xinference python=3.10 conda activate xinference #Start command xinference-local --host 0.0.0.0 --port 9997 #Start model command xinference engine -e http://0.0.0.0:9997 --model-name qwen-chat #Other references xinference launch --model-name <MODEL_NAME> \ [--model-engine <MODEL_ENGINE>] \ [--model-type <MODEL_TYPE>] \ [--model-uid <MODEL_UID>] \ [--endpoint "http://<XINFERENCE_HOST>:<XINFERENCE_PORT>" ] \
Deployment in a cluster
#Start Supervisor Replace `${supervisor_host}` with the IP of the current node.
xinference-supervisor -H " ${supervisor_host} "
#Start Worker
xinference-worker -e "http:// ${supervisor_host} :9997" -H " ${worker_host} "
After startup is complete, you can access the web UI at http://${supervisor_host}:9997/ui and the API documentation at http://${supervisor_host}:9997/docs.
Deploy using Docker
#Run on a machine with an NVIDIA graphics card
docker run -e XINFERENCE_MODEL_SRC = modelscope -p 9998 :9997 --gpus all xprobe/xinference:<your_version> xinference-local -H 0 .0.0.0 --log-level debug
#Run on a machine with only a CPU
docker run -e XINFERENCE_MODEL_SRC = modelscope -p 9998 :9997 xprobe/xinference:<your_version>-cpu xinference-local -H 0 .0.0.0 --log-level debug
Full analysis of model capabilities
Core Functional Module
Chat & Spawn
Large Language Model (LLM)
Built-in models : Supports mainstream open source models such as Qwen, ChatGLM3, Vicuna, WizardLM, etc., covering Chinese, English and multi-language scenarios.
Long context processing : Optimizes high-throughput reasoning and supports ultra-long text conversations, code generation, and complex logical reasoning.
Function call : Provides structured output capabilities for models such as Qwen and ChatGLM3, supports interaction with external APIs (such as weather query and code execution), and enables intelligent agent development.
Multimodal Processing
Vision Module
Image Generation : Integrates models such as Stable Diffusion and supports text-to-image generation.
Image and text understanding : Use large multimodal models (such as Qwen-VL) to achieve tasks such as image description and visual question answering.
Audio module
Speech Recognition : Supports the Whisper model to realize speech-to-text and multi-language translation38.
Speech Generation (Experimental) : Explore text-to-speech (TTS) capabilities and support custom voice generation.
Video module (experimental)
Video Understanding : Analyze video content based on multimodal embedding technology, and support clip retrieval and summary generation.
Embedding & Reordering
Embedding Model
Text /image vectorization : supports models such as BGE and M3E to generate cross-modal unified semantic vectors.
Application scenarios : Optimize the recall accuracy of search and recommendation systems, and support mixed-modal retrieval.
Reordering Model
Refined sorting : Optimize the sorting of search results through cross encoders to improve Top-K accuracy.
Built-in Model List
Model Type | Representative Model | Key Features |
---|---|---|
Large Language Model | Qwen-72B, ChatGLM3-6B, Vicuna-7B | Support function calls, long contexts, and multi-round conversations |
Embedding Model | BGE-Large, M3E-Base | Cross-modal semantic alignment and low-latency reasoning |
Image Model | Stable Diffusion XL, Qwen-VL | Text-based graphs, image descriptions, and visual question answering |
Audio Model | Whisper-Large, Bark (experimental) | Speech recognition, multi-language translation, TTS generation |
Reorder Model | bge-reranker-large | Dynamically adjust the search results sorting |
Video Model | CLIP-ViT (experimental) | Video content analysis, cross-modal retrieval |
Core Advantages
Performance optimization : Low-latency inference is achieved through engines such as vLLM and SGLang, and throughput is increased by 2-3 times.
Enterprise -level support : supports distributed deployment, domestic hardware adaptation, and model full life cycle management.
Eco- compatibility : Seamlessly connect to development frameworks such as LangChain and LlamaIndex to accelerate the construction of AI applications.